Crash-only Systems

Finally got around to reading this paper about crash-only software. What a great idea. Given that fatal errors that require a component be restarted are by their nature unanticipated and almost impossible to design out, why not make crash/reboot the default mode of operation? If a system is designed to be safely arbitrarily crashed and restarted, then the number of fail/recovery scenarios is reduced to one.

The paper takes a number of good ideas for building robust distributed systems, and turns them up to 11. Go read it.