James Carlson wrote: > Who really is to "blame" when a process drops core? SMF and fault > management seem to assert that (by default) all processes that > constitute a "service" -- those in one contract -- are equal and bound > at the hip. If one fails, they're all suspect. None is more equal > than the others.
This is just an extension of the philosophy that says that when there is a failure in the code, anywhere, the right answer is to restart. If some random piece of libc code dereferences NULL, should my process die? If my process dies, should SMF expect that anything better will happen when it gets restarted? In my mind, there are two important questions here: - Where do you draw the fault boundaries? When and how do you decide that a fault in a low-level component suggests that the high-level component is broken? After all, much of the time the reason that the low-level component failed was that the high-level component fed it bad data. As designers, we can usually tell by looking where the right boundaries are - cron is not at fault when its subprocesses die - but I don't think that any automaton can tell. - What do you do when a fault occurs? It seems like there are roughly three possibilities: ignore the fault, restart the affected component, or stop the affected component. (Ref the discussion above for what constitutes a "component".) There is an argument that "ignore" and "restart" are best for system availability - the fault might not be serious and might not recur - and an argument for stopping it so that its failure will be noticed and can be truly fixed. While the "self-healing" aspect of the ignore and restart philosophies is appealing, I've seen too many log files with piles of unexplained errors, left unexplained because they didn't *seem* to cause a problem and then left to contribute noise so that new failures are difficult to detect, to be very happy with it as an overall philosophy. At least during the development and test process, I tend to be a "zero tolerance" person. If anything goes wrong, stop, analyze the failure, and fix it. Ignoring and restarting should be reserved for production environments.