James Carlson wrote:
> Who really is to "blame" when a process drops core?  SMF and fault
> management seem to assert that (by default) all processes that
> constitute a "service" -- those in one contract -- are equal and bound
> at the hip.  If one fails, they're all suspect.  None is more equal
> than the others.

This is just an extension of the philosophy that says that when there is 
a failure in the code, anywhere, the right answer is to restart.

If some random piece of libc code dereferences NULL, should my process 
die?  If my process dies, should SMF expect that anything better will 
happen when it gets restarted?

In my mind, there are two important questions here:

- Where do you draw the fault boundaries?  When and how do you decide 
that a fault in a low-level component suggests that the high-level 
component is broken?  After all, much of the time the reason that the 
low-level component failed was that the high-level component fed it bad 
data.  As designers, we can usually tell by looking where the right 
boundaries are - cron is not at fault when its subprocesses die - but I 
don't think that any automaton can tell.

- What do you do when a fault occurs?  It seems like there are roughly 
three possibilities:  ignore the fault, restart the affected component, 
or stop the affected component.  (Ref the discussion above for what 
constitutes a "component".)  There is an argument that "ignore" and 
"restart" are best for system availability - the fault might not be 
serious and might not recur - and an argument for stopping it so that 
its failure will be noticed and can be truly fixed.

While the "self-healing" aspect of the ignore and restart philosophies 
is appealing, I've seen too many log files with piles of unexplained 
errors, left unexplained because they didn't *seem* to cause a problem 
and then left to contribute noise so that new failures are difficult to 
detect, to be very happy with it as an overall philosophy.  At least 
during the development and test process, I tend to be a "zero tolerance" 
person.  If anything goes wrong, stop, analyze the failure, and fix it. 
  Ignoring and restarting should be reserved for production environments.

Reply via email to