Jordan Brown (Sun) writes:
> James Carlson wrote:
> > Who really is to "blame" when a process drops core?  SMF and fault
> > management seem to assert that (by default) all processes that
> > constitute a "service" -- those in one contract -- are equal and bound
> > at the hip.  If one fails, they're all suspect.  None is more equal
> > than the others.
> 
> This is just an extension of the philosophy that says that when there is 
> a failure in the code, anywhere, the right answer is to restart.

So why not just reboot the whole machine when one process dies?

> If some random piece of libc code dereferences NULL, should my process 
> die?  If my process dies, should SMF expect that anything better will 
> happen when it gets restarted?

I don't think these are really equivalent at all, except on a
superficial level.

If something goes wrong in libc, my process address space may be
trashed.  I have no way of knowing how bad the problem is.  The best
thing I can do is to exit abruptly, so that I don't hurt anyone else
by spewing corrupted data into the rest of the system.

I can't reach into other address spaces, so those guys are still ok.
(Yes, there's an interesting fate-sharing issue with shared memory,
and having a mutual-core-dump pact among processes attached to a
shared memory segment sounds like a cool idea, but we're not talking
about anything like that here.)

That corruption problem is *NOT* true if I run system("ifconfig -a")
and one of those two sub-processes (the shell or the ifconfig command)
ends up dumping core.

The fundamental underlying issues are quite different, though you're
right that (as I _also_ said in the part you didn't quote) it's the
fault boundary that is at issue.

> While the "self-healing" aspect of the ignore and restart philosophies 
> is appealing, I've seen too many log files with piles of unexplained 
> errors, left unexplained because they didn't *seem* to cause a problem 

That's a separate problem.

You're quite right that doing a restart on failure is a suspicious
thing, and it's something we talked about at length during ARC review
of Greenline years ago.  I still don't really believe in it, but that
ship has long since sailed.

What we're talking about is only who takes the blame when something
goes wrong, not what recovery needs to do.  Today, the default blame
answer is "you and the horse you rode in on."  In some cases, and from
a high level view, that seems right, but there are detailed cases
where it's very wrong, and it's *EASY* for developers to step right
into those bad cases using ordinary UNIX design rules.  (Such as
having event hooks.)

Having them not waltz into bizarre behavior seems like a good thing to
me.

-- 
James Carlson, Solaris Networking              <james.d.carlson at sun.com>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Reply via email to