Jordan Brown (Sun) writes:
> James Carlson wrote:
> > I can't reach into other address spaces, so those guys are still ok.
> > (Yes, there's an interesting fate-sharing issue with shared memory,
> > and having a mutual-core-dump pact among processes attached to a
> > shared memory segment sounds like a cool idea, but we're not talking
> > about anything like that here.)
> 
> There's also shared files, databases, output streams, and so on.  They 
> aren't *totally* independent.  A sed dying, ignored by its parent shell, 
> can lead to damaged data being written into a file, and so on.  I agree, 

Right, but the distinction I was drawing was between the "parent is on
the hook to figure out what to do about failures" design school (i.e.,
traditional UNIX) and the new SMF+contracts school that (at least by
default) bucks that trend.

Obviously, if the parent fails to live up to its end of the bargain in
the old model, bad things are entirely possible.

The "new" thing here is that parents are shot for the sins of
grandchildren, and it happens in unexpected ways.  (Which goes back to
the whole question of understanding how fault boundaries are set.
It's an issue I don't think we understand well, or that any UNIX
designer would _expect_.)

> > You're quite right that doing a restart on failure is a suspicious
> > thing, and it's something we talked about at length during ARC review
> > of Greenline years ago.  I still don't really believe in it, but that
> > ship has long since sailed.
> 
> Over here in my neck of the woods, we've (against my advice) taken it to 
> another level:  our "start the service" command automatically does 
> "svcadm clear" against each of our services.  Hey, even though it failed 
> the last time, perhaps many times, might as well try *again*.  It might 
> work, and that's better than not, right?

Oh, my.

> yes, I've started to occasionally use "sh -e".)  Should a single warning 
> message in compiling the kernel abort the entire build?  At the moment 
> it does, and I'm happy.  It forces people to fix the problem.  If we 
> kill you and the horse, maybe the next guy will get the horseshoes on 
> right.  Should a single Java thread with an uncaught exception kill the 
> whole program?  It doesn't, and so (among many other reasons) log files 
> get littered with exception messages that get ignored, and I'm unhappy.

Here's the problem: you're talking about bits that a single designer
(or group of related folks) put together and then -- oops! --
fornicated with the dog on one of them.  They're all in the same
playground.

That's not quite what I'm talking about.  In the case I'm talking
about, the processes that were launched really were never any
responsibility of the original caller.  He may not have known that the
processes existed, and didn't _expect_ to be taking any ownership of
them, and, in some cases (as with event hooks), he probably has no
design-level control over them at all -- some _end user_ is installing
binaries for him to run.

I ought to be able call system(3C) and not get a fatal signal as a
direct result.  That was once essentially true, other than for
intentionally parricidal processes ... but is no more.

Now that this fundamental assumption is not true, it's unclear what
level of effort should be put into finding cases where this has
surprising results.  That was what I was talking about before.

> Again, part of the answer may be to have different settings for 
> development and test environments than for production environments.

Yes, that's a possible complicating issue.

Still, even in a development environment, I don't want ifconfig(1M)
(or its invoker) to take a signal if dhcpagent or in.mpathd dies, or
if dhcpagent's event hook script (from the user) drops core.  In those
cases, that'll happen today, and it's not right.

-- 
James Carlson, Solaris Networking              <james.d.carlson at sun.com>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Reply via email to