Jordan Brown (Sun) writes: > James Carlson wrote: > > I can't reach into other address spaces, so those guys are still ok. > > (Yes, there's an interesting fate-sharing issue with shared memory, > > and having a mutual-core-dump pact among processes attached to a > > shared memory segment sounds like a cool idea, but we're not talking > > about anything like that here.) > > There's also shared files, databases, output streams, and so on. They > aren't *totally* independent. A sed dying, ignored by its parent shell, > can lead to damaged data being written into a file, and so on. I agree,
Right, but the distinction I was drawing was between the "parent is on the hook to figure out what to do about failures" design school (i.e., traditional UNIX) and the new SMF+contracts school that (at least by default) bucks that trend. Obviously, if the parent fails to live up to its end of the bargain in the old model, bad things are entirely possible. The "new" thing here is that parents are shot for the sins of grandchildren, and it happens in unexpected ways. (Which goes back to the whole question of understanding how fault boundaries are set. It's an issue I don't think we understand well, or that any UNIX designer would _expect_.) > > You're quite right that doing a restart on failure is a suspicious > > thing, and it's something we talked about at length during ARC review > > of Greenline years ago. I still don't really believe in it, but that > > ship has long since sailed. > > Over here in my neck of the woods, we've (against my advice) taken it to > another level: our "start the service" command automatically does > "svcadm clear" against each of our services. Hey, even though it failed > the last time, perhaps many times, might as well try *again*. It might > work, and that's better than not, right? Oh, my. > yes, I've started to occasionally use "sh -e".) Should a single warning > message in compiling the kernel abort the entire build? At the moment > it does, and I'm happy. It forces people to fix the problem. If we > kill you and the horse, maybe the next guy will get the horseshoes on > right. Should a single Java thread with an uncaught exception kill the > whole program? It doesn't, and so (among many other reasons) log files > get littered with exception messages that get ignored, and I'm unhappy. Here's the problem: you're talking about bits that a single designer (or group of related folks) put together and then -- oops! -- fornicated with the dog on one of them. They're all in the same playground. That's not quite what I'm talking about. In the case I'm talking about, the processes that were launched really were never any responsibility of the original caller. He may not have known that the processes existed, and didn't _expect_ to be taking any ownership of them, and, in some cases (as with event hooks), he probably has no design-level control over them at all -- some _end user_ is installing binaries for him to run. I ought to be able call system(3C) and not get a fatal signal as a direct result. That was once essentially true, other than for intentionally parricidal processes ... but is no more. Now that this fundamental assumption is not true, it's unclear what level of effort should be put into finding cases where this has surprising results. That was what I was talking about before. > Again, part of the answer may be to have different settings for > development and test environments than for production environments. Yes, that's a possible complicating issue. Still, even in a development environment, I don't want ifconfig(1M) (or its invoker) to take a signal if dhcpagent or in.mpathd dies, or if dhcpagent's event hook script (from the user) drops core. In those cases, that'll happen today, and it's not right. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677