James Carlson wrote:
> If something goes wrong in libc, my process address space may be
> trashed.  I have no way of knowing how bad the problem is.  The best
> thing I can do is to exit abruptly, so that I don't hurt anyone else
> by spewing corrupted data into the rest of the system.

Yeah, mostly.  I agree that the process boundaries help a lot to contain 
the fault.  However, I'd say that often the reason that the underlying 
component failed was that it was fed bad data (which is usually how the 
libc routine gets a NULL to play with), and that says bad things about 
the higher-level component.  (I'll admit that that kind of thing tends 
to cause ordinary failures, rather than core dumps, and so the policies 
here wouldn't detect it.)

> I can't reach into other address spaces, so those guys are still ok.
> (Yes, there's an interesting fate-sharing issue with shared memory,
> and having a mutual-core-dump pact among processes attached to a
> shared memory segment sounds like a cool idea, but we're not talking
> about anything like that here.)

There's also shared files, databases, output streams, and so on.  They 
aren't *totally* independent.  A sed dying, ignored by its parent shell, 
can lead to damaged data being written into a file, and so on.  I agree, 
however, that the process boundary mostly blocks collateral damage, and 
the damage that can be caused doesn't _tend_ to be as dangerous.

> The fundamental underlying issues are quite different, though you're
> right that (as I _also_ said in the part you didn't quote) it's the
> fault boundary that is at issue.

Sorry, no intent to drop key comments.

> You're quite right that doing a restart on failure is a suspicious
> thing, and it's something we talked about at length during ARC review
> of Greenline years ago.  I still don't really believe in it, but that
> ship has long since sailed.

Over here in my neck of the woods, we've (against my advice) taken it to 
another level:  our "start the service" command automatically does 
"svcadm clear" against each of our services.  Hey, even though it failed 
the last time, perhaps many times, might as well try *again*.  It might 
work, and that's better than not, right?

> What we're talking about is only who takes the blame when something
> goes wrong, not what recovery needs to do.  Today, the default blame
> answer is "you and the horse you rode in on."  In some cases, and from
> a high level view, that seems right, but there are detailed cases
> where it's very wrong, and it's *EASY* for developers to step right
> into those bad cases using ordinary UNIX design rules.  (Such as
> having event hooks.)
> 
> Having them not waltz into bizarre behavior seems like a good thing to
> me.

I don't disagree, but I don't think the answer is clear-cut either.  At 
the moment my pendulum is swinging towards the intolerant side.  (In 
particular, I've seen too many Java exceptions logged and ignored, and 
too many commands fail in shell scripts with the failure ignored.  And, 
yes, I've started to occasionally use "sh -e".)  Should a single warning 
message in compiling the kernel abort the entire build?  At the moment 
it does, and I'm happy.  It forces people to fix the problem.  If we 
kill you and the horse, maybe the next guy will get the horseshoes on 
right.  Should a single Java thread with an uncaught exception kill the 
whole program?  It doesn't, and so (among many other reasons) log files 
get littered with exception messages that get ignored, and I'm unhappy.

Again, part of the answer may be to have different settings for 
development and test environments than for production environments.

Reply via email to