[smf-discuss] svc.startd notices dead child, kills the parent

James Carlson Wed, 30 Apr 2008 08:51:43 -0400

Sebastien Roy writes:
> On Wed, 2008-04-30 at 06:59 -0400, James Carlson wrote:
> > Instead, it's a problem in libdhcpagent.  That's what spawns dhcpagent
> > when necessary, and it should know that *nobody* who is starting that
> > process should really be held responsible for its fate.  It could
> > start the process in a new contract.
> 
> True, but the ifconfig command that nwamd execs should also not cause
> the service to be restarted.


I don't really know.  This gets into an unclear area of SMF and
contracts.

Who really is to "blame" when a process drops core?  SMF and fault
management seem to assert that (by default) all processes that
constitute a "service" -- those in one contract -- are equal and bound
at the hip.  If one fails, they're all suspect.  None is more equal
than the others.

I can believe that's true if the processes in question are (say)
"oraclemumble" and "oraclefrotz."  If one of those goes, then the
other is probably damaged, and the whole service is suspect as a
result.  Restarting may allow you to recover normal operation (one
hopes ... as long as it's not a deterministic machine ;-}).

It's much harder to understand what to do when a process calls
system().  The shell that system(3C) implicitly invokes (but nobody
thinks about) could drop core.  So could the subprocess spawned.  Is
either of those a reason to put a bullet in the caller's head?
Doesn't doing so potentially turn a small problem into a big one?  (As
in the case you describe ...)

It sort of reminds me of the old National Lampoon cover saying, "buy
this magazine or we shoot this dog."  ;-}

>  It's just a transient command that could
> fail for any number of reasons, all of which nwamd should be able to
> handle.  Fixing the larger problem you mention in libdhcpagent won't fix
> this.  I think nwamd just uses system(), so it would be easy enough to
> have it run ifconfig under ctrun.

ctrun itself could fail.  So could the shell that calls it.  :-<

> If nwamd were using the libinetcfg or libdhcpagent API directly, then it
> would have no way of doing that.  It would have to depend on the
> libraries doing the right thing.

Yep.

> > A similar problem exists in ifconfig itself where it invokes
> > in.mpathd.  It needs to start a new contract there, because the
> > service that started ifconfig has nothing to do with the background
> > process that we're starting.  (Same problem, different daemon.)
> 
> It's like buying a new car.  Once you have one particular model, you
> start seeing it everywhere on the road.  Now that I'm aware of this
> contract boundary issue, I'm sure I'll find problems like this
> everywhere. :-}

In the past, it was the caller's problem to call one of the wait*()
functions, and deal with any failures that might occur in some
locally-meaningful way.  With SMF, that's not quite true, though it
_can_ be set up to work that way by specifying special handling.

Consider, for example, all those services that have scripting
interfaces -- the ones with event hooks or other script-invocation
features (such as dhcpagent and pppd).  Is every one of those services
held to account when some user-written (!) script does something that
causes a core dump?  And should the whole service be nuked when it
does?

Or consider the fate of developers porting software to Solaris.  Does
anyone expect a SIGTERM out-of-the-blue due to some child of a child
of a child dropping core?

It almost sounds like a case where we have to inspect every exec*()
caller, and then every caller of those callers, to make sure that
"blame" for failure is charged fairly, and new contracts started where
needed.

Perhaps that's alarmist, but it does look like the outer bound of the
problem.  Obviously, I don't think anything should just be dropping
core _ever_, but the broader issue of understanding where the
boundaries of each "fault cone" lay is hard to determine correctly,
especially in the interaction between system functions and
user-written bits.

-- 
James Carlson, Solaris Networking              <james.d.carlson at sun.com>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

[smf-discuss] svc.startd notices dead child, kills the parent

Reply via email to