[smf-discuss] Does SMF kill shell scripts randomly ? / was: Re: svc.startd notices dead child, kills the parent

James Carlson Fri, 2 May 2008 09:48:53 -0400

David Powell writes:
> James Carlson wrote:
>  > 'nwam' occasionally execs 'ifconfig'.  As a simple executable, this
>  > isn't such a big deal.  A hidden nasty, though, is that ifconfig can
>  > fork/exec long-lived daemons that provide global services.  (This is
>  > the "on demand" bit again.)
[...]
>    What's special about ifconfig that frees it from the collection of
>    "all processes started by the service"?


There's nothing special about ifconfig.  It doesn't leave the
collection at all.

However, the daemons that it starts (dhcpagent and in.mpathd) are in
fact "special."  They're special in the same way that all common
service-providing daemons are special: they close descriptors and do
the fork-setsid-fork dance in order to _try_ to get out from under the
thumb of the caller (though, obviously, that doesn't work with
contracts), and then they provide a service to the system.

That service they're providing is global.  Some other, completely
unrelated caller can come in later and (through an IPC) send a request
to the existing daemon.  It doesn't have to start a new daemon for its
request.  By that logic, the original invoker (who exec'd ifconfig
which caused the fork/exec of the daemon) is *not* responsible in any
way for the progress of the daemon or the consequences of subsequent
requests.

The problem here is that there's really nothing unusual going on from
a traditional UNIX viewpoint.  It's not at all wrong to fork a daemon
into the background.  It's not wrong to disclaim your connection from
the original invoker so that you can set off into your own world.

Unfortunately, it's no longer possible to do those things on Solaris
simply by using the traditional UNIX mechanisms.  To get out of your
invoker's contract, you must do something special.  That "something
special" could be disabling the error protection using the SMF
manifest (as you've suggested), or it could be creating a new contract
in the parent (if ifconfig somehow "knows" about the problem), or it
could be an enhanced daemon()-like interface that _really_ disclaims
attachment to the original process (which we don't have).

But whatever it is, it's not the default, and it's not sufficient to
rely on the previously well-known design patterns.

>    The service does a variety of things including starting other
>    processes that are automatically considered part of the service.
>    Some of those processes failed.  The coarse-grained (compared to the
>    operations performed by the service) fault management took out the
>    entire service.

Correct.  It's the "considered part of the service" that happens to be
incorrect here.

>    I don't see why understanding how contracts function really matters
>    here.  Unless your *service* is using contracts (which would imply
>    you *have* an understanding of contracts), new processes will always
>    belong to the service they were created by.  Period.  End of story.

Correct.  That's exactly the problem.

Things that did the equivalent of daemon() in the past were *NOT*
expecting to be part of any "service" collection, and those who were
invoking them were *NOT* expecting to be called to account for those
things at any point in the indeterminate future.

Now they are.  That's different.  And it causes exactly the problem I
outlined before.

For another analogy (sent to me by private email), consider close-on-
exec behavior for file descriptors.  One really good reason to use
this feature is because you're opening a file descriptor inside a
library, and the calling application knows nothing about the
descriptor.

You do this because don't want to share the descriptor accidentally.
If the caller naively (and reasonably, since he shouldn't be on the
hook to understand your internal implementation details) does
fork/exec, he'd accidentally hand off a stray descriptor to the new
process, with unpredictable results.  A similar problem occurs here.
SMF establishes a contract, so that it can track what belongs inside
the service.  But instead of keeping that detail private (which it
really couldn't do anyway), it counts on the members of the service
understanding what they must do to step outside, if need be.

What they must do is not something from ordinary UNIX, and it's not
something found on any other operating system.

>  > The root of this problem (it seems to me) is a lack of pervasive
>  > understanding of the implications of "contracts."  If you do anything
>  > non-trivial with respect to system daemons (where pipe(3C),
>  > system(3C), fork(2)/exec(2), and several other things count as
>  > "non-trivial"), you need to know how the contracts work so that you
>  > can _terminate_ the edge of the fault boundary where it belongs.
> 
>    You have three choices:
> 
>      1) Use service-wide fault detection.
> 
>      2) Turn off that fault detection (i.e. set ignore_errors) and do
>         whatever you would do in to manage faults in the absence of SMF
>         and contracts, with the exactly same UNIX semantics the system
>         has always had.
> 
>      3) Use contracts to implement fine-grained fault detection.
> 
>    You only need to understand contracts if you have chosen 3, which is
>    to say you only need to understand contracts if you have explicitly
>    chosen to use contracts, which is tautological.

(1) makes no sense in this context, as the daemons that are being
launched are not actually part of the service, so that's not a
solution.  In fact it's the problem.

(2) is not the default.  You have to do something special to make it
happen, and most reasonable developers are not going to choose
non-default options just on a whim.  Instead, you have to understand
the implications of contracts *AND* you have to know that you *OR*
some grandchild of yours will be starting one of these new services in
order to choose this option and live with its implications.

(3) also isn't the default, nor is it part of the usual daemon design
pattern for UNIX systems.

In all of those cases, those designing daemon-like software to run on
Solaris must understand what contracts mean.  They currently don't --
as evidenced by the bug.  Just like the obscure "controlling terminal"
(mis)feature from the past, this puts a special burden on designers.

I'm not saying that it's not a warranted change or burden.  I'm not
saying we should somehow turn back time and undo the change.

I *am* saying that it's a complicated nuance that needs wider
exposure, and that I strongly believe that we'll be chasing down
interesting interactions like the ifconfig one for some time to come.

-- 
James Carlson, Solaris Networking              <james.d.carlson at sun.com>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

[smf-discuss] Does SMF kill shell scripts randomly ? / was: Re: svc.startd notices dead child, kills the parent

Reply via email to