James Carlson wrote:
 > David Powell writes:
 >> James Carlson wrote:
 >>  > As we've seen, that's far easier to say than to do.  The contract
 >>  > inheritance and its implications are not well-understood by the folks
 >>  > maintaining all code that runs on Solaris.
 >>
 >>    You don't need to understand contracts to tell SMF that it should
 >>    ignore signals from other services and core files for a service.  I
 >>    haven't seen much misunderstanding that all processes started by a
 >>    service are tracked by that service.
 >>
 >>    Unless you want finer-grained control over error handling than the
 >>    service abstraction allows, and are therefore actively creating
 >>    contracts yourself, "contract inheritance and its implications" are
 >>    just implementation details of how services are maintained.
 >
 > It's no mere detail, because we've seen it escape from the
 > implementation unintentionally.  Perhaps we've lost context here.  The
 > original problem we saw was this:
 >
 > The 'nwam' service starts up.  It reasonably (I think) has the default
 > SMF attributes, because if it goes south, we do indeed want to have
 > the service restarted.
 >
 > 'nwam' occasionally execs 'ifconfig'.  As a simple executable, this
 > isn't such a big deal.  A hidden nasty, though, is that ifconfig can
 > fork/exec long-lived daemons that provide global services.  (This is
 > the "on demand" bit again.)
 >
 > Ordinarily, that behavior of ifconfig would just be an internal
 > implementation detail.  However, in this case, it's not, because those
 > new background processes end up in the same contract as nwam.
 >
 > If one of those global services takes a fault, SMF turns around and
 > puts a slug in nwam's head.  Nwam didn't do anything wrong, restarting
 > it won't fix anything (and in fact probably hurts), but nwam takes the
 > fall anyway.

   What's special about ifconfig that frees it from the collection of
   "all processes started by the service"?

   The service does a variety of things including starting other
   processes that are automatically considered part of the service.
   Some of those processes failed.  The coarse-grained (compared to the
   operations performed by the service) fault management took out the
   entire service.

   I don't see why understanding how contracts function really matters
   here.  Unless your *service* is using contracts (which would imply
   you *have* an understanding of contracts), new processes will always
   belong to the service they were created by.  Period.  End of story.

 > The root of this problem (it seems to me) is a lack of pervasive
 > understanding of the implications of "contracts."  If you do anything
 > non-trivial with respect to system daemons (where pipe(3C),
 > system(3C), fork(2)/exec(2), and several other things count as
 > "non-trivial"), you need to know how the contracts work so that you
 > can _terminate_ the edge of the fault boundary where it belongs.

   You have three choices:

     1) Use service-wide fault detection.

     2) Turn off that fault detection (i.e. set ignore_errors) and do
        whatever you would do in to manage faults in the absence of SMF
        and contracts, with the exactly same UNIX semantics the system
        has always had.

     3) Use contracts to implement fine-grained fault detection.

   You only need to understand contracts if you have chosen 3, which is
   to say you only need to understand contracts if you have explicitly
   chosen to use contracts, which is tautological.

   Dave


Reply via email to