David Powell writes:
> James Carlson wrote:
>  > As we've seen, that's far easier to say than to do.  The contract
>  > inheritance and its implications are not well-understood by the folks
>  > maintaining all code that runs on Solaris.
> 
>    You don't need to understand contracts to tell SMF that it should
>    ignore signals from other services and core files for a service.  I
>    haven't seen much misunderstanding that all processes started by a
>    service are tracked by that service.
> 
>    Unless you want finer-grained control over error handling than the
>    service abstraction allows, and are therefore actively creating
>    contracts yourself, "contract inheritance and its implications" are
>    just implementation details of how services are maintained.

It's no mere detail, because we've seen it escape from the
implementation unintentionally.  Perhaps we've lost context here.  The
original problem we saw was this:

The 'nwam' service starts up.  It reasonably (I think) has the default
SMF attributes, because if it goes south, we do indeed want to have
the service restarted.

'nwam' occasionally execs 'ifconfig'.  As a simple executable, this
isn't such a big deal.  A hidden nasty, though, is that ifconfig can
fork/exec long-lived daemons that provide global services.  (This is
the "on demand" bit again.)

Ordinarily, that behavior of ifconfig would just be an internal
implementation detail.  However, in this case, it's not, because those
new background processes end up in the same contract as nwam.

If one of those global services takes a fault, SMF turns around and
puts a slug in nwam's head.  Nwam didn't do anything wrong, restarting
it won't fix anything (and in fact probably hurts), but nwam takes the
fall anyway.

The root of this problem (it seems to me) is a lack of pervasive
understanding of the implications of "contracts."  If you do anything
non-trivial with respect to system daemons (where pipe(3C),
system(3C), fork(2)/exec(2), and several other things count as
"non-trivial"), you need to know how the contracts work so that you
can _terminate_ the edge of the fault boundary where it belongs.

Failing to do so has puzzling and obscure results.

In that respect, it's very much similar to the POSIX "controlling
terminal" feature.  It's not well-enough-known, and it can earn you
signals you weren't expecting.

>  > ... but guess what it fails to mention.
> 
>    Please file a bug.

Done, though that wasn't really the point.  See CR 6696703.

-- 
James Carlson, Solaris Networking              <james.d.carlson at sun.com>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Reply via email to