David Powell writes: > James Carlson wrote: > > As we've seen, that's far easier to say than to do. The contract > > inheritance and its implications are not well-understood by the folks > > maintaining all code that runs on Solaris. > > You don't need to understand contracts to tell SMF that it should > ignore signals from other services and core files for a service. I > haven't seen much misunderstanding that all processes started by a > service are tracked by that service. > > Unless you want finer-grained control over error handling than the > service abstraction allows, and are therefore actively creating > contracts yourself, "contract inheritance and its implications" are > just implementation details of how services are maintained.
It's no mere detail, because we've seen it escape from the implementation unintentionally. Perhaps we've lost context here. The original problem we saw was this: The 'nwam' service starts up. It reasonably (I think) has the default SMF attributes, because if it goes south, we do indeed want to have the service restarted. 'nwam' occasionally execs 'ifconfig'. As a simple executable, this isn't such a big deal. A hidden nasty, though, is that ifconfig can fork/exec long-lived daemons that provide global services. (This is the "on demand" bit again.) Ordinarily, that behavior of ifconfig would just be an internal implementation detail. However, in this case, it's not, because those new background processes end up in the same contract as nwam. If one of those global services takes a fault, SMF turns around and puts a slug in nwam's head. Nwam didn't do anything wrong, restarting it won't fix anything (and in fact probably hurts), but nwam takes the fall anyway. The root of this problem (it seems to me) is a lack of pervasive understanding of the implications of "contracts." If you do anything non-trivial with respect to system daemons (where pipe(3C), system(3C), fork(2)/exec(2), and several other things count as "non-trivial"), you need to know how the contracts work so that you can _terminate_ the edge of the fault boundary where it belongs. Failing to do so has puzzling and obscure results. In that respect, it's very much similar to the POSIX "controlling terminal" feature. It's not well-enough-known, and it can earn you signals you weren't expecting. > > ... but guess what it fails to mention. > > Please file a bug. Done, though that wasn't really the point. See CR 6696703. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677