David Powell writes: > James Carlson wrote: > > 'nwam' occasionally execs 'ifconfig'. As a simple executable, this > > isn't such a big deal. A hidden nasty, though, is that ifconfig can > > fork/exec long-lived daemons that provide global services. (This is > > the "on demand" bit again.) [...] > What's special about ifconfig that frees it from the collection of > "all processes started by the service"?
There's nothing special about ifconfig. It doesn't leave the collection at all. However, the daemons that it starts (dhcpagent and in.mpathd) are in fact "special." They're special in the same way that all common service-providing daemons are special: they close descriptors and do the fork-setsid-fork dance in order to _try_ to get out from under the thumb of the caller (though, obviously, that doesn't work with contracts), and then they provide a service to the system. That service they're providing is global. Some other, completely unrelated caller can come in later and (through an IPC) send a request to the existing daemon. It doesn't have to start a new daemon for its request. By that logic, the original invoker (who exec'd ifconfig which caused the fork/exec of the daemon) is *not* responsible in any way for the progress of the daemon or the consequences of subsequent requests. The problem here is that there's really nothing unusual going on from a traditional UNIX viewpoint. It's not at all wrong to fork a daemon into the background. It's not wrong to disclaim your connection from the original invoker so that you can set off into your own world. Unfortunately, it's no longer possible to do those things on Solaris simply by using the traditional UNIX mechanisms. To get out of your invoker's contract, you must do something special. That "something special" could be disabling the error protection using the SMF manifest (as you've suggested), or it could be creating a new contract in the parent (if ifconfig somehow "knows" about the problem), or it could be an enhanced daemon()-like interface that _really_ disclaims attachment to the original process (which we don't have). But whatever it is, it's not the default, and it's not sufficient to rely on the previously well-known design patterns. > The service does a variety of things including starting other > processes that are automatically considered part of the service. > Some of those processes failed. The coarse-grained (compared to the > operations performed by the service) fault management took out the > entire service. Correct. It's the "considered part of the service" that happens to be incorrect here. > I don't see why understanding how contracts function really matters > here. Unless your *service* is using contracts (which would imply > you *have* an understanding of contracts), new processes will always > belong to the service they were created by. Period. End of story. Correct. That's exactly the problem. Things that did the equivalent of daemon() in the past were *NOT* expecting to be part of any "service" collection, and those who were invoking them were *NOT* expecting to be called to account for those things at any point in the indeterminate future. Now they are. That's different. And it causes exactly the problem I outlined before. For another analogy (sent to me by private email), consider close-on- exec behavior for file descriptors. One really good reason to use this feature is because you're opening a file descriptor inside a library, and the calling application knows nothing about the descriptor. You do this because don't want to share the descriptor accidentally. If the caller naively (and reasonably, since he shouldn't be on the hook to understand your internal implementation details) does fork/exec, he'd accidentally hand off a stray descriptor to the new process, with unpredictable results. A similar problem occurs here. SMF establishes a contract, so that it can track what belongs inside the service. But instead of keeping that detail private (which it really couldn't do anyway), it counts on the members of the service understanding what they must do to step outside, if need be. What they must do is not something from ordinary UNIX, and it's not something found on any other operating system. > > The root of this problem (it seems to me) is a lack of pervasive > > understanding of the implications of "contracts." If you do anything > > non-trivial with respect to system daemons (where pipe(3C), > > system(3C), fork(2)/exec(2), and several other things count as > > "non-trivial"), you need to know how the contracts work so that you > > can _terminate_ the edge of the fault boundary where it belongs. > > You have three choices: > > 1) Use service-wide fault detection. > > 2) Turn off that fault detection (i.e. set ignore_errors) and do > whatever you would do in to manage faults in the absence of SMF > and contracts, with the exactly same UNIX semantics the system > has always had. > > 3) Use contracts to implement fine-grained fault detection. > > You only need to understand contracts if you have chosen 3, which is > to say you only need to understand contracts if you have explicitly > chosen to use contracts, which is tautological. (1) makes no sense in this context, as the daemons that are being launched are not actually part of the service, so that's not a solution. In fact it's the problem. (2) is not the default. You have to do something special to make it happen, and most reasonable developers are not going to choose non-default options just on a whim. Instead, you have to understand the implications of contracts *AND* you have to know that you *OR* some grandchild of yours will be starting one of these new services in order to choose this option and live with its implications. (3) also isn't the default, nor is it part of the usual daemon design pattern for UNIX systems. In all of those cases, those designing daemon-like software to run on Solaris must understand what contracts mean. They currently don't -- as evidenced by the bug. Just like the obscure "controlling terminal" (mis)feature from the past, this puts a special burden on designers. I'm not saying that it's not a warranted change or burden. I'm not saying we should somehow turn back time and undo the change. I *am* saying that it's a complicated nuance that needs wider exposure, and that I strongly believe that we'll be chasing down interesting interactions like the ifconfig one for some time to come. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677