Sebastien Roy wrote: > Perhaps an SMF expert can shed some light into the following svc.startd > behavior: > > A system is running NWAM, and therefore the network/physical:nwam > service is enabled, and the nwamd daemon is running. The nwamd daemon > configures a network interface by exec'ing "ifconfig <intf> dhcp start", > which causes ifconfig to in turn exec dhcpagent. > > At some point later, dhcpagent dies a horrible death by way of SIGSEGV > and dumps core due to a bug (obviously). > > At this point, svc.startd somehow notices the dhcpagent crash and for > some reason decides that the system would be better off if the > network/physical:nwam service were restarted. It prints the following > anonymous message in /var/svc/log/network-physical:nwam.log: > > "Stopping because process dumped core." > > (It would be nice if svc.startd were a bit more specific in that log > message, but that's not the core issue.)
Agreed. > It proceeds to stop and start > network/physical:nwam. Why does it do this? Is nwamd not to be trusted > to notice that it's unable to acquire a DHCP lease on this interface and > deal with this on its own? nwamd is likely capable of noticing that > something went wrong with the network interface it was responsible for > and to either retry to acquire a lease, or try on another network > interface. Even if it's not, it's not inconceivable that it could be. SMF will trust nwamd to handle failures of the commands it runs, but nwamd needs to communicate that it deserves that trust. There are two ways of doing that: a) "I'm moving out" Use startd/ignore_error to tell svc.startd to ignore processes that dump core. Note that this will also ignore nwamd when it dumps core. b) Take responsibility If a subcomponent may fail and you don't want its failures to be conflated with nwamd's, start it in a separate contract. Dave