James Carlson wrote:

> SMF/contractfs supposes (by default) that a core dump of any of the
> processes means the whole group is dead and should be restarted.  This
> has the side-effect of meaning that any state associated with the
> entire group is at the mercy of the weakest member of the group, even
> if that member was invoked *unwittingly* by one of the other members.

Yep.  And Dave already went over in the first message of this thread 
how you can opt out of that behaviour as either a service developer or 
an admin.

This default is how we can make the entire system more fault tolerant in 
the face of things like uncorrectable memory errors.  Remember, in S9 we 
had to restart the entire system if an uncorrectable error was caught in 
a user process, because we didn't know how much memory corruption might 
have spread.  SMF attempts to set boundaries that are most safe in the 
face of unmodified programs, and allows programs willing to be savvy 
about their communication and fault boundaries with worker 
threads/processes to declare their intent through service properties, 
contract library calls, or ctrun.

But, I'm not really sure what you're trying to accomplish with this 
discussion, Jim.  Are you trying to propose that we un-do this default 
which was set in S10 and break programs and service definitions which 
were made based on these defaults?

I do understand your concerns, but believe that this was one of those 
tradeoff calls.  Either we make things behave in a manner least likely 
to cause spreading corruption when a fault occurs, or we assume that 
service authors who need to protect from spreading corruption will 
modify their service in a Solaris-specific way to do so.  We attempted 
to choose safety, predictability, and as much Self-Healing as possible 
in the face of unmodified programs.  Solaris-specific programs (e.g. 
nwam, which started this whole kerfuffle) have a variety of mechanisms 
available to them to tell the system that they know how to handle faults 
in certain subprocesses.  Services unwilling to change at the source 
level for Solaris (e.g. sendmail), also have a way to opt-out.

liane

Reply via email to