[smf-discuss] Smarter testing in SMF?

Neil Garthwaite Thu, 18 May 2006 10:12:06 +0100

Nicolas Williams wrote:

>On Wed, May 17, 2006 at 03:53:53PM -0700, Richard Elling wrote:
>  
>
>>On Wed, 2006-05-17 at 16:48 -0500, Nicolas Williams wrote:
>>    
>>
>>>Or perhaps you could fire off a monitor from the start method of the
>>>actual service to be monitored using ctrun to run the monitor in its own
>>>process contract and restartably.  This avoids having a separate SMF
>>>service polluting the SMF service namespace.
>>>      
>>>
>>This can get a bit complicated.  Suppose FMA kills the monitor
>>contract and the monitor loses its state of the monitored service.
>>For simple monitors, such as "does the process exist," this won't
>>be a problem.  For a monitor which is making a database transaction,
>>then there needs to be enough smarts in the monitor to cancel an
>>in-flight transactions which might interfere with its analysis of
>>the database health.  It is not clear to me that stateless monitors 
>>will be more useful than the current method, so it might be somewhat
>>complex to write a good monitor.
>>    
>>
>
>I'm not sure what this has to do with either Liane's suggestion or mine.
>
>In both approaches the monitor 'method' runs in its own contract and
>with a restarter.  In both cases the monitor can die/be killed and will
>get restarted, and in both cases the monitor has to know what to do to
>recover.
>  
>


Well, I suspect the last point you've raised is the issue Richard is 
referring to when he writes,

For a monitor which is making a database transaction, then there needs to be 
enough smarts in the monitor to cancel an in-flight transactions which might 
interfere with its analysis of the database health.


>The question was: how can I build a monitor facility given SMF.  Liane
>provided one answer, and described the long-term direction for SMF in
>this area.  I provided an alternative short-term answer.
>
>Perhaps the full-fledged SMF monitor facility can provide additional
>features that make monitor recovery from monitor restart easier, but I
>think that's a problem that SMF can't generally solve.
>
>  
>
>>Monitors also tend to have timeouts, which further complicates their
>>deployment.  It is not clear to me that we can avoid following the
>>current path of cluster monitors, even as they get more complicated
>>(eg. dynamically adjustable timeouts).  It might be better just to
>>implement a single-node cluster instead, when possible, thus 
>>leveraging the existing agents.
>>    
>>
>
>See above.
>  
>

I do feel that the functionality of existing cluster agents within a 
single-node cluster could be viewed as the target space in terms of 
functionality that SMF should aim for. Beyond the monitor debate, the 
concept of "wait_for_online" would generally help an SMF start method.

In this regard consider SMF starting an application that takes a few 
minutes to start, yet the start command returns almost immediately. As 
such the application is not "ready for work" even though SMF would 
report it as online. The check that would need to be coded within the 
service or SMF start script would be similar in functionality to the 
monitor, and therefore it's likely that when an SMF monitor is 
available, the start method would benefit from periodically calling the 
monitor, for the duration of start timeout, before returning 
successfully (or not) from the start method.

Again, I suspect the point being raised is that a single-node cluster 
leveraging existing agents is likley to set the functionality space to 
aim for. When SC3.2 is available, non-global zones will be supported 
whereby existing cluster agents could be deployed within a single-node 
cluster within non-global zones. This then brings added benefits of 
being able to have dependencies across non-global zones. In fact this 
exists today with SC3.108/05, however only for a subset of agents 
running within a non-global zone.

So, the initial question asked was,

can I make SMF smarter in how it decides whether a service has failed? 

In this regard the short-term proposal using ctrun is likely to be 
practical, although I suspect it would then open the door for more 
functionality. As the initial question stated,

A cluster agent can issue synthetic transactions and do some fairly 
sophisticated 
monitoring to decide whether a service / app is still alive.

My point being that there already is a comparison with a cluster agent. 
So, in summary an option to these questions could be a single-node 
cluster. Considering restart dependencies across non-global zones using 
the cluster framework/agents maybe too much functionality, yet one could 
equally argue it represents another solution where the focus is simply 
providing services for an application.

Personally, I think functionality is what is being asked for. The tools 
to achieve this are currently overlaping and hopefully will converge in 
time.

Regards
Neil

>_______________________________________________
>smf-discuss mailing list
>smf-discuss at opensolaris.org
>  
>

[smf-discuss] Smarter testing in SMF?

Reply via email to