Nicolas Williams wrote: >On Wed, May 17, 2006 at 03:53:53PM -0700, Richard Elling wrote: > > >>On Wed, 2006-05-17 at 16:48 -0500, Nicolas Williams wrote: >> >> >>>Or perhaps you could fire off a monitor from the start method of the >>>actual service to be monitored using ctrun to run the monitor in its own >>>process contract and restartably. This avoids having a separate SMF >>>service polluting the SMF service namespace. >>> >>> >>This can get a bit complicated. Suppose FMA kills the monitor >>contract and the monitor loses its state of the monitored service. >>For simple monitors, such as "does the process exist," this won't >>be a problem. For a monitor which is making a database transaction, >>then there needs to be enough smarts in the monitor to cancel an >>in-flight transactions which might interfere with its analysis of >>the database health. It is not clear to me that stateless monitors >>will be more useful than the current method, so it might be somewhat >>complex to write a good monitor. >> >> > >I'm not sure what this has to do with either Liane's suggestion or mine. > >In both approaches the monitor 'method' runs in its own contract and >with a restarter. In both cases the monitor can die/be killed and will >get restarted, and in both cases the monitor has to know what to do to >recover. > >
Well, I suspect the last point you've raised is the issue Richard is referring to when he writes, For a monitor which is making a database transaction, then there needs to be enough smarts in the monitor to cancel an in-flight transactions which might interfere with its analysis of the database health. >The question was: how can I build a monitor facility given SMF. Liane >provided one answer, and described the long-term direction for SMF in >this area. I provided an alternative short-term answer. > >Perhaps the full-fledged SMF monitor facility can provide additional >features that make monitor recovery from monitor restart easier, but I >think that's a problem that SMF can't generally solve. > > > >>Monitors also tend to have timeouts, which further complicates their >>deployment. It is not clear to me that we can avoid following the >>current path of cluster monitors, even as they get more complicated >>(eg. dynamically adjustable timeouts). It might be better just to >>implement a single-node cluster instead, when possible, thus >>leveraging the existing agents. >> >> > >See above. > > I do feel that the functionality of existing cluster agents within a single-node cluster could be viewed as the target space in terms of functionality that SMF should aim for. Beyond the monitor debate, the concept of "wait_for_online" would generally help an SMF start method. In this regard consider SMF starting an application that takes a few minutes to start, yet the start command returns almost immediately. As such the application is not "ready for work" even though SMF would report it as online. The check that would need to be coded within the service or SMF start script would be similar in functionality to the monitor, and therefore it's likely that when an SMF monitor is available, the start method would benefit from periodically calling the monitor, for the duration of start timeout, before returning successfully (or not) from the start method. Again, I suspect the point being raised is that a single-node cluster leveraging existing agents is likley to set the functionality space to aim for. When SC3.2 is available, non-global zones will be supported whereby existing cluster agents could be deployed within a single-node cluster within non-global zones. This then brings added benefits of being able to have dependencies across non-global zones. In fact this exists today with SC3.108/05, however only for a subset of agents running within a non-global zone. So, the initial question asked was, can I make SMF smarter in how it decides whether a service has failed? In this regard the short-term proposal using ctrun is likely to be practical, although I suspect it would then open the door for more functionality. As the initial question stated, A cluster agent can issue synthetic transactions and do some fairly sophisticated monitoring to decide whether a service / app is still alive. My point being that there already is a comparison with a cluster agent. So, in summary an option to these questions could be a single-node cluster. Considering restart dependencies across non-global zones using the cluster framework/agents maybe too much functionality, yet one could equally argue it represents another solution where the focus is simply providing services for an application. Personally, I think functionality is what is being asked for. The tools to achieve this are currently overlaping and hopefully will converge in time. Regards Neil >_______________________________________________ >smf-discuss mailing list >smf-discuss at opensolaris.org > >