[smf-discuss] Restart after SMF start timeout

Neil Garthwaite Fri, 18 Apr 2008 20:11:23 +0100

Hi Liane,

Please see my comments below.

Regards
Neil

On 18 Apr 2008, at 19:03, Liane Praza wrote:

> Renaud Manus wrote:
>> This is RFE 6450661 "start method failures should be configurable"
>
> Yep, mostly.  For RFE 6450661 (and its friends 6197273 and 6219078,  
> which should probably be consolidated), I haven't yet seen a  
> specific request that wouldn't be better solved by more intelligent  
> restart detection by startd rather than allowing significant  
> tweaking of the algorithm through configuration.
>
> So, I would like a clarification on the scenario below.
>
>> Neil Garthwaite wrote:
>>> Hi,
>>>
>>> After reading the svc.startd (1M) man page and in particular  
>>> "SERVICE  FAILURE", I'm trying to find if I can influence
>>>
>>> "...
>>>      If three method failures happen in a row, or if the  service
>>>      is restarting more than once a second, svc.startd places the
>>>      service in the maintenance state.
>>> ..."
>>>
>>> It appears that even a transient service gets retried three times  
>>> if  the start fails to exit within the start timeout, i.e. the  
>>> start times  out. Basically, I have an SMF service which works  
>>> fine. However, I'm  now injecting some faults to determine how  
>>> robust it is and in this  regard the start method checks to see if  
>>> the application is really up  and available for work before  
>>> exiting from the start method.
>>>
>>> In one particular fault injected case, my SMF service consumes all  
>>> of  it's start timeout and then times out. However, it gets  
>>> restarted 3  times before entering into a maintenance state. In  
>>> this regard if all  of start timeout is consumed I would simply  
>>> like the SMF service to  enter maintenance state and not get  
>>> retried three times.
>>>
>>> So, is there a property I can use to influence the number of  
>>> retries  for start.
>
> If we had general fail-once semantics (that is, any failure of this  
> service would cause it to enter maintenance), would that satisfy  
> your request?
>

Yes. Furthermore, for my particular usage I'd be happy if that was for  
a contract or transient service, i.e. not limited to say just a  
transient service.

> Though, as an aside, I am interested in how this works in real  
> life.  If the transient service fails, and enters maintenance, what  
> will the administrator do differently than your stop method script  
> to clean up so that they don't have to reboot to repair the service?

Well, I hope this doesn't appear too messy.

I'm using SMF with Sun Cluster. In particular, in SC we have an agent  
that can failover non-global zones (S10 native as well as S8 and lx  
branded zones) between SC nodes. In addition to failing over the non- 
global zone, if that non-global zone is a S10 zone then we can also  
enable/disable an SMF service within the "failover" zone.

In this regard, SC manages the SMF enable/disable and additionally  
probes the application that was started by the SMF service. This then  
allows the probe to determine wedged applications or a bad application  
and signal back to SC to either perform a local restart or initiate a  
failover to another SC node.

While we have some "linkage" between SC and the "failover" zone to do  
this, we predominately let SMF do it's thing, i.e. dependencies and  
start/stop and restart as appropriate. The issue I have is that when  
SC decides to try and enable a SMF service within a failover zone  
there maybe all sorts of conditions [read faults] within the  
environment that causes the SMF service to fail or timeout. In this  
scenario, Sun Cluster takes the view that if the application (SMF  
service) fails to start within the allocated timeout and within the  
failover zone, then the start should either be faulted and require  
user intervention, or failed over to another SC node and retry the  
start, or reboot the node and retry the start on another SC node.

The bottom line is that a fail-once semantic (per node) is taken at  
initial start by SC. Once the application (SMF service) has started  
successfully on one node then the semantics change a little and  
essentially the application can be restarted several times within a  
sliding time window, where the number of reties and window are user  
configurable properties to the SC resource managing the application.  
If the number of retries is exceeded within the window then a failover  
is performed.

So, essentially I'd like the SMF service to be able to fail just once  
upon initial startup.

While I accept this maybe a corner case, we currently provide this  
type of service for several Sun Cluster agents, i.e WebSphere MQ,  
Apache Tomcat, MySQL, PostgreSQL, Informix, Samba and others I can't  
recall at this time, where we also provide the SMF manifest as well as  
the probe for the application.

As mentioned in my earlier email, the SMF service works fine, very  
well in fact. The occasion I'm trying to refine is the scenario when  
the SMF service fails to start cleanly for the first time after boot.  
In this regard, if the SMF service only failed once upon startup then  
I would not need my workaround.

Please note that SC's usage of zones is not contained to "failover"  
zones, and instead SC predominately use zones that represent virtual  
nodes to an application whereby that application can failover between  
nodes and zones. Failover zones just represents additional choice.

>
>
> (Most the requests I've heard for fail-once semantics have been from  
> administrators rather than service authors, because the  
> administrators don't trust that the services can properly clean up  
> after themselves.)
>
>>>
>>> My only thought to achieve only one start is to keep a count of  
>>> the  consumed time, i.e. ksh variable $SECONDS and then exit with   
>>> $SMF_EXIT_ERR_CONFIG or $SMF_EXIT_ERR_FATAL just before the  
>>> service  times out.
>
> Yep, that's probably the best workaround for now.

Thanks.

>
>
>>>
>>> I was hoping a transient service would be exempt from being  
>>> restarted  three times but it appears not. I would appreciate any  
>>> thoughts on how  to achieve only one start or other suggestions to  
>>> my thought of  keeping a count within the start method.
>
> liane

[smf-discuss] Restart after SMF start timeout

Reply via email to