Hi Liane, Please see my comments below.
Regards Neil On 18 Apr 2008, at 19:03, Liane Praza wrote: > Renaud Manus wrote: >> This is RFE 6450661 "start method failures should be configurable" > > Yep, mostly. For RFE 6450661 (and its friends 6197273 and 6219078, > which should probably be consolidated), I haven't yet seen a > specific request that wouldn't be better solved by more intelligent > restart detection by startd rather than allowing significant > tweaking of the algorithm through configuration. > > So, I would like a clarification on the scenario below. > >> Neil Garthwaite wrote: >>> Hi, >>> >>> After reading the svc.startd (1M) man page and in particular >>> "SERVICE FAILURE", I'm trying to find if I can influence >>> >>> "... >>> If three method failures happen in a row, or if the service >>> is restarting more than once a second, svc.startd places the >>> service in the maintenance state. >>> ..." >>> >>> It appears that even a transient service gets retried three times >>> if the start fails to exit within the start timeout, i.e. the >>> start times out. Basically, I have an SMF service which works >>> fine. However, I'm now injecting some faults to determine how >>> robust it is and in this regard the start method checks to see if >>> the application is really up and available for work before >>> exiting from the start method. >>> >>> In one particular fault injected case, my SMF service consumes all >>> of it's start timeout and then times out. However, it gets >>> restarted 3 times before entering into a maintenance state. In >>> this regard if all of start timeout is consumed I would simply >>> like the SMF service to enter maintenance state and not get >>> retried three times. >>> >>> So, is there a property I can use to influence the number of >>> retries for start. > > If we had general fail-once semantics (that is, any failure of this > service would cause it to enter maintenance), would that satisfy > your request? > Yes. Furthermore, for my particular usage I'd be happy if that was for a contract or transient service, i.e. not limited to say just a transient service. > Though, as an aside, I am interested in how this works in real > life. If the transient service fails, and enters maintenance, what > will the administrator do differently than your stop method script > to clean up so that they don't have to reboot to repair the service? Well, I hope this doesn't appear too messy. I'm using SMF with Sun Cluster. In particular, in SC we have an agent that can failover non-global zones (S10 native as well as S8 and lx branded zones) between SC nodes. In addition to failing over the non- global zone, if that non-global zone is a S10 zone then we can also enable/disable an SMF service within the "failover" zone. In this regard, SC manages the SMF enable/disable and additionally probes the application that was started by the SMF service. This then allows the probe to determine wedged applications or a bad application and signal back to SC to either perform a local restart or initiate a failover to another SC node. While we have some "linkage" between SC and the "failover" zone to do this, we predominately let SMF do it's thing, i.e. dependencies and start/stop and restart as appropriate. The issue I have is that when SC decides to try and enable a SMF service within a failover zone there maybe all sorts of conditions [read faults] within the environment that causes the SMF service to fail or timeout. In this scenario, Sun Cluster takes the view that if the application (SMF service) fails to start within the allocated timeout and within the failover zone, then the start should either be faulted and require user intervention, or failed over to another SC node and retry the start, or reboot the node and retry the start on another SC node. The bottom line is that a fail-once semantic (per node) is taken at initial start by SC. Once the application (SMF service) has started successfully on one node then the semantics change a little and essentially the application can be restarted several times within a sliding time window, where the number of reties and window are user configurable properties to the SC resource managing the application. If the number of retries is exceeded within the window then a failover is performed. So, essentially I'd like the SMF service to be able to fail just once upon initial startup. While I accept this maybe a corner case, we currently provide this type of service for several Sun Cluster agents, i.e WebSphere MQ, Apache Tomcat, MySQL, PostgreSQL, Informix, Samba and others I can't recall at this time, where we also provide the SMF manifest as well as the probe for the application. As mentioned in my earlier email, the SMF service works fine, very well in fact. The occasion I'm trying to refine is the scenario when the SMF service fails to start cleanly for the first time after boot. In this regard, if the SMF service only failed once upon startup then I would not need my workaround. Please note that SC's usage of zones is not contained to "failover" zones, and instead SC predominately use zones that represent virtual nodes to an application whereby that application can failover between nodes and zones. Failover zones just represents additional choice. > > > (Most the requests I've heard for fail-once semantics have been from > administrators rather than service authors, because the > administrators don't trust that the services can properly clean up > after themselves.) > >>> >>> My only thought to achieve only one start is to keep a count of >>> the consumed time, i.e. ksh variable $SECONDS and then exit with >>> $SMF_EXIT_ERR_CONFIG or $SMF_EXIT_ERR_FATAL just before the >>> service times out. > > Yep, that's probably the best workaround for now. Thanks. > > >>> >>> I was hoping a transient service would be exempt from being >>> restarted three times but it appears not. I would appreciate any >>> thoughts on how to achieve only one start or other suggestions to >>> my thought of keeping a count within the start method. > > liane