[oi-dev] Are there any active healthchecks for SMF in general?
Hello all, I wonder if there are any RFEs or on-going works regarding proactive health-checks for SMF services (test routine to be defined by the service author or packager and/or by local system admin)? I think that just like there are start, stop, refresh methods and so on, there could also be a healthcheck method with its associated timeouts, as well as frequency of tests, tolerable amount of test failures in a row and/or within a given time range, etc. There could also be a policy to choose what to do if the healthcheck fails (too many times): offline the service, set it to maintenance, restart it, or smth else? In fact, if the healthcheck method is validly defined, it should be fired after running the start method and only after a successful test the SMF service state should transfer from offline* to online. Some service methods exit as soon as the target daemon has started, even though the service becomes useful after a few minutes. I've had to script clutches like that for many different projects, usually involving a test routine fired from crontab or crafting a specialized startup script which includes needed checks on prerequisite services as well as startup real results. As an example, think Apache Tomcat with its default start scripts - they exit after spawning JVM, but the user-required webapps can take minutes to initialize and start up. Currently SMF would online the service as soon as the script exited, and proceed to starting up the dependent services. However, the method is actually online for us generically when the servlet container has *logged* that its startup routine is complete. If other SMF services do depend on this Tomcat (say, it is running an OpenDJ LDAP server), it is online only when it responds correctly to LDAP queries, and not before. In case of webserver SMF-services the tests usually request a healthcheck page or some other page and compare it with the expected healthy template. For DBMS or LDAP services that would be an SQL or ldapsearch query. In case of crontabs there are tricks (i.e. lockfiles) to forbid the test script from running in numerous parallel invokations if the tested service takes too long to respond. Recently (in my vboxsvc[1] project for controlling the VirtualBox VMs as SMF service instances), I've taken a different approach and made a background loop initiated and executed by the service method script; part of that loop's job is to check whether the VM is not only running as a process on the Solaris host, but also provides the service it was booted for (if the test method was validly defined and configured and enabled). Originally the loop got there because the service is transient (due to VirtualBox internals) and SMF does not monitor the service's child processes, but we needed to monitor anyway whether the VMs are running or not, and stop the VM processes gracefully when the SMF service is stopped. Then things got expanded a bit... ;) I wonder if it would be useful to generalize the solution and/or recode it in some more efficient manner to be available for all SMF services as an optional part of the framework? Theoretically it is there, somewhat - SMF already checks that child processes exist for contract/wait type of services, and none died on bad signals like coredumping. What do you think? Would that logic be useful as generic part of SMF? Can it be left as (includable?) shell scripts and/or rewritten into perl for efficiency? Would anyone undertake to revise and rewrite the logic into C? ;) [1] http://sourceforge.net/projects/vboxsvc - my VBoxSvc project which controls VMs as SMF service instances, with optional healthchecks. [2] http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/lib/svc/method/vbox.sh?content-type=text%2Fplain The main script (keywords: KICKER vmsvccheck monitoring hook). Thanks for any ideas, //Jim Klimov ___ oi-dev mailing list oi-dev@openindiana.org http://openindiana.org/mailman/listinfo/oi-dev
Re: [oi-dev] Are there any active healthchecks for SMF in general?
tl;dr Seems like an upstream question (illumos-gate and/or illumos-userland). Even in terms of the upstream, we don't really do RFEs unless there's enough interest in the issue that someone's likely to write some code. On Wed, Apr 18, 2012 at 12:37 PM, Jim Klimov jimkli...@cos.ru wrote: Hello all, I wonder if there are any RFEs or on-going works regarding proactive health-checks for SMF services (test routine to be defined by the service author or packager and/or by local system admin)? I think that just like there are start, stop, refresh methods and so on, there could also be a healthcheck method with its associated timeouts, as well as frequency of tests, tolerable amount of test failures in a row and/or within a given time range, etc. There could also be a policy to choose what to do if the healthcheck fails (too many times): offline the service, set it to maintenance, restart it, or smth else? In fact, if the healthcheck method is validly defined, it should be fired after running the start method and only after a successful test the SMF service state should transfer from offline* to online. Some service methods exit as soon as the target daemon has started, even though the service becomes useful after a few minutes. I've had to script clutches like that for many different projects, usually involving a test routine fired from crontab or crafting a specialized startup script which includes needed checks on prerequisite services as well as startup real results. As an example, think Apache Tomcat with its default start scripts - they exit after spawning JVM, but the user-required webapps can take minutes to initialize and start up. Currently SMF would online the service as soon as the script exited, and proceed to starting up the dependent services. However, the method is actually online for us generically when the servlet container has *logged* that its startup routine is complete. If other SMF services do depend on this Tomcat (say, it is running an OpenDJ LDAP server), it is online only when it responds correctly to LDAP queries, and not before. In case of webserver SMF-services the tests usually request a healthcheck page or some other page and compare it with the expected healthy template. For DBMS or LDAP services that would be an SQL or ldapsearch query. In case of crontabs there are tricks (i.e. lockfiles) to forbid the test script from running in numerous parallel invokations if the tested service takes too long to respond. Recently (in my vboxsvc[1] project for controlling the VirtualBox VMs as SMF service instances), I've taken a different approach and made a background loop initiated and executed by the service method script; part of that loop's job is to check whether the VM is not only running as a process on the Solaris host, but also provides the service it was booted for (if the test method was validly defined and configured and enabled). Originally the loop got there because the service is transient (due to VirtualBox internals) and SMF does not monitor the service's child processes, but we needed to monitor anyway whether the VMs are running or not, and stop the VM processes gracefully when the SMF service is stopped. Then things got expanded a bit... ;) I wonder if it would be useful to generalize the solution and/or recode it in some more efficient manner to be available for all SMF services as an optional part of the framework? Theoretically it is there, somewhat - SMF already checks that child processes exist for contract/wait type of services, and none died on bad signals like coredumping. What do you think? Would that logic be useful as generic part of SMF? Can it be left as (includable?) shell scripts and/or rewritten into perl for efficiency? Would anyone undertake to revise and rewrite the logic into C? ;) [1] http://sourceforge.net/projects/vboxsvc - my VBoxSvc project which controls VMs as SMF service instances, with optional healthchecks. [2] http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/lib/svc/method/vbox.sh?content-type=text%2Fplain The main script (keywords: KICKER vmsvccheck monitoring hook). Thanks for any ideas, //Jim Klimov ___ oi-dev mailing list oi-dev@openindiana.org http://openindiana.org/mailman/listinfo/oi-dev ___ oi-dev mailing list oi-dev@openindiana.org http://openindiana.org/mailman/listinfo/oi-dev
Re: [oi-dev] Are there any active healthchecks for SMF in general?
2012-04-18 18:33, Bayard Bell написал: tl;dr Seems like an upstream question (illumos-gate and/or illumos-userland). Even in terms of the upstream, we don't really do RFEs unless there's enough interest in the issue that someone's likely to write some code. tldr is my problem of sorts ;) Still, this was not so much an RFE per se, but rather a statement that I have to make such-and-such workaround too often, so I do have solutions to the problem. I wonder if the problem annoys/bugs others too, if they want the solution, and if my approach is sane and/or acceptable. In particular, my SMF healthchecks are embedded into scrpt code of several method-scripts I made for my projects. I think that what I learned from that could be generalized for everyone's benefit, but I might rewrite it into another shell or perl script. That is likely not a good way to build into SMF core, so other interested developers might pick up the algorithm and/or the idea, and code some C... Thanks, //Jim ___ oi-dev mailing list oi-dev@openindiana.org http://openindiana.org/mailman/listinfo/oi-dev
Re: [oi-dev] Are there any active healthchecks for SMF in general?
On Wed, Apr 18, 2012 at 4:00 PM, Jim Klimov jimkli...@cos.ru wrote: 2012-04-18 18:33, Bayard Bell написал: tl;dr Seems like an upstream question (illumos-gate and/or illumos-userland). Even in terms of the upstream, we don't really do RFEs unless there's enough interest in the issue that someone's likely to write some code. tldr is my problem of sorts ;) I have the same problem at times. Still, this was not so much an RFE per se, but rather a statement that I have to make such-and-such workaround too often, so I do have solutions to the problem. I wonder if the problem annoys/bugs others too, if they want the solution, and if my approach is sane and/or acceptable. Factor into distinct and clear problems; circulate those to upstream developer lists for the referenced project and ask whether there's interest in taking solutions on board. If you already have code written, so much the better, but it's still better to explain what problem you're trying to solve if you want to upstream it. Cheers, Bayard ___ oi-dev mailing list oi-dev@openindiana.org http://openindiana.org/mailman/listinfo/oi-dev