[oi-dev] Are there any active healthchecks for SMF in general?

2012-04-18 Thread Jim Klimov

Hello all,

  I wonder if there are any RFEs or on-going works regarding
proactive health-checks for SMF services (test routine to be
defined by the service author or packager and/or by local
system admin)?

  I think that just like there are start, stop, refresh
methods and so on, there could also be a healthcheck method
with its associated timeouts, as well as frequency of tests,
tolerable amount of test failures in a row and/or within a
given time range, etc. There could also be a policy to choose
what to do if the healthcheck fails (too many times): offline
the service, set it to maintenance, restart it, or smth else?

  In fact, if the healthcheck method is validly defined, it
should be fired after running the start method and only after
a successful test the SMF service state should transfer from
offline* to online. Some service methods exit as soon as
the target daemon has started, even though the service becomes
useful after a few minutes.

  I've had to script clutches like that for many different
projects, usually involving a test routine fired from crontab
or crafting a specialized startup script which includes needed
checks on prerequisite services as well as startup real results.

  As an example, think Apache Tomcat with its default start
scripts - they exit after spawning JVM, but the user-required
webapps can take minutes to initialize and start up. Currently
SMF would online the service as soon as the script exited,
and proceed to starting up the dependent services. However,
the method is actually online for us generically when the
servlet container has *logged* that its startup routine is
complete. If other SMF services do depend on this Tomcat (say,
it is running an OpenDJ LDAP server), it is online only when
it responds correctly to LDAP queries, and not before.

  In case of webserver SMF-services the tests usually request
a healthcheck page or some other page and compare it with the
expected healthy template. For DBMS or LDAP services that
would be an SQL or ldapsearch query. In case of crontabs there
are tricks (i.e. lockfiles) to forbid the test script from
running in numerous parallel invokations if the tested service
takes too long to respond.

  Recently (in my vboxsvc[1] project for controlling the
VirtualBox VMs as SMF service instances), I've taken a different
approach and made a background loop initiated and executed by
the service method script; part of that loop's job is to check
whether the VM is not only running as a process on the Solaris
host, but also provides the service it was booted for (if the
test method was validly defined and configured and enabled).
Originally the loop got there because the service is transient
(due to VirtualBox internals) and SMF does not monitor the
service's child processes, but we needed to monitor anyway
whether the VMs are running or not, and stop the VM processes
gracefully when the SMF service is stopped. Then things got
expanded a bit... ;)

  I wonder if it would be useful to generalize the solution
and/or recode it in some more efficient manner to be available
for all SMF services as an optional part of the framework?
Theoretically it is there, somewhat - SMF already checks that
child processes exist for contract/wait type of services,
and none died on bad signals like coredumping.

  What do you think? Would that logic be useful as generic part
of SMF? Can it be left as (includable?) shell scripts and/or
rewritten into perl for efficiency? Would anyone undertake to
revise and rewrite the logic into C? ;)

[1] http://sourceforge.net/projects/vboxsvc - my VBoxSvc project
which controls VMs as SMF service instances, with optional
healthchecks.

[2] 
http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/lib/svc/method/vbox.sh?content-type=text%2Fplain

The main script (keywords: KICKER vmsvccheck monitoring hook).

Thanks for any ideas,
//Jim Klimov


___
oi-dev mailing list
oi-dev@openindiana.org
http://openindiana.org/mailman/listinfo/oi-dev


Re: [oi-dev] Are there any active healthchecks for SMF in general?

2012-04-18 Thread Bayard Bell
tl;dr

Seems like an upstream question (illumos-gate and/or illumos-userland).

Even in terms of the upstream, we don't really do RFEs unless there's
enough interest in the issue that someone's likely to write some code.

On Wed, Apr 18, 2012 at 12:37 PM, Jim Klimov jimkli...@cos.ru wrote:
 Hello all,

  I wonder if there are any RFEs or on-going works regarding
 proactive health-checks for SMF services (test routine to be
 defined by the service author or packager and/or by local
 system admin)?

  I think that just like there are start, stop, refresh
 methods and so on, there could also be a healthcheck method
 with its associated timeouts, as well as frequency of tests,
 tolerable amount of test failures in a row and/or within a
 given time range, etc. There could also be a policy to choose
 what to do if the healthcheck fails (too many times): offline
 the service, set it to maintenance, restart it, or smth else?

  In fact, if the healthcheck method is validly defined, it
 should be fired after running the start method and only after
 a successful test the SMF service state should transfer from
 offline* to online. Some service methods exit as soon as
 the target daemon has started, even though the service becomes
 useful after a few minutes.

  I've had to script clutches like that for many different
 projects, usually involving a test routine fired from crontab
 or crafting a specialized startup script which includes needed
 checks on prerequisite services as well as startup real results.

  As an example, think Apache Tomcat with its default start
 scripts - they exit after spawning JVM, but the user-required
 webapps can take minutes to initialize and start up. Currently
 SMF would online the service as soon as the script exited,
 and proceed to starting up the dependent services. However,
 the method is actually online for us generically when the
 servlet container has *logged* that its startup routine is
 complete. If other SMF services do depend on this Tomcat (say,
 it is running an OpenDJ LDAP server), it is online only when
 it responds correctly to LDAP queries, and not before.

  In case of webserver SMF-services the tests usually request
 a healthcheck page or some other page and compare it with the
 expected healthy template. For DBMS or LDAP services that
 would be an SQL or ldapsearch query. In case of crontabs there
 are tricks (i.e. lockfiles) to forbid the test script from
 running in numerous parallel invokations if the tested service
 takes too long to respond.

  Recently (in my vboxsvc[1] project for controlling the
 VirtualBox VMs as SMF service instances), I've taken a different
 approach and made a background loop initiated and executed by
 the service method script; part of that loop's job is to check
 whether the VM is not only running as a process on the Solaris
 host, but also provides the service it was booted for (if the
 test method was validly defined and configured and enabled).
 Originally the loop got there because the service is transient
 (due to VirtualBox internals) and SMF does not monitor the
 service's child processes, but we needed to monitor anyway
 whether the VMs are running or not, and stop the VM processes
 gracefully when the SMF service is stopped. Then things got
 expanded a bit... ;)

  I wonder if it would be useful to generalize the solution
 and/or recode it in some more efficient manner to be available
 for all SMF services as an optional part of the framework?
 Theoretically it is there, somewhat - SMF already checks that
 child processes exist for contract/wait type of services,
 and none died on bad signals like coredumping.

  What do you think? Would that logic be useful as generic part
 of SMF? Can it be left as (includable?) shell scripts and/or
 rewritten into perl for efficiency? Would anyone undertake to
 revise and rewrite the logic into C? ;)

 [1] http://sourceforge.net/projects/vboxsvc - my VBoxSvc project
    which controls VMs as SMF service instances, with optional
    healthchecks.

 [2]
 http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/lib/svc/method/vbox.sh?content-type=text%2Fplain
    The main script (keywords: KICKER vmsvccheck monitoring hook).

 Thanks for any ideas,
 //Jim Klimov


 ___
 oi-dev mailing list
 oi-dev@openindiana.org
 http://openindiana.org/mailman/listinfo/oi-dev

___
oi-dev mailing list
oi-dev@openindiana.org
http://openindiana.org/mailman/listinfo/oi-dev


Re: [oi-dev] Are there any active healthchecks for SMF in general?

2012-04-18 Thread Jim Klimov

2012-04-18 18:33, Bayard Bell написал:

tl;dr

Seems like an upstream question (illumos-gate and/or illumos-userland).

Even in terms of the upstream, we don't really do RFEs unless there's
enough interest in the issue that someone's likely to write some code.


tldr is my problem of sorts ;)

Still, this was not so much an RFE per se, but rather a statement
that I have to make such-and-such workaround too often, so I do
have solutions to the problem. I wonder if the problem annoys/bugs
others too, if they want the solution, and if my approach is sane
and/or acceptable.

In particular, my SMF healthchecks are embedded into scrpt code
of several method-scripts I made for my projects. I think that
what I learned from that could be generalized for everyone's
benefit, but I might rewrite it into another shell or perl
script. That is likely not a good way to build into SMF core,
so other interested developers might pick up the algorithm
and/or the idea, and code some C...

Thanks,
//Jim

___
oi-dev mailing list
oi-dev@openindiana.org
http://openindiana.org/mailman/listinfo/oi-dev

Re: [oi-dev] Are there any active healthchecks for SMF in general?

2012-04-18 Thread Bayard Bell
On Wed, Apr 18, 2012 at 4:00 PM, Jim Klimov jimkli...@cos.ru wrote:
 2012-04-18 18:33, Bayard Bell написал:

 tl;dr

 Seems like an upstream question (illumos-gate and/or illumos-userland).

 Even in terms of the upstream, we don't really do RFEs unless there's
 enough interest in the issue that someone's likely to write some code.


 tldr is my problem of sorts ;)

I have the same problem at times.

 Still, this was not so much an RFE per se, but rather a statement
 that I have to make such-and-such workaround too often, so I do
 have solutions to the problem. I wonder if the problem annoys/bugs
 others too, if they want the solution, and if my approach is sane
 and/or acceptable.

Factor into distinct and clear problems; circulate those to upstream
developer lists for the referenced project and ask whether there's
interest in taking solutions on board. If you already have code
written, so much the better, but it's still better to explain what
problem you're trying to solve if you want to upstream it.

Cheers,
Bayard

___
oi-dev mailing list
oi-dev@openindiana.org
http://openindiana.org/mailman/listinfo/oi-dev