I can't find an answer whether it is possible for an SMF service to run automated health checks (as defined by the service script's author) and restart if required.
For a specific example, we run Magnolia CMS in a Tomcat server in a zone. It depends on a MySQL server to work properly. This server lives in another zone (usually on another server, in fact). One of the problem-scenarios is as follows: * if the MySQL server is not running, the Magnolia web-app is not initialized; * if the MySQL server was restarted, the web-app's connection breaks. In either case, Tomcat is running (SMF is happy - its contract service is fulfilled), but the end-user service is no longer provided. To fix the problem the web-app or the whole web-container need to be restarted. Other scenarios with different web-applications involve running out of memory or slowing down then crawling to death. In any case, as soon as the end-user service goes below a SLA the service should be recycled - it is known to help. (say, the web-site's page takes over 5s to render, or some runaway loop consumes 95+% CPU for many minutes). For most third-party applications, we can't fix them directly (i.e. rewrite to prevent them from failing), but we have to do our best to reduce downtime for the customers. We do currently have some scripts to run such checks and maintain our services, so their logic (and to some extent implementation) is not the problem. These scripts are placed into root's (or webserver user's) crontab and invoke init scripts to recycle services. I want to convert these scripts and crontabs into a single SMF service which includes complicated self-monitoring, to reduce the complexity of (default) configuration as well as improve observability. I want to stress again that a working contract in the OS is not the only metric which describes a service as truly "online". Are there some established best-practices and examples, or am I doomed to invent another bicycle? ;) //Jim -- This message posted from opensolaris.org