Re: perp - how to notify if service suddenly starts dying all the time
On Fri, 17 Jul 2015 08:59:46 +0300 Georgi Chorbadzhiyski georgi.chorbadzhiy...@gmail.com wrote: On 07/16/15 15:13, Wayne Marshall wrote: Simple way to notify from perp is to send yourself (admin) an email from within the reset target: ... reset() { case $3 in 'exit') echo *** $SVNAME: exited status $4 $PERP_SVSECS seconds runtime. mail -s $SVNAME exited ad...@myserver.com END_MAIL NOTICE: The $SVNAME service has exited status $4 after runtime of $PERP_SVSECS seconds. END_MAIL ;; 'signal') echo *** $SVNAME: killed on signal $5 $PERP_SVSECS seconds runtime. ;; *) echo *** $SVNAME: stopped ($3) $PERP_SVSECS seconds runtime. ;; esac exit 0 } ... The above example shows usage of a generic mail(1) command that may vary a little among plaforms/mail agents. Also uses shell here document to generate the body of the email. This is just a bare bones starting point. You could embellish this to suit your own sites' requirements. Another suggestion is to develop an executable perp_notify script that incorporates the above to provide a consistent notification message, without having to duplicate within each/every runscript. Thanks, I already have something like the solution that you've described but I was looking for something else. Maybe an additional ENV variable (or some other mechanism) that keeps for example the number of service restarts for the last 1 minute. I don't want to overwhelm our admin team with notices on every service restart (we are managing thousands of servers). I need a notice only if the service restarts more than X times in a minute, which is a sign that something is most definitely wrong. I'll have to hack something up. Thanks for the response. Hi Georgi, Thanks for the suggestion of an exit loop counter for perpd. It is good information. But by itself it would not enough for your case, because you would still need to track your last notification externally. In the meantime, your notification hack can be something fairly simple. Every time service exits abnormally, test against a file timestamp somewhere (eg. /var/run/perp/myservice/exit_notify). Send a new notification and update the timestamp if, say, last notification was more than 3 minutes ago. Something is usually wrong if a service exits abnormally, which is the exit condition, as opposed to the signal condition. Exit condition can also be filtered more specifically as to the exitcode. Unfortunately your custom notification tool will probably need to be developed with something slightly more powerful than shell sh(1) scripting, because test(1) does not offer much in the way of time math facilities to use for the timestamp comparison. All the best, Wayne
Re: perp - how to notify if service suddenly starts dying all the time
On Thu, 16 Jul 2015 09:52:55 +0300 Georgi Chorbadzhiyski geo...@unixsol.org wrote: Yesterday, something have corrupted the database file that Redis uses and Redis have crashed and then refused to start. I'm using perp to monitor the service and of course perp was doing it's job and restarted the service after it died. The problem was that I can't think of a way to notify me if a service dies all the time. In this case since Redis have never died on me, it'll be enough to know it the service have been restarted X times in the last 30 seconds (for example). I can monitor the logs but that doesn't seem like a good idea (to start parallel monitor service for each service that is being monitored). Any ideas? Here is how my rc.main script for the service looks like (it is pretty standard). #!/bin/sh exec 21 TARGET=$1 SVNAME=$2 [ -z $SVNAME ] SVNAME=$(basename $(readlink -m $(dirname $0))) start() { echo *** $SVNAME: starting... exec runuid -s redis /usr/bin/redis } reset() { case $3 in 'exit') echo *** $SVNAME: exited status $4 $PERP_SVSECS seconds runtime. ;; 'signal') echo *** $SVNAME: killed on signal $5 $PERP_SVSECS seconds runtime. ;; *) echo *** $SVNAME: stopped ($3) $PERP_SVSECS seconds runtime. ;; esac exit 0 } eval $TARGET $@ exit 0 Hi Georgi, Simple way to notify from perp is to send yourself (admin) an email from within the reset target: ... reset() { case $3 in 'exit') echo *** $SVNAME: exited status $4 $PERP_SVSECS seconds runtime. mail -s $SVNAME exited ad...@myserver.com END_MAIL NOTICE: The $SVNAME service has exited status $4 after runtime of $PERP_SVSECS seconds. END_MAIL ;; 'signal') echo *** $SVNAME: killed on signal $5 $PERP_SVSECS seconds runtime. ;; *) echo *** $SVNAME: stopped ($3) $PERP_SVSECS seconds runtime. ;; esac exit 0 } ... The above example shows usage of a generic mail(1) command that may vary a little among plaforms/mail agents. Also uses shell here document to generate the body of the email. This is just a bare bones starting point. You could embellish this to suit your own sites' requirements. Another suggestion is to develop an executable perp_notify script that incorporates the above to provide a consistent notification message, without having to duplicate within each/every runscript. All the best, Wayne
Re: s6 ordering and run-once?
On Tue, 16 Jun 2015 20:58:35 -0400 Steve Litt sl...@troubleshooters.com wrote: Does anyone know how to do a run-once service without putting an infinite sleep loop at the end? In perp you would simply touch(1) a file named flag.once in the service definition directory. See the section STARTUP MODIFICATION in the perpd(8) man page for the full story: http://b0llix.net/perp/site.cgi?page=perpd.8 What do you do if a oneshot requires that a longrun is already running? In perp you would simply use the perpok(8) utility to test for the required service, and use the runpause(8) utility to hack oneshot into a persistent process (_not_ using the flag.once flagfile in this case). Here is a complete perp runscript example for oneshot, as would be found in the file /etc/perp/oneshot/rc.main: #!/bin/sh exec 21 if test $1 = 'start' ; then if ! perpok -u3 longrun ; then echo dependency failure: longrun not running... exit 1 else echo starting oneshot... /usr/bin/oneshot exec runpause -L 'oneshot' 0 /bin/true fi fi if test $1 = 'reset' ; then echo resetting oneshot... fi ### eof The relevant manual pages: perpok(8): http://b0llix.net/perp/site.cgi?page=perpok.8 runpause(8): http://b0llix.net/perp/site.cgi?page=runpause.8 Wayne http://b0llix.net/perp/
Re: staggering runsv startup
Off the top of my head an easy solution in perp that requires no special or supplemental scripting, flagfile tricks, etc. For multiple service instances of /usr/bin/myserv -- named myserv00, myserv01, myserv01, ..., myservNN -- deploy the following set of service definitions. First, basic myserv00 runscript (in /etc/perp/myserv00/rc.main): #!/bin/sh exec 21 TARGET=$1 SVNAME=$2 start() { echo starting ${SVNAME}... exec /usr/bin/myserv } reset() { echo resetting ${SVNAME}... exit 0 } eval ${TARGET} $@ ### eof Next myserv01 runscript (in /etc/perp/myserv01/rc.main), showing just the start() stanza for brevity: ... start() { if ! perpok -u3 myserv00 ; then echo ${SVNAME}: not yet running myserv00 exit 1 fi echo starting ${SVNAME}... exec /usr/bin/myserv } ... Next myserv02 runscript (in /etc/perp/myserv02/rc.main), again the start() stanza: ... start() { if ! perpok -u3 myserv01 ; then echo ${SVNAME}: not yet running myserv01 exit 1 fi echo starting ${SVNAME}... exec /usr/bin/myserv } ... And so forth, each instance of myservXX using perpok(8) to check if the previous instance is up and running before loading current instance. Many permutations of this basic theme are possible and can be fine tuned to match your local installation, eg: * Subsequent instances of myserv may all use myserv00 as the base instance, and modify/increment -u secs parameter to perpok(8) accordingly. * Multiple instances of myservNN may be grouped that use the same previous instance with the same -u secs parameter to perpok(8), to permit asynchronous startup of a few at a time, rather than one at a time as shown above. See also runtools such as runargs(8) that may allow you to setup your runscripts with a basic template, loading the perpok(8) parameters from an external file. Wayne http://b0llix.net/perp/ On Thu, 04 Jun 2015 13:41:12 -0700 Jameson Graef Rollins jroll...@finestructure.net wrote: Hi, all. I am using runit to supervise a large set of nearly identical processes. Each process accesses certain IO-bound shared resources (e.g. NFS mount) at startup. At system initialization, when runsvdir is launched, it launches all these processes (via runsv) essentially simultaneously. This causes a big resource contention at initialization that occasionally causes problems. What I would like is to somehow stagger the startup of the processes, to avoid the resource contention. I could do this by putting a random sleep into the ./run scripts, but this would also cause random startup delays on subsequent process restarts via sv restart or the like (which we occasionally need to do). What I would prefer instead is to add random delays to the startup of the *runsv* processes, since this would only apply at system initialization. Unfortunately I can't see any way to do that right now (other than somehow wrapping the runsv binary itself). Does anyone know any way to accomplish what I'm looking for? I don't believe runsvdir supports any options that would apply here. Is it possible to somehow point runsvdir to a alternate runsv executable to which I could add the random delays? Any suggestions would be much appreciated. Thanks. jamie.
taxonomy of dependencies
Quite a lot of clock-cycles are being devoted to the discussion of dependencies among services. I would like to suggest that not all dependencies are created equal. That is, some (if not most) dependencies are really of no practical consequence -- and we don't need to worry about them in terms of sequentializing service start-up. Other dependencies (and their number is fewer) may in fact require our attention through simple handling within the runscript. First a preliminary statement: service dependencies are of a different nature than, say, build dependencies or package installation dependencies. Efforts to use these paradigms for ordering service start-up will generally lead to unnecessary complexity. Unnecessary complexity is a Bad Thing. As the system becomes complex, it becomes opaque, confusing, prone to error, hard to troubleshoot, and difficult to administer. Then you are right back to the problems of, say, sysvinit or systemd from which you are trying to escape. To the extent that we can recognize and categorize service dependencies, we may simplify our runscripts considerably. What follows below is an effort (admittedly a first-cut) to describe a taxonomy of service dependencies. 0) No dependency. A service has no functional dependency on any other service. (Mentioned here for completeness). Nothing to worry about. 1) Logical only dependency. A service has only a logical dependency on another service; in terms of functional behavior however, arranging for ordered start-up is unnecessary. There is no pathological functional behavior associated with starting in parallel or in any particular order. Again, nothing to worry about. 2) Functional dependency - soft. A service has a functional dependency on another service, as in, it cannot perform it's task without the other service running. Yet there is no pathological functional behavior exhibited by the service in cases where the dependency is not running. That is, the service simply defers connections or reports not ready until the dependency is running. Once again, still nothing for us to worry about in terms of special handling within runscripts, as the services may be started in any order. 3) Functional dependency - medium. A service has a functional dependency on another service, and fails to start if the dependency is not running. This is actually quite a nice arrangement, because the service itself is testing for its dependency without any additional effort on our part. Under a supervision framework, failure of a service starting is absolutely ok. (Many novices fail to grasp the elegance of this essential feature.) The system will automatically attempt to restart the failed service at intervals until the dependency is met. Yet again, nothing to worry about in our runscripts. 4) Functional dependency - hard. A service has a functional dependency on another service, yet will start and run no error without the dependency, and doing so results in some kind of pathological behavior. The pathological behavior may expose the system to security vulnerability, resource blockage, or provide users with erroneous data or bad results. Now -- finally -- this is a case we have to worry about. The runscripts must be designed to explicitly require the dependency before starting the service. When working with dependencies in this last category #4, all we have to to do is to make runscripts that effectively turn them into category #3. That is, we simply want to immediately fail any service whose required dependency is not running. An example of dependency handling under the perp system is illustrated in the perpok(8) manual page in the EXAMPLES section: http://b0llix.net/perp/site.cgi?page=perpok.8 Alternatively, one may write purpose-specific dependency checking utilities, such as with ncat, etc., to make sure that the dependency is not only running, but serving a set of expected results. Note also that in no case is it necessary for a service runscript to try starting dependencies itself -- this is all left to the supervisor. All the runscript needs for category #4 dependencies is to check for the dependency, and fail immediately if that check fails. Simplissimo. -- Wayne http://b0llix.net/perp/
Re: runit and sv check for dependencies
The sv check paradigm is a bit wrong-headed in its approach to dependency handling. It forces a service to block and wait for its dependency. To the rest of the world, that service will then seem like it is up and running normally, when in fact it may only be waiting for an unmet dependency. The better paradigm for dependency checking is this: for a service with any unmet dependency, fail immediately. The supervisor itself will then automatically take care of trying to restart the service at periodic intervals, until the dependency check for that service succeeds. The perpok(8) utility for dependency checking is included in the perp distribution and described here: http://b0llix.net/perp/site.cgi?page=perpok.8 A complete perp runscript for the scenario you describe might look like this: #!/bin/sh exec 21 start() { echo starting lightppd... ## postgresql dependency check: if ! perpok -u 3 postgresql ; then echo sorry: dependency check failure postgresql exit 1 fi ## dependency check ok, start lightppd: exec lighttpd -f /etc/lighttpd/lighttpd.conf -D } reset() { echo resetting lightppd... exit 0 } eval ${TARGET} $@ ### eof (/etc/perp/lightppd/rc.main) In many cases we may generally resist the idea of failure being okay. But in the case of dependency checking within a service management framework, failing -- and failing quickly -- is actually the best thing to do. Wayne http://b0llix.net/perp/ On Wed, 14 Jan 2015 16:24:19 + James Byrne james.by...@origamienergy.com wrote: Hi, I am working on an embedded Linux system where I want to use the 'runit' tools to start various system services, and I have an issue where sv check doesn't seem to behave in a useful way. I have seen it suggested (specifically in the article at http://rubyists.github.io/2011/05/02/runit-for-ruby-and-everything-else.html) that sv check can be used to implement dependencies in the run file. The example given in the article is: /service/lighttpd/run: #!/bin/sh -e sv -w7 check postgresql exec 21 lighttpd -f /etc/lighttpd/lighttpd.conf -D It goes on to say This would wait 7 seconds for the postgresql service to be running, exiting with an error if that timout is reached. runsv will then run this script again. Lighttpd will never be executed unless sv check exits without an error (postgresql is up). However in practice this will not work, because sv check will return exit code 0 if the postgresql service is down, or if it failed to run at all (i.e. if postgresql/run exited with a non-zero exit code). Having looked at the code and done various tests (using runit 2.1.2), sv check doesn't appear to be very useful with its current behaviour. The documentation is ambiguous about what it does, saying that it will: Check for the service to be in the state that’s been requested. Wait up to 7 seconds for the service to reach the requested state, then report the status or timeout. This doesn't really make sense, because there isn't any such thing as the requested state. My solution is to make the following change to sv.c: --- old/sv.c 2014-08-10 19:22:34.0 +0100 +++ new/sv.c 2015-01-14 14:29:31.384556297 + @@ -227,7 +227,7 @@ if (!checkscript()) return(0); break; case 'd': if (pid || svstatus[19] != 0) return(0); break; -case 'C': if (pid) if (!checkscript()) return(0); break; +case 'C': if (!pid || !checkscript()) return(0); break; case 't': case 'k': if (!pid svstatus[17] == 'd') break; With this change, sv check works in a much more useful way. If all the services specified are up it will exit with exit code 0, and if not it will wait until the timeout for them to come up, and return a non-zero exit code if any are still down. Is there any reason why I should not make this change? Have I misunderstood what sv check is supposed to do? If this change is OK, could it be included in future releases of runit? Regards, James Byrne
Re: I need your advice on this web page
Hi Laurent, Thanks for this. I would only add: it is *extremely* unfortunate that service management frameworks have come to be so conflated with pid 1. With the focus turned so exclusively on init, many people are losing sight of the benefits of supervision in general, and portable, cross-platform, init-agnostic service management in particular. Wayne http://b0llix.net/perp/ On Fri, 16 Jan 2015 14:12:07 +0100 Laurent Bercot ska-supervis...@skarnet.org wrote: On 16/01/2015 01:05, Steve Litt wrote: http://www.troubleshooters.com/linux/init/features_and_benefits.htm (I'm lacking sleep and I'm going to talk about systemd. Not a good combination. So, apologies in advance for the rant, for the inevitable coarse language, and for the very opinionated post.) Hi Steve, The main comment that I have to make after reading your document is that despite your attempt at impartiality, and avowed liking of daemontools- inspired schemes, it is still systemd-centric and biased in its favor. Not purposefully, of course, but simply because the systemd propaganda machine works, and has already taught you to think in systemd terms - which, let it be said openly, are often pure marketing bullshit. Let's dig into some of those. I. Socket activation. This has to be my new favorite marketing buzzword. Socket activation, people. (My sockets are activated. I put my feet into them, and now they move. It's awesome.) Last summer, I wrote a post about it - and you were in that discussion, Steve: http://forums.gentoo.org/viewtopic-t-994548-postdays-0-postorder-asc-start-25.html#7581522 The short version of it is that socket activation, as systemd defines it, is a hack that mixes several different already-existing concepts in a shaker, and what you get in the end is *worse* than if you had nothing at all - but since everything is mixed and confused, nobody notices, and systemd can pretend it's doing that wonderful thing that no other system does, and people believe it. When you think socket activation, the questions to ask are the following. (I wrote answers from the s6 point of view, which mostly applies to other supervision suites too.) Q1. Does the init system work as a super-server pre-opening and binding sockets so that daemons do not have to do it themselves ? A1. It is NOT the freaking init system's job to pre-open sockets. Doing so requires the init system to be aware of every single socket-listening daemon on the machine, which translates into a central registry. And that is how you turn Unix into Microsoft Windows. If you need to pre-open and bind sockets all at once, use inetd. This is exactly what inetd does, and at least it doesn't require running as process 1, or communicating over D-Bus. Better, use decent superservers, such as s6-tcpserver or tcpsvd, one per service. For Unix domain sockets, which is what systemd focuses on (and rightly so), there's s6-ipcserver. Starting those superservers in parallel, as any supervision suite can, will end up being just as fast as trying to open every possible socket early on in process 1. There is no reason at all why a superserver should be tied to an init system. Q2. Does the init system allow you to start processes as soon as the sockets are open, before the servers are ready ? This is the much touted benefit of socket activation on http://0pointer.de/blog/projects/socket-activation.html A2. HELL NO. WHY ON EARTH WOULD I WANT THAT ? - Doing so has a serious reliability cost: if a service ends up having issues, but dependent services have been started and are assuming that it's working, hilarity will ensue. I mean, you could also play Dance Dance Revolution on a mat made of old WWII landmines. What could possibly go wrong ? - It's especially twisted with logging. Sure, start your daemons before the logger is running, no problem. That way, if anything goes wrong, you'll have no way of knowing what happened. Have fun debugging. - The speed benefits are minimal at best. As I wrote in my post, daemons can perform their first writes in parallel, but as soon as they have to read, they block anyway, waiting for their dependencies. The only case where daemons write and never read, and could benefit from such a scheme, is... when they write to their logger. And, as we just saw, it is a really good idea to write logs before the logger is guaranteed operational. This is just not worth it. Simply starting all the services as soon as possible in parallel will have the same benefits - the kernel will schedule everything to the best of its abilities, and there will be no risk of spectacular crashes. Q3. Does the init system allow you to hold a copy of a bound socket for the daemon to retrieve if it has to restart ? A3. This is what I call fd-holding, and is the *only* thing of value in socket activation. No, supervision suites do not perform
Re: runit and sv check for dependencies
Your assertion sounds scary and foreboding in theory, but is not an issue in practice. Certainly not an issue with the example runscript provided. Wayne http://b0llix.net/perp/ On Fri, 16 Jan 2015 11:11:06 -0500 (EST) Charlie Brady charlieb-supervis...@budge.apana.org.au wrote: On Fri, 16 Jan 2015, Wayne Marshall wrote: The better paradigm for dependency checking is this: for a service with any unmet dependency, fail immediately. That's great in theory, but can be very expensive in CPU and other resources in practice. The dependency check code needs to be very lightweight.
service dependency examples?
Hi, Would someone kindly provide a real-world example of a service dependency? That is, some service foo that critically depends on another service bar, and that satisfies either or both of the following conditions: c1: service foo MUST NOT be started, or be attempted to be started, until service bar is running, lest something undesirable happens. c2: service foo MUST be terminated whenever service bar stops running, lest something undesirable happens. To elaborate a little, the following scenarios are of *no* interest to this inquiry: s1: service foo fails immediately on startup if service bar is not available. s2: service foo terminates itself whenever service bar is not available. s3: service foo successfully starts and runs even if service bar is not running, and waits or queues or times-out or retries with diagnostics while bar is unavailable, but otherwise does nothing undesirable in the absence of bar. Many thanks, Wayne