Re: perp - how to notify if service suddenly starts dying all the time

2015-07-17 Thread Wayne Marshall
On Fri, 17 Jul 2015 08:59:46 +0300
Georgi Chorbadzhiyski georgi.chorbadzhiy...@gmail.com wrote:

 On 07/16/15 15:13, Wayne Marshall wrote:
  Simple way to notify from perp is to send yourself (admin) an email
  from within the reset target:
  
  ...
  reset() {
  case $3 in
  'exit')
echo *** $SVNAME: exited status $4 $PERP_SVSECS seconds
  runtime. mail -s $SVNAME exited ad...@myserver.com  END_MAIL
  NOTICE:
  The $SVNAME service has exited status $4 after runtime of
  $PERP_SVSECS seconds.
  END_MAIL
  ;;
  'signal')
 echo *** $SVNAME: killed on signal $5 $PERP_SVSECS seconds
 runtime.
  ;;
  *)
echo *** $SVNAME: stopped ($3) $PERP_SVSECS seconds
runtime.
  ;;
  esac
  exit 0
  }
  ...
  
  
  The above example shows usage of a generic mail(1) command that may
  vary a little among plaforms/mail agents.  Also uses shell here
  document to generate the body of the email.
  
  This is just a bare bones starting point.  You could embellish this
  to suit your own sites' requirements.
  
  Another suggestion is to develop an executable perp_notify script
  that incorporates the above to provide a consistent notification
  message, without having to duplicate within each/every runscript.
 
 Thanks, I already have something like the solution that you've
 described but I was looking for something else. Maybe an additional
 ENV variable (or some other mechanism) that keeps for example the
 number of service restarts for the last 1 minute.
 
 I don't want to overwhelm our admin team with notices on every service
 restart (we are managing thousands of servers). I need a notice only
 if the service restarts more than X times in a minute, which is a
 sign that something is most definitely wrong.
 
 I'll have to hack something up.
 
 Thanks for the response.
 

Hi Georgi,

Thanks for the suggestion of an exit loop counter for perpd.  It is
good information.  But by itself it would not enough for your case,
because you would still need to track your last notification externally.

In the meantime, your notification hack can be something fairly simple.
Every time service exits abnormally, test against a file timestamp
somewhere (eg. /var/run/perp/myservice/exit_notify).  Send a new
notification and update the timestamp if, say, last notification was
more than 3 minutes ago.

Something is usually wrong if a service exits abnormally, which is the
exit condition, as opposed to the signal condition.  Exit condition
can also be filtered more specifically as to the exitcode.

Unfortunately your custom notification tool will probably need to be
developed with something slightly more powerful than shell sh(1)
scripting, because test(1) does not offer much in the way of time math
facilities to use for the timestamp comparison.

All the best,

Wayne




Re: perp - how to notify if service suddenly starts dying all the time

2015-07-16 Thread Wayne Marshall
On Thu, 16 Jul 2015 09:52:55 +0300
Georgi Chorbadzhiyski geo...@unixsol.org wrote:

 Yesterday, something have corrupted the database file that Redis uses
 and Redis have crashed and then refused to start.
 
 I'm using perp to monitor the service and of course perp was doing
 it's job and restarted the service after it died. The problem was
 that I can't think of a way to notify me if a service dies all the
 time. In this case since Redis have never died on me, it'll be enough
 to know it the service have been restarted X times in the last 30
 seconds (for example).
 
 I can monitor the logs but that doesn't seem like a good idea (to
 start parallel monitor service for each service that is being
 monitored).
 
 Any ideas?
 
 Here is how my rc.main script for the service looks like (it is
 pretty standard).
 
 #!/bin/sh
 
 exec 21
 
 TARGET=$1
 SVNAME=$2
 
 [ -z $SVNAME ]  SVNAME=$(basename $(readlink -m $(dirname $0)))
 
 start() {
 echo *** $SVNAME: starting...
 exec runuid -s redis /usr/bin/redis
 }
 
 reset() {
 case $3 in
 'exit')
 echo *** $SVNAME: exited status $4 $PERP_SVSECS
 seconds runtime. ;;
 'signal')
 echo *** $SVNAME: killed on signal $5 $PERP_SVSECS
 seconds runtime. ;;
 *)
 echo *** $SVNAME: stopped ($3) $PERP_SVSECS seconds
 runtime. ;;
 esac
 exit 0
 }
 
 eval $TARGET $@
 
 exit 0
 

Hi Georgi,

Simple way to notify from perp is to send yourself (admin) an email from
within the reset target:

...
reset() {
case $3 in
'exit')
  echo *** $SVNAME: exited status $4 $PERP_SVSECS seconds runtime.
  mail -s $SVNAME exited ad...@myserver.com  END_MAIL
NOTICE:
The $SVNAME service has exited status $4 after runtime of $PERP_SVSECS
seconds.
END_MAIL
;;
'signal')
   echo *** $SVNAME: killed on signal $5 $PERP_SVSECS seconds
   runtime.
;;
*)
  echo *** $SVNAME: stopped ($3) $PERP_SVSECS seconds
  runtime.
;;
esac
exit 0
}
...


The above example shows usage of a generic mail(1) command that may vary
a little among plaforms/mail agents.  Also uses shell here document to
generate the body of the email.

This is just a bare bones starting point.  You could embellish this to
suit your own sites' requirements.

Another suggestion is to develop an executable perp_notify script that
incorporates the above to provide a consistent notification message,
without having to duplicate within each/every runscript.

All the best,

Wayne




Re: s6 ordering and run-once?

2015-06-16 Thread Wayne Marshall
On Tue, 16 Jun 2015 20:58:35 -0400
Steve Litt sl...@troubleshooters.com wrote:

   Does anyone know how to do a run-once service without putting an
   infinite sleep loop at the end?

In perp you would simply touch(1) a file named flag.once in the
service definition directory.  See the section STARTUP MODIFICATION in
the perpd(8) man page for the full story:

http://b0llix.net/perp/site.cgi?page=perpd.8

 
 What do you do if a oneshot requires that a longrun is already
 running?


In perp you would simply use the perpok(8) utility to test for the
required service, and use the runpause(8) utility to hack oneshot into
a persistent process (_not_ using the flag.once flagfile in this
case).

Here is a complete perp runscript example for oneshot, as would be found
in the file /etc/perp/oneshot/rc.main:

#!/bin/sh
exec 21

if test $1 = 'start' ; then
  if ! perpok -u3 longrun ; then
echo dependency failure: longrun not running...
exit 1
  else
echo starting oneshot...
/usr/bin/oneshot
exec runpause -L 'oneshot' 0 /bin/true
  fi
fi

if test $1 = 'reset' ; then
  echo resetting oneshot...
fi

### eof 


The relevant manual pages:

perpok(8):
http://b0llix.net/perp/site.cgi?page=perpok.8

runpause(8):
http://b0llix.net/perp/site.cgi?page=runpause.8


Wayne
http://b0llix.net/perp/
 


Re: staggering runsv startup

2015-06-06 Thread Wayne Marshall
Off the top of my head an easy solution in perp that requires no
special or supplemental scripting, flagfile tricks, etc.

For multiple service instances of /usr/bin/myserv -- named myserv00,
myserv01, myserv01, ..., myservNN -- deploy the following set of
service definitions.

First, basic myserv00 runscript (in /etc/perp/myserv00/rc.main):

#!/bin/sh
exec 21

TARGET=$1
SVNAME=$2

start() {
  echo starting ${SVNAME}...
  exec /usr/bin/myserv
}

reset() {
  echo resetting ${SVNAME}...
  exit 0
}

eval ${TARGET} $@
### eof


Next myserv01 runscript (in /etc/perp/myserv01/rc.main), showing just
the start() stanza for brevity:

...
start() {
  if ! perpok -u3 myserv00 ; then
echo ${SVNAME}: not yet running myserv00
exit 1
  fi
  echo starting ${SVNAME}...
  exec /usr/bin/myserv
}
...


Next myserv02 runscript (in /etc/perp/myserv02/rc.main), again the
start() stanza:

...
start() {
  if ! perpok -u3 myserv01 ; then
echo ${SVNAME}: not yet running myserv01
exit 1
  fi
  echo starting ${SVNAME}...
  exec /usr/bin/myserv
}
...


And so forth, each instance of myservXX using perpok(8) to check if the
previous instance is up and running before loading current instance.

Many permutations of this basic theme are possible and can be fine tuned
to match your local installation, eg:

* Subsequent instances of myserv may all use myserv00 as the base
  instance, and modify/increment -u secs parameter to perpok(8)
  accordingly.

* Multiple instances of myservNN may be grouped that use the same
  previous instance with the same -u secs parameter to perpok(8), to
  permit asynchronous startup of a few at a time, rather than one at a
  time as shown above.

See also runtools such as runargs(8) that may allow you to setup your
runscripts with a basic template, loading the perpok(8) parameters from
an external file.

Wayne
http://b0llix.net/perp/


On Thu, 04 Jun 2015 13:41:12 -0700
Jameson Graef Rollins jroll...@finestructure.net wrote:

 Hi, all.  I am using runit to supervise a large set of nearly
 identical processes.  Each process accesses certain IO-bound shared
 resources (e.g. NFS mount) at startup.  At system initialization,
 when runsvdir is launched, it launches all these processes (via
 runsv) essentially simultaneously.  This causes a big resource
 contention at initialization that occasionally causes problems.
 
 What I would like is to somehow stagger the startup of the processes,
 to avoid the resource contention.  I could do this by putting a random
 sleep into the ./run scripts, but this would also cause random startup
 delays on subsequent process restarts via sv restart or the like
 (which we occasionally need to do).
 
 What I would prefer instead is to add random delays to the startup of
 the *runsv* processes, since this would only apply at system
 initialization.  Unfortunately I can't see any way to do that right
 now (other than somehow wrapping the runsv binary itself).
 
 Does anyone know any way to accomplish what I'm looking for?  I don't
 believe runsvdir supports any options that would apply here.  Is it
 possible to somehow point runsvdir to a alternate runsv executable to
 which I could add the random delays?
 
 Any suggestions would be much appreciated.  Thanks.
 
 jamie.



taxonomy of dependencies

2015-04-28 Thread Wayne Marshall
Quite a lot of clock-cycles are being devoted to the discussion of
dependencies among services.

I would like to suggest that not all dependencies are created equal.
That is, some (if not most) dependencies are really of no practical
consequence -- and we don't need to worry about them in terms of
sequentializing service start-up.  Other dependencies (and their number
is fewer) may in fact require our attention through simple handling
within the runscript.

First a preliminary statement: service dependencies are of a different
nature than, say, build dependencies or package installation
dependencies.  Efforts to use these paradigms for ordering service
start-up will generally lead to unnecessary complexity.  Unnecessary
complexity is a Bad Thing.  As the system becomes complex, it becomes
opaque, confusing, prone to error, hard to troubleshoot, and difficult
to administer. Then you are right back to the problems of, say,
sysvinit or systemd from which you are trying to escape.

To the extent that we can recognize and categorize service
dependencies, we may simplify our runscripts considerably.  What follows
below is an effort (admittedly a first-cut) to describe a taxonomy of
service dependencies.

0) No dependency.  A service has no functional dependency on any other
service.  (Mentioned here for completeness).  Nothing to worry about.

1) Logical only dependency.  A service has only a logical dependency on
another service; in terms of functional behavior however, arranging for
ordered start-up is unnecessary.  There is no pathological functional
behavior associated with starting in parallel or in any particular
order. Again, nothing to worry about.

2) Functional dependency - soft.  A service has a functional dependency
on another service, as in, it cannot perform it's task without the other
service running.  Yet there is no pathological functional behavior
exhibited by the service in cases where the dependency is not running.
That is, the service simply defers connections or reports not ready
until the dependency is running.  Once again, still nothing for us to
worry about in terms of special handling within runscripts, as the
services may be started in any order.

3) Functional dependency - medium.  A service has a functional
dependency on another service, and fails to start if the dependency is
not running.  This is actually quite a nice arrangement, because the
service itself is testing for its dependency without any additional
effort on our part.  Under a supervision framework, failure of a
service starting is absolutely ok.  (Many novices fail to grasp the
elegance of this essential feature.)  The system will automatically
attempt to restart the failed service at intervals until the dependency
is met.  Yet again, nothing to worry about in our runscripts.

4) Functional dependency - hard.  A service has a functional dependency
on another service, yet will start and run no error without the
dependency, and doing so results in some kind of pathological
behavior.  The pathological behavior may expose the system to security
vulnerability, resource blockage, or provide users with erroneous data
or bad results.  Now -- finally -- this is a case we have to worry
about.  The runscripts must be designed to explicitly require the
dependency before starting the service.

When working with dependencies in this last category #4, all we have to
to do is to make runscripts that effectively turn them into category #3.
That is, we simply want to immediately fail any service whose required
dependency is not running.

An example of dependency handling under the perp system is illustrated
in the perpok(8) manual page in the EXAMPLES section:

http://b0llix.net/perp/site.cgi?page=perpok.8

Alternatively, one may write purpose-specific dependency checking
utilities, such as with ncat, etc., to make sure that the dependency is
not only running, but serving a set of expected results.

Note also that in no case is it necessary for a service runscript to try
starting dependencies itself -- this is all left to the supervisor.
All the runscript needs for category #4 dependencies is to check for
the dependency, and fail immediately if that check fails.  Simplissimo.

--
Wayne
http://b0llix.net/perp/


Re: runit and sv check for dependencies

2015-01-16 Thread Wayne Marshall
The sv check paradigm is a bit wrong-headed in its approach to
dependency handling.  It forces a service to block and wait for its
dependency.  To the rest of the world, that service will then seem like
it is up and running normally, when in fact it may only be waiting for
an unmet dependency.

The better paradigm for dependency checking is this:  for a service
with any unmet dependency, fail immediately.

The supervisor itself will then automatically take care of trying to
restart the service at periodic intervals, until the dependency check
for that service succeeds.

The perpok(8) utility for dependency checking is included in the perp
distribution and described here:

http://b0llix.net/perp/site.cgi?page=perpok.8

A complete perp runscript for the scenario you describe might look like
this:

#!/bin/sh
exec 21

start() {
echo starting lightppd...
## postgresql dependency check:
if ! perpok -u 3 postgresql ; then
echo sorry: dependency check failure postgresql
exit 1
fi
## dependency check ok, start lightppd:
exec lighttpd -f /etc/lighttpd/lighttpd.conf -D
}

reset() {
echo resetting lightppd...
exit 0
}

eval ${TARGET} $@
### eof (/etc/perp/lightppd/rc.main)


In many cases we may generally resist the idea of failure being okay.
But in the case of dependency checking within a service management
framework, failing -- and failing quickly -- is actually the best
thing to do.

Wayne
http://b0llix.net/perp/


On Wed, 14 Jan 2015 16:24:19 +
James Byrne james.by...@origamienergy.com wrote:

 Hi,
 
 I am working on an embedded Linux system where I want to use the
 'runit' tools to start various system services, and I have an issue
 where sv check doesn't seem to behave in a useful way.
 
 I have seen it suggested (specifically in the article at 
 http://rubyists.github.io/2011/05/02/runit-for-ruby-and-everything-else.html) 
 that sv check can be used to implement dependencies in the run
 file. The example given in the article is:
 
 /service/lighttpd/run:
#!/bin/sh -e
sv -w7 check postgresql
exec 21 lighttpd -f /etc/lighttpd/lighttpd.conf -D
 
 It goes on to say This would wait 7 seconds for the postgresql
 service to be running, exiting with an error if that timout is
 reached. runsv will then run this script again. Lighttpd will never
 be executed unless sv check exits without an error (postgresql is
 up).
 
 However in practice this will not work, because sv check will
 return exit code 0 if the postgresql service is down, or if it
 failed to run at all (i.e. if postgresql/run exited with a non-zero
 exit code).
 
 Having looked at the code and done various tests (using runit 2.1.2), 
 sv check doesn't appear to be very useful with its current
 behaviour. The documentation is ambiguous about what it does, saying
 that it will:
 
 Check for the service to be in the state that’s been requested. Wait
 up to 7 seconds for the service to reach the requested state, then
 report the status or timeout.
 
 This doesn't really make sense, because there isn't any such thing as 
 the requested state.
 
 My solution is to make the following change to sv.c:
 
 --- old/sv.c  2014-08-10 19:22:34.0 +0100
 +++ new/sv.c  2015-01-14 14:29:31.384556297 +
 @@ -227,7 +227,7 @@
 if (!checkscript()) return(0);
 break;
   case 'd': if (pid || svstatus[19] != 0) return(0); break;
 -case 'C': if (pid) if (!checkscript()) return(0); break;
 +case 'C': if (!pid || !checkscript()) return(0); break;
   case 't':
   case 'k':
 if (!pid  svstatus[17] == 'd') break;
 
 With this change, sv check works in a much more useful way. If all
 the services specified are up it will exit with exit code 0, and if
 not it will wait until the timeout for them to come up, and return a
 non-zero exit code if any are still down.
 
 Is there any reason why I should not make this change? Have I 
 misunderstood what sv check is supposed to do? If this change is
 OK, could it be included in future releases of runit?
 
 Regards,
 
 James Byrne
 



Re: I need your advice on this web page

2015-01-16 Thread Wayne Marshall
Hi Laurent,

Thanks for this.

I would only add:  it is *extremely* unfortunate that service
management frameworks have come to be so conflated with pid 1.

With the focus turned so exclusively on init, many people are losing
sight of the benefits of supervision in general, and portable,
cross-platform, init-agnostic service management in particular.

Wayne
http://b0llix.net/perp/


On Fri, 16 Jan 2015 14:12:07 +0100
Laurent Bercot ska-supervis...@skarnet.org wrote:

 On 16/01/2015 01:05, Steve Litt wrote:
  http://www.troubleshooters.com/linux/init/features_and_benefits.htm
 
   (I'm lacking sleep and I'm going to talk about systemd. Not a good
 combination. So, apologies in advance for the rant, for the inevitable
 coarse language, and for the very opinionated post.)
 
   Hi Steve,
 
   The main comment that I have to make after reading your document is
 that despite your attempt at impartiality, and avowed liking of
 daemontools- inspired schemes, it is still systemd-centric and biased
 in its favor. Not purposefully, of course, but simply because the
 systemd propaganda machine works, and has already taught you to think
 in systemd terms - which, let it be said openly, are often pure
 marketing bullshit.
 
   Let's dig into some of those.
 
 
   I. Socket activation.
 
   This has to be my new favorite marketing buzzword. Socket
 activation, people. (My sockets are activated. I put my feet into
 them, and now they move. It's awesome.)
   Last summer, I wrote a post about it - and you were in that
 discussion, Steve:
   
 http://forums.gentoo.org/viewtopic-t-994548-postdays-0-postorder-asc-start-25.html#7581522
 
   The short version of it is that socket activation, as systemd
 defines it, is a hack that mixes several different already-existing
 concepts in a shaker, and what you get in the end is *worse* than if
 you had nothing at all - but since everything is mixed and confused,
 nobody notices, and systemd can pretend it's doing that wonderful
 thing that no other system does, and people believe it.
   When you think socket activation, the questions to ask are the
 following. (I wrote answers from the s6 point of view, which mostly
 applies to other supervision suites too.)
 
   Q1. Does the init system work as a super-server pre-opening and
 binding sockets so that daemons do not have to do it themselves ?
 
   A1. It is NOT the freaking init system's job to pre-open sockets.
 Doing so requires the init system to be aware of every single
 socket-listening daemon on the machine, which translates into a
 central registry. And that is how you turn Unix into Microsoft
 Windows. If you need to pre-open and bind sockets all at once, use
 inetd. This is exactly what inetd does, and at least it doesn't
 require running as process 1, or communicating over D-Bus.
   Better, use decent superservers, such as s6-tcpserver or tcpsvd, one
 per service. For Unix domain sockets, which is what systemd focuses on
 (and rightly so), there's s6-ipcserver. Starting those superservers in
 parallel, as any supervision suite can, will end up being just as
 fast as trying to open every possible socket early on in process 1.
 There is no reason at all why a superserver should be tied to an init
 system.
 
   Q2. Does the init system allow you to start processes as soon as the
 sockets are open, before the servers are ready ? This is the much
 touted benefit of socket activation on
   http://0pointer.de/blog/projects/socket-activation.html
 
   A2. HELL NO. WHY ON EARTH WOULD I WANT THAT ?
   - Doing so has a serious reliability cost: if a service ends up
 having issues, but dependent services have been started and are
 assuming that it's working, hilarity will ensue. I mean, you could
 also play Dance Dance Revolution on a mat made of old WWII landmines.
 What could possibly go wrong ?
   - It's especially twisted with logging. Sure, start your daemons
 before the logger is running, no problem. That way, if anything goes
 wrong, you'll have no way of knowing what happened. Have fun
 debugging.
   - The speed benefits are minimal at best. As I wrote in my post,
 daemons can perform their first writes in parallel, but as soon as
 they have to read, they block anyway, waiting for their dependencies.
 The only case where daemons write and never read, and could benefit
 from such a scheme, is... when they write to their logger. And, as we
 just saw, it is a really good idea to write logs before the logger is
 guaranteed operational. This is just not worth it. Simply starting
 all the services as soon as possible in parallel will have the same
 benefits - the kernel will schedule everything to the best of its
 abilities, and there will be no risk of spectacular crashes.
 
   Q3. Does the init system allow you to hold a copy of a bound socket
 for the daemon to retrieve if it has to restart ?
 
   A3. This is what I call fd-holding, and is the *only* thing of
 value in socket activation. No, supervision suites do not perform
 

Re: runit and sv check for dependencies

2015-01-16 Thread Wayne Marshall
Your assertion sounds scary and foreboding in theory, but is not an
issue in practice.

Certainly not an issue with the example runscript provided.

Wayne
http://b0llix.net/perp/


On Fri, 16 Jan 2015 11:11:06 -0500 (EST)
Charlie Brady charlieb-supervis...@budge.apana.org.au wrote:

 
 On Fri, 16 Jan 2015, Wayne Marshall wrote:
 
  The better paradigm for dependency checking is this:  for a service
  with any unmet dependency, fail immediately.
 
 That's great in theory, but can be very expensive in CPU and other 
 resources in practice. The dependency check code needs to be very 
 lightweight.
 



service dependency examples?

2014-11-03 Thread Wayne Marshall
Hi,

Would someone kindly provide a real-world example of a service
dependency?  That is, some service foo that critically depends on
another service bar, and that satisfies either or both of the
following conditions:

 c1: service foo MUST NOT be started, or be attempted to be started,
 until service bar is running, lest something undesirable happens.

 c2: service foo MUST be terminated whenever service bar stops
 running, lest something undesirable happens.

To elaborate a little, the following scenarios are of *no* interest to
this inquiry:

 s1: service foo fails immediately on startup if service bar is not
 available.

 s2: service foo terminates itself whenever service bar is not
 available.

 s3: service foo successfully starts and runs even if service bar
 is not running, and waits or queues or times-out or retries with
 diagnostics while bar is unavailable, but otherwise does nothing
 undesirable in the absence of bar.

Many thanks,

Wayne