Thanks Gerrit! You've exactly pinpointed the issue. I can reproduce it reliably:
1. Create service/ and service/run, and let it get picked up by runsvdir 2. Create service/log/ and service/log/run Now this is in the bad state I originally described. I suppose a workaround for us is pretty simple: if the service directory does not exist (initial deploy), do the rsync to a temp directory and mv it into place. Do you think that runsv should more gracefully handle this order of events? Thanks for your help. -Caleb On Tue, Jul 29, 2014 at 8:11 AM, Gerrit Pape <p...@smarden.org> wrote: > On Fri, Jul 25, 2014 at 05:16:08PM -0700, Caleb Spare wrote: >> I've been using runit for a while now and it has been mostly >> wonderful. I'm noticing a persistent issue and I'm not sure how to >> debug it. > >> The problem I'm seeing is that, very occasionally, runsv will get into >> a bad state where svlogd is not running. (I'm not sure if it fails to >> start svlogd or if this happens later on after it has been running >> properly.) When the problem occurs, pstree shows something like this: >> >> runsvdir-+-runsv-+-foo---5*[{foo}] >> | `-svlogd >> |-runsv-+-bar---21*[{bar}] >> | `-svlogd >> `-runsv---baz---250*[{baz}] >> >> Here you can see that the baz process does not have an associated >> svlogd process. Further: >> >> $ sudo sv s foo >> run: foo: (pid 4885) 526260s; run: log: (pid 875) 526517s >> $ sudo sv s baz >> run: baz: (pid 2337) 2983swarning: baz: unable to open supervise/ok: >> file does not exist >> ; run: log: (pid 2337) 2983s >> >> Two strange things there: the warning about supervise/ok and also that >> the pid for 'log' is the same as for 'baz'. > > The warning actually should be fine and applies to the log service, but > not how the output os formatted. For the final line, this is a bug in > sv, here's a patch: > > diff --git a/runit-2.1.1/src/sv.c b/runit-2.1.1/src/sv.c > index 43224b5..5f3aa71 100644 > --- a/runit-2.1.1/src/sv.c > +++ b/runit-2.1.1/src/sv.c > @@ -153,7 +153,7 @@ int status(char *unused) { > int rc; > > rc =svstatus_get(); > - switch(r) { case -1: if (lsb) done(4); case 0: return(0); } > + switch(rc) { case -1: if (lsb) done(4); case 0: return(0); } > rc =svstatus_print(*service); > if (chdir("log") == -1) { > if (errno != error_noent) { > >> When runsv is in this bad state, the output from baz goes right to >> runsvdir and ends up in /var/log/upstart/runsvdir.log. >> >> The fix I've been using is to 'sv d baz' and then kill the offending >> runsv process. Runsvdir will quickly restart it and then everything >> will be working: > >> I'm unsure what causes this rare problem. We only do simple things >> with the runit: sv {t,d,u}. When we deploy services, we rsync a >> directory from elsewhere on the box into /etc/services/<name> and then >> 'sv t <name>'. That source dir only has ./run, ./finish, and >> ./log/run. > > Since the runsv service is running, but not the appendant log service, > and a restart of runsv fixes the issue, I guess it's a race when > deploying the service directories through rsync. It may be that runsv > starts up in the very moment where service/run already exists, but the > subdirectory service/log/ not yet. > > Regards, Gerrit.