Re: [PATCH/RFC] Re: logpipe error handling patch

2014-08-08 Thread Gerrit Pape
On Thu, Aug 07, 2014 at 11:39:08AM +0200, Jan Pobrislo wrote:
> On Fri, 1 Aug 2014 05:32:16 +
> Gerrit Pape  wrote:
> > On Wed, Jul 30, 2014 at 03:01:16PM +0200, Jan Pobrislo wrote:
> > > I was under the assumption that the pipe is left open so you can
> > > send signals in case the supervised processes have trouble exiting
> > > the normal way.
> > 
> > The service stopped already in this situation, I'm with Laurent in
> > this case.  But not the log service, so its control pipe should stay
> > open and commands processed.
> 
> Can't it happen the other way around? (log/run exiting and ./run
> being stalled) I didn't dig too deep into runsv code to see if that case
> is handled differently.

This is about runsv behavior when receiving the x command through the
control interface, which only is supported for the main service, not the
log service.  So, no.

> > What do you think about this patch?:
> > 
> > 
> > When runsv is told to exit, it now no longer processes any control
> > characters read from the control pipe (it still does for the log
> > service).  It waits until the log service has terminated and then
> > exits.  The sv program now properly reports this through the
> > status command.
> 
> Using sv with runsv in "wait" state:
> * Will I still be able to see PIDs of the processes until they exit?
> * Will I be able to send signals to them using sv kill etc. or will I
>   have to extract the PIDs and kill them manually?

In the wait state, the service is down, so there's no pid.  For the
appendant log service you still can use all the functionality of sv.

> * How this interacts with the control scripts?

Nothing should change in this regard.

The devground branch in my git repository has a slightly revised version
of the patch.  I'd be happy to receive feedback, there might still be
some glitches.

Regards, Gerrit.


Re: [PATCH/RFC] Re: logpipe error handling patch

2014-08-07 Thread Jan Pobrislo
On Fri, 1 Aug 2014 05:32:16 +
Gerrit Pape  wrote:
> On Wed, Jul 30, 2014 at 03:01:16PM +0200, Jan Pobrislo wrote:
> > I was under the assumption that the pipe is left open so you can
> > send signals in case the supervised processes have trouble exiting
> > the normal way.
> 
> The service stopped already in this situation, I'm with Laurent in
> this case.  But not the log service, so its control pipe should stay
> open and commands processed.

Can't it happen the other way around? (log/run exiting and ./run
being stalled) I didn't dig too deep into runsv code to see if that case
is handled differently.

> What do you think about this patch?:
> 
> 
> When runsv is told to exit, it now no longer processes any control
> characters read from the control pipe (it still does for the log
> service).  It waits until the log service has terminated and then
> exits.  The sv program now properly reports this through the
> status command.

Using sv with runsv in "wait" state:
* Will I still be able to see PIDs of the processes until they exit?
* Will I be able to send signals to them using sv kill etc. or will I
  have to extract the PIDs and kill them manually?
* How this interacts with the control scripts?

Thanks for clarifications


[PATCH/RFC] Re: logpipe error handling patch

2014-07-31 Thread Gerrit Pape
On Wed, Jul 30, 2014 at 03:01:16PM +0200, Jan Pobrislo wrote:
> On Wed, 30 Jul 2014 00:37:47 +0100
> Laurent Bercot  wrote:
> > At this point, the service is
> > dysfunctional and there's no way to recover it: so runsv should not
> > be accepting commands anymore, and close its control pipe instead.
> 
> I was under the assumption that the pipe is left open so you can send
> signals in case the supervised processes have trouble exiting the
> normal way.

The service stopped already in this situation, I'm with Laurent in this
case.  But not the log service, so its control pipe should stay open and
commands processed.

What do you think about this patch?:


When runsv is told to exit, it now no longer processes any control
characters read from the control pipe (it still does for the log
service).  It waits until the log service has terminated and then
exits.  The sv program now properly reports this through the status
command.
---
 man/runsv.8 |  1 +
 man/sv.8|  1 +
 package/CHANGES |  2 ++
 src/runsv.c | 25 ++---
 src/sv.c|  5 -
 5 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/man/runsv.8 b/man/runsv.8
index 7c5abfc..8d364c6 100644
--- a/man/runsv.8
+++ b/man/runsv.8
@@ -150,6 +150,7 @@ exits.
 If the service is down and a log service exists,
 .B runsv
 closes the standard input of the log service, and waits for it to terminate.
+No more control characters will be processed.
 If the log service is down,
 .B runsv
 exits.
diff --git a/man/sv.8 b/man/sv.8
index 7b09602..0c9f4c3 100644
--- a/man/sv.8
+++ b/man/sv.8
@@ -78,6 +78,7 @@ exits.
 If the service is down and a log service exists,
 .BR runsv (8)
 closes the standard input of the log service and waits for it to terminate.
+No more commands will be processed.
 If the log service is down,
 .BR runsv (8)
 exits.
diff --git a/package/CHANGES b/package/CHANGES
index 82d117c..1286c8f 100644
--- a/package/CHANGES
+++ b/package/CHANGES
@@ -1,5 +1,7 @@
 2.1.2
 
+  * runsv.c, sv.c: on exit, properly wait for log service to terminate.
+  * man/runsv.8, man/sv.8: adapt.
   * man/sv.8: "sv exit" does not send a TERM signal to the log service
 (thx Jonathan Nieder).
   * fmt_ptime.c: 64 bits time_t fix for svlogd (tnx Jérémie
diff --git a/src/runsv.c b/src/runsv.c
index ecf4677..dc628fa 100644
--- a/src/runsv.c
+++ b/src/runsv.c
@@ -31,6 +31,7 @@ int selfpipe[2];
 #define S_DOWN 0
 #define S_RUN 1
 #define S_FINISH 2
+#define S_WAIT 3
 /* ctrl */
 #define C_NOOP 0
 #define C_TERM 1
@@ -149,6 +150,9 @@ void update_status(struct svdir *s) {
   case S_FINISH:
 buffer_puts(&b, "finish");
 break;
+  case S_WAIT:
+buffer_puts(&b, "wait");
+break;
   }
   if (s->ctrl & C_PAUSE) buffer_puts(&b, ", paused");
   if (s->ctrl & C_TERM) buffer_puts(&b, ", got TERM");
@@ -566,6 +570,13 @@ int main(int argc, char **argv) {
 continue;
   }
 svd[0].state =S_DOWN;
+if (svd[0].want == W_EXIT) {
+  svd[0].state =S_WAIT;
+  svd[1].want =W_EXIT;
+  update_status(&svd[1]);
+  if (close(logpipe[1]) == -1) warn("unable to close logpipe[1]");
+  if (close(logpipe[0]) == -1) warn("unable to close logpipe[0]");
+}
 taia_uint(&deadline, 1);
 taia_add(&deadline, &svd[0].start, &deadline);
 taia_now(&svd[0].start);
@@ -586,22 +597,14 @@ int main(int argc, char **argv) {
 }
   }
 }
-if (read(svd[0].fdcontrol, &ch, 1) == 1) ctrl(&svd[0], ch);
+if (read(svd[0].fdcontrol, &ch, 1) == 1)
+  if (svd[0].state != S_WAIT) ctrl(&svd[0], ch);
 if (haslog)
   if (read(svd[1].fdcontrol, &ch, 1) == 1) ctrl(&svd[1], ch);
 
 if (sigterm) { ctrl(&svd[0], 'x'); sigterm =0; }
 
-if ((svd[0].want == W_EXIT) && (svd[0].state == S_DOWN)) {
-  if (svd[1].pid == 0) _exit(0);
-  if (svd[1].want != W_EXIT) {
-svd[1].want =W_EXIT;
-/* stopservice(&svd[1]); */
-update_status(&svd[1]);
-if (close(logpipe[1]) == -1) warn("unable to close logpipe[1]");
-if (close(logpipe[0]) == -1) warn("unable to close logpipe[0]");
-  }
-}
+if (svd[0].state == S_WAIT) if (svd[1].pid == 0) _exit(0);
   }
   _exit(0);
 }
diff --git a/src/sv.c b/src/sv.c
index 06b2e41..db0bd1d 100644
--- a/src/sv.c
+++ b/src/sv.c
@@ -26,6 +26,7 @@
 #define RUN "run: "
 #define FINISH  "finish: "
 #define DOWN"down: "
+#define WAIT"wait: "
 #define TIMEOUT "timeout: "
 #define KILL"kill: "
 
@@ -132,9 +133,10 @@ unsigned int svstatus_print(char *m) {
   case 0: outs(DOWN); break;
   case 1: outs(RUN); break;
   case 2: outs(FINISH); break;
+  case 3: outs(WAIT); break;
   }
   outs(m); outs(": ");
-  if (svstatus[19]) {
+  if (pid) {
 outs("(pid "); sulong[fmt_ulong(sulong, pid)] =0;
 outs(sulong); outs(") ");
   }
@@ -146,6 +148,7 @@ unsigned int svstatus_print(char *m) {
   if (pid && svstatus[16]) outs(", paused");
   if (!pid

Re: logpipe error handling patch

2014-07-30 Thread Jan Pobrislo
On Wed, 30 Jul 2014 00:37:47 +0100
Laurent Bercot  wrote:

> On 29/07/2014 20:59, Jan Pobrislo wrote:
> > There are two possible resolutions - either make runsv create new
> > pipe when asked to start up again or make it die since it's unable
> > to spawn a child.
> 
>   Actually, there's only one possible resolution.
>   When runsv is told to exit, it kills the service and closes the log
> pipe, which is impossible to reopen.

Yes, it's impossible to reopen the very same pipe. What's possible is
to create new pipe for use with newly spawned processes. The existing
ones are shutting down now anyway, or at least are supposed to be, so
indeed we need to do the functional equivalent of runsv restart even if
we do create new pipe.

Note that I don't really like this approach much, just exiting as my
first patch does is much simpler and cleaner imo.

> At this point, the service is
> dysfunctional and there's no way to recover it: so runsv should not
> be accepting commands anymore, and close its control pipe instead.

I was under the assumption that the pipe is left open so you can send
signals in case the supervised processes have trouble exiting the
normal way.

>   runsv could even exit right after sending the SIGTERM to the
> service, that's the quick and easy solution that closes both the log
> pipe and the control pipe. However, it also means runsvdir will
> restart it, and the new runsv instance will try and spawn a new
> logger when the old one may still be hanging around, and that could
> cause conflicts.

I'm trying to build my systems so they can recover from any percentage
of killed processes at any time, except of PID1, so this would have
to be handled either way. A good logger here will exit on held locks and
only read it's input when it knows it can write. But indeed it's
unnecessary complication.

Normally one would remove a symlink to the service directory when
sending the x command, so it doesn't get respawned by runsvdir.
It's still possible though for it to be re-added back when marking the
service up again, so it would be possible for such behavior to occur.
To prevent it we either need to exit unconditionally (ignoring any up
or once commands after x) or create new pipe as described above after
both processes exited.

> A more advanced behaviour is to have potentially
> configurable timeouts in runsv to ensure that the logger finally dies
> and the service can be restarted:

You can look at start-stop-daemon in OpenRC to see example of this.
From manpage:

These options are only used for stopping daemons:

-R, --retry timeout | signal/timeout
The retry specification can be either a timeout in seconds or
multiple signal/timeout pairs (like SIGTERM/5).

For runsvdir itself I can see --retry HUP/5/TERM/5 being used in the
default initscript.

You can have configurable shutdown behavior this way, but presumably
this is exactly what should go into service/control/t -- but that
wouldn't work with the log indeed. I wonder if there would be any
benefit making runsv understand similar syntax, but that'd probably
make things just unnecessarily complex.

Only thing I can think of that exiting unconditionally after x
interferes is the o (once) command; I presume that when executing up
and down commands the ./down file is created/removed appropriately so
it doesn't matter if runsv is restarted. The once command is a tricky
one in this regard, but I can't conjure in my mind an use-case where
it'd matter.


Re: logpipe error handling patch

2014-07-29 Thread Laurent Bercot

On 29/07/2014 20:59, Jan Pobrislo wrote:

There are two possible resolutions - either make runsv create new pipe
when asked to start up again or make it die since it's unable to spawn
a child.


 Actually, there's only one possible resolution.
 When runsv is told to exit, it kills the service and closes the log pipe,
which is impossible to reopen. At this point, the service is dysfunctional
and there's no way to recover it: so runsv should not be accepting commands
anymore, and close its control pipe instead. Any sv command after "sv x"
should return an error, because the service is not in a state where it can
process commands; it will only be functional again when the logger has died,
runsv has died and runsvdir has restarted runsv.

 runsv could even exit right after sending the SIGTERM to the service, that's
the quick and easy solution that closes both the log pipe and the control
pipe. However, it also means runsvdir will restart it, and the new runsv
instance will try and spawn a new logger when the old one may still be hanging
around, and that could cause conflicts.
 A more advanced behaviour is to have potentially configurable timeouts in runsv
to ensure that the logger finally dies and the service can be restarted:
 - close the log pipe
 - give the logger some grace time
 - send it a SIGTERM (and a SIGCONT, just to be sure. Stopped processes don't 
die.)
 - give it some more grace time
 -  send it a SIGKILL
 - give it a bit more time
 - panic loudly.

--
 Laurent



RE: logpipe error handling patch

2014-07-29 Thread James Powell
Thanks Jan. I'll try Gerrit's git repository first to see if it has any further 
updates before I use the patch.

> Date: Tue, 29 Jul 2014 22:44:51 +0200
> From: c...@webprojekty.cz
> To: supervision@list.skarnet.org
> Subject: Re: logpipe error handling patch
> 
> On Tue, 29 Jul 2014 13:08:37 -0700
> James Powell  wrote:
> 
> > Yeah its a workaround but I've implemented it anyway in my LFS hint
> > to be sure it works right.
> > 
> > If you don't mind me asking Jan, can I host it on the runit-for-LFS
> > project on GoogleCode?
> 
> Sure, go ahead if you want. I think it's better to fix this in the
> upstream package though, one less patch to mantain.
  

Re: logpipe error handling patch

2014-07-29 Thread Jan Pobrislo
On Tue, 29 Jul 2014 13:08:37 -0700
James Powell  wrote:

> Yeah its a workaround but I've implemented it anyway in my LFS hint
> to be sure it works right.
> 
> If you don't mind me asking Jan, can I host it on the runit-for-LFS
> project on GoogleCode?

Sure, go ahead if you want. I think it's better to fix this in the
upstream package though, one less patch to mantain.


Re: logpipe error handling patch

2014-07-29 Thread Jan Pobrislo
Wrote a quick mockup for restoring the pipe (untested, may eat your
pet). I suppose there'll be better places to do pipe initialization at,
if you want to go this route, but this one ought to work.diff --git a/src/runsv.c b/src/runsv.c
index ecf4677..991e211 100644
--- a/src/runsv.c
+++ b/src/runsv.c
@@ -57,7 +57,7 @@ struct svdir svd[2];
 int sigterm =0;
 int haslog =0;
 int pidchanged =1;
-int logpipe[2];
+int logpipe[2] = {0,0};
 char *dir;
 
 void usage () { strerr_die4x(1, "usage: ", progname, USAGE, "\n"); }
@@ -89,6 +89,15 @@ void s_term() {
   write(selfpipe[1], "", 1); /* XXX */
 }
 
+void setup_logpipe(void) {
+  if(logpipe[0])
+return;
+  if (pipe(logpipe) == -1)
+fatal("unable to create log pipe");
+  coe(logpipe[0]);
+  coe(logpipe[1]);
+}
+
 void update_status(struct svdir *s) {
   unsigned long l;
   int fd;
@@ -280,6 +289,7 @@ void startservice(struct svdir *s) {
   }
 
   if (s->pid != 0) stopservice(s); /* should never happen */
+  if (haslog) setup_logpipe();
   while ((p =fork()) == -1) {
 warn("unable to fork, sleeping");
 sleep(5);
@@ -428,10 +438,7 @@ int main(int argc, char **argv) {
   taia_now(&svd[1].start);
   if (stat("log/down", &s) != -1)
 svd[1].want =W_DOWN;
-  if (pipe(logpipe) == -1)
-fatal("unable to create log pipe");
-  coe(logpipe[0]);
-  coe(logpipe[1]);
+  setup_logpipe();
 }
   }
 
@@ -599,7 +606,9 @@ int main(int argc, char **argv) {
 /* stopservice(&svd[1]); */
 update_status(&svd[1]);
 if (close(logpipe[1]) == -1) warn("unable to close logpipe[1]");
+logpipe[1] = 0;
 if (close(logpipe[0]) == -1) warn("unable to close logpipe[0]");
+logpipe[0] = 0;
   }
 }
   }


RE: logpipe error handling patch

2014-07-29 Thread James Powell
Yeah its a workaround but I've implemented it anyway in my LFS hint to be sure 
it works right.

If you don't mind me asking Jan, can I host it on the runit-for-LFS project on 
GoogleCode?

Sent from my Windows Phone

From: Jan Pobrislo<mailto:c...@webprojekty.cz>
Sent: ‎7/‎29/‎2014 1:00 PM
To: supervision@list.skarnet.org<mailto:supervision@list.skarnet.org>
Subject: Re: logpipe error handling patch

On Tue, 29 Jul 2014 15:22:36 +
Gerrit Pape  wrote:

> Thanks for the patch, but it works around the bug and doesn't fix its
> root cause.

There are two possible resolutions - either make runsv create new pipe
when asked to start up again or make it die since it's unable to spawn
a child. My patch implements latter (because it's the simpler of
possible resolutions) and will not break anything if the former is ever
implemented. runsv instance that passes wrong filedescriptors to its
child is broken and should be killed so it can be respawned.

I can try to implement some more graceful behaviour if you can think of
one, but I'd still like to see this patch included as it doesn't break
anything - it's just a sanity check.


Re: logpipe error handling patch

2014-07-29 Thread Jan Pobrislo
On Tue, 29 Jul 2014 15:22:36 +
Gerrit Pape  wrote:
 
> Thanks for the patch, but it works around the bug and doesn't fix its
> root cause.

There are two possible resolutions - either make runsv create new pipe
when asked to start up again or make it die since it's unable to spawn
a child. My patch implements latter (because it's the simpler of
possible resolutions) and will not break anything if the former is ever
implemented. runsv instance that passes wrong filedescriptors to its
child is broken and should be killed so it can be respawned.

I can try to implement some more graceful behaviour if you can think of
one, but I'd still like to see this patch included as it doesn't break
anything - it's just a sanity check.


Re: logpipe error handling patch

2014-07-29 Thread Gerrit Pape
On Sat, Jul 26, 2014 at 10:44:18AM +0200, c...@webprojekty.cz wrote:
> Hello, I wrote a patch for how broken filedescriptors when creating child
> are handled. Currently only the child dies, leaving non-working instance of
> runsv in place. My patch makes it kill the parent so the runsv process can
> be respawned properly when it gets to the state in which it can't produce
> working children.

Thanks for the patch, but it works around the bug and doesn't fix its
root cause.  Laurent was correct when analysing this a few years ago,
thanks for that
 http://permalink.gmane.org/gmane.comp.sysutils.supervision.general/2026

Here's how to reproduce:
$ mkdir -p bug/log
$ cat 

logpipe error handling patch

2014-07-26 Thread ccx
Hello, I wrote a patch for how broken filedescriptors when creating child
are handled. Currently only the child dies, leaving non-working instance of
runsv in place. My patch makes it kill the parent so the runsv process can
be respawned properly when it gets to the state in which it can't produce
working children.

This should fix the behaviour described in
Message-id: CAN_+VLUtc+JBAATN55b2o6G8tmH4Y8OKLCf-yA6maBt+c1=g...@mail.gmail.com
http://blog.gmane.org/gmane.comp.sysutils.supervision.general/month=20130301
and
Message-id: pine.lnx.4.64.1008161622390.20...@e-smith.charlieb.ott.istop.com
http://blog.gmane.org/gmane.comp.sysutils.supervision.general/month=20100801

--- runsv.c.orig2014-07-24 16:17:43.087191947 +0200
+++ runsv.c 2014-07-24 18:27:29.720588407 +0200
@@ -62,6 +62,11 @@
 
 void usage () { strerr_die4x(1, "usage: ", progname, USAGE, "\n"); }
 
+void fatal_child(char *m) {
+  strerr_warn5("runsv ", dir, ": fatal: ", m, ": ", &strerr_sys);
+  kill(getppid(), SIGTERM);
+  _exit(111);
+}
 void fatal(char *m) {
   strerr_die5sys(111, "runsv ", dir, ": fatal: ", m, ": ");
 }
@@ -289,14 +294,14 @@
 if (haslog) {
   if (s->islog) {
 if (fd_copy(0, logpipe[0]) == -1)
-  fatal("unable to setup filedescriptor for ./log/run");
+  fatal_child("unable to setup filedescriptor for ./log/run");
 close(logpipe[1]);
 if (chdir("./log") == -1)
   fatal("unable to change directory to ./log");
   }
   else {
 if (fd_copy(1, logpipe[1]) == -1)
-  fatal("unable to setup filedescriptor for ./run");
+  fatal_child("unable to setup filedescriptor for ./run");
 close(logpipe[0]);
   }
 }