Re: [PATCH/RFC] Re: logpipe error handling patch
On Thu, Aug 07, 2014 at 11:39:08AM +0200, Jan Pobrislo wrote: > On Fri, 1 Aug 2014 05:32:16 + > Gerrit Pape wrote: > > On Wed, Jul 30, 2014 at 03:01:16PM +0200, Jan Pobrislo wrote: > > > I was under the assumption that the pipe is left open so you can > > > send signals in case the supervised processes have trouble exiting > > > the normal way. > > > > The service stopped already in this situation, I'm with Laurent in > > this case. But not the log service, so its control pipe should stay > > open and commands processed. > > Can't it happen the other way around? (log/run exiting and ./run > being stalled) I didn't dig too deep into runsv code to see if that case > is handled differently. This is about runsv behavior when receiving the x command through the control interface, which only is supported for the main service, not the log service. So, no. > > What do you think about this patch?: > > > > > > When runsv is told to exit, it now no longer processes any control > > characters read from the control pipe (it still does for the log > > service). It waits until the log service has terminated and then > > exits. The sv program now properly reports this through the > > status command. > > Using sv with runsv in "wait" state: > * Will I still be able to see PIDs of the processes until they exit? > * Will I be able to send signals to them using sv kill etc. or will I > have to extract the PIDs and kill them manually? In the wait state, the service is down, so there's no pid. For the appendant log service you still can use all the functionality of sv. > * How this interacts with the control scripts? Nothing should change in this regard. The devground branch in my git repository has a slightly revised version of the patch. I'd be happy to receive feedback, there might still be some glitches. Regards, Gerrit.
Re: [PATCH/RFC] Re: logpipe error handling patch
On Fri, 1 Aug 2014 05:32:16 + Gerrit Pape wrote: > On Wed, Jul 30, 2014 at 03:01:16PM +0200, Jan Pobrislo wrote: > > I was under the assumption that the pipe is left open so you can > > send signals in case the supervised processes have trouble exiting > > the normal way. > > The service stopped already in this situation, I'm with Laurent in > this case. But not the log service, so its control pipe should stay > open and commands processed. Can't it happen the other way around? (log/run exiting and ./run being stalled) I didn't dig too deep into runsv code to see if that case is handled differently. > What do you think about this patch?: > > > When runsv is told to exit, it now no longer processes any control > characters read from the control pipe (it still does for the log > service). It waits until the log service has terminated and then > exits. The sv program now properly reports this through the > status command. Using sv with runsv in "wait" state: * Will I still be able to see PIDs of the processes until they exit? * Will I be able to send signals to them using sv kill etc. or will I have to extract the PIDs and kill them manually? * How this interacts with the control scripts? Thanks for clarifications
[PATCH/RFC] Re: logpipe error handling patch
On Wed, Jul 30, 2014 at 03:01:16PM +0200, Jan Pobrislo wrote: > On Wed, 30 Jul 2014 00:37:47 +0100 > Laurent Bercot wrote: > > At this point, the service is > > dysfunctional and there's no way to recover it: so runsv should not > > be accepting commands anymore, and close its control pipe instead. > > I was under the assumption that the pipe is left open so you can send > signals in case the supervised processes have trouble exiting the > normal way. The service stopped already in this situation, I'm with Laurent in this case. But not the log service, so its control pipe should stay open and commands processed. What do you think about this patch?: When runsv is told to exit, it now no longer processes any control characters read from the control pipe (it still does for the log service). It waits until the log service has terminated and then exits. The sv program now properly reports this through the status command. --- man/runsv.8 | 1 + man/sv.8| 1 + package/CHANGES | 2 ++ src/runsv.c | 25 ++--- src/sv.c| 5 - 5 files changed, 22 insertions(+), 12 deletions(-) diff --git a/man/runsv.8 b/man/runsv.8 index 7c5abfc..8d364c6 100644 --- a/man/runsv.8 +++ b/man/runsv.8 @@ -150,6 +150,7 @@ exits. If the service is down and a log service exists, .B runsv closes the standard input of the log service, and waits for it to terminate. +No more control characters will be processed. If the log service is down, .B runsv exits. diff --git a/man/sv.8 b/man/sv.8 index 7b09602..0c9f4c3 100644 --- a/man/sv.8 +++ b/man/sv.8 @@ -78,6 +78,7 @@ exits. If the service is down and a log service exists, .BR runsv (8) closes the standard input of the log service and waits for it to terminate. +No more commands will be processed. If the log service is down, .BR runsv (8) exits. diff --git a/package/CHANGES b/package/CHANGES index 82d117c..1286c8f 100644 --- a/package/CHANGES +++ b/package/CHANGES @@ -1,5 +1,7 @@ 2.1.2 + * runsv.c, sv.c: on exit, properly wait for log service to terminate. + * man/runsv.8, man/sv.8: adapt. * man/sv.8: "sv exit" does not send a TERM signal to the log service (thx Jonathan Nieder). * fmt_ptime.c: 64 bits time_t fix for svlogd (tnx Jérémie diff --git a/src/runsv.c b/src/runsv.c index ecf4677..dc628fa 100644 --- a/src/runsv.c +++ b/src/runsv.c @@ -31,6 +31,7 @@ int selfpipe[2]; #define S_DOWN 0 #define S_RUN 1 #define S_FINISH 2 +#define S_WAIT 3 /* ctrl */ #define C_NOOP 0 #define C_TERM 1 @@ -149,6 +150,9 @@ void update_status(struct svdir *s) { case S_FINISH: buffer_puts(&b, "finish"); break; + case S_WAIT: +buffer_puts(&b, "wait"); +break; } if (s->ctrl & C_PAUSE) buffer_puts(&b, ", paused"); if (s->ctrl & C_TERM) buffer_puts(&b, ", got TERM"); @@ -566,6 +570,13 @@ int main(int argc, char **argv) { continue; } svd[0].state =S_DOWN; +if (svd[0].want == W_EXIT) { + svd[0].state =S_WAIT; + svd[1].want =W_EXIT; + update_status(&svd[1]); + if (close(logpipe[1]) == -1) warn("unable to close logpipe[1]"); + if (close(logpipe[0]) == -1) warn("unable to close logpipe[0]"); +} taia_uint(&deadline, 1); taia_add(&deadline, &svd[0].start, &deadline); taia_now(&svd[0].start); @@ -586,22 +597,14 @@ int main(int argc, char **argv) { } } } -if (read(svd[0].fdcontrol, &ch, 1) == 1) ctrl(&svd[0], ch); +if (read(svd[0].fdcontrol, &ch, 1) == 1) + if (svd[0].state != S_WAIT) ctrl(&svd[0], ch); if (haslog) if (read(svd[1].fdcontrol, &ch, 1) == 1) ctrl(&svd[1], ch); if (sigterm) { ctrl(&svd[0], 'x'); sigterm =0; } -if ((svd[0].want == W_EXIT) && (svd[0].state == S_DOWN)) { - if (svd[1].pid == 0) _exit(0); - if (svd[1].want != W_EXIT) { -svd[1].want =W_EXIT; -/* stopservice(&svd[1]); */ -update_status(&svd[1]); -if (close(logpipe[1]) == -1) warn("unable to close logpipe[1]"); -if (close(logpipe[0]) == -1) warn("unable to close logpipe[0]"); - } -} +if (svd[0].state == S_WAIT) if (svd[1].pid == 0) _exit(0); } _exit(0); } diff --git a/src/sv.c b/src/sv.c index 06b2e41..db0bd1d 100644 --- a/src/sv.c +++ b/src/sv.c @@ -26,6 +26,7 @@ #define RUN "run: " #define FINISH "finish: " #define DOWN"down: " +#define WAIT"wait: " #define TIMEOUT "timeout: " #define KILL"kill: " @@ -132,9 +133,10 @@ unsigned int svstatus_print(char *m) { case 0: outs(DOWN); break; case 1: outs(RUN); break; case 2: outs(FINISH); break; + case 3: outs(WAIT); break; } outs(m); outs(": "); - if (svstatus[19]) { + if (pid) { outs("(pid "); sulong[fmt_ulong(sulong, pid)] =0; outs(sulong); outs(") "); } @@ -146,6 +148,7 @@ unsigned int svstatus_print(char *m) { if (pid && svstatus[16]) outs(", paused"); if (!pid
Re: logpipe error handling patch
On Wed, 30 Jul 2014 00:37:47 +0100 Laurent Bercot wrote: > On 29/07/2014 20:59, Jan Pobrislo wrote: > > There are two possible resolutions - either make runsv create new > > pipe when asked to start up again or make it die since it's unable > > to spawn a child. > > Actually, there's only one possible resolution. > When runsv is told to exit, it kills the service and closes the log > pipe, which is impossible to reopen. Yes, it's impossible to reopen the very same pipe. What's possible is to create new pipe for use with newly spawned processes. The existing ones are shutting down now anyway, or at least are supposed to be, so indeed we need to do the functional equivalent of runsv restart even if we do create new pipe. Note that I don't really like this approach much, just exiting as my first patch does is much simpler and cleaner imo. > At this point, the service is > dysfunctional and there's no way to recover it: so runsv should not > be accepting commands anymore, and close its control pipe instead. I was under the assumption that the pipe is left open so you can send signals in case the supervised processes have trouble exiting the normal way. > runsv could even exit right after sending the SIGTERM to the > service, that's the quick and easy solution that closes both the log > pipe and the control pipe. However, it also means runsvdir will > restart it, and the new runsv instance will try and spawn a new > logger when the old one may still be hanging around, and that could > cause conflicts. I'm trying to build my systems so they can recover from any percentage of killed processes at any time, except of PID1, so this would have to be handled either way. A good logger here will exit on held locks and only read it's input when it knows it can write. But indeed it's unnecessary complication. Normally one would remove a symlink to the service directory when sending the x command, so it doesn't get respawned by runsvdir. It's still possible though for it to be re-added back when marking the service up again, so it would be possible for such behavior to occur. To prevent it we either need to exit unconditionally (ignoring any up or once commands after x) or create new pipe as described above after both processes exited. > A more advanced behaviour is to have potentially > configurable timeouts in runsv to ensure that the logger finally dies > and the service can be restarted: You can look at start-stop-daemon in OpenRC to see example of this. From manpage: These options are only used for stopping daemons: -R, --retry timeout | signal/timeout The retry specification can be either a timeout in seconds or multiple signal/timeout pairs (like SIGTERM/5). For runsvdir itself I can see --retry HUP/5/TERM/5 being used in the default initscript. You can have configurable shutdown behavior this way, but presumably this is exactly what should go into service/control/t -- but that wouldn't work with the log indeed. I wonder if there would be any benefit making runsv understand similar syntax, but that'd probably make things just unnecessarily complex. Only thing I can think of that exiting unconditionally after x interferes is the o (once) command; I presume that when executing up and down commands the ./down file is created/removed appropriately so it doesn't matter if runsv is restarted. The once command is a tricky one in this regard, but I can't conjure in my mind an use-case where it'd matter.
Re: logpipe error handling patch
On 29/07/2014 20:59, Jan Pobrislo wrote: There are two possible resolutions - either make runsv create new pipe when asked to start up again or make it die since it's unable to spawn a child. Actually, there's only one possible resolution. When runsv is told to exit, it kills the service and closes the log pipe, which is impossible to reopen. At this point, the service is dysfunctional and there's no way to recover it: so runsv should not be accepting commands anymore, and close its control pipe instead. Any sv command after "sv x" should return an error, because the service is not in a state where it can process commands; it will only be functional again when the logger has died, runsv has died and runsvdir has restarted runsv. runsv could even exit right after sending the SIGTERM to the service, that's the quick and easy solution that closes both the log pipe and the control pipe. However, it also means runsvdir will restart it, and the new runsv instance will try and spawn a new logger when the old one may still be hanging around, and that could cause conflicts. A more advanced behaviour is to have potentially configurable timeouts in runsv to ensure that the logger finally dies and the service can be restarted: - close the log pipe - give the logger some grace time - send it a SIGTERM (and a SIGCONT, just to be sure. Stopped processes don't die.) - give it some more grace time - send it a SIGKILL - give it a bit more time - panic loudly. -- Laurent
RE: logpipe error handling patch
Thanks Jan. I'll try Gerrit's git repository first to see if it has any further updates before I use the patch. > Date: Tue, 29 Jul 2014 22:44:51 +0200 > From: c...@webprojekty.cz > To: supervision@list.skarnet.org > Subject: Re: logpipe error handling patch > > On Tue, 29 Jul 2014 13:08:37 -0700 > James Powell wrote: > > > Yeah its a workaround but I've implemented it anyway in my LFS hint > > to be sure it works right. > > > > If you don't mind me asking Jan, can I host it on the runit-for-LFS > > project on GoogleCode? > > Sure, go ahead if you want. I think it's better to fix this in the > upstream package though, one less patch to mantain.
Re: logpipe error handling patch
On Tue, 29 Jul 2014 13:08:37 -0700 James Powell wrote: > Yeah its a workaround but I've implemented it anyway in my LFS hint > to be sure it works right. > > If you don't mind me asking Jan, can I host it on the runit-for-LFS > project on GoogleCode? Sure, go ahead if you want. I think it's better to fix this in the upstream package though, one less patch to mantain.
Re: logpipe error handling patch
Wrote a quick mockup for restoring the pipe (untested, may eat your pet). I suppose there'll be better places to do pipe initialization at, if you want to go this route, but this one ought to work.diff --git a/src/runsv.c b/src/runsv.c index ecf4677..991e211 100644 --- a/src/runsv.c +++ b/src/runsv.c @@ -57,7 +57,7 @@ struct svdir svd[2]; int sigterm =0; int haslog =0; int pidchanged =1; -int logpipe[2]; +int logpipe[2] = {0,0}; char *dir; void usage () { strerr_die4x(1, "usage: ", progname, USAGE, "\n"); } @@ -89,6 +89,15 @@ void s_term() { write(selfpipe[1], "", 1); /* XXX */ } +void setup_logpipe(void) { + if(logpipe[0]) +return; + if (pipe(logpipe) == -1) +fatal("unable to create log pipe"); + coe(logpipe[0]); + coe(logpipe[1]); +} + void update_status(struct svdir *s) { unsigned long l; int fd; @@ -280,6 +289,7 @@ void startservice(struct svdir *s) { } if (s->pid != 0) stopservice(s); /* should never happen */ + if (haslog) setup_logpipe(); while ((p =fork()) == -1) { warn("unable to fork, sleeping"); sleep(5); @@ -428,10 +438,7 @@ int main(int argc, char **argv) { taia_now(&svd[1].start); if (stat("log/down", &s) != -1) svd[1].want =W_DOWN; - if (pipe(logpipe) == -1) -fatal("unable to create log pipe"); - coe(logpipe[0]); - coe(logpipe[1]); + setup_logpipe(); } } @@ -599,7 +606,9 @@ int main(int argc, char **argv) { /* stopservice(&svd[1]); */ update_status(&svd[1]); if (close(logpipe[1]) == -1) warn("unable to close logpipe[1]"); +logpipe[1] = 0; if (close(logpipe[0]) == -1) warn("unable to close logpipe[0]"); +logpipe[0] = 0; } } }
RE: logpipe error handling patch
Yeah its a workaround but I've implemented it anyway in my LFS hint to be sure it works right. If you don't mind me asking Jan, can I host it on the runit-for-LFS project on GoogleCode? Sent from my Windows Phone From: Jan Pobrislo<mailto:c...@webprojekty.cz> Sent: 7/29/2014 1:00 PM To: supervision@list.skarnet.org<mailto:supervision@list.skarnet.org> Subject: Re: logpipe error handling patch On Tue, 29 Jul 2014 15:22:36 + Gerrit Pape wrote: > Thanks for the patch, but it works around the bug and doesn't fix its > root cause. There are two possible resolutions - either make runsv create new pipe when asked to start up again or make it die since it's unable to spawn a child. My patch implements latter (because it's the simpler of possible resolutions) and will not break anything if the former is ever implemented. runsv instance that passes wrong filedescriptors to its child is broken and should be killed so it can be respawned. I can try to implement some more graceful behaviour if you can think of one, but I'd still like to see this patch included as it doesn't break anything - it's just a sanity check.
Re: logpipe error handling patch
On Tue, 29 Jul 2014 15:22:36 + Gerrit Pape wrote: > Thanks for the patch, but it works around the bug and doesn't fix its > root cause. There are two possible resolutions - either make runsv create new pipe when asked to start up again or make it die since it's unable to spawn a child. My patch implements latter (because it's the simpler of possible resolutions) and will not break anything if the former is ever implemented. runsv instance that passes wrong filedescriptors to its child is broken and should be killed so it can be respawned. I can try to implement some more graceful behaviour if you can think of one, but I'd still like to see this patch included as it doesn't break anything - it's just a sanity check.
Re: logpipe error handling patch
On Sat, Jul 26, 2014 at 10:44:18AM +0200, c...@webprojekty.cz wrote: > Hello, I wrote a patch for how broken filedescriptors when creating child > are handled. Currently only the child dies, leaving non-working instance of > runsv in place. My patch makes it kill the parent so the runsv process can > be respawned properly when it gets to the state in which it can't produce > working children. Thanks for the patch, but it works around the bug and doesn't fix its root cause. Laurent was correct when analysing this a few years ago, thanks for that http://permalink.gmane.org/gmane.comp.sysutils.supervision.general/2026 Here's how to reproduce: $ mkdir -p bug/log $ cat
logpipe error handling patch
Hello, I wrote a patch for how broken filedescriptors when creating child are handled. Currently only the child dies, leaving non-working instance of runsv in place. My patch makes it kill the parent so the runsv process can be respawned properly when it gets to the state in which it can't produce working children. This should fix the behaviour described in Message-id: CAN_+VLUtc+JBAATN55b2o6G8tmH4Y8OKLCf-yA6maBt+c1=g...@mail.gmail.com http://blog.gmane.org/gmane.comp.sysutils.supervision.general/month=20130301 and Message-id: pine.lnx.4.64.1008161622390.20...@e-smith.charlieb.ott.istop.com http://blog.gmane.org/gmane.comp.sysutils.supervision.general/month=20100801 --- runsv.c.orig2014-07-24 16:17:43.087191947 +0200 +++ runsv.c 2014-07-24 18:27:29.720588407 +0200 @@ -62,6 +62,11 @@ void usage () { strerr_die4x(1, "usage: ", progname, USAGE, "\n"); } +void fatal_child(char *m) { + strerr_warn5("runsv ", dir, ": fatal: ", m, ": ", &strerr_sys); + kill(getppid(), SIGTERM); + _exit(111); +} void fatal(char *m) { strerr_die5sys(111, "runsv ", dir, ": fatal: ", m, ": "); } @@ -289,14 +294,14 @@ if (haslog) { if (s->islog) { if (fd_copy(0, logpipe[0]) == -1) - fatal("unable to setup filedescriptor for ./log/run"); + fatal_child("unable to setup filedescriptor for ./log/run"); close(logpipe[1]); if (chdir("./log") == -1) fatal("unable to change directory to ./log"); } else { if (fd_copy(1, logpipe[1]) == -1) - fatal("unable to setup filedescriptor for ./run"); + fatal_child("unable to setup filedescriptor for ./run"); close(logpipe[0]); } }