I do want to supervise it, e.g. to restart it when SIGSEGV, but not when a library is missing.
My daemon does not have readiness notification, so s6-rc considers the transition to be successful. I do s6-svc -d . in the finish script, so the daemon is not restarted by s6-supervise, but s6-rc lists it as "up". To get s6-rc back to a coherent state, I need to call "s6-rc -d change svc", but I first need to wait that s6-rc has finished its pending transition. Basically when s6-rc reads 'd' or 'D' on the fifodir it could check whether the service is still up or not (s6-svstat). However I am not sure if it is acceptable that s6-rc receives such state changes from the outside: what should it do with the dependencies then? Bring them down? For applications that collaborate (i.e. with readiness notification) you can probably do that, because depending services are not yet started, but for others, it seems hazardous. Maybe for non-collaborating daemons, up transitions should be considered successful only if the daemon stays up for 1 second. Sounds awful at first but thinking about it, it may not be such a bad idea... Kr, Lionel ________________________________________ From: firstname.lastname@example.org <email@example.com> on behalf of Laurent Bercot <ska-supervis...@skarnet.org> Sent: Thursday, March 2, 2017 4:00:27 PM To: firstname.lastname@example.org Subject: Re: [s6-rc] How to handle longrun failures >Using s6-rc, I am not sure how to handle longrun failures. Say I have a >daemon which fails to start (e.g. missing library, cannot read its >config...). I don't want to start it again. It sounds like you don't want to supervise this daemon. In that case, run it as a oneshot that backgrounds itself, and make sure the parent exits nonzero if the child doesn't succeed. But if you do want to supervise it, keep reading: > For oneshot transitions the return code determines whether the >transition is successful or not. For longruns I see the only reason for >an up transition to fail is a timeout on readiness notification. >However I do not want to use a timeout in this case. Typically, in the >finish script of a longrun service, I would like to decide, based on >the return code or signal number, to put the service down. That makes sense, and it's possible to do it at the s6 level (just call s6-svc -d . in the finish script). However, from the s6-rc point of view, you have asked a supervised service to transition from down to up, so it will not stop trying until the service is actually up or it times out. My advice for now would be to: 1. write your ./finish script with a s6-svc -d when you want to stop restarting the daemon 2. set a reasonable timeout-up value in your s6-rc definition, so when the daemon fails and ./finish tells s6 to stop restarting it, the notification never arrives and s6-rc eventually times out and gives up. It's kind of ugly, but it's the best you can do for now. I will think about implementing a way for s6 to tell s6-rc to fail a longrun transition instantly, without waiting for a timeout. It's a good idea, thanks for mentioning it. -- Laurent