Re: [s6-rc] How to handle longrun failures

2017-03-02 Thread Laurent Bercot

My daemon does not have readiness notification, so s6-rc considers
the transition to be successful. I do s6-svc -d . in the finish script, 
so
the daemon is not restarted by s6-supervise, but s6-rc lists it as 
"up".
To get s6-rc back to a coherent state, I need to call "s6-rc -d change 
svc",
but I first need to wait that s6-rc has finished its pending 
transition.


 As a temporary workaround, you could for instance set up a 2 second
timeout, a notification-fd file (containing, say, 3), and change your
run script to something like:

background
{
  if { sleep 1 }
  if { pipeline { s6-svstat } grep -q ^up }
  fdmove 1 3
  echo
}
fdclose 3
your-real-run-script

 so that if your daemon is up after 1 second, the service will be
considered ready, and if it is not, s6-rc will timeout after 2 seconds
and consider the transition failed.


Basically when s6-rc reads 'd' or 'D' on the fifodir it could check 
whether
the service is still up or not (s6-svstat). However I am not sure if it 
is

acceptable that s6-rc receives such state changes from the outside:
what should it do with the dependencies then? Bring them down?


 s6-rc does nothing with dependencies in a single run. When it sees that
a transition fails, it just marks it as failed, and keeps working on
remaining available transitions. It exits when there is no more work 
that

it can do without retrying.
 If it exits 1, then some transitions failed. It is then up to the user 
to
retry - or to perform appropriate actions: "s6-rc -a list" shows the 
list
of active services, so it's possible to know what should be up but is 
not.


 It is very much intentional that there are two distinct notions of 
state:

 - the state of the process, handled by s6
 - the state of the service, handled by s6-rc
 The process state is temporary, the service state is permanent (until
a new s6-rc invocation, which may or may not change the service state
depending on whether a transition is requested and succeeds).



 Maybe for non-collaborating
daemons, up transitions should be considered successful only if the
daemon stays up for 1 second.


 Yes, and it can be achieved by kludging the run script as above ;)
 I definitely don't want to make it official because it's unreliable
and daemons should ideally provide readiness notification, but it's an
existing possibility for users.

--
 Laurent



Re: [s6-rc] How to handle longrun failures

2017-03-02 Thread Laurent Bercot
Using s6-rc, I am not sure how to handle longrun failures. Say I have a 
daemon which fails to start (e.g. missing library, cannot read its 
config...). I don't want to start it again.


 It sounds like you don't want to supervise this daemon. In that case,
run it as a oneshot that backgrounds itself, and make sure the parent
exits nonzero if the child doesn't succeed.
 But if you do want to supervise it, keep reading:


 For oneshot transitions the return code determines whether the 
transition is successful or not. For longruns I see the only reason for 
an up transition to fail is a timeout on readiness notification. 
However I do not want to use a timeout in this case. Typically, in the 
finish script of a longrun service, I would like to decide, based on 
the return code or signal number, to put the service down.


 That makes sense, and it's possible to do it at the s6 level (just call
s6-svc -d . in the finish script). However, from the s6-rc point of
view, you have asked a supervised service to transition from down to up,
so it will not stop trying until the service is actually up or it
times out.

 My advice for now would be to:
1.  write your ./finish script with a s6-svc -d when you want to stop
restarting the daemon
2. set a reasonable timeout-up value in your s6-rc definition, so when 
the
daemon fails and ./finish tells s6 to stop restarting it, the 
notification

never arrives and s6-rc eventually times out and gives up. It's kind of
ugly, but it's the best you can do for now.

 I will think about implementing a way for s6 to tell s6-rc to fail a
longrun transition instantly, without waiting for a timeout. It's a good
idea, thanks for mentioning it.

--
 Laurent