On 08/09/15 14:50, Laurent Bercot wrote:
On 08/09/2015 14:10, Jan Bramkamp wrote:
How would the ./run script or more likely the daemon it exec()ed into
die from a failed child process?
The child process could s6-svc -t if it fails to find readiness, for
instance. There should be an option in the polling tool to kill the
daemon if the polling does not succeed.
I went too far in saying "the run script will die": there needs to
be support for that, indeed. But "the service is stuck" problem is
easy to fix.
Not if something kills the polling script e.g. stray kill -9 $WRONG_PID.
Such things shouldn't happen but that's why I want a supervision tree
rooted in init. If anything happens to a subtree the supervisor for that
subtree restarts the subtree and if something happens to the root of the
supervision tree (init) the kernel panics and a hardware watchdog
triggers within a few seconds. To let services fail and restart the
infrastructure has to notice errors. Maybe adding an optional timeout
between forking the ./run script and the readiness notification to
s6-supervise would solve the problem without depending on other demons.
Since such errors are expected to very rare a higher recovery time
(whatever the the admin guessed as a worst case start up time) would be
an appropriate trade-off if it avoids complexity. It would make sense to
signal this condition to the ./finish script and at least log it from where.