On 08/09/15 14:50, Laurent Bercot wrote:
On 08/09/2015 14:10, Jan Bramkamp wrote:

How would the ./run script or more likely the daemon it exec()ed into
die from a failed child process?

  The child process could s6-svc -t if it fails to find readiness, for
instance. There should be an option in the polling tool to kill the
daemon if the polling does not succeed.
  I went too far in saying "the run script will die": there needs to
be support for that, indeed. But "the service is stuck" problem is
easy to fix.

Not if something kills the polling script e.g. stray kill -9 $WRONG_PID. Such things shouldn't happen but that's why I want a supervision tree rooted in init. If anything happens to a subtree the supervisor for that subtree restarts the subtree and if something happens to the root of the supervision tree (init) the kernel panics and a hardware watchdog triggers within a few seconds. To let services fail and restart the infrastructure has to notice errors. Maybe adding an optional timeout between forking the ./run script and the readiness notification to s6-supervise would solve the problem without depending on other demons. Since such errors are expected to very rare a higher recovery time (whatever the the admin guessed as a worst case start up time) would be an appropriate trade-off if it avoids complexity. It would make sense to signal this condition to the ./finish script and at least log it from where.

Reply via email to