On 23/06/2016 03:46, Thomas Lau wrote:
LOL, well I am trying to do drill test and see how resilience of runit could be, this is one of the minor downfall.
Current supervisors have no way of knowing that they died and their child is still running. Hence, when they start again, they attempt to run their child again, which will probably fail since the old instance of the child is still running. So, they will periodically try and start the child again, only to fail again, and so on. On daemontools and s6, the period is 1 second. I'm not sure about runit, but it should be around 1s too. Yes, it is a problem, and I don't like that behaviour much, but the alternatives are actually worse. Currently, the consequences of the issue are that when a supervisor dies and restarts: - depending on the run script, the daemon's logs are flooded with error messages from the run script failing to exec into the daemon. - Every second, some CPU is used to try and start the daemon. I think those drawbacks are acceptable and trying to fix them is not a good idea: - Supervisors dying without their daemons dying are an extremely rare occurrence, not worth specialcasing unless it causes systemic, unrecoverable failure which is not the case. - What we'd want ideally: the new instance of the supervisor would "grab" the old instance of the daemon. But that is impossible under Unix, and any attempt to do that is doomed to use the same hacks that non-supervision systems use and that supervision aims to step away from. - Any attempt to kill the old instance of the daemon in order to properly start a new supervised instance is a policy decision, which belongs to the admin; the supervisor program can't make that decision automatically. - As is, even if the supervisor dies, the service keeps running; its in "degraded mode" because the current instance isn't watched by a supervisor, but it's still running, and that's what important. And if the daemon dies, a new, supervised instance will automatically take its place, as if the supervisor had never died: things will fix themselves on their own. - For critical services, the log flooding should trigger an alerting system that will notify the admins that there's a problem, and appropriate action can then be taken (i.e. either do nothing or kill the current instance of the daemon). - The periodic attempt to start a new instance of the daemon is generally not expensive. This is one of the reasons for the 1s respawning period: it gives the system time to breathe, without the "respawning too fast" problem that can be observed with, for instance, sysvinit. If the daemon uses a lot of resources before it notices it cannot succeed, that's a design issue in the daemon, not the supervisor; and even in that case, on critical machines there should be an alerting system that notices the spike in resource usage and notifies the admins. - Attempts to handle that edge case in the supervisor itself would add a lot (a real whole lot) of complexity, for very uncertain benefits. So, yeah. Even if your logs freak out, your memcached is still running, and that's what you want. And stop voluntarily killing your runsv for testing purposes: the day when your runsv accidentally dies before the daemon it's supervising is the day when something's seriously wrong with your system and you have much bigger problems than spurious log messages. -- Laurent
