Great and detail explanation. You are right about the resource footprint of running supervise program isn't expensive and it should be another problem if runsv dies away. I am just trying to simulate crash and see what happened, that's what I observed and wondering if we could fine tune every parts which makes it more reliable for our case, which doesn't seems possible but that's fine.
I am wondering how does Solaris do their supervision? Their supervision program is well known for solid running. On Jun 23, 2016 8:41 PM, "Laurent Bercot" <[email protected]> wrote: > On 23/06/2016 03:46, Thomas Lau wrote: > >> LOL, well I am trying to do drill test and see how resilience of runit >> could be, this is one of the minor downfall. >> > > Current supervisors have no way of knowing that they died and > their child is still running. Hence, when they start again, they attempt > to run their child again, which will probably fail since the old instance > of the child is still running. So, they will periodically try and start > the child again, only to fail again, and so on. > On daemontools and s6, the period is 1 second. I'm not sure about runit, > but it should be around 1s too. > > Yes, it is a problem, and I don't like that behaviour much, but the > alternatives are actually worse. Currently, the consequences of the > issue are that when a supervisor dies and restarts: > - depending on the run script, the daemon's logs are flooded with error > messages from the run script failing to exec into the daemon. > - Every second, some CPU is used to try and start the daemon. > > > I think those drawbacks are acceptable and trying to fix them is not a > good idea: > > - Supervisors dying without their daemons dying are an extremely rare > occurrence, not worth specialcasing unless it causes systemic, > unrecoverable > failure which is not the case. > - What we'd want ideally: the new instance of the supervisor would "grab" > the old instance of the daemon. But that is impossible under Unix, and > any attempt to do that is doomed to use the same hacks that non-supervision > systems use and that supervision aims to step away from. > - Any attempt to kill the old instance of the daemon in order to properly > start a new supervised instance is a policy decision, which belongs to the > admin; the supervisor program can't make that decision automatically. > - As is, even if the supervisor dies, the service keeps running; its in > "degraded mode" because the current instance isn't watched by a supervisor, > but it's still running, and that's what important. And if the daemon dies, > a new, supervised instance will automatically take its place, as if the > supervisor had never died: things will fix themselves on their own. > - For critical services, the log flooding should trigger an alerting > system > that will notify the admins that there's a problem, and appropriate action > can then be taken (i.e. either do nothing or kill the current instance of > the daemon). > - The periodic attempt to start a new instance of the daemon is generally > not expensive. This is one of the reasons for the 1s respawning period: it > gives the system time to breathe, without the "respawning too fast" problem > that can be observed with, for instance, sysvinit. If the daemon uses a lot > of resources before it notices it cannot succeed, that's a design issue > in the daemon, not the supervisor; and even in that case, on critical > machines there should be an alerting system that notices the spike in > resource usage and notifies the admins. > - Attempts to handle that edge case in the supervisor itself would add a > lot > (a real whole lot) of complexity, for very uncertain benefits. > > So, yeah. Even if your logs freak out, your memcached is still running, > and that's what you want. And stop voluntarily killing your runsv for > testing purposes: the day when your runsv accidentally dies before the > daemon it's supervising is the day when something's seriously wrong with > your system and you have much bigger problems than spurious log messages. > > -- > Laurent > >
