On Mon, 27 Jun 2016 14:02:31 +0200 Joan Picanyol i Puig <[email protected]> wrote: > However, couldn't they know whether their child did not cease to run because > of a signal they sent?
Some systems allow to register signal to be sent by kernel to child on "parent death" (Linux), but that is unportable, and actually require both parties to be aware such mechanism is in place (eg. both supervisor and daemon would have to support it). On Fri, 24 Jun 2016 08:33:50 +0800 Thomas Lau <[email protected]> wrote: > ... if we could fine tune every parts which makes it more reliable for > our case, which doesn't seems possible but that's fine. I am pretty confused and forgetful sometimes myself. I also have been messing with, and abusing, supervisors some time now. Yet haven't seen either runsv or s6-supervise die from some internal state breakage. I broke my experimentation vms and few real boxes in really bad ways sometimes, but supervisor ran unfazed. As was already said, after messing with this, when managing children, to avoid problems, parent should "never die". From this point of view, it really seems extremely great care was taken, so that both runit and s6 never die during normal operation. Great care means, that actually almost all IO calls are encapsulated in protected wrappers, and memory is usually pre-allocated statically. Think about it for a second. Although neither the supervisors, seems to be using OS "protected process" mechanism, the size of supervisor's parts is actually so miniscule during runtime, that they are probably smallest processes running on the machine. Talking about protection, BSDs have madvise(MADV_PROTECT) call which marks process as "important" (this breaks in FreeBSD jails), but not even official init uses it. I bet Linux has something similar. However given the way these things are coded, that is probably not worth the effort. I wonder, whether situation described by you really happened naturally or it was result of some manual intervention (`kill -9` or `kill -6` or libc abort perhaps?), because chances of supervisor crashing being so insanely low. To minimize PEBKAC, I made similar rule (like Colin) for myself: - either always use supervision package's provided control program or learn signals (used by supervisor of choice internally) by heart. But besides when messing with things manually and during some research exercises, there should be no point in learning signals in question, since both "scandir monitor" and supervisor should be "uncrashable" under normal conditions. To reiterate, I bevelive, anything else in machine should crash sooner than any part of runit or s6. Maybe you were having some physical memory corruption or somebody else somehow termianted the supervisor? > I am wondering how does Solaris do their supervision? Their supervision > program is well known for solid running. From what I was able to dig out, however without bothering to actually try it in vm, both Solaris and Illumos use "contracts" subsystem which is in-kernel facility, exposed through filesystem. SMF probably relies on that, similarly like systemd relies on cgroups on Linux, or launchd relies on MACH IPC on OS X. All these interfaces are usally completely orthogonal to classic Unix basic concepts (besides being exposed by fs) and very very system specific. There is usually not even direct parity in functionality. Both s6 and runit authors put quite alot of thought and work into given package portability, so it seems to me very unlikely, these specific capabilities will ever be supported directly. eto
