Re: supervision-scripts 2015-06

2015-07-12 Thread Guillermo
2015-07-08 22:07 GMT-03:00 Avery Payne:

 + A small comparison table has been added to the wiki, showing 6 different
 frameworks and their various related concepts.  It is incomplete but
 somewhat usable, and contains various links to sites, downloads, etc.
 Because there is still a lot of missing information, any additional hints,
 comments, or links sent to me are gladly accepted.

All right. According to my current understanding:

* License. nosh is ISC; the COPYING file from the source tarball
further says it may be considered copyright licensed under the terms
of any of the ISC license, the simplified (2 clause) BSD license, the
FreeBSD license or the MIT Expat license.

* Compatibility. For s6, I believe it was mentioned somewhere that it
is tested at least on FreeBSD, OpenBSD and Solaris. It is also in the
FreeBSD Ports Collection, and runit is too. On the other hand, the
execline scripts produced by s6-linux-init-maker (from the
s6-linux-init package) that help getting s6-svscan to run as process 1
are Linux-specific.

* Supports slashpackage install. Possible for nosh, using the source package.

* Comes with init. The entry for s6 would probably be better stated as
s6-linux-init-maker's output. This output is an execline script, the
stage 1 init, that, when saved to a file, put on a suitable place in
the filesystem, and made executable, can be used as the argument of
the kernel's init parameter and run as process 1. After performing
some initialization tasks it replaces itself with s6-svscan, and, if
s6-linux-init-maker was called with default options, use /run/service
as its scan directory. One still needs to write a stage 2 init and
pass it to s6-linux-init-maker with the -2 option (or, without it, the
stage 2 init has to be /etc/rc.init), the same way as one needs to
write an /etc/runit/1 file for runit.

* The row containing readproctitle. It could be renamed to
supervision tree log handling or something like that, and may fit
better in section Logging. The entries for s6 and nosh could be
catch-all logger (s6-svscan-log), again from the s6-linux-init
package, and catch-all logger (cyclog), respectively. The catch-all
logger being a supervised long-lived process that receives and logs
(to a log directory) all output from the supervision tree, including
the one from processes that do not have a dedicated logger.

For s6, s6-linux-init-maker with default options produces a directory
that must be copied to /etc/s6-linux-init/run-image, and contains a
service directory named s6-svscan-log for a catch-all logger. When
the machine boots, it will be copied to s6-svscan's scan directory by
the stage 1 init. This catch-all logger is an s6-log process that logs
to directory /run/uncaught-logs.

For nosh, if system-manager runs as process 1, it will automatically
create and supervise a cyclog process with hardcoded options, that
logs to directory /run/system-manager/log, and pretty much acts as a
catch-all logger, too.

For runit, you did not mention the program that has the self-renaming
behaviour, i.e. the one you have to look for in ps' output to get the
logs. It is runsvdir.

* Supervisor Programs. For nosh, the program similar to svscan would
be service-dt-scanner, in the sense that it runs on a scan directory.
Footnote 2 would apply (only to service-dt-scanner), since it can also
be called as svscan. The program similar to supervise would be
service-manager, in the sense that supervised processes run as its
children, and it will restart them if they exit or crash (and if the
restart file in their respective service directory says so).

But things don't work in nosh as they do in the other software
packages. In particular, when system-manager runs as process 1, there
is a single service-manager process (supervised by system-manager),
not one per supervised process, there is no scan directory, and
service-dt-scanner doesn't run. System state changes are carried out
by the system-control program, called either by the system's
administrator, or in some cases automatically by system-manager.
system-control runs as a short-lived process that exits when it is
done.

* Supports using a supervisor-as-init. For the reasons stated in the
bullet above, for nosh it would be a No or N/A.

* Force a service to stay in the same process group. I haven't used
daemontools[-encore] or runit, but looking at their source code,
pgrphack and chpst -P use the setsid() system call, so this row should
rather be create a new session and make the service the session
leader, and the entries for s6 and nosh would be s6-setsid and
setsid, respectively.

* TAI64 Tools. nosh also provides a tai64n and a tai64nlocal program,
with the same names as their daemontools counterparts.

Except for setpgrp, the program names with question marks in the
column for nosh are correct. Footnote 2 also applies to the ones in
section Service Control  Communication, they can be called using
the daemontools name.

Cheers!
G.


Re: nosh: service-dt-scanner gets repeatedly killed by SIGABRT

2015-07-12 Thread Jonathan de Boyne Pollard

Guillermo:

Jonathan de Boyne Pollard:

If there's no error output, crank up strace and see what the last few system 
calls are.  It's probably worthwhile doing that anyway, in fact.

[...]

a read() call on the file descriptor returned by the inotify_init() that 
produces an EINVAL error, followed rt_sigprocmask() with a SIG_UNBLOCK 
argument, and the tgkill() that sends the SIGABRT.


Remember that I said that my immediate suspicion is a (fourth) libkqueue 
bug?  It's a fourth libkqueue bug.


And it's here:

* https://github.com/mheily/libkqueue/blob/master/src/linux/vnode.c#l70

As the inotify(7) manual page says, if an event is larger than the 
buffer size given to read(), it fails with EINVAL.  And events can be 
larger than sizeof(struct notify_event).  libkqueue doesn't deal with 
this failure properly, leading to a call to abort():


* https://github.com/mheily/libkqueue/blob/master/src/linux/platform.c#l181

nosh code never calls abort(), never calls raise(SIGABRT), and would 
have printed some kind of message if an unhandled exception had led to 
an abort being raised by the C++ library.


The output that you are seeing from service-dt-scanner is because of a 
spurious wakeup.


* https://github.com/mheily/libkqueue/blob/master/src/linux/platform.c#l199

You can turn these debug messages on with the KQUEUE_DEBUG=1 environment 
variable (and compiling the library in debug mode), apparently.


* https://github.com/mheily/libkqueue/blob/master/src/common/kqueue.c#l68

libkqueue is receiving events from inotify that the caller of kevent() 
isn't actually interested in, resulting in a spurious wakeup from the 
call to kevent() with no actual event to report. The output to standard 
error is a minor bug in service-dt-scanner, because it assumes that 
every time that it is woken up and kevent() returns successfully there 
will be at least one event.  It's finding nonsense in the event buffer 
and printing out a debug message when it ignores the nonsense.  This is 
fixed in version 1.18, but this isn't really the cause of your problem 
here.  It's just distracting log noise.


The problem here is that inotify is waking kevent() up because you 
listed the directory.  I suspect this change in your version of 
libkqueue, at first glance:


* 
https://github.com/mheily/libkqueue/commit/e41cc259a0318b0e7925521d0fe3bc7433971ace


After the spurious wakeup, there is another second event enqueued by the 
kernel, that is bigger than sizeof(struct notify_event). Whether that's 
an uninteresting event too, and whether it is also caused by your 
listing the directory, is unknown.  libkqueue isn't passing a buffer big 
enough to read it so that we see what it is, and is abort()ing when the 
kernel returns an error because the read buffer is too small.


This will be a tricky one for the libkqueue people to fix, since 
libkqueue isn't currently geared up to process multiple events from 
inotify at once, which it would have to be prepared for if it were to 
start using a bigger buffer.  But it is a libkqueue problem to be 
fixed.  All that service-dt-scanner is doing is registering just one 
event of interest, and calling kevent() in a fairly tight loop that's in 
fact doing nothing else (apart from dumping the value of the spurious 
event).