Re: s6 bites noob

2019-02-03 Thread Kelly Dean


Laurent Bercot writes:
> foo/run not existing is a temporary error condition that can happen
> at any time, not only at the start of s6-supervise. This is a very
> different case: the supervisor is already running and the user is
> relying on its monitoring foo.

But run not existing when supervise starts is a different case from run 
disappearing after supervise is already running. Even though supervise should 
continue running if run disappears, that doesn't imply that it shouldn't abort 
on startup if run doesn't exist in the first place.

Another example of orneriness: supervise automatically does its own 
initialization, but the s6-rc program (not the eponymous suite) doesn't. 
Instead, the suite has a separate init program, s6-rc-init, that's normally run 
at boot time. But if it isn't run at boot time (which is a policy decision), 
s6-rc doesn't automatically run it if necessary. If rc shouldn't 
auto-initialize, neither should supervise.

Another one: the -d option to s6-rc is overloaded. When used with change, it 
means to down the selected services. But when used with list, it means to 
invert the selection. I'm going to repeatedly forget this.

One more: the doc for the s6-rc program says it's meant to be the one-stop shop 
of service management, after compilation and initialization are done. It has 
subcommands list, listall, diff, and change. But s6-rc-update is a separate 
program, not a subcommand of s6-rc. I suppose there's a reason for this, but it 
complicates the user interface with a seemingly arbitrary distinction of 
whether to put a dash between "s6-rc" and the subcommand depending on what the 
particular subcommand is.

The docs advise entirely copying the service repository to a ramdisk, then 
using (a link to) the copy as the scan directory. This makes the running system 
independent of the original repo. But the doc for s6-rc-init says the rc system 
remains dependent on the original compiled database, and there's no explanation 
of why it isn't also copied in order to make the running system independent.

I tried to test the logger. Set up a service repo with foo and bar, each with a 
run like
#!/bin/bash
echo foo starting
sleep 2
echo foo dying

foo and bar are funneled to a logger that has this run file:
s6-log -l 1 s100 T /home/user/testlogs

Try to start the bundle. Hangs. Press ^C. Get:
s6-rc: warning: unable to start service s6rc-fdholder: command crashed with 
signal 2.

Ok, Colin Booth mentioned permission issues when running as non-root. It 
shouldn't be a problem, since all of this (including svscan) is running as the 
same user. Permission problems should only come into play when trying to do 
things inter-user. Anyway, I checked the s6-rc-compile doc. Looks like -h won't 
be necessary, since it defaults to the owner of the svscan proc. But -u is 
needed, since it defaults to allowing only root--even though I've never run any 
of this as root, and I've never asked it to try to do anything as root, and 
I've never told it that it should expect to be root, or even mentioned root at 
all.

And I'm not really sure the doc is right, because it says -u controls who's 
allowed to start and stop services, yet I've already used rc to start and stop 
regular (longrun) services as my non-root user before, with no problem (I had a 
problem only with oneshot), even though the doc says that since I didn't 
compile with -u, it should have disallowed that.

Anyway, recompile with -u 1000, re-update, and try again. Now, I can't even do 
s6-rc -a list; I get:
s6-rc fatal: unable to take locks: Permission denied

Maybe I missed an essential step, and screwed something up? I'm bewildered, 
tired, and going to bed. After reading more of the docs than I expected to be 
necessary, I'm still unable to get s6 to do the basic job I need: manage a 
small group of services, and funnel and log their output. It's especially 
frustrating having to fight with software that generates gratuitous intra-user 
permission errors.

I'll try again in the morning, with replenished willpower.


svscan and supervise

2019-02-03 Thread Jonathan de Boyne Pollard

Kelly Dean:


Surely this is a common question.


It's a common redesign.

In the original daemontools, |supervise| knows nothing at all about the 
difference between "log" and "main" services.  Linking up the standard 
I/Os with pipes, to connect "main" services to "log" services, and 
knowing the directory layout that delineates them, is entirely the 
domain of |svscan|, which in its turn knows nothing at all about the 
control/status API and about supervision.  M. Bercot has stuck with this 
design.


As laid out in this Frequently Given Answer 
, three of the other 
toolsets did not, and went down the path of tighter integration of the 
twain. They baked in a relationship between "main" and "log" services. 
In the original daemontools design, it was theoretically possible to run 
|supervise| under something else, that set up different relationships.  
Indeed, in the original daemontools |svscan| came along (in version 
0.60) about two years after |supervise| did, /originally/ in early 
versions people were expected to run |supervise| directly, and it was 
more like |daemon| in its operation.  This was until it became apparent 
that "dæmonization" is a fallacy, and simply does not work on modern 
operating systems where login sessions have gone through too many 
one-way trapdoors for escaping to a dæmon context to be viable.  Running 
dæmons outwith login session context /right from the top/ became the 
way, with |svscan| and (the also since much-replaced 
) 
|svscanboot|.  This was of course how service management had been done 
on AIX since 1990 and on AT Unix since 1988 
.


My nosh toolset did not stick with the original design in this regard, 
either.  But I did not go for tighter integration, which I thought to be 
the wrong thing to do. |service-manager| 
 
manages a whole bunch of services, does all of the waiting and spawning 
in a single process, which is conventionally also a subreaper, and 
provides all of the control/status APIs.  But the knowledge of the 
relationships amongst those services, and of directory structures, is 
entirely /outwith/ that program.


It is, rather, in programs like |system-control| 
 (in 
its |start| and |stop| subcommands) and |service-dt-scanner| 
. 
They decide the policy, which services' standard I/Os are plumbed 
together and how a "main" service indicates its "log" service. They 
instruct |service-manager| with the |service/| and |supervise/| 
directories, passing it open file descriptors for them, and what to 
plumb to what, and it just runs the mechanism.  They even implement /two 
different/ policies, one with ordering and dependency processing and a 
full /service bundle/ mechanism and the other more like the old 
daemontools and s6 /scan directory/.  There is no reason that a third 
tool could not implement a third policy still.


There is not even a requirement of a 1:1 relationship between "main" and 
"log" services, and indeed the set of pre-supplied service bundles has 
fan-in arrangements for a few of the short-lived single-shot services 
that run at bootstrap/shutdown, with them sharing a single logger. The 
pre-supplied per-user services have three fan-in sets 
, 
too.




Re: s6 bites noob

2019-02-03 Thread Laurent Bercot

s6-supervise aborts on startup if foo/supervise/control is already open, but 
perpetually retries if foo/run doesn't exist. Both of those problems indicate 
the user is doing something wrong. Wouldn't it make more sense for both 
problems to result in the same behavior (either retry or abort, preferably the 
latter)?


foo/supervise/control being already open indicates there's already a
s6-supervise process monitoring foo - in which case spawning another
one makes no sense, so s6-supervise aborts.

foo/run not existing is a temporary error condition that can happen
at any time, not only at the start of s6-supervise. This is a very
different case: the supervisor is already running and the user is
relying on its monitoring foo. At that point, the supervisor really
should not die, unless explicitly asked to; and "nonexistent foo/run"
is perfectly recoverable, you just have to warn the user and try
again later.

It's simply the difference between a fatal error and a recoverable
error. In most simple programs, all errors can be treated as fatal:
if you're not in the nominal case, just abort and let the user deal
with it. But in a supervisor, the difference is important, because
surviving all kinds of trouble is precisely what a supervisor is
there for.



https://cr.yp.to/daemontools/supervise.html indicates the original verison of 
supervise aborts in both cases.


That's what it suggests, but it is unclear ("may exit"). I have
forgotten what daemontools' supervise does when foo/run doesn't
exist, but I don't think it dies. I think it loops, just as
s6-supervise does. You should test it.



 I also don't understand the reason for svscan and supervise being different. 
Supervise's job is to watch one daemon. Svscan's job is to watch a collection 
of supervise procs. Why not omit supervise, and have svscan directly watch the 
daemons? Surely this is a common question.


You said it yourself: supervise's job is to watch one daemon, and
svscan's job is to watch a collection of supervise processes. That is
not the same job at all. And if it's not the same job, a Unix guideline
says they should be different programs: one function = one tool. With
experience, I've found this guideline to be 100% justified, and
extremely useful.
Look at s6-svscan's and s6-supervise's source code. You will find
they share very few library functions - there's basically no code
duplication, no functionality duplication, between them.

Supervising several daemons from one unique process is obviously
possible. That's for instance what perpd, sysvinit and systemd do.
But if you look at perpd's source code (which is functionally and
stylistically the closest to svscan+supervise) you'll see that
it's almost as long as the source code of s6-svscan plus s6-supervise
combined, while not being a perfectly nonblocking state machine as
s6-supervise is.

Combining functionality into a single process adds complexity.
Putting separate functionality in separate processes reduces
complexity, because it takes advantage of the natural boundaries
provided by the OS. It allows you to do just as much with much less
code.



I understand svscan must be as simple as possible, for reliability, because it 
must not die. But I don't see how combining it with supervise would really make 
it more complex. It already has supervise's functionality built in (watch a 
target proc, and restart it when it dies).


No, the functionality isn't the same at all, and "restart a process
when it dies" is an excessively simplified view of what s6-supervise
does. If that was all there is to it, a "while true ; do ./run ; done"
shell script would do the job; but if you've had to deal with that
approach once in a production environment, you intimately and
painfully know how terrible it is.

s6-svscan knows how s6-supervise behaves, and can trust it and rely
on an interface between the two programs since they're part of the
same package. Spawning and watching a s6-supervise process is easy,
as easy as calling a function; s6-svscan's complexity comes from the
fact that it needs to manage a *collection* of s6-supervise
processes. (Actually, the brunt of its complexity comes from supporting
pipes between a service and a logger, but that's beside the point.)

On the other hand, s6-supervise does not know how ./run behaves, can
make no assumption about it, cannot trust it, must babysit it no matter
how bad it gets, and must remain stable no matter how much shit it
throws at you. This is a totally different job - and a much harder job
than watching a thousand of nice, friendly s6-supervise programs.
Part of the proof is that s6-supervise's source code is bigger than
s6-svscan's.

By all means, if you want a single supervisor for all your services,
try perp. It may suit you. But I don't think having less processes
in your "ps" output is a worthwhile goal: it's purely cosmetic, and
you have to balance that against the real benefits that separating
processes provides.

--
Laurent



Re: s6 bites noob

2019-02-03 Thread Kelly Dean


Laurent Bercot writes:
> It is impossible to portably wait for the appearance of a file.
> And testing the existence of the file first, before creating the
> subdirs, wouldn't help, because it would be a TOCTOU.

s6-supervise aborts on startup if foo/supervise/control is already open, but 
perpetually retries if foo/run doesn't exist. Both of those problems indicate 
the user is doing something wrong. Wouldn't it make more sense for both 
problems to result in the same behavior (either retry or abort, preferably the 
latter)?

https://cr.yp.to/daemontools/supervise.html indicates the original verison of 
supervise aborts in both cases.

I also don't understand the reason for svscan and supervise being different. 
Supervise's job is to watch one daemon. Svscan's job is to watch a collection 
of supervise procs. Why not omit supervise, and have svscan directly watch the 
daemons? Surely this is a common question.

I suppose supervise on its own might be convenient during testing, to have a 
lone supervise proc watching a daemon. But this could be done just as well with 
a combined svscan-supervise, with the daemon being the only entry in the 
collection of watched procs.

I understand svscan must be as simple as possible, for reliability, because it 
must not die. But I don't see how combining it with supervise would really make 
it more complex. It already has supervise's functionality built in (watch a 
target proc, and restart it when it dies).