Hi grin,
Thanks for your comments. I'm going to reply to a few points.
> It was okay, I went to the documentation. I am mainly about this.
I have to admit the documentation of s6 is old and has mostly grown
locally, as new parts were added; I rarely go back to it and refactor
it as scope or interactions change, as subsystems are added, etc. I do
that a lot for code, but not for documentation, sorry. That is why you
feel there are layers missing: the documentation is not a coherent,
holistic thing, it is a hodgepodge of reductionist pages that have
developed over time.
So, yeah, it's not very good, because a lot less tending work has gone
into it than has gone into the code.
> There seem to be very simple introductory materials AND very tech
> reference docs but nothing in-between. I was looking for some simple
> architectural scheme, like:
>
> s6-svscan - manages supervise daemons
> \
> |--s6-supervise daemon1 - spawns and monitors daemon1
> | \- /usr/bin/daemon1 args
> \--s6-supervise daemon1/log - spawns and monitors log of
daemon1
> \- s6-log T /var/log/s6/daemon1
>
> An explanation of the typical process tree that you get when
launching
> a supervision tree would be a good element to add, yes.
>
>
> and some suggested directory structure like
> /etc/s6/sv/ - collection of daemons
> sv/daemon1/run - + sample
> sv/daemon1/type - + sample
> sv/daemon1/log/run - ...
> /service/ - the live services
> symlink to ../sv/daemon1
That one is on purpose. From the start, a design principle of s6 has
been to provide *mechanism*, and not *policy*. People are free to put
their service directories and related things wherever they want, and s6
will not constrain you in any way.
When I first developed s6, I didn't really know what good practices
were; I didn't know what a good directory structure would be. I expected
good practices to emerge over time. So I didn't write any policy
suggestions in the documentation.
And indeed, good practices *have* emerged over time, and I could
definitely make policy suggestions now. But I did not add suggestions
to the s6 documentation; rather, I chose to gather all policy-related
decision in the higher-level user interface, s6-frontend:
https://skarnet.org/software/s6-frontend/
The default values for the s6-frontend variables are my policy
recommendations.
But I should probably add *some* recommendations to the s6
documentation as well now, if only to stop people from putting their
service directories under /etc 😛
> And some tutorial like example what the huge amount of separate tools
> are for.
So, about tutorials.
I have always said that tutorials are best written *not* by the author
of the software, but by users who discover the software step by step and
know what questions emerge, what the logical progression is. I still
stand by it; the author is the worst person to write tutorials, and that
is definitely a task better done by the *community*, who knows what it
needs from a tutorial much better than the author would.
If you (or anything else) are willing to document your struggles and
your progress and the answers to the things you wondered about, that
would be a great addition to the documentation indeed. So far nobody
has stepped up to do that - but it's not too late.
> - if supervise is killed the daemon keeps running, while a new
> supervise tries to start a new one. Is this intentional? Definitely
> not convenient.
This is intentional, yes.
What is the purpose of a supervision suite? To maximize the uptime of
a daemon. If you restart a daemon every time its supervisor dies, you
are bounding the daemon's uptime by the supervisor's uptime. You are
not increasing the daemon's uptime, you are *decreasing* it. That is
antithetical to the purpose of a supervisor.
So, yes, the old daemon keeps running. The new s6-supervise process
will try and start a new instance; that is normally not a problem
because the administrator should be notified when the daemon fails
and can kill the old instance *at their convenience* (as opposed to
whenever the supervisor dies). If the restarting attempts are an issue,
you can mitigate that with lock-fd, and s6-supervise will block until
the old instance is dead.
> - the restarted supervise seem to lose the death-toll immediately,
> including any proof that the daemon was running or dying or killed.
Does it really? The death tally file is supposed to be stable. If
you can isolate a case where the death tally file isn't taken into
account, please send it to the list, because it sounds like a bug.
What can happen is that the new s6-supervise doesn't detect the next
death of the *old* service instance, since it's not tracking it, but
the previous deaths should still be there.
Please note that these situations are supposed to be handled correctly
because the point of a supervision suite is to behave well even in
pathological cases, and I will do my best to ensure that there is no
cascading failure or anything of the kind, but a supervisor's death
is a *seriously abnormal* event. In the 15 years of existence of s6,
it has *never* occurred that a supervisor process died without being
explicitly killed by an administrator, for resilience testing purposes
or otherwise. This just *should not happen* in normal use. So yes,
things should still work when it happens, but "not convenient" is
certainly not an issue. Degraded mode is not convenient; the goal of
degraded mode is to keep things working until an administrator comes
and fixes the issue. If you want convenience, don't kill your
s6-supervise processes.
> - you're in love with TAI64N which is fine, but the provided
> tai64nlocal isn't intelligent enough to match it in the middle of a
> text, like the output of s6-svdt or in `ls -la`. It's super human
> unfriendly this way.
s6-tai64nlocal was made to translate TAI64N into local time in
log files written by s6-log, where the timestamp is at the beginning
of the line. It is not meant to be a generic filter translating any
TAI64N stamp it finds.
But now that you're saying it, yeah, it could be a cool option to add
to the tool. I'll think about it.
> Without specific examples I'm not even sure how some tools would work,
> like svlink, svlisten* and various event based tools. Yes, I could
> spend much more time to read through all reference pages (with no
> examples) and try to figure it out, but I am not sure what would give
> me the motivation for that, beyond the sometimes pretty abstract gains
> compared to, say, runit.
The best example of the various event-based tools is s6-rc, where
services only start when their dependencies have notified readiness;
s6-rc literally invokes s6-svlisten1. But it's true it's not
documented.
Readiness notification was the primary objective of the ftrigr
library; the various event-based tools were just added because they're
cool things to have in all genericity.
I don't think the gains of s6 over runit are abstract at all. I think
being able to run a service manager over s6 is a very visible and
significant gain. I think not having the supervisor hang on a bad
customized control script is an important thing. I think regularly
maintained software with relatively frequent releases and author-
community interactions is a very concrete and practical gain.
I agree that the whole s6 documentation - and more generally, the
whole skaware documentation - could use more examples.
Well, as they say, contributions welcome 🙂
> Do not take me wrong: I think the code is fine, but possibly the
> documentation, especially the onboarding would need more love, and
> possibly more specific examples, especially for more complex cases.
As always, it's a question of trade-offs, and of time and energy and
motivation management.
Every day I need to make a choice. I could go back and rework old
documentation, refactor and improve it, add examples, etc. Or I could
keep working on the next piece of software, possibly the next brick of
the s6 stack, and make actual progress towards something that may some
day improve people's lives. And have a lot more fun in the process.
I hear your criticism. Most of it is valid. I'm just saying, yeah,
sorry if the doc is not as good as you wish it to be; it is what it is.
But if anyone wants to contribute better documentation, please, that
would be awesome.
--
Laurent