Re: Propagating parent PID to ./run

2022-01-04 Thread Laurent Bercot

If getppid() returns 1, it means the service has already been orphaned.


I don't think that's guaranteed by Posix.

Apparently another data point: 
https://gist.github.com/gsauthof/8c8406748e536887c45ec14b2e476cbc


 I thought it was, but apparently you're right:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/_exit.html

"The parent process ID of all of the existing child processes and zombie 
processes of the calling process shall be set to the process ID of an 
implementation-defined system process. That is, these processes shall be 
inherited by a special system process."


 That "special system process" is pid 1 on every Unix on Earth and in
Heaven, but of course systemd, being from Hell, had to be the extra
special snowflake (that for some reason doesn't melt in Hell) and do
things differently, for no reason at all.

 Subreapers without pid namespaces are not only useless, they are
*actively harmful*. 郎

 Well, there goes this idea. But I'm *not* adding code to s6 to deal
with this systemd idiosyncrasy ('syncra' being optional); I don't
want to encourage parent detection hacks anyway.



What are your thoughts for this specific scenario? My understanding is that the 
supervisor would be relaunched, and another instance of the service would be 
started. I'd like to avoid/deal with the situation of the evil-twin service 
instance.


 Any reasonably written service will lock a resource at start, before
becoming ready; if the old instance is still around, it will try and
fail to lock the resource. Either it dies and will be restarted one
second later, until an admin finds and kills the old instance; or it
blocks on the lock, using no cpu, until the old instance is killed.
Since the goal of s6 is to maximize uptime even in a degraded state,
s6 takes no special action for that - but you could have a ./finish
script that sends a special alert when the service fails several times
in a few seconds. (I think by default it's better to have that degraded
state than to have service downtime that is not explicitly prompted by
an admin.)

 If the service doesn't lock any resource, then you have lock-fd,
which was designed to handle this - nd the documentation is
inaccurate, I'll fix it. The behaviour is that the new instance *blocks*
until the old instance is dead; s6-supervise writes a warning message
to its own stderr so the situation is detected. (Retrying after 60
seconds only happens in a few unlikely error situations.)

--
 Laurent



Re: Propagating parent PID to ./run

2022-01-04 Thread Earl Chew via skaware

Laurent,

Thanks for the quick reply.

If getppid() returns 1, it means the service has already been orphaned. 


I don't think that's guaranteed by Posix.

Apparently another data point: 
https://gist.github.com/gsauthof/8c8406748e536887c45ec14b2e476cbc


Normally you should never be worried about the supervisor dying. ... 
the lock-fd file is meant to avoid that


What are your thoughts for this specific scenario? My understanding is 
that the supervisor would be relaunched, and another instance of the 
service would be started. I'd like to avoid/deal with the situation of 
the evil-twin service instance.



An optional regular file named lock-fd.


Oh ... it seems I missed this part of the documentation.

Earl

On , Laurent Bercot wrote:


Is there any appetite for providing a way for ./run to know the PID 
of its parent s6-supervise instance?


This information allows the supervised child to know that it has been 
orphaned, and to tie its fate to its parent (eg PDEATHSIG 
https://stackoverflowcom/a/36945270).


Using getppid(2) alone is not reliable because the child might have 
been orphaned between the fork(2) and getppid(2) calls.


 getppid(2) is totally reliable.
 If getppid() returns 1, it means the service has already been orphaned.
(Don't use subreapers without pid namespaces! they're useless and
break that property.)

 So you can call getppid() in your ./run, exit if it's 1, and otherwise
record it and continue running with your prctl().

 But from a systems designer point of view, I would advise *not* doing
that. s6 was designed to maximize the uptime of the service; it is
100% intentional that the service does not die if the supervisor dies.
Going out of your way to make the service die when the supervisor does
*decreases your uptime*, since now the uptime depends on two processes
being alive, not just one.

 Normally you should never be worried about the supervisor dying. It
has been specifically written to be extremely stable. And, just in
case, if what you don't like is the log spam whenever the supervisor
happens to die and comes back up and the previous instance of the
service is still alive: the lock-fd file is meant to avoid that.

 The point of supervision is to take burden *off* services. Services
should not care how they're launched, under a supervisor or not, in
what circumstances, etc. The need to add detection shenanigans and
special cases is a sign that you're probably not using the framework
as it was intended to be used.

--
 Laurent



Re: Propagating parent PID to ./run

2022-01-04 Thread Laurent Bercot




Is there any appetite for providing a way for ./run to know the PID of its 
parent s6-supervise instance?

This information allows the supervised child to know that it has been orphaned, 
and to tie its fate to its parent (eg PDEATHSIG 
https://stackoverflow.com/a/36945270).

Using getppid(2) alone is not reliable because the child might have been 
orphaned between the fork(2) and getppid(2) calls.


 getppid(2) is totally reliable.
 If getppid() returns 1, it means the service has already been orphaned.
(Don't use subreapers without pid namespaces! they're useless and
break that property.)

 So you can call getppid() in your ./run, exit if it's 1, and otherwise
record it and continue running with your prctl().

 But from a systems designer point of view, I would advise *not* doing
that. s6 was designed to maximize the uptime of the service; it is
100% intentional that the service does not die if the supervisor dies.
Going out of your way to make the service die when the supervisor does
*decreases your uptime*, since now the uptime depends on two processes
being alive, not just one.

 Normally you should never be worried about the supervisor dying. It
has been specifically written to be extremely stable. And, just in
case, if what you don't like is the log spam whenever the supervisor
happens to die and comes back up and the previous instance of the
service is still alive: the lock-fd file is meant to avoid that.

 The point of supervision is to take burden *off* services. Services
should not care how they're launched, under a supervisor or not, in
what circumstances, etc. The need to add detection shenanigans and
special cases is a sign that you're probably not using the framework
as it was intended to be used.

--
 Laurent



Propagating parent PID to ./run

2022-01-04 Thread Earl Chew via skaware
Is there any appetite for providing a way for ./run to know the PID of 
its parent s6-supervise instance?


This information allows the supervised child to know that it has been 
orphaned, and to tie its fate to its parent (eg PDEATHSIG 
https://stackoverflow.com/a/36945270).


Using getppid(2) alone is not reliable because the child might have been 
orphaned between the fork(2) and getppid(2) calls.


Mechanisms that might be used include a) setting an environment variable 
(eg S6_PPID) before executing ./run, or b) passing the PID as an 
argument when executing ./run.


The environment variable approach can be used when s6-supervise is 
deployed in standalone settings (eg exec env S6_PPID=$$ s6-supervise 
servicedir), but this approach not presently a good fit in s6-svscan 
scenarios.


Earl