Re: Race in reboot/poweroff path at init?

2017-11-23 Thread Deb McLemore
Hi all,

We put together a test to try out the various suggestions
on double-forking processes to see if polling for a zombie process
being reaped by INIT would help provide a method to synchronize
the ready state of the signal handlers for reboot and poweroff paths
in Busybox INIT during early IPL.

The usermode helper that kicks Busybox /sbin/poweroff
(started by PID=2) is very early, even before Busybox INIT
is do_execve'd from kernel_init.

We found that any process zombies in this early stage of kernel_init
get flushed when the do_execve is done for Busybox /sbin/init.

Specifically we traced the program flow to proc_flush_task_mnt
from linux/fs/proc/base.c.

We tested out the following patch and the race is fixed using
the abstract socket.



>From 1a7ac3aba3e94bd04de48e9619647cca853dca06 Mon Sep 17 00:00:00 2001
From: Deb McLemore 
Date: Wed, 25 Oct 2017 08:06:20 -0500
Subject: [PATCH] init: Add handshake to poweroff/reboot for signal handler
 setup

Add an abstract socket to synchronize the readiness of init to receive
the SIGUSR2 to catch poweroff/reboot during an IPL phase (e.g. soft power
off via BMC).

Signed-off-by: Deb McLemore 
---
 init/halt.c | 37 +
 init/init.c | 23 +++
 2 files changed, 60 insertions(+)

diff --git a/init/halt.c b/init/halt.c
index c6c857f..510c4d2 100644
--- a/init/halt.c
+++ b/init/halt.c
@@ -83,6 +83,10 @@
 #include "libbb.h"
 #include "reboot.h"
 
+#ifdef __linux__
+#include 
+#endif
+
 #if ENABLE_FEATURE_WTMP
 #include 
 
@@ -112,6 +116,12 @@ static void write_wtmp(void)
 int halt_main(int argc, char **argv) MAIN_EXTERNALLY_VISIBLE;
 int halt_main(int argc UNUSED_PARAM, char **argv)
 {
+
+#ifdef __linux__
+    struct sockaddr_un bb_addr2;
+    int fdBB2 = 0;
+    int fdrc = 0;
+#endif
 static const int magic[] = {
     RB_HALT_SYSTEM,
     RB_POWER_OFF,
@@ -170,6 +180,33 @@ int halt_main(int argc UNUSED_PARAM, char **argv)
         /* talk to init */
         if (!ENABLE_FEATURE_CALL_TELINIT) {
             /* bbox init assumed */
+
+                /* Use socket connect to INIT to assure that
+                 * signal handlers are ready
+                 */
+                #ifdef __linux__
+                memset(_addr2,
+                    0,
+                    sizeof(struct sockaddr_un));
+                bb_addr2.sun_family = AF_UNIX;
+                strcpy(bb_addr2.sun_path,
+                    "\0bb_signals");
+                while (1) {
+                    fdBB2 = socket(AF_UNIX,
+                        SOCK_CLOEXEC | SOCK_DGRAM, 0);
+                    if (fdBB2 != -1)
+                        break;
+                    sleep(1);
+                }
+                while (1) {
+                    fdrc = connect(fdBB2,
+                    (struct sockaddr *)_addr2,
+                    sizeof(struct sockaddr_un));
+                    if (fdrc == 0)
+                        break;
+                    sleep(1);
+                }
+                #endif
             rc = kill(1, signals[which]);
         } else {
             /* SysV style init assumed */
diff --git a/init/init.c b/init/init.c
index 6f3374e..3308a6e 100644
--- a/init/init.c
+++ b/init/init.c
@@ -133,6 +133,7 @@
 #ifdef __linux__
 # include 
 # include 
+# include 
 #endif
 #include "reboot.h" /* reboot() constants */
 
@@ -1046,6 +1047,12 @@ static void sleep_much(void)
 int init_main(int argc, char **argv) MAIN_EXTERNALLY_VISIBLE;
 int init_main(int argc UNUSED_PARAM, char **argv)
 {
+
+#ifdef __linux__
+    struct sockaddr_un bb_addr;
+    int fdBB = 0;
+#endif
+
 if (argv[1] && strcmp(argv[1], "-q") == 0) {
     return kill(1, SIGHUP);
 }
@@ -1191,6 +1198,22 @@ int init_main(int argc UNUSED_PARAM, char **argv)
     sigprocmask_allsigs(SIG_UNBLOCK);
 }
 
+#ifdef __linux__
+    memset(_addr, 0, sizeof(struct sockaddr_un));
+    bb_addr.sun_family = AF_UNIX;
+    strcpy(bb_addr.sun_path, "\0bb_signals");
+    /* Create an abstract socket to use for synchronizing poweroff and
+     * reboot which could be kicked at IPL before the INIT process
+     * completes signal setup to listen for the signals
+     * Ignore failures and keep on, this just helps
+     * close the exposed window when nothing will get signaled
+     * The abstract socket needs to stay persistent, so no cleanup
+     */
+    fdBB = socket(AF_UNIX, SOCK_CLOEXEC | SOCK_DGRAM, 0);
+    if (fdBB != -1)
+        bind(fdBB, (struct sockaddr *)_addr, sizeof(bb_addr));
+#endif
+
 /* Now run everything that needs to be run */
 /* First run the sysinit command */
 run_actions(SYSINIT);
-- 
2.7.4

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Race in reboot/poweroff path at init?

2017-11-06 Thread Jeremy Kerr
Hi Denys,

>>  - installing the signal handlers even earlier in init_main(). However,
>>this will only reduce the window for lost events, rather than
>>eliminating it; or
> 
> Sure, this should be done. How about this:
> --- a/init/init.c
> +++ b/init/init.c
> @@ -1064,6 +1064,12 @@ int init_main(int argc UNUSED_PARAM, char **argv)
>  #endif
> 
>  if (!DEBUG_INIT) {
> +/* Some users send poweroff signals to init VERY early.
> + * To handle this, mask signals early,
> + * and unmask them only after signal handlers are installed.
> + */
> +sigprocmask_allsigs(SIG_BLOCK);
> +
>  /* Expect to be invoked as init with PID=1 or be invoked as linuxrc 
> */
>  if (getpid() != 1
>   && (!ENABLE_LINUXRC || applet_name[0] != 'l') /* not linuxrc? */
> @@ -1204,6 +1187,8 @@ int init_main(int argc UNUSED_PARAM, char **argv)
>  + (1 << SIGHUP)  /* reread /etc/inittab */
>  #endif
>  , record_signo);
> +
> +sigprocmask_allsigs(SIG_UNBLOCK);
>  }
> 
>  /* Now run everything that needs to be run */
> 
> This covers code which opens and parses /etc/inittab,
> which can be slow (if storage is slow), and can make
> race realistic in real world.
> 
> Can you test whether this change makes the race go away in your case?

This looks good, but it's not going to completely fix the race - in some
cases, we've seen the 'poweroff' binary started before init.

>>  - using a synchronous channel to send the shutdown/reboot message
>>between the poweroff/reboot helpers, rather than an asynchronous
>>signal. Say, have init listening on a socket, allowing the poweroff
>>binary to wait and/or retry.
>>
>> However, before I go down the wrong path here: does anyone have other
>> ideas that might help eliminating dropped poweroff/reboot events?
> 
> The test that processes are being reaped is a good idea.

OK, sounds like something we should experiment with.

Thanks for the input!

Cheers,


Jeremy
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Race in reboot/poweroff path at init?

2017-11-02 Thread Denys Vlasenko
On Tue, Oct 10, 2017 at 10:52 AM, Jeremy Kerr  wrote:
> Hi all,
>
> I've been debugging an issue where we can't reboot or poweroff a machine
> in the early stages of busybox init. Using the poweroff case as an
> example:
>
>  - kernel starts /sbin/init
>
>  - kernel receives a poweroff event, so calls __orderly_poweroff.
>Effectively, these will just call out to the /sbin/poweroff usermode
>helper.
>
>  - /sbin/poweroff just does a:
>
>  kill(1, SIGUSR2);
>
>  - However, /sbin/init has not yet installed a signal handler for
>SIGUSR2. Because we're PID 1, this means the signal is ignored, and
>so the command to poweroff the machine is dropped.
>
>  - init keeps booting rather than powering off.
>
> In our particular case, the "poweroff event" is an IPMI soft shutdown
> message. However, the same would apply for any other path that involves
> orderly_poweroff or orderly_reboot.
>
> Even though the signal handlers are installed fairly early in init, we
> can still hit the race between this and the SIGUSR2 being sent fairly
> reliably.
>
> I see a couple of options for resolving this:
>
>  - installing the signal handlers even earlier in init_main(). However,
>this will only reduce the window for lost events, rather than
>eliminating it; or

Sure, this should be done. How about this:

--- a/init/init.c
+++ b/init/init.c
@@ -1064,6 +1064,12 @@ int init_main(int argc UNUSED_PARAM, char **argv)
 #endif

 if (!DEBUG_INIT) {
+/* Some users send poweroff signals to init VERY early.
+ * To handle this, mask signals early,
+ * and unmask them only after signal handlers are installed.
+ */
+sigprocmask_allsigs(SIG_BLOCK);
+
 /* Expect to be invoked as init with PID=1 or be invoked as linuxrc */
 if (getpid() != 1
  && (!ENABLE_LINUXRC || applet_name[0] != 'l') /* not linuxrc? */
@@ -1204,6 +1187,8 @@ int init_main(int argc UNUSED_PARAM, char **argv)
 + (1 << SIGHUP)  /* reread /etc/inittab */
 #endif
 , record_signo);
+
+sigprocmask_allsigs(SIG_UNBLOCK);
 }

 /* Now run everything that needs to be run */

This covers code which opens and parses /etc/inittab,
which can be slow (if storage is slow), and can make
race realistic in real world.

Can you test whether this change makes the race go away in your case?

>  - using a synchronous channel to send the shutdown/reboot message
>between the poweroff/reboot helpers, rather than an asynchronous
>signal. Say, have init listening on a socket, allowing the poweroff
>binary to wait and/or retry.
>
> However, before I go down the wrong path here: does anyone have other
> ideas that might help eliminating dropped poweroff/reboot events?

The test that processes are being reaped is a good idea.
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Race in reboot/poweroff path at init?

2017-10-11 Thread Ralf Friedl

Laurent Bercot wrote:

 That is true, I hadn't thought of abstract sockets, and it would work.

 However, changing the way poweroff signals init is a big change, and
in particular, making init listen to a socket is very significant: you
know need to multiplex reaping zombies with listening to connections.
It's not hard, but it requires patching init significantly, reworking
its control flow around an asynchronous event loop; and at that point
you're just better off using a better pid 1 process (such as s6-svscan,
which already does this - have you considered switching inits? ;)). And
changes of that magnitude are pretty dangerous - I'm quite convinced
there are some people relying on the "send a signal to init" mechanism
and whose systems would break if that mechanism were changed.

 The "send a signal to init" mechanism is a good one, and does not
fundamentally need to be changed; your problem is a readiness detection
one, i.e. whatever mechanism you're using to tell init to shutdown, you
can't use it before init is ready to handle it.

 And that's where abstract sockets can be useful: you can use abstract
sockets as a simple synchronization mechanism. The easiest way to do it
would be: when init has installed its signal handlers, it creates an
abstract socket (close-on-exec, of course) and binds it - but does not
listen to it and does nothing else with it). And poweroff checks for
the presence of this socket by repeatedly attempting connect(). When
errno changes from ENOENT to ECONNREFUSED, it means the socket has been
created and init is now ready to receive signals.

 This requires minimal change to init and a small polling addition
to poweroff - which I think is much better and safer than a heavy change
to busybox's init.
I think your idea with wait is better. It doesn't require any additional 
code to the init process, it is likely more portable, it doesn't consume 
memory for an open socket and the changes to poweroff are also smaller.
I would modify it so that it doesn't check for the presence of the 
grandchild, as the grandchild may be started from a process that does 
not do wait. Just check whether the parent PID is 1 or not.


Basically:
if ((pid1 = fork ()) > 0) exit (0);
if ((pid2 = fork ()) > 0) exit (0);
while (pid1 > 0 && pid2 > 0 && getppid () != 1) // only wait if both 
fork calls were successful

  sleep (1);
If fork fails, it should signal init from the parent process. If there 
are not enough resources to create a new process, we can safely assume 
that init is ready to receive a signal.

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox



Re: Race in reboot/poweroff path at init?

2017-10-11 Thread Laurent Bercot

That's OK, as the helper (/sbin/poweroff) has the opportunity to retry
if the connect() fails (because init hasn't established the listening
socket yet). The main difference is that the sender can detect failure,
and retry if necessary.
AF_UNIX sockets in the abstract namespace don't require a path bound to
the filesystem, so perhaps they would be available early enough - or
have I missed something there?


 That is true, I hadn't thought of abstract sockets, and it would work.

 However, changing the way poweroff signals init is a big change, and
in particular, making init listen to a socket is very significant: you
know need to multiplex reaping zombies with listening to connections.
It's not hard, but it requires patching init significantly, reworking
its control flow around an asynchronous event loop; and at that point
you're just better off using a better pid 1 process (such as s6-svscan,
which already does this - have you considered switching inits? ;)). And
changes of that magnitude are pretty dangerous - I'm quite convinced
there are some people relying on the "send a signal to init" mechanism
and whose systems would break if that mechanism were changed.

 The "send a signal to init" mechanism is a good one, and does not
fundamentally need to be changed; your problem is a readiness detection
one, i.e. whatever mechanism you're using to tell init to shutdown, you
can't use it before init is ready to handle it.

 And that's where abstract sockets can be useful: you can use abstract
sockets as a simple synchronization mechanism. The easiest way to do it
would be: when init has installed its signal handlers, it creates an
abstract socket (close-on-exec, of course) and binds it - but does not
listen to it and does nothing else with it). And poweroff checks for
the presence of this socket by repeatedly attempting connect(). When
errno changes from ENOENT to ECONNREFUSED, it means the socket has been
created and init is now ready to receive signals.

 This requires minimal change to init and a small polling addition
to poweroff - which I think is much better and safer than a heavy change
to busybox's init.

--
 Laurent

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Race in reboot/poweroff path at init?

2017-10-11 Thread Laurent Bercot
   There's the sigqueue() mechanism out there. From the man page, it 
seems it's essentially dedicated to send data together with the signal, 
but it also has a queueing mechanism implemented in the kernel. Wether 
this allows the message to be kept in the queue until the destination 
process unmasks it, this isn't written explicitely in the man, but 
maybe somebody knows it. Anyway your case is a perfect test bench.


 From what I understand from the POSIX page for sigqueue(), it only 
makes

a difference if the receiver has already installed a signal handler with
SA_SIGINFO. If the receiver hasn't installed a signal handler yet, the
signal just gets delivered as is.
 So it's not a mechanism that can be used to defer signals until the
receiver has installed a signal handler.

--
 Laurent

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Race in reboot/poweroff path at init?

2017-10-11 Thread Didier Kryn

Le 11/10/2017 à 04:43, Jeremy Kerr a écrit :

Hi Laurent,

Thanks for the reply, good to get some conversation going here!


- using a synchronous channel to send the shutdown/reboot message
   between the poweroff/reboot helpers, rather than an asynchronous
   signal. Say, have init listening on a socket, allowing the poweroff
   binary to wait and/or retry.

  That would not work either: you could receive the event before init
starts listening to the socket.

That's OK, as the helper (/sbin/poweroff) has the opportunity to retry
if the connect() fails (because init hasn't established the listening
socket yet). The main difference is that the sender can detect failure,
and retry if necessary.

AF_UNIX sockets in the abstract namespace don't require a path bound to
the filesystem, so perhaps they would be available early enough - or
have I missed something there?

[Non-Linux platforms may not support the same abstract namespace, so
I'd need to implement a fallback there, but I don't (yet) know if this
same race is relevant on those platforms...]

I'd rather not just wear the race, as that means we've missed shutdown
events, which is quite user-visible. The signal-after-reaped-grandchild
approach seems okay too, if other methods aren't workable. Or even
Tito's suggestion of a repeated signal, which has the advantage of a
minimal change to code.
There's the sigqueue() mechanism out there. From the man page, it 
seems it's essentially dedicated to send data together with the signal, 
but it also has a queueing mechanism implemented in the kernel. Wether 
this allows the message to be kept in the queue until the destination 
process unmasks it, this isn't written explicitely in the man, but maybe 
somebody knows it. Anyway your case is a perfect test bench.


Didier

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Race in reboot/poweroff path at init?

2017-10-10 Thread Jeremy Kerr
Hi Laurent,

Thanks for the reply, good to get some conversation going here!

>> - using a synchronous channel to send the shutdown/reboot message
>>   between the poweroff/reboot helpers, rather than an asynchronous
>>   signal. Say, have init listening on a socket, allowing the poweroff
>>   binary to wait and/or retry.
> 
>  That would not work either: you could receive the event before init
> starts listening to the socket.

That's OK, as the helper (/sbin/poweroff) has the opportunity to retry
if the connect() fails (because init hasn't established the listening
socket yet). The main difference is that the sender can detect failure,
and retry if necessary.

AF_UNIX sockets in the abstract namespace don't require a path bound to
the filesystem, so perhaps they would be available early enough - or
have I missed something there?

[Non-Linux platforms may not support the same abstract namespace, so
I'd need to implement a fallback there, but I don't (yet) know if this
same race is relevant on those platforms...]

I'd rather not just wear the race, as that means we've missed shutdown
events, which is quite user-visible. The signal-after-reaped-grandchild
approach seems okay too, if other methods aren't workable. Or even
Tito's suggestion of a repeated signal, which has the advantage of a
minimal change to code.

Cheers,


Jeremy
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Race in reboot/poweroff path at init?

2017-10-10 Thread Tito


On October 10, 2017 4:52:17 PM GMT+08:00, Jeremy Kerr  wrote:
>Hi all,
>
>I've been debugging an issue where we can't reboot or poweroff a
>machine
>in the early stages of busybox init. Using the poweroff case as an
>example:
>
> - kernel starts /sbin/init
>
> - kernel receives a poweroff event, so calls __orderly_poweroff.
>   Effectively, these will just call out to the /sbin/poweroff usermode
>   helper.
>
> - /sbin/poweroff just does a:
>
> kill(1, SIGUSR2);
>
> - However, /sbin/init has not yet installed a signal handler for
>   SIGUSR2. Because we're PID 1, this means the signal is ignored, and
>   so the command to poweroff the machine is dropped.
>
> - init keeps booting rather than powering off.
>
>In our particular case, the "poweroff event" is an IPMI soft shutdown
>message. However, the same would apply for any other path that involves
>orderly_poweroff or orderly_reboot.
>
>Even though the signal handlers are installed fairly early in init, we
>can still hit the race between this and the SIGUSR2 being sent fairly
>reliably.
>
>I see a couple of options for resolving this:
>
> - installing the signal handlers even earlier in init_main(). However,
>   this will only reduce the window for lost events, rather than
>   eliminating it; or
>
> - using a synchronous channel to send the shutdown/reboot message
>   between the poweroff/reboot helpers, rather than an asynchronous
>   signal. Say, have init listening on a socket, allowing the poweroff
>   binary to wait and/or retry.
>
>However, before I go down the wrong path here: does anyone have other
>ideas that might help eliminating dropped poweroff/reboot events?
>
>Regards,
>
>
>Jeremy
>___
>busybox mailing list
>busybox@busybox.net
>http://lists.busybox.net/mailman/listinfo/busybox
Hi, just a silly idea: would running the poweroff or reboot signal in a loop do 
any harm or eventually be somehow an improvement?
Just my 0,2 cents from vacation after a couple of beers.
Ciao, Tito
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Race in reboot/poweroff path at init?

2017-10-10 Thread Laurent Bercot



- using a synchronous channel to send the shutdown/reboot message
  between the poweroff/reboot helpers, rather than an asynchronous
  signal. Say, have init listening on a socket, allowing the poweroff
  binary to wait and/or retry.


 That would not work either: you could receive the event before init
starts listening to the socket.

 There will always be a window where init can't receive events. The
kernel starts it barebones, with no channel of communication to other
processes; if an event arrives before it starts establishing these 
channels,

you're out of luck. The best you can do is make the window as small as
possible.

 If you need to be 100% safe, then you need to somehow queue the events
before init starts processing them. But that's tricky, because it's
extremely early - you have nothing but the kernel and the /sbin/poweroff
process' memory. You don't even have the guarantee that you can write to
a filesystem: you only have the rootfs and it may be read-only. You 
don't

even have a tmpfs yet. You can't be certain you have a devtmpfs mounted.
You don't have /dev/shm. You don't have /proc.

 So it's a matter of finding a way to queue events that don't involve
writing to the filesystem at all. That severely restricts your options:
for instance, POSIX message queues sound like a perfect fit, but Linux
implements them via a virtual filesystem that needs to be mounted first,
so it's a no-go.
 Signals are actually pretty good: all they require is that init has
installed a handler, which can be done early. The only issue is that
you can't queue them.

 What I would do is add a check to /sbin/poweroff that init has 
progressed

to a point where its signal handlers are installed, and if it's not
there yet, poll until it is (i.e. sleep and retry).

 What check to use? well at this point it's very hackish. The only
thing I can think of that doesn't depend on the contents of /etc/inittab
is that when init reaps zombies, we know it has its signal handlers
installed. So... I would have poweroff doublefork a process (have the
child communicate the pid of the grandchild before dying), the 
grandchild

dies - at this point it's a zombie waiting for init to reap it - and
poweroff repeatedly hits the grandchild with kill(), using signal 0 just
to be safe. When kill() fails with ESRCH, it means the zombie has
disappeared and init is now ready to accept signals.

 It's really ugly, but it's the best I can come up with that makes no
unsafe assumptions. Whether implementing that in /sbin/poweroff is
better than simply eating the race condition... that's your call.

--
 Laurent

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox