subject:"\[LEDE\-DEV\] libubox, procd\: init process hangs"

Re: [LEDE-DEV] libubox, procd: init process hangs

2016-06-07 Thread Yousong Zhou

On 7 June 2016 at 06:11, Xinxing Hu  wrote:
> Hi Guys,
>
> I have another idea about this issue. Maybe it is not kernel, but uloop
> related. I read procd and libubox code a little bit, and it seems there is a
> potential issue existing in uloop_run().
>
> In general, uloop_run() is running in a while loop:
>
> while()
> 1, Process timeouts list
>
> 2, Handle terminated child processes
>
> 3, uloop_run_events(timeout) => calls epoll_wait()
> done
>
> During boot, procd_inittab_run("sysinit") is called in Step1, which calls
> add_initd(). add_initd() would add an entry in timeouts list, whose callback
> function is to execute an rc.d/S* script.
>
> When the while loop goes back to Step1 again, the timeouts list would be
> processed, and an rc.d/S* script would be executed in a child process while
> the parent process remains in the while loop. If everything goes fine, when
> the child process is terminated, the parent process will handle terminated
> child process by calling waitpid() in the while loop. A process callback
> function will also be called, which adds another timeout entry in timeouts
> list. This new entry corresponds to the next rc.d/S* script to be executed.
> When the while loop reaches Step1 again, the next rc.d/S* script would be
> invoked.
>
> Everything looks OK till now. However, due to process scheduling, problems
> might happen when uloop_run_events(uloop_get_next_timeout()) is called.
> For instance: if the child process is still running when
> uloop_get_next_timeout() is called, then the timeouts list is already
> empty at that time, so the return value of uloop_get_next_timeout() would
> be -1. Furthermore, if the child process is terminated and signal handler is
> executed before epoll_wait() is called, then epoll_wait will block the
> parent process forever until some other events it is listening to arrive. In
> this sense, other events arriving just hide this issue. During the boot, as
> long as /etc/rc.d/S* is not finished executing, epoll_wait() should never be
> blocked.
>
> I think, a potential solution might be: during initialization, we let uloop
> listens to a kind of 'dummy' event. Every time when the child process
> finishes executing a rc.d/S* script, we send a 'dummy' event. In this case,
> epoll_wait would never be blocked during booting.

Interesting.  Looks like the same issue can also happen to the
uloop_canceled check.  Python's tornado library uses pipe() as a
"waker" to "calls the given callback on the next I/O loop iteration."

Can you give the attached patch a try to see if it can solve the issue
for you?  It was only just run-tested on qemu malta to make sure the
patched libubox still runs.

yousong

>
> Best Regards,
> Xinxing
>
>
>
>
> On 2016/5/17 18:03, Mats Karrman wrote:
> Hi Felix, others,
>
> I have been experiencing problems with the init scripts dispatch
> suddenly stopping (indefinitely).
> This happens maybe once in 100 reboots.
> After inserting a new start script that launches another daemon
> (cgrulesengd) very early in the boot process, the failures started to
> come a lot more frequently, maybe once in 10 reboots, making this a real
> issue.
> I'm normally using the versions of procd and libubox selected by OpenWRT
> BB branch but I have tested the latest versions from the git repos with
> the same result.
> So far I have only got this to happen on a quite fast board (ARM dual
> CorexA9 @ 1GHz).
> Inserting trace prints in libubox changes behavior, also suggesting the
> problem is timing dependent.
>
> When init hangs:
> - it is still possible to log in on console
> - there is always a zombie start script, e.g. S11sysctl.
> - by killing a process (e.g. ubusd or cgrulesengd) the init process
> continues.
> - otherwise generating an event, e.g inserting something into a USB port
> also makes the init continue.
>
> I have traced the problem down to the "epoll_wait" call in
> libubox::uloop.c::uloop_fetch_events().
> The following patch makes sure epoll_wait is never called without a timeout.
> My tests show that this solves the problem.
> I have been able to observe the case when the boot gets stuck and then
> continues after the 8s timeout.
> However I'm not sure that this is the correct fix for the problem as
> there may be other reasons that there is no event in the first place.
> Your feedback would be welcome!
>
> BR // Mats
> Currently working for Inteno Broadband Technology AB
>
> ---
> Avast 防毒软件已对此电子邮件执行病毒检查。
> https://www.avast.com/antivirus
>
>
> ___
> Lede-dev mailing list
> Lede-dev@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/lede-dev


0001-uloop-use-a-waker-for-notifying-sigchld-and-loop-can.patch
Description: Binary data
___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev

Re: [LEDE-DEV] libubox, procd: init process hangs

2016-05-19 Thread Mats Karrman

On 2016-05-18 15:09, Mats Karrman wrote:

On 2016-05-18 14:03, Felix Fietkau wrote:

On 2016-05-18 14:00, Mats Karrman wrote:

On 2016-05-18 13:01, Felix Fietkau wrote:

On 2016-05-18 11:38, Mats Karrman wrote:

On 2016-05-17 17:31, Mats Karrman wrote:

On 2016-05-17 13:29, Felix Fietkau wrote:
I just took a look at the code and uloop's processing of signals
looked
a bit racy to me. I've pushed a commit that makes it use
signalfd if

available. I also found that waitpid wasn't being retried on signal
interrupt, so I added an extra check there. The changes are in
libubox

git, but not in OpenWrt/LEDE yet.
Please test if this fixes your issue.

Thanks,

- Felix

Tried that but no immediate success, but it might have provided
some additional clues. Now the boot hangs early on *every* boot
but after logging in I found something different in the ps list.
There is a Broadcom utility (smd) that is called from one of the
start scripts (S10environment). It's purpose is to set scheduling
priority and cpu affinity for some of the Broadcom proprietary
processes, The smd program handles fork rather ugly. The
parent only loops until it receives SIGCHLD and then exits without
any wait. With the modified libubox I get a zombie smd child and
sleeping smd parent and S11environment (no other zombie).

Not sure exactly how this happened but I got to think about
something written in the wait man page:

"""
If a parent process terminates, then its "zombie" children (if any)
are adopted by init(8), which automatically performs a wait to
remove the zombies.
"""

Is this wait really (unconditionally) implemented in procd or could
that be what I accomplished with the "forced timeout" patch?

I fixed the ugly fork and got the system to boot once.
Then tried the original libubox with the fixed smd program but
this was not enough to get things working (25 reboots to hang).

Now I'm running reboot tests with your new libubox and fixed smd...

More than 250 reboots without problem :)

Clearly the smd program is broken, but still it doesn't feel good
that it

manages to hang the init process. Considering that timing is involved
it's difficult to make any certain conclusions but it seems like
having

uloop epoll_wait to time out occasionally isn't such a bad idea?

I agree, that definitely needs fixing. What kernel are you using?

It's the 3.4.11-rt19 from the Broadcom SDK v4.16, so very old...

Now I also noticed, with your libubox fixes (and my fixed smd) I
still get

some zombies, even though the system seems to boot OK all the way
(the corresponding services being defunct though).
With my epoll_wait timeout fix on the original libubox, this does not
happen.

Can you try backporting this to your kernel?
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=128dd1759d96ad36c379240f8b9463e8acfd37a1

- Felix

OK, did that.

First, I tested original libubox without my epoll_wait timeout fix
(but fixed smd),

init hung after 7 reboots. Same state as before.

Second, I tested your libubox (with fixed smd), seems to be no change,
i.e.

some processes/scripts ends up as zombies.

Third, added my epoll_wait timeout fix to your libubox, made no
difference..

Also noted that other stuff is not working properly with your fixes in
libubox,

e.g. system upgrade hangs with a zombie sh and after scp copying files
to the box I get a zombie dropbear. Child reaping in general seems to be
broken.

Backed down to latest libubox before your fixes and the zombies/problems
disappear so it has no relation to the kernel patch or smd (original
intit hang

problem still there of course).

// Mats

___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev

Re: [LEDE-DEV] libubox, procd: init process hangs

2016-05-18 Thread Mats Karrman




On 2016-05-17 17:31, Mats Karrman wrote:


On 2016-05-17 13:29, Felix Fietkau wrote:

I just took a look at the code and uloop's processing of signals looked
a bit racy to me. I've pushed a commit that makes it use signalfd if
available. I also found that waitpid wasn't being retried on signal
interrupt, so I added an extra check there. The changes are in libubox
git, but not in OpenWrt/LEDE yet.
Please test if this fixes your issue.

Thanks,

- Felix

Tried that but no immediate success, but it might have provided
some additional clues. Now the boot hangs early on *every* boot
but after logging in I found something different in the ps list.
There is a Broadcom utility (smd) that is called from one of the
start scripts (S10environment). It's purpose is to set scheduling
priority and cpu affinity for some of the Broadcom proprietary
processes, The smd program handles fork rather ugly. The
parent only loops until it receives SIGCHLD and then exits without
any wait. With the modified libubox I get a zombie smd child and
sleeping smd parent and S11environment (no other zombie).

Not sure exactly how this happened but I got to think about
something written in the wait man page:

"""
If  a parent process terminates, then its "zombie" children (if any)
are adopted by init(8), which automatically performs a wait to
remove the zombies.
"""

Is this wait really (unconditionally) implemented in procd or could
that be what I accomplished with the "forced timeout" patch?

I fixed the ugly fork and got the system to boot once.
Then tried the original libubox with the fixed smd program but
this was not enough to get things working (25 reboots to hang).

Now I'm running reboot tests with your new libubox and fixed smd...

More than 250 reboots without problem :)

Clearly the smd program is broken, but still it doesn't feel good that it
manages to hang the init process. Considering that timing is involved
it's difficult to make any certain conclusions but it seems like having
uloop epoll_wait to time out occasionally isn't such a bad idea?

// Mats


___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev

Re: [LEDE-DEV] libubox, procd: init process hangs

2016-05-17 Thread Felix Fietkau

Hi Mats,

On 2016-05-17 12:03, Mats Karrman wrote:
> Hi Felix, others,
> 
> I have been experiencing problems with the init scripts dispatch 
> suddenly stopping (indefinitely).
> This happens maybe once in 100 reboots.
> After inserting a new start script that launches another daemon 
> (cgrulesengd) very early in the boot process, the failures started to 
> come a lot more frequently, maybe once in 10 reboots, making this a real 
> issue.
> I'm normally using the versions of procd and libubox selected by OpenWRT 
> BB branch but I have tested the latest versions from the git repos with 
> the same result.
> So far I have only got this to happen on a quite fast board (ARM dual 
> CorexA9 @ 1GHz).
> Inserting trace prints in libubox changes behavior, also suggesting the 
> problem is timing dependent.
> 
> When init hangs:
> - it is still possible to log in on console
> - there is always a zombie start script, e.g. S11sysctl.
> - by killing a process (e.g. ubusd or cgrulesengd) the init process 
> continues.
> - otherwise generating an event, e.g inserting something into a USB port 
> also makes the init continue.
> 
> I have traced the problem down to the "epoll_wait" call in 
> libubox::uloop.c::uloop_fetch_events().
> The following patch makes sure epoll_wait is never called without a timeout.
> My tests show that this solves the problem.
> I have been able to observe the case when the boot gets stuck and then 
> continues after the 8s timeout.
> However I'm not sure that this is the correct fix for the problem as 
> there may be other reasons that there is no event in the first place.
> Your feedback would be welcome!
I just took a look at the code and uloop's processing of signals looked
a bit racy to me. I've pushed a commit that makes it use signalfd if
available. I also found that waitpid wasn't being retried on signal
interrupt, so I added an extra check there. The changes are in libubox
git, but not in OpenWrt/LEDE yet.
Please test if this fixes your issue.

Thanks,

- Felix

___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev

[LEDE-DEV] libubox, procd: init process hangs

2016-05-17 Thread Mats Karrman


Hi Felix, others,

I have been experiencing problems with the init scripts dispatch 
suddenly stopping (indefinitely).

This happens maybe once in 100 reboots.
After inserting a new start script that launches another daemon 
(cgrulesengd) very early in the boot process, the failures started to 
come a lot more frequently, maybe once in 10 reboots, making this a real 
issue.
I'm normally using the versions of procd and libubox selected by OpenWRT 
BB branch but I have tested the latest versions from the git repos with 
the same result.
So far I have only got this to happen on a quite fast board (ARM dual 
CorexA9 @ 1GHz).
Inserting trace prints in libubox changes behavior, also suggesting the 
problem is timing dependent.


When init hangs:
- it is still possible to log in on console
- there is always a zombie start script, e.g. S11sysctl.
- by killing a process (e.g. ubusd or cgrulesengd) the init process 
continues.
- otherwise generating an event, e.g inserting something into a USB port 
also makes the init continue.


I have traced the problem down to the "epoll_wait" call in 
libubox::uloop.c::uloop_fetch_events().

The following patch makes sure epoll_wait is never called without a timeout.
My tests show that this solves the problem.
I have been able to observe the case when the boot gets stuck and then 
continues after the 8s timeout.
However I'm not sure that this is the correct fix for the problem as 
there may be other reasons that there is no event in the first place.

Your feedback would be welcome!

BR // Mats
Currently working for Inteno Broadband Technology AB



diff --git a/uloop.c b/uloop.c
index ea160a0..8343bc5 100644
--- a/uloop.c
+++ b/uloop.c
@@ -256,7 +256,7 @@ static int uloop_fetch_events(int timeout)
 {
 int n, nfds;

-nfds = epoll_wait(poll_fd, events, ARRAY_SIZE(events), timeout);
+nfds = epoll_wait(poll_fd, events, ARRAY_SIZE(events), timeout < 0 
? 8000 : timeout);

 for (n = 0; n < nfds; ++n) {
 struct uloop_fd_event *cur = _fds[n];
 struct uloop_fd *u = events[n].data.ptr;


___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev

Re: [LEDE-DEV] libubox, procd: init process hangs

Re: [LEDE-DEV] libubox, procd: init process hangs

Re: [LEDE-DEV] libubox, procd: init process hangs

Re: [LEDE-DEV] libubox, procd: init process hangs

[LEDE-DEV] libubox, procd: init process hangs

5 matches

Site Navigation

Mail list logo

Footer information