Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> I'm banging my head against this issue for several days now, first
>>> trying to sort out an unrelated bug I also came across at this chance,
>>> then trying to understand what happens, and finally getting mad about
>>> why this may only happen with Xenomai:
>>> One process, two threads, running under gdb control (no breakpoints,
>>> just the automatically set ones that track thread creation/destruction).
>>> All happens already with only one CPU. The first thread decides to issue
>>> exit() exactly while the second one is on its way from primary to
>>> secondary mode due to running on a breakpoint (int3 -> xnpod_trap_fault
>>> -> xnshadow_relax...). The group exit of thread A causes SIGKILL to be
>>> set in thread B, but triggers no further actions due to B already being
>>> awake and on its way to queue and handle the other signal (SIGTRAP). Now
>>> when B comes to dequeue the next signal it finds SIGTRAP and SIGKILL
>>> set, but picks up SIGTRAP due to its lower number. Now ptrace causes B
>>> to stop, gdb gets confused, sends A, which is already a zombie, a
>>> SIGSTOP and waits on it to confirm this stop - which never happens. If
>>> someone is interested, I can provide an LTTng dump of this scenario.
>>> My problem is now that I still don't understand what prevents this
>>> deadlock on vanilla Linux. Does Xenomai create a thread schedule here
>>> that is impossible there? Or does it only widens an otherwise very
>>> small race window that also exists with mainline? Before making a fool
>>> of my self on LKML, I would like to collect some further ideas on the
>>> workaround or fix(?) below that cures this deadlock for me.
>> After reading this comment
>> I'm now about to escalate the issue to LKML. This really looks like a
>> mainline bug, probably just triggered more quickly by the large latency
>> between signal queuing and receiver scheduling that the
>> primary->secondary mode switch introduces.
> That said, I think gdb is buggy too: the kill function probably returns
> some error which says that the thread no longer exists, which gdb
> probably ignores since it awaits a signal from that killed thread.
According to my traces, there is no error returned. However, gdb /may/
see that the group leader, which issued the sys_exit_group, is now in
TASK_DEAD state - before trying to block on it, becoming TASK_TRACED again.
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
Xenomai-core mailing list