Re: [Xenomai-core] Houston, we have a circular problem

2008-05-06 Thread Philippe Gerum
Jan Kiszka wrote:
> Philippe Gerum wrote:
>> Jan Kiszka wrote:
>>> Gilles Chanteperdrix wrote:
 On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>  do_schedule_event() is the culprit when it reads the pending signals on 
> the
>  shared queue (XNDEBUG check for rearming the timers),
 A stupid suggestion: if we know that the spinlock is always locked
 when calling do_schedule_event, maybe we can simply avoid the lock
 there ?
>>> Would be the best solution - but I don't think so. After reading a bit
>>> more into the lockdep output, I think the issue is that _some_other_
>>> task my hold the siglock and then acquire our rq_lock, but not
>>> necessarily along a similar code path we took to acquire the siglock now.
>>>
>> Actually, this locking around the sigmask retrieval looks overkill, since we
>> only address ptracing signals here, and those should go through the shared
>> pending set, not through the task's private one. I.e. There should be no way 
>> to
>> get fooled by any asynchronous update of that mask.
> 
> This is a debug helper anyway, so we risk (if I got this right) at worst
> a spurious unfreeze of the Xenomai timers. Does not really compare to
> the current deadlock...
>

As a matter of fact, we don't test any condition under the protection of this
lock, so aside of the memory barrier induced on SMP, this lock does not buy us
anything. Except a deadlock, that is...

> I will let my colleagues run the hunk below tomorrow (which works for me) -
> let's see if they manage to crash it again :P (they are experts in this!).
> 
> Jan
> 
> Index: xenomai-2.4.x/ksrc/nucleus/shadow.c
> ===
> --- xenomai-2.4.x/ksrc/nucleus/shadow.c   (Revision 3734)
> +++ xenomai-2.4.x/ksrc/nucleus/shadow.c   (Arbeitskopie)
> @@ -2194,9 +2194,7 @@ static inline void do_schedule_event(str
>   if (signal_pending(next)) {
>   sigset_t pending;
>  
> - spin_lock(&wrap_sighand_lock(next));/* 
> Already interrupt-safe. */
>   wrap_get_sigpending(&pending, next);
> - spin_unlock(&wrap_sighand_lock(next));
>  
>   if (sigismember(&pending, SIGSTOP) ||
>   sigismember(&pending, SIGINT))
> 


-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Houston, we have a circular problem

2008-05-05 Thread Jan Kiszka
Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
  do_schedule_event() is the culprit when it reads the pending signals on 
 the
  shared queue (XNDEBUG check for rearming the timers),
>>> A stupid suggestion: if we know that the spinlock is always locked
>>> when calling do_schedule_event, maybe we can simply avoid the lock
>>> there ?
>> Would be the best solution - but I don't think so. After reading a bit
>> more into the lockdep output, I think the issue is that _some_other_
>> task my hold the siglock and then acquire our rq_lock, but not
>> necessarily along a similar code path we took to acquire the siglock now.
>>
> 
> Actually, this locking around the sigmask retrieval looks overkill, since we
> only address ptracing signals here, and those should go through the shared
> pending set, not through the task's private one. I.e. There should be no way 
> to
> get fooled by any asynchronous update of that mask.

This is a debug helper anyway, so we risk (if I got this right) at worst
a spurious unfreeze of the Xenomai timers. Does not really compare to
the current deadlock...

I will let my colleagues run the hunk below tomorrow (which works for me) -
let's see if they manage to crash it again :P (they are experts in this!).

Jan

Index: xenomai-2.4.x/ksrc/nucleus/shadow.c
===
--- xenomai-2.4.x/ksrc/nucleus/shadow.c (Revision 3734)
+++ xenomai-2.4.x/ksrc/nucleus/shadow.c (Arbeitskopie)
@@ -2194,9 +2194,7 @@ static inline void do_schedule_event(str
if (signal_pending(next)) {
sigset_t pending;
 
-   spin_lock(&wrap_sighand_lock(next));/* 
Already interrupt-safe. */
wrap_get_sigpending(&pending, next);
-   spin_unlock(&wrap_sighand_lock(next));
 
if (sigismember(&pending, SIGSTOP) ||
sigismember(&pending, SIGINT))



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Houston, we have a circular problem

2008-05-05 Thread Philippe Gerum
Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>>>  do_schedule_event() is the culprit when it reads the pending signals on the
>>>  shared queue (XNDEBUG check for rearming the timers),
>> A stupid suggestion: if we know that the spinlock is always locked
>> when calling do_schedule_event, maybe we can simply avoid the lock
>> there ?
> 
> Would be the best solution - but I don't think so. After reading a bit
> more into the lockdep output, I think the issue is that _some_other_
> task my hold the siglock and then acquire our rq_lock, but not
> necessarily along a similar code path we took to acquire the siglock now.
> 

Actually, this locking around the sigmask retrieval looks overkill, since we
only address ptracing signals here, and those should go through the shared
pending set, not through the task's private one. I.e. There should be no way to
get fooled by any asynchronous update of that mask.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Houston, we have a circular problem

2008-05-05 Thread Philippe Gerum
Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>>>  do_schedule_event() is the culprit when it reads the pending signals on the
>>>  shared queue (XNDEBUG check for rearming the timers),
>> A stupid suggestion: if we know that the spinlock is always locked
>> when calling do_schedule_event, maybe we can simply avoid the lock
>> there ?
> 
> Would be the best solution - but I don't think so. After reading a bit
> more into the lockdep output, I think the issue is that _some_other_
> task my hold the siglock and then acquire our rq_lock, but not
> necessarily along a similar code path we took to acquire the siglock now.
> 

Mm, indeed. Grabbing any further kernel lock when we hold the runqueue lock is
probably a very bad idea, since any normal locking sequence would exactly go the
opposite way...

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Houston, we have a circular problem

2008-05-05 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
> On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>>  do_schedule_event() is the culprit when it reads the pending signals on the
>>  shared queue (XNDEBUG check for rearming the timers),
> 
> A stupid suggestion: if we know that the spinlock is always locked
> when calling do_schedule_event, maybe we can simply avoid the lock
> there ?

Would be the best solution - but I don't think so. After reading a bit
more into the lockdep output, I think the issue is that _some_other_
task my hold the siglock and then acquire our rq_lock, but not
necessarily along a similar code path we took to acquire the siglock now.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Houston, we have a circular problem

2008-05-05 Thread Gilles Chanteperdrix
On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>  do_schedule_event() is the culprit when it reads the pending signals on the
>  shared queue (XNDEBUG check for rearming the timers),

A stupid suggestion: if we know that the spinlock is always locked
when calling do_schedule_event, maybe we can simply avoid the lock
there ?

-- 
 Gilles

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Houston, we have a circular problem

2008-05-05 Thread Philippe Gerum
Jan Kiszka wrote:
> Hi,
> 
> after hacking away the barriers I-pipe erected in front of lockdep
> (patches will follow on adeos-main), I was finally able to "visualize" a
> bit more what our colleagues see in reality on SMP: some ugly, not yet
> understood circular dependency when running some Xenomai app under gdb.
> What lockdep tries to tell us remains unclear, unfortunately:
> 
> [  874.356703]
> [  874.356957] ===
> 
> Here it hangs because of this (catched via QEMU):
> 
> (gdb) bt
> #0  __delay (loops=1) at arch/x86/lib/delay_64.c:34
> #1  0x80372712 in _raw_spin_lock (lock=0x81000232c6c0) at 
> lib/spinlock_debug.c:111
> #2  0x80479d3d in _spin_lock (lock=0x81000232c6c0) at 
> kernel/spinlock.c:182
> #3  0x8022e546 in task_rq_lock (p=0x81002e792000, 
> flags=0x81002f487910) at kernel/sched.c:615
> #4  0x8022e6b6 in try_to_wake_up (p=0x1, state=, 
> sync=341) at kernel/sched.c:1562
> #5  0x8022e9a5 in default_wake_function (curr=, 
> mode=0, sync=341, key=0xf48791c8) at kernel/sched.c:3840
> #6  0x8024ae51 in autoremove_wake_function (wait=0x1, mode=0, 
> sync=341, key=0xf48791c8) at kernel/wait.c:132
> #7  0x8022bdc7 in __wake_up_common (q=, mode=1, 
> nr_exclusive=1, sync=0, key=0x0) at kernel/sched.c:3861
> #8  0x8022df43 in __wake_up (q=0x805a6240, mode=1, 
> nr_exclusive=1, key=0x0) at kernel/sched.c:3880
> #9  0x80235838 in wake_up_klogd () at kernel/printk.c:1013
> #10 0x80235a30 in release_console_sem () at kernel/printk.c:1059
> #11 0x802360be in vprintk (fmt=0x12 , 
> args=0x81002f487a72) at kernel/printk.c:807
> #12 0x802361e5 in printk (fmt=0x8054c0fd "\n", '='  55 times>, "\n") at kernel/printk.c:664
> #13 0x80256268 in print_circular_bug_header 
> (entry=0x809cb8c0, depth=2) at kernel/lockdep.c:902
> #14 0x80256f84 in check_noncircular (source=, 
> depth=1) at kernel/lockdep.c:973
> #15 0x80256f8e in check_noncircular (source=, 
> depth=0) at kernel/lockdep.c:975
> #16 0x80257a45 in __lock_acquire (lock=0x81002e9ff960, 
> subclass=0, trylock=0, read=0, check=2, hardirqs_off=1, 
> ip=18446744071564715933) at kernel/lockdep.c:1324
> #17 0x80258500 in lock_acquire (lock=0x1, subclass=0, 
> trylock=-2140427232, read=-192441912, check=2, ip=) at 
> kernel/lockdep.c:2703
> #18 0x80479d35 in _spin_lock (lock=0x81002e9ff948) at 
> kernel/spinlock.c:181
> #19 0x8028679d in schedule_event (event=, 
> ipd=0x0, data=0x81002e198000) at kernel/xenomai/nucleus/shadow.c:2197
> #20 0x80274bfc in __ipipe_dispatch_event (event=33, 
> data=0x81002e198000) at kernel/ipipe/core.c:828
> #21 0x80477637 in schedule () at kernel/sched.c:1897
> #22 0x80247598 in worker_thread (__cwq=) at 
> kernel/workqueue.c:314
> #23 0x8024ad16 in kthread (_create=) at 
> kernel/kthread.c:78
> #24 0x8020d238 in child_rip ()
> #25 0x in ?? ()
> 
> The lock in question should be task->sighand->siglock, but as we hit the
> bug inside the scheduler, printk deadlocks now :(. Need to dig out some
> patch of Steven Rostedt (IIRC) that may overcome the second deadlock.
> 
> But maybe someone already hears a bell ringing. Would be highly
> appreciated as gdb is effectively unusable here.
> 

do_schedule_event() is the culprit when it reads the pending signals on the
shared queue (XNDEBUG check for rearming the timers), but we really need to know
who is the first locker to fix that properly. Any chance busting the spinlocks
and running with printk_sync() mode on the current domain would get us the
traces ou?


-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Houston, we have a circular problem

2008-05-05 Thread Jan Kiszka
Jan Kiszka wrote:
> Hi,
> 
> after hacking away the barriers I-pipe erected in front of lockdep
> (patches will follow on adeos-main), I was finally able to "visualize" a
> bit more what our colleagues see in reality on SMP: some ugly, not yet
> understood circular dependency when running some Xenomai app under gdb.
> What lockdep tries to tell us remains unclear, unfortunately:
> 
> [  874.356703]
> [  874.356957] ===

Got it!

[0.00]
[0.00] ===
[0.00] [ INFO: possible circular locking dependency detected ]
[0.00] 2.6.24.6-xeno_64 #313
[0.00] ---
[0.00] gdb/4385 is trying to acquire lock:
[0.00]  ((spinlock_t *)&sighand->siglock){}, at: 
[] schedule_event+0x7f/0x578
[0.00]
[0.00] but task is already holding lock:
[0.00]  (&rq->rq_lock_key){++..}, at: [] 
schedule+0x176/0x7ff
[0.00]
[0.00] which lock already depends on the new lock.
[0.00]
[0.00]
[0.00] the existing dependency chain (in reverse order) is:
[0.00]
[0.00] -> #2 (&rq->rq_lock_key){++..}:
[0.00][] __lock_acquire+0xb91/0xd80
[0.00][] lock_acquire+0x9d/0xbc
[0.00][] task_rq_lock+0x7f/0xb8
[0.00][] _spin_lock+0x2a/0x36
[0.00][] task_rq_lock+0x7f/0xb8
[0.00][] try_to_wake_up+0x29/0x306
[0.00][] default_wake_function+0x12/0x14
[0.00][] __wake_up_common+0x4b/0x7a
[0.00][] complete+0x3d/0x51
[0.00][] migration_thread+0x0/0x22b
[0.00][] kthread+0x2c/0x7c
[0.00][] child_rip+0xa/0x12
[0.00][] restore_args+0x0/0x30
[0.00][] kthread+0x0/0x7c
[0.00][] child_rip+0x0/0x12
[0.00][] 0x
[0.00]
[0.00] -> #1 ((spinlock_t *)&q->lock){++..}:
[0.00][] __lock_acquire+0xb91/0xd80
[0.00][] lock_acquire+0x9d/0xbc
[0.00][] __wake_up_sync+0x23/0x53
[0.00][] _spin_lock_irqsave+0x69/0x79
[0.00][] __wake_up_sync+0x23/0x53
[0.00][] do_notify_parent+0x1ea/0x207
[0.00][] kmem_cache_free+0xc6/0xcf
[0.00][] _raw_write_lock+0xe/0x90
[0.00][] do_exit+0x5fd/0x7e0
[0.00][] do_exit+0x707/0x7e0
[0.00][] __call_usermodehelper+0x0/0x61
[0.00][] request_module+0x0/0x166
[0.00][] child_rip+0xa/0x12
[0.00][] restore_args+0x0/0x30
[0.00][] call_usermodehelper+0x0/0x183
[0.00][] child_rip+0x0/0x12
[0.00][] 0x
[0.00]
[0.00] -> #0 ((spinlock_t *)&sighand->siglock){}:
[0.00][] print_circular_bug_entry+0x4d/0x54
[0.00][] __lock_acquire+0xa93/0xd80
[0.00][] lock_acquire+0x9d/0xbc
[0.00][] schedule_event+0x7f/0x578
[0.00][] _spin_lock+0x2a/0x36
[0.00][] schedule_event+0x7f/0x578
[0.00][] __ipipe_dispatch_event+0xe4/0x1db
[0.00][] schedule+0x5a8/0x7ff
[0.00][] do_wait+0xb5e/0xc4c
[0.00][] _raw_read_unlock+0xe/0x2d
[0.00][] do_wait+0xb8d/0xc4c
[0.00][] default_wake_function+0x0/0x14
[0.00][] mcount+0x4c/0x72
[0.00][] sys_wait4+0x2d/0x2f
[0.00][] system_call+0x92/0x97
[0.00][] 0x
[0.00]
[0.00] other info that might help us debug this:
[0.00]
[0.00] 1 lock held by gdb/4385:
[0.00]  #0:  (&rq->rq_lock_key){++..}, at: [] 
schedule+0x176/0x7ff
[0.00]
[0.00] stack backtrace:
[0.00] Pid: 4385, comm: gdb Not tainted 2.6.24.6-xeno_64 #313
[0.00]
[0.00] Call Trace:
[0.00]  [] print_circular_bug_tail+0x75/0x80
[0.00]  [] print_circular_bug_entry+0x4d/0x54
[0.00]  [] __lock_acquire+0xa93/0xd80
[0.00]  [] lock_acquire+0x9d/0xbc
[0.00]  [] schedule_event+0x7f/0x578
[0.00]  [] _spin_lock+0x2a/0x36
[0.00]  [] schedule_event+0x7f/0x578
[0.00]  [] __ipipe_dispatch_event+0xe4/0x1db
[0.00]  [] schedule+0x5a8/0x7ff
[0.00]  [] do_wait+0xb5e/0xc4c
[0.00]  [] _raw_read_unlock+0xe/0x2d
[0.00]  [] do_wait+0xb8d/0xc4c
[0.00]  [] default_wake_function+0x0/0x14
[0.00]  [] mcount+0x4c/0x72
[0.00]  [] sys_wait4+0x2d/0x2f
[0.00]  [] system_call+0x92/0x97
[0.00]


My quick translation is that we must not send signals from the
schedule_event callback, at least as that hook is c

[Xenomai-core] Houston, we have a circular problem

2008-05-05 Thread Jan Kiszka
Hi,

after hacking away the barriers I-pipe erected in front of lockdep
(patches will follow on adeos-main), I was finally able to "visualize" a
bit more what our colleagues see in reality on SMP: some ugly, not yet
understood circular dependency when running some Xenomai app under gdb.
What lockdep tries to tell us remains unclear, unfortunately:

[  874.356703]
[  874.356957] ===

Here it hangs because of this (catched via QEMU):

(gdb) bt
#0  __delay (loops=1) at arch/x86/lib/delay_64.c:34
#1  0x80372712 in _raw_spin_lock (lock=0x81000232c6c0) at 
lib/spinlock_debug.c:111
#2  0x80479d3d in _spin_lock (lock=0x81000232c6c0) at 
kernel/spinlock.c:182
#3  0x8022e546 in task_rq_lock (p=0x81002e792000, 
flags=0x81002f487910) at kernel/sched.c:615
#4  0x8022e6b6 in try_to_wake_up (p=0x1, state=, 
sync=341) at kernel/sched.c:1562
#5  0x8022e9a5 in default_wake_function (curr=, 
mode=0, sync=341, key=0xf48791c8) at kernel/sched.c:3840
#6  0x8024ae51 in autoremove_wake_function (wait=0x1, mode=0, sync=341, 
key=0xf48791c8) at kernel/wait.c:132
#7  0x8022bdc7 in __wake_up_common (q=, mode=1, 
nr_exclusive=1, sync=0, key=0x0) at kernel/sched.c:3861
#8  0x8022df43 in __wake_up (q=0x805a6240, mode=1, 
nr_exclusive=1, key=0x0) at kernel/sched.c:3880
#9  0x80235838 in wake_up_klogd () at kernel/printk.c:1013
#10 0x80235a30 in release_console_sem () at kernel/printk.c:1059
#11 0x802360be in vprintk (fmt=0x12 , 
args=0x81002f487a72) at kernel/printk.c:807
#12 0x802361e5 in printk (fmt=0x8054c0fd "\n", '=' , "\n") at kernel/printk.c:664
#13 0x80256268 in print_circular_bug_header (entry=0x809cb8c0, 
depth=2) at kernel/lockdep.c:902
#14 0x80256f84 in check_noncircular (source=, 
depth=1) at kernel/lockdep.c:973
#15 0x80256f8e in check_noncircular (source=, 
depth=0) at kernel/lockdep.c:975
#16 0x80257a45 in __lock_acquire (lock=0x81002e9ff960, subclass=0, 
trylock=0, read=0, check=2, hardirqs_off=1, ip=18446744071564715933) at 
kernel/lockdep.c:1324
#17 0x80258500 in lock_acquire (lock=0x1, subclass=0, 
trylock=-2140427232, read=-192441912, check=2, ip=) at 
kernel/lockdep.c:2703
#18 0x80479d35 in _spin_lock (lock=0x81002e9ff948) at 
kernel/spinlock.c:181
#19 0x8028679d in schedule_event (event=, ipd=0x0, 
data=0x81002e198000) at kernel/xenomai/nucleus/shadow.c:2197
#20 0x80274bfc in __ipipe_dispatch_event (event=33, 
data=0x81002e198000) at kernel/ipipe/core.c:828
#21 0x80477637 in schedule () at kernel/sched.c:1897
#22 0x80247598 in worker_thread (__cwq=) at 
kernel/workqueue.c:314
#23 0x8024ad16 in kthread (_create=) at 
kernel/kthread.c:78
#24 0x8020d238 in child_rip ()
#25 0x in ?? ()

The lock in question should be task->sighand->siglock, but as we hit the
bug inside the scheduler, printk deadlocks now :(. Need to dig out some
patch of Steven Rostedt (IIRC) that may overcome the second deadlock.

But maybe someone already hears a bell ringing. Would be highly
appreciated as gdb is effectively unusable here.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core