Re: [Xenomai-core] Summary: Xenomai 2.3.2 and 2.4 lock-ups and OOPSes

2007-09-13 Thread Philippe Gerum
On Thu, 2007-09-13 at 10:06 +0200, Peter Soetens wrote:
> >
> > Please point me at the actual Orocos test code that breaks, with the
> > hope to get a fairly standalone test case from it; if you do have a
> > standalone test case already, this would be even better. I intend to
> > address this issue asap.
> 
> I stripped the OS layer of Orocos for xenomai and built a testcase, which 
> causes a complete lockup with that. In order to avoid the lock-up, comment 
> some thread/mutex/semaphore creations. All these classes call the functions 
> from fosi.c and fosi_internal_join.cpp. So the testcase could be reduced to 
> call only these functions...
> It seems you need to construct a number of threads and mutexes in the 
> same application before it happens. With only 'little' use of these 
> primitives, our applications run fine 1000's times in a row.

Ok. I can confirm that something tragically bugous is happening at
nucleus level when running this testcase. So far, I already triggered
the -ENOMEM issue, and a nifty crash during task creation. I'll dig
this, thanks.

-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Summary: Xenomai 2.3.2 and 2.4 lock-ups and OOPSes

2007-09-13 Thread Peter Soetens
Quoting Peter Soetens <[EMAIL PROTECTED]>:
> I stripped the OS layer of Orocos for xenomai and built a testcase, which
> causes a complete lockup with that. In order to avoid the lock-up, comment
> some thread/mutex/semaphore creations.

that's in trigger-bug.cpp, use 'make' to create the executable.

Peter

-- 
www.fmtc.be


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Summary: Xenomai 2.3.2 and 2.4 lock-ups and OOPSes

2007-09-07 Thread Gilles Chanteperdrix
On 9/7/07, Gilles Chanteperdrix <[EMAIL PROTECTED]> wrote:
> Philippe Gerum wrote:
>  > On Fri, 2007-09-07 at 11:27 +0200, Peter Soetens wrote:
>  > > Just in case you hooked off the long discussion about the issues we
> found from
>  > > Xenomai 2.3.2 on:
>  > >
>  > >   o We are using the xeno_native skin, create Xeno tasks and
> semaphores, but
>  > > have strong indications that the crashes are caused by the memory
> allocation
>  > > scheme of Xenomai in combination with task creation/deletion
>  > >   o We found two ways to break Xenomai, causing a 'Killed'
> (rt_task_delete)
>  > > and causing an OOPS (rt_task_join).
>  > >   o They happen on 2.6.20 and 2.6.22 kernels
>  > >   o On the 2.3 branch, r2429 works, r2433 causes the faults. The patch
> is
>  > > small, and in the ChangLog:
>  > >
>  > > 2007-05-11  Philippe Gerum  <[EMAIL PROTECTED]>
>  > >
>  > > * include/nucleus/heap.h (xnfreesafe): Use xnpod_current_p() when
>  > > checking for deferral.
>  > >
>  > > * include/nucleus/pod.h (xnpod_current_p): Give exec mode
>  > > awareness to this predicate, checking for primary/secondary mode
>  > > of shadows.
>  > >
>  > > 2007-05-11  Gilles Chanteperdrix  <[EMAIL PROTECTED]>
>  > >
>  > > * ksrc/skins: Always defer thread memory release in deletion hook
>  > > by calling xnheap_schedule_free() instead of xnfreesafe().
>  > >
>  > >   o We reverted this patch on HEAD of the 2.3 branch, but got -ENOMEM
> errors
>  > > during Xenomai resource allocations, indicating that later changes
> depend on
>  > > this patch. So we use clean HEAD again further on to find the causes:
>  > >  o A first test (in Orocos) creates one thread, two semaphores, lets it
> wait
>  > > on them and cleans up the thread.
>  >
>  > Please point me at the actual Orocos test code that breaks, with the
>  > hope to get a fairly standalone test case from it; if you do have a
>  > standalone test case already, this would be even better. I intend to
>  > address this issue asap.
>
> Before you have a piece of code that causes the crash, I gave a look at
> the code involved. The only suspicious thing I see is that the correct
> working of native skins thread termination depends on the execution
> order of the two deletion hooks, the one in task.c and the one in
> syscall.c. As a matter of fact, if the one in task.c is executed before
> the one in syscall.c, the task magic is changed and xnshadow_unmap will
> never be called. I suspect this is true for all skins, but I do not know
> if this could cause a crash.

There are two magics involved, this supposition is wrong.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Summary: Xenomai 2.3.2 and 2.4 lock-ups and OOPSes

2007-09-07 Thread Gilles Chanteperdrix
Philippe Gerum wrote:
 > On Fri, 2007-09-07 at 11:27 +0200, Peter Soetens wrote:
 > > Just in case you hooked off the long discussion about the issues we found 
 > > from
 > > Xenomai 2.3.2 on:
 > > 
 > >   o We are using the xeno_native skin, create Xeno tasks and semaphores, 
 > > but 
 > > have strong indications that the crashes are caused by the memory 
 > > allocation 
 > > scheme of Xenomai in combination with task creation/deletion
 > >   o We found two ways to break Xenomai, causing a 'Killed' 
 > > (rt_task_delete) 
 > > and causing an OOPS (rt_task_join).
 > >   o They happen on 2.6.20 and 2.6.22 kernels
 > >   o On the 2.3 branch, r2429 works, r2433 causes the faults. The patch is 
 > > small, and in the ChangLog: 
 > > 
 > > 2007-05-11  Philippe Gerum  <[EMAIL PROTECTED]>
 > > 
 > > * include/nucleus/heap.h (xnfreesafe): Use xnpod_current_p() when
 > > checking for deferral.
 > > 
 > > * include/nucleus/pod.h (xnpod_current_p): Give exec mode
 > > awareness to this predicate, checking for primary/secondary mode
 > > of shadows.
 > > 
 > > 2007-05-11  Gilles Chanteperdrix  <[EMAIL PROTECTED]>
 > > 
 > > * ksrc/skins: Always defer thread memory release in deletion hook
 > > by calling xnheap_schedule_free() instead of xnfreesafe().
 > > 
 > >   o We reverted this patch on HEAD of the 2.3 branch, but got -ENOMEM 
 > > errors 
 > > during Xenomai resource allocations, indicating that later changes depend 
 > > on 
 > > this patch. So we use clean HEAD again further on to find the causes:
 > >  o A first test (in Orocos) creates one thread, two semaphores, lets it 
 > > wait 
 > > on them and cleans up the thread.
 > 
 > Please point me at the actual Orocos test code that breaks, with the
 > hope to get a fairly standalone test case from it; if you do have a
 > standalone test case already, this would be even better. I intend to
 > address this issue asap.

Before you have a piece of code that causes the crash, I gave a look at
the code involved. The only suspicious thing I see is that the correct
working of native skins thread termination depends on the execution
order of the two deletion hooks, the one in task.c and the one in
syscall.c. As a matter of fact, if the one in task.c is executed before
the one in syscall.c, the task magic is changed and xnshadow_unmap will
never be called. I suspect this is true for all skins, but I do not know
if this could cause a crash.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Summary: Xenomai 2.3.2 and 2.4 lock-ups and OOPSes

2007-09-07 Thread Philippe Gerum
On Fri, 2007-09-07 at 11:27 +0200, Peter Soetens wrote:
> Just in case you hooked off the long discussion about the issues we found from
> Xenomai 2.3.2 on:
> 
>   o We are using the xeno_native skin, create Xeno tasks and semaphores, but 
> have strong indications that the crashes are caused by the memory allocation 
> scheme of Xenomai in combination with task creation/deletion
>   o We found two ways to break Xenomai, causing a 'Killed' (rt_task_delete) 
> and causing an OOPS (rt_task_join).
>   o They happen on 2.6.20 and 2.6.22 kernels
>   o On the 2.3 branch, r2429 works, r2433 causes the faults. The patch is 
> small, and in the ChangLog: 
> 
> 2007-05-11  Philippe Gerum  <[EMAIL PROTECTED]>
> 
> * include/nucleus/heap.h (xnfreesafe): Use xnpod_current_p() when
> checking for deferral.
> 
> * include/nucleus/pod.h (xnpod_current_p): Give exec mode
> awareness to this predicate, checking for primary/secondary mode
> of shadows.
> 
> 2007-05-11  Gilles Chanteperdrix  <[EMAIL PROTECTED]>
> 
> * ksrc/skins: Always defer thread memory release in deletion hook
> by calling xnheap_schedule_free() instead of xnfreesafe().
> 
>   o We reverted this patch on HEAD of the 2.3 branch, but got -ENOMEM errors 
> during Xenomai resource allocations, indicating that later changes depend on 
> this patch. So we use clean HEAD again further on to find the causes:
>  o A first test (in Orocos) creates one thread, two semaphores, lets it wait 
> on them and cleans up the thread.

Please point me at the actual Orocos test code that breaks, with the
hope to get a fairly standalone test case from it; if you do have a
standalone test case already, this would be even better. I intend to
address this issue asap.

>  o During rt_task_delete, our program gets 'Killed' (without joinable 
> thread), 
> hence a user space problem. However, gdb is of no use, all thread info is 
> lost.

SIGKILL is sent from the nucleus upon a call to rt_task_delete() which
targets a non-current task, in order to make sure this user-space task
will go away from a Linux context, since we don't want the kernel TCB
Xenomai maintains for it, to be wiped out before the mated userland
thread has really exited. IOW, this case boils down to an asynchronous
cancellation, where Linux is asked to kick out the target task first,
Xenomai then catches the event and cleans up the TCB on its side
afterwise.

As per POSIX, a lethal signal sent to a single thread zaps all other
threads belonging to the same process, which explains why your process
dies. We could be a bit smarter by handling this situation using a
hidden exit from a trapped signal handler, I guess.

>  o We made the thread joinable (T_JOINABLE), and then joined. This bypassed 
> the Kill on the first run but causes an OOPS the second time the same 
> application is started:
> 
> Oops:  [#1]
> PREEMPT
> CPU:0
> EIP:0060:[]Not tainted VLI
> EFLAGS: 00010002   (2.6.20.9-ipipe-1.8-08 #2)
> EIP is at get_free_range+0x56/0x160 [xeno_nucleus]
> eax: f3a81d01   ebx: 0200   ecx: 0101   edx: fef62b00
> esi: 0101   edi: 0200   ebp: f0f33ec4   esp: f0f33e98
> ds: 007b   es: 007b   ss: 0068
> Process NonPeriodicActi (pid: 3020, ti=f0f32000 task=f7ce61b0 
> task.ti=f0f32000)
> Stack:  0600 fef62b80 f3a81b24 f3a8 fef62ba4 f3a80720 0101
>0600 f0f33f18 f7ce6360 f0f33ee4 fef4a948 fef62b80 f0f33f08 
>0400 f0f33f18 f7ce6360 f0f33f50 ff13e1de 0282 0282 bfab6350
> Call Trace:
>  [] show_trace_log_lvl+0x1f/0x35
>  [] show_stack_log_lvl+0xaa/0xcf
>  [] show_registers+0x1c9/0x392
>  [] die+0x116/0x245
>  [] do_page_fault+0x287/0x61d
>  [] __ipipe_handle_exception+0x63/0x136
>  [] error_code+0x79/0x88
>  [] xnheap_alloc+0x15b/0x17d [xeno_nucleus]

The only explanation looking at this backtrace is that the system heap
has been corrupted by the previous exit; likely a side effect of the
deferral.

>  [] __rt_task_create+0xe0/0x171 [xeno_native]
>  [] losyscall_event+0xaf/0x170 [xeno_nucleus]
>  [] __ipipe_dispatch_event+0xc0/0x1da
>  [] __ipipe_syscall_root+0x43/0x10a
>  [] system_call+0x29/0x41
>  ===
> Code: 74 61 85 c0 74 5d c7 45 e0 00 00 00 00 8b 4d e4 8b 49 10 89 4d ec 85 c9 
> 74 38 8b 45 dc 8b 78 0c 89 4d f0 89 ce 89 fb eb 02 89 ce <8b> 09 8d 04 3e 39 
> c1 0f 94 c2 3b 5d d8 0f 92 c0 01 fb 84 c2 75
> EIP: [] get_free_range+0x56/0x160 [xeno_nucleus] SS:ESP 
> 0068:f0f33e98
> [hard lockup]
> 
>   o Our application is also mixing the original RT_TASK struct and return 
> value of the rt_task_self() function call when calling rt_ functions. 
> Switching between one of those influences the crashing behaviour as well, not 
> further investigated.
> 

This should not make any difference regarding the bug above. Both
methods boil down to returning an abstract handler to the task, which
serves as an index to the in-kernel TCB maintained by Xenomai. In any
case, this handle is fully validated