Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-26 Thread Gilles Chanteperdrix
On Nov 13, 2007 7:02 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>
> Gilles Chanteperdrix wrote:
> > On Nov 13, 2007 6:44 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>  Gilles Chanteperdrix wrote:
> > On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> Hi,
> >>>
> >>> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
> >>> enabled box under heavy non real-time network load (which passes
> >>> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
> >>> another rtmac_vnic). When reading some I-pipe tracer traces, I
> >>> remarked that I forgot to replace a local_irq_save/local_irq_restore
> >>> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
> >>> handler. I fixed this bug, and the slab corruption seems to be gone.
> >> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
> >> domain state would not be updated appropriately - which is at least 
> >> unclean.
> > It is some low level secondary timer handling code, there is no rtdm
> > involved. The code protected by the interrupt masking routines is one
> > or two inline assembly instructions.
> >
> >> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
> > I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
> > I-pipe patch and Xenomai update is scheduled for when RT applications
> > and drivers porting will be finished.
> >
> > Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
> > ipipe_unstall_root are unconditional.
> >
>  What bothers me, is that even looking at the old 1.3 series here and on,
>  the code should exhibit a call chain like
>  local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
>  __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
>  domain pointer, which is ok, since well, it has to be right in the first
>  place. If we were running over a real-time handler, then I assume the
>  Xenomai domain was active. So BUG_ON() should have triggered if present
>  in __ipipe_unstall_root.
> >>> I am using an I-pipe arm 1.5-04 (now that I have done cat
> >>> /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
> >>> __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
> >>> will switch to Xenomai 2.4.
> >>>
>  Additionally, calling __ipipe_sync_pipeline() would sync the current
>  stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.
> 
>  Mm, ok, in short: I have no clue.
> >>> The system runs stably, so I have to assume that calling
> >>> local_irq_restore in a real-time interrupt handler can cause slab
> >>> corruption. Strange.
> >>>
> >> I guess this is likely not on your critical path, but when time allows,
> >> I'd be interested to know whether such bug still occurs when using a
> >> purely kernel-only tasking, assuming that you currently see this bug
> >> with userland tasks. Basically, I wonder if migrating shadows between
> >> both domains would not reveal the bug, since your real-time handler
> >> starts being preemptible by hw IRQs as soon as it returns from
> >> __ipipe_unstall_root, which forces local_irq_enable_hw().
> >
> > Actually, I had only kernel-only tasking, since in my test I had
> > remove everything and only kept the RTnet drivers and stack and tested
> > Linux routing (my basic goal was to improve non-real time trafic
> > rate).
> >
>
> Ah, ok. So maybe the preemption issue? Would the ISR be fine with being
> re-entered for instance? Any potential trashing in sight? I guess that
> you could check if this is related with using a local version of
> local_irq_restore in this particular code spot, which would basically do
> what __ipipe_unstall_root does, but local_irq_enable_hw().

Now that I have spent some time again in arch/arm/kernel/entry*.S, I
think I understand what happened. Because of the call to
local_irq_enable_hw() in ipipe_unstall_root(), and because
ipipe_run_isr calls local_irq_disable_nohead, we exit the irq_handler
with hardware irqs on. And some path in entry.S assume that hardware
irqs are off. I do not remember why, but doing the kernel/user context
switch with hardware irqs on may result in random registers
corruption, which is exactly what causes the slab corruption I
observe.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Gilles Chanteperdrix
On Nov 13, 2007 6:54 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>
> Gilles Chanteperdrix wrote:
> > On Nov 13, 2007 6:45 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>  Gilles Chanteperdrix wrote:
> > On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> Hi,
> >>>
> >>> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
> >>> enabled box under heavy non real-time network load (which passes
> >>> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
> >>> another rtmac_vnic). When reading some I-pipe tracer traces, I
> >>> remarked that I forgot to replace a local_irq_save/local_irq_restore
> >>> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
> >>> handler. I fixed this bug, and the slab corruption seems to be gone.
> >> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
> >> domain state would not be updated appropriately - which is at least 
> >> unclean.
> > It is some low level secondary timer handling code, there is no rtdm
> > involved. The code protected by the interrupt masking routines is one
> > or two inline assembly instructions.
> >
> >> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
> > I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
> > I-pipe patch and Xenomai update is scheduled for when RT applications
> > and drivers porting will be finished.
> >
> > Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
> > ipipe_unstall_root are unconditional.
> >
>  What bothers me, is that even looking at the old 1.3 series here and on,
>  the code should exhibit a call chain like
>  local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
>  __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
>  domain pointer, which is ok, since well, it has to be right in the first
>  place. If we were running over a real-time handler, then I assume the
>  Xenomai domain was active. So BUG_ON() should have triggered if present
>  in __ipipe_unstall_root.
> >>> I am using an I-pipe arm 1.5-04 (now that I have done cat
> >>> /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
> >>> __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
> >>> will switch to Xenomai 2.4.
> >>>
>  Additionally, calling __ipipe_sync_pipeline() would sync the current
>  stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.
> 
>  Mm, ok, in short: I have no clue.
> >>> The system runs stably, so I have to assume that calling
> >>> local_irq_restore in a real-time interrupt handler can cause slab
> >>> corruption. Strange.
> >> What about instrumenting the involved I-pipe code path with
> >> ipipe_trace_specials and then restoring the buggy code. You may even
> >> ipipe_trace_freeze on that spot so that you can watch in pre/post trace
> >> what happens. May help to understand if this was the only issue, or if
> >> we may need some further measures for future versions.
> >
> > I have used the tracer, but the "slab corruption" message triggers a
> > long time after the bug. At least, with a 128K backtrace, I could not
> > find the place where the bug happened.
>
> Well, IF this local_irq fiddling here is supposed to be the reason, you
> should already be able to spot unexpected code paths around its usage.
> That was my idea. If there is nothing, you are probably only pushing
> around some still existing race window.

That is what I fear.


-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Philippe Gerum
Philippe Gerum wrote:
> Gilles Chanteperdrix wrote:
>> On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>>> Gilles Chanteperdrix wrote:
 On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> Gilles Chanteperdrix wrote:
>> Hi,
>>
>> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
>> enabled box under heavy non real-time network load (which passes
>> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
>> another rtmac_vnic). When reading some I-pipe tracer traces, I
>> remarked that I forgot to replace a local_irq_save/local_irq_restore
>> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
>> handler. I fixed this bug, and the slab corruption seems to be gone.
> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
> domain state would not be updated appropriately - which is at least 
> unclean.
 It is some low level secondary timer handling code, there is no rtdm
 involved. The code protected by the interrupt masking routines is one
 or two inline assembly instructions.

> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
 I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
 I-pipe patch and Xenomai update is scheduled for when RT applications
 and drivers porting will be finished.

 Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
 ipipe_unstall_root are unconditional.

>>> What bothers me, is that even looking at the old 1.3 series here and on,
>>> the code should exhibit a call chain like
>>> local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
>>> __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
>>> domain pointer, which is ok, since well, it has to be right in the first
>>> place. If we were running over a real-time handler, then I assume the
>>> Xenomai domain was active. So BUG_ON() should have triggered if present
>>> in __ipipe_unstall_root.
>> I am using an I-pipe arm 1.5-04 (now that I have done cat
>> /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
>> __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
>> will switch to Xenomai 2.4.
>>
>>> Additionally, calling __ipipe_sync_pipeline() would sync the current
>>> stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.
>>>
>>> Mm, ok, in short: I have no clue.
>> The system runs stably, so I have to assume that calling
>> local_irq_restore in a real-time interrupt handler can cause slab
>> corruption. Strange.
>>
> 
> I guess this is likely not on your critical path, but when time allows,
> I'd be interested to know whether such bug still occurs when using a
> purely kernel-only tasking, assuming that you currently see this bug
> with userland tasks. Basically, I wonder if migrating shadows between
> both domains would not reveal the bug, since your real-time handler
> starts being preemptible by hw IRQs as soon as it returns from
> __ipipe_unstall_root, which forces local_irq_enable_hw().
> 

Well, the 1.5 series still has a deep log, so you would also have to
make sure that no IRQ is pending in the pipeline.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
> On Nov 13, 2007 6:44 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
 Gilles Chanteperdrix wrote:
> On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> Hi,
>>>
>>> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
>>> enabled box under heavy non real-time network load (which passes
>>> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
>>> another rtmac_vnic). When reading some I-pipe tracer traces, I
>>> remarked that I forgot to replace a local_irq_save/local_irq_restore
>>> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
>>> handler. I fixed this bug, and the slab corruption seems to be gone.
>> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
>> domain state would not be updated appropriately - which is at least 
>> unclean.
> It is some low level secondary timer handling code, there is no rtdm
> involved. The code protected by the interrupt masking routines is one
> or two inline assembly instructions.
>
>> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
> I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
> I-pipe patch and Xenomai update is scheduled for when RT applications
> and drivers porting will be finished.
>
> Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
> ipipe_unstall_root are unconditional.
>
 What bothers me, is that even looking at the old 1.3 series here and on,
 the code should exhibit a call chain like
 local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
 __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
 domain pointer, which is ok, since well, it has to be right in the first
 place. If we were running over a real-time handler, then I assume the
 Xenomai domain was active. So BUG_ON() should have triggered if present
 in __ipipe_unstall_root.
>>> I am using an I-pipe arm 1.5-04 (now that I have done cat
>>> /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
>>> __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
>>> will switch to Xenomai 2.4.
>>>
 Additionally, calling __ipipe_sync_pipeline() would sync the current
 stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.

 Mm, ok, in short: I have no clue.
>>> The system runs stably, so I have to assume that calling
>>> local_irq_restore in a real-time interrupt handler can cause slab
>>> corruption. Strange.
>>>
>> I guess this is likely not on your critical path, but when time allows,
>> I'd be interested to know whether such bug still occurs when using a
>> purely kernel-only tasking, assuming that you currently see this bug
>> with userland tasks. Basically, I wonder if migrating shadows between
>> both domains would not reveal the bug, since your real-time handler
>> starts being preemptible by hw IRQs as soon as it returns from
>> __ipipe_unstall_root, which forces local_irq_enable_hw().
> 
> Actually, I had only kernel-only tasking, since in my test I had
> remove everything and only kept the RTnet drivers and stack and tested
> Linux routing (my basic goal was to improve non-real time trafic
> rate).
> 

Ah, ok. So maybe the preemption issue? Would the ISR be fine with being
re-entered for instance? Any potential trashing in sight? I guess that
you could check if this is related with using a local version of
local_irq_restore in this particular code spot, which would basically do
what __ipipe_unstall_root does, but local_irq_enable_hw().

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
> On Nov 13, 2007 6:45 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
 Gilles Chanteperdrix wrote:
> On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> Hi,
>>>
>>> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
>>> enabled box under heavy non real-time network load (which passes
>>> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
>>> another rtmac_vnic). When reading some I-pipe tracer traces, I
>>> remarked that I forgot to replace a local_irq_save/local_irq_restore
>>> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
>>> handler. I fixed this bug, and the slab corruption seems to be gone.
>> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
>> domain state would not be updated appropriately - which is at least 
>> unclean.
> It is some low level secondary timer handling code, there is no rtdm
> involved. The code protected by the interrupt masking routines is one
> or two inline assembly instructions.
>
>> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
> I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
> I-pipe patch and Xenomai update is scheduled for when RT applications
> and drivers porting will be finished.
>
> Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
> ipipe_unstall_root are unconditional.
>
 What bothers me, is that even looking at the old 1.3 series here and on,
 the code should exhibit a call chain like
 local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
 __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
 domain pointer, which is ok, since well, it has to be right in the first
 place. If we were running over a real-time handler, then I assume the
 Xenomai domain was active. So BUG_ON() should have triggered if present
 in __ipipe_unstall_root.
>>> I am using an I-pipe arm 1.5-04 (now that I have done cat
>>> /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
>>> __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
>>> will switch to Xenomai 2.4.
>>>
 Additionally, calling __ipipe_sync_pipeline() would sync the current
 stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.

 Mm, ok, in short: I have no clue.
>>> The system runs stably, so I have to assume that calling
>>> local_irq_restore in a real-time interrupt handler can cause slab
>>> corruption. Strange.
>> What about instrumenting the involved I-pipe code path with
>> ipipe_trace_specials and then restoring the buggy code. You may even
>> ipipe_trace_freeze on that spot so that you can watch in pre/post trace
>> what happens. May help to understand if this was the only issue, or if
>> we may need some further measures for future versions.
> 
> I have used the tracer, but the "slab corruption" message triggers a
> long time after the bug. At least, with a 128K backtrace, I could not
> find the place where the bug happened.

Well, IF this local_irq fiddling here is supposed to be the reason, you
should already be able to spot unexpected code paths around its usage.
That was my idea. If there is nothing, you are probably only pushing
around some still existing race window.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Gilles Chanteperdrix
On Nov 13, 2007 6:44 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>
> Gilles Chanteperdrix wrote:
> > On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>  Gilles Chanteperdrix wrote:
> > Hi,
> >
> > I am chasing a slab corruption bug which happens on a Xenomai+RTnet
> > enabled box under heavy non real-time network load (which passes
> > through rtnet and rtmac_vnic to Linux which does NAT and resend it to
> > another rtmac_vnic). When reading some I-pipe tracer traces, I
> > remarked that I forgot to replace a local_irq_save/local_irq_restore
> > with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
> > handler. I fixed this bug, and the slab corruption seems to be gone.
>  Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
>  domain state would not be updated appropriately - which is at least 
>  unclean.
> >>> It is some low level secondary timer handling code, there is no rtdm
> >>> involved. The code protected by the interrupt masking routines is one
> >>> or two inline assembly instructions.
> >>>
>  BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
> >>> I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
> >>> I-pipe patch and Xenomai update is scheduled for when RT applications
> >>> and drivers porting will be finished.
> >>>
> >>> Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
> >>> ipipe_unstall_root are unconditional.
> >>>
> >> What bothers me, is that even looking at the old 1.3 series here and on,
> >> the code should exhibit a call chain like
> >> local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
> >> __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
> >> domain pointer, which is ok, since well, it has to be right in the first
> >> place. If we were running over a real-time handler, then I assume the
> >> Xenomai domain was active. So BUG_ON() should have triggered if present
> >> in __ipipe_unstall_root.
> >
> > I am using an I-pipe arm 1.5-04 (now that I have done cat
> > /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
> > __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
> > will switch to Xenomai 2.4.
> >
> >> Additionally, calling __ipipe_sync_pipeline() would sync the current
> >> stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.
> >>
> >> Mm, ok, in short: I have no clue.
> >
> > The system runs stably, so I have to assume that calling
> > local_irq_restore in a real-time interrupt handler can cause slab
> > corruption. Strange.
> >
>
> I guess this is likely not on your critical path, but when time allows,
> I'd be interested to know whether such bug still occurs when using a
> purely kernel-only tasking, assuming that you currently see this bug
> with userland tasks. Basically, I wonder if migrating shadows between
> both domains would not reveal the bug, since your real-time handler
> starts being preemptible by hw IRQs as soon as it returns from
> __ipipe_unstall_root, which forces local_irq_enable_hw().

Actually, I had only kernel-only tasking, since in my test I had
remove everything and only kept the RTnet drivers and stack and tested
Linux routing (my basic goal was to improve non-real time trafic
rate).

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Gilles Chanteperdrix
On Nov 13, 2007 6:45 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>
> Gilles Chanteperdrix wrote:
> > On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>  Gilles Chanteperdrix wrote:
> > Hi,
> >
> > I am chasing a slab corruption bug which happens on a Xenomai+RTnet
> > enabled box under heavy non real-time network load (which passes
> > through rtnet and rtmac_vnic to Linux which does NAT and resend it to
> > another rtmac_vnic). When reading some I-pipe tracer traces, I
> > remarked that I forgot to replace a local_irq_save/local_irq_restore
> > with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
> > handler. I fixed this bug, and the slab corruption seems to be gone.
>  Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
>  domain state would not be updated appropriately - which is at least 
>  unclean.
> >>> It is some low level secondary timer handling code, there is no rtdm
> >>> involved. The code protected by the interrupt masking routines is one
> >>> or two inline assembly instructions.
> >>>
>  BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
> >>> I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
> >>> I-pipe patch and Xenomai update is scheduled for when RT applications
> >>> and drivers porting will be finished.
> >>>
> >>> Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
> >>> ipipe_unstall_root are unconditional.
> >>>
> >> What bothers me, is that even looking at the old 1.3 series here and on,
> >> the code should exhibit a call chain like
> >> local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
> >> __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
> >> domain pointer, which is ok, since well, it has to be right in the first
> >> place. If we were running over a real-time handler, then I assume the
> >> Xenomai domain was active. So BUG_ON() should have triggered if present
> >> in __ipipe_unstall_root.
> >
> > I am using an I-pipe arm 1.5-04 (now that I have done cat
> > /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
> > __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
> > will switch to Xenomai 2.4.
> >
> >> Additionally, calling __ipipe_sync_pipeline() would sync the current
> >> stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.
> >>
> >> Mm, ok, in short: I have no clue.
> >
> > The system runs stably, so I have to assume that calling
> > local_irq_restore in a real-time interrupt handler can cause slab
> > corruption. Strange.
>
> What about instrumenting the involved I-pipe code path with
> ipipe_trace_specials and then restoring the buggy code. You may even
> ipipe_trace_freeze on that spot so that you can watch in pre/post trace
> what happens. May help to understand if this was the only issue, or if
> we may need some further measures for future versions.

I have used the tracer, but the "slab corruption" message triggers a
long time after the bug. At least, with a 128K backtrace, I could not
find the place where the bug happened.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
> On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 Gilles Chanteperdrix wrote:
> Hi,
>
> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
> enabled box under heavy non real-time network load (which passes
> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
> another rtmac_vnic). When reading some I-pipe tracer traces, I
> remarked that I forgot to replace a local_irq_save/local_irq_restore
> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
> handler. I fixed this bug, and the slab corruption seems to be gone.
 Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
 domain state would not be updated appropriately - which is at least 
 unclean.
>>> It is some low level secondary timer handling code, there is no rtdm
>>> involved. The code protected by the interrupt masking routines is one
>>> or two inline assembly instructions.
>>>
 BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
>>> I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
>>> I-pipe patch and Xenomai update is scheduled for when RT applications
>>> and drivers porting will be finished.
>>>
>>> Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
>>> ipipe_unstall_root are unconditional.
>>>
>> What bothers me, is that even looking at the old 1.3 series here and on,
>> the code should exhibit a call chain like
>> local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
>> __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
>> domain pointer, which is ok, since well, it has to be right in the first
>> place. If we were running over a real-time handler, then I assume the
>> Xenomai domain was active. So BUG_ON() should have triggered if present
>> in __ipipe_unstall_root.
> 
> I am using an I-pipe arm 1.5-04 (now that I have done cat
> /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
> __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
> will switch to Xenomai 2.4.
> 
>> Additionally, calling __ipipe_sync_pipeline() would sync the current
>> stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.
>>
>> Mm, ok, in short: I have no clue.
> 
> The system runs stably, so I have to assume that calling
> local_irq_restore in a real-time interrupt handler can cause slab
> corruption. Strange.

What about instrumenting the involved I-pipe code path with
ipipe_trace_specials and then restoring the buggy code. You may even
ipipe_trace_freeze on that spot so that you can watch in pre/post trace
what happens. May help to understand if this was the only issue, or if
we may need some further measures for future versions.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
> On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 Gilles Chanteperdrix wrote:
> Hi,
>
> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
> enabled box under heavy non real-time network load (which passes
> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
> another rtmac_vnic). When reading some I-pipe tracer traces, I
> remarked that I forgot to replace a local_irq_save/local_irq_restore
> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
> handler. I fixed this bug, and the slab corruption seems to be gone.
 Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
 domain state would not be updated appropriately - which is at least 
 unclean.
>>> It is some low level secondary timer handling code, there is no rtdm
>>> involved. The code protected by the interrupt masking routines is one
>>> or two inline assembly instructions.
>>>
 BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
>>> I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
>>> I-pipe patch and Xenomai update is scheduled for when RT applications
>>> and drivers porting will be finished.
>>>
>>> Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
>>> ipipe_unstall_root are unconditional.
>>>
>> What bothers me, is that even looking at the old 1.3 series here and on,
>> the code should exhibit a call chain like
>> local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
>> __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
>> domain pointer, which is ok, since well, it has to be right in the first
>> place. If we were running over a real-time handler, then I assume the
>> Xenomai domain was active. So BUG_ON() should have triggered if present
>> in __ipipe_unstall_root.
> 
> I am using an I-pipe arm 1.5-04 (now that I have done cat
> /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
> __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
> will switch to Xenomai 2.4.
> 
>> Additionally, calling __ipipe_sync_pipeline() would sync the current
>> stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.
>>
>> Mm, ok, in short: I have no clue.
> 
> The system runs stably, so I have to assume that calling
> local_irq_restore in a real-time interrupt handler can cause slab
> corruption. Strange.
> 

I guess this is likely not on your critical path, but when time allows,
I'd be interested to know whether such bug still occurs when using a
purely kernel-only tasking, assuming that you currently see this bug
with userland tasks. Basically, I wonder if migrating shadows between
both domains would not reveal the bug, since your real-time handler
starts being preemptible by hw IRQs as soon as it returns from
__ipipe_unstall_root, which forces local_irq_enable_hw().

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Gilles Chanteperdrix
On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
> Gilles Chanteperdrix wrote:
> > On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> Hi,
> >>>
> >>> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
> >>> enabled box under heavy non real-time network load (which passes
> >>> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
> >>> another rtmac_vnic). When reading some I-pipe tracer traces, I
> >>> remarked that I forgot to replace a local_irq_save/local_irq_restore
> >>> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
> >>> handler. I fixed this bug, and the slab corruption seems to be gone.
> >> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
> >> domain state would not be updated appropriately - which is at least 
> >> unclean.
> >
> > It is some low level secondary timer handling code, there is no rtdm
> > involved. The code protected by the interrupt masking routines is one
> > or two inline assembly instructions.
> >
> >> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
> >
> > I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
> > I-pipe patch and Xenomai update is scheduled for when RT applications
> > and drivers porting will be finished.
> >
> > Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
> > ipipe_unstall_root are unconditional.
> >
>
> What bothers me, is that even looking at the old 1.3 series here and on,
> the code should exhibit a call chain like
> local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
> __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
> domain pointer, which is ok, since well, it has to be right in the first
> place. If we were running over a real-time handler, then I assume the
> Xenomai domain was active. So BUG_ON() should have triggered if present
> in __ipipe_unstall_root.

I am using an I-pipe arm 1.5-04 (now that I have done cat
/proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
__ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
will switch to Xenomai 2.4.

>
> Additionally, calling __ipipe_sync_pipeline() would sync the current
> stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.
>
> Mm, ok, in short: I have no clue.

The system runs stably, so I have to assume that calling
local_irq_restore in a real-time interrupt handler can cause slab
corruption. Strange.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
> On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> Hi,
>>>
>>> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
>>> enabled box under heavy non real-time network load (which passes
>>> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
>>> another rtmac_vnic). When reading some I-pipe tracer traces, I
>>> remarked that I forgot to replace a local_irq_save/local_irq_restore
>>> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
>>> handler. I fixed this bug, and the slab corruption seems to be gone.
>> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
>> domain state would not be updated appropriately - which is at least unclean.
> 
> It is some low level secondary timer handling code, there is no rtdm
> involved. The code protected by the interrupt masking routines is one
> or two inline assembly instructions.
> 
>> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
> 
> I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
> I-pipe patch and Xenomai update is scheduled for when RT applications
> and drivers porting will be finished.
> 
> Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
> ipipe_unstall_root are unconditional.
>

What bothers me, is that even looking at the old 1.3 series here and on,
the code should exhibit a call chain like
local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
__ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
domain pointer, which is ok, since well, it has to be right in the first
place. If we were running over a real-time handler, then I assume the
Xenomai domain was active. So BUG_ON() should have triggered if present
in __ipipe_unstall_root.

Additionally, calling __ipipe_sync_pipeline() would sync the current
stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.

Mm, ok, in short: I have no clue.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
> Hi,
> 
> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
> enabled box under heavy non real-time network load (which passes
> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
> another rtmac_vnic). When reading some I-pipe tracer traces, I
> remarked that I forgot to replace a local_irq_save/local_irq_restore
> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
> handler. I fixed this bug, and the slab corruption seems to be gone.

Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
domain state would not be updated appropriately - which is at least unclean.

BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.

> 
> So, my question is: is it possible ? I mean, if local_irq_restore in
> the real-time interrupt handler calls __ipipe_sync_stage, the root
> domain is not stalled, so there should be no problem Linux-wise
> playing root domain interrupts, I would rather expect the I-pipe state
> to be jammed (after all, we are probably reentering functions that
> should not be reentered), not Linux state.

If I grab it correctly ATM, __ipipe_sync_stage() does not check for the
domain stall bits, it assumes the caller has done so. Thus,

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Gilles Chanteperdrix
On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> Gilles Chanteperdrix wrote:
> > Hi,
> >
> > I am chasing a slab corruption bug which happens on a Xenomai+RTnet
> > enabled box under heavy non real-time network load (which passes
> > through rtnet and rtmac_vnic to Linux which does NAT and resend it to
> > another rtmac_vnic). When reading some I-pipe tracer traces, I
> > remarked that I forgot to replace a local_irq_save/local_irq_restore
> > with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
> > handler. I fixed this bug, and the slab corruption seems to be gone.
>
> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
> domain state would not be updated appropriately - which is at least unclean.

It is some low level secondary timer handling code, there is no rtdm
involved. The code protected by the interrupt masking routines is one
or two inline assembly instructions.

>
> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.

I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
I-pipe patch and Xenomai update is scheduled for when RT applications
and drivers porting will be finished.

Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
ipipe_unstall_root are unconditional.

>
> >
> > So, my question is: is it possible ? I mean, if local_irq_restore in
> > the real-time interrupt handler calls __ipipe_sync_stage, the root
> > domain is not stalled, so there should be no problem Linux-wise
> > playing root domain interrupts, I would rather expect the I-pipe state
> > to be jammed (after all, we are probably reentering functions that
> > should not be reentered), not Linux state.
>
> If I grab it correctly ATM, __ipipe_sync_stage() does not check for the
> domain stall bits, it assumes the caller has done so. Thus,

Yes, but the flags saved by local_irq_save tells if the domain is
stalled, and local_irq_restore calls __ipipe_unstall_root which calls
__ipipe_sync_stage only if the domain was not stalled.


-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


[Xenomai-core] local_irq_save/local_irq_restore in real-time interrupt handler and slab corruption.

2007-11-13 Thread Gilles Chanteperdrix
Hi,

I am chasing a slab corruption bug which happens on a Xenomai+RTnet
enabled box under heavy non real-time network load (which passes
through rtnet and rtmac_vnic to Linux which does NAT and resend it to
another rtmac_vnic). When reading some I-pipe tracer traces, I
remarked that I forgot to replace a local_irq_save/local_irq_restore
with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
handler. I fixed this bug, and the slab corruption seems to be gone.

So, my question is: is it possible ? I mean, if local_irq_restore in
the real-time interrupt handler calls __ipipe_sync_stage, the root
domain is not stalled, so there should be no problem Linux-wise
playing root domain interrupts, I would rather expect the I-pipe state
to be jammed (after all, we are probably reentering functions that
should not be reentered), not Linux state.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core