Gilles Chanteperdrix wrote:
> On Nov 13, 2007 6:44 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>>>> Gilles Chanteperdrix wrote:
>>>>> On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>>>>>> Gilles Chanteperdrix wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
>>>>>>> enabled box under heavy non real-time network load (which passes
>>>>>>> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
>>>>>>> another rtmac_vnic). When reading some I-pipe tracer traces, I
>>>>>>> remarked that I forgot to replace a local_irq_save/local_irq_restore
>>>>>>> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
>>>>>>> handler. I fixed this bug, and the slab corruption seems to be gone.
>>>>>> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
>>>>>> domain state would not be updated appropriately - which is at least 
>>>>>> unclean.
>>>>> It is some low level secondary timer handling code, there is no rtdm
>>>>> involved. The code protected by the interrupt masking routines is one
>>>>> or two inline assembly instructions.
>>>>>
>>>>>> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
>>>>> I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
>>>>> I-pipe patch and Xenomai update is scheduled for when RT applications
>>>>> and drivers porting will be finished.
>>>>>
>>>>> Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
>>>>> ipipe_unstall_root are unconditional.
>>>>>
>>>> What bothers me, is that even looking at the old 1.3 series here and on,
>>>> the code should exhibit a call chain like
>>>> local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
>>>> __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
>>>> domain pointer, which is ok, since well, it has to be right in the first
>>>> place. If we were running over a real-time handler, then I assume the
>>>> Xenomai domain was active. So BUG_ON() should have triggered if present
>>>> in __ipipe_unstall_root.
>>> I am using an I-pipe arm 1.5-04 (now that I have done cat
>>> /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
>>> __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
>>> will switch to Xenomai 2.4.
>>>
>>>> Additionally, calling __ipipe_sync_pipeline() would sync the current
>>>> stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.
>>>>
>>>> Mm, ok, in short: I have no clue.
>>> The system runs stably, so I have to assume that calling
>>> local_irq_restore in a real-time interrupt handler can cause slab
>>> corruption. Strange.
>>>
>> I guess this is likely not on your critical path, but when time allows,
>> I'd be interested to know whether such bug still occurs when using a
>> purely kernel-only tasking, assuming that you currently see this bug
>> with userland tasks. Basically, I wonder if migrating shadows between
>> both domains would not reveal the bug, since your real-time handler
>> starts being preemptible by hw IRQs as soon as it returns from
>> __ipipe_unstall_root, which forces local_irq_enable_hw().
> 
> Actually, I had only kernel-only tasking, since in my test I had
> remove everything and only kept the RTnet drivers and stack and tested
> Linux routing (my basic goal was to improve non-real time trafic
> rate).
> 

Ah, ok. So maybe the preemption issue? Would the ISR be fine with being
re-entered for instance? Any potential trashing in sight? I guess that
you could check if this is related with using a local version of
local_irq_restore in this particular code spot, which would basically do
what __ipipe_unstall_root does, but local_irq_enable_hw().

-- 
Philippe.

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to