Gilles Chanteperdrix wrote:
> On Nov 13, 2007 6:45 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>>>> Gilles Chanteperdrix wrote:
>>>>> On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>>>>>> Gilles Chanteperdrix wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am chasing a slab corruption bug which happens on a Xenomai+RTnet
>>>>>>> enabled box under heavy non real-time network load (which passes
>>>>>>> through rtnet and rtmac_vnic to Linux which does NAT and resend it to
>>>>>>> another rtmac_vnic). When reading some I-pipe tracer traces, I
>>>>>>> remarked that I forgot to replace a local_irq_save/local_irq_restore
>>>>>>> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt
>>>>>>> handler. I fixed this bug, and the slab corruption seems to be gone.
>>>>>> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's
>>>>>> domain state would not be updated appropriately - which is at least 
>>>>>> unclean.
>>>>> It is some low level secondary timer handling code, there is no rtdm
>>>>> involved. The code protected by the interrupt masking routines is one
>>>>> or two inline assembly instructions.
>>>>>
>>>>>> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well.
>>>>> I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT.
>>>>> I-pipe patch and Xenomai update is scheduled for when RT applications
>>>>> and drivers porting will be finished.
>>>>>
>>>>> Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and
>>>>> ipipe_unstall_root are unconditional.
>>>>>
>>>> What bothers me, is that even looking at the old 1.3 series here and on,
>>>> the code should exhibit a call chain like
>>>> local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root ->
>>>> __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current
>>>> domain pointer, which is ok, since well, it has to be right in the first
>>>> place. If we were running over a real-time handler, then I assume the
>>>> Xenomai domain was active. So BUG_ON() should have triggered if present
>>>> in __ipipe_unstall_root.
>>> I am using an I-pipe arm 1.5-04 (now that I have done cat
>>> /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in
>>> __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I
>>> will switch to Xenomai 2.4.
>>>
>>>> Additionally, calling __ipipe_sync_pipeline() would sync the current
>>>> stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers.
>>>>
>>>> Mm, ok, in short: I have no clue.
>>> The system runs stably, so I have to assume that calling
>>> local_irq_restore in a real-time interrupt handler can cause slab
>>> corruption. Strange.
>> What about instrumenting the involved I-pipe code path with
>> ipipe_trace_specials and then restoring the buggy code. You may even
>> ipipe_trace_freeze on that spot so that you can watch in pre/post trace
>> what happens. May help to understand if this was the only issue, or if
>> we may need some further measures for future versions.
> 
> I have used the tracer, but the "slab corruption" message triggers a
> long time after the bug. At least, with a 128K backtrace, I could not
> find the place where the bug happened.

Well, IF this local_irq fiddling here is supposed to be the reason, you
should already be able to spot unexpected code paths around its usage.
That was my idea. If there is nothing, you are probably only pushing
around some still existing race window.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to