On Nov 13, 2007 6:44 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > > Gilles Chanteperdrix wrote: > > On Nov 13, 2007 6:10 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> On Nov 13, 2007 3:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > >>>> Gilles Chanteperdrix wrote: > >>>>> Hi, > >>>>> > >>>>> I am chasing a slab corruption bug which happens on a Xenomai+RTnet > >>>>> enabled box under heavy non real-time network load (which passes > >>>>> through rtnet and rtmac_vnic to Linux which does NAT and resend it to > >>>>> another rtmac_vnic). When reading some I-pipe tracer traces, I > >>>>> remarked that I forgot to replace a local_irq_save/local_irq_restore > >>>>> with local_irq_save_hw/local_irq_restore_hw in a real-time interrupt > >>>>> handler. I fixed this bug, and the slab corruption seems to be gone. > >>>> Hope you mean rtdm_lock_irqsave/irqrestore instead. Otherwise Xenomai's > >>>> domain state would not be updated appropriately - which is at least > >>>> unclean. > >>> It is some low level secondary timer handling code, there is no rtdm > >>> involved. The code protected by the interrupt masking routines is one > >>> or two inline assembly instructions. > >>> > >>>> BTW, CONFIG_IPIPE_DEBUG_CONTEXT should have caught this bug as well. > >>> I am using an old I-pipe pacth without CONFIG_IPIPE_DEBUG_CONTEXT. > >>> I-pipe patch and Xenomai update is scheduled for when RT applications > >>> and drivers porting will be finished. > >>> > >>> Besides the BUG_ON(!ipipe_root_domain_p) in ipipe_restore_root and > >>> ipipe_unstall_root are unconditional. > >>> > >> What bothers me, is that even looking at the old 1.3 series here and on, > >> the code should exhibit a call chain like > >> local_irq_restore -> raw_local_irq_restore() -> __ipipe_restore_root -> > >> __ipipe_unstall_root -> __ipipe_sync_stage, without touching the current > >> domain pointer, which is ok, since well, it has to be right in the first > >> place. If we were running over a real-time handler, then I assume the > >> Xenomai domain was active. So BUG_ON() should have triggered if present > >> in __ipipe_unstall_root. > > > > I am using an I-pipe arm 1.5-04 (now that I have done cat > > /proc/ipipe/version, I really feel ashamed). And it has no BUG_ON in > > __ipipe_unstall_root or __ipipe_restore_root. I promise, one day, I > > will switch to Xenomai 2.4. > > > >> Additionally, calling __ipipe_sync_pipeline() would sync the current > >> stage, i.e. Xenomai, and run the real-time ISRs, not the Linux handlers. > >> > >> Mm, ok, in short: I have no clue. > > > > The system runs stably, so I have to assume that calling > > local_irq_restore in a real-time interrupt handler can cause slab > > corruption. Strange. > > > > I guess this is likely not on your critical path, but when time allows, > I'd be interested to know whether such bug still occurs when using a > purely kernel-only tasking, assuming that you currently see this bug > with userland tasks. Basically, I wonder if migrating shadows between > both domains would not reveal the bug, since your real-time handler > starts being preemptible by hw IRQs as soon as it returns from > __ipipe_unstall_root, which forces local_irq_enable_hw().
Actually, I had only kernel-only tasking, since in my test I had remove everything and only kept the RTnet drivers and stack and tested Linux routing (my basic goal was to improve non-real time trafic rate). -- Gilles Chanteperdrix _______________________________________________ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core