Re: [Xenomai-core] Potential problem with rt_eepro100

Anders Blomdell Wed, 03 Nov 2010 05:08:54 -0700

On 2010-11-03 12.55, Jan Kiszka wrote:
> Am 03.11.2010 12:50, Jan Kiszka wrote:
>> Am 03.11.2010 12:44, Anders Blomdell wrote:
>>> Anders Blomdell wrote:
>>>> Jan Kiszka wrote:
>>>>> Am 01.11.2010 17:55, Anders Blomdell wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> Am 28.10.2010 11:34, Anders Blomdell wrote:
>>>>>>>> Jan Kiszka wrote:
>>>>>>>>> Am 28.10.2010 09:34, Anders Blomdell wrote:
>>>>>>>>>> Anders Blomdell wrote:
>>>>>>>>>>> Anders Blomdell wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm trying to use rt_eepro100, for sending raw ethernet packets,
>>>>>>>>>>>> but I'm
>>>>>>>>>>>> experincing occasionally weird behaviour.
>>>>>>>>>>>>
>>>>>>>>>>>> Versions of things:
>>>>>>>>>>>>
>>>>>>>>>>>>   linux-2.6.34.5
>>>>>>>>>>>>   xenomai-2.5.5.2
>>>>>>>>>>>>   rtnet-39f7fcf
>>>>>>>>>>>>
>>>>>>>>>>>> The testprogram runs on two computers with "Intel Corporation
>>>>>>>>>>>> 82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one
>>>>>>>>>>>> computer
>>>>>>>>>>>> acts as a mirror sending back packets received from the ethernet
>>>>>>>>>>>> (only
>>>>>>>>>>>> those two computers on the network), and the other sends
>>>>>>>>>>>> packets and
>>>>>>>>>>>> measures roundtrip time. Most packets comes back in approximately
>>>>>>>>>>>> 100
>>>>>>>>>>>> us, but occasionally the reception times out (once in about
>>>>>>>>>>>> 100000
>>>>>>>>>>>> packets or more), but the packets gets immediately received when
>>>>>>>>>>>> reception is retried, which might indicate a race between
>>>>>>>>>>>> rt_dev_recvmsg
>>>>>>>>>>>> and interrupt, but I might miss something obvious.
>>>>>>>>>>> Changing one of the ethernet cards to a "Intel Corporation 82541PI
>>>>>>>>>>> Gigabit Ethernet Controller (rev 05)", while keeping everything
>>>>>>>>>>> else
>>>>>>>>>>> constant, changes behavior somewhat; after receiving a few 100000
>>>>>>>>>>> packets, reception stops entirely (-EAGAIN is returned), while
>>>>>>>>>>> transmission proceeds as it should (and mirror returns packets).
>>>>>>>>>>>
>>>>>>>>>>> Any suggestions on what to try?
>>>>>>>>>> Since the problem disappears with 'maxcpus=1', I suspect I have
>>>>>>>>>> a SMP
>>>>>>>>>> issue (machine is a Core2 Quad), so I'll move to xenomai-core.
>>>>>>>>>> (original message can be found at
>>>>>>>>>> http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> )
>>>>>>>>>>
>>>>>>>>>> Xenomai-core gurus: which is the corrrect way to debug SMP issues?
>>>>>>>>>> Can I run I-pipe-tracer and expect to be able save at least 150
>>>>>>>>>> us of
>>>>>>>>>> traces for all cpus? Any hints/suggestions/insigths are welcome...
>>>>>>>>> The i-pipe tracer unfortunately only saves traces for a the CPU that
>>>>>>>>> triggered the freeze. To have a full pictures, you may want to
>>>>>>>>> try my
>>>>>>>>> ftrace port I posted recently for 2.6.35.
>>>>>>>> 2.6.35.7 ?
>>>>>>>>
>>>>>>> Exactly.
>>>>>> Finally managed to get the ftrace to work
>>>>>> (one possible bug: had to manually copy
>>>>>> include/xenomai/trace/xn_nucleus.h to
>>>>>> include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
>>>>>> very useful...
>>>>>>
>>>>>> But I don't think it will give much info at the moment, since no
>>>>>> xenomai/ipipe interrupt activity shows up, and adding that is far above
>>>>>> my league :-(
>>>>>
>>>>> You could use the function tracer, provided you are able to stop the
>>>>> trace quickly enough on error.
>>>>>
>>>>>> My current theory is that the problem occurs when something like this
>>>>>> takes place:
>>>>>>
>>>>>>   CPU-i        CPU-j        CPU-k        CPU-l
>>>>>>
>>>>>> rt_dev_sendmsg
>>>>>>         xmit_irq
>>>>>> rt_dev_recvmsg            recv_irq
>>>>>
>>>>> Can't follow. When races here, and what will go wrong then?
>>>> Thats the good question. Find attached:
>>>>
>>>> 1. .config (so you can check for stupid mistakes)
>>>> 2. console log
>>>> 3. latest version of test program
>>>> 4. tail of ftrace dump
>>>>
>>>> These are the xenomai tasks running when the test program is active:
>>>>
>>>> CPU  PID    CLASS  PRI      TIMEOUT   TIMEBASE   STAT       NAME
>>>>   0  0      idle    -1      -         master     R          ROOT/0
>>>>   1  0      idle    -1      -         master     R          ROOT/1
>>>>   2  0      idle    -1      -         master     R          ROOT/2
>>>>   3  0      idle    -1      -         master     R          ROOT/3
>>>>   0  0      rt      98      -         master     W          rtnet-stack
>>>>   0  0      rt       0      -         master     W          rtnet-rtpc
>>>>   0  29901  rt      50      -         master                raw_test
>>>>   0  29906  rt       0      -         master     X          reporter
>>>>
>>>>
>>>>
>>>> The lines of interest from the trace are probably:
>>>>
>>>> [003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00   
>>>>                   thread_name=rtnet-stack mask=2
>>>> [003]  2061.347862: xn_nucleus_sched: status=2000000
>>>> [000]  2061.347866: xn_nucleus_sched_remote: status=0
>>>>
>>>> since this is the only place where a packet gets delayed, and the only
>>>> place in the trace where sched_remote reports a status=0
>>> Since the cpu that has rtnet-stack and hence should be resumed is doing
>>> heavy I/O at the time of fault; could it be that
>>> send_ipi/schedule_handler needs barriers to make sure taht decisions are
>>> made on the right status?
>>
>> That was my first idea as well - but we should run all relevant code
>> under nklock here. But please correct me if I miss something.
Wouldn't we need a write-barrier before the send_ipi regardless of what locks we
hold, otherwise no guarantees that the memory write reaches the target cpu
before the interrupt does?


> 
> Mmmh -- not everything. The inlined XNRESCHED entry test in
> xnpod_schedule runs outside nklock. But doesn't releasing nklock imply a
> memory write barrier? Let me meditate...
Wouldn't we need a read barrier then (but maybe the irq-handling takes care of
that, not familiar with the code yet)?

Meditate all yo need. BTW: the ftrace stuff is great, I'm looking forward to be
able to trace everything this way :-)

/Anders


-- 
Anders Blomdell                  Email: [email protected]
Department of Automatic Control
Lund University                  Phone:    +46 46 222 4625
P.O. Box 118                     Fax:      +46 46 138118
SE-221 00 Lund, Sweden

_______________________________________________
Xenomai-core mailing list
[email protected]
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

Reply via email to