Re: [Xenomai-core] Potential problem with rt_eepro100

Jan Kiszka Wed, 03 Nov 2010 05:19:04 -0700

Am 03.11.2010 13:07, Anders Blomdell wrote:
> On 2010-11-03 12.55, Jan Kiszka wrote:
>> Am 03.11.2010 12:50, Jan Kiszka wrote:
>>> Am 03.11.2010 12:44, Anders Blomdell wrote:
>>>> Anders Blomdell wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Am 01.11.2010 17:55, Anders Blomdell wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Am 28.10.2010 11:34, Anders Blomdell wrote:
>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>> Am 28.10.2010 09:34, Anders Blomdell wrote:
>>>>>>>>>>> Anders Blomdell wrote:
>>>>>>>>>>>> Anders Blomdell wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm trying to use rt_eepro100, for sending raw ethernet packets,
>>>>>>>>>>>>> but I'm
>>>>>>>>>>>>> experincing occasionally weird behaviour.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Versions of things:
>>>>>>>>>>>>>
>>>>>>>>>>>>>   linux-2.6.34.5
>>>>>>>>>>>>>   xenomai-2.5.5.2
>>>>>>>>>>>>>   rtnet-39f7fcf
>>>>>>>>>>>>>
>>>>>>>>>>>>> The testprogram runs on two computers with "Intel Corporation
>>>>>>>>>>>>> 82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one
>>>>>>>>>>>>> computer
>>>>>>>>>>>>> acts as a mirror sending back packets received from the ethernet
>>>>>>>>>>>>> (only
>>>>>>>>>>>>> those two computers on the network), and the other sends
>>>>>>>>>>>>> packets and
>>>>>>>>>>>>> measures roundtrip time. Most packets comes back in approximately
>>>>>>>>>>>>> 100
>>>>>>>>>>>>> us, but occasionally the reception times out (once in about
>>>>>>>>>>>>> 100000
>>>>>>>>>>>>> packets or more), but the packets gets immediately received when
>>>>>>>>>>>>> reception is retried, which might indicate a race between
>>>>>>>>>>>>> rt_dev_recvmsg
>>>>>>>>>>>>> and interrupt, but I might miss something obvious.
>>>>>>>>>>>> Changing one of the ethernet cards to a "Intel Corporation 82541PI
>>>>>>>>>>>> Gigabit Ethernet Controller (rev 05)", while keeping everything
>>>>>>>>>>>> else
>>>>>>>>>>>> constant, changes behavior somewhat; after receiving a few 100000
>>>>>>>>>>>> packets, reception stops entirely (-EAGAIN is returned), while
>>>>>>>>>>>> transmission proceeds as it should (and mirror returns packets).
>>>>>>>>>>>>
>>>>>>>>>>>> Any suggestions on what to try?
>>>>>>>>>>> Since the problem disappears with 'maxcpus=1', I suspect I have
>>>>>>>>>>> a SMP
>>>>>>>>>>> issue (machine is a Core2 Quad), so I'll move to xenomai-core.
>>>>>>>>>>> (original message can be found at
>>>>>>>>>>> http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> )
>>>>>>>>>>>
>>>>>>>>>>> Xenomai-core gurus: which is the corrrect way to debug SMP issues?
>>>>>>>>>>> Can I run I-pipe-tracer and expect to be able save at least 150
>>>>>>>>>>> us of
>>>>>>>>>>> traces for all cpus? Any hints/suggestions/insigths are welcome...
>>>>>>>>>> The i-pipe tracer unfortunately only saves traces for a the CPU that
>>>>>>>>>> triggered the freeze. To have a full pictures, you may want to
>>>>>>>>>> try my
>>>>>>>>>> ftrace port I posted recently for 2.6.35.
>>>>>>>>> 2.6.35.7 ?
>>>>>>>>>
>>>>>>>> Exactly.
>>>>>>> Finally managed to get the ftrace to work
>>>>>>> (one possible bug: had to manually copy
>>>>>>> include/xenomai/trace/xn_nucleus.h to
>>>>>>> include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
>>>>>>> very useful...
>>>>>>>
>>>>>>> But I don't think it will give much info at the moment, since no
>>>>>>> xenomai/ipipe interrupt activity shows up, and adding that is far above
>>>>>>> my league :-(
>>>>>>
>>>>>> You could use the function tracer, provided you are able to stop the
>>>>>> trace quickly enough on error.
>>>>>>
>>>>>>> My current theory is that the problem occurs when something like this
>>>>>>> takes place:
>>>>>>>
>>>>>>>   CPU-i        CPU-j        CPU-k        CPU-l
>>>>>>>
>>>>>>> rt_dev_sendmsg
>>>>>>>         xmit_irq
>>>>>>> rt_dev_recvmsg            recv_irq
>>>>>>
>>>>>> Can't follow. When races here, and what will go wrong then?
>>>>> Thats the good question. Find attached:
>>>>>
>>>>> 1. .config (so you can check for stupid mistakes)
>>>>> 2. console log
>>>>> 3. latest version of test program
>>>>> 4. tail of ftrace dump
>>>>>
>>>>> These are the xenomai tasks running when the test program is active:
>>>>>
>>>>> CPU  PID    CLASS  PRI      TIMEOUT   TIMEBASE   STAT       NAME
>>>>>   0  0      idle    -1      -         master     R          ROOT/0
>>>>>   1  0      idle    -1      -         master     R          ROOT/1
>>>>>   2  0      idle    -1      -         master     R          ROOT/2
>>>>>   3  0      idle    -1      -         master     R          ROOT/3
>>>>>   0  0      rt      98      -         master     W          rtnet-stack
>>>>>   0  0      rt       0      -         master     W          rtnet-rtpc
>>>>>   0  29901  rt      50      -         master                raw_test
>>>>>   0  29906  rt       0      -         master     X          reporter
>>>>>
>>>>>
>>>>>
>>>>> The lines of interest from the trace are probably:
>>>>>
>>>>> [003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00   
>>>>>                   thread_name=rtnet-stack mask=2
>>>>> [003]  2061.347862: xn_nucleus_sched: status=2000000
>>>>> [000]  2061.347866: xn_nucleus_sched_remote: status=0
>>>>>
>>>>> since this is the only place where a packet gets delayed, and the only
>>>>> place in the trace where sched_remote reports a status=0
>>>> Since the cpu that has rtnet-stack and hence should be resumed is doing
>>>> heavy I/O at the time of fault; could it be that
>>>> send_ipi/schedule_handler needs barriers to make sure taht decisions are
>>>> made on the right status?
>>>
>>> That was my first idea as well - but we should run all relevant code
>>> under nklock here. But please correct me if I miss something.
> Wouldn't we need a write-barrier before the send_ipi regardless of what locks 
> we
> hold, otherwise no guarantees that the memory write reaches the target cpu
> before the interrupt does?


Yeah, the problem is that if xnpod_resume_thread and the next
xnpod_reschedule are under the same nklock, we won't issue the barrier
as we won't release the lock! So there is indeed the need to issue an
additional barrier. Can you check this?

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..66b52ad 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched 
*sched)
   if (current_sched != (__sched__))    {                               \
       xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);  \
       setbits((__sched__)->status, XNRESCHED);                         \
+      xnarch_memory_barrier();                                         \
   }                                                                    \
 } while (0)
 

> 
>>
>> Mmmh -- not everything. The inlined XNRESCHED entry test in
>> xnpod_schedule runs outside nklock. But doesn't releasing nklock imply a
>> memory write barrier? Let me meditate...
> Wouldn't we need a read barrier then (but maybe the irq-handling takes care of
> that, not familiar with the code yet)?

A read barrier is not required here as we do not need to order load
operation /wrt each other in the reschedule IRQ handler.

> 
> Meditate all yo need. BTW: the ftrace stuff is great, I'm looking forward to 
> be
> able to trace everything this way :-)

You can always help: there is a lot boring^Winteresting tracepoint
conversion waiting in Xenomai, see the few already converted nucleus
tracepoints.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

_______________________________________________
Xenomai-core mailing list
[email protected]
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

Reply via email to