Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell

Anders Blomdell wrote:

Jan Kiszka wrote:

Am 01.11.2010 17:55, Anders Blomdell wrote:

Jan Kiszka wrote:

Am 28.10.2010 11:34, Anders Blomdell wrote:

Jan Kiszka wrote:

Am 28.10.2010 09:34, Anders Blomdell wrote:

Anders Blomdell wrote:

Anders Blomdell wrote:

Hi,

I'm trying to use rt_eepro100, for sending raw ethernet packets,
but I'm
experincing occasionally weird behaviour.

Versions of things:

  linux-2.6.34.5
  xenomai-2.5.5.2
  rtnet-39f7fcf

The testprogram runs on two computers with Intel Corporation
82557/8/9/0/1 Ethernet Pro 100 (rev 08) controller, where one
computer
acts as a mirror sending back packets received from the ethernet
(only
those two computers on the network), and the other sends 
packets and

measures roundtrip time. Most packets comes back in approximately
100
us, but occasionally the reception times out (once in about 10
packets or more), but the packets gets immediately received when
reception is retried, which might indicate a race between
rt_dev_recvmsg
and interrupt, but I might miss something obvious.

Changing one of the ethernet cards to a Intel Corporation 82541PI
Gigabit Ethernet Controller (rev 05), while keeping everything 
else

constant, changes behavior somewhat; after receiving a few 10
packets, reception stops entirely (-EAGAIN is returned), while
transmission proceeds as it should (and mirror returns packets).

Any suggestions on what to try?
Since the problem disappears with 'maxcpus=1', I suspect I have a 
SMP

issue (machine is a Core2 Quad), so I'll move to xenomai-core.
(original message can be found at
http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se 



)

Xenomai-core gurus: which is the corrrect way to debug SMP issues?
Can I run I-pipe-tracer and expect to be able save at least 150 
us of

traces for all cpus? Any hints/suggestions/insigths are welcome...

The i-pipe tracer unfortunately only saves traces for a the CPU that
triggered the freeze. To have a full pictures, you may want to try my
ftrace port I posted recently for 2.6.35.

2.6.35.7 ?


Exactly.

Finally managed to get the ftrace to work
(one possible bug: had to manually copy
include/xenomai/trace/xn_nucleus.h to
include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
very useful...

But I don't think it will give much info at the moment, since no
xenomai/ipipe interrupt activity shows up, and adding that is far above
my league :-(


You could use the function tracer, provided you are able to stop the
trace quickly enough on error.


My current theory is that the problem occurs when something like this
takes place:

  CPU-iCPU-jCPU-kCPU-l

rt_dev_sendmsg
xmit_irq
rt_dev_recvmsgrecv_irq


Can't follow. When races here, and what will go wrong then?

Thats the good question. Find attached:

1. .config (so you can check for stupid mistakes)
2. console log
3. latest version of test program
4. tail of ftrace dump

These are the xenomai tasks running when the test program is active:

CPU  PIDCLASS  PRI  TIMEOUT   TIMEBASE   STAT   NAME
  0  0  idle-1  - master R  ROOT/0
  1  0  idle-1  - master R  ROOT/1
  2  0  idle-1  - master R  ROOT/2
  3  0  idle-1  - master R  ROOT/3
  0  0  rt  98  - master W  rtnet-stack
  0  0  rt   0  - master W  rtnet-rtpc
  0  29901  rt  50  - masterraw_test
  0  29906  rt   0  - master X  reporter



The lines of interest from the trace are probably:

[003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00
  thread_name=rtnet-stack mask=2

[003]  2061.347862: xn_nucleus_sched: status=200
[000]  2061.347866: xn_nucleus_sched_remote: status=0

since this is the only place where a packet gets delayed, and the only 
place in the trace where sched_remote reports a status=0
Since the cpu that has rtnet-stack and hence should be resumed is doing 
heavy I/O at the time of fault; could it be that 
send_ipi/schedule_handler needs barriers to make sure taht decisions are 
made on the right status?


/Anders


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 03.11.2010 12:44, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 01.11.2010 17:55, Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 28.10.2010 11:34, Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 28.10.2010 09:34, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Anders Blomdell wrote:
 Hi,

 I'm trying to use rt_eepro100, for sending raw ethernet packets,
 but I'm
 experincing occasionally weird behaviour.

 Versions of things:

   linux-2.6.34.5
   xenomai-2.5.5.2
   rtnet-39f7fcf

 The testprogram runs on two computers with Intel Corporation
 82557/8/9/0/1 Ethernet Pro 100 (rev 08) controller, where one
 computer
 acts as a mirror sending back packets received from the ethernet
 (only
 those two computers on the network), and the other sends
 packets and
 measures roundtrip time. Most packets comes back in approximately
 100
 us, but occasionally the reception times out (once in about
 10
 packets or more), but the packets gets immediately received when
 reception is retried, which might indicate a race between
 rt_dev_recvmsg
 and interrupt, but I might miss something obvious.
 Changing one of the ethernet cards to a Intel Corporation 82541PI
 Gigabit Ethernet Controller (rev 05), while keeping everything
 else
 constant, changes behavior somewhat; after receiving a few 10
 packets, reception stops entirely (-EAGAIN is returned), while
 transmission proceeds as it should (and mirror returns packets).

 Any suggestions on what to try?
 Since the problem disappears with 'maxcpus=1', I suspect I have
 a SMP
 issue (machine is a Core2 Quad), so I'll move to xenomai-core.
 (original message can be found at
 http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se


 )

 Xenomai-core gurus: which is the corrrect way to debug SMP issues?
 Can I run I-pipe-tracer and expect to be able save at least 150
 us of
 traces for all cpus? Any hints/suggestions/insigths are welcome...
 The i-pipe tracer unfortunately only saves traces for a the CPU that
 triggered the freeze. To have a full pictures, you may want to
 try my
 ftrace port I posted recently for 2.6.35.
 2.6.35.7 ?

 Exactly.
 Finally managed to get the ftrace to work
 (one possible bug: had to manually copy
 include/xenomai/trace/xn_nucleus.h to
 include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
 very useful...

 But I don't think it will give much info at the moment, since no
 xenomai/ipipe interrupt activity shows up, and adding that is far above
 my league :-(

 You could use the function tracer, provided you are able to stop the
 trace quickly enough on error.

 My current theory is that the problem occurs when something like this
 takes place:

   CPU-iCPU-jCPU-kCPU-l

 rt_dev_sendmsg
 xmit_irq
 rt_dev_recvmsgrecv_irq

 Can't follow. When races here, and what will go wrong then?
 Thats the good question. Find attached:

 1. .config (so you can check for stupid mistakes)
 2. console log
 3. latest version of test program
 4. tail of ftrace dump

 These are the xenomai tasks running when the test program is active:

 CPU  PIDCLASS  PRI  TIMEOUT   TIMEBASE   STAT   NAME
   0  0  idle-1  - master R  ROOT/0
   1  0  idle-1  - master R  ROOT/1
   2  0  idle-1  - master R  ROOT/2
   3  0  idle-1  - master R  ROOT/3
   0  0  rt  98  - master W  rtnet-stack
   0  0  rt   0  - master W  rtnet-rtpc
   0  29901  rt  50  - masterraw_test
   0  29906  rt   0  - master X  reporter



 The lines of interest from the trace are probably:

 [003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00   
   thread_name=rtnet-stack mask=2
 [003]  2061.347862: xn_nucleus_sched: status=200
 [000]  2061.347866: xn_nucleus_sched_remote: status=0

 since this is the only place where a packet gets delayed, and the only
 place in the trace where sched_remote reports a status=0
 Since the cpu that has rtnet-stack and hence should be resumed is doing
 heavy I/O at the time of fault; could it be that
 send_ipi/schedule_handler needs barriers to make sure taht decisions are
 made on the right status?

That was my first idea as well - but we should run all relevant code
under nklock here. But please correct me if I miss something.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 03.11.2010 12:50, Jan Kiszka wrote:
 Am 03.11.2010 12:44, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 01.11.2010 17:55, Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 28.10.2010 11:34, Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 28.10.2010 09:34, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Anders Blomdell wrote:
 Hi,

 I'm trying to use rt_eepro100, for sending raw ethernet packets,
 but I'm
 experincing occasionally weird behaviour.

 Versions of things:

   linux-2.6.34.5
   xenomai-2.5.5.2
   rtnet-39f7fcf

 The testprogram runs on two computers with Intel Corporation
 82557/8/9/0/1 Ethernet Pro 100 (rev 08) controller, where one
 computer
 acts as a mirror sending back packets received from the ethernet
 (only
 those two computers on the network), and the other sends
 packets and
 measures roundtrip time. Most packets comes back in approximately
 100
 us, but occasionally the reception times out (once in about
 10
 packets or more), but the packets gets immediately received when
 reception is retried, which might indicate a race between
 rt_dev_recvmsg
 and interrupt, but I might miss something obvious.
 Changing one of the ethernet cards to a Intel Corporation 82541PI
 Gigabit Ethernet Controller (rev 05), while keeping everything
 else
 constant, changes behavior somewhat; after receiving a few 10
 packets, reception stops entirely (-EAGAIN is returned), while
 transmission proceeds as it should (and mirror returns packets).

 Any suggestions on what to try?
 Since the problem disappears with 'maxcpus=1', I suspect I have
 a SMP
 issue (machine is a Core2 Quad), so I'll move to xenomai-core.
 (original message can be found at
 http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se


 )

 Xenomai-core gurus: which is the corrrect way to debug SMP issues?
 Can I run I-pipe-tracer and expect to be able save at least 150
 us of
 traces for all cpus? Any hints/suggestions/insigths are welcome...
 The i-pipe tracer unfortunately only saves traces for a the CPU that
 triggered the freeze. To have a full pictures, you may want to
 try my
 ftrace port I posted recently for 2.6.35.
 2.6.35.7 ?

 Exactly.
 Finally managed to get the ftrace to work
 (one possible bug: had to manually copy
 include/xenomai/trace/xn_nucleus.h to
 include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
 very useful...

 But I don't think it will give much info at the moment, since no
 xenomai/ipipe interrupt activity shows up, and adding that is far above
 my league :-(

 You could use the function tracer, provided you are able to stop the
 trace quickly enough on error.

 My current theory is that the problem occurs when something like this
 takes place:

   CPU-iCPU-jCPU-kCPU-l

 rt_dev_sendmsg
 xmit_irq
 rt_dev_recvmsgrecv_irq

 Can't follow. When races here, and what will go wrong then?
 Thats the good question. Find attached:

 1. .config (so you can check for stupid mistakes)
 2. console log
 3. latest version of test program
 4. tail of ftrace dump

 These are the xenomai tasks running when the test program is active:

 CPU  PIDCLASS  PRI  TIMEOUT   TIMEBASE   STAT   NAME
   0  0  idle-1  - master R  ROOT/0
   1  0  idle-1  - master R  ROOT/1
   2  0  idle-1  - master R  ROOT/2
   3  0  idle-1  - master R  ROOT/3
   0  0  rt  98  - master W  rtnet-stack
   0  0  rt   0  - master W  rtnet-rtpc
   0  29901  rt  50  - masterraw_test
   0  29906  rt   0  - master X  reporter



 The lines of interest from the trace are probably:

 [003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00   
   thread_name=rtnet-stack mask=2
 [003]  2061.347862: xn_nucleus_sched: status=200
 [000]  2061.347866: xn_nucleus_sched_remote: status=0

 since this is the only place where a packet gets delayed, and the only
 place in the trace where sched_remote reports a status=0
 Since the cpu that has rtnet-stack and hence should be resumed is doing
 heavy I/O at the time of fault; could it be that
 send_ipi/schedule_handler needs barriers to make sure taht decisions are
 made on the right status?
 
 That was my first idea as well - but we should run all relevant code
 under nklock here. But please correct me if I miss something.

Mmmh -- not everything. The inlined XNRESCHED entry test in
xnpod_schedule runs outside nklock. But doesn't releasing nklock imply a
memory write barrier? Let me meditate...

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell
On 2010-11-03 12.55, Jan Kiszka wrote:
 Am 03.11.2010 12:50, Jan Kiszka wrote:
 Am 03.11.2010 12:44, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 01.11.2010 17:55, Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 28.10.2010 11:34, Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 28.10.2010 09:34, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Anders Blomdell wrote:
 Hi,

 I'm trying to use rt_eepro100, for sending raw ethernet packets,
 but I'm
 experincing occasionally weird behaviour.

 Versions of things:

   linux-2.6.34.5
   xenomai-2.5.5.2
   rtnet-39f7fcf

 The testprogram runs on two computers with Intel Corporation
 82557/8/9/0/1 Ethernet Pro 100 (rev 08) controller, where one
 computer
 acts as a mirror sending back packets received from the ethernet
 (only
 those two computers on the network), and the other sends
 packets and
 measures roundtrip time. Most packets comes back in approximately
 100
 us, but occasionally the reception times out (once in about
 10
 packets or more), but the packets gets immediately received when
 reception is retried, which might indicate a race between
 rt_dev_recvmsg
 and interrupt, but I might miss something obvious.
 Changing one of the ethernet cards to a Intel Corporation 82541PI
 Gigabit Ethernet Controller (rev 05), while keeping everything
 else
 constant, changes behavior somewhat; after receiving a few 10
 packets, reception stops entirely (-EAGAIN is returned), while
 transmission proceeds as it should (and mirror returns packets).

 Any suggestions on what to try?
 Since the problem disappears with 'maxcpus=1', I suspect I have
 a SMP
 issue (machine is a Core2 Quad), so I'll move to xenomai-core.
 (original message can be found at
 http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se


 )

 Xenomai-core gurus: which is the corrrect way to debug SMP issues?
 Can I run I-pipe-tracer and expect to be able save at least 150
 us of
 traces for all cpus? Any hints/suggestions/insigths are welcome...
 The i-pipe tracer unfortunately only saves traces for a the CPU that
 triggered the freeze. To have a full pictures, you may want to
 try my
 ftrace port I posted recently for 2.6.35.
 2.6.35.7 ?

 Exactly.
 Finally managed to get the ftrace to work
 (one possible bug: had to manually copy
 include/xenomai/trace/xn_nucleus.h to
 include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
 very useful...

 But I don't think it will give much info at the moment, since no
 xenomai/ipipe interrupt activity shows up, and adding that is far above
 my league :-(

 You could use the function tracer, provided you are able to stop the
 trace quickly enough on error.

 My current theory is that the problem occurs when something like this
 takes place:

   CPU-iCPU-jCPU-kCPU-l

 rt_dev_sendmsg
 xmit_irq
 rt_dev_recvmsgrecv_irq

 Can't follow. When races here, and what will go wrong then?
 Thats the good question. Find attached:

 1. .config (so you can check for stupid mistakes)
 2. console log
 3. latest version of test program
 4. tail of ftrace dump

 These are the xenomai tasks running when the test program is active:

 CPU  PIDCLASS  PRI  TIMEOUT   TIMEBASE   STAT   NAME
   0  0  idle-1  - master R  ROOT/0
   1  0  idle-1  - master R  ROOT/1
   2  0  idle-1  - master R  ROOT/2
   3  0  idle-1  - master R  ROOT/3
   0  0  rt  98  - master W  rtnet-stack
   0  0  rt   0  - master W  rtnet-rtpc
   0  29901  rt  50  - masterraw_test
   0  29906  rt   0  - master X  reporter



 The lines of interest from the trace are probably:

 [003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00   
   thread_name=rtnet-stack mask=2
 [003]  2061.347862: xn_nucleus_sched: status=200
 [000]  2061.347866: xn_nucleus_sched_remote: status=0

 since this is the only place where a packet gets delayed, and the only
 place in the trace where sched_remote reports a status=0
 Since the cpu that has rtnet-stack and hence should be resumed is doing
 heavy I/O at the time of fault; could it be that
 send_ipi/schedule_handler needs barriers to make sure taht decisions are
 made on the right status?

 That was my first idea as well - but we should run all relevant code
 under nklock here. But please correct me if I miss something.
Wouldn't we need a write-barrier before the send_ipi regardless of what locks we
hold, otherwise no guarantees that the memory write reaches the target cpu
before the interrupt does?

 
 Mmmh -- not everything. The inlined XNRESCHED entry test in
 xnpod_schedule runs outside nklock. But doesn't releasing nklock imply a
 memory write barrier? Let me meditate...
Wouldn't 

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 03.11.2010 13:07, Anders Blomdell wrote:
 On 2010-11-03 12.55, Jan Kiszka wrote:
 Am 03.11.2010 12:50, Jan Kiszka wrote:
 Am 03.11.2010 12:44, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 01.11.2010 17:55, Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 28.10.2010 11:34, Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 28.10.2010 09:34, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Anders Blomdell wrote:
 Hi,

 I'm trying to use rt_eepro100, for sending raw ethernet packets,
 but I'm
 experincing occasionally weird behaviour.

 Versions of things:

   linux-2.6.34.5
   xenomai-2.5.5.2
   rtnet-39f7fcf

 The testprogram runs on two computers with Intel Corporation
 82557/8/9/0/1 Ethernet Pro 100 (rev 08) controller, where one
 computer
 acts as a mirror sending back packets received from the ethernet
 (only
 those two computers on the network), and the other sends
 packets and
 measures roundtrip time. Most packets comes back in approximately
 100
 us, but occasionally the reception times out (once in about
 10
 packets or more), but the packets gets immediately received when
 reception is retried, which might indicate a race between
 rt_dev_recvmsg
 and interrupt, but I might miss something obvious.
 Changing one of the ethernet cards to a Intel Corporation 82541PI
 Gigabit Ethernet Controller (rev 05), while keeping everything
 else
 constant, changes behavior somewhat; after receiving a few 10
 packets, reception stops entirely (-EAGAIN is returned), while
 transmission proceeds as it should (and mirror returns packets).

 Any suggestions on what to try?
 Since the problem disappears with 'maxcpus=1', I suspect I have
 a SMP
 issue (machine is a Core2 Quad), so I'll move to xenomai-core.
 (original message can be found at
 http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se


 )

 Xenomai-core gurus: which is the corrrect way to debug SMP issues?
 Can I run I-pipe-tracer and expect to be able save at least 150
 us of
 traces for all cpus? Any hints/suggestions/insigths are welcome...
 The i-pipe tracer unfortunately only saves traces for a the CPU that
 triggered the freeze. To have a full pictures, you may want to
 try my
 ftrace port I posted recently for 2.6.35.
 2.6.35.7 ?

 Exactly.
 Finally managed to get the ftrace to work
 (one possible bug: had to manually copy
 include/xenomai/trace/xn_nucleus.h to
 include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
 very useful...

 But I don't think it will give much info at the moment, since no
 xenomai/ipipe interrupt activity shows up, and adding that is far above
 my league :-(

 You could use the function tracer, provided you are able to stop the
 trace quickly enough on error.

 My current theory is that the problem occurs when something like this
 takes place:

   CPU-iCPU-jCPU-kCPU-l

 rt_dev_sendmsg
 xmit_irq
 rt_dev_recvmsgrecv_irq

 Can't follow. When races here, and what will go wrong then?
 Thats the good question. Find attached:

 1. .config (so you can check for stupid mistakes)
 2. console log
 3. latest version of test program
 4. tail of ftrace dump

 These are the xenomai tasks running when the test program is active:

 CPU  PIDCLASS  PRI  TIMEOUT   TIMEBASE   STAT   NAME
   0  0  idle-1  - master R  ROOT/0
   1  0  idle-1  - master R  ROOT/1
   2  0  idle-1  - master R  ROOT/2
   3  0  idle-1  - master R  ROOT/3
   0  0  rt  98  - master W  rtnet-stack
   0  0  rt   0  - master W  rtnet-rtpc
   0  29901  rt  50  - masterraw_test
   0  29906  rt   0  - master X  reporter



 The lines of interest from the trace are probably:

 [003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00   
   thread_name=rtnet-stack mask=2
 [003]  2061.347862: xn_nucleus_sched: status=200
 [000]  2061.347866: xn_nucleus_sched_remote: status=0

 since this is the only place where a packet gets delayed, and the only
 place in the trace where sched_remote reports a status=0
 Since the cpu that has rtnet-stack and hence should be resumed is doing
 heavy I/O at the time of fault; could it be that
 send_ipi/schedule_handler needs barriers to make sure taht decisions are
 made on the right status?

 That was my first idea as well - but we should run all relevant code
 under nklock here. But please correct me if I miss something.
 Wouldn't we need a write-barrier before the send_ipi regardless of what locks 
 we
 hold, otherwise no guarantees that the memory write reaches the target cpu
 before the interrupt does?

Yeah, the problem is that if xnpod_resume_thread and the next
xnpod_reschedule are under the same nklock, we won't issue the barrier
as we 

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell

Jan Kiszka wrote:

additional barrier. Can you check this?

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..66b52ad 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched 
*sched)
   if (current_sched != (__sched__)){   \
   xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);   \
   setbits((__sched__)-status, XNRESCHED);  \
+  xnarch_memory_barrier(); \
   }\
 } while (0)


In progress, if nothing breaks before, I'll report status tomorrow morning.


Mmmh -- not everything. The inlined XNRESCHED entry test in
xnpod_schedule runs outside nklock. But doesn't releasing nklock imply a
memory write barrier? Let me meditate...

Wouldn't we need a read barrier then (but maybe the irq-handling takes care of
that, not familiar with the code yet)?


A read barrier is not required here as we do not need to order load
operation /wrt each other in the reschedule IRQ handler.

Only if taking the interrupt is equivalent to:

  read interrupts status
  memory_read_barrier
  execute handler

processor manuals should have the answer to this (or it might already be 
in the code)...



You can always help: there is a lot boring^Winteresting tracepoint
conversion waiting in Xenomai, see the few already converted nucleus
tracepoints.

As soon as I have my system running, I'll put some effort into this.

/Anders


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell

Anders Blomdell wrote:

Jan Kiszka wrote:

additional barrier. Can you check this?

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..66b52ad 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
xnsched *sched)

   if (current_sched != (__sched__)){\
   xnarch_cpu_set(xnsched_cpu(__sched__), 
current_sched-resched);\

   setbits((__sched__)-status, XNRESCHED);\
+  xnarch_memory_barrier();\
   }\
 } while (0)


In progress, if nothing breaks before, I'll report status tomorrow morning.
It still breaks (in approximately the same way). I'm currently putting a 
barrier in the other macro doing a RESCHED, also adding some tracing to 
see if a read barrier is needed.


Interesting side-note:

Harddisk accesses seems to get real slow after error has occured (kernel 
installs progresses with 2-3 modules installed per second), while lots 
of idle time reported on all cpu's, weird...


/Anders

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell

Anders Blomdell wrote:

Anders Blomdell wrote:

Jan Kiszka wrote:

additional barrier. Can you check this?

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..66b52ad 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
xnsched *sched)

   if (current_sched != (__sched__)){\
   xnarch_cpu_set(xnsched_cpu(__sched__), 
current_sched-resched);\

   setbits((__sched__)-status, XNRESCHED);\
+  xnarch_memory_barrier();\
   }\
 } while (0)


In progress, if nothing breaks before, I'll report status tomorrow 
morning.
It still breaks (in approximately the same way). I'm currently putting a 
barrier in the other macro doing a RESCHED, also adding some tracing to 
see if a read barrier is needed.
Nope, no luck there either. Will start interesting tracepoint 
adding/conversion :-(


Any reason why xn_nucleus_sched_remote should ever report status = 0?

/Anders


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 03.11.2010 17:46, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Anders Blomdell wrote:
 Jan Kiszka wrote:
 additional barrier. Can you check this?

 diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
 index df56417..66b52ad 100644
 --- a/include/nucleus/sched.h
 +++ b/include/nucleus/sched.h
 @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
 xnsched *sched)
if (current_sched != (__sched__)){\
xnarch_cpu_set(xnsched_cpu(__sched__), 
 current_sched-resched);\
setbits((__sched__)-status, XNRESCHED);\
 +  xnarch_memory_barrier();\
}\
  } while (0)

 In progress, if nothing breaks before, I'll report status tomorrow 
 morning.
 It still breaks (in approximately the same way). I'm currently putting a 
 barrier in the other macro doing a RESCHED, also adding some tracing to 
 see if a read barrier is needed.
 Nope, no luck there either. Will start interesting tracepoint 
 adding/conversion :-(

Strange. But it was too easy anyway...

 
 Any reason why xn_nucleus_sched_remote should ever report status = 0?

Really don't know yet. You could trigger on this state and call
ftrace_stop() then. Provided you had the functions tracer enabled, that
should give a nice pictures of what happened before.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell

Jan Kiszka wrote:

Am 03.11.2010 17:46, Anders Blomdell wrote:

Anders Blomdell wrote:

Anders Blomdell wrote:

Jan Kiszka wrote:

additional barrier. Can you check this?

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..66b52ad 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
xnsched *sched)

   if (current_sched != (__sched__)){\
   xnarch_cpu_set(xnsched_cpu(__sched__), 
current_sched-resched);\

   setbits((__sched__)-status, XNRESCHED);\
+  xnarch_memory_barrier();\
   }\
 } while (0)
In progress, if nothing breaks before, I'll report status tomorrow 
morning.
It still breaks (in approximately the same way). I'm currently putting a 
barrier in the other macro doing a RESCHED, also adding some tracing to 
see if a read barrier is needed.
Nope, no luck there either. Will start interesting tracepoint 
adding/conversion :-(


Strange. But it was too easy anyway...


Any reason why xn_nucleus_sched_remote should ever report status = 0?


Really don't know yet. You could trigger on this state and call
ftrace_stop() then. Provided you had the functions tracer enabled, that
should give a nice pictures of what happened before.


Isn't there a race betweeen these two (still waiting for compilation to 
be finished)?


static inline int __xnpod_test_resched(struct xnsched *sched)
{
int resched = testbits(sched-status, XNRESCHED);
#ifdef CONFIG_SMP
/* Send resched IPI to remote CPU(s). */
if (unlikely(xnsched_resched_p(sched))) {
xnarch_send_ipi(sched-resched);
xnarch_cpus_clear(sched-resched);
}
#endif
clrbits(sched-status, XNRESCHED);
return resched;
}

#define xnsched_set_resched(__sched__) do {   \
  xnsched_t *current_sched = xnpod_current_sched();   \
  setbits(current_sched-status, XNRESCHED);  \
  if (current_sched != (__sched__)) { \
  xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched); \
  setbits((__sched__)-status, XNRESCHED);\
  xnarch_memory_barrier();\
  }   \
} while (0)

I would suggest (if I have got all the macros right):

static inline int __xnpod_test_resched(struct xnsched *sched)
{
int resched = testbits(sched-status, XNRESCHED);
if (unlikely(resched)) {
#ifdef CONFIG_SMP
/* Send resched IPI to remote CPU(s). */
xnarch_send_ipi(sched-resched);
xnarch_cpus_clear(sched-resched);
#endif
clrbits(sched-status, XNRESCHED);
}
return resched;
}

/Anders


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Philippe Gerum
On Wed, 2010-11-03 at 20:38 +0100, Anders Blomdell wrote:
 Jan Kiszka wrote:
  Am 03.11.2010 17:46, Anders Blomdell wrote:
  Anders Blomdell wrote:
  Anders Blomdell wrote:
  Jan Kiszka wrote:
  additional barrier. Can you check this?
 
  diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
  index df56417..66b52ad 100644
  --- a/include/nucleus/sched.h
  +++ b/include/nucleus/sched.h
  @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
  xnsched *sched)
 if (current_sched != (__sched__)){\
 xnarch_cpu_set(xnsched_cpu(__sched__), 
  current_sched-resched);\
 setbits((__sched__)-status, XNRESCHED);\
  +  xnarch_memory_barrier();\
 }\
   } while (0)
  In progress, if nothing breaks before, I'll report status tomorrow 
  morning.
  It still breaks (in approximately the same way). I'm currently putting a 
  barrier in the other macro doing a RESCHED, also adding some tracing to 
  see if a read barrier is needed.
  Nope, no luck there either. Will start interesting tracepoint 
  adding/conversion :-(
  
  Strange. But it was too easy anyway...
  
  Any reason why xn_nucleus_sched_remote should ever report status = 0?
  
  Really don't know yet. You could trigger on this state and call
  ftrace_stop() then. Provided you had the functions tracer enabled, that
  should give a nice pictures of what happened before.
 
 Isn't there a race betweeen these two (still waiting for compilation to 
 be finished)?

We always hold the nklock in both contexts.

 
 static inline int __xnpod_test_resched(struct xnsched *sched)
 {
   int resched = testbits(sched-status, XNRESCHED);
 #ifdef CONFIG_SMP
   /* Send resched IPI to remote CPU(s). */
   if (unlikely(xnsched_resched_p(sched))) {
   xnarch_send_ipi(sched-resched);
   xnarch_cpus_clear(sched-resched);
   }
 #endif
   clrbits(sched-status, XNRESCHED);
   return resched;
 }
 
 #define xnsched_set_resched(__sched__) do {   \
xnsched_t *current_sched = xnpod_current_sched();   \
setbits(current_sched-status, XNRESCHED);  \
if (current_sched != (__sched__)) { \
xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched); \
setbits((__sched__)-status, XNRESCHED);\
xnarch_memory_barrier();\
}   \
 } while (0)
 
 I would suggest (if I have got all the macros right):
 
 static inline int __xnpod_test_resched(struct xnsched *sched)
 {
   int resched = testbits(sched-status, XNRESCHED);
   if (unlikely(resched)) {
 #ifdef CONFIG_SMP
   /* Send resched IPI to remote CPU(s). */
   xnarch_send_ipi(sched-resched);
   xnarch_cpus_clear(sched-resched);
 #endif
   clrbits(sched-status, XNRESCHED);
   }
   return resched;
 }
 
 /Anders
 
 
 ___
 Xenomai-core mailing list
 Xenomai-core@gna.org
 https://mail.gna.org/listinfo/xenomai-core

-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


[Xenomai-core] Is anybody using the pSOS skin in userland?

2010-11-03 Thread ronny meeus
Hello

we are investigating to usage of the pSOS+ skin to port a large legacy pSOS
application to Linux.
The application model consist of several processes in which the application
lives. All processes will make use of the pSOS library.

After playing around with the library for some time we have observed several
missing service calls, bugs and differences in behaviour compared to a real
pSOS implementation:
- missing sm_ident
- missing t_getreg / t_setreg in userland (patch already included in 2.5.5)
- not possible to use skin from the context of different processes (patch
already included in 2.5.5)
- added support for identical task/queue/semaphore/region names by making
names unique.
- strange behaviour in pSOS message queue (see post Possible memory leak in
psos skin message queue handling).

I can (and will) deliver patches for all issues I have found, but I'm
wondering whether there are other people using the pSOS skin (in userland)
in a real live application. The target for my project would be an embedded
system with strong reliability requirements (very stable / long running
etc).
Any feedback is welcome and appreciated.

It is not clear to me either which tests are executed before a new version
is released. Is there any test-suite available for the pSOS skin?

Best regards,
Ronny
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 03.11.2010 21:41, Philippe Gerum wrote:
 On Wed, 2010-11-03 at 20:38 +0100, Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 03.11.2010 17:46, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Anders Blomdell wrote:
 Jan Kiszka wrote:
 additional barrier. Can you check this?

 diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
 index df56417..66b52ad 100644
 --- a/include/nucleus/sched.h
 +++ b/include/nucleus/sched.h
 @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
 xnsched *sched)
if (current_sched != (__sched__)){\
xnarch_cpu_set(xnsched_cpu(__sched__), 
 current_sched-resched);\
setbits((__sched__)-status, XNRESCHED);\
 +  xnarch_memory_barrier();\
}\
  } while (0)
 In progress, if nothing breaks before, I'll report status tomorrow 
 morning.
 It still breaks (in approximately the same way). I'm currently putting a 
 barrier in the other macro doing a RESCHED, also adding some tracing to 
 see if a read barrier is needed.
 Nope, no luck there either. Will start interesting tracepoint 
 adding/conversion :-(

 Strange. But it was too easy anyway...

 Any reason why xn_nucleus_sched_remote should ever report status = 0?

 Really don't know yet. You could trigger on this state and call
 ftrace_stop() then. Provided you had the functions tracer enabled, that
 should give a nice pictures of what happened before.

 Isn't there a race betweeen these two (still waiting for compilation to 
 be finished)?
 
 We always hold the nklock in both contexts.
 

But we not not always use atomic ops for manipulating status bits (but
we do in other cases where this is no need - different story). This may
fix the race:

diff --git a/ksrc/nucleus/intr.c b/ksrc/nucleus/intr.c
index d7a772f..af8ebeb 100644
--- a/ksrc/nucleus/intr.c
+++ b/ksrc/nucleus/intr.c
@@ -85,7 +85,7 @@ static void xnintr_irq_handler(unsigned irq, void *cookie);
 
 void xnintr_host_tick(struct xnsched *sched) /* Interrupts off. */
 {
-   __clrbits(sched-status, XNHTICK);
+   clrbits(sched-status, XNHTICK);
xnarch_relay_tick();
 }
 
@@ -105,11 +105,13 @@ void xnintr_clock_handler(void)
trace_mark(xn_nucleus, irq_enter, irq %u, XNARCH_TIMER_IRQ);
trace_mark(xn_nucleus, tbase_tick, base %s, nktbase.name);
 
+   xnlock_get(nklock);
+
++sched-inesting;
__setbits(sched-status, XNINIRQ);
 
-   xnlock_get(nklock);
xntimer_tick_aperiodic();
+
xnlock_put(nklock);
 
xnstat_counter_inc(nkclock.stat[xnsched_cpu(sched)].hits);
@@ -117,7 +119,7 @@ void xnintr_clock_handler(void)
nkclock.stat[xnsched_cpu(sched)].account, start);
 
if (--sched-inesting == 0) {
-   __clrbits(sched-status, XNINIRQ);
+   clrbits(sched-status, XNINIRQ);
xnpod_schedule();
}
/*
@@ -178,7 +180,7 @@ static void xnintr_shirq_handler(unsigned irq, void *cookie)
trace_mark(xn_nucleus, irq_enter, irq %u, irq);
 
++sched-inesting;
-   __setbits(sched-status, XNINIRQ);
+   setbits(sched-status, XNINIRQ);
 
xnlock_get(shirq-lock);
intr = shirq-handlers;
@@ -220,7 +222,7 @@ static void xnintr_shirq_handler(unsigned irq, void *cookie)
xnarch_end_irq(irq);
 
if (--sched-inesting == 0) {
-   __clrbits(sched-status, XNINIRQ);
+   clrbits(sched-status, XNINIRQ);
xnpod_schedule();
}
 
@@ -247,7 +249,7 @@ static void xnintr_edge_shirq_handler(unsigned irq, void 
*cookie)
trace_mark(xn_nucleus, irq_enter, irq %u, irq);
 
++sched-inesting;
-   __setbits(sched-status, XNINIRQ);
+   setbits(sched-status, XNINIRQ);
 
xnlock_get(shirq-lock);
intr = shirq-handlers;
@@ -303,7 +305,7 @@ static void xnintr_edge_shirq_handler(unsigned irq, void 
*cookie)
xnarch_end_irq(irq);
 
if (--sched-inesting == 0) {
-   __clrbits(sched-status, XNINIRQ);
+   clrbits(sched-status, XNINIRQ);
xnpod_schedule();
}
trace_mark(xn_nucleus, irq_exit, irq %u, irq);
@@ -446,7 +448,7 @@ static void xnintr_irq_handler(unsigned irq, void *cookie)
trace_mark(xn_nucleus, irq_enter, irq %u, irq);
 
++sched-inesting;
-   __setbits(sched-status, XNINIRQ);
+   setbits(sched-status, XNINIRQ);
 
xnlock_get(xnirqs[irq].lock);
 
@@ -493,7 +495,7 @@ static void xnintr_irq_handler(unsigned irq, void *cookie)
xnarch_end_irq(irq);
 
if (--sched-inesting == 0) {
-   __clrbits(sched-status, XNINIRQ);
+   clrbits(sched-status, XNINIRQ);
xnpod_schedule();
}
 

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 03.11.2010 23:03, Jan Kiszka wrote:
 Am 03.11.2010 21:41, Philippe Gerum wrote:
 On Wed, 2010-11-03 at 20:38 +0100, Anders Blomdell wrote:
 Jan Kiszka wrote:
 Am 03.11.2010 17:46, Anders Blomdell wrote:
 Anders Blomdell wrote:
 Anders Blomdell wrote:
 Jan Kiszka wrote:
 additional barrier. Can you check this?

 diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
 index df56417..66b52ad 100644
 --- a/include/nucleus/sched.h
 +++ b/include/nucleus/sched.h
 @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
 xnsched *sched)
if (current_sched != (__sched__)){\
xnarch_cpu_set(xnsched_cpu(__sched__), 
 current_sched-resched);\
setbits((__sched__)-status, XNRESCHED);\
 +  xnarch_memory_barrier();\
}\
  } while (0)
 In progress, if nothing breaks before, I'll report status tomorrow 
 morning.
 It still breaks (in approximately the same way). I'm currently putting a 
 barrier in the other macro doing a RESCHED, also adding some tracing to 
 see if a read barrier is needed.
 Nope, no luck there either. Will start interesting tracepoint 
 adding/conversion :-(

 Strange. But it was too easy anyway...

 Any reason why xn_nucleus_sched_remote should ever report status = 0?

 Really don't know yet. You could trigger on this state and call
 ftrace_stop() then. Provided you had the functions tracer enabled, that
 should give a nice pictures of what happened before.

 Isn't there a race betweeen these two (still waiting for compilation to 
 be finished)?

 We always hold the nklock in both contexts.

 
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This may
 fix the race:

Err, nonsense. As we manipulate xnsched::status also outside of nklock
protection, we must _always_ use atomic ops.

This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
should be pushed in a separate status word that can then be safely
modified non-atomically.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This may
 fix the race:
 
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.
 
 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.

Second try to fix and clean up the sched status bits. Anders, please
test.

Jan

diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
index 01ff0a7..5987a1f 100644
--- a/include/nucleus/pod.h
+++ b/include/nucleus/pod.h
@@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
 * context is active, or if we are caught in the middle of a
 * unlocked context switch.
 */
-#if XENO_DEBUG(NUCLEUS)
if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK))
return;
-#else /* !XENO_DEBUG(NUCLEUS) */
-   if (testbits(sched-status,
-XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
+#if !XENO_DEBUG(NUCLEUS)
+   if (!sched-resched)
return;
 #endif /* !XENO_DEBUG(NUCLEUS) */
 
diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..1850208 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -44,7 +44,6 @@
 #define XNINTCK0x1000  /* In master tick handler 
context */
 #define XNINIRQ0x0800  /* In IRQ handling context */
 #define XNSWLOCK   0x0400  /* In context switch */
-#define XNRESCHED  0x0200  /* Needs rescheduling */
 #define XNHDEFER   0x0100  /* Host tick deferred */
 
 struct xnsched_rt {
@@ -63,7 +62,8 @@ typedef struct xnsched {
xnflags_t status;   /*! Scheduler specific status bitmask. 
*/
int cpu;
struct xnthread *curr;  /*! Current thread. */
-   xnarch_cpumask_t resched;   /*! Mask of CPUs needing rescheduling. 
*/
+   xnarch_cpumask_t remote_resched; /*! Mask of CPUs needing 
rescheduling. */
+   int resched;/*! Rescheduling needed. */
 
struct xnsched_rt rt;   /*! Context of built-in real-time 
class. */
 #ifdef CONFIG_XENO_OPT_SCHED_TP
@@ -164,30 +164,21 @@ struct xnsched_class {
 #define xnsched_cpu(__sched__) ({ (void)__sched__; 0; })
 #endif /* CONFIG_SMP */
 
-/* Test all resched flags from the given scheduler mask. */
-static inline int xnsched_resched_p(struct xnsched *sched)
-{
-   return testbits(sched-status, XNRESCHED);
-}
-
-static inline int xnsched_self_resched_p(struct xnsched *sched)
-{
-   return testbits(sched-status, XNRESCHED);
-}
-
 /* Set self resched flag for the given scheduler. */
 #define xnsched_set_self_resched(__sched__) do {   \
-  setbits((__sched__)-status, XNRESCHED); \
+   (__sched__)-resched = 1;   \
 } while (0)
 
 /* Set specific resched flag into the local scheduler mask. */
 #define xnsched_set_resched(__sched__) do {\
-  xnsched_t *current_sched = xnpod_current_sched();\
-  setbits(current_sched-status, XNRESCHED);   \
-  if (current_sched != (__sched__)){   \
-  xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);  \
-  setbits((__sched__)-status, XNRESCHED); \
-  }\
+   xnsched_t *current_sched = xnpod_current_sched();   \
+   current_sched-resched = 1; \
+   if (current_sched != (__sched__)) { \
+   xnarch_cpu_set(xnsched_cpu(__sched__),  \
+  current_sched-remote_resched);  \
+   (__sched__)-resched = 1;   \
+   xnarch_memory_barrier();\
+   }   \
 } while (0)
 
 void xnsched_zombie_hooks(struct xnthread *thread);
@@ -209,7 +200,7 @@ struct xnsched *xnsched_finish_unlocked_switch(struct 
xnsched *sched);
 static inline
 int xnsched_maybe_resched_after_unlocked_switch(struct xnsched *sched)
 {
-   return testbits(sched-status, XNRESCHED);
+   return sched-resched;
 }
 
 #else /* !CONFIG_XENO_HW_UNLOCKED_SWITCH */
diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 9e135f3..f7f8b2c 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -284,7 +284,7 @@ void xnpod_schedule_handler(void) /* Called with hw 
interrupts off. */
trace_xn_nucleus_sched_remote(sched);
 #if defined(CONFIG_SMP)  defined(CONFIG_XENO_OPT_PRIOCPL)
if 

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This may
 fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.
 
 Second try to fix and clean up the sched status bits. Anders, please
 test.
 
 Jan
 
 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
* context is active, or if we are caught in the middle of a
* unlocked context switch.
*/
 -#if XENO_DEBUG(NUCLEUS)
   if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK))
   return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 - if (testbits(sched-status,
 -  XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 + if (!sched-resched)
   return;
  #endif /* !XENO_DEBUG(NUCLEUS) */

Having only one test was really nice here, maybe we simply read a
barrier before reading the status?

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This may
 fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.

 Second try to fix and clean up the sched status bits. Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
   * context is active, or if we are caught in the middle of a
   * unlocked context switch.
   */
 -#if XENO_DEBUG(NUCLEUS)
  if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK))
  return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 -if (testbits(sched-status,
 - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 +if (!sched-resched)
  return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
 
 Having only one test was really nice here, maybe we simply read a
 barrier before reading the status?
 

I agree - but the alternative is letting all modifications of
xnsched::status use atomic bitops (that's required when folding all bits
into a single word). And that should be much more costly, specifically
on SMP.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
 Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This may
 fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.
 Second try to fix and clean up the sched status bits. Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
  * context is active, or if we are caught in the middle of a
  * unlocked context switch.
  */
 -#if XENO_DEBUG(NUCLEUS)
 if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK))
 return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 -   if (testbits(sched-status,
 -XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 +   if (!sched-resched)
 return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
 Having only one test was really nice here, maybe we simply read a
 barrier before reading the status?

 
 I agree - but the alternative is letting all modifications of
 xnsched::status use atomic bitops (that's required when folding all bits
 into a single word). And that should be much more costly, specifically
 on SMP.

What about issuing a barrier before testing the status?

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This may
 fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.
 Second try to fix and clean up the sched status bits. Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
 * context is active, or if we are caught in the middle of a
 * unlocked context switch.
 */
 -#if XENO_DEBUG(NUCLEUS)
if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK))
return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 -  if (testbits(sched-status,
 -   XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 +  if (!sched-resched)
return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
 Having only one test was really nice here, maybe we simply read a
 barrier before reading the status?


 I agree - but the alternative is letting all modifications of
 xnsched::status use atomic bitops (that's required when folding all bits
 into a single word). And that should be much more costly, specifically
 on SMP.
 
 What about issuing a barrier before testing the status?
 

The problem is not about reading but writing the status concurrently,
thus it's not about the code you see above.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
 Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This may
 fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.
 Second try to fix and clean up the sched status bits. Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
* context is active, or if we are caught in the middle of a
* unlocked context switch.
*/
 -#if XENO_DEBUG(NUCLEUS)
   if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK))
   return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 - if (testbits(sched-status,
 -  XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 + if (!sched-resched)
   return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
 Having only one test was really nice here, maybe we simply read a
 barrier before reading the status?

 I agree - but the alternative is letting all modifications of
 xnsched::status use atomic bitops (that's required when folding all bits
 into a single word). And that should be much more costly, specifically
 on SMP.
 What about issuing a barrier before testing the status?

 
 The problem is not about reading but writing the status concurrently,
 thus it's not about the code you see above.

The bits are modified under nklock, which implies a barrier when
unlocked. Furthermore, an IPI is guaranteed to be received on the remote
CPU after this barrier, so, a barrier should be enough to see the
modifications which have been made remotely.


-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This may
 fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.
 Second try to fix and clean up the sched status bits. Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
   * context is active, or if we are caught in the middle of a
   * unlocked context switch.
   */
 -#if XENO_DEBUG(NUCLEUS)
  if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK))
  return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 -if (testbits(sched-status,
 - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 +if (!sched-resched)
  return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
 Having only one test was really nice here, maybe we simply read a
 barrier before reading the status?

 I agree - but the alternative is letting all modifications of
 xnsched::status use atomic bitops (that's required when folding all bits
 into a single word). And that should be much more costly, specifically
 on SMP.
 What about issuing a barrier before testing the status?


 The problem is not about reading but writing the status concurrently,
 thus it's not about the code you see above.
 
 The bits are modified under nklock, which implies a barrier when
 unlocked. Furthermore, an IPI is guaranteed to be received on the remote
 CPU after this barrier, so, a barrier should be enough to see the
 modifications which have been made remotely.

Check nucleus/intr.c for tons of unprotected status modifications.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
 Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This 
 may
 fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.
 Second try to fix and clean up the sched status bits. Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
  * context is active, or if we are caught in the middle of a
  * unlocked context switch.
  */
 -#if XENO_DEBUG(NUCLEUS)
 if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK))
 return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 -   if (testbits(sched-status,
 -XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 +   if (!sched-resched)
 return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
 Having only one test was really nice here, maybe we simply read a
 barrier before reading the status?

 I agree - but the alternative is letting all modifications of
 xnsched::status use atomic bitops (that's required when folding all bits
 into a single word). And that should be much more costly, specifically
 on SMP.
 What about issuing a barrier before testing the status?

 The problem is not about reading but writing the status concurrently,
 thus it's not about the code you see above.
 The bits are modified under nklock, which implies a barrier when
 unlocked. Furthermore, an IPI is guaranteed to be received on the remote
 CPU after this barrier, so, a barrier should be enough to see the
 modifications which have been made remotely.
 
 Check nucleus/intr.c for tons of unprotected status modifications.

Ok. Then maybe, we should reconsider the original decision to start
fiddling with the XNRESCHED bit remotely.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka
Am 04.11.2010 00:56, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits 
 (but
 we do in other cases where this is no need - different story). This 
 may
 fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.
 Second try to fix and clean up the sched status bits. Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
 * context is active, or if we are caught in the middle of a
 * unlocked context switch.
 */
 -#if XENO_DEBUG(NUCLEUS)
if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK))
return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 -  if (testbits(sched-status,
 -   XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 +  if (!sched-resched)
return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
 Having only one test was really nice here, maybe we simply read a
 barrier before reading the status?

 I agree - but the alternative is letting all modifications of
 xnsched::status use atomic bitops (that's required when folding all bits
 into a single word). And that should be much more costly, specifically
 on SMP.
 What about issuing a barrier before testing the status?

 The problem is not about reading but writing the status concurrently,
 thus it's not about the code you see above.
 The bits are modified under nklock, which implies a barrier when
 unlocked. Furthermore, an IPI is guaranteed to be received on the remote
 CPU after this barrier, so, a barrier should be enough to see the
 modifications which have been made remotely.

 Check nucleus/intr.c for tons of unprotected status modifications.
 
 Ok. Then maybe, we should reconsider the original decision to start
 fiddling with the XNRESCHED bit remotely.

...which removed complexity and fixed a race? Let's better review the
checks done in xnpod_schedule vs. its callers, I bet there is more to
save (IOW: remove the need to test for sched-resched).

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
 Am 04.11.2010 00:56, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits 
 (but
 we do in other cases where this is no need - different story). This 
 may
 fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of 
 nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.
 Second try to fix and clean up the sched status bits. Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
* context is active, or if we are caught in the middle of a
* unlocked context switch.
*/
 -#if XENO_DEBUG(NUCLEUS)
   if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK))
   return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 - if (testbits(sched-status,
 -  XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 + if (!sched-resched)
   return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
 Having only one test was really nice here, maybe we simply read a
 barrier before reading the status?

 I agree - but the alternative is letting all modifications of
 xnsched::status use atomic bitops (that's required when folding all bits
 into a single word). And that should be much more costly, specifically
 on SMP.
 What about issuing a barrier before testing the status?

 The problem is not about reading but writing the status concurrently,
 thus it's not about the code you see above.
 The bits are modified under nklock, which implies a barrier when
 unlocked. Furthermore, an IPI is guaranteed to be received on the remote
 CPU after this barrier, so, a barrier should be enough to see the
 modifications which have been made remotely.
 Check nucleus/intr.c for tons of unprotected status modifications.
 Ok. Then maybe, we should reconsider the original decision to start
 fiddling with the XNRESCHED bit remotely.
 
 ...which removed complexity and fixed a race? Let's better review the
 checks done in xnpod_schedule vs. its callers, I bet there is more to
 save (IOW: remove the need to test for sched-resched).

Not that much complexitiy... and the race was a false positive in debug
code, no big deal. At least it worked, and it has done so for a long
time. No atomic needed, no barrier, only one test in xnpod_schedule. And
a nice invariant: sched-status is always accessed on the local cpu.
What else?


-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core