Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-06 Thread Anders Blomdell

Gilles Chanteperdrix wrote:

Anders Blomdell wrote:

Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

At first sight, here you are more breaking things than cleaning them.
Still, it has the SMP record for my test program, still runs with ftrace 
on (after 2 hours, where it previously failed after maximum 23 minutes).

My version was indeed still buggy, I'm reworking it ATM.

If I get the gist of Jan's changes, they are (using the IPI to transfer 
one bit of information: your cpu needs to reschedule):


xnsched_set_resched:
-  setbits((__sched__)-status, XNRESCHED);

xnpod_schedule_handler:
+   xnsched_set_resched(sched);

If you (we?) decide to keep the debug checks, under what circumstances 
would the current check trigger (in laymans language, that I'll be able 
to understand)?

That's actually what /me is wondering as well. I do not see yet how you
can reliably detect a missed reschedule reliably (that was the purpose
of the debug check) given the racy nature between signaling resched and
processing the resched hints.

The purpose of the debugging change is to detect a change of the
scheduler state which was not followed by setting the XNRESCHED bit.

But that is nucleus business, nothing skins can screw up (as long as
they do not misuse APIs).

Yes, but it happens that we modify the nucleus from time to time.


Getting it to work is relatively simple: we add a scheduler change set
remotely bit to the sched structure which is NOT in the status bit, set
this bit when changing a remote sched (under nklock). In the debug check
code, if the scheduler state changed, and the XNRESCHED bit is not set,
only consider this a but if this new bit is not set. All this is
compiled out if the debug is not enabled.

I still see no benefit in this check. Where to you want to place the bit
set? Aren't that just the same locations where
xnsched_set_[self_]resched already is today?

Well no, that would be another bit in the sched structure which would
allow us to manipulate the status bits from the local cpu. That
supplementary bit would only be changed from a distant CPU, and serve to
detect the race which causes the false positive. The resched bits are
set on the local cpu to get xnpod_schedule to trigger a rescheduling on
the distance cpu. That bit would be set on the remote cpu's sched. Only
when debugging is enabled.


But maybe you can provide some motivating bug scenarios, real ones of
the past or realistic ones of the future.

Of course. The bug is anything which changes the scheduler state but
does not set the XNRESCHED bit. This happened when we started the SMP
port. New scheduling policies would be good candidates for a revival of
this bug.


You don't gain any worthwhile check if you cannot make the
instrumentation required for a stable detection simpler than the proper
problem solution itself. And this is what I'm still skeptical of.
The solution is simple, but finding the problem without the 
instrumentation is way harder than with the instrumentation, so the 
instrumentation is worth something.


Reproducing the false positive is surprisingly easy with a simple
dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
using a ridiculous long variable name to illustrate what I was 
thinking about:


diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index cf4..454b8e8 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -108,6 +108,9 @@ typedef struct xnsched {
struct xnthread *gktarget;
 #endif

+#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
+   int debug_resched_from_remote;
+#endif
 } xnsched_t;

 union xnsched_policy_param;
@@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched *sched)
   xnsched_t *current_sched = xnpod_current_sched();\
   __setbits(current_sched-status, XNRESCHED); \
   if (current_sched != (__sched__)){   \
+ if (XENO_DEBUG(NUCLEUS))  \
+ __sched__-debug_resched_from_remote = 1; \
   xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);  \
   }\
 } while (0)
diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 4cb707a..50b0f49 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct xnsched 
*sched)
xnarch_cpus_clear(sched-resched);
}
 #endif
+   if (XENO_DEBUG(NUCLEUS)  sched-debug_resched_from_remote) {
+   sched-debug_resched_from_remote = 0;
+   resched = 1;
+   }
clrbits(sched-status, XNRESCHED);
return resched;
 }


I am still uncertain.
Will only work if all is done under nklock, otherwise two 

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-06 Thread Gilles Chanteperdrix
Anders Blomdell wrote:
 Gilles Chanteperdrix wrote:
 Anders Blomdell wrote:
 Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 At first sight, here you are more breaking things than cleaning 
 them.
 Still, it has the SMP record for my test program, still runs with 
 ftrace 
 on (after 2 hours, where it previously failed after maximum 23 
 minutes).
 My version was indeed still buggy, I'm reworking it ATM.

 If I get the gist of Jan's changes, they are (using the IPI to 
 transfer 
 one bit of information: your cpu needs to reschedule):

 xnsched_set_resched:
 -  setbits((__sched__)-status, XNRESCHED);

 xnpod_schedule_handler:
 +xnsched_set_resched(sched);
  
 If you (we?) decide to keep the debug checks, under what 
 circumstances 
 would the current check trigger (in laymans language, that I'll be 
 able 
 to understand)?
 That's actually what /me is wondering as well. I do not see yet how 
 you
 can reliably detect a missed reschedule reliably (that was the purpose
 of the debug check) given the racy nature between signaling resched 
 and
 processing the resched hints.
 The purpose of the debugging change is to detect a change of the
 scheduler state which was not followed by setting the XNRESCHED bit.
 But that is nucleus business, nothing skins can screw up (as long as
 they do not misuse APIs).
 Yes, but it happens that we modify the nucleus from time to time.

 Getting it to work is relatively simple: we add a scheduler change set
 remotely bit to the sched structure which is NOT in the status bit, 
 set
 this bit when changing a remote sched (under nklock). In the debug 
 check
 code, if the scheduler state changed, and the XNRESCHED bit is not set,
 only consider this a but if this new bit is not set. All this is
 compiled out if the debug is not enabled.
 I still see no benefit in this check. Where to you want to place the bit
 set? Aren't that just the same locations where
 xnsched_set_[self_]resched already is today?
 Well no, that would be another bit in the sched structure which would
 allow us to manipulate the status bits from the local cpu. That
 supplementary bit would only be changed from a distant CPU, and serve to
 detect the race which causes the false positive. The resched bits are
 set on the local cpu to get xnpod_schedule to trigger a rescheduling on
 the distance cpu. That bit would be set on the remote cpu's sched. Only
 when debugging is enabled.

 But maybe you can provide some motivating bug scenarios, real ones of
 the past or realistic ones of the future.
 Of course. The bug is anything which changes the scheduler state but
 does not set the XNRESCHED bit. This happened when we started the SMP
 port. New scheduling policies would be good candidates for a revival of
 this bug.

 You don't gain any worthwhile check if you cannot make the
 instrumentation required for a stable detection simpler than the proper
 problem solution itself. And this is what I'm still skeptical of.
 The solution is simple, but finding the problem without the 
 instrumentation is way harder than with the instrumentation, so the 
 instrumentation is worth something.

 Reproducing the false positive is surprisingly easy with a simple
 dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
 using a ridiculous long variable name to illustrate what I was 
 thinking about:

 diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
 index cf4..454b8e8 100644
 --- a/include/nucleus/sched.h
 +++ b/include/nucleus/sched.h
 @@ -108,6 +108,9 @@ typedef struct xnsched {
 struct xnthread *gktarget;
  #endif

 +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
 +   int debug_resched_from_remote;
 +#endif
  } xnsched_t;

  union xnsched_policy_param;
 @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched 
 *sched)
xnsched_t *current_sched = xnpod_current_sched();\
__setbits(current_sched-status, XNRESCHED); \
if (current_sched != (__sched__)){   \
 + if (XENO_DEBUG(NUCLEUS))  \
 + __sched__-debug_resched_from_remote = 1; \
xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);  \
}\
  } while (0)
 diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
 index 4cb707a..50b0f49 100644
 --- a/ksrc/nucleus/pod.c
 +++ b/ksrc/nucleus/pod.c
 @@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct 
 xnsched *sched)
 xnarch_cpus_clear(sched-resched);
 }
  #endif
 +   if (XENO_DEBUG(NUCLEUS)  sched-debug_resched_from_remote) {
 +   sched-debug_resched_from_remote = 0;
 +   resched = 1;
 +   }
 clrbits(sched-status, 

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-06 Thread Philippe Gerum
On Sat, 2010-11-06 at 21:37 +0100, Gilles Chanteperdrix wrote:
 Anders Blomdell wrote:
  Gilles Chanteperdrix wrote:
  Anders Blomdell wrote:
  Gilles Chanteperdrix wrote:
  Jan Kiszka wrote:
  Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
  Jan Kiszka wrote:
  Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
  Jan Kiszka wrote:
  At first sight, here you are more breaking things than cleaning 
  them.
  Still, it has the SMP record for my test program, still runs with 
  ftrace 
  on (after 2 hours, where it previously failed after maximum 23 
  minutes).
  My version was indeed still buggy, I'm reworking it ATM.
 
  If I get the gist of Jan's changes, they are (using the IPI to 
  transfer 
  one bit of information: your cpu needs to reschedule):
 
  xnsched_set_resched:
  -  setbits((__sched__)-status, XNRESCHED);
 
  xnpod_schedule_handler:
  +  xnsched_set_resched(sched);
 
  If you (we?) decide to keep the debug checks, under what 
  circumstances 
  would the current check trigger (in laymans language, that I'll be 
  able 
  to understand)?
  That's actually what /me is wondering as well. I do not see yet how 
  you
  can reliably detect a missed reschedule reliably (that was the 
  purpose
  of the debug check) given the racy nature between signaling resched 
  and
  processing the resched hints.
  The purpose of the debugging change is to detect a change of the
  scheduler state which was not followed by setting the XNRESCHED bit.
  But that is nucleus business, nothing skins can screw up (as long as
  they do not misuse APIs).
  Yes, but it happens that we modify the nucleus from time to time.
 
  Getting it to work is relatively simple: we add a scheduler change 
  set
  remotely bit to the sched structure which is NOT in the status bit, 
  set
  this bit when changing a remote sched (under nklock). In the debug 
  check
  code, if the scheduler state changed, and the XNRESCHED bit is not 
  set,
  only consider this a but if this new bit is not set. All this is
  compiled out if the debug is not enabled.
  I still see no benefit in this check. Where to you want to place the 
  bit
  set? Aren't that just the same locations where
  xnsched_set_[self_]resched already is today?
  Well no, that would be another bit in the sched structure which would
  allow us to manipulate the status bits from the local cpu. That
  supplementary bit would only be changed from a distant CPU, and serve 
  to
  detect the race which causes the false positive. The resched bits are
  set on the local cpu to get xnpod_schedule to trigger a rescheduling on
  the distance cpu. That bit would be set on the remote cpu's sched. Only
  when debugging is enabled.
 
  But maybe you can provide some motivating bug scenarios, real ones of
  the past or realistic ones of the future.
  Of course. The bug is anything which changes the scheduler state but
  does not set the XNRESCHED bit. This happened when we started the SMP
  port. New scheduling policies would be good candidates for a revival of
  this bug.
 
  You don't gain any worthwhile check if you cannot make the
  instrumentation required for a stable detection simpler than the proper
  problem solution itself. And this is what I'm still skeptical of.
  The solution is simple, but finding the problem without the 
  instrumentation is way harder than with the instrumentation, so the 
  instrumentation is worth something.
 
  Reproducing the false positive is surprisingly easy with a simple
  dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
  using a ridiculous long variable name to illustrate what I was 
  thinking about:
 
  diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
  index cf4..454b8e8 100644
  --- a/include/nucleus/sched.h
  +++ b/include/nucleus/sched.h
  @@ -108,6 +108,9 @@ typedef struct xnsched {
  struct xnthread *gktarget;
   #endif
 
  +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
  +   int debug_resched_from_remote;
  +#endif
   } xnsched_t;
 
   union xnsched_policy_param;
  @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched 
  *sched)
 xnsched_t *current_sched = xnpod_current_sched();\
 __setbits(current_sched-status, XNRESCHED); \
 if (current_sched != (__sched__)){   \
  + if (XENO_DEBUG(NUCLEUS))  \
  + __sched__-debug_resched_from_remote = 1; \
 xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);  \
 }\
   } while (0)
  diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
  index 4cb707a..50b0f49 100644
  --- a/ksrc/nucleus/pod.c
  +++ b/ksrc/nucleus/pod.c
  @@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct 
  xnsched *sched)
  xnarch_cpus_clear(sched-resched);
  }
   #endif
  +   if 

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-06 Thread Jan Kiszka
Am 06.11.2010 23:49, Philippe Gerum wrote:
 On Sat, 2010-11-06 at 21:37 +0100, Gilles Chanteperdrix wrote:
 Anders Blomdell wrote:
 Gilles Chanteperdrix wrote:
 Anders Blomdell wrote:
 Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
 At first sight, here you are more breaking things than cleaning 
 them.
 Still, it has the SMP record for my test program, still runs with 
 ftrace 
 on (after 2 hours, where it previously failed after maximum 23 
 minutes).
 My version was indeed still buggy, I'm reworking it ATM.

 If I get the gist of Jan's changes, they are (using the IPI to 
 transfer 
 one bit of information: your cpu needs to reschedule):

 xnsched_set_resched:
 -  setbits((__sched__)-status, XNRESCHED);

 xnpod_schedule_handler:
 +  xnsched_set_resched(sched);

 If you (we?) decide to keep the debug checks, under what 
 circumstances 
 would the current check trigger (in laymans language, that I'll be 
 able 
 to understand)?
 That's actually what /me is wondering as well. I do not see yet how 
 you
 can reliably detect a missed reschedule reliably (that was the 
 purpose
 of the debug check) given the racy nature between signaling resched 
 and
 processing the resched hints.
 The purpose of the debugging change is to detect a change of the
 scheduler state which was not followed by setting the XNRESCHED bit.
 But that is nucleus business, nothing skins can screw up (as long as
 they do not misuse APIs).
 Yes, but it happens that we modify the nucleus from time to time.

 Getting it to work is relatively simple: we add a scheduler change 
 set
 remotely bit to the sched structure which is NOT in the status bit, 
 set
 this bit when changing a remote sched (under nklock). In the debug 
 check
 code, if the scheduler state changed, and the XNRESCHED bit is not 
 set,
 only consider this a but if this new bit is not set. All this is
 compiled out if the debug is not enabled.
 I still see no benefit in this check. Where to you want to place the 
 bit
 set? Aren't that just the same locations where
 xnsched_set_[self_]resched already is today?
 Well no, that would be another bit in the sched structure which would
 allow us to manipulate the status bits from the local cpu. That
 supplementary bit would only be changed from a distant CPU, and serve 
 to
 detect the race which causes the false positive. The resched bits are
 set on the local cpu to get xnpod_schedule to trigger a rescheduling on
 the distance cpu. That bit would be set on the remote cpu's sched. Only
 when debugging is enabled.

 But maybe you can provide some motivating bug scenarios, real ones of
 the past or realistic ones of the future.
 Of course. The bug is anything which changes the scheduler state but
 does not set the XNRESCHED bit. This happened when we started the SMP
 port. New scheduling policies would be good candidates for a revival of
 this bug.

 You don't gain any worthwhile check if you cannot make the
 instrumentation required for a stable detection simpler than the proper
 problem solution itself. And this is what I'm still skeptical of.
 The solution is simple, but finding the problem without the 
 instrumentation is way harder than with the instrumentation, so the 
 instrumentation is worth something.

 Reproducing the false positive is surprisingly easy with a simple
 dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
 using a ridiculous long variable name to illustrate what I was 
 thinking about:

 diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
 index cf4..454b8e8 100644
 --- a/include/nucleus/sched.h
 +++ b/include/nucleus/sched.h
 @@ -108,6 +108,9 @@ typedef struct xnsched {
 struct xnthread *gktarget;
  #endif

 +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
 +   int debug_resched_from_remote;
 +#endif
  } xnsched_t;

  union xnsched_policy_param;
 @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched 
 *sched)
xnsched_t *current_sched = xnpod_current_sched();\
__setbits(current_sched-status, XNRESCHED); \
if (current_sched != (__sched__)){   \
 + if (XENO_DEBUG(NUCLEUS))  \
 + __sched__-debug_resched_from_remote = 1; \
xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);  \
}\
  } while (0)
 diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
 index 4cb707a..50b0f49 100644
 --- a/ksrc/nucleus/pod.c
 +++ b/ksrc/nucleus/pod.c
 @@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct 
 xnsched *sched)
 xnarch_cpus_clear(sched-resched);
 }
  #endif
 +   if (XENO_DEBUG(NUCLEUS)  sched-debug_resched_from_remote) {
 +