Re: [Xenomai-core] Potential problem with rt_eepro100
Gilles Chanteperdrix wrote: Anders Blomdell wrote: Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 05.11.2010 00:24, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 23:06, Gilles Chanteperdrix wrote: Jan Kiszka wrote: At first sight, here you are more breaking things than cleaning them. Still, it has the SMP record for my test program, still runs with ftrace on (after 2 hours, where it previously failed after maximum 23 minutes). My version was indeed still buggy, I'm reworking it ATM. If I get the gist of Jan's changes, they are (using the IPI to transfer one bit of information: your cpu needs to reschedule): xnsched_set_resched: - setbits((__sched__)-status, XNRESCHED); xnpod_schedule_handler: + xnsched_set_resched(sched); If you (we?) decide to keep the debug checks, under what circumstances would the current check trigger (in laymans language, that I'll be able to understand)? That's actually what /me is wondering as well. I do not see yet how you can reliably detect a missed reschedule reliably (that was the purpose of the debug check) given the racy nature between signaling resched and processing the resched hints. The purpose of the debugging change is to detect a change of the scheduler state which was not followed by setting the XNRESCHED bit. But that is nucleus business, nothing skins can screw up (as long as they do not misuse APIs). Yes, but it happens that we modify the nucleus from time to time. Getting it to work is relatively simple: we add a scheduler change set remotely bit to the sched structure which is NOT in the status bit, set this bit when changing a remote sched (under nklock). In the debug check code, if the scheduler state changed, and the XNRESCHED bit is not set, only consider this a but if this new bit is not set. All this is compiled out if the debug is not enabled. I still see no benefit in this check. Where to you want to place the bit set? Aren't that just the same locations where xnsched_set_[self_]resched already is today? Well no, that would be another bit in the sched structure which would allow us to manipulate the status bits from the local cpu. That supplementary bit would only be changed from a distant CPU, and serve to detect the race which causes the false positive. The resched bits are set on the local cpu to get xnpod_schedule to trigger a rescheduling on the distance cpu. That bit would be set on the remote cpu's sched. Only when debugging is enabled. But maybe you can provide some motivating bug scenarios, real ones of the past or realistic ones of the future. Of course. The bug is anything which changes the scheduler state but does not set the XNRESCHED bit. This happened when we started the SMP port. New scheduling policies would be good candidates for a revival of this bug. You don't gain any worthwhile check if you cannot make the instrumentation required for a stable detection simpler than the proper problem solution itself. And this is what I'm still skeptical of. The solution is simple, but finding the problem without the instrumentation is way harder than with the instrumentation, so the instrumentation is worth something. Reproducing the false positive is surprisingly easy with a simple dual-cpu semaphore ping-pong test. So, here is the (tested) patch, using a ridiculous long variable name to illustrate what I was thinking about: diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index cf4..454b8e8 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -108,6 +108,9 @@ typedef struct xnsched { struct xnthread *gktarget; #endif +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS + int debug_resched_from_remote; +#endif } xnsched_t; union xnsched_policy_param; @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched *sched) xnsched_t *current_sched = xnpod_current_sched();\ __setbits(current_sched-status, XNRESCHED); \ if (current_sched != (__sched__)){ \ + if (XENO_DEBUG(NUCLEUS)) \ + __sched__-debug_resched_from_remote = 1; \ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched); \ }\ } while (0) diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c index 4cb707a..50b0f49 100644 --- a/ksrc/nucleus/pod.c +++ b/ksrc/nucleus/pod.c @@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct xnsched *sched) xnarch_cpus_clear(sched-resched); } #endif + if (XENO_DEBUG(NUCLEUS) sched-debug_resched_from_remote) { + sched-debug_resched_from_remote = 0; + resched = 1; + } clrbits(sched-status, XNRESCHED); return resched; } I am still uncertain. Will only work if all is done under nklock, otherwise two
Re: [Xenomai-core] Potential problem with rt_eepro100
Anders Blomdell wrote: Gilles Chanteperdrix wrote: Anders Blomdell wrote: Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 05.11.2010 00:24, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 23:06, Gilles Chanteperdrix wrote: Jan Kiszka wrote: At first sight, here you are more breaking things than cleaning them. Still, it has the SMP record for my test program, still runs with ftrace on (after 2 hours, where it previously failed after maximum 23 minutes). My version was indeed still buggy, I'm reworking it ATM. If I get the gist of Jan's changes, they are (using the IPI to transfer one bit of information: your cpu needs to reschedule): xnsched_set_resched: - setbits((__sched__)-status, XNRESCHED); xnpod_schedule_handler: +xnsched_set_resched(sched); If you (we?) decide to keep the debug checks, under what circumstances would the current check trigger (in laymans language, that I'll be able to understand)? That's actually what /me is wondering as well. I do not see yet how you can reliably detect a missed reschedule reliably (that was the purpose of the debug check) given the racy nature between signaling resched and processing the resched hints. The purpose of the debugging change is to detect a change of the scheduler state which was not followed by setting the XNRESCHED bit. But that is nucleus business, nothing skins can screw up (as long as they do not misuse APIs). Yes, but it happens that we modify the nucleus from time to time. Getting it to work is relatively simple: we add a scheduler change set remotely bit to the sched structure which is NOT in the status bit, set this bit when changing a remote sched (under nklock). In the debug check code, if the scheduler state changed, and the XNRESCHED bit is not set, only consider this a but if this new bit is not set. All this is compiled out if the debug is not enabled. I still see no benefit in this check. Where to you want to place the bit set? Aren't that just the same locations where xnsched_set_[self_]resched already is today? Well no, that would be another bit in the sched structure which would allow us to manipulate the status bits from the local cpu. That supplementary bit would only be changed from a distant CPU, and serve to detect the race which causes the false positive. The resched bits are set on the local cpu to get xnpod_schedule to trigger a rescheduling on the distance cpu. That bit would be set on the remote cpu's sched. Only when debugging is enabled. But maybe you can provide some motivating bug scenarios, real ones of the past or realistic ones of the future. Of course. The bug is anything which changes the scheduler state but does not set the XNRESCHED bit. This happened when we started the SMP port. New scheduling policies would be good candidates for a revival of this bug. You don't gain any worthwhile check if you cannot make the instrumentation required for a stable detection simpler than the proper problem solution itself. And this is what I'm still skeptical of. The solution is simple, but finding the problem without the instrumentation is way harder than with the instrumentation, so the instrumentation is worth something. Reproducing the false positive is surprisingly easy with a simple dual-cpu semaphore ping-pong test. So, here is the (tested) patch, using a ridiculous long variable name to illustrate what I was thinking about: diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index cf4..454b8e8 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -108,6 +108,9 @@ typedef struct xnsched { struct xnthread *gktarget; #endif +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS + int debug_resched_from_remote; +#endif } xnsched_t; union xnsched_policy_param; @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched *sched) xnsched_t *current_sched = xnpod_current_sched();\ __setbits(current_sched-status, XNRESCHED); \ if (current_sched != (__sched__)){ \ + if (XENO_DEBUG(NUCLEUS)) \ + __sched__-debug_resched_from_remote = 1; \ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched); \ }\ } while (0) diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c index 4cb707a..50b0f49 100644 --- a/ksrc/nucleus/pod.c +++ b/ksrc/nucleus/pod.c @@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct xnsched *sched) xnarch_cpus_clear(sched-resched); } #endif + if (XENO_DEBUG(NUCLEUS) sched-debug_resched_from_remote) { + sched-debug_resched_from_remote = 0; + resched = 1; + } clrbits(sched-status,
Re: [Xenomai-core] Potential problem with rt_eepro100
On Sat, 2010-11-06 at 21:37 +0100, Gilles Chanteperdrix wrote: Anders Blomdell wrote: Gilles Chanteperdrix wrote: Anders Blomdell wrote: Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 05.11.2010 00:24, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 23:06, Gilles Chanteperdrix wrote: Jan Kiszka wrote: At first sight, here you are more breaking things than cleaning them. Still, it has the SMP record for my test program, still runs with ftrace on (after 2 hours, where it previously failed after maximum 23 minutes). My version was indeed still buggy, I'm reworking it ATM. If I get the gist of Jan's changes, they are (using the IPI to transfer one bit of information: your cpu needs to reschedule): xnsched_set_resched: - setbits((__sched__)-status, XNRESCHED); xnpod_schedule_handler: + xnsched_set_resched(sched); If you (we?) decide to keep the debug checks, under what circumstances would the current check trigger (in laymans language, that I'll be able to understand)? That's actually what /me is wondering as well. I do not see yet how you can reliably detect a missed reschedule reliably (that was the purpose of the debug check) given the racy nature between signaling resched and processing the resched hints. The purpose of the debugging change is to detect a change of the scheduler state which was not followed by setting the XNRESCHED bit. But that is nucleus business, nothing skins can screw up (as long as they do not misuse APIs). Yes, but it happens that we modify the nucleus from time to time. Getting it to work is relatively simple: we add a scheduler change set remotely bit to the sched structure which is NOT in the status bit, set this bit when changing a remote sched (under nklock). In the debug check code, if the scheduler state changed, and the XNRESCHED bit is not set, only consider this a but if this new bit is not set. All this is compiled out if the debug is not enabled. I still see no benefit in this check. Where to you want to place the bit set? Aren't that just the same locations where xnsched_set_[self_]resched already is today? Well no, that would be another bit in the sched structure which would allow us to manipulate the status bits from the local cpu. That supplementary bit would only be changed from a distant CPU, and serve to detect the race which causes the false positive. The resched bits are set on the local cpu to get xnpod_schedule to trigger a rescheduling on the distance cpu. That bit would be set on the remote cpu's sched. Only when debugging is enabled. But maybe you can provide some motivating bug scenarios, real ones of the past or realistic ones of the future. Of course. The bug is anything which changes the scheduler state but does not set the XNRESCHED bit. This happened when we started the SMP port. New scheduling policies would be good candidates for a revival of this bug. You don't gain any worthwhile check if you cannot make the instrumentation required for a stable detection simpler than the proper problem solution itself. And this is what I'm still skeptical of. The solution is simple, but finding the problem without the instrumentation is way harder than with the instrumentation, so the instrumentation is worth something. Reproducing the false positive is surprisingly easy with a simple dual-cpu semaphore ping-pong test. So, here is the (tested) patch, using a ridiculous long variable name to illustrate what I was thinking about: diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index cf4..454b8e8 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -108,6 +108,9 @@ typedef struct xnsched { struct xnthread *gktarget; #endif +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS + int debug_resched_from_remote; +#endif } xnsched_t; union xnsched_policy_param; @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched *sched) xnsched_t *current_sched = xnpod_current_sched();\ __setbits(current_sched-status, XNRESCHED); \ if (current_sched != (__sched__)){ \ + if (XENO_DEBUG(NUCLEUS)) \ + __sched__-debug_resched_from_remote = 1; \ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched); \ }\ } while (0) diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c index 4cb707a..50b0f49 100644 --- a/ksrc/nucleus/pod.c +++ b/ksrc/nucleus/pod.c @@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct xnsched *sched) xnarch_cpus_clear(sched-resched); } #endif + if
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 06.11.2010 23:49, Philippe Gerum wrote: On Sat, 2010-11-06 at 21:37 +0100, Gilles Chanteperdrix wrote: Anders Blomdell wrote: Gilles Chanteperdrix wrote: Anders Blomdell wrote: Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 05.11.2010 00:24, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 23:06, Gilles Chanteperdrix wrote: Jan Kiszka wrote: At first sight, here you are more breaking things than cleaning them. Still, it has the SMP record for my test program, still runs with ftrace on (after 2 hours, where it previously failed after maximum 23 minutes). My version was indeed still buggy, I'm reworking it ATM. If I get the gist of Jan's changes, they are (using the IPI to transfer one bit of information: your cpu needs to reschedule): xnsched_set_resched: - setbits((__sched__)-status, XNRESCHED); xnpod_schedule_handler: + xnsched_set_resched(sched); If you (we?) decide to keep the debug checks, under what circumstances would the current check trigger (in laymans language, that I'll be able to understand)? That's actually what /me is wondering as well. I do not see yet how you can reliably detect a missed reschedule reliably (that was the purpose of the debug check) given the racy nature between signaling resched and processing the resched hints. The purpose of the debugging change is to detect a change of the scheduler state which was not followed by setting the XNRESCHED bit. But that is nucleus business, nothing skins can screw up (as long as they do not misuse APIs). Yes, but it happens that we modify the nucleus from time to time. Getting it to work is relatively simple: we add a scheduler change set remotely bit to the sched structure which is NOT in the status bit, set this bit when changing a remote sched (under nklock). In the debug check code, if the scheduler state changed, and the XNRESCHED bit is not set, only consider this a but if this new bit is not set. All this is compiled out if the debug is not enabled. I still see no benefit in this check. Where to you want to place the bit set? Aren't that just the same locations where xnsched_set_[self_]resched already is today? Well no, that would be another bit in the sched structure which would allow us to manipulate the status bits from the local cpu. That supplementary bit would only be changed from a distant CPU, and serve to detect the race which causes the false positive. The resched bits are set on the local cpu to get xnpod_schedule to trigger a rescheduling on the distance cpu. That bit would be set on the remote cpu's sched. Only when debugging is enabled. But maybe you can provide some motivating bug scenarios, real ones of the past or realistic ones of the future. Of course. The bug is anything which changes the scheduler state but does not set the XNRESCHED bit. This happened when we started the SMP port. New scheduling policies would be good candidates for a revival of this bug. You don't gain any worthwhile check if you cannot make the instrumentation required for a stable detection simpler than the proper problem solution itself. And this is what I'm still skeptical of. The solution is simple, but finding the problem without the instrumentation is way harder than with the instrumentation, so the instrumentation is worth something. Reproducing the false positive is surprisingly easy with a simple dual-cpu semaphore ping-pong test. So, here is the (tested) patch, using a ridiculous long variable name to illustrate what I was thinking about: diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index cf4..454b8e8 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -108,6 +108,9 @@ typedef struct xnsched { struct xnthread *gktarget; #endif +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS + int debug_resched_from_remote; +#endif } xnsched_t; union xnsched_policy_param; @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched *sched) xnsched_t *current_sched = xnpod_current_sched();\ __setbits(current_sched-status, XNRESCHED); \ if (current_sched != (__sched__)){ \ + if (XENO_DEBUG(NUCLEUS)) \ + __sched__-debug_resched_from_remote = 1; \ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched); \ }\ } while (0) diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c index 4cb707a..50b0f49 100644 --- a/ksrc/nucleus/pod.c +++ b/ksrc/nucleus/pod.c @@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct xnsched *sched) xnarch_cpus_clear(sched-resched); } #endif + if (XENO_DEBUG(NUCLEUS) sched-debug_resched_from_remote) { +