Re: [Xenomai-core] Potential problem with rt_eepro100

Gilles Chanteperdrix Fri, 05 Nov 2010 17:28:50 -0700

Anders Blomdell wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
>>>> Jan Kiszka wrote:
>>>>> Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>>>> At first sight, here you are more breaking things than cleaning them.
>>>>>>>> Still, it has the SMP record for my test program, still runs with 
>>>>>>>> ftrace 
>>>>>>>> on (after 2 hours, where it previously failed after maximum 23 
>>>>>>>> minutes).
>>>>>>> My version was indeed still buggy, I'm reworking it ATM.
>>>>>>>
>>>>>>>> If I get the gist of Jan's changes, they are (using the IPI to 
>>>>>>>> transfer 
>>>>>>>> one bit of information: your cpu needs to reschedule):
>>>>>>>>
>>>>>>>> xnsched_set_resched:
>>>>>>>> -      setbits((__sched__)->status, XNRESCHED);
>>>>>>>>
>>>>>>>> xnpod_schedule_handler:
>>>>>>>> +      xnsched_set_resched(sched);
>>>>>>>>        
>>>>>>>> If you (we?) decide to keep the debug checks, under what circumstances 
>>>>>>>> would the current check trigger (in laymans language, that I'll be 
>>>>>>>> able 
>>>>>>>> to understand)?
>>>>>>> That's actually what /me is wondering as well. I do not see yet how you
>>>>>>> can reliably detect a missed reschedule reliably (that was the purpose
>>>>>>> of the debug check) given the racy nature between signaling resched and
>>>>>>> processing the resched hints.
>>>>>> The purpose of the debugging change is to detect a change of the
>>>>>> scheduler state which was not followed by setting the XNRESCHED bit.
>>>>> But that is nucleus business, nothing skins can screw up (as long as
>>>>> they do not misuse APIs).
>>>> Yes, but it happens that we modify the nucleus from time to time.
>>>>
>>>>>> Getting it to work is relatively simple: we add a "scheduler change set
>>>>>> remotely" bit to the sched structure which is NOT in the status bit, set
>>>>>> this bit when changing a remote sched (under nklock). In the debug check
>>>>>> code, if the scheduler state changed, and the XNRESCHED bit is not set,
>>>>>> only consider this a but if this new bit is not set. All this is
>>>>>> compiled out if the debug is not enabled.
>>>>> I still see no benefit in this check. Where to you want to place the bit
>>>>> set? Aren't that just the same locations where
>>>>> xnsched_set_[self_]resched already is today?
>>>> Well no, that would be another bit in the sched structure which would
>>>> allow us to manipulate the status bits from the local cpu. That
>>>> supplementary bit would only be changed from a distant CPU, and serve to
>>>> detect the race which causes the false positive. The resched bits are
>>>> set on the local cpu to get xnpod_schedule to trigger a rescheduling on
>>>> the distance cpu. That bit would be set on the remote cpu's sched. Only
>>>> when debugging is enabled.
>>>>
>>>>> But maybe you can provide some motivating bug scenarios, real ones of
>>>>> the past or realistic ones of the future.
>>>> Of course. The bug is anything which changes the scheduler state but
>>>> does not set the XNRESCHED bit. This happened when we started the SMP
>>>> port. New scheduling policies would be good candidates for a revival of
>>>> this bug.
>>>>
>>> You don't gain any worthwhile check if you cannot make the
>>> instrumentation required for a stable detection simpler than the proper
>>> problem solution itself. And this is what I'm still skeptical of.
>> The solution is simple, but finding the problem without the 
>> instrumentation is way harder than with the instrumentation, so the 
>> instrumentation is worth something.
>>
>> Reproducing the false positive is surprisingly easy with a simple
>> dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
>> using a ridiculous long variable name to illustrate what I was 
>> thinking about:
>>
>> diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
>> index 8888cf4..454b8e8 100644
>> --- a/include/nucleus/sched.h
>> +++ b/include/nucleus/sched.h
>> @@ -108,6 +108,9 @@ typedef struct xnsched {
>>         struct xnthread *gktarget;
>>  #endif
>>
>> +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
>> +       int debug_resched_from_remote;
>> +#endif
>>  } xnsched_t;
>>
>>  union xnsched_policy_param;
>> @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched 
>> *sched)
>>    xnsched_t *current_sched = xnpod_current_sched();                    \
>>    __setbits(current_sched->status, XNRESCHED);                         \
>>    if (current_sched != (__sched__))    {                               \
>> +         if (XENO_DEBUG(NUCLEUS))                                      \
>> +                 __sched__->debug_resched_from_remote = 1;             \
>>        xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);  \
>>    }                                                                    \
>>  } while (0)
>> diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
>> index 4cb707a..50b0f49 100644
>> --- a/ksrc/nucleus/pod.c
>> +++ b/ksrc/nucleus/pod.c
>> @@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct xnsched 
>> *sched)
>>                 xnarch_cpus_clear(sched->resched);
>>         }
>>  #endif
>> +       if (XENO_DEBUG(NUCLEUS) && sched->debug_resched_from_remote) {
>> +               sched->debug_resched_from_remote = 0;
>> +               resched = 1;
>> +       }
>>         clrbits(sched->status, XNRESCHED);
>>         return resched;
>>  }
>>
>>
>> I am still uncertain.
> Will only work if all is done under nklock, otherwise two almost 
> simultaneous xnsched_resched_p from different cpus, might lead to one of 
> the ipi wakeups sees the 0 written due to handling the first ipi interrupt.


This is a patch artifact, the function modified are xnsched_set_resched
and xnpod_test_resched, and both are run with the nklock locked.

-- 
                                                                Gilles.

_______________________________________________
Xenomai-core mailing list
[email protected]
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

Reply via email to