Re: [Xenomai-core] Potential problem with rt_eepro100

Jan Kiszka Thu, 04 Nov 2010 08:35:08 -0700

Am 04.11.2010 15:53, Anders Blomdell wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 14:18, Anders Blomdell wrote:
>>> Gilles Chanteperdrix wrote:
>>>> Jan Kiszka wrote:
>>>>> Am 04.11.2010 10:26, Jan Kiszka wrote:
>>>>>> Am 04.11.2010 10:16, Gilles Chanteperdrix wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Take a step back and look at the root cause for this issue again. 
>>>>>>>> Unlocked
>>>>>>>>
>>>>>>>>        if need-resched
>>>>>>>>                __xnpod_schedule
>>>>>>>>
>>>>>>>> is inherently racy and will always be (not only for the remote
>>>>>>>> reschedule case BTW).
>>>>>>> Ok, let us examine what may happen with this code if we only set the
>>>>>>> XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
>>>>>>> matter, because they can not change under our feet. So, we have two
>>>>>>> cases for this race:
>>>>>>> 1- we see the XNRESCHED bit, but it has been cleared once nklock is
>>>>>>> locked in __xnpod_schedule.
>>>>>>> 2- we do not see the XNRESCHED bit, but it get set right after we test 
>>>>>>> it.
>>>>>>>
>>>>>>> 1 is not a problem.
>>>>>> Yes, as long as we remove the debug check from the scheduler code (or
>>>>>> fix it somehow). The scheduling code already catches this race.
>>>>>>
>>>>>>> 2 is not a problem, because anything which sets the XNRESCHED (it may
>>>>>>> only be an interrupt in fact) bit will cause xnpod_schedule to be called
>>>>>>> right after that.
>>>>>>>
>>>>>>> So no, no race here provided that we only set the XNRESCHED bit on the
>>>>>>> local cpu.
>>>>>>>
>>>>>>>  So we either have to accept this and remove the
>>>>>>>> debugging check from the scheduler or push the check back to
>>>>>>>> __xnpod_schedule where it once came from. When this it cleaned up, we
>>>>>>>> can look into the remote resched protocol again.
>>>>>>> The problem of the debug check is that it checks whether the scheduler
>>>>>>> state is modified without the XNRESCHED bit being set. And this is the
>>>>>>> problem, because yes, in that case, we have a race: the scheduler state
>>>>>>> may be modified before the XNRESCHED bit is set by an IPI.
>>>>>>>
>>>>>>> If we want to fix the debug check, we have to have a special bit, on in
>>>>>>> the sched->status flag, only for the purpose of debugging. Or remove the
>>>>>>> debug check.
>>>>>> Exactly my point. Is there any benefit in keeping the debug check? The
>>>>>> code to make it work may end up as "complex" as the logic it verifies,
>>>>>> at least that's my current feeling.
>>>>>>
>>>>> This would be the radical approach of removing the check (and cleaning
>>>>> up some bits). If it's acceptable, I would split it up properly.
>>>> This debug check saved our asses when debugging SMP issues, and I
>>>> suspect it may help debugging skin issues. So, I think we should try and
>>>> keep it.
>>>>
>>>>
>>>> At first sight, here you are more breaking things than cleaning them.
>>> Still, it has the SMP record for my test program, still runs with ftrace 
>>> on (after 2 hours, where it previously failed after maximum 23 minutes).
>>
>> My version was indeed still buggy, I'm reworking it ATM.
> Any reason why the two changes below would fail (I need to get things 
> working real soon now).


Maybe they don't fail but definitely cause reschedules where we do not
need them. I stopped thinking about those details after starting the
rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
debugging off.

>>
>>> If I get the gist of Jan's changes, they are (using the IPI to transfer 
>>> one bit of information: your cpu needs to reschedule):
>>>
>>> xnsched_set_resched:
>>> -      setbits((__sched__)->status, XNRESCHED);
>>>
>>> xnpod_schedule_handler:
>>> +   xnsched_set_resched(sched);
>>>     
>>> If you (we?) decide to keep the debug checks, under what circumstances 
>>> would the current check trigger (in laymans language, that I'll be able 
>>> to understand)?
>>
>> That's actually what /me is wondering as well. I do not see yet how you
>> can reliably detect a missed reschedule reliably (that was the purpose
>> of the debug check) given the racy nature between signaling resched and
>> processing the resched hints.
> The only thing I can think of are atomic set/clear on an independent 
> variable.

And the set has to be confined to few central locations - otherwise it
will be simpler to check for the proper pairing of runqueue manipulation
and xnsched_set_resched manually - if that was the fault pattern the
test was supposed to catch (which would have had nucleus scope, no skins
involved).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

_______________________________________________
Xenomai-core mailing list
[email protected]
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

Reply via email to