Am 04.11.2010 15:53, Anders Blomdell wrote: > Jan Kiszka wrote: >> Am 04.11.2010 14:18, Anders Blomdell wrote: >>> Gilles Chanteperdrix wrote: >>>> Jan Kiszka wrote: >>>>> Am 04.11.2010 10:26, Jan Kiszka wrote: >>>>>> Am 04.11.2010 10:16, Gilles Chanteperdrix wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Take a step back and look at the root cause for this issue again. >>>>>>>> Unlocked >>>>>>>> >>>>>>>> if need-resched >>>>>>>> __xnpod_schedule >>>>>>>> >>>>>>>> is inherently racy and will always be (not only for the remote >>>>>>>> reschedule case BTW). >>>>>>> Ok, let us examine what may happen with this code if we only set the >>>>>>> XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not >>>>>>> matter, because they can not change under our feet. So, we have two >>>>>>> cases for this race: >>>>>>> 1- we see the XNRESCHED bit, but it has been cleared once nklock is >>>>>>> locked in __xnpod_schedule. >>>>>>> 2- we do not see the XNRESCHED bit, but it get set right after we test >>>>>>> it. >>>>>>> >>>>>>> 1 is not a problem. >>>>>> Yes, as long as we remove the debug check from the scheduler code (or >>>>>> fix it somehow). The scheduling code already catches this race. >>>>>> >>>>>>> 2 is not a problem, because anything which sets the XNRESCHED (it may >>>>>>> only be an interrupt in fact) bit will cause xnpod_schedule to be called >>>>>>> right after that. >>>>>>> >>>>>>> So no, no race here provided that we only set the XNRESCHED bit on the >>>>>>> local cpu. >>>>>>> >>>>>>> So we either have to accept this and remove the >>>>>>>> debugging check from the scheduler or push the check back to >>>>>>>> __xnpod_schedule where it once came from. When this it cleaned up, we >>>>>>>> can look into the remote resched protocol again. >>>>>>> The problem of the debug check is that it checks whether the scheduler >>>>>>> state is modified without the XNRESCHED bit being set. And this is the >>>>>>> problem, because yes, in that case, we have a race: the scheduler state >>>>>>> may be modified before the XNRESCHED bit is set by an IPI. >>>>>>> >>>>>>> If we want to fix the debug check, we have to have a special bit, on in >>>>>>> the sched->status flag, only for the purpose of debugging. Or remove the >>>>>>> debug check. >>>>>> Exactly my point. Is there any benefit in keeping the debug check? The >>>>>> code to make it work may end up as "complex" as the logic it verifies, >>>>>> at least that's my current feeling. >>>>>> >>>>> This would be the radical approach of removing the check (and cleaning >>>>> up some bits). If it's acceptable, I would split it up properly. >>>> This debug check saved our asses when debugging SMP issues, and I >>>> suspect it may help debugging skin issues. So, I think we should try and >>>> keep it. >>>> >>>> >>>> At first sight, here you are more breaking things than cleaning them. >>> Still, it has the SMP record for my test program, still runs with ftrace >>> on (after 2 hours, where it previously failed after maximum 23 minutes). >> >> My version was indeed still buggy, I'm reworking it ATM. > Any reason why the two changes below would fail (I need to get things > working real soon now).
Maybe they don't fail but definitely cause reschedules where we do not need them. I stopped thinking about those details after starting the rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus debugging off. >> >>> If I get the gist of Jan's changes, they are (using the IPI to transfer >>> one bit of information: your cpu needs to reschedule): >>> >>> xnsched_set_resched: >>> - setbits((__sched__)->status, XNRESCHED); >>> >>> xnpod_schedule_handler: >>> + xnsched_set_resched(sched); >>> >>> If you (we?) decide to keep the debug checks, under what circumstances >>> would the current check trigger (in laymans language, that I'll be able >>> to understand)? >> >> That's actually what /me is wondering as well. I do not see yet how you >> can reliably detect a missed reschedule reliably (that was the purpose >> of the debug check) given the racy nature between signaling resched and >> processing the resched hints. > The only thing I can think of are atomic set/clear on an independent > variable. And the set has to be confined to few central locations - otherwise it will be simpler to check for the proper pairing of runqueue manipulation and xnsched_set_resched manually - if that was the fault pattern the test was supposed to catch (which would have had nucleus scope, no skins involved). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux _______________________________________________ Xenomai-core mailing list [email protected] https://mail.gna.org/listinfo/xenomai-core
