Gilles Chanteperdrix wrote:
Jan Kiszka wrote:
Am 04.11.2010 10:26, Jan Kiszka wrote:
Am 04.11.2010 10:16, Gilles Chanteperdrix wrote:
Jan Kiszka wrote:
Take a step back and look at the root cause for this issue again. Unlocked

        if need-resched

is inherently racy and will always be (not only for the remote
reschedule case BTW).
Ok, let us examine what may happen with this code if we only set the
XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
matter, because they can not change under our feet. So, we have two
cases for this race:
1- we see the XNRESCHED bit, but it has been cleared once nklock is
locked in __xnpod_schedule.
2- we do not see the XNRESCHED bit, but it get set right after we test it.

1 is not a problem.
Yes, as long as we remove the debug check from the scheduler code (or
fix it somehow). The scheduling code already catches this race.

2 is not a problem, because anything which sets the XNRESCHED (it may
only be an interrupt in fact) bit will cause xnpod_schedule to be called
right after that.

So no, no race here provided that we only set the XNRESCHED bit on the
local cpu.

 So we either have to accept this and remove the
debugging check from the scheduler or push the check back to
__xnpod_schedule where it once came from. When this it cleaned up, we
can look into the remote resched protocol again.
The problem of the debug check is that it checks whether the scheduler
state is modified without the XNRESCHED bit being set. And this is the
problem, because yes, in that case, we have a race: the scheduler state
may be modified before the XNRESCHED bit is set by an IPI.

If we want to fix the debug check, we have to have a special bit, on in
the sched->status flag, only for the purpose of debugging. Or remove the
debug check.
Exactly my point. Is there any benefit in keeping the debug check? The
code to make it work may end up as "complex" as the logic it verifies,
at least that's my current feeling.

This would be the radical approach of removing the check (and cleaning
up some bits). If it's acceptable, I would split it up properly.

This debug check saved our asses when debugging SMP issues, and I
suspect it may help debugging skin issues. So, I think we should try and
keep it.

At first sight, here you are more breaking things than cleaning them.
Still, it has the SMP record for my test program, still runs with ftrace on (after 2 hours, where it previously failed after maximum 23 minutes).

If I get the gist of Jan's changes, they are (using the IPI to transfer one bit of information: your cpu needs to reschedule):

-      setbits((__sched__)->status, XNRESCHED);

+       xnsched_set_resched(sched);
If you (we?) decide to keep the debug checks, under what circumstances would the current check trigger (in laymans language, that I'll be able to understand)?


Xenomai-core mailing list

Reply via email to