Philippe Gerum wrote: > On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote: >> Jan Kiszka wrote: >>> Jan Kiszka wrote: >>>> Hi, >>>> >>>> I'm meditating over an oops in rpi_clear_remote. NULL pointer deref, it >>>> /seems/ like thread->rpi is invalid. Looking at the code, I wonder if >>>> this could explain the bug: >>>> >>>> >>>> static void rpi_clear_remote(struct xnthread *thread) >>>> { >>>> ... >>>> rpi = thread->rpi; >>>> if (unlikely(rpi == NULL)) >>>> return; >>>> >>>> xnlock_get_irqsave(&rpi->rpilock, s); >>>> >>>> /* >>>> * The RPI slot - if present - is always valid, and won't >>>> * change since the thread is resuming on this CPU and cannot >>>> * migrate under our feet. We may grab the remote slot lock >>>> * now. >>>> */ >>>> xnsched_pop_rpi(thread); >>>> thread->rpi = NULL; >>>> >>>> ... >>>> >>>> So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, but >>>> we check for it without any protection? Sounds racy. I think 'thread' is >>>> not only pointing to the current thread but could refer to a foreign one >>>> as well, right? Don't we need this: >>>> >>>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c >>>> index 872c37f..1f995d6 100644 >>>> --- a/ksrc/nucleus/shadow.c >>>> +++ b/ksrc/nucleus/shadow.c >>>> @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread *thread) >>>> >>>> xnlock_get_irqsave(&rpi->rpilock, s); >>>> >>>> + /* Re-check under lock, someone may have cleared rpi by now. */ >>>> + if (unlikely(thread->rpi == NULL)) { >>>> + xnlock_put_irqrestore(&rpi->rpilock, s); >>>> + return; >>>> + } >>>> + >>>> /* >>>> * The RPI slot - if present - is always valid, and won't >>>> * change since the thread is resuming on this CPU and cannot >>> Another worry: Can thread->rpi become != rpi without being NULL? Or can >>> we really only race for clearance here? >>> >> I think so now, therefore I'm proposing this: >> >> -----------> >> >> Most RPI services work on the current task or the one to be scheduled in >> next, thus are naturally serialized. But rpi_next is not as it can walk >> the chain of RPI requests for a CPU independently. In that case, >> clearing RPI via rpi_clear_remote can race with rpi_next, and if the >> former loses after checking thread->rpi for NULL, we will dereference a >> NULL pointer in xnsched_pop_rpi(). >> >> Signed-off-by: Jan Kiszka <jan.kis...@siemens.com> >> --- >> ksrc/nucleus/shadow.c | 9 +++++++++ >> 1 files changed, 9 insertions(+), 0 deletions(-) >> >> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c >> index 872c37f..cf7c08f 100644 >> --- a/ksrc/nucleus/shadow.c >> +++ b/ksrc/nucleus/shadow.c >> @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread *thread) >> xnlock_get_irqsave(&rpi->rpilock, s); >> >> /* >> + * Re-check under lock. Someone may have invoked rpi_next and cleared >> + * rpi by now. >> + */ >> + if (unlikely(!rpi_p(thread))) { >> + xnlock_put_irqrestore(&rpi->rpilock, s); >> + return; >> + } >> + >> + /* >> * The RPI slot - if present - is always valid, and won't >> * change since the thread is resuming on this CPU and cannot >> * migrate under our feet. We may grab the remote slot lock >> > > The suggested patch papers over the actual issue, which is that > rpi_clear_remote() may not invoke rpi_next(), because it may only affect
I don't think that in our case rpi_clear_remote called rpi_next and therefore crashed. It should rather have been the scenario of both running in parallel on different CPUs, the former on behalf of a migrated shadow that wants to clear its remainders on the remote CPU, the latter on that CPU, picking a new top RPI after some other shadow was removed from the queue. Is this a possible scenario, and would your patch cover it? > the RPI state of the argument thread which must be a local one, and not > that of any random thread that happens to be linked to the remote RPI > queue. > > By calling rpi_next(), rpi_clear_remote() shoots itself in the foot, > allowing a concurrent invocation of itself on a remote CPU, to fiddle > with the rpi backlink of a thread which is not active on the > local/per-cpu Xenomai scheduler instance, which is the point where > things start to hit the crapper. > > Now, unless I can't even synchronize the couple of neurons I have left > at this hour, the following patch should better fix the issue, because > it restores the two basic rules that apply to the whole RPI machinery, > namely: > > - rpi_* calls may only alter the contents of the local scheduler's RPI > queue, with the notable exception of rpi_clear_remote() which may remove > the given _local_ thread only, from a remote RPI slot. > > - rpi_* calls may only change the RPI state of threads which are > controlled by the local Xenomai scheduler instance, except rpi_push() > when called for setting up the RPI state of an emerging thread, in which > case this is a no-conflict zone. > > That breakage was introduced in the early 2.5.1 timeframe, so 2.4.x > should be immune from this particular bug. > > diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c > index 872c37f..1397ed1 100644 > --- a/ksrc/nucleus/shadow.c > +++ b/ksrc/nucleus/shadow.c > @@ -340,7 +340,12 @@ static void rpi_clear_remote(struct xnthread *thread) > xnsched_pop_rpi(thread); > thread->rpi = NULL; > > - if (rpi_next(rpi, s) == NULL) > + /* > + * If the remote RPI queue was emptied, prepare for kicking > + * xnshadow_rpi_check() on the relevant CPU to demote the root > + * thread priority there. > + */ > + if (xnsched_peek_rpi(rpi) == NULL) /* No rpi_next() here. */ > rcpu = xnsched_cpu(rpi); > > xnlock_put_irqrestore(&rpi->rpilock, s); > > I have to confess, I do not understand how the patch may relate to our crash. But that's because I still only have a semi-understanding of this frightening complex RPI code. However, the fact that thread->rpi is now again only checked outside its protecting lock leaves me with a very bad feeling... Jan
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core