Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-05-02 Thread Jan Kiszka
Philippe Gerum wrote:
> On Sat, 2010-05-01 at 19:47 +0200, Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> Don't you think that, quoting yourself:
>>>
>>> "I have to confess, I do not understand how the patch may relate to our
>>> crash. But that's because I still only have a semi-understanding of this
>>> frightening complex RPI code"
>>>
>>> does not fit that well with any kind of strong assertion made later on
>>> the same topic, not backed by facts?
>> See, as long as you do not directly explain why my concerns on this
>> naturally racy construct were not valid (I brought up quite a few
>> concrete questions),
> 
> Which were answered. And you seem to consider that a construct is racy
> by design without taking care of understanding why they could be
> perfectly valid in the proper context. This is what I was pointing out,
> all time long.

Depends on the POV. But what counts is the outcome of this, and that's a
fix.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-05-01 Thread Philippe Gerum
On Sat, 2010-05-01 at 19:47 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > Don't you think that, quoting yourself:
> > 
> > "I have to confess, I do not understand how the patch may relate to our
> > crash. But that's because I still only have a semi-understanding of this
> > frightening complex RPI code"
> > 
> > does not fit that well with any kind of strong assertion made later on
> > the same topic, not backed by facts?
> 
> See, as long as you do not directly explain why my concerns on this
> naturally racy construct were not valid (I brought up quite a few
> concrete questions),

Which were answered. And you seem to consider that a construct is racy
by design without taking care of understanding why they could be
perfectly valid in the proper context. This is what I was pointing out,
all time long.

>  I couldn't help raising them over and over again.
> This has, in fact, nothing to do with understanding the RPI code in all
> its details. It's about reviewing basic patterns of it. But now that the
> critical bit is gone, I'm surely no longer concerned. :)
> 
> > 
> > So, I'm suggesting that we move on, and end this discussion in a
> > positive manner, i.e. by fixing this code. I hear your concerns, like I
> > always do, and I'm trying to find a solution that does not paper over
> > another issue. I guess we should be able to settle on this.
> > 
> > I pushed two patches in my for-upstream queue, with lengthy comments to
> > explain what they do and why this is needed:
> > 
> > http://git.xenomai.org/?p=xenomai-rpm.git;a=commit;h=ac5c739dabcb14334c2e390a9e3064f11f97283c
> > http://git.xenomai.org/?p=xenomai-rpm.git;a=commit;h=d3242401b8e1cf2c28822f801b681194243b4394
> > 
> > - the first patch is a plain revert of the commit introducing
> > rpi_next(), which caused the bug you observed in SMP mode, and that your
> > patch plugged successfully.
> > 
> > - the second patch is the proper fix for the issue rpi_next() tried to
> > address.
> > 
> > Could you please try them, and report whether this also fixes the issue
> > for you? In the meantime, I will be analyzing the RPI code once again,
> > to check whether your patch still covers a possible case, or not.
> 
> I will try my best, but the issue showed up only once in countless
> application runs. We will role out the patches at the next chance (we
> already had to push my workaround as we need to deliver the fastsynch
> fix, so it may take a while).
> 

No problem with that, you were the one hit by the issue so far.

> Jan
> 


-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-05-01 Thread Jan Kiszka
Philippe Gerum wrote:
> Don't you think that, quoting yourself:
> 
> "I have to confess, I do not understand how the patch may relate to our
> crash. But that's because I still only have a semi-understanding of this
> frightening complex RPI code"
> 
> does not fit that well with any kind of strong assertion made later on
> the same topic, not backed by facts?

See, as long as you do not directly explain why my concerns on this
naturally racy construct were not valid (I brought up quite a few
concrete questions), I couldn't help raising them over and over again.
This has, in fact, nothing to do with understanding the RPI code in all
its details. It's about reviewing basic patterns of it. But now that the
critical bit is gone, I'm surely no longer concerned. :)

> 
> So, I'm suggesting that we move on, and end this discussion in a
> positive manner, i.e. by fixing this code. I hear your concerns, like I
> always do, and I'm trying to find a solution that does not paper over
> another issue. I guess we should be able to settle on this.
> 
> I pushed two patches in my for-upstream queue, with lengthy comments to
> explain what they do and why this is needed:
> 
> http://git.xenomai.org/?p=xenomai-rpm.git;a=commit;h=ac5c739dabcb14334c2e390a9e3064f11f97283c
> http://git.xenomai.org/?p=xenomai-rpm.git;a=commit;h=d3242401b8e1cf2c28822f801b681194243b4394
> 
> - the first patch is a plain revert of the commit introducing
> rpi_next(), which caused the bug you observed in SMP mode, and that your
> patch plugged successfully.
> 
> - the second patch is the proper fix for the issue rpi_next() tried to
> address.
> 
> Could you please try them, and report whether this also fixes the issue
> for you? In the meantime, I will be analyzing the RPI code once again,
> to check whether your patch still covers a possible case, or not.

I will try my best, but the issue showed up only once in countless
application runs. We will role out the patches at the next chance (we
already had to push my workaround as we need to deliver the fastsynch
fix, so it may take a while).

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-05-01 Thread Philippe Gerum
On Tue, 2010-04-27 at 12:40 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Tue, 2010-04-27 at 11:34 +0200, Jan Kiszka wrote:
> >> Philippe Gerum wrote:
> >>> On Tue, 2010-04-27 at 11:27 +0200, Jan Kiszka wrote:
>  Philippe Gerum wrote:
> > On Tue, 2010-04-27 at 10:25 +0200, Jan Kiszka wrote:
> >> Philippe Gerum wrote:
> >>> On Tue, 2010-04-27 at 08:46 +0200, Jan Kiszka wrote:
>  Philippe Gerum wrote:
> > On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
> >> Jan Kiszka wrote:
> >>> Jan Kiszka wrote:
>  Hi,
> 
>  I'm meditating over an oops in rpi_clear_remote. NULL pointer 
>  deref, it
>  /seems/ like thread->rpi is invalid. Looking at the code, I 
>  wonder if
>  this could explain the bug:
> 
> 
>  static void rpi_clear_remote(struct xnthread *thread)
>  {
>  ...
>  rpi = thread->rpi;
>  if (unlikely(rpi == NULL))
>  return;
> 
>  xnlock_get_irqsave(&rpi->rpilock, s);
> 
>  /*
>   * The RPI slot - if present - is always valid, and won't
>   * change since the thread is resuming on this CPU and 
>  cannot
>   * migrate under our feet. We may grab the remote slot 
>  lock
>   * now.
>   */
>  xnsched_pop_rpi(thread);
>  thread->rpi = NULL;
> 
>  ...
> 
>  So we deref (xnsched_pop_rpi) and clear thread->rpi under 
>  rpilock, but
>  we check for it without any protection? Sounds racy. I think 
>  'thread' is
>  not only pointing to the current thread but could refer to a 
>  foreign one
>  as well, right? Don't we need this:
> 
>  diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>  index 872c37f..1f995d6 100644
>  --- a/ksrc/nucleus/shadow.c
>  +++ b/ksrc/nucleus/shadow.c
>  @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct 
>  xnthread *thread)
>   
>   xnlock_get_irqsave(&rpi->rpilock, s);
>   
>  +/* Re-check under lock, someone may have cleared rpi by 
>  now. */
>  +if (unlikely(thread->rpi == NULL)) {
>  +xnlock_put_irqrestore(&rpi->rpilock, s);
>  +return;
>  +}
>  +
>   /*
>    * The RPI slot - if present - is always valid, and 
>  won't
>    * change since the thread is resuming on this CPU and 
>  cannot
> >>> Another worry: Can thread->rpi become != rpi without being NULL? 
> >>> Or can
> >>> we really only race for clearance here?
> >>>
> >> I think so now, therefore I'm proposing this:
> >>
> >> --->
> >>
> >> Most RPI services work on the current task or the one to be 
> >> scheduled in
> >> next, thus are naturally serialized. But rpi_next is not as it can 
> >> walk
> >> the chain of RPI requests for a CPU independently. In that case,
> >> clearing RPI via rpi_clear_remote can race with rpi_next, and if 
> >> the
> >> former loses after checking thread->rpi for NULL, we will 
> >> dereference a
> >> NULL pointer in xnsched_pop_rpi().
> >>
> >> Signed-off-by: Jan Kiszka 
> >> ---
> >>  ksrc/nucleus/shadow.c |9 +
> >>  1 files changed, 9 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> >> index 872c37f..cf7c08f 100644
> >> --- a/ksrc/nucleus/shadow.c
> >> +++ b/ksrc/nucleus/shadow.c
> >> @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread 
> >> *thread)
> >>xnlock_get_irqsave(&rpi->rpilock, s);
> >>  
> >>/*
> >> +   * Re-check under lock. Someone may have invoked rpi_next and 
> >> cleared
> >> +   * rpi by now.
> >> +   */
> >> +  if (unlikely(!rpi_p(thread))) {
> >> +  xnlock_put_irqrestore(&rpi->rpilock, s);
> >> +  return;
> >> +  }
> >> +
> >> +  /*
> >> * The RPI slot - if present - is always valid, and won't
> >> * change since the thread is resuming on this CPU and cannot
> >> * migrate under our feet. We may grab the remote slot loc

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-27 Thread Jan Kiszka
Philippe Gerum wrote:
> On Tue, 2010-04-27 at 11:34 +0200, Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> On Tue, 2010-04-27 at 11:27 +0200, Jan Kiszka wrote:
 Philippe Gerum wrote:
> On Tue, 2010-04-27 at 10:25 +0200, Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> On Tue, 2010-04-27 at 08:46 +0200, Jan Kiszka wrote:
 Philippe Gerum wrote:
> On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> Jan Kiszka wrote:
 Hi,

 I'm meditating over an oops in rpi_clear_remote. NULL pointer 
 deref, it
 /seems/ like thread->rpi is invalid. Looking at the code, I wonder 
 if
 this could explain the bug:


 static void rpi_clear_remote(struct xnthread *thread)
 {
 ...
 rpi = thread->rpi;
 if (unlikely(rpi == NULL))
 return;

 xnlock_get_irqsave(&rpi->rpilock, s);

 /*
  * The RPI slot - if present - is always valid, and won't
  * change since the thread is resuming on this CPU and 
 cannot
  * migrate under our feet. We may grab the remote slot lock
  * now.
  */
 xnsched_pop_rpi(thread);
 thread->rpi = NULL;

 ...

 So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, 
 but
 we check for it without any protection? Sounds racy. I think 
 'thread' is
 not only pointing to the current thread but could refer to a 
 foreign one
 as well, right? Don't we need this:

 diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
 index 872c37f..1f995d6 100644
 --- a/ksrc/nucleus/shadow.c
 +++ b/ksrc/nucleus/shadow.c
 @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread 
 *thread)
  
xnlock_get_irqsave(&rpi->rpilock, s);
  
 +  /* Re-check under lock, someone may have cleared rpi by now. */
 +  if (unlikely(thread->rpi == NULL)) {
 +  xnlock_put_irqrestore(&rpi->rpilock, s);
 +  return;
 +  }
 +
/*
 * The RPI slot - if present - is always valid, and won't
 * change since the thread is resuming on this CPU and cannot
>>> Another worry: Can thread->rpi become != rpi without being NULL? Or 
>>> can
>>> we really only race for clearance here?
>>>
>> I think so now, therefore I'm proposing this:
>>
>> --->
>>
>> Most RPI services work on the current task or the one to be 
>> scheduled in
>> next, thus are naturally serialized. But rpi_next is not as it can 
>> walk
>> the chain of RPI requests for a CPU independently. In that case,
>> clearing RPI via rpi_clear_remote can race with rpi_next, and if the
>> former loses after checking thread->rpi for NULL, we will 
>> dereference a
>> NULL pointer in xnsched_pop_rpi().
>>
>> Signed-off-by: Jan Kiszka 
>> ---
>>  ksrc/nucleus/shadow.c |9 +
>>  1 files changed, 9 insertions(+), 0 deletions(-)
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 872c37f..cf7c08f 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread 
>> *thread)
>>  xnlock_get_irqsave(&rpi->rpilock, s);
>>  
>>  /*
>> + * Re-check under lock. Someone may have invoked rpi_next and 
>> cleared
>> + * rpi by now.
>> + */
>> +if (unlikely(!rpi_p(thread))) {
>> +xnlock_put_irqrestore(&rpi->rpilock, s);
>> +return;
>> +}
>> +
>> +/*
>>   * The RPI slot - if present - is always valid, and won't
>>   * change since the thread is resuming on this CPU and cannot
>>   * migrate under our feet. We may grab the remote slot lock
>>
> The suggested patch papers over the actual issue, which is that
> rpi_clear_remote() may not invoke rpi_next(), because it may only 
> affect
 I don't think that in our case rpi_clear_remote called rpi_next and
 therefore crashed. It should rather have been the scenario of both
 running in parallel on different CPUs, the

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-27 Thread Philippe Gerum
On Tue, 2010-04-27 at 11:34 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Tue, 2010-04-27 at 11:27 +0200, Jan Kiszka wrote:
> >> Philippe Gerum wrote:
> >>> On Tue, 2010-04-27 at 10:25 +0200, Jan Kiszka wrote:
>  Philippe Gerum wrote:
> > On Tue, 2010-04-27 at 08:46 +0200, Jan Kiszka wrote:
> >> Philippe Gerum wrote:
> >>> On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
>  Jan Kiszka wrote:
> > Jan Kiszka wrote:
> >> Hi,
> >>
> >> I'm meditating over an oops in rpi_clear_remote. NULL pointer 
> >> deref, it
> >> /seems/ like thread->rpi is invalid. Looking at the code, I wonder 
> >> if
> >> this could explain the bug:
> >>
> >>
> >> static void rpi_clear_remote(struct xnthread *thread)
> >> {
> >> ...
> >> rpi = thread->rpi;
> >> if (unlikely(rpi == NULL))
> >> return;
> >>
> >> xnlock_get_irqsave(&rpi->rpilock, s);
> >>
> >> /*
> >>  * The RPI slot - if present - is always valid, and won't
> >>  * change since the thread is resuming on this CPU and 
> >> cannot
> >>  * migrate under our feet. We may grab the remote slot lock
> >>  * now.
> >>  */
> >> xnsched_pop_rpi(thread);
> >> thread->rpi = NULL;
> >>
> >> ...
> >>
> >> So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, 
> >> but
> >> we check for it without any protection? Sounds racy. I think 
> >> 'thread' is
> >> not only pointing to the current thread but could refer to a 
> >> foreign one
> >> as well, right? Don't we need this:
> >>
> >> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> >> index 872c37f..1f995d6 100644
> >> --- a/ksrc/nucleus/shadow.c
> >> +++ b/ksrc/nucleus/shadow.c
> >> @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread 
> >> *thread)
> >>  
> >>xnlock_get_irqsave(&rpi->rpilock, s);
> >>  
> >> +  /* Re-check under lock, someone may have cleared rpi by now. */
> >> +  if (unlikely(thread->rpi == NULL)) {
> >> +  xnlock_put_irqrestore(&rpi->rpilock, s);
> >> +  return;
> >> +  }
> >> +
> >>/*
> >> * The RPI slot - if present - is always valid, and won't
> >> * change since the thread is resuming on this CPU and cannot
> > Another worry: Can thread->rpi become != rpi without being NULL? Or 
> > can
> > we really only race for clearance here?
> >
>  I think so now, therefore I'm proposing this:
> 
>  --->
> 
>  Most RPI services work on the current task or the one to be 
>  scheduled in
>  next, thus are naturally serialized. But rpi_next is not as it can 
>  walk
>  the chain of RPI requests for a CPU independently. In that case,
>  clearing RPI via rpi_clear_remote can race with rpi_next, and if the
>  former loses after checking thread->rpi for NULL, we will 
>  dereference a
>  NULL pointer in xnsched_pop_rpi().
> 
>  Signed-off-by: Jan Kiszka 
>  ---
>   ksrc/nucleus/shadow.c |9 +
>   1 files changed, 9 insertions(+), 0 deletions(-)
> 
>  diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>  index 872c37f..cf7c08f 100644
>  --- a/ksrc/nucleus/shadow.c
>  +++ b/ksrc/nucleus/shadow.c
>  @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread 
>  *thread)
>   xnlock_get_irqsave(&rpi->rpilock, s);
>   
>   /*
>  + * Re-check under lock. Someone may have invoked rpi_next and 
>  cleared
>  + * rpi by now.
>  + */
>  +if (unlikely(!rpi_p(thread))) {
>  +xnlock_put_irqrestore(&rpi->rpilock, s);
>  +return;
>  +}
>  +
>  +/*
>    * The RPI slot - if present - is always valid, and won't
>    * change since the thread is resuming on this CPU and cannot
>    * migrate under our feet. We may grab the remote slot lock
> 
> >>> The suggested patch papers over the actual issue, which is that
> >>> rpi_clear_remote() may not invoke rpi_next(), because it may only 
> >>> affect
> >> I don't think that in our case rpi_clear_remote called rpi_next and
> >> therefore crashed. It should rather have been the scenario of both
> >> running in parallel on different CPUs, the former on behalf of a
> 

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-27 Thread Jan Kiszka
Philippe Gerum wrote:
> On Tue, 2010-04-27 at 11:27 +0200, Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> On Tue, 2010-04-27 at 10:25 +0200, Jan Kiszka wrote:
 Philippe Gerum wrote:
> On Tue, 2010-04-27 at 08:46 +0200, Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
 Jan Kiszka wrote:
> Jan Kiszka wrote:
>> Hi,
>>
>> I'm meditating over an oops in rpi_clear_remote. NULL pointer deref, 
>> it
>> /seems/ like thread->rpi is invalid. Looking at the code, I wonder if
>> this could explain the bug:
>>
>>
>> static void rpi_clear_remote(struct xnthread *thread)
>> {
>> ...
>> rpi = thread->rpi;
>> if (unlikely(rpi == NULL))
>> return;
>>
>> xnlock_get_irqsave(&rpi->rpilock, s);
>>
>> /*
>>  * The RPI slot - if present - is always valid, and won't
>>  * change since the thread is resuming on this CPU and cannot
>>  * migrate under our feet. We may grab the remote slot lock
>>  * now.
>>  */
>> xnsched_pop_rpi(thread);
>> thread->rpi = NULL;
>>
>> ...
>>
>> So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, 
>> but
>> we check for it without any protection? Sounds racy. I think 
>> 'thread' is
>> not only pointing to the current thread but could refer to a foreign 
>> one
>> as well, right? Don't we need this:
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 872c37f..1f995d6 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread 
>> *thread)
>>  
>>  xnlock_get_irqsave(&rpi->rpilock, s);
>>  
>> +/* Re-check under lock, someone may have cleared rpi by now. */
>> +if (unlikely(thread->rpi == NULL)) {
>> +xnlock_put_irqrestore(&rpi->rpilock, s);
>> +return;
>> +}
>> +
>>  /*
>>   * The RPI slot - if present - is always valid, and won't
>>   * change since the thread is resuming on this CPU and cannot
> Another worry: Can thread->rpi become != rpi without being NULL? Or 
> can
> we really only race for clearance here?
>
 I think so now, therefore I'm proposing this:

 --->

 Most RPI services work on the current task or the one to be scheduled 
 in
 next, thus are naturally serialized. But rpi_next is not as it can walk
 the chain of RPI requests for a CPU independently. In that case,
 clearing RPI via rpi_clear_remote can race with rpi_next, and if the
 former loses after checking thread->rpi for NULL, we will dereference a
 NULL pointer in xnsched_pop_rpi().

 Signed-off-by: Jan Kiszka 
 ---
  ksrc/nucleus/shadow.c |9 +
  1 files changed, 9 insertions(+), 0 deletions(-)

 diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
 index 872c37f..cf7c08f 100644
 --- a/ksrc/nucleus/shadow.c
 +++ b/ksrc/nucleus/shadow.c
 @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread 
 *thread)
xnlock_get_irqsave(&rpi->rpilock, s);
  
/*
 +   * Re-check under lock. Someone may have invoked rpi_next and 
 cleared
 +   * rpi by now.
 +   */
 +  if (unlikely(!rpi_p(thread))) {
 +  xnlock_put_irqrestore(&rpi->rpilock, s);
 +  return;
 +  }
 +
 +  /*
 * The RPI slot - if present - is always valid, and won't
 * change since the thread is resuming on this CPU and cannot
 * migrate under our feet. We may grab the remote slot lock

>>> The suggested patch papers over the actual issue, which is that
>>> rpi_clear_remote() may not invoke rpi_next(), because it may only affect
>> I don't think that in our case rpi_clear_remote called rpi_next and
>> therefore crashed. It should rather have been the scenario of both
>> running in parallel on different CPUs, the former on behalf of a
>> migrated shadow that wants to clear its remainders on the remote CPU,
>> the latter on that CPU, picking a new top RPI after some other shadow
>> was removed from the queue. Is this a possible scenario, and would your
>> patch cover it?
>>
>>> the RPI state of the argument

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-27 Thread Philippe Gerum
On Tue, 2010-04-27 at 11:27 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Tue, 2010-04-27 at 10:25 +0200, Jan Kiszka wrote:
> >> Philippe Gerum wrote:
> >>> On Tue, 2010-04-27 at 08:46 +0200, Jan Kiszka wrote:
>  Philippe Gerum wrote:
> > On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
> >> Jan Kiszka wrote:
> >>> Jan Kiszka wrote:
>  Hi,
> 
>  I'm meditating over an oops in rpi_clear_remote. NULL pointer deref, 
>  it
>  /seems/ like thread->rpi is invalid. Looking at the code, I wonder if
>  this could explain the bug:
> 
> 
>  static void rpi_clear_remote(struct xnthread *thread)
>  {
>  ...
>  rpi = thread->rpi;
>  if (unlikely(rpi == NULL))
>  return;
> 
>  xnlock_get_irqsave(&rpi->rpilock, s);
> 
>  /*
>   * The RPI slot - if present - is always valid, and won't
>   * change since the thread is resuming on this CPU and cannot
>   * migrate under our feet. We may grab the remote slot lock
>   * now.
>   */
>  xnsched_pop_rpi(thread);
>  thread->rpi = NULL;
> 
>  ...
> 
>  So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, 
>  but
>  we check for it without any protection? Sounds racy. I think 
>  'thread' is
>  not only pointing to the current thread but could refer to a foreign 
>  one
>  as well, right? Don't we need this:
> 
>  diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>  index 872c37f..1f995d6 100644
>  --- a/ksrc/nucleus/shadow.c
>  +++ b/ksrc/nucleus/shadow.c
>  @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread 
>  *thread)
>   
>   xnlock_get_irqsave(&rpi->rpilock, s);
>   
>  +/* Re-check under lock, someone may have cleared rpi by now. */
>  +if (unlikely(thread->rpi == NULL)) {
>  +xnlock_put_irqrestore(&rpi->rpilock, s);
>  +return;
>  +}
>  +
>   /*
>    * The RPI slot - if present - is always valid, and won't
>    * change since the thread is resuming on this CPU and cannot
> >>> Another worry: Can thread->rpi become != rpi without being NULL? Or 
> >>> can
> >>> we really only race for clearance here?
> >>>
> >> I think so now, therefore I'm proposing this:
> >>
> >> --->
> >>
> >> Most RPI services work on the current task or the one to be scheduled 
> >> in
> >> next, thus are naturally serialized. But rpi_next is not as it can walk
> >> the chain of RPI requests for a CPU independently. In that case,
> >> clearing RPI via rpi_clear_remote can race with rpi_next, and if the
> >> former loses after checking thread->rpi for NULL, we will dereference a
> >> NULL pointer in xnsched_pop_rpi().
> >>
> >> Signed-off-by: Jan Kiszka 
> >> ---
> >>  ksrc/nucleus/shadow.c |9 +
> >>  1 files changed, 9 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> >> index 872c37f..cf7c08f 100644
> >> --- a/ksrc/nucleus/shadow.c
> >> +++ b/ksrc/nucleus/shadow.c
> >> @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread 
> >> *thread)
> >>xnlock_get_irqsave(&rpi->rpilock, s);
> >>  
> >>/*
> >> +   * Re-check under lock. Someone may have invoked rpi_next and 
> >> cleared
> >> +   * rpi by now.
> >> +   */
> >> +  if (unlikely(!rpi_p(thread))) {
> >> +  xnlock_put_irqrestore(&rpi->rpilock, s);
> >> +  return;
> >> +  }
> >> +
> >> +  /*
> >> * The RPI slot - if present - is always valid, and won't
> >> * change since the thread is resuming on this CPU and cannot
> >> * migrate under our feet. We may grab the remote slot lock
> >>
> > The suggested patch papers over the actual issue, which is that
> > rpi_clear_remote() may not invoke rpi_next(), because it may only affect
>  I don't think that in our case rpi_clear_remote called rpi_next and
>  therefore crashed. It should rather have been the scenario of both
>  running in parallel on different CPUs, the former on behalf of a
>  migrated shadow that wants to clear its remainders on the remote CPU,
>  the latter on that CPU, picking a new top RPI after some other shadow
>  was removed from the queue. Is this a possible scenario, and would your
>  patch cover it?
> 
> > the RPI state of the argument thread which must be a l

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-27 Thread Jan Kiszka
Philippe Gerum wrote:
> On Tue, 2010-04-27 at 10:25 +0200, Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> On Tue, 2010-04-27 at 08:46 +0200, Jan Kiszka wrote:
 Philippe Gerum wrote:
> On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> Jan Kiszka wrote:
 Hi,

 I'm meditating over an oops in rpi_clear_remote. NULL pointer deref, it
 /seems/ like thread->rpi is invalid. Looking at the code, I wonder if
 this could explain the bug:


 static void rpi_clear_remote(struct xnthread *thread)
 {
 ...
 rpi = thread->rpi;
 if (unlikely(rpi == NULL))
 return;

 xnlock_get_irqsave(&rpi->rpilock, s);

 /*
  * The RPI slot - if present - is always valid, and won't
  * change since the thread is resuming on this CPU and cannot
  * migrate under our feet. We may grab the remote slot lock
  * now.
  */
 xnsched_pop_rpi(thread);
 thread->rpi = NULL;

 ...

 So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, but
 we check for it without any protection? Sounds racy. I think 'thread' 
 is
 not only pointing to the current thread but could refer to a foreign 
 one
 as well, right? Don't we need this:

 diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
 index 872c37f..1f995d6 100644
 --- a/ksrc/nucleus/shadow.c
 +++ b/ksrc/nucleus/shadow.c
 @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread 
 *thread)
  
xnlock_get_irqsave(&rpi->rpilock, s);
  
 +  /* Re-check under lock, someone may have cleared rpi by now. */
 +  if (unlikely(thread->rpi == NULL)) {
 +  xnlock_put_irqrestore(&rpi->rpilock, s);
 +  return;
 +  }
 +
/*
 * The RPI slot - if present - is always valid, and won't
 * change since the thread is resuming on this CPU and cannot
>>> Another worry: Can thread->rpi become != rpi without being NULL? Or can
>>> we really only race for clearance here?
>>>
>> I think so now, therefore I'm proposing this:
>>
>> --->
>>
>> Most RPI services work on the current task or the one to be scheduled in
>> next, thus are naturally serialized. But rpi_next is not as it can walk
>> the chain of RPI requests for a CPU independently. In that case,
>> clearing RPI via rpi_clear_remote can race with rpi_next, and if the
>> former loses after checking thread->rpi for NULL, we will dereference a
>> NULL pointer in xnsched_pop_rpi().
>>
>> Signed-off-by: Jan Kiszka 
>> ---
>>  ksrc/nucleus/shadow.c |9 +
>>  1 files changed, 9 insertions(+), 0 deletions(-)
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 872c37f..cf7c08f 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread 
>> *thread)
>>  xnlock_get_irqsave(&rpi->rpilock, s);
>>  
>>  /*
>> + * Re-check under lock. Someone may have invoked rpi_next and 
>> cleared
>> + * rpi by now.
>> + */
>> +if (unlikely(!rpi_p(thread))) {
>> +xnlock_put_irqrestore(&rpi->rpilock, s);
>> +return;
>> +}
>> +
>> +/*
>>   * The RPI slot - if present - is always valid, and won't
>>   * change since the thread is resuming on this CPU and cannot
>>   * migrate under our feet. We may grab the remote slot lock
>>
> The suggested patch papers over the actual issue, which is that
> rpi_clear_remote() may not invoke rpi_next(), because it may only affect
 I don't think that in our case rpi_clear_remote called rpi_next and
 therefore crashed. It should rather have been the scenario of both
 running in parallel on different CPUs, the former on behalf of a
 migrated shadow that wants to clear its remainders on the remote CPU,
 the latter on that CPU, picking a new top RPI after some other shadow
 was removed from the queue. Is this a possible scenario, and would your
 patch cover it?

> the RPI state of the argument thread which must be a local one, and not
> that of any random thread that happens to be linked to the remote RPI
> queue.
>
> By calling rpi_next(), rpi_clear_remote() shoots itself in the foot,
> allowing a concurrent invocation of itself on a remote CPU, to fiddle
> with the r

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-27 Thread Philippe Gerum
On Tue, 2010-04-27 at 10:25 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Tue, 2010-04-27 at 08:46 +0200, Jan Kiszka wrote:
> >> Philippe Gerum wrote:
> >>> On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
>  Jan Kiszka wrote:
> > Jan Kiszka wrote:
> >> Hi,
> >>
> >> I'm meditating over an oops in rpi_clear_remote. NULL pointer deref, it
> >> /seems/ like thread->rpi is invalid. Looking at the code, I wonder if
> >> this could explain the bug:
> >>
> >>
> >> static void rpi_clear_remote(struct xnthread *thread)
> >> {
> >> ...
> >> rpi = thread->rpi;
> >> if (unlikely(rpi == NULL))
> >> return;
> >>
> >> xnlock_get_irqsave(&rpi->rpilock, s);
> >>
> >> /*
> >>  * The RPI slot - if present - is always valid, and won't
> >>  * change since the thread is resuming on this CPU and cannot
> >>  * migrate under our feet. We may grab the remote slot lock
> >>  * now.
> >>  */
> >> xnsched_pop_rpi(thread);
> >> thread->rpi = NULL;
> >>
> >> ...
> >>
> >> So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, but
> >> we check for it without any protection? Sounds racy. I think 'thread' 
> >> is
> >> not only pointing to the current thread but could refer to a foreign 
> >> one
> >> as well, right? Don't we need this:
> >>
> >> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> >> index 872c37f..1f995d6 100644
> >> --- a/ksrc/nucleus/shadow.c
> >> +++ b/ksrc/nucleus/shadow.c
> >> @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread 
> >> *thread)
> >>  
> >>xnlock_get_irqsave(&rpi->rpilock, s);
> >>  
> >> +  /* Re-check under lock, someone may have cleared rpi by now. */
> >> +  if (unlikely(thread->rpi == NULL)) {
> >> +  xnlock_put_irqrestore(&rpi->rpilock, s);
> >> +  return;
> >> +  }
> >> +
> >>/*
> >> * The RPI slot - if present - is always valid, and won't
> >> * change since the thread is resuming on this CPU and cannot
> > Another worry: Can thread->rpi become != rpi without being NULL? Or can
> > we really only race for clearance here?
> >
>  I think so now, therefore I'm proposing this:
> 
>  --->
> 
>  Most RPI services work on the current task or the one to be scheduled in
>  next, thus are naturally serialized. But rpi_next is not as it can walk
>  the chain of RPI requests for a CPU independently. In that case,
>  clearing RPI via rpi_clear_remote can race with rpi_next, and if the
>  former loses after checking thread->rpi for NULL, we will dereference a
>  NULL pointer in xnsched_pop_rpi().
> 
>  Signed-off-by: Jan Kiszka 
>  ---
>   ksrc/nucleus/shadow.c |9 +
>   1 files changed, 9 insertions(+), 0 deletions(-)
> 
>  diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>  index 872c37f..cf7c08f 100644
>  --- a/ksrc/nucleus/shadow.c
>  +++ b/ksrc/nucleus/shadow.c
>  @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread 
>  *thread)
>   xnlock_get_irqsave(&rpi->rpilock, s);
>   
>   /*
>  + * Re-check under lock. Someone may have invoked rpi_next and 
>  cleared
>  + * rpi by now.
>  + */
>  +if (unlikely(!rpi_p(thread))) {
>  +xnlock_put_irqrestore(&rpi->rpilock, s);
>  +return;
>  +}
>  +
>  +/*
>    * The RPI slot - if present - is always valid, and won't
>    * change since the thread is resuming on this CPU and cannot
>    * migrate under our feet. We may grab the remote slot lock
> 
> >>> The suggested patch papers over the actual issue, which is that
> >>> rpi_clear_remote() may not invoke rpi_next(), because it may only affect
> >> I don't think that in our case rpi_clear_remote called rpi_next and
> >> therefore crashed. It should rather have been the scenario of both
> >> running in parallel on different CPUs, the former on behalf of a
> >> migrated shadow that wants to clear its remainders on the remote CPU,
> >> the latter on that CPU, picking a new top RPI after some other shadow
> >> was removed from the queue. Is this a possible scenario, and would your
> >> patch cover it?
> >>
> >>> the RPI state of the argument thread which must be a local one, and not
> >>> that of any random thread that happens to be linked to the remote RPI
> >>> queue.
> >>>
> >>> By calling rpi_next(), rpi_clear_remote() shoots itself in the foot,
> >>> allowing a concurrent invocation of itself on a remote CPU, to fiddle
> >>> with the rpi backlink of a thread w

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-27 Thread Jan Kiszka
Philippe Gerum wrote:
> On Tue, 2010-04-27 at 08:46 +0200, Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
 Jan Kiszka wrote:
> Jan Kiszka wrote:
>> Hi,
>>
>> I'm meditating over an oops in rpi_clear_remote. NULL pointer deref, it
>> /seems/ like thread->rpi is invalid. Looking at the code, I wonder if
>> this could explain the bug:
>>
>>
>> static void rpi_clear_remote(struct xnthread *thread)
>> {
>> ...
>> rpi = thread->rpi;
>> if (unlikely(rpi == NULL))
>> return;
>>
>> xnlock_get_irqsave(&rpi->rpilock, s);
>>
>> /*
>>  * The RPI slot - if present - is always valid, and won't
>>  * change since the thread is resuming on this CPU and cannot
>>  * migrate under our feet. We may grab the remote slot lock
>>  * now.
>>  */
>> xnsched_pop_rpi(thread);
>> thread->rpi = NULL;
>>
>> ...
>>
>> So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, but
>> we check for it without any protection? Sounds racy. I think 'thread' is
>> not only pointing to the current thread but could refer to a foreign one
>> as well, right? Don't we need this:
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 872c37f..1f995d6 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread 
>> *thread)
>>  
>>  xnlock_get_irqsave(&rpi->rpilock, s);
>>  
>> +/* Re-check under lock, someone may have cleared rpi by now. */
>> +if (unlikely(thread->rpi == NULL)) {
>> +xnlock_put_irqrestore(&rpi->rpilock, s);
>> +return;
>> +}
>> +
>>  /*
>>   * The RPI slot - if present - is always valid, and won't
>>   * change since the thread is resuming on this CPU and cannot
> Another worry: Can thread->rpi become != rpi without being NULL? Or can
> we really only race for clearance here?
>
 I think so now, therefore I'm proposing this:

 --->

 Most RPI services work on the current task or the one to be scheduled in
 next, thus are naturally serialized. But rpi_next is not as it can walk
 the chain of RPI requests for a CPU independently. In that case,
 clearing RPI via rpi_clear_remote can race with rpi_next, and if the
 former loses after checking thread->rpi for NULL, we will dereference a
 NULL pointer in xnsched_pop_rpi().

 Signed-off-by: Jan Kiszka 
 ---
  ksrc/nucleus/shadow.c |9 +
  1 files changed, 9 insertions(+), 0 deletions(-)

 diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
 index 872c37f..cf7c08f 100644
 --- a/ksrc/nucleus/shadow.c
 +++ b/ksrc/nucleus/shadow.c
 @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread *thread)
xnlock_get_irqsave(&rpi->rpilock, s);
  
/*
 +   * Re-check under lock. Someone may have invoked rpi_next and cleared
 +   * rpi by now.
 +   */
 +  if (unlikely(!rpi_p(thread))) {
 +  xnlock_put_irqrestore(&rpi->rpilock, s);
 +  return;
 +  }
 +
 +  /*
 * The RPI slot - if present - is always valid, and won't
 * change since the thread is resuming on this CPU and cannot
 * migrate under our feet. We may grab the remote slot lock

>>> The suggested patch papers over the actual issue, which is that
>>> rpi_clear_remote() may not invoke rpi_next(), because it may only affect
>> I don't think that in our case rpi_clear_remote called rpi_next and
>> therefore crashed. It should rather have been the scenario of both
>> running in parallel on different CPUs, the former on behalf of a
>> migrated shadow that wants to clear its remainders on the remote CPU,
>> the latter on that CPU, picking a new top RPI after some other shadow
>> was removed from the queue. Is this a possible scenario, and would your
>> patch cover it?
>>
>>> the RPI state of the argument thread which must be a local one, and not
>>> that of any random thread that happens to be linked to the remote RPI
>>> queue.
>>>
>>> By calling rpi_next(), rpi_clear_remote() shoots itself in the foot,
>>> allowing a concurrent invocation of itself on a remote CPU, to fiddle
>>> with the rpi backlink of a thread which is not active on the
>>> local/per-cpu Xenomai scheduler instance, which is the point where
>>> things start to hit the crapper.
>>>
>>> Now, unless I can't even synchronize the couple of neurons I have left
>>> at this hour, the following patch should better fix the issue, because
>>> it restores the two basic rules that apply to the whole RPI machinery,
>>> namely:

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-27 Thread Philippe Gerum
On Tue, 2010-04-27 at 08:46 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
> >> Jan Kiszka wrote:
> >>> Jan Kiszka wrote:
>  Hi,
> 
>  I'm meditating over an oops in rpi_clear_remote. NULL pointer deref, it
>  /seems/ like thread->rpi is invalid. Looking at the code, I wonder if
>  this could explain the bug:
> 
> 
>  static void rpi_clear_remote(struct xnthread *thread)
>  {
>  ...
>  rpi = thread->rpi;
>  if (unlikely(rpi == NULL))
>  return;
> 
>  xnlock_get_irqsave(&rpi->rpilock, s);
> 
>  /*
>   * The RPI slot - if present - is always valid, and won't
>   * change since the thread is resuming on this CPU and cannot
>   * migrate under our feet. We may grab the remote slot lock
>   * now.
>   */
>  xnsched_pop_rpi(thread);
>  thread->rpi = NULL;
> 
>  ...
> 
>  So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, but
>  we check for it without any protection? Sounds racy. I think 'thread' is
>  not only pointing to the current thread but could refer to a foreign one
>  as well, right? Don't we need this:
> 
>  diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>  index 872c37f..1f995d6 100644
>  --- a/ksrc/nucleus/shadow.c
>  +++ b/ksrc/nucleus/shadow.c
>  @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread 
>  *thread)
>   
>   xnlock_get_irqsave(&rpi->rpilock, s);
>   
>  +/* Re-check under lock, someone may have cleared rpi by now. */
>  +if (unlikely(thread->rpi == NULL)) {
>  +xnlock_put_irqrestore(&rpi->rpilock, s);
>  +return;
>  +}
>  +
>   /*
>    * The RPI slot - if present - is always valid, and won't
>    * change since the thread is resuming on this CPU and cannot
> >>> Another worry: Can thread->rpi become != rpi without being NULL? Or can
> >>> we really only race for clearance here?
> >>>
> >> I think so now, therefore I'm proposing this:
> >>
> >> --->
> >>
> >> Most RPI services work on the current task or the one to be scheduled in
> >> next, thus are naturally serialized. But rpi_next is not as it can walk
> >> the chain of RPI requests for a CPU independently. In that case,
> >> clearing RPI via rpi_clear_remote can race with rpi_next, and if the
> >> former loses after checking thread->rpi for NULL, we will dereference a
> >> NULL pointer in xnsched_pop_rpi().
> >>
> >> Signed-off-by: Jan Kiszka 
> >> ---
> >>  ksrc/nucleus/shadow.c |9 +
> >>  1 files changed, 9 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> >> index 872c37f..cf7c08f 100644
> >> --- a/ksrc/nucleus/shadow.c
> >> +++ b/ksrc/nucleus/shadow.c
> >> @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread *thread)
> >>xnlock_get_irqsave(&rpi->rpilock, s);
> >>  
> >>/*
> >> +   * Re-check under lock. Someone may have invoked rpi_next and cleared
> >> +   * rpi by now.
> >> +   */
> >> +  if (unlikely(!rpi_p(thread))) {
> >> +  xnlock_put_irqrestore(&rpi->rpilock, s);
> >> +  return;
> >> +  }
> >> +
> >> +  /*
> >> * The RPI slot - if present - is always valid, and won't
> >> * change since the thread is resuming on this CPU and cannot
> >> * migrate under our feet. We may grab the remote slot lock
> >>
> > 
> > The suggested patch papers over the actual issue, which is that
> > rpi_clear_remote() may not invoke rpi_next(), because it may only affect
> 
> I don't think that in our case rpi_clear_remote called rpi_next and
> therefore crashed. It should rather have been the scenario of both
> running in parallel on different CPUs, the former on behalf of a
> migrated shadow that wants to clear its remainders on the remote CPU,
> the latter on that CPU, picking a new top RPI after some other shadow
> was removed from the queue. Is this a possible scenario, and would your
> patch cover it?
> 
> > the RPI state of the argument thread which must be a local one, and not
> > that of any random thread that happens to be linked to the remote RPI
> > queue.
> > 
> > By calling rpi_next(), rpi_clear_remote() shoots itself in the foot,
> > allowing a concurrent invocation of itself on a remote CPU, to fiddle
> > with the rpi backlink of a thread which is not active on the
> > local/per-cpu Xenomai scheduler instance, which is the point where
> > things start to hit the crapper.
> > 
> > Now, unless I can't even synchronize the couple of neurons I have left
> > at this hour, the following patch should better fix the issue, because
> > it restores the two basic rules that apply to the whole RPI machinery,
> > namely:
> > 
> > - rpi_* calls

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-26 Thread Jan Kiszka
Philippe Gerum wrote:
> On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> Jan Kiszka wrote:
 Hi,

 I'm meditating over an oops in rpi_clear_remote. NULL pointer deref, it
 /seems/ like thread->rpi is invalid. Looking at the code, I wonder if
 this could explain the bug:


 static void rpi_clear_remote(struct xnthread *thread)
 {
 ...
 rpi = thread->rpi;
 if (unlikely(rpi == NULL))
 return;

 xnlock_get_irqsave(&rpi->rpilock, s);

 /*
  * The RPI slot - if present - is always valid, and won't
  * change since the thread is resuming on this CPU and cannot
  * migrate under our feet. We may grab the remote slot lock
  * now.
  */
 xnsched_pop_rpi(thread);
 thread->rpi = NULL;

 ...

 So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, but
 we check for it without any protection? Sounds racy. I think 'thread' is
 not only pointing to the current thread but could refer to a foreign one
 as well, right? Don't we need this:

 diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
 index 872c37f..1f995d6 100644
 --- a/ksrc/nucleus/shadow.c
 +++ b/ksrc/nucleus/shadow.c
 @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread *thread)
  
xnlock_get_irqsave(&rpi->rpilock, s);
  
 +  /* Re-check under lock, someone may have cleared rpi by now. */
 +  if (unlikely(thread->rpi == NULL)) {
 +  xnlock_put_irqrestore(&rpi->rpilock, s);
 +  return;
 +  }
 +
/*
 * The RPI slot - if present - is always valid, and won't
 * change since the thread is resuming on this CPU and cannot
>>> Another worry: Can thread->rpi become != rpi without being NULL? Or can
>>> we really only race for clearance here?
>>>
>> I think so now, therefore I'm proposing this:
>>
>> --->
>>
>> Most RPI services work on the current task or the one to be scheduled in
>> next, thus are naturally serialized. But rpi_next is not as it can walk
>> the chain of RPI requests for a CPU independently. In that case,
>> clearing RPI via rpi_clear_remote can race with rpi_next, and if the
>> former loses after checking thread->rpi for NULL, we will dereference a
>> NULL pointer in xnsched_pop_rpi().
>>
>> Signed-off-by: Jan Kiszka 
>> ---
>>  ksrc/nucleus/shadow.c |9 +
>>  1 files changed, 9 insertions(+), 0 deletions(-)
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 872c37f..cf7c08f 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread *thread)
>>  xnlock_get_irqsave(&rpi->rpilock, s);
>>  
>>  /*
>> + * Re-check under lock. Someone may have invoked rpi_next and cleared
>> + * rpi by now.
>> + */
>> +if (unlikely(!rpi_p(thread))) {
>> +xnlock_put_irqrestore(&rpi->rpilock, s);
>> +return;
>> +}
>> +
>> +/*
>>   * The RPI slot - if present - is always valid, and won't
>>   * change since the thread is resuming on this CPU and cannot
>>   * migrate under our feet. We may grab the remote slot lock
>>
> 
> The suggested patch papers over the actual issue, which is that
> rpi_clear_remote() may not invoke rpi_next(), because it may only affect

I don't think that in our case rpi_clear_remote called rpi_next and
therefore crashed. It should rather have been the scenario of both
running in parallel on different CPUs, the former on behalf of a
migrated shadow that wants to clear its remainders on the remote CPU,
the latter on that CPU, picking a new top RPI after some other shadow
was removed from the queue. Is this a possible scenario, and would your
patch cover it?

> the RPI state of the argument thread which must be a local one, and not
> that of any random thread that happens to be linked to the remote RPI
> queue.
> 
> By calling rpi_next(), rpi_clear_remote() shoots itself in the foot,
> allowing a concurrent invocation of itself on a remote CPU, to fiddle
> with the rpi backlink of a thread which is not active on the
> local/per-cpu Xenomai scheduler instance, which is the point where
> things start to hit the crapper.
> 
> Now, unless I can't even synchronize the couple of neurons I have left
> at this hour, the following patch should better fix the issue, because
> it restores the two basic rules that apply to the whole RPI machinery,
> namely:
> 
> - rpi_* calls may only alter the contents of the local scheduler's RPI
> queue, with the notable exception of rpi_clear_remote() which may remove
> the given _local_ thread only, from a remote RPI slot.
> 
> - rpi_* calls may only change the RPI state of threads which are
> controlled by the local Xenomai scheduler instance, except rpi_push()

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-26 Thread Philippe Gerum
On Mon, 2010-04-26 at 18:06 +0200, Jan Kiszka wrote:
> Jan Kiszka wrote:
> > Jan Kiszka wrote:
> >> Hi,
> >>
> >> I'm meditating over an oops in rpi_clear_remote. NULL pointer deref, it
> >> /seems/ like thread->rpi is invalid. Looking at the code, I wonder if
> >> this could explain the bug:
> >>
> >>
> >> static void rpi_clear_remote(struct xnthread *thread)
> >> {
> >> ...
> >> rpi = thread->rpi;
> >> if (unlikely(rpi == NULL))
> >> return;
> >>
> >> xnlock_get_irqsave(&rpi->rpilock, s);
> >>
> >> /*
> >>  * The RPI slot - if present - is always valid, and won't
> >>  * change since the thread is resuming on this CPU and cannot
> >>  * migrate under our feet. We may grab the remote slot lock
> >>  * now.
> >>  */
> >> xnsched_pop_rpi(thread);
> >> thread->rpi = NULL;
> >>
> >> ...
> >>
> >> So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, but
> >> we check for it without any protection? Sounds racy. I think 'thread' is
> >> not only pointing to the current thread but could refer to a foreign one
> >> as well, right? Don't we need this:
> >>
> >> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> >> index 872c37f..1f995d6 100644
> >> --- a/ksrc/nucleus/shadow.c
> >> +++ b/ksrc/nucleus/shadow.c
> >> @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread *thread)
> >>  
> >>xnlock_get_irqsave(&rpi->rpilock, s);
> >>  
> >> +  /* Re-check under lock, someone may have cleared rpi by now. */
> >> +  if (unlikely(thread->rpi == NULL)) {
> >> +  xnlock_put_irqrestore(&rpi->rpilock, s);
> >> +  return;
> >> +  }
> >> +
> >>/*
> >> * The RPI slot - if present - is always valid, and won't
> >> * change since the thread is resuming on this CPU and cannot
> > 
> > Another worry: Can thread->rpi become != rpi without being NULL? Or can
> > we really only race for clearance here?
> > 
> 
> I think so now, therefore I'm proposing this:
> 
> --->
> 
> Most RPI services work on the current task or the one to be scheduled in
> next, thus are naturally serialized. But rpi_next is not as it can walk
> the chain of RPI requests for a CPU independently. In that case,
> clearing RPI via rpi_clear_remote can race with rpi_next, and if the
> former loses after checking thread->rpi for NULL, we will dereference a
> NULL pointer in xnsched_pop_rpi().
> 
> Signed-off-by: Jan Kiszka 
> ---
>  ksrc/nucleus/shadow.c |9 +
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> index 872c37f..cf7c08f 100644
> --- a/ksrc/nucleus/shadow.c
> +++ b/ksrc/nucleus/shadow.c
> @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread *thread)
>   xnlock_get_irqsave(&rpi->rpilock, s);
>  
>   /*
> +  * Re-check under lock. Someone may have invoked rpi_next and cleared
> +  * rpi by now.
> +  */
> + if (unlikely(!rpi_p(thread))) {
> + xnlock_put_irqrestore(&rpi->rpilock, s);
> + return;
> + }
> +
> + /*
>* The RPI slot - if present - is always valid, and won't
>* change since the thread is resuming on this CPU and cannot
>* migrate under our feet. We may grab the remote slot lock
> 

The suggested patch papers over the actual issue, which is that
rpi_clear_remote() may not invoke rpi_next(), because it may only affect
the RPI state of the argument thread which must be a local one, and not
that of any random thread that happens to be linked to the remote RPI
queue.

By calling rpi_next(), rpi_clear_remote() shoots itself in the foot,
allowing a concurrent invocation of itself on a remote CPU, to fiddle
with the rpi backlink of a thread which is not active on the
local/per-cpu Xenomai scheduler instance, which is the point where
things start to hit the crapper.

Now, unless I can't even synchronize the couple of neurons I have left
at this hour, the following patch should better fix the issue, because
it restores the two basic rules that apply to the whole RPI machinery,
namely:

- rpi_* calls may only alter the contents of the local scheduler's RPI
queue, with the notable exception of rpi_clear_remote() which may remove
the given _local_ thread only, from a remote RPI slot.

- rpi_* calls may only change the RPI state of threads which are
controlled by the local Xenomai scheduler instance, except rpi_push()
when called for setting up the RPI state of an emerging thread, in which
case this is a no-conflict zone.

That breakage was introduced in the early 2.5.1 timeframe, so 2.4.x
should be immune from this particular bug.

diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
index 872c37f..1397ed1 100644
--- a/ksrc/nucleus/shadow.c
+++ b/ksrc/nucleus/shadow.c
@@ -340,7 +340,12 @@ static void rpi_clear_remote(struct xnthread *thread)
xnsched_pop_rpi(thread);
thread->rpi = NULL;
 
-   if

Re: [Xenomai-core] [PATCH] nucleus: Plug race between rpi_clear_remote and rpi_next

2010-04-26 Thread Jan Kiszka
Jan Kiszka wrote:
> Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> Hi,
>>>
>>> I'm meditating over an oops in rpi_clear_remote. NULL pointer deref, it
>>> /seems/ like thread->rpi is invalid. Looking at the code, I wonder if
>>> this could explain the bug:
>>>
>>>
>>> static void rpi_clear_remote(struct xnthread *thread)
>>> {
>>> ...
>>> rpi = thread->rpi;
>>> if (unlikely(rpi == NULL))
>>> return;
>>>
>>> xnlock_get_irqsave(&rpi->rpilock, s);
>>>
>>> /*
>>>  * The RPI slot - if present - is always valid, and won't
>>>  * change since the thread is resuming on this CPU and cannot
>>>  * migrate under our feet. We may grab the remote slot lock
>>>  * now.
>>>  */
>>> xnsched_pop_rpi(thread);
>>> thread->rpi = NULL;
>>>
>>> ...
>>>
>>> So we deref (xnsched_pop_rpi) and clear thread->rpi under rpilock, but
>>> we check for it without any protection? Sounds racy. I think 'thread' is
>>> not only pointing to the current thread but could refer to a foreign one
>>> as well, right? Don't we need this:
>>>
>>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>>> index 872c37f..1f995d6 100644
>>> --- a/ksrc/nucleus/shadow.c
>>> +++ b/ksrc/nucleus/shadow.c
>>> @@ -331,6 +331,12 @@ static void rpi_clear_remote(struct xnthread *thread)
>>>  
>>> xnlock_get_irqsave(&rpi->rpilock, s);
>>>  
>>> +   /* Re-check under lock, someone may have cleared rpi by now. */
>>> +   if (unlikely(thread->rpi == NULL)) {
>>> +   xnlock_put_irqrestore(&rpi->rpilock, s);
>>> +   return;
>>> +   }
>>> +
>>> /*
>>>  * The RPI slot - if present - is always valid, and won't
>>>  * change since the thread is resuming on this CPU and cannot
>> Another worry: Can thread->rpi become != rpi without being NULL? Or can
>> we really only race for clearance here?
>>
> 
> I think so now, therefore I'm proposing this:
> 
> --->
> 
> Most RPI services work on the current task or the one to be scheduled in
> next, thus are naturally serialized. But rpi_next is not as it can walk

...serialized /wrt changes of thread->rpi.

> the chain of RPI requests for a CPU independently. In that case,
> clearing RPI via rpi_clear_remote can race with rpi_next, and if the
> former loses after checking thread->rpi for NULL, we will dereference a
> NULL pointer in xnsched_pop_rpi().
> 
> Signed-off-by: Jan Kiszka 
> ---
>  ksrc/nucleus/shadow.c |9 +
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> index 872c37f..cf7c08f 100644
> --- a/ksrc/nucleus/shadow.c
> +++ b/ksrc/nucleus/shadow.c
> @@ -332,6 +332,15 @@ static void rpi_clear_remote(struct xnthread *thread)
>   xnlock_get_irqsave(&rpi->rpilock, s);
>  
>   /*
> +  * Re-check under lock. Someone may have invoked rpi_next and cleared
> +  * rpi by now.
> +  */
> + if (unlikely(!rpi_p(thread))) {
> + xnlock_put_irqrestore(&rpi->rpilock, s);
> + return;
> + }
> +
> + /*
>* The RPI slot - if present - is always valid, and won't
>* change since the thread is resuming on this CPU and cannot
>* migrate under our feet. We may grab the remote slot lock
> 
> 

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core