subject:"\[Xenomai\-core\] Potential problem with rt

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-13 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 11.11.2010 16:46, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> I just hope we finally converge over a solution. Looks like all
>>> possibilities have been explored now. A few more comments on this one:
>>>
>>> It probably makes sense to group the status bits accordingly (both their
>>> values and definitions) and briefly document on which status field they
>>> are supposed to be applied.
>>>
>>> I do not understand the split logic - or some bits are simply not yet
>>> migrated. XNHDEFER, XNSWLOCK, XNKCOUT are all local-only as well, no?
>>> Then better put them in the _local_ status field, that's more consistent
>>> (and would help if we once wanted to optimize their cache line usage).
>>>
>>> The naming is unfortunate: status vs. lstatus. This is asking for
>>> confusion and typos. They must be better distinguishable, e.g.
>>> local_status. Or we need accessors that have debug checks built in,
>>> catching wrong bits for their target fields.
>>>
>>> Good catch of the RPI breakage, Gilles!
>> Hi Jan,
>>
>> I just pushed a modified version of this patch, including your remarks
>> as well as the ones of Philippe. I suspect some of the cleanup patches
>> you sent still make sense over this patch, would it be possible to
>> rebase them over this pushed version?
> 
> Just rebased and pushed my queue. One additional optimization was added
> ("Optimize setting of XNRESCHED"), basic tests passed.

Ah, I thought about this one, I wonder why I forgot it. Merged, thanks.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-12 Thread Jan Kiszka

Am 11.11.2010 16:46, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> I just hope we finally converge over a solution. Looks like all
>> possibilities have been explored now. A few more comments on this one:
>>
>> It probably makes sense to group the status bits accordingly (both their
>> values and definitions) and briefly document on which status field they
>> are supposed to be applied.
>>
>> I do not understand the split logic - or some bits are simply not yet
>> migrated. XNHDEFER, XNSWLOCK, XNKCOUT are all local-only as well, no?
>> Then better put them in the _local_ status field, that's more consistent
>> (and would help if we once wanted to optimize their cache line usage).
>>
>> The naming is unfortunate: status vs. lstatus. This is asking for
>> confusion and typos. They must be better distinguishable, e.g.
>> local_status. Or we need accessors that have debug checks built in,
>> catching wrong bits for their target fields.
>>
>> Good catch of the RPI breakage, Gilles!
> 
> Hi Jan,
> 
> I just pushed a modified version of this patch, including your remarks
> as well as the ones of Philippe. I suspect some of the cleanup patches
> you sent still make sense over this patch, would it be possible to
> rebase them over this pushed version?

Just rebased and pushed my queue. One additional optimization was added
("Optimize setting of XNRESCHED"), basic tests passed.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-11 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> I just hope we finally converge over a solution. Looks like all
> possibilities have been explored now. A few more comments on this one:
> 
> It probably makes sense to group the status bits accordingly (both their
> values and definitions) and briefly document on which status field they
> are supposed to be applied.
> 
> I do not understand the split logic - or some bits are simply not yet
> migrated. XNHDEFER, XNSWLOCK, XNKCOUT are all local-only as well, no?
> Then better put them in the _local_ status field, that's more consistent
> (and would help if we once wanted to optimize their cache line usage).
> 
> The naming is unfortunate: status vs. lstatus. This is asking for
> confusion and typos. They must be better distinguishable, e.g.
> local_status. Or we need accessors that have debug checks built in,
> catching wrong bits for their target fields.
> 
> Good catch of the RPI breakage, Gilles!

Hi Jan,

I just pushed a modified version of this patch, including your remarks
as well as the ones of Philippe. I suspect some of the cleanup patches
you sent still make sense over this patch, would it be possible to
rebase them over this pushed version?

TIA,

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-07 Thread Philippe Gerum

On Sun, 2010-11-07 at 11:14 +0100, Jan Kiszka wrote:
> Am 07.11.2010 11:12, Gilles Chanteperdrix wrote:
> > Jan Kiszka wrote:
> >> Am 07.11.2010 11:03, Philippe Gerum wrote:
> >>> On Sun, 2010-11-07 at 09:31 +0100, Gilles Chanteperdrix wrote:
>  Jan Kiszka wrote:
> >>> Anyway, after some thoughts, I think we are going to try and make the
> >>> current situation work instead of going back to the old way.
> >>>
> >>> You can find the patch which attempts to do so here:
> >>> http://sisyphus.hd.free.fr/~gilles/sched_status.txt
> >> Ack. At last, this addresses the real issues without asking for
> >> regression funkiness: fix the lack of barrier before testing XNSCHED in
> > Check the kernel, we actually need it on both sides. Wherever the final
> > barriers will be, we should leave a comment behind why they are there.
> > Could be picked up from kernel/smp.c.
>  We have it on both sides: the non-local flags are modified while holding
>  the nklock. Unlocking the nklock implies a barrier.
> >>> I think we may have an issue with this kind of construct:
> >>>
> >>> xnlock_get_irq*(&nklock)
> >>>   xnpod_resume/suspend/whatever_thread()
> >>>   xnlock_get_irq*(&nklock)
> >>>   ...
> >>>   xnlock_put_irq*(&nklock)
> >>>   xnpod_schedule()
> >>>   xnlock_get_irq*(&nklock)
> >>>   send_ipi
> >>>   => xnpod_schedule_handler on dest CPU
> >>>   xnlock_put_irq*(&nklock)
> >>> xnlock_put_irq*(&nklock)
> >>>
> >>> The issue would be triggered by the use of recursive locking. In that
> >>> case, the source CPU would only sync its cache when the lock is actually
> >>> dropped by the outer xnlock_put_irq* call and the inner
> >>> xnlock_get/put_irq* would not act as barriers, so the remote
> >>> rescheduling handler won't always see the XNSCHED update done remotely,
> >>> and may lead to a no-op. So we need a barrier before sending the IPI in
> >>> __xnpod_test_resched().
> >>
> >> That's what I said.
> >>
> >> And we need it on the reader side as an rmb().
> > 
> > This one we have, in xnpod_schedule_handler.
> > 
> 
> Right, with your patch (the above sounded like we only need it on writer
> side).

C'mon...

-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-07 Thread Jan Kiszka

Am 07.11.2010 11:12, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 07.11.2010 11:03, Philippe Gerum wrote:
>>> On Sun, 2010-11-07 at 09:31 +0100, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
>>> Anyway, after some thoughts, I think we are going to try and make the
>>> current situation work instead of going back to the old way.
>>>
>>> You can find the patch which attempts to do so here:
>>> http://sisyphus.hd.free.fr/~gilles/sched_status.txt
>> Ack. At last, this addresses the real issues without asking for
>> regression funkiness: fix the lack of barrier before testing XNSCHED in
> Check the kernel, we actually need it on both sides. Wherever the final
> barriers will be, we should leave a comment behind why they are there.
> Could be picked up from kernel/smp.c.
 We have it on both sides: the non-local flags are modified while holding
 the nklock. Unlocking the nklock implies a barrier.
>>> I think we may have an issue with this kind of construct:
>>>
>>> xnlock_get_irq*(&nklock)
>>> xnpod_resume/suspend/whatever_thread()
>>> xnlock_get_irq*(&nklock)
>>> ...
>>> xnlock_put_irq*(&nklock)
>>> xnpod_schedule()
>>> xnlock_get_irq*(&nklock)
>>> send_ipi
>>> => xnpod_schedule_handler on dest CPU
>>> xnlock_put_irq*(&nklock)
>>> xnlock_put_irq*(&nklock)
>>>
>>> The issue would be triggered by the use of recursive locking. In that
>>> case, the source CPU would only sync its cache when the lock is actually
>>> dropped by the outer xnlock_put_irq* call and the inner
>>> xnlock_get/put_irq* would not act as barriers, so the remote
>>> rescheduling handler won't always see the XNSCHED update done remotely,
>>> and may lead to a no-op. So we need a barrier before sending the IPI in
>>> __xnpod_test_resched().
>>
>> That's what I said.
>>
>> And we need it on the reader side as an rmb().
> 
> This one we have, in xnpod_schedule_handler.
> 

Right, with your patch (the above sounded like we only need it on writer
side).

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-07 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 07.11.2010 11:03, Philippe Gerum wrote:
>> On Sun, 2010-11-07 at 09:31 +0100, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>> Anyway, after some thoughts, I think we are going to try and make the
>> current situation work instead of going back to the old way.
>>
>> You can find the patch which attempts to do so here:
>> http://sisyphus.hd.free.fr/~gilles/sched_status.txt
> Ack. At last, this addresses the real issues without asking for
> regression funkiness: fix the lack of barrier before testing XNSCHED in
 Check the kernel, we actually need it on both sides. Wherever the final
 barriers will be, we should leave a comment behind why they are there.
 Could be picked up from kernel/smp.c.
>>> We have it on both sides: the non-local flags are modified while holding
>>> the nklock. Unlocking the nklock implies a barrier.
>> I think we may have an issue with this kind of construct:
>>
>> xnlock_get_irq*(&nklock)
>>  xnpod_resume/suspend/whatever_thread()
>>  xnlock_get_irq*(&nklock)
>>  ...
>>  xnlock_put_irq*(&nklock)
>>  xnpod_schedule()
>>  xnlock_get_irq*(&nklock)
>>  send_ipi
>>  => xnpod_schedule_handler on dest CPU
>>  xnlock_put_irq*(&nklock)
>> xnlock_put_irq*(&nklock)
>>
>> The issue would be triggered by the use of recursive locking. In that
>> case, the source CPU would only sync its cache when the lock is actually
>> dropped by the outer xnlock_put_irq* call and the inner
>> xnlock_get/put_irq* would not act as barriers, so the remote
>> rescheduling handler won't always see the XNSCHED update done remotely,
>> and may lead to a no-op. So we need a barrier before sending the IPI in
>> __xnpod_test_resched().
> 
> That's what I said.
> 
> And we need it on the reader side as an rmb().

This one we have, in xnpod_schedule_handler.


-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-07 Thread Jan Kiszka

Am 07.11.2010 11:03, Philippe Gerum wrote:
> On Sun, 2010-11-07 at 09:31 +0100, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
> Anyway, after some thoughts, I think we are going to try and make the
> current situation work instead of going back to the old way.
>
> You can find the patch which attempts to do so here:
> http://sisyphus.hd.free.fr/~gilles/sched_status.txt
 Ack. At last, this addresses the real issues without asking for
 regression funkiness: fix the lack of barrier before testing XNSCHED in
>>>
>>> Check the kernel, we actually need it on both sides. Wherever the final
>>> barriers will be, we should leave a comment behind why they are there.
>>> Could be picked up from kernel/smp.c.
>>
>> We have it on both sides: the non-local flags are modified while holding
>> the nklock. Unlocking the nklock implies a barrier.
> 
> I think we may have an issue with this kind of construct:
> 
> xnlock_get_irq*(&nklock)
>   xnpod_resume/suspend/whatever_thread()
>   xnlock_get_irq*(&nklock)
>   ...
>   xnlock_put_irq*(&nklock)
>   xnpod_schedule()
>   xnlock_get_irq*(&nklock)
>   send_ipi
>   => xnpod_schedule_handler on dest CPU
>   xnlock_put_irq*(&nklock)
> xnlock_put_irq*(&nklock)
> 
> The issue would be triggered by the use of recursive locking. In that
> case, the source CPU would only sync its cache when the lock is actually
> dropped by the outer xnlock_put_irq* call and the inner
> xnlock_get/put_irq* would not act as barriers, so the remote
> rescheduling handler won't always see the XNSCHED update done remotely,
> and may lead to a no-op. So we need a barrier before sending the IPI in
> __xnpod_test_resched().

That's what I said.

And we need it on the reader side as an rmb().

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-07 Thread Jan Kiszka

Am 07.11.2010 10:57, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 07.11.2010 09:31, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>> Anyway, after some thoughts, I think we are going to try and make the
>> current situation work instead of going back to the old way.
>>
>> You can find the patch which attempts to do so here:
>> http://sisyphus.hd.free.fr/~gilles/sched_status.txt
> Ack. At last, this addresses the real issues without asking for
> regression funkiness: fix the lack of barrier before testing XNSCHED in
 Check the kernel, we actually need it on both sides. Wherever the final
 barriers will be, we should leave a comment behind why they are there.
 Could be picked up from kernel/smp.c.
>>> We have it on both sides: the non-local flags are modified while holding
>>> the nklock. Unlocking the nklock implies a barrier.
>>
>> I think this does not work if nklock is used nested, ie. hold across
>> xnsched_set_resched and the next xnpod_schedule.
> 
> Ok. There is a race here, even without nesting, we need a barrier before
> sending the IPI.
> 
 I do not understand the split logic - or some bits are simply not yet
 migrated. XNHDEFER, XNSWLOCK, XNKCOUT are all local-only as well, no?
 Then better put them in the _local_ status field, that's more consistent
 (and would help if we once wanted to optimize their cache line usage).
>>> Maybe the naming is not good the. ->status is everything which is
>>> modified under nklock, ->lstatus is for XNINIRQ and XNHTICK which are
>>> modified without holding the nklock.
>>
>> And this is not a good clustering IMHO as it makes no difference for a
>> local-only flag if it is protected by nklock or not (as long as IRQs are
>> off, of course). What makes a difference it the CPU using the related
>> field. If we group according to local and remote usage, less cache line
>> bouncing can be achieved (of both fields are also cache line aligned,
>> but that is a second step that only helps if the first is made).
> 
> The two status are in the same cache line.
> 

Which can (and likely should) be changed - if the fields are logically
separated first.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-07 Thread Philippe Gerum

On Sun, 2010-11-07 at 09:31 +0100, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
> >>> Anyway, after some thoughts, I think we are going to try and make the
> >>> current situation work instead of going back to the old way.
> >>>
> >>> You can find the patch which attempts to do so here:
> >>> http://sisyphus.hd.free.fr/~gilles/sched_status.txt
> >> Ack. At last, this addresses the real issues without asking for
> >> regression funkiness: fix the lack of barrier before testing XNSCHED in
> > 
> > Check the kernel, we actually need it on both sides. Wherever the final
> > barriers will be, we should leave a comment behind why they are there.
> > Could be picked up from kernel/smp.c.
> 
> We have it on both sides: the non-local flags are modified while holding
> the nklock. Unlocking the nklock implies a barrier.

I think we may have an issue with this kind of construct:

xnlock_get_irq*(&nklock)
xnpod_resume/suspend/whatever_thread()
xnlock_get_irq*(&nklock)
...
xnlock_put_irq*(&nklock)
xnpod_schedule()
xnlock_get_irq*(&nklock)
send_ipi
=> xnpod_schedule_handler on dest CPU
xnlock_put_irq*(&nklock)
xnlock_put_irq*(&nklock)

The issue would be triggered by the use of recursive locking. In that
case, the source CPU would only sync its cache when the lock is actually
dropped by the outer xnlock_put_irq* call and the inner
xnlock_get/put_irq* would not act as barriers, so the remote
rescheduling handler won't always see the XNSCHED update done remotely,
and may lead to a no-op. So we need a barrier before sending the IPI in
__xnpod_test_resched().

This could not happen if all schedule state changes where clearly
isolated from rescheduling calls in different critical sections, but
it's sometimes not an option not to group them for consistency reasons.

> 
> > 
> >> the xnpod_schedule pre-test, and stop sched->status trashing due to
> >> XNINIRQ/XNHTICK/XNRPICK ops done un-synced on nklock.
> >>
> >> In short, this patch looks like moving the local-only flags where they
> >> belong, i.e. anywhere you want but *outside* of the status with remotely
> >> accessed bits. XNRPICK seems to be handled differently, but it makes
> >> sense to group it with other RPI data as you did, so fine with me.
> > 
> > I just hope we finally converge over a solution. Looks like all
> > possibilities have been explored now. A few more comments on this one:
> > 
> > It probably makes sense to group the status bits accordingly (both their
> > values and definitions) and briefly document on which status field they
> > are supposed to be applied.
> 
> Ok, but I wanted them to not use the same values, so that we can use the
> sched->status | sched->lstatus trick in xnpod_schedule. Something is
> lacking too: we probably need to use sched->status | sched->lstatus for
> display in /proc.
> 
> > 
> > I do not understand the split logic - or some bits are simply not yet
> > migrated. XNHDEFER, XNSWLOCK, XNKCOUT are all local-only as well, no?
> > Then better put them in the _local_ status field, that's more consistent
> > (and would help if we once wanted to optimize their cache line usage).
> 
> Maybe the naming is not good the. ->status is everything which is
> modified under nklock, ->lstatus is for XNINIRQ and XNHTICK which are
> modified without holding the nklock.
> 
> > 
> > The naming is unfortunate: status vs. lstatus. This is asking for
> > confusion and typos. They must be better distinguishable, e.g.
> > local_status. Or we need accessors that have debug checks built in,
> > catching wrong bits for their target fields.
> 
> I agree.
> 
> > 
> > Good catch of the RPI breakage, Gilles!
> 

-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-07 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 07.11.2010 09:31, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
> Anyway, after some thoughts, I think we are going to try and make the
> current situation work instead of going back to the old way.
>
> You can find the patch which attempts to do so here:
> http://sisyphus.hd.free.fr/~gilles/sched_status.txt
 Ack. At last, this addresses the real issues without asking for
 regression funkiness: fix the lack of barrier before testing XNSCHED in
>>> Check the kernel, we actually need it on both sides. Wherever the final
>>> barriers will be, we should leave a comment behind why they are there.
>>> Could be picked up from kernel/smp.c.
>> We have it on both sides: the non-local flags are modified while holding
>> the nklock. Unlocking the nklock implies a barrier.
> 
> I think this does not work if nklock is used nested, ie. hold across
> xnsched_set_resched and the next xnpod_schedule.

Ok. There is a race here, even without nesting, we need a barrier before
sending the IPI.

>>> I do not understand the split logic - or some bits are simply not yet
>>> migrated. XNHDEFER, XNSWLOCK, XNKCOUT are all local-only as well, no?
>>> Then better put them in the _local_ status field, that's more consistent
>>> (and would help if we once wanted to optimize their cache line usage).
>> Maybe the naming is not good the. ->status is everything which is
>> modified under nklock, ->lstatus is for XNINIRQ and XNHTICK which are
>> modified without holding the nklock.
> 
> And this is not a good clustering IMHO as it makes no difference for a
> local-only flag if it is protected by nklock or not (as long as IRQs are
> off, of course). What makes a difference it the CPU using the related
> field. If we group according to local and remote usage, less cache line
> bouncing can be achieved (of both fields are also cache line aligned,
> but that is a second step that only helps if the first is made).

The two status are in the same cache line.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-07 Thread Jan Kiszka

Am 07.11.2010 09:31, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
 Anyway, after some thoughts, I think we are going to try and make the
 current situation work instead of going back to the old way.

 You can find the patch which attempts to do so here:
 http://sisyphus.hd.free.fr/~gilles/sched_status.txt
>>> Ack. At last, this addresses the real issues without asking for
>>> regression funkiness: fix the lack of barrier before testing XNSCHED in
>>
>> Check the kernel, we actually need it on both sides. Wherever the final
>> barriers will be, we should leave a comment behind why they are there.
>> Could be picked up from kernel/smp.c.
> 
> We have it on both sides: the non-local flags are modified while holding
> the nklock. Unlocking the nklock implies a barrier.

I think this does not work if nklock is used nested, ie. hold across
xnsched_set_resched and the next xnpod_schedule.

> 
>>
>>> the xnpod_schedule pre-test, and stop sched->status trashing due to
>>> XNINIRQ/XNHTICK/XNRPICK ops done un-synced on nklock.
>>>
>>> In short, this patch looks like moving the local-only flags where they
>>> belong, i.e. anywhere you want but *outside* of the status with remotely
>>> accessed bits. XNRPICK seems to be handled differently, but it makes
>>> sense to group it with other RPI data as you did, so fine with me.
>>
>> I just hope we finally converge over a solution. Looks like all
>> possibilities have been explored now. A few more comments on this one:
>>
>> It probably makes sense to group the status bits accordingly (both their
>> values and definitions) and briefly document on which status field they
>> are supposed to be applied.
> 
> Ok, but I wanted them to not use the same values, so that we can use the
> sched->status | sched->lstatus trick in xnpod_schedule. Something is
> lacking too: we probably need to use sched->status | sched->lstatus for
> display in /proc.

Nothing speaks against clustering the flag values so that they can be
OR'ed. They should just be grouped according to their target field.

> 
>>
>> I do not understand the split logic - or some bits are simply not yet
>> migrated. XNHDEFER, XNSWLOCK, XNKCOUT are all local-only as well, no?
>> Then better put them in the _local_ status field, that's more consistent
>> (and would help if we once wanted to optimize their cache line usage).
> 
> Maybe the naming is not good the. ->status is everything which is
> modified under nklock, ->lstatus is for XNINIRQ and XNHTICK which are
> modified without holding the nklock.

And this is not a good clustering IMHO as it makes no difference for a
local-only flag if it is protected by nklock or not (as long as IRQs are
off, of course). What makes a difference it the CPU using the related
field. If we group according to local and remote usage, less cache line
bouncing can be achieved (of both fields are also cache line aligned,
but that is a second step that only helps if the first is made).

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-07 Thread Philippe Gerum

On Sun, 2010-11-07 at 02:00 +0100, Jan Kiszka wrote:
> Am 06.11.2010 23:49, Philippe Gerum wrote:
> > On Sat, 2010-11-06 at 21:37 +0100, Gilles Chanteperdrix wrote:
> >> Anders Blomdell wrote:
> >>> Gilles Chanteperdrix wrote:
>  Anders Blomdell wrote:
> > Gilles Chanteperdrix wrote:
> >> Jan Kiszka wrote:
> >>> Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
>  Jan Kiszka wrote:
> > Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
> >> Jan Kiszka wrote:
> > At first sight, here you are more breaking things than cleaning 
> > them.
>  Still, it has the SMP record for my test program, still runs 
>  with ftrace 
>  on (after 2 hours, where it previously failed after maximum 23 
>  minutes).
> >>> My version was indeed still buggy, I'm reworking it ATM.
> >>>
>  If I get the gist of Jan's changes, they are (using the IPI to 
>  transfer 
>  one bit of information: your cpu needs to reschedule):
> 
>  xnsched_set_resched:
>  -  setbits((__sched__)->status, XNRESCHED);
> 
>  xnpod_schedule_handler:
>  +xnsched_set_resched(sched);
>   
>  If you (we?) decide to keep the debug checks, under what 
>  circumstances 
>  would the current check trigger (in laymans language, that I'll 
>  be able 
>  to understand)?
> >>> That's actually what /me is wondering as well. I do not see yet 
> >>> how you
> >>> can reliably detect a missed reschedule reliably (that was the 
> >>> purpose
> >>> of the debug check) given the racy nature between signaling 
> >>> resched and
> >>> processing the resched hints.
> >> The purpose of the debugging change is to detect a change of the
> >> scheduler state which was not followed by setting the XNRESCHED 
> >> bit.
> > But that is nucleus business, nothing skins can screw up (as long as
> > they do not misuse APIs).
>  Yes, but it happens that we modify the nucleus from time to time.
> 
> >> Getting it to work is relatively simple: we add a "scheduler 
> >> change set
> >> remotely" bit to the sched structure which is NOT in the status 
> >> bit, set
> >> this bit when changing a remote sched (under nklock). In the debug 
> >> check
> >> code, if the scheduler state changed, and the XNRESCHED bit is not 
> >> set,
> >> only consider this a but if this new bit is not set. All this is
> >> compiled out if the debug is not enabled.
> > I still see no benefit in this check. Where to you want to place 
> > the bit
> > set? Aren't that just the same locations where
> > xnsched_set_[self_]resched already is today?
>  Well no, that would be another bit in the sched structure which would
>  allow us to manipulate the status bits from the local cpu. That
>  supplementary bit would only be changed from a distant CPU, and 
>  serve to
>  detect the race which causes the false positive. The resched bits are
>  set on the local cpu to get xnpod_schedule to trigger a rescheduling 
>  on
>  the distance cpu. That bit would be set on the remote cpu's sched. 
>  Only
>  when debugging is enabled.
> 
> > But maybe you can provide some motivating bug scenarios, real ones 
> > of
> > the past or realistic ones of the future.
>  Of course. The bug is anything which changes the scheduler state but
>  does not set the XNRESCHED bit. This happened when we started the SMP
>  port. New scheduling policies would be good candidates for a revival 
>  of
>  this bug.
> 
> >>> You don't gain any worthwhile check if you cannot make the
> >>> instrumentation required for a stable detection simpler than the 
> >>> proper
> >>> problem solution itself. And this is what I'm still skeptical of.
> >> The solution is simple, but finding the problem without the 
> >> instrumentation is way harder than with the instrumentation, so the 
> >> instrumentation is worth something.
> >>
> >> Reproducing the false positive is surprisingly easy with a simple
> >> dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
> >> using a ridiculous long variable name to illustrate what I was 
> >> thinking about:
> >>
> >> diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
> >> index cf4..454b8e8 100644
> >> --- a/include/nucleus/sched.h
> >> +++ b/include/nucleus/sched.h
> >> @@ -108,6 +108,9 @@ typedef struct xnsched {
> >> struct xnthre

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-07 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
>>> Anyway, after some thoughts, I think we are going to try and make the
>>> current situation work instead of going back to the old way.
>>>
>>> You can find the patch which attempts to do so here:
>>> http://sisyphus.hd.free.fr/~gilles/sched_status.txt
>> Ack. At last, this addresses the real issues without asking for
>> regression funkiness: fix the lack of barrier before testing XNSCHED in
> 
> Check the kernel, we actually need it on both sides. Wherever the final
> barriers will be, we should leave a comment behind why they are there.
> Could be picked up from kernel/smp.c.

We have it on both sides: the non-local flags are modified while holding
the nklock. Unlocking the nklock implies a barrier.

> 
>> the xnpod_schedule pre-test, and stop sched->status trashing due to
>> XNINIRQ/XNHTICK/XNRPICK ops done un-synced on nklock.
>>
>> In short, this patch looks like moving the local-only flags where they
>> belong, i.e. anywhere you want but *outside* of the status with remotely
>> accessed bits. XNRPICK seems to be handled differently, but it makes
>> sense to group it with other RPI data as you did, so fine with me.
> 
> I just hope we finally converge over a solution. Looks like all
> possibilities have been explored now. A few more comments on this one:
> 
> It probably makes sense to group the status bits accordingly (both their
> values and definitions) and briefly document on which status field they
> are supposed to be applied.

Ok, but I wanted them to not use the same values, so that we can use the
sched->status | sched->lstatus trick in xnpod_schedule. Something is
lacking too: we probably need to use sched->status | sched->lstatus for
display in /proc.

> 
> I do not understand the split logic - or some bits are simply not yet
> migrated. XNHDEFER, XNSWLOCK, XNKCOUT are all local-only as well, no?
> Then better put them in the _local_ status field, that's more consistent
> (and would help if we once wanted to optimize their cache line usage).

Maybe the naming is not good the. ->status is everything which is
modified under nklock, ->lstatus is for XNINIRQ and XNHTICK which are
modified without holding the nklock.

> 
> The naming is unfortunate: status vs. lstatus. This is asking for
> confusion and typos. They must be better distinguishable, e.g.
> local_status. Or we need accessors that have debug checks built in,
> catching wrong bits for their target fields.

I agree.

> 
> Good catch of the RPI breakage, Gilles!

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-06 Thread Jan Kiszka

Am 06.11.2010 23:49, Philippe Gerum wrote:
> On Sat, 2010-11-06 at 21:37 +0100, Gilles Chanteperdrix wrote:
>> Anders Blomdell wrote:
>>> Gilles Chanteperdrix wrote:
 Anders Blomdell wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
> Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
> At first sight, here you are more breaking things than cleaning 
> them.
 Still, it has the SMP record for my test program, still runs with 
 ftrace 
 on (after 2 hours, where it previously failed after maximum 23 
 minutes).
>>> My version was indeed still buggy, I'm reworking it ATM.
>>>
 If I get the gist of Jan's changes, they are (using the IPI to 
 transfer 
 one bit of information: your cpu needs to reschedule):

 xnsched_set_resched:
 -  setbits((__sched__)->status, XNRESCHED);

 xnpod_schedule_handler:
 +  xnsched_set_resched(sched);

 If you (we?) decide to keep the debug checks, under what 
 circumstances 
 would the current check trigger (in laymans language, that I'll be 
 able 
 to understand)?
>>> That's actually what /me is wondering as well. I do not see yet how 
>>> you
>>> can reliably detect a missed reschedule reliably (that was the 
>>> purpose
>>> of the debug check) given the racy nature between signaling resched 
>>> and
>>> processing the resched hints.
>> The purpose of the debugging change is to detect a change of the
>> scheduler state which was not followed by setting the XNRESCHED bit.
> But that is nucleus business, nothing skins can screw up (as long as
> they do not misuse APIs).
 Yes, but it happens that we modify the nucleus from time to time.

>> Getting it to work is relatively simple: we add a "scheduler change 
>> set
>> remotely" bit to the sched structure which is NOT in the status bit, 
>> set
>> this bit when changing a remote sched (under nklock). In the debug 
>> check
>> code, if the scheduler state changed, and the XNRESCHED bit is not 
>> set,
>> only consider this a but if this new bit is not set. All this is
>> compiled out if the debug is not enabled.
> I still see no benefit in this check. Where to you want to place the 
> bit
> set? Aren't that just the same locations where
> xnsched_set_[self_]resched already is today?
 Well no, that would be another bit in the sched structure which would
 allow us to manipulate the status bits from the local cpu. That
 supplementary bit would only be changed from a distant CPU, and serve 
 to
 detect the race which causes the false positive. The resched bits are
 set on the local cpu to get xnpod_schedule to trigger a rescheduling on
 the distance cpu. That bit would be set on the remote cpu's sched. Only
 when debugging is enabled.

> But maybe you can provide some motivating bug scenarios, real ones of
> the past or realistic ones of the future.
 Of course. The bug is anything which changes the scheduler state but
 does not set the XNRESCHED bit. This happened when we started the SMP
 port. New scheduling policies would be good candidates for a revival of
 this bug.

>>> You don't gain any worthwhile check if you cannot make the
>>> instrumentation required for a stable detection simpler than the proper
>>> problem solution itself. And this is what I'm still skeptical of.
>> The solution is simple, but finding the problem without the 
>> instrumentation is way harder than with the instrumentation, so the 
>> instrumentation is worth something.
>>
>> Reproducing the false positive is surprisingly easy with a simple
>> dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
>> using a ridiculous long variable name to illustrate what I was 
>> thinking about:
>>
>> diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
>> index cf4..454b8e8 100644
>> --- a/include/nucleus/sched.h
>> +++ b/include/nucleus/sched.h
>> @@ -108,6 +108,9 @@ typedef struct xnsched {
>> struct xnthread *gktarget;
>>  #endif
>>
>> +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
>> +   int debug_resched_from_remote;
>> +#endif
>>  } xnsched_t;
>>
>>  union xnsched_policy_param;
>> @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched 
>> *sched)
>>xnsched_t *

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-06 Thread Philippe Gerum

On Sat, 2010-11-06 at 21:37 +0100, Gilles Chanteperdrix wrote:
> Anders Blomdell wrote:
> > Gilles Chanteperdrix wrote:
> >> Anders Blomdell wrote:
> >>> Gilles Chanteperdrix wrote:
>  Jan Kiszka wrote:
> > Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
> >> Jan Kiszka wrote:
> >>> Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
>  Jan Kiszka wrote:
> >>> At first sight, here you are more breaking things than cleaning 
> >>> them.
> >> Still, it has the SMP record for my test program, still runs with 
> >> ftrace 
> >> on (after 2 hours, where it previously failed after maximum 23 
> >> minutes).
> > My version was indeed still buggy, I'm reworking it ATM.
> >
> >> If I get the gist of Jan's changes, they are (using the IPI to 
> >> transfer 
> >> one bit of information: your cpu needs to reschedule):
> >>
> >> xnsched_set_resched:
> >> -  setbits((__sched__)->status, XNRESCHED);
> >>
> >> xnpod_schedule_handler:
> >> +  xnsched_set_resched(sched);
> >>
> >> If you (we?) decide to keep the debug checks, under what 
> >> circumstances 
> >> would the current check trigger (in laymans language, that I'll be 
> >> able 
> >> to understand)?
> > That's actually what /me is wondering as well. I do not see yet how 
> > you
> > can reliably detect a missed reschedule reliably (that was the 
> > purpose
> > of the debug check) given the racy nature between signaling resched 
> > and
> > processing the resched hints.
>  The purpose of the debugging change is to detect a change of the
>  scheduler state which was not followed by setting the XNRESCHED bit.
> >>> But that is nucleus business, nothing skins can screw up (as long as
> >>> they do not misuse APIs).
> >> Yes, but it happens that we modify the nucleus from time to time.
> >>
>  Getting it to work is relatively simple: we add a "scheduler change 
>  set
>  remotely" bit to the sched structure which is NOT in the status bit, 
>  set
>  this bit when changing a remote sched (under nklock). In the debug 
>  check
>  code, if the scheduler state changed, and the XNRESCHED bit is not 
>  set,
>  only consider this a but if this new bit is not set. All this is
>  compiled out if the debug is not enabled.
> >>> I still see no benefit in this check. Where to you want to place the 
> >>> bit
> >>> set? Aren't that just the same locations where
> >>> xnsched_set_[self_]resched already is today?
> >> Well no, that would be another bit in the sched structure which would
> >> allow us to manipulate the status bits from the local cpu. That
> >> supplementary bit would only be changed from a distant CPU, and serve 
> >> to
> >> detect the race which causes the false positive. The resched bits are
> >> set on the local cpu to get xnpod_schedule to trigger a rescheduling on
> >> the distance cpu. That bit would be set on the remote cpu's sched. Only
> >> when debugging is enabled.
> >>
> >>> But maybe you can provide some motivating bug scenarios, real ones of
> >>> the past or realistic ones of the future.
> >> Of course. The bug is anything which changes the scheduler state but
> >> does not set the XNRESCHED bit. This happened when we started the SMP
> >> port. New scheduling policies would be good candidates for a revival of
> >> this bug.
> >>
> > You don't gain any worthwhile check if you cannot make the
> > instrumentation required for a stable detection simpler than the proper
> > problem solution itself. And this is what I'm still skeptical of.
>  The solution is simple, but finding the problem without the 
>  instrumentation is way harder than with the instrumentation, so the 
>  instrumentation is worth something.
> 
>  Reproducing the false positive is surprisingly easy with a simple
>  dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
>  using a ridiculous long variable name to illustrate what I was 
>  thinking about:
> 
>  diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
>  index cf4..454b8e8 100644
>  --- a/include/nucleus/sched.h
>  +++ b/include/nucleus/sched.h
>  @@ -108,6 +108,9 @@ typedef struct xnsched {
>  struct xnthread *gktarget;
>   #endif
> 
>  +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
>  +   int debug_resched_from_remote;
>  +#endif
>   } xnsched_t;
> 
>   union xnsched_policy_param;
>  @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched 
>  *sched)
> xnsched_t *current_sched = xnpod_current_sched();

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-06 Thread Gilles Chanteperdrix

Anders Blomdell wrote:
> Gilles Chanteperdrix wrote:
>> Anders Blomdell wrote:
>>> Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
> Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
>>> At first sight, here you are more breaking things than cleaning 
>>> them.
>> Still, it has the SMP record for my test program, still runs with 
>> ftrace 
>> on (after 2 hours, where it previously failed after maximum 23 
>> minutes).
> My version was indeed still buggy, I'm reworking it ATM.
>
>> If I get the gist of Jan's changes, they are (using the IPI to 
>> transfer 
>> one bit of information: your cpu needs to reschedule):
>>
>> xnsched_set_resched:
>> -  setbits((__sched__)->status, XNRESCHED);
>>
>> xnpod_schedule_handler:
>> +xnsched_set_resched(sched);
>>  
>> If you (we?) decide to keep the debug checks, under what 
>> circumstances 
>> would the current check trigger (in laymans language, that I'll be 
>> able 
>> to understand)?
> That's actually what /me is wondering as well. I do not see yet how 
> you
> can reliably detect a missed reschedule reliably (that was the purpose
> of the debug check) given the racy nature between signaling resched 
> and
> processing the resched hints.
 The purpose of the debugging change is to detect a change of the
 scheduler state which was not followed by setting the XNRESCHED bit.
>>> But that is nucleus business, nothing skins can screw up (as long as
>>> they do not misuse APIs).
>> Yes, but it happens that we modify the nucleus from time to time.
>>
 Getting it to work is relatively simple: we add a "scheduler change set
 remotely" bit to the sched structure which is NOT in the status bit, 
 set
 this bit when changing a remote sched (under nklock). In the debug 
 check
 code, if the scheduler state changed, and the XNRESCHED bit is not set,
 only consider this a but if this new bit is not set. All this is
 compiled out if the debug is not enabled.
>>> I still see no benefit in this check. Where to you want to place the bit
>>> set? Aren't that just the same locations where
>>> xnsched_set_[self_]resched already is today?
>> Well no, that would be another bit in the sched structure which would
>> allow us to manipulate the status bits from the local cpu. That
>> supplementary bit would only be changed from a distant CPU, and serve to
>> detect the race which causes the false positive. The resched bits are
>> set on the local cpu to get xnpod_schedule to trigger a rescheduling on
>> the distance cpu. That bit would be set on the remote cpu's sched. Only
>> when debugging is enabled.
>>
>>> But maybe you can provide some motivating bug scenarios, real ones of
>>> the past or realistic ones of the future.
>> Of course. The bug is anything which changes the scheduler state but
>> does not set the XNRESCHED bit. This happened when we started the SMP
>> port. New scheduling policies would be good candidates for a revival of
>> this bug.
>>
> You don't gain any worthwhile check if you cannot make the
> instrumentation required for a stable detection simpler than the proper
> problem solution itself. And this is what I'm still skeptical of.
 The solution is simple, but finding the problem without the 
 instrumentation is way harder than with the instrumentation, so the 
 instrumentation is worth something.

 Reproducing the false positive is surprisingly easy with a simple
 dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
 using a ridiculous long variable name to illustrate what I was 
 thinking about:

 diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
 index cf4..454b8e8 100644
 --- a/include/nucleus/sched.h
 +++ b/include/nucleus/sched.h
 @@ -108,6 +108,9 @@ typedef struct xnsched {
 struct xnthread *gktarget;
  #endif

 +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
 +   int debug_resched_from_remote;
 +#endif
  } xnsched_t;

  union xnsched_policy_param;
 @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched 
 *sched)
xnsched_t *current_sched = xnpod_current_sched();\
__setbits(current_sched->status, XNRESCHED); \
if (current_sched != (__sched__)){   \
 + if (XENO_DEBUG(NUCLEUS))  \
 + __sched__->debug_resched_from_remote = 1;

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-06 Thread Anders Blomdell


Gilles Chanteperdrix wrote:

Anders Blomdell wrote:

Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

At first sight, here you are more breaking things than cleaning them.
Still, it has the SMP record for my test program, still runs with ftrace 
on (after 2 hours, where it previously failed after maximum 23 minutes).

My version was indeed still buggy, I'm reworking it ATM.

If I get the gist of Jan's changes, they are (using the IPI to transfer 
one bit of information: your cpu needs to reschedule):


xnsched_set_resched:
-  setbits((__sched__)->status, XNRESCHED);

xnpod_schedule_handler:
+   xnsched_set_resched(sched);

If you (we?) decide to keep the debug checks, under what circumstances 
would the current check trigger (in laymans language, that I'll be able 
to understand)?

That's actually what /me is wondering as well. I do not see yet how you
can reliably detect a missed reschedule reliably (that was the purpose
of the debug check) given the racy nature between signaling resched and
processing the resched hints.

The purpose of the debugging change is to detect a change of the
scheduler state which was not followed by setting the XNRESCHED bit.

But that is nucleus business, nothing skins can screw up (as long as
they do not misuse APIs).

Yes, but it happens that we modify the nucleus from time to time.


Getting it to work is relatively simple: we add a "scheduler change set
remotely" bit to the sched structure which is NOT in the status bit, set
this bit when changing a remote sched (under nklock). In the debug check
code, if the scheduler state changed, and the XNRESCHED bit is not set,
only consider this a but if this new bit is not set. All this is
compiled out if the debug is not enabled.

I still see no benefit in this check. Where to you want to place the bit
set? Aren't that just the same locations where
xnsched_set_[self_]resched already is today?

Well no, that would be another bit in the sched structure which would
allow us to manipulate the status bits from the local cpu. That
supplementary bit would only be changed from a distant CPU, and serve to
detect the race which causes the false positive. The resched bits are
set on the local cpu to get xnpod_schedule to trigger a rescheduling on
the distance cpu. That bit would be set on the remote cpu's sched. Only
when debugging is enabled.


But maybe you can provide some motivating bug scenarios, real ones of
the past or realistic ones of the future.

Of course. The bug is anything which changes the scheduler state but
does not set the XNRESCHED bit. This happened when we started the SMP
port. New scheduling policies would be good candidates for a revival of
this bug.


You don't gain any worthwhile check if you cannot make the
instrumentation required for a stable detection simpler than the proper
problem solution itself. And this is what I'm still skeptical of.
The solution is simple, but finding the problem without the 
instrumentation is way harder than with the instrumentation, so the 
instrumentation is worth something.


Reproducing the false positive is surprisingly easy with a simple
dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
using a ridiculous long variable name to illustrate what I was 
thinking about:


diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index cf4..454b8e8 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -108,6 +108,9 @@ typedef struct xnsched {
struct xnthread *gktarget;
 #endif

+#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
+   int debug_resched_from_remote;
+#endif
 } xnsched_t;

 union xnsched_policy_param;
@@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched *sched)
   xnsched_t *current_sched = xnpod_current_sched();\
   __setbits(current_sched->status, XNRESCHED); \
   if (current_sched != (__sched__)){   \
+ if (XENO_DEBUG(NUCLEUS))  \
+ __sched__->debug_resched_from_remote = 1; \
   xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);  \
   }\
 } while (0)
diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 4cb707a..50b0f49 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct xnsched 
*sched)
xnarch_cpus_clear(sched->resched);
}
 #endif
+   if (XENO_DEBUG(NUCLEUS) && sched->debug_resched_from_remote) {
+   sched->debug_resched_from_remote = 0;
+   resched = 1;
+   }
clrbits(sched->status, XNRESCHED);
return resched;
 }


I am still uncertain.
Will only work if all is done under nklock, other

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-05 Thread Gilles Chanteperdrix

Anders Blomdell wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
> Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
> At first sight, here you are more breaking things than cleaning them.
 Still, it has the SMP record for my test program, still runs with 
 ftrace 
 on (after 2 hours, where it previously failed after maximum 23 
 minutes).
>>> My version was indeed still buggy, I'm reworking it ATM.
>>>
 If I get the gist of Jan's changes, they are (using the IPI to 
 transfer 
 one bit of information: your cpu needs to reschedule):

 xnsched_set_resched:
 -  setbits((__sched__)->status, XNRESCHED);

 xnpod_schedule_handler:
 +  xnsched_set_resched(sched);

 If you (we?) decide to keep the debug checks, under what circumstances 
 would the current check trigger (in laymans language, that I'll be 
 able 
 to understand)?
>>> That's actually what /me is wondering as well. I do not see yet how you
>>> can reliably detect a missed reschedule reliably (that was the purpose
>>> of the debug check) given the racy nature between signaling resched and
>>> processing the resched hints.
>> The purpose of the debugging change is to detect a change of the
>> scheduler state which was not followed by setting the XNRESCHED bit.
> But that is nucleus business, nothing skins can screw up (as long as
> they do not misuse APIs).
 Yes, but it happens that we modify the nucleus from time to time.

>> Getting it to work is relatively simple: we add a "scheduler change set
>> remotely" bit to the sched structure which is NOT in the status bit, set
>> this bit when changing a remote sched (under nklock). In the debug check
>> code, if the scheduler state changed, and the XNRESCHED bit is not set,
>> only consider this a but if this new bit is not set. All this is
>> compiled out if the debug is not enabled.
> I still see no benefit in this check. Where to you want to place the bit
> set? Aren't that just the same locations where
> xnsched_set_[self_]resched already is today?
 Well no, that would be another bit in the sched structure which would
 allow us to manipulate the status bits from the local cpu. That
 supplementary bit would only be changed from a distant CPU, and serve to
 detect the race which causes the false positive. The resched bits are
 set on the local cpu to get xnpod_schedule to trigger a rescheduling on
 the distance cpu. That bit would be set on the remote cpu's sched. Only
 when debugging is enabled.

> But maybe you can provide some motivating bug scenarios, real ones of
> the past or realistic ones of the future.
 Of course. The bug is anything which changes the scheduler state but
 does not set the XNRESCHED bit. This happened when we started the SMP
 port. New scheduling policies would be good candidates for a revival of
 this bug.

>>> You don't gain any worthwhile check if you cannot make the
>>> instrumentation required for a stable detection simpler than the proper
>>> problem solution itself. And this is what I'm still skeptical of.
>> The solution is simple, but finding the problem without the 
>> instrumentation is way harder than with the instrumentation, so the 
>> instrumentation is worth something.
>>
>> Reproducing the false positive is surprisingly easy with a simple
>> dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
>> using a ridiculous long variable name to illustrate what I was 
>> thinking about:
>>
>> diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
>> index cf4..454b8e8 100644
>> --- a/include/nucleus/sched.h
>> +++ b/include/nucleus/sched.h
>> @@ -108,6 +108,9 @@ typedef struct xnsched {
>> struct xnthread *gktarget;
>>  #endif
>>
>> +#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
>> +   int debug_resched_from_remote;
>> +#endif
>>  } xnsched_t;
>>
>>  union xnsched_policy_param;
>> @@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched 
>> *sched)
>>xnsched_t *current_sched = xnpod_current_sched();\
>>__setbits(current_sched->status, XNRESCHED); \
>>if (current_sched != (__sched__)){   \
>> + if (XENO_DEBUG(NUCLEUS))  \
>> + __sched__->debug_resched_from_remote = 1; \
>>xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);  \
>>}\
>>  } while (0)
>> diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
>> index 4cb707a..50b0f49 100644
>> --- a/ksrc/nucleus/pod.c
>> ++

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-05 Thread Anders Blomdell


Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

At first sight, here you are more breaking things than cleaning them.
Still, it has the SMP record for my test program, still runs with ftrace 
on (after 2 hours, where it previously failed after maximum 23 minutes).

My version was indeed still buggy, I'm reworking it ATM.

If I get the gist of Jan's changes, they are (using the IPI to transfer 
one bit of information: your cpu needs to reschedule):


xnsched_set_resched:
-  setbits((__sched__)->status, XNRESCHED);

xnpod_schedule_handler:
+   xnsched_set_resched(sched);

If you (we?) decide to keep the debug checks, under what circumstances 
would the current check trigger (in laymans language, that I'll be able 
to understand)?

That's actually what /me is wondering as well. I do not see yet how you
can reliably detect a missed reschedule reliably (that was the purpose
of the debug check) given the racy nature between signaling resched and
processing the resched hints.

The purpose of the debugging change is to detect a change of the
scheduler state which was not followed by setting the XNRESCHED bit.

But that is nucleus business, nothing skins can screw up (as long as
they do not misuse APIs).

Yes, but it happens that we modify the nucleus from time to time.


Getting it to work is relatively simple: we add a "scheduler change set
remotely" bit to the sched structure which is NOT in the status bit, set
this bit when changing a remote sched (under nklock). In the debug check
code, if the scheduler state changed, and the XNRESCHED bit is not set,
only consider this a but if this new bit is not set. All this is
compiled out if the debug is not enabled.

I still see no benefit in this check. Where to you want to place the bit
set? Aren't that just the same locations where
xnsched_set_[self_]resched already is today?

Well no, that would be another bit in the sched structure which would
allow us to manipulate the status bits from the local cpu. That
supplementary bit would only be changed from a distant CPU, and serve to
detect the race which causes the false positive. The resched bits are
set on the local cpu to get xnpod_schedule to trigger a rescheduling on
the distance cpu. That bit would be set on the remote cpu's sched. Only
when debugging is enabled.


But maybe you can provide some motivating bug scenarios, real ones of
the past or realistic ones of the future.

Of course. The bug is anything which changes the scheduler state but
does not set the XNRESCHED bit. This happened when we started the SMP
port. New scheduling policies would be good candidates for a revival of
this bug.


You don't gain any worthwhile check if you cannot make the
instrumentation required for a stable detection simpler than the proper
problem solution itself. And this is what I'm still skeptical of.


The solution is simple, but finding the problem without the 
instrumentation is way harder than with the instrumentation, so the 
instrumentation is worth something.


Reproducing the false positive is surprisingly easy with a simple
dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
using a ridiculous long variable name to illustrate what I was 
thinking about:


diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index cf4..454b8e8 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -108,6 +108,9 @@ typedef struct xnsched {
struct xnthread *gktarget;
 #endif

+#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
+   int debug_resched_from_remote;
+#endif
 } xnsched_t;

 union xnsched_policy_param;
@@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched *sched)
   xnsched_t *current_sched = xnpod_current_sched();\
   __setbits(current_sched->status, XNRESCHED); \
   if (current_sched != (__sched__)){   \
+ if (XENO_DEBUG(NUCLEUS))  \
+ __sched__->debug_resched_from_remote = 1; \
   xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);  \
   }\
 } while (0)
diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 4cb707a..50b0f49 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct xnsched 
*sched)
xnarch_cpus_clear(sched->resched);
}
 #endif
+   if (XENO_DEBUG(NUCLEUS) && sched->debug_resched_from_remote) {
+   sched->debug_resched_from_remote = 0;
+   resched = 1;
+   }
clrbits(sched->status, XNRESCHED);
return resched;
 }


I am still uncertain.
Will only work if all is done under nklock, otherwise two almost 
simultaneous xnsched_resched_p fro

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-05 Thread Anders Blomdell


Gilles Chanteperdrix wrote:

Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 05.11.2010 00:25, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 04.11.2010 23:08, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
debugging off.

That is not enough.

It is, I've reviewed the code today.

The fallouts I am talking about are:
47dac49c71e89b684203e854d1b0172ecacbc555

Not related.


38f2ca83a8e63cc94eaa911ff1c0940c884b5078

An optimization.


5e7cfa5c25672e4478a721eadbd6f6c5b4f88a2f

That fall out of that commit is fixed in my series.


This commit was followed by several others to "fix
the fix". You know how things are, someone proposes a fix, which fixes
things for him, but it breaks in the other people configurations (one of
the fallouts was a complete revamp of include/asm-arm/atomic.h for
instance).


I've pushed a series that reverts that commit, then fixes and cleans up
on top of it. Just pushed if you want to take a look. We can find some
alternative debugging mechanism independently (though I'm curious to see
it - it still makes no sense to me).

Since the fix is simply a modification to what we have currently. I
would prefer if we did not remove it. In fact, I think it would be
simpler if we started from what we currently have than reverting past
patches.

Look at the series, it goes step by step to an IMHO clean state. We can
pull out the debugging check removal, though, if you prefer to work on
top of the existing code.
From my point of view, Anders looks for something that works, so 
following the rules that the minimal set of changes minimize the chances
of introducing new bugs while cleaning, I would go for the minimal set 
of changes, such as:


The tested one (on SMP, and UP with and without unlocked ctx switch):

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..cf4 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -165,28 +165,27 @@ struct xnsched_class {
 #endif /* CONFIG_SMP */
 
 /* Test all resched flags from the given scheduler mask. */

-static inline int xnsched_resched_p(struct xnsched *sched)
+static inline int xnsched_remote_resched_p(struct xnsched *sched)
 {
-   return testbits(sched->status, XNRESCHED);
+   return !xnarch_cpus_empty(sched->resched);
 }
 
-static inline int xnsched_self_resched_p(struct xnsched *sched)

+static inline int xnsched_resched_p(struct xnsched *sched)
 {
return testbits(sched->status, XNRESCHED);
 }
 
 /* Set self resched flag for the given scheduler. */

 #define xnsched_set_self_resched(__sched__) do {   \
-  setbits((__sched__)->status, XNRESCHED);  \
+  __setbits((__sched__)->status, XNRESCHED);\
 } while (0)
 
 /* Set specific resched flag into the local scheduler mask. */

 #define xnsched_set_resched(__sched__) do {\
   xnsched_t *current_sched = xnpod_current_sched();\
-  setbits(current_sched->status, XNRESCHED);\
+  __setbits(current_sched->status, XNRESCHED);  \
   if (current_sched != (__sched__)){   \
   xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);   \
-  setbits((__sched__)->status, XNRESCHED);  \
   }\
 } while (0)
 
diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c

index 862838c..4cb707a 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -276,18 +276,16 @@ EXPORT_SYMBOL_GPL(xnpod_fatal_helper);
 
 void xnpod_schedule_handler(void) /* Called with hw interrupts off. */

 {
-   xnsched_t *sched;
+   xnsched_t *sched = xnpod_current_sched();
 
 	trace_mark(xn_nucleus, sched_remote, MARK_NOARGS);

 #if defined(CONFIG_SMP) && defined(CONFIG_XENO_OPT_PRIOCPL)
-   sched = xnpod_current_sched();
if (testbits(sched->status, XNRPICK)) {
clrbits(sched->status, XNRPICK);
xnshadow_rpi_check();
}
-#else
-   (void)sched;
 #endif /* CONFIG_SMP && CONFIG_XENO_OPT_PRIOCPL */
+   xnsched_set_self_resched(sched);
xnpod_schedule();
 }
 
@@ -2174,7 +2172,7 @@ static inline int __xnpod_test_resched(struct xnsched *sched)

int resched = testbits(sched->status, XNRESCHED);
 #ifdef CONFIG_SMP
/* Send resched IPI to remote CPU(s). */
-   if (unlikely(xnsched_resched_p(sched))) {
+   if (unlikely(xnsched_remote_resched_p(sched))) {
xnarch_send_ipi(sched->resched);
xnarch_cpus_clear(sched->resched);
}
diff --git a/ksrc/nucleus/timer.c b/ksrc/nucleus/timer.c
index 1fe3331..a0ac627 100644
--- a/ksrc/nucleus/timer.c
+++ b/ksrc/nucleus/timer.c
@@ -97,7 +97,7 @@ void xntimer_next_local_shot(xnsched_t *sched)
__clrbits(sched->status, XNHDEFER);
timer = ap

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 05.11.2010 00:25, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 04.11.2010 23:08, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
>> debugging off.
> That is not enough.
 It is, I've reviewed the code today.
>>> The fallouts I am talking about are:
>>> 47dac49c71e89b684203e854d1b0172ecacbc555
>> Not related.
>>
>>> 38f2ca83a8e63cc94eaa911ff1c0940c884b5078
>> An optimization.
>>
>>> 5e7cfa5c25672e4478a721eadbd6f6c5b4f88a2f
>> That fall out of that commit is fixed in my series.
>>
> This commit was followed by several others to "fix
> the fix". You know how things are, someone proposes a fix, which fixes
> things for him, but it breaks in the other people configurations (one of
> the fallouts was a complete revamp of include/asm-arm/atomic.h for
> instance).
>
 I've pushed a series that reverts that commit, then fixes and cleans up
 on top of it. Just pushed if you want to take a look. We can find some
 alternative debugging mechanism independently (though I'm curious to see
 it - it still makes no sense to me).
>>> Since the fix is simply a modification to what we have currently. I
>>> would prefer if we did not remove it. In fact, I think it would be
>>> simpler if we started from what we currently have than reverting past
>>> patches.
>> Look at the series, it goes step by step to an IMHO clean state. We can
>> pull out the debugging check removal, though, if you prefer to work on
>> top of the existing code.
> 
> From my point of view, Anders looks for something that works, so 
> following the rules that the minimal set of changes minimize the chances
> of introducing new bugs while cleaning, I would go for the minimal set 
> of changes, such as:

The tested one (on SMP, and UP with and without unlocked ctx switch):

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..cf4 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -165,28 +165,27 @@ struct xnsched_class {
 #endif /* CONFIG_SMP */
 
 /* Test all resched flags from the given scheduler mask. */
-static inline int xnsched_resched_p(struct xnsched *sched)
+static inline int xnsched_remote_resched_p(struct xnsched *sched)
 {
-   return testbits(sched->status, XNRESCHED);
+   return !xnarch_cpus_empty(sched->resched);
 }
 
-static inline int xnsched_self_resched_p(struct xnsched *sched)
+static inline int xnsched_resched_p(struct xnsched *sched)
 {
return testbits(sched->status, XNRESCHED);
 }
 
 /* Set self resched flag for the given scheduler. */
 #define xnsched_set_self_resched(__sched__) do {   \
-  setbits((__sched__)->status, XNRESCHED); \
+  __setbits((__sched__)->status, XNRESCHED);   \
 } while (0)
 
 /* Set specific resched flag into the local scheduler mask. */
 #define xnsched_set_resched(__sched__) do {\
   xnsched_t *current_sched = xnpod_current_sched();\
-  setbits(current_sched->status, XNRESCHED);   \
+  __setbits(current_sched->status, XNRESCHED); \
   if (current_sched != (__sched__)){   \
   xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);  \
-  setbits((__sched__)->status, XNRESCHED); \
   }\
 } while (0)
 
diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 862838c..4cb707a 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -276,18 +276,16 @@ EXPORT_SYMBOL_GPL(xnpod_fatal_helper);
 
 void xnpod_schedule_handler(void) /* Called with hw interrupts off. */
 {
-   xnsched_t *sched;
+   xnsched_t *sched = xnpod_current_sched();
 
trace_mark(xn_nucleus, sched_remote, MARK_NOARGS);
 #if defined(CONFIG_SMP) && defined(CONFIG_XENO_OPT_PRIOCPL)
-   sched = xnpod_current_sched();
if (testbits(sched->status, XNRPICK)) {
clrbits(sched->status, XNRPICK);
xnshadow_rpi_check();
}
-#else
-   (void)sched;
 #endif /* CONFIG_SMP && CONFIG_XENO_OPT_PRIOCPL */
+   xnsched_set_self_resched(sched);
xnpod_schedule();
 }
 
@@ -2174,7 +2172,7 @@ static inline int __xnpod_test_resched(struct xnsched 
*sched)
int resched = testbits(sched->status, XNRESCHED);
 #ifdef CONFIG_SMP
/* Send resched IPI to remote CPU(s). */
-   if (unlikely(xnsched_resched_p(sched))) {
+   if (unlikely(xnsched_remote_resched_p(sched))) {
xnarch_send_ipi(sched->resched);
xnarch_cpus_clear(sched->resched);
}
diff --git a/ksrc/nucleus/timer.c b/ksrc/nucleus/timer.c
index 1fe3331..a0ac627 100644
--- a/ksrc/nucleus/timer.c
+++ b/ksrc/nucleus/timer.c
@@ -97,7 +97,7 @@ void xntimer_n

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
>>> At first sight, here you are more breaking things than cleaning them.
>> Still, it has the SMP record for my test program, still runs with ftrace 
>> on (after 2 hours, where it previously failed after maximum 23 minutes).
> My version was indeed still buggy, I'm reworking it ATM.
>
>> If I get the gist of Jan's changes, they are (using the IPI to transfer 
>> one bit of information: your cpu needs to reschedule):
>>
>> xnsched_set_resched:
>> -  setbits((__sched__)->status, XNRESCHED);
>>
>> xnpod_schedule_handler:
>> +xnsched_set_resched(sched);
>>  
>> If you (we?) decide to keep the debug checks, under what circumstances 
>> would the current check trigger (in laymans language, that I'll be able 
>> to understand)?
> That's actually what /me is wondering as well. I do not see yet how you
> can reliably detect a missed reschedule reliably (that was the purpose
> of the debug check) given the racy nature between signaling resched and
> processing the resched hints.
 The purpose of the debugging change is to detect a change of the
 scheduler state which was not followed by setting the XNRESCHED bit.
>>> But that is nucleus business, nothing skins can screw up (as long as
>>> they do not misuse APIs).
>> Yes, but it happens that we modify the nucleus from time to time.
>>
 Getting it to work is relatively simple: we add a "scheduler change set
 remotely" bit to the sched structure which is NOT in the status bit, set
 this bit when changing a remote sched (under nklock). In the debug check
 code, if the scheduler state changed, and the XNRESCHED bit is not set,
 only consider this a but if this new bit is not set. All this is
 compiled out if the debug is not enabled.
>>> I still see no benefit in this check. Where to you want to place the bit
>>> set? Aren't that just the same locations where
>>> xnsched_set_[self_]resched already is today?
>> Well no, that would be another bit in the sched structure which would
>> allow us to manipulate the status bits from the local cpu. That
>> supplementary bit would only be changed from a distant CPU, and serve to
>> detect the race which causes the false positive. The resched bits are
>> set on the local cpu to get xnpod_schedule to trigger a rescheduling on
>> the distance cpu. That bit would be set on the remote cpu's sched. Only
>> when debugging is enabled.
>>
>>> But maybe you can provide some motivating bug scenarios, real ones of
>>> the past or realistic ones of the future.
>> Of course. The bug is anything which changes the scheduler state but
>> does not set the XNRESCHED bit. This happened when we started the SMP
>> port. New scheduling policies would be good candidates for a revival of
>> this bug.
>>
> 
> You don't gain any worthwhile check if you cannot make the
> instrumentation required for a stable detection simpler than the proper
> problem solution itself. And this is what I'm still skeptical of.

The solution is simple, but finding the problem without the 
instrumentation is way harder than with the instrumentation, so the 
instrumentation is worth something.

Reproducing the false positive is surprisingly easy with a simple
dual-cpu semaphore ping-pong test. So, here is the (tested) patch, 
using a ridiculous long variable name to illustrate what I was 
thinking about:

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index cf4..454b8e8 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -108,6 +108,9 @@ typedef struct xnsched {
struct xnthread *gktarget;
 #endif

+#ifdef CONFIG_XENO_OPT_DEBUG_NUCLEUS
+   int debug_resched_from_remote;
+#endif
 } xnsched_t;

 union xnsched_policy_param;
@@ -185,6 +188,8 @@ static inline int xnsched_resched_p(struct xnsched *sched)
   xnsched_t *current_sched = xnpod_current_sched();\
   __setbits(current_sched->status, XNRESCHED); \
   if (current_sched != (__sched__)){   \
+ if (XENO_DEBUG(NUCLEUS))  \
+ __sched__->debug_resched_from_remote = 1; \
   xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);  \
   }\
 } while (0)
diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 4cb707a..50b0f49 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -2177,6 +2177,10 @@ static inline int __xnpod_test_resched(struct xnsched 
*sched)
xnarch_cpus_clear(sched->resched);
}
 #endif
+   if (XENO_DEBUG(NUCLEUS) && sched->debug_resched_from_remote) {
+   sched->debug_resched_from_remote = 0;
+   r

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 05.11.2010 00:46, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 05.11.2010 00:25, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
> Am 04.11.2010 23:08, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
>>> debugging off.
>> That is not enough.
> It is, I've reviewed the code today.
 The fallouts I am talking about are:
 47dac49c71e89b684203e854d1b0172ecacbc555
>>> Not related.
>>>
 38f2ca83a8e63cc94eaa911ff1c0940c884b5078
>>> An optimization.
>>>
 5e7cfa5c25672e4478a721eadbd6f6c5b4f88a2f
>>> That fall out of that commit is fixed in my series.
>>>
>> This commit was followed by several others to "fix
>> the fix". You know how things are, someone proposes a fix, which fixes
>> things for him, but it breaks in the other people configurations (one of
>> the fallouts was a complete revamp of include/asm-arm/atomic.h for
>> instance).
>>
> I've pushed a series that reverts that commit, then fixes and cleans up
> on top of it. Just pushed if you want to take a look. We can find some
> alternative debugging mechanism independently (though I'm curious to see
> it - it still makes no sense to me).
 Since the fix is simply a modification to what we have currently. I
 would prefer if we did not remove it. In fact, I think it would be
 simpler if we started from what we currently have than reverting past
 patches.
>>> Look at the series, it goes step by step to an IMHO clean state. We can
>>> pull out the debugging check removal, though, if you prefer to work on
>>> top of the existing code.
>> From my point of view, Anders looks for something that works, so 
>> following the rules that the minimal set of changes minimize the chances
>> of introducing new bugs while cleaning, I would go for the minimal set 
>> of changes, such as:
> 
> I don't mind where to start, you need to understand the code anyway to
> asses any change, may it be a complete rewrite, revert and then rework,
> or a combination like below.
> 
>> diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
>> index df56417..8150510 100644
>> --- a/include/nucleus/sched.h
>> +++ b/include/nucleus/sched.h
>> @@ -165,28 +165,27 @@ struct xnsched_class {
>>  #endif /* CONFIG_SMP */
>>  
>>  /* Test all resched flags from the given scheduler mask. */
>> -static inline int xnsched_resched_p(struct xnsched *sched)
>> +static inline int xnsched_remote_resched_p(struct xnsched *sched)
>>  {
>> -return testbits(sched->status, XNRESCHED);
>> +return !xnarch_cpus_empty(current_sched->resched);
>>  }
>>  
>> -static inline int xnsched_self_resched_p(struct xnsched *sched)
>> +static inline int xnsched_resched_p(struct xnsched *sched)
>>  {
>>  return testbits(sched->status, XNRESCHED);
>>  }
>>  
>>  /* Set self resched flag for the given scheduler. */
>>  #define xnsched_set_self_resched(__sched__) do {\
>> -  setbits((__sched__)->status, XNRESCHED);  \
>> +  __setbits((__sched__)->status, XNRESCHED);\
>>  } while (0)
>>  
>>  /* Set specific resched flag into the local scheduler mask. */
>>  #define xnsched_set_resched(__sched__) do { \
>>xnsched_t *current_sched = xnpod_current_sched(); \
>> -  setbits(current_sched->status, XNRESCHED);
>> \
>> +  __setbits(current_sched->status, XNRESCHED);  
>> \
>>if (current_sched != (__sched__)) {   \
>>xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);   
>> \
>> -  setbits((__sched__)->status, XNRESCHED);  
>> \
>>} \
>>  } while (0)
>>  
>> diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
>> index 862838c..4cb707a 100644
>> --- a/ksrc/nucleus/pod.c
>> +++ b/ksrc/nucleus/pod.c
>> @@ -276,18 +276,16 @@ EXPORT_SYMBOL_GPL(xnpod_fatal_helper);
>>  
>>  void xnpod_schedule_handler(void) /* Called with hw interrupts off. */
>>  {
>> -xnsched_t *sched;
>> +xnsched_t *sched = xnpod_current_sched();
>>  
>>  trace_mark(xn_nucleus, sched_remote, MARK_NOARGS);
>>  #if defined(CONFIG_SMP) && defined(CONFIG_XENO_OPT_PRIOCPL)
>> -sched = xnpod_current_sched();
>>  if (testbits(sched->status, XNRPICK)) {
>>  clrbits(sched->status, XNRPICK);
>>  xnshadow_rpi_check();
>>  }
>> -#else
>> -(void)sched;
>>  #endif /* CONFIG_SMP && CONFIG_XENO_OPT_PRIOCPL */
>> +xnsched_set_self_resched(sched);
>>  xnpod_schedule();
>>  }
>>  
>> @@ -2174,7 +2172,7 @@ static inline int __xnpod_test_resched(struct xnsched 
>> *sched)
>>  int resched = testbits(sched->status, XNRESCHED);
>>  #ifdef CONFIG_SMP
>>  /* Send resched IPI to remote CPU(s). */
>> -if (u

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 05.11.2010 00:46, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 05.11.2010 00:25, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 04.11.2010 23:08, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
>> debugging off.
> That is not enough.
 It is, I've reviewed the code today.
>>> The fallouts I am talking about are:
>>> 47dac49c71e89b684203e854d1b0172ecacbc555
>>
>> Not related.
>>
>>> 38f2ca83a8e63cc94eaa911ff1c0940c884b5078
>>
>> An optimization.
>>
>>> 5e7cfa5c25672e4478a721eadbd6f6c5b4f88a2f
>>
>> That fall out of that commit is fixed in my series.
>>
> This commit was followed by several others to "fix
> the fix". You know how things are, someone proposes a fix, which fixes
> things for him, but it breaks in the other people configurations (one of
> the fallouts was a complete revamp of include/asm-arm/atomic.h for
> instance).
>
 I've pushed a series that reverts that commit, then fixes and cleans up
 on top of it. Just pushed if you want to take a look. We can find some
 alternative debugging mechanism independently (though I'm curious to see
 it - it still makes no sense to me).
>>> Since the fix is simply a modification to what we have currently. I
>>> would prefer if we did not remove it. In fact, I think it would be
>>> simpler if we started from what we currently have than reverting past
>>> patches.
>>
>> Look at the series, it goes step by step to an IMHO clean state. We can
>> pull out the debugging check removal, though, if you prefer to work on
>> top of the existing code.
> 
> From my point of view, Anders looks for something that works, so 
> following the rules that the minimal set of changes minimize the chances
> of introducing new bugs while cleaning, I would go for the minimal set 
> of changes, such as:

I don't mind where to start, you need to understand the code anyway to
asses any change, may it be a complete rewrite, revert and then rework,
or a combination like below.

> 
> diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
> index df56417..8150510 100644
> --- a/include/nucleus/sched.h
> +++ b/include/nucleus/sched.h
> @@ -165,28 +165,27 @@ struct xnsched_class {
>  #endif /* CONFIG_SMP */
>  
>  /* Test all resched flags from the given scheduler mask. */
> -static inline int xnsched_resched_p(struct xnsched *sched)
> +static inline int xnsched_remote_resched_p(struct xnsched *sched)
>  {
> - return testbits(sched->status, XNRESCHED);
> + return !xnarch_cpus_empty(current_sched->resched);
>  }
>  
> -static inline int xnsched_self_resched_p(struct xnsched *sched)
> +static inline int xnsched_resched_p(struct xnsched *sched)
>  {
>   return testbits(sched->status, XNRESCHED);
>  }
>  
>  /* Set self resched flag for the given scheduler. */
>  #define xnsched_set_self_resched(__sched__) do { \
> -  setbits((__sched__)->status, XNRESCHED);   \
> +  __setbits((__sched__)->status, XNRESCHED); \
>  } while (0)
>  
>  /* Set specific resched flag into the local scheduler mask. */
>  #define xnsched_set_resched(__sched__) do {  \
>xnsched_t *current_sched = xnpod_current_sched();  \
> -  setbits(current_sched->status, XNRESCHED); \
> +  __setbits(current_sched->status, XNRESCHED);   
> \
>if (current_sched != (__sched__))  {   \
>xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);
> \
> -  setbits((__sched__)->status, XNRESCHED);   
> \
>}  \
>  } while (0)
>  
> diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
> index 862838c..4cb707a 100644
> --- a/ksrc/nucleus/pod.c
> +++ b/ksrc/nucleus/pod.c
> @@ -276,18 +276,16 @@ EXPORT_SYMBOL_GPL(xnpod_fatal_helper);
>  
>  void xnpod_schedule_handler(void) /* Called with hw interrupts off. */
>  {
> - xnsched_t *sched;
> + xnsched_t *sched = xnpod_current_sched();
>  
>   trace_mark(xn_nucleus, sched_remote, MARK_NOARGS);
>  #if defined(CONFIG_SMP) && defined(CONFIG_XENO_OPT_PRIOCPL)
> - sched = xnpod_current_sched();
>   if (testbits(sched->status, XNRPICK)) {
>   clrbits(sched->status, XNRPICK);
>   xnshadow_rpi_check();
>   }
> -#else
> - (void)sched;
>  #endif /* CONFIG_SMP && CONFIG_XENO_OPT_PRIOCPL */
> + xnsched_set_self_resched(sched);
>   xnpod_schedule();
>  }
>  
> @@ -2174,7 +2172,7 @@ static inline int __xnpod_test_resched(struct xnsched 
> *sched)
>   int resched = testbits(sched->status, XNRESCHED);
>  #ifdef CONFIG_SMP
>   /* Send resched IPI to remote CPU(s). */
> - if (unlikely(xnsched_resched_p(sched))) {
> + if (unlikely(xnsched_remote_resched_p(sched))) {
>   xnarch

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 05.11.2010 00:25, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 04.11.2010 23:08, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
> rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
> debugging off.
 That is not enough.
>>> It is, I've reviewed the code today.
>> The fallouts I am talking about are:
>> 47dac49c71e89b684203e854d1b0172ecacbc555
> 
> Not related.
> 
>> 38f2ca83a8e63cc94eaa911ff1c0940c884b5078
> 
> An optimization.
> 
>> 5e7cfa5c25672e4478a721eadbd6f6c5b4f88a2f
> 
> That fall out of that commit is fixed in my series.
> 
 This commit was followed by several others to "fix
 the fix". You know how things are, someone proposes a fix, which fixes
 things for him, but it breaks in the other people configurations (one of
 the fallouts was a complete revamp of include/asm-arm/atomic.h for
 instance).

>>> I've pushed a series that reverts that commit, then fixes and cleans up
>>> on top of it. Just pushed if you want to take a look. We can find some
>>> alternative debugging mechanism independently (though I'm curious to see
>>> it - it still makes no sense to me).
>> Since the fix is simply a modification to what we have currently. I
>> would prefer if we did not remove it. In fact, I think it would be
>> simpler if we started from what we currently have than reverting past
>> patches.
> 
> Look at the series, it goes step by step to an IMHO clean state. We can
> pull out the debugging check removal, though, if you prefer to work on
> top of the existing code.

>From my point of view, Anders looks for something that works, so 
following the rules that the minimal set of changes minimize the chances
of introducing new bugs while cleaning, I would go for the minimal set 
of changes, such as:

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..8150510 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -165,28 +165,27 @@ struct xnsched_class {
 #endif /* CONFIG_SMP */
 
 /* Test all resched flags from the given scheduler mask. */
-static inline int xnsched_resched_p(struct xnsched *sched)
+static inline int xnsched_remote_resched_p(struct xnsched *sched)
 {
-   return testbits(sched->status, XNRESCHED);
+   return !xnarch_cpus_empty(current_sched->resched);
 }
 
-static inline int xnsched_self_resched_p(struct xnsched *sched)
+static inline int xnsched_resched_p(struct xnsched *sched)
 {
return testbits(sched->status, XNRESCHED);
 }
 
 /* Set self resched flag for the given scheduler. */
 #define xnsched_set_self_resched(__sched__) do {   \
-  setbits((__sched__)->status, XNRESCHED); \
+  __setbits((__sched__)->status, XNRESCHED);   \
 } while (0)
 
 /* Set specific resched flag into the local scheduler mask. */
 #define xnsched_set_resched(__sched__) do {\
   xnsched_t *current_sched = xnpod_current_sched();\
-  setbits(current_sched->status, XNRESCHED);   \
+  __setbits(current_sched->status, XNRESCHED); \
   if (current_sched != (__sched__)){   \
   xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);  \
-  setbits((__sched__)->status, XNRESCHED); \
   }\
 } while (0)
 
diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 862838c..4cb707a 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -276,18 +276,16 @@ EXPORT_SYMBOL_GPL(xnpod_fatal_helper);
 
 void xnpod_schedule_handler(void) /* Called with hw interrupts off. */
 {
-   xnsched_t *sched;
+   xnsched_t *sched = xnpod_current_sched();
 
trace_mark(xn_nucleus, sched_remote, MARK_NOARGS);
 #if defined(CONFIG_SMP) && defined(CONFIG_XENO_OPT_PRIOCPL)
-   sched = xnpod_current_sched();
if (testbits(sched->status, XNRPICK)) {
clrbits(sched->status, XNRPICK);
xnshadow_rpi_check();
}
-#else
-   (void)sched;
 #endif /* CONFIG_SMP && CONFIG_XENO_OPT_PRIOCPL */
+   xnsched_set_self_resched(sched);
xnpod_schedule();
 }
 
@@ -2174,7 +2172,7 @@ static inline int __xnpod_test_resched(struct xnsched 
*sched)
int resched = testbits(sched->status, XNRESCHED);
 #ifdef CONFIG_SMP
/* Send resched IPI to remote CPU(s). */
-   if (unlikely(xnsched_resched_p(sched))) {
+   if (unlikely(xnsched_remote_resched_p(sched))) {
xnarch_send_ipi(sched->resched);
xnarch_cpus_clear(sched->resched);
}
diff --git a/ksrc/nucleus/timer.c b/ksrc/nucleus/timer.c
index 1fe3331..a0ac627 100644
--- a/ksrc/nucleus/timer.c
+++ b/ksrc/nucleus/timer.c
@@ -97,7 +97,7 @@ void xntimer_next_local_shot(xnsched_t *sched)
__clrbits(sched->status, XNHDEFER);
timer = aplink2timer(h);
if (

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 05.11.2010 00:24, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>> At first sight, here you are more breaking things than cleaning them.
> Still, it has the SMP record for my test program, still runs with ftrace 
> on (after 2 hours, where it previously failed after maximum 23 minutes).
 My version was indeed still buggy, I'm reworking it ATM.

> If I get the gist of Jan's changes, they are (using the IPI to transfer 
> one bit of information: your cpu needs to reschedule):
>
> xnsched_set_resched:
> -  setbits((__sched__)->status, XNRESCHED);
>
> xnpod_schedule_handler:
> + xnsched_set_resched(sched);
>   
> If you (we?) decide to keep the debug checks, under what circumstances 
> would the current check trigger (in laymans language, that I'll be able 
> to understand)?
 That's actually what /me is wondering as well. I do not see yet how you
 can reliably detect a missed reschedule reliably (that was the purpose
 of the debug check) given the racy nature between signaling resched and
 processing the resched hints.
>>> The purpose of the debugging change is to detect a change of the
>>> scheduler state which was not followed by setting the XNRESCHED bit.
>>
>> But that is nucleus business, nothing skins can screw up (as long as
>> they do not misuse APIs).
> 
> Yes, but it happens that we modify the nucleus from time to time.
> 
>>
>>> Getting it to work is relatively simple: we add a "scheduler change set
>>> remotely" bit to the sched structure which is NOT in the status bit, set
>>> this bit when changing a remote sched (under nklock). In the debug check
>>> code, if the scheduler state changed, and the XNRESCHED bit is not set,
>>> only consider this a but if this new bit is not set. All this is
>>> compiled out if the debug is not enabled.
>>
>> I still see no benefit in this check. Where to you want to place the bit
>> set? Aren't that just the same locations where
>> xnsched_set_[self_]resched already is today?
> 
> Well no, that would be another bit in the sched structure which would
> allow us to manipulate the status bits from the local cpu. That
> supplementary bit would only be changed from a distant CPU, and serve to
> detect the race which causes the false positive. The resched bits are
> set on the local cpu to get xnpod_schedule to trigger a rescheduling on
> the distance cpu. That bit would be set on the remote cpu's sched. Only
> when debugging is enabled.
> 
>>
>> But maybe you can provide some motivating bug scenarios, real ones of
>> the past or realistic ones of the future.
> 
> Of course. The bug is anything which changes the scheduler state but
> does not set the XNRESCHED bit. This happened when we started the SMP
> port. New scheduling policies would be good candidates for a revival of
> this bug.
> 

You don't gain any worthwhile check if you cannot make the
instrumentation required for a stable detection simpler than the proper
problem solution itself. And this is what I'm still skeptical of.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 05.11.2010 00:25, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 23:08, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
 debugging off.
>>> That is not enough.
>>
>> It is, I've reviewed the code today.
> 
> The fallouts I am talking about are:
> 47dac49c71e89b684203e854d1b0172ecacbc555

Not related.

> 38f2ca83a8e63cc94eaa911ff1c0940c884b5078

An optimization.

> 5e7cfa5c25672e4478a721eadbd6f6c5b4f88a2f

That fall out of that commit is fixed in my series.

> 
>>
>>> This commit was followed by several others to "fix
>>> the fix". You know how things are, someone proposes a fix, which fixes
>>> things for him, but it breaks in the other people configurations (one of
>>> the fallouts was a complete revamp of include/asm-arm/atomic.h for
>>> instance).
>>>
>>
>> I've pushed a series that reverts that commit, then fixes and cleans up
>> on top of it. Just pushed if you want to take a look. We can find some
>> alternative debugging mechanism independently (though I'm curious to see
>> it - it still makes no sense to me).
> 
> Since the fix is simply a modification to what we have currently. I
> would prefer if we did not remove it. In fact, I think it would be
> simpler if we started from what we currently have than reverting past
> patches.

Look at the series, it goes step by step to an IMHO clean state. We can
pull out the debugging check removal, though, if you prefer to work on
top of the existing code.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 04.11.2010 23:08, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
>>> debugging off.
>> That is not enough.
> 
> It is, I've reviewed the code today.

The fallouts I am talking about are:
47dac49c71e89b684203e854d1b0172ecacbc555
38f2ca83a8e63cc94eaa911ff1c0940c884b5078
5e7cfa5c25672e4478a721eadbd6f6c5b4f88a2f


> 
>> This commit was followed by several others to "fix
>> the fix". You know how things are, someone proposes a fix, which fixes
>> things for him, but it breaks in the other people configurations (one of
>> the fallouts was a complete revamp of include/asm-arm/atomic.h for
>> instance).
>>
> 
> I've pushed a series that reverts that commit, then fixes and cleans up
> on top of it. Just pushed if you want to take a look. We can find some
> alternative debugging mechanism independently (though I'm curious to see
> it - it still makes no sense to me).

Since the fix is simply a modification to what we have currently. I
would prefer if we did not remove it. In fact, I think it would be
simpler if we started from what we currently have than reverting past
patches.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
> At first sight, here you are more breaking things than cleaning them.
 Still, it has the SMP record for my test program, still runs with ftrace 
 on (after 2 hours, where it previously failed after maximum 23 minutes).
>>> My version was indeed still buggy, I'm reworking it ATM.
>>>
 If I get the gist of Jan's changes, they are (using the IPI to transfer 
 one bit of information: your cpu needs to reschedule):

 xnsched_set_resched:
 -  setbits((__sched__)->status, XNRESCHED);

 xnpod_schedule_handler:
 +  xnsched_set_resched(sched);

 If you (we?) decide to keep the debug checks, under what circumstances 
 would the current check trigger (in laymans language, that I'll be able 
 to understand)?
>>> That's actually what /me is wondering as well. I do not see yet how you
>>> can reliably detect a missed reschedule reliably (that was the purpose
>>> of the debug check) given the racy nature between signaling resched and
>>> processing the resched hints.
>> The purpose of the debugging change is to detect a change of the
>> scheduler state which was not followed by setting the XNRESCHED bit.
> 
> But that is nucleus business, nothing skins can screw up (as long as
> they do not misuse APIs).

Yes, but it happens that we modify the nucleus from time to time.

> 
>> Getting it to work is relatively simple: we add a "scheduler change set
>> remotely" bit to the sched structure which is NOT in the status bit, set
>> this bit when changing a remote sched (under nklock). In the debug check
>> code, if the scheduler state changed, and the XNRESCHED bit is not set,
>> only consider this a but if this new bit is not set. All this is
>> compiled out if the debug is not enabled.
> 
> I still see no benefit in this check. Where to you want to place the bit
> set? Aren't that just the same locations where
> xnsched_set_[self_]resched already is today?

Well no, that would be another bit in the sched structure which would
allow us to manipulate the status bits from the local cpu. That
supplementary bit would only be changed from a distant CPU, and serve to
detect the race which causes the false positive. The resched bits are
set on the local cpu to get xnpod_schedule to trigger a rescheduling on
the distance cpu. That bit would be set on the remote cpu's sched. Only
when debugging is enabled.

> 
> But maybe you can provide some motivating bug scenarios, real ones of
> the past or realistic ones of the future.

Of course. The bug is anything which changes the scheduler state but
does not set the XNRESCHED bit. This happened when we started the SMP
port. New scheduling policies would be good candidates for a revival of
this bug.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 04.11.2010 23:06, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
 At first sight, here you are more breaking things than cleaning them.
>>> Still, it has the SMP record for my test program, still runs with ftrace 
>>> on (after 2 hours, where it previously failed after maximum 23 minutes).
>>
>> My version was indeed still buggy, I'm reworking it ATM.
>>
>>> If I get the gist of Jan's changes, they are (using the IPI to transfer 
>>> one bit of information: your cpu needs to reschedule):
>>>
>>> xnsched_set_resched:
>>> -  setbits((__sched__)->status, XNRESCHED);
>>>
>>> xnpod_schedule_handler:
>>> +   xnsched_set_resched(sched);
>>> 
>>> If you (we?) decide to keep the debug checks, under what circumstances 
>>> would the current check trigger (in laymans language, that I'll be able 
>>> to understand)?
>>
>> That's actually what /me is wondering as well. I do not see yet how you
>> can reliably detect a missed reschedule reliably (that was the purpose
>> of the debug check) given the racy nature between signaling resched and
>> processing the resched hints.
> 
> The purpose of the debugging change is to detect a change of the
> scheduler state which was not followed by setting the XNRESCHED bit.

But that is nucleus business, nothing skins can screw up (as long as
they do not misuse APIs).

> Getting it to work is relatively simple: we add a "scheduler change set
> remotely" bit to the sched structure which is NOT in the status bit, set
> this bit when changing a remote sched (under nklock). In the debug check
> code, if the scheduler state changed, and the XNRESCHED bit is not set,
> only consider this a but if this new bit is not set. All this is
> compiled out if the debug is not enabled.

I still see no benefit in this check. Where to you want to place the bit
set? Aren't that just the same locations where
xnsched_set_[self_]resched already is today?

But maybe you can provide some motivating bug scenarios, real ones of
the past or realistic ones of the future.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 04.11.2010 23:08, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
>> debugging off.
> 
> That is not enough.

It is, I've reviewed the code today.

> This commit was followed by several others to "fix
> the fix". You know how things are, someone proposes a fix, which fixes
> things for him, but it breaks in the other people configurations (one of
> the fallouts was a complete revamp of include/asm-arm/atomic.h for
> instance).
> 

I've pushed a series that reverts that commit, then fixes and cleans up
on top of it. Just pushed if you want to take a look. We can find some
alternative debugging mechanism independently (though I'm curious to see
it - it still makes no sense to me).

Jan

signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
> debugging off.

That is not enough. This commit was followed by several others to "fix
the fix". You know how things are, someone proposes a fix, which fixes
things for him, but it breaks in the other people configurations (one of
the fallouts was a complete revamp of include/asm-arm/atomic.h for
instance).

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
>>> At first sight, here you are more breaking things than cleaning them.
>> Still, it has the SMP record for my test program, still runs with ftrace 
>> on (after 2 hours, where it previously failed after maximum 23 minutes).
> 
> My version was indeed still buggy, I'm reworking it ATM.
> 
>> If I get the gist of Jan's changes, they are (using the IPI to transfer 
>> one bit of information: your cpu needs to reschedule):
>>
>> xnsched_set_resched:
>> -  setbits((__sched__)->status, XNRESCHED);
>>
>> xnpod_schedule_handler:
>> +xnsched_set_resched(sched);
>>  
>> If you (we?) decide to keep the debug checks, under what circumstances 
>> would the current check trigger (in laymans language, that I'll be able 
>> to understand)?
> 
> That's actually what /me is wondering as well. I do not see yet how you
> can reliably detect a missed reschedule reliably (that was the purpose
> of the debug check) given the racy nature between signaling resched and
> processing the resched hints.

The purpose of the debugging change is to detect a change of the
scheduler state which was not followed by setting the XNRESCHED bit.
Getting it to work is relatively simple: we add a "scheduler change set
remotely" bit to the sched structure which is NOT in the status bit, set
this bit when changing a remote sched (under nklock). In the debug check
code, if the scheduler state changed, and the XNRESCHED bit is not set,
only consider this a but if this new bit is not set. All this is
compiled out if the debug is not enabled.

I will try and work on this this week-end.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 04.11.2010 15:53, Anders Blomdell wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 14:18, Anders Blomdell wrote:
>>> Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
> Am 04.11.2010 10:26, Jan Kiszka wrote:
>> Am 04.11.2010 10:16, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Take a step back and look at the root cause for this issue again. 
 Unlocked

if need-resched
__xnpod_schedule

 is inherently racy and will always be (not only for the remote
 reschedule case BTW).
>>> Ok, let us examine what may happen with this code if we only set the
>>> XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
>>> matter, because they can not change under our feet. So, we have two
>>> cases for this race:
>>> 1- we see the XNRESCHED bit, but it has been cleared once nklock is
>>> locked in __xnpod_schedule.
>>> 2- we do not see the XNRESCHED bit, but it get set right after we test 
>>> it.
>>>
>>> 1 is not a problem.
>> Yes, as long as we remove the debug check from the scheduler code (or
>> fix it somehow). The scheduling code already catches this race.
>>
>>> 2 is not a problem, because anything which sets the XNRESCHED (it may
>>> only be an interrupt in fact) bit will cause xnpod_schedule to be called
>>> right after that.
>>>
>>> So no, no race here provided that we only set the XNRESCHED bit on the
>>> local cpu.
>>>
>>>  So we either have to accept this and remove the
 debugging check from the scheduler or push the check back to
 __xnpod_schedule where it once came from. When this it cleaned up, we
 can look into the remote resched protocol again.
>>> The problem of the debug check is that it checks whether the scheduler
>>> state is modified without the XNRESCHED bit being set. And this is the
>>> problem, because yes, in that case, we have a race: the scheduler state
>>> may be modified before the XNRESCHED bit is set by an IPI.
>>>
>>> If we want to fix the debug check, we have to have a special bit, on in
>>> the sched->status flag, only for the purpose of debugging. Or remove the
>>> debug check.
>> Exactly my point. Is there any benefit in keeping the debug check? The
>> code to make it work may end up as "complex" as the logic it verifies,
>> at least that's my current feeling.
>>
> This would be the radical approach of removing the check (and cleaning
> up some bits). If it's acceptable, I would split it up properly.
 This debug check saved our asses when debugging SMP issues, and I
 suspect it may help debugging skin issues. So, I think we should try and
 keep it.


 At first sight, here you are more breaking things than cleaning them.
>>> Still, it has the SMP record for my test program, still runs with ftrace 
>>> on (after 2 hours, where it previously failed after maximum 23 minutes).
>>
>> My version was indeed still buggy, I'm reworking it ATM.
> Any reason why the two changes below would fail (I need to get things 
> working real soon now).

Maybe they don't fail but definitely cause reschedules where we do not
need them. I stopped thinking about those details after starting the
rework. Safer for now is likely to revert 56ff4329ff, keeping nucleus
debugging off.

>>
>>> If I get the gist of Jan's changes, they are (using the IPI to transfer 
>>> one bit of information: your cpu needs to reschedule):
>>>
>>> xnsched_set_resched:
>>> -  setbits((__sched__)->status, XNRESCHED);
>>>
>>> xnpod_schedule_handler:
>>> +   xnsched_set_resched(sched);
>>> 
>>> If you (we?) decide to keep the debug checks, under what circumstances 
>>> would the current check trigger (in laymans language, that I'll be able 
>>> to understand)?
>>
>> That's actually what /me is wondering as well. I do not see yet how you
>> can reliably detect a missed reschedule reliably (that was the purpose
>> of the debug check) given the racy nature between signaling resched and
>> processing the resched hints.
> The only thing I can think of are atomic set/clear on an independent 
> variable.

And the set has to be confined to few central locations - otherwise it
will be simpler to check for the proper pairing of runqueue manipulation
and xnsched_set_resched manually - if that was the fault pattern the
test was supposed to catch (which would have had nucleus scope, no skins
involved).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Anders Blomdell


Jan Kiszka wrote:

Am 04.11.2010 14:18, Anders Blomdell wrote:

Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 04.11.2010 10:26, Jan Kiszka wrote:

Am 04.11.2010 10:16, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Take a step back and look at the root cause for this issue again. Unlocked

if need-resched
__xnpod_schedule

is inherently racy and will always be (not only for the remote
reschedule case BTW).

Ok, let us examine what may happen with this code if we only set the
XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
matter, because they can not change under our feet. So, we have two
cases for this race:
1- we see the XNRESCHED bit, but it has been cleared once nklock is
locked in __xnpod_schedule.
2- we do not see the XNRESCHED bit, but it get set right after we test it.

1 is not a problem.

Yes, as long as we remove the debug check from the scheduler code (or
fix it somehow). The scheduling code already catches this race.


2 is not a problem, because anything which sets the XNRESCHED (it may
only be an interrupt in fact) bit will cause xnpod_schedule to be called
right after that.

So no, no race here provided that we only set the XNRESCHED bit on the
local cpu.

 So we either have to accept this and remove the

debugging check from the scheduler or push the check back to
__xnpod_schedule where it once came from. When this it cleaned up, we
can look into the remote resched protocol again.

The problem of the debug check is that it checks whether the scheduler
state is modified without the XNRESCHED bit being set. And this is the
problem, because yes, in that case, we have a race: the scheduler state
may be modified before the XNRESCHED bit is set by an IPI.

If we want to fix the debug check, we have to have a special bit, on in
the sched->status flag, only for the purpose of debugging. Or remove the
debug check.

Exactly my point. Is there any benefit in keeping the debug check? The
code to make it work may end up as "complex" as the logic it verifies,
at least that's my current feeling.


This would be the radical approach of removing the check (and cleaning
up some bits). If it's acceptable, I would split it up properly.

This debug check saved our asses when debugging SMP issues, and I
suspect it may help debugging skin issues. So, I think we should try and
keep it.


At first sight, here you are more breaking things than cleaning them.
Still, it has the SMP record for my test program, still runs with ftrace 
on (after 2 hours, where it previously failed after maximum 23 minutes).


My version was indeed still buggy, I'm reworking it ATM.
Any reason why the two changes below would fail (I need to get things 
working real soon now).


If I get the gist of Jan's changes, they are (using the IPI to transfer 
one bit of information: your cpu needs to reschedule):


xnsched_set_resched:
-  setbits((__sched__)->status, XNRESCHED);

xnpod_schedule_handler:
+   xnsched_set_resched(sched);

If you (we?) decide to keep the debug checks, under what circumstances 
would the current check trigger (in laymans language, that I'll be able 
to understand)?


That's actually what /me is wondering as well. I do not see yet how you
can reliably detect a missed reschedule reliably (that was the purpose
of the debug check) given the racy nature between signaling resched and
processing the resched hints.
The only thing I can think of are atomic set/clear on an independent 
variable.


/Anders

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 04.11.2010 14:18, Anders Blomdell wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 04.11.2010 10:26, Jan Kiszka wrote:
 Am 04.11.2010 10:16, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Take a step back and look at the root cause for this issue again. 
>> Unlocked
>>
>>  if need-resched
>>  __xnpod_schedule
>>
>> is inherently racy and will always be (not only for the remote
>> reschedule case BTW).
> Ok, let us examine what may happen with this code if we only set the
> XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
> matter, because they can not change under our feet. So, we have two
> cases for this race:
> 1- we see the XNRESCHED bit, but it has been cleared once nklock is
> locked in __xnpod_schedule.
> 2- we do not see the XNRESCHED bit, but it get set right after we test it.
>
> 1 is not a problem.
 Yes, as long as we remove the debug check from the scheduler code (or
 fix it somehow). The scheduling code already catches this race.

> 2 is not a problem, because anything which sets the XNRESCHED (it may
> only be an interrupt in fact) bit will cause xnpod_schedule to be called
> right after that.
>
> So no, no race here provided that we only set the XNRESCHED bit on the
> local cpu.
>
>  So we either have to accept this and remove the
>> debugging check from the scheduler or push the check back to
>> __xnpod_schedule where it once came from. When this it cleaned up, we
>> can look into the remote resched protocol again.
> The problem of the debug check is that it checks whether the scheduler
> state is modified without the XNRESCHED bit being set. And this is the
> problem, because yes, in that case, we have a race: the scheduler state
> may be modified before the XNRESCHED bit is set by an IPI.
>
> If we want to fix the debug check, we have to have a special bit, on in
> the sched->status flag, only for the purpose of debugging. Or remove the
> debug check.
 Exactly my point. Is there any benefit in keeping the debug check? The
 code to make it work may end up as "complex" as the logic it verifies,
 at least that's my current feeling.

>>> This would be the radical approach of removing the check (and cleaning
>>> up some bits). If it's acceptable, I would split it up properly.
>>
>> This debug check saved our asses when debugging SMP issues, and I
>> suspect it may help debugging skin issues. So, I think we should try and
>> keep it.
>>
>>
>> At first sight, here you are more breaking things than cleaning them.
> Still, it has the SMP record for my test program, still runs with ftrace 
> on (after 2 hours, where it previously failed after maximum 23 minutes).

My version was indeed still buggy, I'm reworking it ATM.

> 
> If I get the gist of Jan's changes, they are (using the IPI to transfer 
> one bit of information: your cpu needs to reschedule):
> 
> xnsched_set_resched:
> -  setbits((__sched__)->status, XNRESCHED);
> 
> xnpod_schedule_handler:
> + xnsched_set_resched(sched);
>   
> If you (we?) decide to keep the debug checks, under what circumstances 
> would the current check trigger (in laymans language, that I'll be able 
> to understand)?

That's actually what /me is wondering as well. I do not see yet how you
can reliably detect a missed reschedule reliably (that was the purpose
of the debug check) given the racy nature between signaling resched and
processing the resched hints.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Anders Blomdell


Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 04.11.2010 10:26, Jan Kiszka wrote:

Am 04.11.2010 10:16, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Take a step back and look at the root cause for this issue again. Unlocked

if need-resched
__xnpod_schedule

is inherently racy and will always be (not only for the remote
reschedule case BTW).

Ok, let us examine what may happen with this code if we only set the
XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
matter, because they can not change under our feet. So, we have two
cases for this race:
1- we see the XNRESCHED bit, but it has been cleared once nklock is
locked in __xnpod_schedule.
2- we do not see the XNRESCHED bit, but it get set right after we test it.

1 is not a problem.

Yes, as long as we remove the debug check from the scheduler code (or
fix it somehow). The scheduling code already catches this race.


2 is not a problem, because anything which sets the XNRESCHED (it may
only be an interrupt in fact) bit will cause xnpod_schedule to be called
right after that.

So no, no race here provided that we only set the XNRESCHED bit on the
local cpu.

 So we either have to accept this and remove the

debugging check from the scheduler or push the check back to
__xnpod_schedule where it once came from. When this it cleaned up, we
can look into the remote resched protocol again.

The problem of the debug check is that it checks whether the scheduler
state is modified without the XNRESCHED bit being set. And this is the
problem, because yes, in that case, we have a race: the scheduler state
may be modified before the XNRESCHED bit is set by an IPI.

If we want to fix the debug check, we have to have a special bit, on in
the sched->status flag, only for the purpose of debugging. Or remove the
debug check.

Exactly my point. Is there any benefit in keeping the debug check? The
code to make it work may end up as "complex" as the logic it verifies,
at least that's my current feeling.


This would be the radical approach of removing the check (and cleaning
up some bits). If it's acceptable, I would split it up properly.


This debug check saved our asses when debugging SMP issues, and I
suspect it may help debugging skin issues. So, I think we should try and
keep it.


At first sight, here you are more breaking things than cleaning them.
Still, it has the SMP record for my test program, still runs with ftrace 
on (after 2 hours, where it previously failed after maximum 23 minutes).


If I get the gist of Jan's changes, they are (using the IPI to transfer 
one bit of information: your cpu needs to reschedule):


xnsched_set_resched:
-  setbits((__sched__)->status, XNRESCHED);

xnpod_schedule_handler:
+   xnsched_set_resched(sched);

If you (we?) decide to keep the debug checks, under what circumstances 
would the current check trigger (in laymans language, that I'll be able 
to understand)?


/Anders


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 04.11.2010 10:26, Jan Kiszka wrote:
>> Am 04.11.2010 10:16, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Take a step back and look at the root cause for this issue again. Unlocked

if need-resched
__xnpod_schedule

 is inherently racy and will always be (not only for the remote
 reschedule case BTW).
>>> Ok, let us examine what may happen with this code if we only set the
>>> XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
>>> matter, because they can not change under our feet. So, we have two
>>> cases for this race:
>>> 1- we see the XNRESCHED bit, but it has been cleared once nklock is
>>> locked in __xnpod_schedule.
>>> 2- we do not see the XNRESCHED bit, but it get set right after we test it.
>>>
>>> 1 is not a problem.
>> Yes, as long as we remove the debug check from the scheduler code (or
>> fix it somehow). The scheduling code already catches this race.
>>
>>> 2 is not a problem, because anything which sets the XNRESCHED (it may
>>> only be an interrupt in fact) bit will cause xnpod_schedule to be called
>>> right after that.
>>>
>>> So no, no race here provided that we only set the XNRESCHED bit on the
>>> local cpu.
>>>
>>>  So we either have to accept this and remove the
 debugging check from the scheduler or push the check back to
 __xnpod_schedule where it once came from. When this it cleaned up, we
 can look into the remote resched protocol again.
>>> The problem of the debug check is that it checks whether the scheduler
>>> state is modified without the XNRESCHED bit being set. And this is the
>>> problem, because yes, in that case, we have a race: the scheduler state
>>> may be modified before the XNRESCHED bit is set by an IPI.
>>>
>>> If we want to fix the debug check, we have to have a special bit, on in
>>> the sched->status flag, only for the purpose of debugging. Or remove the
>>> debug check.
>> Exactly my point. Is there any benefit in keeping the debug check? The
>> code to make it work may end up as "complex" as the logic it verifies,
>> at least that's my current feeling.
>>
> 
> This would be the radical approach of removing the check (and cleaning
> up some bits). If it's acceptable, I would split it up properly.

This debug check saved our asses when debugging SMP issues, and I
suspect it may help debugging skin issues. So, I think we should try and
keep it.

> @@ -2162,21 +2163,21 @@ static inline void xnpod_switch_to(xnsched_t *sched,
>  
>  static inline int __xnpod_test_resched(struct xnsched *sched)
>  {
> - int resched = testbits(sched->status, XNRESCHED);
> + int resched = xnsched_resched_p(sched);
>  #ifdef CONFIG_SMP
>   /* Send resched IPI to remote CPU(s). */
> - if (unlikely(xnsched_resched_p(sched))) {
> + if (unlikely(resched)) {

At first sight, here you are more breaking things than cleaning them.


-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Anders Blomdell


Jan Kiszka wrote:

Am 04.11.2010 10:26, Jan Kiszka wrote:

Am 04.11.2010 10:16, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Take a step back and look at the root cause for this issue again. Unlocked

if need-resched
__xnpod_schedule

is inherently racy and will always be (not only for the remote
reschedule case BTW).

Ok, let us examine what may happen with this code if we only set the
XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
matter, because they can not change under our feet. So, we have two
cases for this race:
1- we see the XNRESCHED bit, but it has been cleared once nklock is
locked in __xnpod_schedule.
2- we do not see the XNRESCHED bit, but it get set right after we test it.

1 is not a problem.

Yes, as long as we remove the debug check from the scheduler code (or
fix it somehow). The scheduling code already catches this race.


2 is not a problem, because anything which sets the XNRESCHED (it may
only be an interrupt in fact) bit will cause xnpod_schedule to be called
right after that.

So no, no race here provided that we only set the XNRESCHED bit on the
local cpu.

 So we either have to accept this and remove the

debugging check from the scheduler or push the check back to
__xnpod_schedule where it once came from. When this it cleaned up, we
can look into the remote resched protocol again.

The problem of the debug check is that it checks whether the scheduler
state is modified without the XNRESCHED bit being set. And this is the
problem, because yes, in that case, we have a race: the scheduler state
may be modified before the XNRESCHED bit is set by an IPI.

If we want to fix the debug check, we have to have a special bit, on in
the sched->status flag, only for the purpose of debugging. Or remove the
debug check.

Exactly my point. Is there any benefit in keeping the debug check? The
code to make it work may end up as "complex" as the logic it verifies,
at least that's my current feeling.



This would be the radical approach of removing the check (and cleaning
up some bits). If it's acceptable, I would split it up properly.

diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
index 01ff0a7..71f8311 100644
--- a/include/nucleus/pod.h
+++ b/include/nucleus/pod.h
@@ -277,14 +277,9 @@ static inline void xnpod_schedule(void)
 * context is active, or if we are caught in the middle of a
 * unlocked context switch.
 */
-#if XENO_DEBUG(NUCLEUS)
-   if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
-   return;
-#else /* !XENO_DEBUG(NUCLEUS) */
if (testbits(sched->status,
 XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
return;
-#endif /* !XENO_DEBUG(NUCLEUS) */
 
 	__xnpod_schedule(sched);

 }
diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..c832b91 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -177,17 +177,16 @@ static inline int xnsched_self_resched_p(struct xnsched 
*sched)
 
 /* Set self resched flag for the given scheduler. */

 #define xnsched_set_self_resched(__sched__) do {   \
-  setbits((__sched__)->status, XNRESCHED);  \
+   __setbits((__sched__)->status, XNRESCHED);   \
 } while (0)
 
 /* Set specific resched flag into the local scheduler mask. */

 #define xnsched_set_resched(__sched__) do {\
-  xnsched_t *current_sched = xnpod_current_sched();\
-  setbits(current_sched->status, XNRESCHED);\
-  if (current_sched != (__sched__)){   \
-  xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);   \
-  setbits((__sched__)->status, XNRESCHED);  \
-  }\
+   xnsched_t *current_sched = xnpod_current_sched();   \
+   __setbits(current_sched->status, XNRESCHED); \
+   if (current_sched != (__sched__))   \
+   xnarch_cpu_set(xnsched_cpu(__sched__),  \
+  current_sched->resched);  \
 } while (0)
 
 void xnsched_zombie_hooks(struct xnthread *thread);

diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 9e135f3..87dc136 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -284,10 +284,11 @@ void xnpod_schedule_handler(void) /* Called with hw 
interrupts off. */
trace_xn_nucleus_sched_remote(sched);
 #if defined(CONFIG_SMP) && defined(CONFIG_XENO_OPT_PRIOCPL)
if (testbits(sched->status, XNRPICK)) {
-   clrbits(sched->status, XNRPICK);
+   __clrbits(sched->status, XNRPICK);
xnshadow_rpi_check();
}
 #endif /* CONFIG_SMP && CONFIG_XENO_OPT_PRIOCPL */
+   xnsched_set_resched(sched);
xnpod_schedule();
 }
 
@@ -

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 04.11.2010 10:26, Jan Kiszka wrote:
> Am 04.11.2010 10:16, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Take a step back and look at the root cause for this issue again. Unlocked
>>>
>>> if need-resched
>>> __xnpod_schedule
>>>
>>> is inherently racy and will always be (not only for the remote
>>> reschedule case BTW).
>>
>> Ok, let us examine what may happen with this code if we only set the
>> XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
>> matter, because they can not change under our feet. So, we have two
>> cases for this race:
>> 1- we see the XNRESCHED bit, but it has been cleared once nklock is
>> locked in __xnpod_schedule.
>> 2- we do not see the XNRESCHED bit, but it get set right after we test it.
>>
>> 1 is not a problem.
> 
> Yes, as long as we remove the debug check from the scheduler code (or
> fix it somehow). The scheduling code already catches this race.
> 
>> 2 is not a problem, because anything which sets the XNRESCHED (it may
>> only be an interrupt in fact) bit will cause xnpod_schedule to be called
>> right after that.
>>
>> So no, no race here provided that we only set the XNRESCHED bit on the
>> local cpu.
>>
>>  So we either have to accept this and remove the
>>> debugging check from the scheduler or push the check back to
>>> __xnpod_schedule where it once came from. When this it cleaned up, we
>>> can look into the remote resched protocol again.
>>
>> The problem of the debug check is that it checks whether the scheduler
>> state is modified without the XNRESCHED bit being set. And this is the
>> problem, because yes, in that case, we have a race: the scheduler state
>> may be modified before the XNRESCHED bit is set by an IPI.
>>
>> If we want to fix the debug check, we have to have a special bit, on in
>> the sched->status flag, only for the purpose of debugging. Or remove the
>> debug check.
> 
> Exactly my point. Is there any benefit in keeping the debug check? The
> code to make it work may end up as "complex" as the logic it verifies,
> at least that's my current feeling.
> 

This would be the radical approach of removing the check (and cleaning
up some bits). If it's acceptable, I would split it up properly.

diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
index 01ff0a7..71f8311 100644
--- a/include/nucleus/pod.h
+++ b/include/nucleus/pod.h
@@ -277,14 +277,9 @@ static inline void xnpod_schedule(void)
 * context is active, or if we are caught in the middle of a
 * unlocked context switch.
 */
-#if XENO_DEBUG(NUCLEUS)
-   if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
-   return;
-#else /* !XENO_DEBUG(NUCLEUS) */
if (testbits(sched->status,
 XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
return;
-#endif /* !XENO_DEBUG(NUCLEUS) */
 
__xnpod_schedule(sched);
 }
diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..c832b91 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -177,17 +177,16 @@ static inline int xnsched_self_resched_p(struct xnsched 
*sched)
 
 /* Set self resched flag for the given scheduler. */
 #define xnsched_set_self_resched(__sched__) do {   \
-  setbits((__sched__)->status, XNRESCHED); \
+   __setbits((__sched__)->status, XNRESCHED);  \
 } while (0)
 
 /* Set specific resched flag into the local scheduler mask. */
 #define xnsched_set_resched(__sched__) do {\
-  xnsched_t *current_sched = xnpod_current_sched();\
-  setbits(current_sched->status, XNRESCHED);   \
-  if (current_sched != (__sched__)){   \
-  xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);  \
-  setbits((__sched__)->status, XNRESCHED); \
-  }\
+   xnsched_t *current_sched = xnpod_current_sched();   \
+   __setbits(current_sched->status, XNRESCHED);\
+   if (current_sched != (__sched__))   \
+   xnarch_cpu_set(xnsched_cpu(__sched__),  \
+  current_sched->resched); \
 } while (0)
 
 void xnsched_zombie_hooks(struct xnthread *thread);
diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 9e135f3..87dc136 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -284,10 +284,11 @@ void xnpod_schedule_handler(void) /* Called with hw 
interrupts off. */
trace_xn_nucleus_sched_remote(sched);
 #if defined(CONFIG_SMP) && defined(CONFIG_XENO_OPT_PRIOCPL)
if (testbits(sched->status, XNRPICK)) {
-   clrbits(sched->status, XNRPICK);
+   __clrbits(sched->status, XNRPICK);
xnshadow_rpi_check();
}
 #endif /* CONFIG_SMP && CONFI

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 04.11.2010 10:16, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Take a step back and look at the root cause for this issue again. Unlocked
>>
>>  if need-resched
>>  __xnpod_schedule
>>
>> is inherently racy and will always be (not only for the remote
>> reschedule case BTW).
> 
> Ok, let us examine what may happen with this code if we only set the
> XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
> matter, because they can not change under our feet. So, we have two
> cases for this race:
> 1- we see the XNRESCHED bit, but it has been cleared once nklock is
> locked in __xnpod_schedule.
> 2- we do not see the XNRESCHED bit, but it get set right after we test it.
> 
> 1 is not a problem.

Yes, as long as we remove the debug check from the scheduler code (or
fix it somehow). The scheduling code already catches this race.

> 2 is not a problem, because anything which sets the XNRESCHED (it may
> only be an interrupt in fact) bit will cause xnpod_schedule to be called
> right after that.
> 
> So no, no race here provided that we only set the XNRESCHED bit on the
> local cpu.
> 
>  So we either have to accept this and remove the
>> debugging check from the scheduler or push the check back to
>> __xnpod_schedule where it once came from. When this it cleaned up, we
>> can look into the remote resched protocol again.
> 
> The problem of the debug check is that it checks whether the scheduler
> state is modified without the XNRESCHED bit being set. And this is the
> problem, because yes, in that case, we have a race: the scheduler state
> may be modified before the XNRESCHED bit is set by an IPI.
> 
> If we want to fix the debug check, we have to have a special bit, on in
> the sched->status flag, only for the purpose of debugging. Or remove the
> debug check.

Exactly my point. Is there any benefit in keeping the debug check? The
code to make it work may end up as "complex" as the logic it verifies,
at least that's my current feeling.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Take a step back and look at the root cause for this issue again. Unlocked
>>
>>  if need-resched
>>  __xnpod_schedule
>>
>> is inherently racy and will always be (not only for the remote
>> reschedule case BTW).
> 
> Ok, let us examine what may happen with this code if we only set the
> XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
> matter, because they can not change under our feet. So, we have two
> cases for this race:
> 1- we see the XNRESCHED bit, but it has been cleared once nklock is
> locked in __xnpod_schedule.
> 2- we do not see the XNRESCHED bit, but it get set right after we test it.
> 
> 1 is not a problem.
> 2 is not a problem, because anything which sets the XNRESCHED (it may
> only be an interrupt in fact) bit will cause xnpod_schedule to be called
> right after that.
> 
> So no, no race here provided that we only set the XNRESCHED bit on the
> local cpu.
> 
>  So we either have to accept this and remove the
>> debugging check from the scheduler or push the check back to
>> __xnpod_schedule where it once came from. When this it cleaned up, we
>> can look into the remote resched protocol again.
> 
> The problem of the debug check is that it checks whether the scheduler
> state is modified without the XNRESCHED bit being set. And this is the
> problem, because yes, in that case, we have a race: the scheduler state
> may be modified before the XNRESCHED bit is set by an IPI.
> 
> If we want to fix the debug check, we have to have a special bit, on in
   NOT
> the sched->status flag, only for the purpose of debugging. Or remove the
> debug check.
> 


-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Anders Blomdell wrote:
> Probably being daft here; why not stop fiddling with remote CPU status 
> bits and always do a reschedule on IPI irq's?

That is what we had been doing for a long time, and stopped between
2.5.4 and 2.5.5, and this is what I say: maybe this was not such a good
idea.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Take a step back and look at the root cause for this issue again. Unlocked
> 
>   if need-resched
>   __xnpod_schedule
> 
> is inherently racy and will always be (not only for the remote
> reschedule case BTW).

Ok, let us examine what may happen with this code if we only set the
XNRESCHED bit on the local cpu. First, other bits than XNRESCHED do not
matter, because they can not change under our feet. So, we have two
cases for this race:
1- we see the XNRESCHED bit, but it has been cleared once nklock is
locked in __xnpod_schedule.
2- we do not see the XNRESCHED bit, but it get set right after we test it.

1 is not a problem.
2 is not a problem, because anything which sets the XNRESCHED (it may
only be an interrupt in fact) bit will cause xnpod_schedule to be called
right after that.

So no, no race here provided that we only set the XNRESCHED bit on the
local cpu.

 So we either have to accept this and remove the
> debugging check from the scheduler or push the check back to
> __xnpod_schedule where it once came from. When this it cleaned up, we
> can look into the remote resched protocol again.

The problem of the debug check is that it checks whether the scheduler
state is modified without the XNRESCHED bit being set. And this is the
problem, because yes, in that case, we have a race: the scheduler state
may be modified before the XNRESCHED bit is set by an IPI.

If we want to fix the debug check, we have to have a special bit, on in
the sched->status flag, only for the purpose of debugging. Or remove the
debug check.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 04.11.2010 09:45, Anders Blomdell wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 01:13, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 04.11.2010 00:56, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
> Am 03.11.2010 23:03, Jan Kiszka wrote:
>> But we not not always use atomic ops for manipulating
>> status bits (but
>> we do in other cases where this is no need - different
>> story). This may
>> fix the race:
> Err, nonsense. As we manipulate xnsched::status also
> outside of nklock
> protection, we must _always_ use atomic ops.
>
> This screams for a cleanup: local-only bits like XNHTICK or
> XNINIRQ
> should be pushed in a separate status word that can then be
> safely
> modified non-atomically.
 Second try to fix and clean up the sched status bits.
 Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
   * context is active, or if we are caught in the middle
 of a
   * unlocked context switch.
   */
 -#if XENO_DEBUG(NUCLEUS)
  if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
  return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 -if (testbits(sched->status,
 - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 +if (!sched->resched)
  return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
>>> Having only one test was really nice here, maybe we simply
>>> read a
>>> barrier before reading the status?
>>>
>> I agree - but the alternative is letting all modifications of
>> xnsched::status use atomic bitops (that's required when
>> folding all bits
>> into a single word). And that should be much more costly,
>> specifically
>> on SMP.
> What about issuing a barrier before testing the status?
>
 The problem is not about reading but writing the status
 concurrently,
 thus it's not about the code you see above.
>>> The bits are modified under nklock, which implies a barrier when
>>> unlocked. Furthermore, an IPI is guaranteed to be received on the
>>> remote
>>> CPU after this barrier, so, a barrier should be enough to see the
>>> modifications which have been made remotely.
>> Check nucleus/intr.c for tons of unprotected status modifications.
> Ok. Then maybe, we should reconsider the original decision to start
> fiddling with the XNRESCHED bit remotely.
 ...which removed complexity and fixed a race? Let's better review the
 checks done in xnpod_schedule vs. its callers, I bet there is more to
 save (IOW: remove the need to test for sched->resched).
>>> Not that much complexitiy... and the race was a false positive in debug
>>> code, no big deal. At least it worked, and it has done so for a long
>>> time. No atomic needed, no barrier, only one test in xnpod_schedule. And
>>> a nice invariant: sched->status is always accessed on the local cpu.
>>> What else?
>>
>> Take a step back and look at the root cause for this issue again.
>> Unlocked
>>
>> if need-resched
>> __xnpod_schedule
>>
>> is inherently racy and will always be (not only for the remote
>> reschedule case BTW). So we either have to accept this and remove the
>> debugging check from the scheduler or push the check back to
>> __xnpod_schedule where it once came from. When this it cleaned up, we
>> can look into the remote resched protocol again.
> Probably being daft here; why not stop fiddling with remote CPU status
> bits and always do a reschedule on IPI irq's?

Yes that's what we did before. But the snippet above was and still is
the issue to be resolved first. It motivated the change in the remote
reschedule to avoid one of the possible races around this check.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Anders Blomdell


Jan Kiszka wrote:

Am 04.11.2010 01:13, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 04.11.2010 00:56, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:

Jan Kiszka wrote:

Am 03.11.2010 23:11, Jan Kiszka wrote:

Am 03.11.2010 23:03, Jan Kiszka wrote:

But we not not always use atomic ops for manipulating status bits (but
we do in other cases where this is no need - different story). This may
fix the race:

Err, nonsense. As we manipulate xnsched::status also outside of nklock
protection, we must _always_ use atomic ops.

This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
should be pushed in a separate status word that can then be safely
modified non-atomically.

Second try to fix and clean up the sched status bits. Anders, please
test.

Jan

diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
index 01ff0a7..5987a1f 100644
--- a/include/nucleus/pod.h
+++ b/include/nucleus/pod.h
@@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
 * context is active, or if we are caught in the middle of a
 * unlocked context switch.
 */
-#if XENO_DEBUG(NUCLEUS)
if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
return;
-#else /* !XENO_DEBUG(NUCLEUS) */
-   if (testbits(sched->status,
-XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
+#if !XENO_DEBUG(NUCLEUS)
+   if (!sched->resched)
return;
 #endif /* !XENO_DEBUG(NUCLEUS) */

Having only one test was really nice here, maybe we simply read a
barrier before reading the status?


I agree - but the alternative is letting all modifications of
xnsched::status use atomic bitops (that's required when folding all bits
into a single word). And that should be much more costly, specifically
on SMP.

What about issuing a barrier before testing the status?


The problem is not about reading but writing the status concurrently,
thus it's not about the code you see above.

The bits are modified under nklock, which implies a barrier when
unlocked. Furthermore, an IPI is guaranteed to be received on the remote
CPU after this barrier, so, a barrier should be enough to see the
modifications which have been made remotely.

Check nucleus/intr.c for tons of unprotected status modifications.

Ok. Then maybe, we should reconsider the original decision to start
fiddling with the XNRESCHED bit remotely.

...which removed complexity and fixed a race? Let's better review the
checks done in xnpod_schedule vs. its callers, I bet there is more to
save (IOW: remove the need to test for sched->resched).

Not that much complexitiy... and the race was a false positive in debug
code, no big deal. At least it worked, and it has done so for a long
time. No atomic needed, no barrier, only one test in xnpod_schedule. And
a nice invariant: sched->status is always accessed on the local cpu.
What else?


Take a step back and look at the root cause for this issue again. Unlocked

if need-resched
__xnpod_schedule

is inherently racy and will always be (not only for the remote
reschedule case BTW). So we either have to accept this and remove the
debugging check from the scheduler or push the check back to
__xnpod_schedule where it once came from. When this it cleaned up, we
can look into the remote resched protocol again.
Probably being daft here; why not stop fiddling with remote CPU status 
bits and always do a reschedule on IPI irq's?


/Anders


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-04 Thread Jan Kiszka

Am 04.11.2010 01:13, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 00:56, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 03.11.2010 23:11, Jan Kiszka wrote:
>>> Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits 
 (but
 we do in other cases where this is no need - different story). 
 This may
 fix the race:
>>> Err, nonsense. As we manipulate xnsched::status also outside of 
>>> nklock
>>> protection, we must _always_ use atomic ops.
>>>
>>> This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
>>> should be pushed in a separate status word that can then be safely
>>> modified non-atomically.
>> Second try to fix and clean up the sched status bits. Anders, please
>> test.
>>
>> Jan
>>
>> diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
>> index 01ff0a7..5987a1f 100644
>> --- a/include/nucleus/pod.h
>> +++ b/include/nucleus/pod.h
>> @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
>>   * context is active, or if we are caught in the middle of a
>>   * unlocked context switch.
>>   */
>> -#if XENO_DEBUG(NUCLEUS)
>>  if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
>>  return;
>> -#else /* !XENO_DEBUG(NUCLEUS) */
>> -if (testbits(sched->status,
>> - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
>> +#if !XENO_DEBUG(NUCLEUS)
>> +if (!sched->resched)
>>  return;
>>  #endif /* !XENO_DEBUG(NUCLEUS) */
> Having only one test was really nice here, maybe we simply read a
> barrier before reading the status?
>
 I agree - but the alternative is letting all modifications of
 xnsched::status use atomic bitops (that's required when folding all 
 bits
 into a single word). And that should be much more costly, specifically
 on SMP.
>>> What about issuing a barrier before testing the status?
>>>
>> The problem is not about reading but writing the status concurrently,
>> thus it's not about the code you see above.
> The bits are modified under nklock, which implies a barrier when
> unlocked. Furthermore, an IPI is guaranteed to be received on the remote
> CPU after this barrier, so, a barrier should be enough to see the
> modifications which have been made remotely.
 Check nucleus/intr.c for tons of unprotected status modifications.
>>> Ok. Then maybe, we should reconsider the original decision to start
>>> fiddling with the XNRESCHED bit remotely.
>>
>> ...which removed complexity and fixed a race? Let's better review the
>> checks done in xnpod_schedule vs. its callers, I bet there is more to
>> save (IOW: remove the need to test for sched->resched).
> 
> Not that much complexitiy... and the race was a false positive in debug
> code, no big deal. At least it worked, and it has done so for a long
> time. No atomic needed, no barrier, only one test in xnpod_schedule. And
> a nice invariant: sched->status is always accessed on the local cpu.
> What else?

Take a step back and look at the root cause for this issue again. Unlocked

if need-resched
__xnpod_schedule

is inherently racy and will always be (not only for the remote
reschedule case BTW). So we either have to accept this and remove the
debugging check from the scheduler or push the check back to
__xnpod_schedule where it once came from. When this it cleaned up, we
can look into the remote resched protocol again.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 04.11.2010 00:56, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
> Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
> Am 03.11.2010 23:11, Jan Kiszka wrote:
>> Am 03.11.2010 23:03, Jan Kiszka wrote:
>>> But we not not always use atomic ops for manipulating status bits 
>>> (but
>>> we do in other cases where this is no need - different story). This 
>>> may
>>> fix the race:
>> Err, nonsense. As we manipulate xnsched::status also outside of 
>> nklock
>> protection, we must _always_ use atomic ops.
>>
>> This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
>> should be pushed in a separate status word that can then be safely
>> modified non-atomically.
> Second try to fix and clean up the sched status bits. Anders, please
> test.
>
> Jan
>
> diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
> index 01ff0a7..5987a1f 100644
> --- a/include/nucleus/pod.h
> +++ b/include/nucleus/pod.h
> @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
>* context is active, or if we are caught in the middle of a
>* unlocked context switch.
>*/
> -#if XENO_DEBUG(NUCLEUS)
>   if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
>   return;
> -#else /* !XENO_DEBUG(NUCLEUS) */
> - if (testbits(sched->status,
> -  XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
> +#if !XENO_DEBUG(NUCLEUS)
> + if (!sched->resched)
>   return;
>  #endif /* !XENO_DEBUG(NUCLEUS) */
 Having only one test was really nice here, maybe we simply read a
 barrier before reading the status?

>>> I agree - but the alternative is letting all modifications of
>>> xnsched::status use atomic bitops (that's required when folding all bits
>>> into a single word). And that should be much more costly, specifically
>>> on SMP.
>> What about issuing a barrier before testing the status?
>>
> The problem is not about reading but writing the status concurrently,
> thus it's not about the code you see above.
 The bits are modified under nklock, which implies a barrier when
 unlocked. Furthermore, an IPI is guaranteed to be received on the remote
 CPU after this barrier, so, a barrier should be enough to see the
 modifications which have been made remotely.
>>> Check nucleus/intr.c for tons of unprotected status modifications.
>> Ok. Then maybe, we should reconsider the original decision to start
>> fiddling with the XNRESCHED bit remotely.
> 
> ...which removed complexity and fixed a race? Let's better review the
> checks done in xnpod_schedule vs. its callers, I bet there is more to
> save (IOW: remove the need to test for sched->resched).

Not that much complexitiy... and the race was a false positive in debug
code, no big deal. At least it worked, and it has done so for a long
time. No atomic needed, no barrier, only one test in xnpod_schedule. And
a nice invariant: sched->status is always accessed on the local cpu.
What else?


-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 04.11.2010 00:56, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
> Am 03.11.2010 23:03, Jan Kiszka wrote:
>> But we not not always use atomic ops for manipulating status bits 
>> (but
>> we do in other cases where this is no need - different story). This 
>> may
>> fix the race:
> Err, nonsense. As we manipulate xnsched::status also outside of nklock
> protection, we must _always_ use atomic ops.
>
> This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
> should be pushed in a separate status word that can then be safely
> modified non-atomically.
 Second try to fix and clean up the sched status bits. Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
 * context is active, or if we are caught in the middle of a
 * unlocked context switch.
 */
 -#if XENO_DEBUG(NUCLEUS)
if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 -  if (testbits(sched->status,
 -   XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 +  if (!sched->resched)
return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
>>> Having only one test was really nice here, maybe we simply read a
>>> barrier before reading the status?
>>>
>> I agree - but the alternative is letting all modifications of
>> xnsched::status use atomic bitops (that's required when folding all bits
>> into a single word). And that should be much more costly, specifically
>> on SMP.
> What about issuing a barrier before testing the status?
>
 The problem is not about reading but writing the status concurrently,
 thus it's not about the code you see above.
>>> The bits are modified under nklock, which implies a barrier when
>>> unlocked. Furthermore, an IPI is guaranteed to be received on the remote
>>> CPU after this barrier, so, a barrier should be enough to see the
>>> modifications which have been made remotely.
>>
>> Check nucleus/intr.c for tons of unprotected status modifications.
> 
> Ok. Then maybe, we should reconsider the original decision to start
> fiddling with the XNRESCHED bit remotely.

...which removed complexity and fixed a race? Let's better review the
checks done in xnpod_schedule vs. its callers, I bet there is more to
save (IOW: remove the need to test for sched->resched).

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
> Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
> But we not not always use atomic ops for manipulating status bits (but
> we do in other cases where this is no need - different story). This 
> may
> fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.
>>> Second try to fix and clean up the sched status bits. Anders, please
>>> test.
>>>
>>> Jan
>>>
>>> diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
>>> index 01ff0a7..5987a1f 100644
>>> --- a/include/nucleus/pod.h
>>> +++ b/include/nucleus/pod.h
>>> @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
>>>  * context is active, or if we are caught in the middle of a
>>>  * unlocked context switch.
>>>  */
>>> -#if XENO_DEBUG(NUCLEUS)
>>> if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
>>> return;
>>> -#else /* !XENO_DEBUG(NUCLEUS) */
>>> -   if (testbits(sched->status,
>>> -XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
>>> +#if !XENO_DEBUG(NUCLEUS)
>>> +   if (!sched->resched)
>>> return;
>>>  #endif /* !XENO_DEBUG(NUCLEUS) */
>> Having only one test was really nice here, maybe we simply read a
>> barrier before reading the status?
>>
> I agree - but the alternative is letting all modifications of
> xnsched::status use atomic bitops (that's required when folding all bits
> into a single word). And that should be much more costly, specifically
> on SMP.
 What about issuing a barrier before testing the status?

>>> The problem is not about reading but writing the status concurrently,
>>> thus it's not about the code you see above.
>> The bits are modified under nklock, which implies a barrier when
>> unlocked. Furthermore, an IPI is guaranteed to be received on the remote
>> CPU after this barrier, so, a barrier should be enough to see the
>> modifications which have been made remotely.
> 
> Check nucleus/intr.c for tons of unprotected status modifications.

Ok. Then maybe, we should reconsider the original decision to start
fiddling with the XNRESCHED bit remotely.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 03.11.2010 23:11, Jan Kiszka wrote:
>>> Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This may
 fix the race:
>>> Err, nonsense. As we manipulate xnsched::status also outside of nklock
>>> protection, we must _always_ use atomic ops.
>>>
>>> This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
>>> should be pushed in a separate status word that can then be safely
>>> modified non-atomically.
>> Second try to fix and clean up the sched status bits. Anders, please
>> test.
>>
>> Jan
>>
>> diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
>> index 01ff0a7..5987a1f 100644
>> --- a/include/nucleus/pod.h
>> +++ b/include/nucleus/pod.h
>> @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
>>   * context is active, or if we are caught in the middle of a
>>   * unlocked context switch.
>>   */
>> -#if XENO_DEBUG(NUCLEUS)
>>  if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
>>  return;
>> -#else /* !XENO_DEBUG(NUCLEUS) */
>> -if (testbits(sched->status,
>> - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
>> +#if !XENO_DEBUG(NUCLEUS)
>> +if (!sched->resched)
>>  return;
>>  #endif /* !XENO_DEBUG(NUCLEUS) */
> Having only one test was really nice here, maybe we simply read a
> barrier before reading the status?
>
 I agree - but the alternative is letting all modifications of
 xnsched::status use atomic bitops (that's required when folding all bits
 into a single word). And that should be much more costly, specifically
 on SMP.
>>> What about issuing a barrier before testing the status?
>>>
>>
>> The problem is not about reading but writing the status concurrently,
>> thus it's not about the code you see above.
> 
> The bits are modified under nklock, which implies a barrier when
> unlocked. Furthermore, an IPI is guaranteed to be received on the remote
> CPU after this barrier, so, a barrier should be enough to see the
> modifications which have been made remotely.

Check nucleus/intr.c for tons of unprotected status modifications.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
> Am 03.11.2010 23:11, Jan Kiszka wrote:
>> Am 03.11.2010 23:03, Jan Kiszka wrote:
>>> But we not not always use atomic ops for manipulating status bits (but
>>> we do in other cases where this is no need - different story). This may
>>> fix the race:
>> Err, nonsense. As we manipulate xnsched::status also outside of nklock
>> protection, we must _always_ use atomic ops.
>>
>> This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
>> should be pushed in a separate status word that can then be safely
>> modified non-atomically.
> Second try to fix and clean up the sched status bits. Anders, please
> test.
>
> Jan
>
> diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
> index 01ff0a7..5987a1f 100644
> --- a/include/nucleus/pod.h
> +++ b/include/nucleus/pod.h
> @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
>* context is active, or if we are caught in the middle of a
>* unlocked context switch.
>*/
> -#if XENO_DEBUG(NUCLEUS)
>   if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
>   return;
> -#else /* !XENO_DEBUG(NUCLEUS) */
> - if (testbits(sched->status,
> -  XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
> +#if !XENO_DEBUG(NUCLEUS)
> + if (!sched->resched)
>   return;
>  #endif /* !XENO_DEBUG(NUCLEUS) */
 Having only one test was really nice here, maybe we simply read a
 barrier before reading the status?

>>> I agree - but the alternative is letting all modifications of
>>> xnsched::status use atomic bitops (that's required when folding all bits
>>> into a single word). And that should be much more costly, specifically
>>> on SMP.
>> What about issuing a barrier before testing the status?
>>
> 
> The problem is not about reading but writing the status concurrently,
> thus it's not about the code you see above.

The bits are modified under nklock, which implies a barrier when
unlocked. Furthermore, an IPI is guaranteed to be received on the remote
CPU after this barrier, so, a barrier should be enough to see the
modifications which have been made remotely.


-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
 Am 03.11.2010 23:11, Jan Kiszka wrote:
> Am 03.11.2010 23:03, Jan Kiszka wrote:
>> But we not not always use atomic ops for manipulating status bits (but
>> we do in other cases where this is no need - different story). This may
>> fix the race:
> Err, nonsense. As we manipulate xnsched::status also outside of nklock
> protection, we must _always_ use atomic ops.
>
> This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
> should be pushed in a separate status word that can then be safely
> modified non-atomically.
 Second try to fix and clean up the sched status bits. Anders, please
 test.

 Jan

 diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
 index 01ff0a7..5987a1f 100644
 --- a/include/nucleus/pod.h
 +++ b/include/nucleus/pod.h
 @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
 * context is active, or if we are caught in the middle of a
 * unlocked context switch.
 */
 -#if XENO_DEBUG(NUCLEUS)
if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
return;
 -#else /* !XENO_DEBUG(NUCLEUS) */
 -  if (testbits(sched->status,
 -   XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
 +#if !XENO_DEBUG(NUCLEUS)
 +  if (!sched->resched)
return;
  #endif /* !XENO_DEBUG(NUCLEUS) */
>>> Having only one test was really nice here, maybe we simply read a
>>> barrier before reading the status?
>>>
>>
>> I agree - but the alternative is letting all modifications of
>> xnsched::status use atomic bitops (that's required when folding all bits
>> into a single word). And that should be much more costly, specifically
>> on SMP.
> 
> What about issuing a barrier before testing the status?
> 

The problem is not about reading but writing the status concurrently,
thus it's not about the code you see above.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Am 03.11.2010 23:11, Jan Kiszka wrote:
 Am 03.11.2010 23:03, Jan Kiszka wrote:
> But we not not always use atomic ops for manipulating status bits (but
> we do in other cases where this is no need - different story). This may
> fix the race:
 Err, nonsense. As we manipulate xnsched::status also outside of nklock
 protection, we must _always_ use atomic ops.

 This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
 should be pushed in a separate status word that can then be safely
 modified non-atomically.
>>> Second try to fix and clean up the sched status bits. Anders, please
>>> test.
>>>
>>> Jan
>>>
>>> diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
>>> index 01ff0a7..5987a1f 100644
>>> --- a/include/nucleus/pod.h
>>> +++ b/include/nucleus/pod.h
>>> @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
>>>  * context is active, or if we are caught in the middle of a
>>>  * unlocked context switch.
>>>  */
>>> -#if XENO_DEBUG(NUCLEUS)
>>> if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
>>> return;
>>> -#else /* !XENO_DEBUG(NUCLEUS) */
>>> -   if (testbits(sched->status,
>>> -XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
>>> +#if !XENO_DEBUG(NUCLEUS)
>>> +   if (!sched->resched)
>>> return;
>>>  #endif /* !XENO_DEBUG(NUCLEUS) */
>> Having only one test was really nice here, maybe we simply read a
>> barrier before reading the status?
>>
> 
> I agree - but the alternative is letting all modifications of
> xnsched::status use atomic bitops (that's required when folding all bits
> into a single word). And that should be much more costly, specifically
> on SMP.

What about issuing a barrier before testing the status?

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Am 03.11.2010 23:11, Jan Kiszka wrote:
>>> Am 03.11.2010 23:03, Jan Kiszka wrote:
 But we not not always use atomic ops for manipulating status bits (but
 we do in other cases where this is no need - different story). This may
 fix the race:
>>> Err, nonsense. As we manipulate xnsched::status also outside of nklock
>>> protection, we must _always_ use atomic ops.
>>>
>>> This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
>>> should be pushed in a separate status word that can then be safely
>>> modified non-atomically.
>>
>> Second try to fix and clean up the sched status bits. Anders, please
>> test.
>>
>> Jan
>>
>> diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
>> index 01ff0a7..5987a1f 100644
>> --- a/include/nucleus/pod.h
>> +++ b/include/nucleus/pod.h
>> @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
>>   * context is active, or if we are caught in the middle of a
>>   * unlocked context switch.
>>   */
>> -#if XENO_DEBUG(NUCLEUS)
>>  if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
>>  return;
>> -#else /* !XENO_DEBUG(NUCLEUS) */
>> -if (testbits(sched->status,
>> - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
>> +#if !XENO_DEBUG(NUCLEUS)
>> +if (!sched->resched)
>>  return;
>>  #endif /* !XENO_DEBUG(NUCLEUS) */
> 
> Having only one test was really nice here, maybe we simply read a
> barrier before reading the status?
> 

I agree - but the alternative is letting all modifications of
xnsched::status use atomic bitops (that's required when folding all bits
into a single word). And that should be much more costly, specifically
on SMP.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Gilles Chanteperdrix

Jan Kiszka wrote:
> Am 03.11.2010 23:11, Jan Kiszka wrote:
>> Am 03.11.2010 23:03, Jan Kiszka wrote:
>>> But we not not always use atomic ops for manipulating status bits (but
>>> we do in other cases where this is no need - different story). This may
>>> fix the race:
>> Err, nonsense. As we manipulate xnsched::status also outside of nklock
>> protection, we must _always_ use atomic ops.
>>
>> This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
>> should be pushed in a separate status word that can then be safely
>> modified non-atomically.
> 
> Second try to fix and clean up the sched status bits. Anders, please
> test.
> 
> Jan
> 
> diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
> index 01ff0a7..5987a1f 100644
> --- a/include/nucleus/pod.h
> +++ b/include/nucleus/pod.h
> @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
>* context is active, or if we are caught in the middle of a
>* unlocked context switch.
>*/
> -#if XENO_DEBUG(NUCLEUS)
>   if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
>   return;
> -#else /* !XENO_DEBUG(NUCLEUS) */
> - if (testbits(sched->status,
> -  XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
> +#if !XENO_DEBUG(NUCLEUS)
> + if (!sched->resched)
>   return;
>  #endif /* !XENO_DEBUG(NUCLEUS) */

Having only one test was really nice here, maybe we simply read a
barrier before reading the status?

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 03.11.2010 23:11, Jan Kiszka wrote:
> Am 03.11.2010 23:03, Jan Kiszka wrote:
>> But we not not always use atomic ops for manipulating status bits (but
>> we do in other cases where this is no need - different story). This may
>> fix the race:
> 
> Err, nonsense. As we manipulate xnsched::status also outside of nklock
> protection, we must _always_ use atomic ops.
> 
> This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
> should be pushed in a separate status word that can then be safely
> modified non-atomically.

Second try to fix and clean up the sched status bits. Anders, please
test.

Jan

diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
index 01ff0a7..5987a1f 100644
--- a/include/nucleus/pod.h
+++ b/include/nucleus/pod.h
@@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
 * context is active, or if we are caught in the middle of a
 * unlocked context switch.
 */
-#if XENO_DEBUG(NUCLEUS)
if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
return;
-#else /* !XENO_DEBUG(NUCLEUS) */
-   if (testbits(sched->status,
-XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
+#if !XENO_DEBUG(NUCLEUS)
+   if (!sched->resched)
return;
 #endif /* !XENO_DEBUG(NUCLEUS) */
 
diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..1850208 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -44,7 +44,6 @@
 #define XNINTCK0x1000  /* In master tick handler 
context */
 #define XNINIRQ0x0800  /* In IRQ handling context */
 #define XNSWLOCK   0x0400  /* In context switch */
-#define XNRESCHED  0x0200  /* Needs rescheduling */
 #define XNHDEFER   0x0100  /* Host tick deferred */
 
 struct xnsched_rt {
@@ -63,7 +62,8 @@ typedef struct xnsched {
xnflags_t status;   /*!< Scheduler specific status bitmask. 
*/
int cpu;
struct xnthread *curr;  /*!< Current thread. */
-   xnarch_cpumask_t resched;   /*!< Mask of CPUs needing rescheduling. 
*/
+   xnarch_cpumask_t remote_resched; /*!< Mask of CPUs needing 
rescheduling. */
+   int resched;/*!< Rescheduling needed. */
 
struct xnsched_rt rt;   /*!< Context of built-in real-time 
class. */
 #ifdef CONFIG_XENO_OPT_SCHED_TP
@@ -164,30 +164,21 @@ struct xnsched_class {
 #define xnsched_cpu(__sched__) ({ (void)__sched__; 0; })
 #endif /* CONFIG_SMP */
 
-/* Test all resched flags from the given scheduler mask. */
-static inline int xnsched_resched_p(struct xnsched *sched)
-{
-   return testbits(sched->status, XNRESCHED);
-}
-
-static inline int xnsched_self_resched_p(struct xnsched *sched)
-{
-   return testbits(sched->status, XNRESCHED);
-}
-
 /* Set self resched flag for the given scheduler. */
 #define xnsched_set_self_resched(__sched__) do {   \
-  setbits((__sched__)->status, XNRESCHED); \
+   (__sched__)->resched = 1;   \
 } while (0)
 
 /* Set specific resched flag into the local scheduler mask. */
 #define xnsched_set_resched(__sched__) do {\
-  xnsched_t *current_sched = xnpod_current_sched();\
-  setbits(current_sched->status, XNRESCHED);   \
-  if (current_sched != (__sched__)){   \
-  xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);  \
-  setbits((__sched__)->status, XNRESCHED); \
-  }\
+   xnsched_t *current_sched = xnpod_current_sched();   \
+   current_sched->resched = 1; \
+   if (current_sched != (__sched__)) { \
+   xnarch_cpu_set(xnsched_cpu(__sched__),  \
+  current_sched->remote_resched);  \
+   (__sched__)->resched = 1;   \
+   xnarch_memory_barrier();\
+   }   \
 } while (0)
 
 void xnsched_zombie_hooks(struct xnthread *thread);
@@ -209,7 +200,7 @@ struct xnsched *xnsched_finish_unlocked_switch(struct 
xnsched *sched);
 static inline
 int xnsched_maybe_resched_after_unlocked_switch(struct xnsched *sched)
 {
-   return testbits(sched->status, XNRESCHED);
+   return sched->resched;
 }
 
 #else /* !CONFIG_XENO_HW_UNLOCKED_SWITCH */
diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 9e135f3..f7f8b2c 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -284,7 +284,7 @@ void xnpod_schedule_handler(void) /* Called with hw 
interrupts off. */
trace_xn_nucleus_sched_remote(sched);
 #if defined(CONFIG_SMP) && defined(CON

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 03.11.2010 23:03, Jan Kiszka wrote:
> Am 03.11.2010 21:41, Philippe Gerum wrote:
>> On Wed, 2010-11-03 at 20:38 +0100, Anders Blomdell wrote:
>>> Jan Kiszka wrote:
 Am 03.11.2010 17:46, Anders Blomdell wrote:
> Anders Blomdell wrote:
>> Anders Blomdell wrote:
>>> Jan Kiszka wrote:
 additional barrier. Can you check this?

 diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
 index df56417..66b52ad 100644
 --- a/include/nucleus/sched.h
 +++ b/include/nucleus/sched.h
 @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
 xnsched *sched)
if (current_sched != (__sched__)){\
xnarch_cpu_set(xnsched_cpu(__sched__), 
 current_sched->resched);\
setbits((__sched__)->status, XNRESCHED);\
 +  xnarch_memory_barrier();\
}\
  } while (0)
>>> In progress, if nothing breaks before, I'll report status tomorrow 
>>> morning.
>> It still breaks (in approximately the same way). I'm currently putting a 
>> barrier in the other macro doing a RESCHED, also adding some tracing to 
>> see if a read barrier is needed.
> Nope, no luck there either. Will start interesting tracepoint 
> adding/conversion :-(

 Strange. But it was too easy anyway...

> Any reason why xn_nucleus_sched_remote should ever report status = 0?

 Really don't know yet. You could trigger on this state and call
 ftrace_stop() then. Provided you had the functions tracer enabled, that
 should give a nice pictures of what happened before.
>>>
>>> Isn't there a race betweeen these two (still waiting for compilation to 
>>> be finished)?
>>
>> We always hold the nklock in both contexts.
>>
> 
> But we not not always use atomic ops for manipulating status bits (but
> we do in other cases where this is no need - different story). This may
> fix the race:

Err, nonsense. As we manipulate xnsched::status also outside of nklock
protection, we must _always_ use atomic ops.

This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ
should be pushed in a separate status word that can then be safely
modified non-atomically.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 03.11.2010 21:41, Philippe Gerum wrote:
> On Wed, 2010-11-03 at 20:38 +0100, Anders Blomdell wrote:
>> Jan Kiszka wrote:
>>> Am 03.11.2010 17:46, Anders Blomdell wrote:
 Anders Blomdell wrote:
> Anders Blomdell wrote:
>> Jan Kiszka wrote:
>>> additional barrier. Can you check this?
>>>
>>> diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
>>> index df56417..66b52ad 100644
>>> --- a/include/nucleus/sched.h
>>> +++ b/include/nucleus/sched.h
>>> @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
>>> xnsched *sched)
>>>if (current_sched != (__sched__)){\
>>>xnarch_cpu_set(xnsched_cpu(__sched__), 
>>> current_sched->resched);\
>>>setbits((__sched__)->status, XNRESCHED);\
>>> +  xnarch_memory_barrier();\
>>>}\
>>>  } while (0)
>> In progress, if nothing breaks before, I'll report status tomorrow 
>> morning.
> It still breaks (in approximately the same way). I'm currently putting a 
> barrier in the other macro doing a RESCHED, also adding some tracing to 
> see if a read barrier is needed.
 Nope, no luck there either. Will start interesting tracepoint 
 adding/conversion :-(
>>>
>>> Strange. But it was too easy anyway...
>>>
 Any reason why xn_nucleus_sched_remote should ever report status = 0?
>>>
>>> Really don't know yet. You could trigger on this state and call
>>> ftrace_stop() then. Provided you had the functions tracer enabled, that
>>> should give a nice pictures of what happened before.
>>
>> Isn't there a race betweeen these two (still waiting for compilation to 
>> be finished)?
> 
> We always hold the nklock in both contexts.
> 

But we not not always use atomic ops for manipulating status bits (but
we do in other cases where this is no need - different story). This may
fix the race:

diff --git a/ksrc/nucleus/intr.c b/ksrc/nucleus/intr.c
index d7a772f..af8ebeb 100644
--- a/ksrc/nucleus/intr.c
+++ b/ksrc/nucleus/intr.c
@@ -85,7 +85,7 @@ static void xnintr_irq_handler(unsigned irq, void *cookie);
 
 void xnintr_host_tick(struct xnsched *sched) /* Interrupts off. */
 {
-   __clrbits(sched->status, XNHTICK);
+   clrbits(sched->status, XNHTICK);
xnarch_relay_tick();
 }
 
@@ -105,11 +105,13 @@ void xnintr_clock_handler(void)
trace_mark(xn_nucleus, irq_enter, "irq %u", XNARCH_TIMER_IRQ);
trace_mark(xn_nucleus, tbase_tick, "base %s", nktbase.name);
 
+   xnlock_get(&nklock);
+
++sched->inesting;
__setbits(sched->status, XNINIRQ);
 
-   xnlock_get(&nklock);
xntimer_tick_aperiodic();
+
xnlock_put(&nklock);
 
xnstat_counter_inc(&nkclock.stat[xnsched_cpu(sched)].hits);
@@ -117,7 +119,7 @@ void xnintr_clock_handler(void)
&nkclock.stat[xnsched_cpu(sched)].account, start);
 
if (--sched->inesting == 0) {
-   __clrbits(sched->status, XNINIRQ);
+   clrbits(sched->status, XNINIRQ);
xnpod_schedule();
}
/*
@@ -178,7 +180,7 @@ static void xnintr_shirq_handler(unsigned irq, void *cookie)
trace_mark(xn_nucleus, irq_enter, "irq %u", irq);
 
++sched->inesting;
-   __setbits(sched->status, XNINIRQ);
+   setbits(sched->status, XNINIRQ);
 
xnlock_get(&shirq->lock);
intr = shirq->handlers;
@@ -220,7 +222,7 @@ static void xnintr_shirq_handler(unsigned irq, void *cookie)
xnarch_end_irq(irq);
 
if (--sched->inesting == 0) {
-   __clrbits(sched->status, XNINIRQ);
+   clrbits(sched->status, XNINIRQ);
xnpod_schedule();
}
 
@@ -247,7 +249,7 @@ static void xnintr_edge_shirq_handler(unsigned irq, void 
*cookie)
trace_mark(xn_nucleus, irq_enter, "irq %u", irq);
 
++sched->inesting;
-   __setbits(sched->status, XNINIRQ);
+   setbits(sched->status, XNINIRQ);
 
xnlock_get(&shirq->lock);
intr = shirq->handlers;
@@ -303,7 +305,7 @@ static void xnintr_edge_shirq_handler(unsigned irq, void 
*cookie)
xnarch_end_irq(irq);
 
if (--sched->inesting == 0) {
-   __clrbits(sched->status, XNINIRQ);
+   clrbits(sched->status, XNINIRQ);
xnpod_schedule();
}
trace_mark(xn_nucleus, irq_exit, "irq %u", irq);
@@ -446,7 +448,7 @@ static void xnintr_irq_handler(unsigned irq, void *cookie)
trace_mark(xn_nucleus, irq_enter, "irq %u", irq);
 
++sched->inesting;
-   __setbits(sched->status, XNINIRQ);
+   setbits(sched->status, XNINIRQ);
 
xnlock_get(&xnirqs[irq].lock);
 
@@ -493,7 +495,7 @@ static void xnintr_irq_handler(unsigned irq, void *cookie)
xnarch_end_irq(irq);
 
if (--sched->inesting == 0) {
-   __clrbits(sched->status, XNINIRQ);
+

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Philippe Gerum

On Wed, 2010-11-03 at 20:38 +0100, Anders Blomdell wrote:
> Jan Kiszka wrote:
> > Am 03.11.2010 17:46, Anders Blomdell wrote:
> >> Anders Blomdell wrote:
> >>> Anders Blomdell wrote:
>  Jan Kiszka wrote:
> > additional barrier. Can you check this?
> >
> > diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
> > index df56417..66b52ad 100644
> > --- a/include/nucleus/sched.h
> > +++ b/include/nucleus/sched.h
> > @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
> > xnsched *sched)
> >if (current_sched != (__sched__)){\
> >xnarch_cpu_set(xnsched_cpu(__sched__), 
> > current_sched->resched);\
> >setbits((__sched__)->status, XNRESCHED);\
> > +  xnarch_memory_barrier();\
> >}\
> >  } while (0)
>  In progress, if nothing breaks before, I'll report status tomorrow 
>  morning.
> >>> It still breaks (in approximately the same way). I'm currently putting a 
> >>> barrier in the other macro doing a RESCHED, also adding some tracing to 
> >>> see if a read barrier is needed.
> >> Nope, no luck there either. Will start interesting tracepoint 
> >> adding/conversion :-(
> > 
> > Strange. But it was too easy anyway...
> > 
> >> Any reason why xn_nucleus_sched_remote should ever report status = 0?
> > 
> > Really don't know yet. You could trigger on this state and call
> > ftrace_stop() then. Provided you had the functions tracer enabled, that
> > should give a nice pictures of what happened before.
> 
> Isn't there a race betweeen these two (still waiting for compilation to 
> be finished)?

We always hold the nklock in both contexts.

> 
> static inline int __xnpod_test_resched(struct xnsched *sched)
> {
>   int resched = testbits(sched->status, XNRESCHED);
> #ifdef CONFIG_SMP
>   /* Send resched IPI to remote CPU(s). */
>   if (unlikely(xnsched_resched_p(sched))) {
>   xnarch_send_ipi(sched->resched);
>   xnarch_cpus_clear(sched->resched);
>   }
> #endif
>   clrbits(sched->status, XNRESCHED);
>   return resched;
> }
> 
> #define xnsched_set_resched(__sched__) do {   \
>xnsched_t *current_sched = xnpod_current_sched();   \
>setbits(current_sched->status, XNRESCHED);  \
>if (current_sched != (__sched__)) { \
>xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched); \
>setbits((__sched__)->status, XNRESCHED);\
>xnarch_memory_barrier();\
>}   \
> } while (0)
> 
> I would suggest (if I have got all the macros right):
> 
> static inline int __xnpod_test_resched(struct xnsched *sched)
> {
>   int resched = testbits(sched->status, XNRESCHED);
>   if (unlikely(resched)) {
> #ifdef CONFIG_SMP
>   /* Send resched IPI to remote CPU(s). */
>   xnarch_send_ipi(sched->resched);
>   xnarch_cpus_clear(sched->resched);
> #endif
>   clrbits(sched->status, XNRESCHED);
>   }
>   return resched;
> }
> 
> /Anders
> 
> 
> ___
> Xenomai-core mailing list
> Xenomai-core@gna.org
> https://mail.gna.org/listinfo/xenomai-core

-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell


Jan Kiszka wrote:

Am 03.11.2010 17:46, Anders Blomdell wrote:

Anders Blomdell wrote:

Anders Blomdell wrote:

Jan Kiszka wrote:

additional barrier. Can you check this?

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..66b52ad 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
xnsched *sched)

   if (current_sched != (__sched__)){\
   xnarch_cpu_set(xnsched_cpu(__sched__), 
current_sched->resched);\

   setbits((__sched__)->status, XNRESCHED);\
+  xnarch_memory_barrier();\
   }\
 } while (0)
In progress, if nothing breaks before, I'll report status tomorrow 
morning.
It still breaks (in approximately the same way). I'm currently putting a 
barrier in the other macro doing a RESCHED, also adding some tracing to 
see if a read barrier is needed.
Nope, no luck there either. Will start interesting tracepoint 
adding/conversion :-(


Strange. But it was too easy anyway...


Any reason why xn_nucleus_sched_remote should ever report status = 0?


Really don't know yet. You could trigger on this state and call
ftrace_stop() then. Provided you had the functions tracer enabled, that
should give a nice pictures of what happened before.


Isn't there a race betweeen these two (still waiting for compilation to 
be finished)?


static inline int __xnpod_test_resched(struct xnsched *sched)
{
int resched = testbits(sched->status, XNRESCHED);
#ifdef CONFIG_SMP
/* Send resched IPI to remote CPU(s). */
if (unlikely(xnsched_resched_p(sched))) {
xnarch_send_ipi(sched->resched);
xnarch_cpus_clear(sched->resched);
}
#endif
clrbits(sched->status, XNRESCHED);
return resched;
}

#define xnsched_set_resched(__sched__) do {   \
  xnsched_t *current_sched = xnpod_current_sched();   \
  setbits(current_sched->status, XNRESCHED);  \
  if (current_sched != (__sched__)) { \
  xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched); \
  setbits((__sched__)->status, XNRESCHED);\
  xnarch_memory_barrier();\
  }   \
} while (0)

I would suggest (if I have got all the macros right):

static inline int __xnpod_test_resched(struct xnsched *sched)
{
int resched = testbits(sched->status, XNRESCHED);
if (unlikely(resched)) {
#ifdef CONFIG_SMP
/* Send resched IPI to remote CPU(s). */
xnarch_send_ipi(sched->resched);
xnarch_cpus_clear(sched->resched);
#endif
clrbits(sched->status, XNRESCHED);
}
return resched;
}

/Anders


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 03.11.2010 17:46, Anders Blomdell wrote:
> Anders Blomdell wrote:
>> Anders Blomdell wrote:
>>> Jan Kiszka wrote:
 additional barrier. Can you check this?

 diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
 index df56417..66b52ad 100644
 --- a/include/nucleus/sched.h
 +++ b/include/nucleus/sched.h
 @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
 xnsched *sched)
if (current_sched != (__sched__)){\
xnarch_cpu_set(xnsched_cpu(__sched__), 
 current_sched->resched);\
setbits((__sched__)->status, XNRESCHED);\
 +  xnarch_memory_barrier();\
}\
  } while (0)
>>>
>>> In progress, if nothing breaks before, I'll report status tomorrow 
>>> morning.
>> It still breaks (in approximately the same way). I'm currently putting a 
>> barrier in the other macro doing a RESCHED, also adding some tracing to 
>> see if a read barrier is needed.
> Nope, no luck there either. Will start interesting tracepoint 
> adding/conversion :-(

Strange. But it was too easy anyway...

> 
> Any reason why xn_nucleus_sched_remote should ever report status = 0?

Really don't know yet. You could trigger on this state and call
ftrace_stop() then. Provided you had the functions tracer enabled, that
should give a nice pictures of what happened before.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell


Anders Blomdell wrote:

Anders Blomdell wrote:

Jan Kiszka wrote:

additional barrier. Can you check this?

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..66b52ad 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
xnsched *sched)

   if (current_sched != (__sched__)){\
   xnarch_cpu_set(xnsched_cpu(__sched__), 
current_sched->resched);\

   setbits((__sched__)->status, XNRESCHED);\
+  xnarch_memory_barrier();\
   }\
 } while (0)


In progress, if nothing breaks before, I'll report status tomorrow 
morning.
It still breaks (in approximately the same way). I'm currently putting a 
barrier in the other macro doing a RESCHED, also adding some tracing to 
see if a read barrier is needed.
Nope, no luck there either. Will start interesting tracepoint 
adding/conversion :-(


Any reason why xn_nucleus_sched_remote should ever report status = 0?

/Anders


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell


Anders Blomdell wrote:

Jan Kiszka wrote:

additional barrier. Can you check this?

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..66b52ad 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct 
xnsched *sched)

   if (current_sched != (__sched__)){\
   xnarch_cpu_set(xnsched_cpu(__sched__), 
current_sched->resched);\

   setbits((__sched__)->status, XNRESCHED);\
+  xnarch_memory_barrier();\
   }\
 } while (0)


In progress, if nothing breaks before, I'll report status tomorrow morning.
It still breaks (in approximately the same way). I'm currently putting a 
barrier in the other macro doing a RESCHED, also adding some tracing to 
see if a read barrier is needed.


Interesting side-note:

Harddisk accesses seems to get real slow after error has occured (kernel 
installs progresses with 2-3 modules installed per second), while lots 
of idle time reported on all cpu's, weird...


/Anders

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell


Jan Kiszka wrote:

additional barrier. Can you check this?

diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h
index df56417..66b52ad 100644
--- a/include/nucleus/sched.h
+++ b/include/nucleus/sched.h
@@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched 
*sched)
   if (current_sched != (__sched__)){   \
   xnarch_cpu_set(xnsched_cpu(__sched__), current_sched->resched);   \
   setbits((__sched__)->status, XNRESCHED);  \
+  xnarch_memory_barrier(); \
   }\
 } while (0)


In progress, if nothing breaks before, I'll report status tomorrow morning.


Mmmh -- not everything. The inlined XNRESCHED entry test in
xnpod_schedule runs outside nklock. But doesn't releasing nklock imply a
memory write barrier? Let me meditate...

Wouldn't we need a read barrier then (but maybe the irq-handling takes care of
that, not familiar with the code yet)?


A read barrier is not required here as we do not need to order load
operation /wrt each other in the reschedule IRQ handler.

Only if taking the interrupt is equivalent to:

  read interrupts status
  memory_read_barrier
  execute handler

processor manuals should have the answer to this (or it might already be 
in the code)...



You can always help: there is a lot boring^Winteresting tracepoint
conversion waiting in Xenomai, see the few already converted nucleus
tracepoints.

As soon as I have my system running, I'll put some effort into this.

/Anders


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 03.11.2010 13:07, Anders Blomdell wrote:
> On 2010-11-03 12.55, Jan Kiszka wrote:
>> Am 03.11.2010 12:50, Jan Kiszka wrote:
>>> Am 03.11.2010 12:44, Anders Blomdell wrote:
 Anders Blomdell wrote:
> Jan Kiszka wrote:
>> Am 01.11.2010 17:55, Anders Blomdell wrote:
>>> Jan Kiszka wrote:
 Am 28.10.2010 11:34, Anders Blomdell wrote:
> Jan Kiszka wrote:
>> Am 28.10.2010 09:34, Anders Blomdell wrote:
>>> Anders Blomdell wrote:
 Anders Blomdell wrote:
> Hi,
>
> I'm trying to use rt_eepro100, for sending raw ethernet packets,
> but I'm
> experincing occasionally weird behaviour.
>
> Versions of things:
>
>   linux-2.6.34.5
>   xenomai-2.5.5.2
>   rtnet-39f7fcf
>
> The testprogram runs on two computers with "Intel Corporation
> 82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one
> computer
> acts as a mirror sending back packets received from the ethernet
> (only
> those two computers on the network), and the other sends
> packets and
> measures roundtrip time. Most packets comes back in approximately
> 100
> us, but occasionally the reception times out (once in about
> 10
> packets or more), but the packets gets immediately received when
> reception is retried, which might indicate a race between
> rt_dev_recvmsg
> and interrupt, but I might miss something obvious.
 Changing one of the ethernet cards to a "Intel Corporation 82541PI
 Gigabit Ethernet Controller (rev 05)", while keeping everything
 else
 constant, changes behavior somewhat; after receiving a few 10
 packets, reception stops entirely (-EAGAIN is returned), while
 transmission proceeds as it should (and mirror returns packets).

 Any suggestions on what to try?
>>> Since the problem disappears with 'maxcpus=1', I suspect I have
>>> a SMP
>>> issue (machine is a Core2 Quad), so I'll move to xenomai-core.
>>> (original message can be found at
>>> http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se
>>>
>>>
>>> )
>>>
>>> Xenomai-core gurus: which is the corrrect way to debug SMP issues?
>>> Can I run I-pipe-tracer and expect to be able save at least 150
>>> us of
>>> traces for all cpus? Any hints/suggestions/insigths are welcome...
>> The i-pipe tracer unfortunately only saves traces for a the CPU that
>> triggered the freeze. To have a full pictures, you may want to
>> try my
>> ftrace port I posted recently for 2.6.35.
> 2.6.35.7 ?
>
 Exactly.
>>> Finally managed to get the ftrace to work
>>> (one possible bug: had to manually copy
>>> include/xenomai/trace/xn_nucleus.h to
>>> include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
>>> very useful...
>>>
>>> But I don't think it will give much info at the moment, since no
>>> xenomai/ipipe interrupt activity shows up, and adding that is far above
>>> my league :-(
>>
>> You could use the function tracer, provided you are able to stop the
>> trace quickly enough on error.
>>
>>> My current theory is that the problem occurs when something like this
>>> takes place:
>>>
>>>   CPU-iCPU-jCPU-kCPU-l
>>>
>>> rt_dev_sendmsg
>>> xmit_irq
>>> rt_dev_recvmsgrecv_irq
>>
>> Can't follow. When races here, and what will go wrong then?
> Thats the good question. Find attached:
>
> 1. .config (so you can check for stupid mistakes)
> 2. console log
> 3. latest version of test program
> 4. tail of ftrace dump
>
> These are the xenomai tasks running when the test program is active:
>
> CPU  PIDCLASS  PRI  TIMEOUT   TIMEBASE   STAT   NAME
>   0  0  idle-1  - master R  ROOT/0
>   1  0  idle-1  - master R  ROOT/1
>   2  0  idle-1  - master R  ROOT/2
>   3  0  idle-1  - master R  ROOT/3
>   0  0  rt  98  - master W  rtnet-stack
>   0  0  rt   0  - master W  rtnet-rtpc
>   0  29901  rt  50  - masterraw_test
>   0  29906  rt   0  - master X  reporter
>
>
>
> The lines of interest from the trace are probably:
>
> [003]  2061.347855: xn_nucle

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell

On 2010-11-03 12.55, Jan Kiszka wrote:
> Am 03.11.2010 12:50, Jan Kiszka wrote:
>> Am 03.11.2010 12:44, Anders Blomdell wrote:
>>> Anders Blomdell wrote:
 Jan Kiszka wrote:
> Am 01.11.2010 17:55, Anders Blomdell wrote:
>> Jan Kiszka wrote:
>>> Am 28.10.2010 11:34, Anders Blomdell wrote:
 Jan Kiszka wrote:
> Am 28.10.2010 09:34, Anders Blomdell wrote:
>> Anders Blomdell wrote:
>>> Anders Blomdell wrote:
 Hi,

 I'm trying to use rt_eepro100, for sending raw ethernet packets,
 but I'm
 experincing occasionally weird behaviour.

 Versions of things:

   linux-2.6.34.5
   xenomai-2.5.5.2
   rtnet-39f7fcf

 The testprogram runs on two computers with "Intel Corporation
 82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one
 computer
 acts as a mirror sending back packets received from the ethernet
 (only
 those two computers on the network), and the other sends
 packets and
 measures roundtrip time. Most packets comes back in approximately
 100
 us, but occasionally the reception times out (once in about
 10
 packets or more), but the packets gets immediately received when
 reception is retried, which might indicate a race between
 rt_dev_recvmsg
 and interrupt, but I might miss something obvious.
>>> Changing one of the ethernet cards to a "Intel Corporation 82541PI
>>> Gigabit Ethernet Controller (rev 05)", while keeping everything
>>> else
>>> constant, changes behavior somewhat; after receiving a few 10
>>> packets, reception stops entirely (-EAGAIN is returned), while
>>> transmission proceeds as it should (and mirror returns packets).
>>>
>>> Any suggestions on what to try?
>> Since the problem disappears with 'maxcpus=1', I suspect I have
>> a SMP
>> issue (machine is a Core2 Quad), so I'll move to xenomai-core.
>> (original message can be found at
>> http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se
>>
>>
>> )
>>
>> Xenomai-core gurus: which is the corrrect way to debug SMP issues?
>> Can I run I-pipe-tracer and expect to be able save at least 150
>> us of
>> traces for all cpus? Any hints/suggestions/insigths are welcome...
> The i-pipe tracer unfortunately only saves traces for a the CPU that
> triggered the freeze. To have a full pictures, you may want to
> try my
> ftrace port I posted recently for 2.6.35.
 2.6.35.7 ?

>>> Exactly.
>> Finally managed to get the ftrace to work
>> (one possible bug: had to manually copy
>> include/xenomai/trace/xn_nucleus.h to
>> include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
>> very useful...
>>
>> But I don't think it will give much info at the moment, since no
>> xenomai/ipipe interrupt activity shows up, and adding that is far above
>> my league :-(
>
> You could use the function tracer, provided you are able to stop the
> trace quickly enough on error.
>
>> My current theory is that the problem occurs when something like this
>> takes place:
>>
>>   CPU-iCPU-jCPU-kCPU-l
>>
>> rt_dev_sendmsg
>> xmit_irq
>> rt_dev_recvmsgrecv_irq
>
> Can't follow. When races here, and what will go wrong then?
 Thats the good question. Find attached:

 1. .config (so you can check for stupid mistakes)
 2. console log
 3. latest version of test program
 4. tail of ftrace dump

 These are the xenomai tasks running when the test program is active:

 CPU  PIDCLASS  PRI  TIMEOUT   TIMEBASE   STAT   NAME
   0  0  idle-1  - master R  ROOT/0
   1  0  idle-1  - master R  ROOT/1
   2  0  idle-1  - master R  ROOT/2
   3  0  idle-1  - master R  ROOT/3
   0  0  rt  98  - master W  rtnet-stack
   0  0  rt   0  - master W  rtnet-rtpc
   0  29901  rt  50  - masterraw_test
   0  29906  rt   0  - master X  reporter



 The lines of interest from the trace are probably:

 [003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00   
   thread_name=rtnet-stack mask=2
 [003]  2061.347862: xn_nucleus_sched: status=200
 [000]

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 03.11.2010 12:50, Jan Kiszka wrote:
> Am 03.11.2010 12:44, Anders Blomdell wrote:
>> Anders Blomdell wrote:
>>> Jan Kiszka wrote:
 Am 01.11.2010 17:55, Anders Blomdell wrote:
> Jan Kiszka wrote:
>> Am 28.10.2010 11:34, Anders Blomdell wrote:
>>> Jan Kiszka wrote:
 Am 28.10.2010 09:34, Anders Blomdell wrote:
> Anders Blomdell wrote:
>> Anders Blomdell wrote:
>>> Hi,
>>>
>>> I'm trying to use rt_eepro100, for sending raw ethernet packets,
>>> but I'm
>>> experincing occasionally weird behaviour.
>>>
>>> Versions of things:
>>>
>>>   linux-2.6.34.5
>>>   xenomai-2.5.5.2
>>>   rtnet-39f7fcf
>>>
>>> The testprogram runs on two computers with "Intel Corporation
>>> 82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one
>>> computer
>>> acts as a mirror sending back packets received from the ethernet
>>> (only
>>> those two computers on the network), and the other sends
>>> packets and
>>> measures roundtrip time. Most packets comes back in approximately
>>> 100
>>> us, but occasionally the reception times out (once in about
>>> 10
>>> packets or more), but the packets gets immediately received when
>>> reception is retried, which might indicate a race between
>>> rt_dev_recvmsg
>>> and interrupt, but I might miss something obvious.
>> Changing one of the ethernet cards to a "Intel Corporation 82541PI
>> Gigabit Ethernet Controller (rev 05)", while keeping everything
>> else
>> constant, changes behavior somewhat; after receiving a few 10
>> packets, reception stops entirely (-EAGAIN is returned), while
>> transmission proceeds as it should (and mirror returns packets).
>>
>> Any suggestions on what to try?
> Since the problem disappears with 'maxcpus=1', I suspect I have
> a SMP
> issue (machine is a Core2 Quad), so I'll move to xenomai-core.
> (original message can be found at
> http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se
>
>
> )
>
> Xenomai-core gurus: which is the corrrect way to debug SMP issues?
> Can I run I-pipe-tracer and expect to be able save at least 150
> us of
> traces for all cpus? Any hints/suggestions/insigths are welcome...
 The i-pipe tracer unfortunately only saves traces for a the CPU that
 triggered the freeze. To have a full pictures, you may want to
 try my
 ftrace port I posted recently for 2.6.35.
>>> 2.6.35.7 ?
>>>
>> Exactly.
> Finally managed to get the ftrace to work
> (one possible bug: had to manually copy
> include/xenomai/trace/xn_nucleus.h to
> include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
> very useful...
>
> But I don't think it will give much info at the moment, since no
> xenomai/ipipe interrupt activity shows up, and adding that is far above
> my league :-(

 You could use the function tracer, provided you are able to stop the
 trace quickly enough on error.

> My current theory is that the problem occurs when something like this
> takes place:
>
>   CPU-iCPU-jCPU-kCPU-l
>
> rt_dev_sendmsg
> xmit_irq
> rt_dev_recvmsgrecv_irq

 Can't follow. When races here, and what will go wrong then?
>>> Thats the good question. Find attached:
>>>
>>> 1. .config (so you can check for stupid mistakes)
>>> 2. console log
>>> 3. latest version of test program
>>> 4. tail of ftrace dump
>>>
>>> These are the xenomai tasks running when the test program is active:
>>>
>>> CPU  PIDCLASS  PRI  TIMEOUT   TIMEBASE   STAT   NAME
>>>   0  0  idle-1  - master R  ROOT/0
>>>   1  0  idle-1  - master R  ROOT/1
>>>   2  0  idle-1  - master R  ROOT/2
>>>   3  0  idle-1  - master R  ROOT/3
>>>   0  0  rt  98  - master W  rtnet-stack
>>>   0  0  rt   0  - master W  rtnet-rtpc
>>>   0  29901  rt  50  - masterraw_test
>>>   0  29906  rt   0  - master X  reporter
>>>
>>>
>>>
>>> The lines of interest from the trace are probably:
>>>
>>> [003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00   
>>>   thread_name=rtnet-stack mask=2
>>> [003]  2061.347862: xn_nucleus_sched: status=200
>>> [000]  2061.347866: xn_nucleus_sched_remote: status=0
>>>
>>> since this is the only place where a packet gets delayed, and the only
>>> place in the trace where

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 03.11.2010 12:44, Anders Blomdell wrote:
> Anders Blomdell wrote:
>> Jan Kiszka wrote:
>>> Am 01.11.2010 17:55, Anders Blomdell wrote:
 Jan Kiszka wrote:
> Am 28.10.2010 11:34, Anders Blomdell wrote:
>> Jan Kiszka wrote:
>>> Am 28.10.2010 09:34, Anders Blomdell wrote:
 Anders Blomdell wrote:
> Anders Blomdell wrote:
>> Hi,
>>
>> I'm trying to use rt_eepro100, for sending raw ethernet packets,
>> but I'm
>> experincing occasionally weird behaviour.
>>
>> Versions of things:
>>
>>   linux-2.6.34.5
>>   xenomai-2.5.5.2
>>   rtnet-39f7fcf
>>
>> The testprogram runs on two computers with "Intel Corporation
>> 82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one
>> computer
>> acts as a mirror sending back packets received from the ethernet
>> (only
>> those two computers on the network), and the other sends
>> packets and
>> measures roundtrip time. Most packets comes back in approximately
>> 100
>> us, but occasionally the reception times out (once in about
>> 10
>> packets or more), but the packets gets immediately received when
>> reception is retried, which might indicate a race between
>> rt_dev_recvmsg
>> and interrupt, but I might miss something obvious.
> Changing one of the ethernet cards to a "Intel Corporation 82541PI
> Gigabit Ethernet Controller (rev 05)", while keeping everything
> else
> constant, changes behavior somewhat; after receiving a few 10
> packets, reception stops entirely (-EAGAIN is returned), while
> transmission proceeds as it should (and mirror returns packets).
>
> Any suggestions on what to try?
 Since the problem disappears with 'maxcpus=1', I suspect I have
 a SMP
 issue (machine is a Core2 Quad), so I'll move to xenomai-core.
 (original message can be found at
 http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se


 )

 Xenomai-core gurus: which is the corrrect way to debug SMP issues?
 Can I run I-pipe-tracer and expect to be able save at least 150
 us of
 traces for all cpus? Any hints/suggestions/insigths are welcome...
>>> The i-pipe tracer unfortunately only saves traces for a the CPU that
>>> triggered the freeze. To have a full pictures, you may want to
>>> try my
>>> ftrace port I posted recently for 2.6.35.
>> 2.6.35.7 ?
>>
> Exactly.
 Finally managed to get the ftrace to work
 (one possible bug: had to manually copy
 include/xenomai/trace/xn_nucleus.h to
 include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
 very useful...

 But I don't think it will give much info at the moment, since no
 xenomai/ipipe interrupt activity shows up, and adding that is far above
 my league :-(
>>>
>>> You could use the function tracer, provided you are able to stop the
>>> trace quickly enough on error.
>>>
 My current theory is that the problem occurs when something like this
 takes place:

   CPU-iCPU-jCPU-kCPU-l

 rt_dev_sendmsg
 xmit_irq
 rt_dev_recvmsgrecv_irq
>>>
>>> Can't follow. When races here, and what will go wrong then?
>> Thats the good question. Find attached:
>>
>> 1. .config (so you can check for stupid mistakes)
>> 2. console log
>> 3. latest version of test program
>> 4. tail of ftrace dump
>>
>> These are the xenomai tasks running when the test program is active:
>>
>> CPU  PIDCLASS  PRI  TIMEOUT   TIMEBASE   STAT   NAME
>>   0  0  idle-1  - master R  ROOT/0
>>   1  0  idle-1  - master R  ROOT/1
>>   2  0  idle-1  - master R  ROOT/2
>>   3  0  idle-1  - master R  ROOT/3
>>   0  0  rt  98  - master W  rtnet-stack
>>   0  0  rt   0  - master W  rtnet-rtpc
>>   0  29901  rt  50  - masterraw_test
>>   0  29906  rt   0  - master X  reporter
>>
>>
>>
>> The lines of interest from the trace are probably:
>>
>> [003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00   
>>   thread_name=rtnet-stack mask=2
>> [003]  2061.347862: xn_nucleus_sched: status=200
>> [000]  2061.347866: xn_nucleus_sched_remote: status=0
>>
>> since this is the only place where a packet gets delayed, and the only
>> place in the trace where sched_remote reports a status=0
> Since the cpu that has rtnet-stack and hence should be resumed is doing
> heavy I/O at the time of fault; could it be that

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Anders Blomdell


Anders Blomdell wrote:

Jan Kiszka wrote:

Am 01.11.2010 17:55, Anders Blomdell wrote:

Jan Kiszka wrote:

Am 28.10.2010 11:34, Anders Blomdell wrote:

Jan Kiszka wrote:

Am 28.10.2010 09:34, Anders Blomdell wrote:

Anders Blomdell wrote:

Anders Blomdell wrote:

Hi,

I'm trying to use rt_eepro100, for sending raw ethernet packets,
but I'm
experincing occasionally weird behaviour.

Versions of things:

  linux-2.6.34.5
  xenomai-2.5.5.2
  rtnet-39f7fcf

The testprogram runs on two computers with "Intel Corporation
82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one
computer
acts as a mirror sending back packets received from the ethernet
(only
those two computers on the network), and the other sends 
packets and

measures roundtrip time. Most packets comes back in approximately
100
us, but occasionally the reception times out (once in about 10
packets or more), but the packets gets immediately received when
reception is retried, which might indicate a race between
rt_dev_recvmsg
and interrupt, but I might miss something obvious.

Changing one of the ethernet cards to a "Intel Corporation 82541PI
Gigabit Ethernet Controller (rev 05)", while keeping everything 
else

constant, changes behavior somewhat; after receiving a few 10
packets, reception stops entirely (-EAGAIN is returned), while
transmission proceeds as it should (and mirror returns packets).

Any suggestions on what to try?
Since the problem disappears with 'maxcpus=1', I suspect I have a 
SMP

issue (machine is a Core2 Quad), so I'll move to xenomai-core.
(original message can be found at
http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se 



)

Xenomai-core gurus: which is the corrrect way to debug SMP issues?
Can I run I-pipe-tracer and expect to be able save at least 150 
us of

traces for all cpus? Any hints/suggestions/insigths are welcome...

The i-pipe tracer unfortunately only saves traces for a the CPU that
triggered the freeze. To have a full pictures, you may want to try my
ftrace port I posted recently for 2.6.35.

2.6.35.7 ?


Exactly.

Finally managed to get the ftrace to work
(one possible bug: had to manually copy
include/xenomai/trace/xn_nucleus.h to
include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
very useful...

But I don't think it will give much info at the moment, since no
xenomai/ipipe interrupt activity shows up, and adding that is far above
my league :-(


You could use the function tracer, provided you are able to stop the
trace quickly enough on error.


My current theory is that the problem occurs when something like this
takes place:

  CPU-iCPU-jCPU-kCPU-l

rt_dev_sendmsg
xmit_irq
rt_dev_recvmsgrecv_irq


Can't follow. When races here, and what will go wrong then?

Thats the good question. Find attached:

1. .config (so you can check for stupid mistakes)
2. console log
3. latest version of test program
4. tail of ftrace dump

These are the xenomai tasks running when the test program is active:

CPU  PIDCLASS  PRI  TIMEOUT   TIMEBASE   STAT   NAME
  0  0  idle-1  - master R  ROOT/0
  1  0  idle-1  - master R  ROOT/1
  2  0  idle-1  - master R  ROOT/2
  3  0  idle-1  - master R  ROOT/3
  0  0  rt  98  - master W  rtnet-stack
  0  0  rt   0  - master W  rtnet-rtpc
  0  29901  rt  50  - masterraw_test
  0  29906  rt   0  - master X  reporter



The lines of interest from the trace are probably:

[003]  2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00
  thread_name=rtnet-stack mask=2

[003]  2061.347862: xn_nucleus_sched: status=200
[000]  2061.347866: xn_nucleus_sched_remote: status=0

since this is the only place where a packet gets delayed, and the only 
place in the trace where sched_remote reports a status=0
Since the cpu that has rtnet-stack and hence should be resumed is doing 
heavy I/O at the time of fault; could it be that 
send_ipi/schedule_handler needs barriers to make sure taht decisions are 
made on the right status?


/Anders


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-03 Thread Jan Kiszka

Am 01.11.2010 17:55, Anders Blomdell wrote:
> Jan Kiszka wrote:
>> Am 28.10.2010 11:34, Anders Blomdell wrote:
>>> Jan Kiszka wrote:
 Am 28.10.2010 09:34, Anders Blomdell wrote:
> Anders Blomdell wrote:
>> Anders Blomdell wrote:
>>> Hi,
>>>
>>> I'm trying to use rt_eepro100, for sending raw ethernet packets,
>>> but I'm
>>> experincing occasionally weird behaviour.
>>>
>>> Versions of things:
>>>
>>>   linux-2.6.34.5
>>>   xenomai-2.5.5.2
>>>   rtnet-39f7fcf
>>>
>>> The testprogram runs on two computers with "Intel Corporation
>>> 82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one
>>> computer
>>> acts as a mirror sending back packets received from the ethernet
>>> (only
>>> those two computers on the network), and the other sends packets and
>>> measures roundtrip time. Most packets comes back in approximately
>>> 100
>>> us, but occasionally the reception times out (once in about 10
>>> packets or more), but the packets gets immediately received when
>>> reception is retried, which might indicate a race between
>>> rt_dev_recvmsg
>>> and interrupt, but I might miss something obvious.
>> Changing one of the ethernet cards to a "Intel Corporation 82541PI
>> Gigabit Ethernet Controller (rev 05)", while keeping everything else
>> constant, changes behavior somewhat; after receiving a few 10
>> packets, reception stops entirely (-EAGAIN is returned), while
>> transmission proceeds as it should (and mirror returns packets).
>>
>> Any suggestions on what to try?
> Since the problem disappears with 'maxcpus=1', I suspect I have a SMP
> issue (machine is a Core2 Quad), so I'll move to xenomai-core.
> (original message can be found at
> http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se
>
> )
>
> Xenomai-core gurus: which is the corrrect way to debug SMP issues?
> Can I run I-pipe-tracer and expect to be able save at least 150 us of
> traces for all cpus? Any hints/suggestions/insigths are welcome...
 The i-pipe tracer unfortunately only saves traces for a the CPU that
 triggered the freeze. To have a full pictures, you may want to try my
 ftrace port I posted recently for 2.6.35.
>>> 2.6.35.7 ?
>>>
>>
>> Exactly.
> Finally managed to get the ftrace to work
> (one possible bug: had to manually copy
> include/xenomai/trace/xn_nucleus.h to
> include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be
> very useful...
> 
> But I don't think it will give much info at the moment, since no
> xenomai/ipipe interrupt activity shows up, and adding that is far above
> my league :-(

You could use the function tracer, provided you are able to stop the
trace quickly enough on error.

> 
> My current theory is that the problem occurs when something like this
> takes place:
> 
>   CPU-iCPU-jCPU-kCPU-l
> 
> rt_dev_sendmsg
> xmit_irq
> rt_dev_recvmsgrecv_irq

Can't follow. When races here, and what will go wrong then?

> 
> So now I'll try to instrument the code to see if the assumtion holds.
> Stay tuned...
> 
> Regards
> 
> Anders
> 
> 

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-11-01 Thread Anders Blomdell


Jan Kiszka wrote:

Am 28.10.2010 11:34, Anders Blomdell wrote:

Jan Kiszka wrote:

Am 28.10.2010 09:34, Anders Blomdell wrote:

Anders Blomdell wrote:

Anders Blomdell wrote:

Hi,

I'm trying to use rt_eepro100, for sending raw ethernet packets,
but I'm
experincing occasionally weird behaviour.

Versions of things:

  linux-2.6.34.5
  xenomai-2.5.5.2
  rtnet-39f7fcf

The testprogram runs on two computers with "Intel Corporation
82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one
computer
acts as a mirror sending back packets received from the ethernet (only
those two computers on the network), and the other sends packets and
measures roundtrip time. Most packets comes back in approximately 100
us, but occasionally the reception times out (once in about 10
packets or more), but the packets gets immediately received when
reception is retried, which might indicate a race between
rt_dev_recvmsg
and interrupt, but I might miss something obvious.

Changing one of the ethernet cards to a "Intel Corporation 82541PI
Gigabit Ethernet Controller (rev 05)", while keeping everything else
constant, changes behavior somewhat; after receiving a few 10
packets, reception stops entirely (-EAGAIN is returned), while
transmission proceeds as it should (and mirror returns packets).

Any suggestions on what to try?

Since the problem disappears with 'maxcpus=1', I suspect I have a SMP
issue (machine is a Core2 Quad), so I'll move to xenomai-core.
(original message can be found at
http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se
)

Xenomai-core gurus: which is the corrrect way to debug SMP issues?
Can I run I-pipe-tracer and expect to be able save at least 150 us of
traces for all cpus? Any hints/suggestions/insigths are welcome...

The i-pipe tracer unfortunately only saves traces for a the CPU that
triggered the freeze. To have a full pictures, you may want to try my
ftrace port I posted recently for 2.6.35.

2.6.35.7 ?



Exactly.

Finally managed to get the ftrace to work
(one possible bug: had to manually copy 
include/xenomai/trace/xn_nucleus.h to 
include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be 
very useful...


But I don't think it will give much info at the moment, since no 
xenomai/ipipe interrupt activity shows up, and adding that is far above 
my league :-(


My current theory is that the problem occurs when something like this 
takes place:


  CPU-i CPU-j   CPU-k   CPU-l

rt_dev_sendmsg
xmit_irq
rt_dev_recvmsg  recv_irq

So now I'll try to instrument the code to see if the assumtion holds. 
Stay tuned...


Regards

Anders



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-10-29 Thread Anders Blomdell

On 2010-10-29 20.06, Jan Kiszka wrote:
> Am 29.10.2010 19:42, Anders Blomdell wrote:
>> Jan Kiszka wrote:
>>
> Please provide the full kernel log, ideally also with the I-pipe tracer
> (with panic tracing) enabled.
 Will reconfigure/recompile and do that, with full kernel log do you
 mean all
 bootup info?
>>>
>>> That's best to avoid missing some detail or doing Q&A ping-pong.
>>
>> Full trace attached (finally...)
>>
> 
> You have to switch off CONFIG_DMA_API_DEBUG, it's incompatible with Xenomai.
Thanks, will continue with this on monday (build in progress).

With your ftrace port, how does one freeze all cpu's at the same time?


Regards

Anders

-- 
Anders Blomdell  Email: anders.blomd...@control.lth.se
Department of Automatic Control
Lund University  Phone:+46 46 222 4625
P.O. Box 118 Fax:  +46 46 138118
SE-221 00 Lund, Sweden

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-10-29 Thread Jan Kiszka

Am 29.10.2010 19:42, Anders Blomdell wrote:
> Jan Kiszka wrote:
> 
 Please provide the full kernel log, ideally also with the I-pipe tracer
 (with panic tracing) enabled.
>>> Will reconfigure/recompile and do that, with full kernel log do you
>>> mean all
>>> bootup info?
>>
>> That's best to avoid missing some detail or doing Q&A ping-pong.
> 
> Full trace attached (finally...)
> 

You have to switch off CONFIG_DMA_API_DEBUG, it's incompatible with Xenomai.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-10-28 Thread Jan Kiszka

Am 28.10.2010 17:18, Anders Blomdell wrote:
> On 2010-10-28 17.09, Jan Kiszka wrote:
>> Am 28.10.2010 17:05, Anders Blomdell wrote:
>>> Current results:
>>>
>>> 1. 2.6.35.7, maxcpus=1; a few thousand rounds, freeze and this after
>>> some time:
>>>
>>> BUG: spinlock lockup on CPU#0, raw_test/2924, c0a1b540
>>> Process raw_test (pid: 2924, ti=f18bc000 task=f1bdab00 task.ti=f18bc000)
>>> I-pipe domain Xenomai
>>> Stack:
>>> Call Trace:
>>> Code: d0 85 d8 74 0f ba 71 00 00 00 b8 78 a7 8f c0 e8 3b 03 02 00 a1 28
>>> f6 9f c0 89 f2 8b 48 20 89 d8 e8 ad fd ff ff 57 9d 8d 65 f4 5b <5e> 5f
>>> 5d c3 90 55 89 e5 0f 1f 44 00 00 8b 0d 28 f6 9f c0 ba 00
>>>
>>
>> Please provide the full kernel log, ideally also with the I-pipe tracer
>> (with panic tracing) enabled.
> Will reconfigure/recompile and do that, with full kernel log do you mean all
> bootup info?

That's best to avoid missing some detail or doing Q&A ping-pong.

>>
>>> 2. 2.6.35.7, maxcpus=4; no packets sent, this on console:
>>>
>>> e1000: rteth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>>
>> Err, is this another NIC used on this box? If yes and when used as RTnet
>> NIC instead, does it trigger the same issue?
> Switched the eepro100 to a e1000, and got same user program issues (indicating
> ipipe/rtdm/rtnet_stack issues and not a specific driver), have not switched 
> back
> yet, will do if you rather want that.

As both NICs are apparently affected, just stick with one.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-10-28 Thread Anders Blomdell

On 2010-10-28 17.09, Jan Kiszka wrote:
> Am 28.10.2010 17:05, Anders Blomdell wrote:
>> Current results:
>>
>> 1. 2.6.35.7, maxcpus=1; a few thousand rounds, freeze and this after
>> some time:
>>
>> BUG: spinlock lockup on CPU#0, raw_test/2924, c0a1b540
>> Process raw_test (pid: 2924, ti=f18bc000 task=f1bdab00 task.ti=f18bc000)
>> I-pipe domain Xenomai
>> Stack:
>> Call Trace:
>> Code: d0 85 d8 74 0f ba 71 00 00 00 b8 78 a7 8f c0 e8 3b 03 02 00 a1 28
>> f6 9f c0 89 f2 8b 48 20 89 d8 e8 ad fd ff ff 57 9d 8d 65 f4 5b <5e> 5f
>> 5d c3 90 55 89 e5 0f 1f 44 00 00 8b 0d 28 f6 9f c0 ba 00
>>
> 
> Please provide the full kernel log, ideally also with the I-pipe tracer
> (with panic tracing) enabled.
Will reconfigure/recompile and do that, with full kernel log do you mean all
bootup info?
> 
>> 2. 2.6.35.7, maxcpus=4; no packets sent, this on console:
>>
>> e1000: rteth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> 
> Err, is this another NIC used on this box? If yes and when used as RTnet
> NIC instead, does it trigger the same issue?
Switched the eepro100 to a e1000, and got same user program issues (indicating
ipipe/rtdm/rtnet_stack issues and not a specific driver), have not switched back
yet, will do if you rather want that.

> 
>>  Tx Queue <0>
>>  TDH  <0>
>>  TDT  <10>
>>  next_to_use  <10>
>>  next_to_clean<0>
>> buffer_info[next_to_clean]
>>  time_stamp   <362d2>
>>  next_to_watch<0>
>>  jiffies  <368f8>
>>  next_to_watch.status <0>

/Anders


-- 
Anders Blomdell  Email: anders.blomd...@control.lth.se
Department of Automatic Control
Lund University  Phone:+46 46 222 4625
P.O. Box 118 Fax:  +46 46 138118
SE-221 00 Lund, Sweden

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-10-28 Thread Jan Kiszka

Am 28.10.2010 17:05, Anders Blomdell wrote:
> Current results:
> 
> 1. 2.6.35.7, maxcpus=1; a few thousand rounds, freeze and this after
> some time:
> 
> BUG: spinlock lockup on CPU#0, raw_test/2924, c0a1b540
> Process raw_test (pid: 2924, ti=f18bc000 task=f1bdab00 task.ti=f18bc000)
> I-pipe domain Xenomai
> Stack:
> Call Trace:
> Code: d0 85 d8 74 0f ba 71 00 00 00 b8 78 a7 8f c0 e8 3b 03 02 00 a1 28
> f6 9f c0 89 f2 8b 48 20 89 d8 e8 ad fd ff ff 57 9d 8d 65 f4 5b <5e> 5f
> 5d c3 90 55 89 e5 0f 1f 44 00 00 8b 0d 28 f6 9f c0 ba 00
> 

Please provide the full kernel log, ideally also with the I-pipe tracer
(with panic tracing) enabled.

> 2. 2.6.35.7, maxcpus=4; no packets sent, this on console:
> 
> e1000: rteth0: e1000_clean_tx_irq: Detected Tx Unit Hang

Err, is this another NIC used on this box? If yes and when used as RTnet
NIC instead, does it trigger the same issue?

>  Tx Queue <0>
>  TDH  <0>
>  TDT  <10>
>  next_to_use  <10>
>  next_to_clean<0>
> buffer_info[next_to_clean]
>  time_stamp   <362d2>
>  next_to_watch<0>
>  jiffies  <368f8>
>  next_to_watch.status <0>
> 
> Anders

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

2010-10-28 Thread Anders Blomdell

Jan Kiszka wrote:
> Am 28.10.2010 11:34, Anders Blomdell wrote:
>> Jan Kiszka wrote:
>>> Am 28.10.2010 09:34, Anders Blomdell wrote:
 Anders Blomdell wrote:
> Anders Blomdell wrote:
>> Hi,
>>
>> I'm trying to use rt_eepro100, for sending raw ethernet packets,
>> but I'm
>> experincing occasionally weird behaviour.
>>
>> Versions of things:
>>
>>   linux-2.6.34.5
>>   xenomai-2.5.5.2
>>   rtnet-39f7fcf
>>
>> The testprogram runs on two computers with "Intel Corporation
>> 82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one
>> computer
>> acts as a mirror sending back packets received from the ethernet (only
>> those two computers on the network), and the other sends packets and
>> measures roundtrip time. Most packets comes back in approximately 100
>> us, but occasionally the reception times out (once in about 10
>> packets or more), but the packets gets immediately received when
>> reception is retried, which might indicate a race between
>> rt_dev_recvmsg
>> and interrupt, but I might miss something obvious.
> Changing one of the ethernet cards to a "Intel Corporation 82541PI
> Gigabit Ethernet Controller (rev 05)", while keeping everything else
> constant, changes behavior somewhat; after receiving a few 10
> packets, reception stops entirely (-EAGAIN is returned), while
> transmission proceeds as it should (and mirror returns packets).
>
> Any suggestions on what to try?
 Since the problem disappears with 'maxcpus=1', I suspect I have a SMP
 issue (machine is a Core2 Quad), so I'll move to xenomai-core.
 (original message can be found at
 http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se
 )

 Xenomai-core gurus: which is the corrrect way to debug SMP issues?
 Can I run I-pipe-tracer and expect to be able save at least 150 us of
 traces for all cpus? Any hints/suggestions/insigths are welcome...
>>> The i-pipe tracer unfortunately only saves traces for a the CPU that
>>> triggered the freeze. To have a full pictures, you may want to try my
>>> ftrace port I posted recently for 2.6.35.
>> 2.6.35.7 ?
Well, 2.6.35.7/xenomai/rtnet without ftrace patch freezes after approx 
8000 rounds (16000 packets). Time freshen up find serial port console 
debugging I guess (under the assumption that this is the same bug, but 
easier to reproduce).

/Anders

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

78 matches

Mail list logo