Re: [Xenomai-core] High latencies on ARM.

2008-01-30 Thread Gilles Chanteperdrix
Gilles Chanteperdrix wrote:
 > On Jan 17, 2008 3:22 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 > >
 > > Gilles Chanteperdrix wrote:
 > > > On Jan 17, 2008 3:16 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 > > >> Gilles Chanteperdrix wrote:
 > > >>> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 > >  Gilles Chanteperdrix wrote:
 > > > On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 > > >> Gilles Chanteperdrix wrote:
 > > >>> Hi,
 > > >>>
 > > >>> after some (unsuccessful) time trying to instrument the code in a 
 > > >>> way
 > > >>> that does not change the latency results completely, I found the
 > > >>> reason for the high latency with latency -t 1 and latency -t 2 on 
 > > >>> ARM.
 > > >>> So, here comes an update on this issue. The culprit is the 
 > > >>> user-space
 > > >>> context switch, which flushes the processor cache with the nklock
 > > >>> locked, irqs off.
 > > >>>
 > > >>> There are two things we could do:
 > > >>> - arrange for the ARM cache flush to happen with the nklock 
 > > >>> unlocked
 > > >>> and irqs enabled. This will improve interrupt latency (latency -t 
 > > >>> 2)
 > > >>> but obviously not scheduling latency (latency -t 1). If we go that
 > > >>> way, there are several problems we should solve:
 > > >>>
 > > >>> we do not want interrupt handlers to reenter xnpod_schedule(), for
 > > >>> this we can use the XNLOCK bit, set on whatever is
 > > >>> xnpod_current_thread() when the cache flush occurs
 > > >>>
 > > >>> since the interrupt handler may modify the rescheduling bits, we 
 > > >>> need
 > > >>> to test these bits in xnpod_schedule() epilogue and restart
 > > >>> xnpod_schedule() if need be
 > > >>>
 > > >>> we do not want xnpod_delete_thread() to delete one of the two 
 > > >>> threads
 > > >>> involved in the context switch, for this the only solution I found 
 > > >>> is
 > > >>> to add a bit to the thread mask meaning that the thread is 
 > > >>> currently
 > > >>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule 
 > > >>> epilogue
 > > >>> to delete whatever thread was marked for deletion
 > > >>>
 > > >>> in case of migration with xnpod_migrate_thread, we do not want
 > > >>> xnpod_schedule() on the target CPU to switch to the migrated thread
 > > >>> before the context switch on the source CPU is finished, for this 
 > > >>> we
 > > >>> can avoid setting the resched bit in xnpod_migrate_thread(), detect
 > > >>> the condition in xnpod_schedule() epilogue and set the rescheduling
 > > >>> bits so that xnpod_schedule is restarted and send the IPI to the
 > > >>> target CPU.
 > > >>>
 > > >>> - avoid using user-space real-time tasks when running latency
 > > >>> kernel-space benches, i.e. at least in the latency -t 1 and 
 > > >>> latency -t
 > > >>> 2 case. This means that we should change the timerbench driver. 
 > > >>> There
 > > >>> are at least two ways of doing this:
 > > >>> use an rt_pipe
 > > >>>  modify the timerbench driver to implement only the nrt ioctl, 
 > > >>> using
 > > >>> vanilla linux services such as wait_event and wake_up.
 > > >> [As you reminded me of this unanswered question:]
 > > >> One may consider adding further modes _besides_ current kernel tests
 > > >> that do not rely on RTDM & native userland support (e.g. when
 > > >> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are 
 > > >> valid
 > > >> scenarios as well that must not be killed by such a change.
 > > > I think the current test scenario for latency -t 1 and latency -t 2
 > > > are a bit misleading: they measure kernel-space latencies in presence
 > > > of user-space real-time tasks. When one runs latency -t 1 or latency
 > > > -t 2, one would expect that there are only kernel-space real-time
 > > > tasks.
 > >  If they are misleading, depends on your perspective. In fact, they are
 > >  measuring in-kernel scenarios over the standard Xenomai setup, which
 > >  includes userland RT task activity these day. Those scenarios are 
 > >  mainly
 > >  targeting driver use cases, not pure kernel-space applications.
 > > 
 > >  But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
 > >  would benefit from an additional set of test cases.
 > > >>> Ok, I will not touch timerbench then, and implement another kernel 
 > > >>> module.
 > > >>>
 > > >> [Without considering all details]
 > > >> To achieve this independence of user space RT thread, it should suffice
 > > >> to implement a kernel-based frontend for timerbench. This frontent would
 > > >> then either dump to syslog or open some pipe to tell userland about the
 > > >> benchmark results. What do yo think?
 > > >
 > > > My intent was to imple

Re: [Xenomai-core] High latencies on ARM.

2008-01-28 Thread Gilles Chanteperdrix
Gilles Chanteperdrix wrote:
 > On Jan 28, 2008 12:34 AM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
 > >
 > > Gilles Chanteperdrix wrote:
 > > > Philippe Gerum wrote:
 > > >  > Gilles Chanteperdrix wrote:
 > > >  > > Philippe Gerum wrote:
 > > >  > >  > Gilles Chanteperdrix wrote:
 > > >  > >  > > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> 
 > > > wrote:
 > > >  > >  > >> Gilles Chanteperdrix wrote:
 > > >  > >  > >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> 
 > > > wrote:
 > > >  > >  >  Gilles Chanteperdrix wrote:
 > > >  > >  > > Gilles Chanteperdrix wrote:
 > > >  > >  > >  > Please find attached a patch implementing these ideas. 
 > > > This adds some
 > > >  > >  > >  > clutter, which I would be happy to reduce. Better ideas 
 > > > are welcome.
 > > >  > >  > >  >
 > > >  > >  > >
 > > >  > >  > > Ok. New version of the patch, this time split in two 
 > > > parts, should
 > > >  > >  > > hopefully make it more readable.
 > > >  > >  > >
 > > >  > >  >  Ack. I'd suggest the following:
 > > >  > >  > 
 > > >  > >  >  - let's have a rate limiter when walking the zombie queue in
 > > >  > >  >  __xnpod_finalize_zombies. We hold the superlock here, and 
 > > > what the patch
 > > >  > >  >  also introduces is the potential for flushing more than a 
 > > > single TCB at
 > > >  > >  >  a time, which might not always be a cheap operation, 
 > > > depending on which
 > > >  > >  >  cra^H^Hode runs on behalf of the deletion hooks for 
 > > > instance. We may
 > > >  > >  >  take for granted that no sane code would continuously 
 > > > create more
 > > >  > >  >  threads than we would be able to finalize in a given time 
 > > > frame anyway.
 > > >  > >  > >>> The maximum number of zombies in the queue is
 > > >  > >  > >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to 
 > > > the queue
 > > >  > >  > >>> only if a deleted thread is xnpod_current_thread(), or if 
 > > > the XNLOCKSW
 > > >  > >  > >>> bit is armed.
 > > >  > >  > >> Ack. rate_limit = 1? I'm really reluctant to increase the 
 > > > WCET here,
 > > >  > >  > >> thread deletion isn't cheap already.
 > > >  > >  > >
 > > >  > >  > > I am not sure that holding the nklock while we run the thread 
 > > > deletion
 > > >  > >  > > hooks is really needed.
 > > >  > >  > >
 > > >  > >  >
 > > >  > >  > Deletion hooks may currently rely on the following assumptions 
 > > > when running:
 > > >  > >  >
 > > >  > >  > - rescheduling is locked
 > > >  > >  > - nklock is held, interrupts are off
 > > >  > >  > - they run on behalf of the deletor context
 > > >  > >  >
 > > >  > >  > The self-delete refactoring currently kills #3 because we now 
 > > > run the
 > > >  > >  > hooks after the context switch, and would also kill #2 if we did 
 > > > not
 > > >  > >  > hold the nklock (btw, enabling the nucleus debug while running 
 > > > with this
 > > >  > >  > patch should raise an abort, from xnshadow_unmap, due to the 
 > > > second
 > > >  > >  > assertion).
 > > >  > >  >
 > > >  >
 > > >  > Forget about this; shadows are always exited in secondary mode, so
 > > >  > that's fine, i.e. xnpod_current_thread() != deleted thread, hence we
 > > >  > should always run the deletion hooks immediately on behalf of the 
 > > > caller.
 > > >
 > > > What happens if the watchdog kills a user-space thread which is
 > > > currently running in primary mode ? If I read xnpod_delete_thread
 > > > correctly, the SIGKILL signal is sent to the target thread only if it is
 > > > not the current thread.
 > > >
 > >
 > > I'd say: zombie queuing from xnpod_delete, then shadow unmap on behalf
 > > of the next switched context which would trigger the lo-stage unmap
 > > request -> wake_up_process against the Linux side and asbestos underwear
 > > provided by the relax epilogue, which would eventually reap the guy
 > > through do_exit(). As a matter of fact, we would still have the
 > > unmap-over-non-current issue, that's true.
 > >
 > > Ok, could we try coding a damn Tetris instead? Pong, maybe? Gasp...
 > 
 > Games for mobile phones then, because I am afraid games for consoles
 > or PCs are too complicated for me.
 > 
 > No, seriously, how do we solve this ? Maybe we could relax from
 > xnpod_delete_thread ?

This will not work, xnpod_schedule will not let xnshadow_relax suspend
the current thread while in interrupt context.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-28 Thread Gilles Chanteperdrix
On Jan 28, 2008 12:34 AM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>
> Gilles Chanteperdrix wrote:
> > Philippe Gerum wrote:
> >  > Gilles Chanteperdrix wrote:
> >  > > Philippe Gerum wrote:
> >  > >  > Gilles Chanteperdrix wrote:
> >  > >  > > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
> >  > >  > >> Gilles Chanteperdrix wrote:
> >  > >  > >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> 
> > wrote:
> >  > >  >  Gilles Chanteperdrix wrote:
> >  > >  > > Gilles Chanteperdrix wrote:
> >  > >  > >  > Please find attached a patch implementing these ideas. 
> > This adds some
> >  > >  > >  > clutter, which I would be happy to reduce. Better ideas 
> > are welcome.
> >  > >  > >  >
> >  > >  > >
> >  > >  > > Ok. New version of the patch, this time split in two parts, 
> > should
> >  > >  > > hopefully make it more readable.
> >  > >  > >
> >  > >  >  Ack. I'd suggest the following:
> >  > >  > 
> >  > >  >  - let's have a rate limiter when walking the zombie queue in
> >  > >  >  __xnpod_finalize_zombies. We hold the superlock here, and what 
> > the patch
> >  > >  >  also introduces is the potential for flushing more than a 
> > single TCB at
> >  > >  >  a time, which might not always be a cheap operation, depending 
> > on which
> >  > >  >  cra^H^Hode runs on behalf of the deletion hooks for instance. 
> > We may
> >  > >  >  take for granted that no sane code would continuously create 
> > more
> >  > >  >  threads than we would be able to finalize in a given time 
> > frame anyway.
> >  > >  > >>> The maximum number of zombies in the queue is
> >  > >  > >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the 
> > queue
> >  > >  > >>> only if a deleted thread is xnpod_current_thread(), or if the 
> > XNLOCKSW
> >  > >  > >>> bit is armed.
> >  > >  > >> Ack. rate_limit = 1? I'm really reluctant to increase the WCET 
> > here,
> >  > >  > >> thread deletion isn't cheap already.
> >  > >  > >
> >  > >  > > I am not sure that holding the nklock while we run the thread 
> > deletion
> >  > >  > > hooks is really needed.
> >  > >  > >
> >  > >  >
> >  > >  > Deletion hooks may currently rely on the following assumptions when 
> > running:
> >  > >  >
> >  > >  > - rescheduling is locked
> >  > >  > - nklock is held, interrupts are off
> >  > >  > - they run on behalf of the deletor context
> >  > >  >
> >  > >  > The self-delete refactoring currently kills #3 because we now run 
> > the
> >  > >  > hooks after the context switch, and would also kill #2 if we did not
> >  > >  > hold the nklock (btw, enabling the nucleus debug while running with 
> > this
> >  > >  > patch should raise an abort, from xnshadow_unmap, due to the second
> >  > >  > assertion).
> >  > >  >
> >  >
> >  > Forget about this; shadows are always exited in secondary mode, so
> >  > that's fine, i.e. xnpod_current_thread() != deleted thread, hence we
> >  > should always run the deletion hooks immediately on behalf of the caller.
> >
> > What happens if the watchdog kills a user-space thread which is
> > currently running in primary mode ? If I read xnpod_delete_thread
> > correctly, the SIGKILL signal is sent to the target thread only if it is
> > not the current thread.
> >
>
> I'd say: zombie queuing from xnpod_delete, then shadow unmap on behalf
> of the next switched context which would trigger the lo-stage unmap
> request -> wake_up_process against the Linux side and asbestos underwear
> provided by the relax epilogue, which would eventually reap the guy
> through do_exit(). As a matter of fact, we would still have the
> unmap-over-non-current issue, that's true.
>
> Ok, could we try coding a damn Tetris instead? Pong, maybe? Gasp...

Games for mobile phones then, because I am afraid games for consoles
or PCs are too complicated for me.

No, seriously, how do we solve this ? Maybe we could relax from
xnpod_delete_thread ?


-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-27 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
> Philippe Gerum wrote:
>  > Gilles Chanteperdrix wrote:
>  > > Philippe Gerum wrote:
>  > >  > Gilles Chanteperdrix wrote:
>  > >  > > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>  > >  > >> Gilles Chanteperdrix wrote:
>  > >  > >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>  > >  >  Gilles Chanteperdrix wrote:
>  > >  > > Gilles Chanteperdrix wrote:
>  > >  > >  > Please find attached a patch implementing these ideas. This 
> adds some
>  > >  > >  > clutter, which I would be happy to reduce. Better ideas are 
> welcome.
>  > >  > >  >
>  > >  > >
>  > >  > > Ok. New version of the patch, this time split in two parts, 
> should
>  > >  > > hopefully make it more readable.
>  > >  > >
>  > >  >  Ack. I'd suggest the following:
>  > >  > 
>  > >  >  - let's have a rate limiter when walking the zombie queue in
>  > >  >  __xnpod_finalize_zombies. We hold the superlock here, and what 
> the patch
>  > >  >  also introduces is the potential for flushing more than a single 
> TCB at
>  > >  >  a time, which might not always be a cheap operation, depending 
> on which
>  > >  >  cra^H^Hode runs on behalf of the deletion hooks for instance. We 
> may
>  > >  >  take for granted that no sane code would continuously create more
>  > >  >  threads than we would be able to finalize in a given time frame 
> anyway.
>  > >  > >>> The maximum number of zombies in the queue is
>  > >  > >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the 
> queue
>  > >  > >>> only if a deleted thread is xnpod_current_thread(), or if the 
> XNLOCKSW
>  > >  > >>> bit is armed.
>  > >  > >> Ack. rate_limit = 1? I'm really reluctant to increase the WCET 
> here,
>  > >  > >> thread deletion isn't cheap already.
>  > >  > > 
>  > >  > > I am not sure that holding the nklock while we run the thread 
> deletion
>  > >  > > hooks is really needed.
>  > >  > > 
>  > >  > 
>  > >  > Deletion hooks may currently rely on the following assumptions when 
> running:
>  > >  > 
>  > >  > - rescheduling is locked
>  > >  > - nklock is held, interrupts are off
>  > >  > - they run on behalf of the deletor context
>  > >  > 
>  > >  > The self-delete refactoring currently kills #3 because we now run the
>  > >  > hooks after the context switch, and would also kill #2 if we did not
>  > >  > hold the nklock (btw, enabling the nucleus debug while running with 
> this
>  > >  > patch should raise an abort, from xnshadow_unmap, due to the second
>  > >  > assertion).
>  > >  > 
>  > 
>  > Forget about this; shadows are always exited in secondary mode, so
>  > that's fine, i.e. xnpod_current_thread() != deleted thread, hence we
>  > should always run the deletion hooks immediately on behalf of the caller.
> 
> What happens if the watchdog kills a user-space thread which is
> currently running in primary mode ? If I read xnpod_delete_thread
> correctly, the SIGKILL signal is sent to the target thread only if it is
> not the current thread.
> 

I'd say: zombie queuing from xnpod_delete, then shadow unmap on behalf
of the next switched context which would trigger the lo-stage unmap
request -> wake_up_process against the Linux side and asbestos underwear
provided by the relax epilogue, which would eventually reap the guy
through do_exit(). As a matter of fact, we would still have the
unmap-over-non-current issue, that's true.

Ok, could we try coding a damn Tetris instead? Pong, maybe? Gasp...

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-26 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
> Philippe Gerum wrote:
>  > Gilles Chanteperdrix wrote:
>  > > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>  > >> Gilles Chanteperdrix wrote:
>  > >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>  >  Gilles Chanteperdrix wrote:
>  > > Gilles Chanteperdrix wrote:
>  > >  > Please find attached a patch implementing these ideas. This adds 
> some
>  > >  > clutter, which I would be happy to reduce. Better ideas are 
> welcome.
>  > >  >
>  > >
>  > > Ok. New version of the patch, this time split in two parts, should
>  > > hopefully make it more readable.
>  > >
>  >  Ack. I'd suggest the following:
>  > 
>  >  - let's have a rate limiter when walking the zombie queue in
>  >  __xnpod_finalize_zombies. We hold the superlock here, and what the 
> patch
>  >  also introduces is the potential for flushing more than a single TCB 
> at
>  >  a time, which might not always be a cheap operation, depending on 
> which
>  >  cra^H^Hode runs on behalf of the deletion hooks for instance. We may
>  >  take for granted that no sane code would continuously create more
>  >  threads than we would be able to finalize in a given time frame 
> anyway.
>  > >>> The maximum number of zombies in the queue is
>  > >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
>  > >>> only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
>  > >>> bit is armed.
>  > >> Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
>  > >> thread deletion isn't cheap already.
>  > > 
>  > > I am not sure that holding the nklock while we run the thread deletion
>  > > hooks is really needed.
>  > > 
>  > 
>  > Deletion hooks may currently rely on the following assumptions when 
> running:
>  > 
>  > - rescheduling is locked
>  > - nklock is held, interrupts are off
>  > - they run on behalf of the deletor context
>  > 
>  > The self-delete refactoring currently kills #3 because we now run the
>  > hooks after the context switch, and would also kill #2 if we did not
>  > hold the nklock (btw, enabling the nucleus debug while running with this
>  > patch should raise an abort, from xnshadow_unmap, due to the second
>  > assertion).
>  > 

Forget about this; shadows are always exited in secondary mode, so
that's fine, i.e. xnpod_current_thread() != deleted thread, hence we
should always run the deletion hooks immediately on behalf of the caller.

>  > It should be possible to get rid of #3 for xnshadow_unmap (serious
>  > testing needed here), but we would have to grab the nklock from this
>  > routine anyway.
> 
> Since the unmapped task is no longer running on the current CPU, is no
> there any chance that it is run on another CPU by the time we get to
> xnshadow_unmap ?
> 

The unmapped task is running actually, and do_exit() may reschedule
quite late until kernel preemption is eventually disabled, which happens
long after the I-pipe notifier is fired. We would need the nklock to
protect the RPI management too.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-26 Thread Gilles Chanteperdrix
Philippe Gerum wrote:
 > Gilles Chanteperdrix wrote:
 > > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
 > >> Gilles Chanteperdrix wrote:
 > >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
 >  Gilles Chanteperdrix wrote:
 > > Gilles Chanteperdrix wrote:
 > >  > Please find attached a patch implementing these ideas. This adds 
 > > some
 > >  > clutter, which I would be happy to reduce. Better ideas are welcome.
 > >  >
 > >
 > > Ok. New version of the patch, this time split in two parts, should
 > > hopefully make it more readable.
 > >
 >  Ack. I'd suggest the following:
 > 
 >  - let's have a rate limiter when walking the zombie queue in
 >  __xnpod_finalize_zombies. We hold the superlock here, and what the patch
 >  also introduces is the potential for flushing more than a single TCB at
 >  a time, which might not always be a cheap operation, depending on which
 >  cra^H^Hode runs on behalf of the deletion hooks for instance. We may
 >  take for granted that no sane code would continuously create more
 >  threads than we would be able to finalize in a given time frame anyway.
 > >>> The maximum number of zombies in the queue is
 > >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
 > >>> only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
 > >>> bit is armed.
 > >> Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
 > >> thread deletion isn't cheap already.
 > > 
 > > I am not sure that holding the nklock while we run the thread deletion
 > > hooks is really needed.
 > > 
 > 
 > Deletion hooks may currently rely on the following assumptions when running:
 > 
 > - rescheduling is locked
 > - nklock is held, interrupts are off
 > - they run on behalf of the deletor context
 > 
 > The self-delete refactoring currently kills #3 because we now run the
 > hooks after the context switch, and would also kill #2 if we did not
 > hold the nklock (btw, enabling the nucleus debug while running with this
 > patch should raise an abort, from xnshadow_unmap, due to the second
 > assertion).
 > 
 > It should be possible to get rid of #3 for xnshadow_unmap (serious
 > testing needed here), but we would have to grab the nklock from this
 > routine anyway.

Since the unmapped task is no longer running on the current CPU, is no
there any chance that it is run on another CPU by the time we get to
xnshadow_unmap ?

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-26 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
> On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
 Gilles Chanteperdrix wrote:
> Gilles Chanteperdrix wrote:
>  > Please find attached a patch implementing these ideas. This adds some
>  > clutter, which I would be happy to reduce. Better ideas are welcome.
>  >
>
> Ok. New version of the patch, this time split in two parts, should
> hopefully make it more readable.
>
 Ack. I'd suggest the following:

 - let's have a rate limiter when walking the zombie queue in
 __xnpod_finalize_zombies. We hold the superlock here, and what the patch
 also introduces is the potential for flushing more than a single TCB at
 a time, which might not always be a cheap operation, depending on which
 cra^H^Hode runs on behalf of the deletion hooks for instance. We may
 take for granted that no sane code would continuously create more
 threads than we would be able to finalize in a given time frame anyway.
>>> The maximum number of zombies in the queue is
>>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
>>> only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
>>> bit is armed.
>> Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
>> thread deletion isn't cheap already.
> 
> I am not sure that holding the nklock while we run the thread deletion
> hooks is really needed.
> 

Deletion hooks may currently rely on the following assumptions when running:

- rescheduling is locked
- nklock is held, interrupts are off
- they run on behalf of the deletor context

The self-delete refactoring currently kills #3 because we now run the
hooks after the context switch, and would also kill #2 if we did not
hold the nklock (btw, enabling the nucleus debug while running with this
patch should raise an abort, from xnshadow_unmap, due to the second
assertion).

It should be possible to get rid of #3 for xnshadow_unmap (serious
testing needed here), but we would have to grab the nklock from this
routine anyway.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-24 Thread Gilles Chanteperdrix
On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
> Gilles Chanteperdrix wrote:
> > On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> Gilles Chanteperdrix wrote:
> >>>  > Please find attached a patch implementing these ideas. This adds some
> >>>  > clutter, which I would be happy to reduce. Better ideas are welcome.
> >>>  >
> >>>
> >>> Ok. New version of the patch, this time split in two parts, should
> >>> hopefully make it more readable.
> >>>
> >> Ack. I'd suggest the following:
> >>
> >> - let's have a rate limiter when walking the zombie queue in
> >> __xnpod_finalize_zombies. We hold the superlock here, and what the patch
> >> also introduces is the potential for flushing more than a single TCB at
> >> a time, which might not always be a cheap operation, depending on which
> >> cra^H^Hode runs on behalf of the deletion hooks for instance. We may
> >> take for granted that no sane code would continuously create more
> >> threads than we would be able to finalize in a given time frame anyway.
> >
> > The maximum number of zombies in the queue is
> > 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
> > only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
> > bit is armed.
>
> Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
> thread deletion isn't cheap already.

I am not sure that holding the nklock while we run the thread deletion
hooks is really needed.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-23 Thread Gilles Chanteperdrix
Philippe Gerum wrote:
 > Gilles Chanteperdrix wrote:
 > > On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
 > >> Gilles Chanteperdrix wrote:
 > >>> Gilles Chanteperdrix wrote:
 > >>>  > Please find attached a patch implementing these ideas. This adds some
 > >>>  > clutter, which I would be happy to reduce. Better ideas are welcome.
 > >>>  >
 > >>>
 > >>> Ok. New version of the patch, this time split in two parts, should
 > >>> hopefully make it more readable.
 > >>>
 > >> Ack. I'd suggest the following:
 > >>
 > >> - let's have a rate limiter when walking the zombie queue in
 > >> __xnpod_finalize_zombies. We hold the superlock here, and what the patch
 > >> also introduces is the potential for flushing more than a single TCB at
 > >> a time, which might not always be a cheap operation, depending on which
 > >> cra^H^Hode runs on behalf of the deletion hooks for instance. We may
 > >> take for granted that no sane code would continuously create more
 > >> threads than we would be able to finalize in a given time frame anyway.
 > > 
 > > The maximum number of zombies in the queue is
 > > 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
 > > only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
 > > bit is armed.
 > 
 > Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
 > thread deletion isn't cheap already.

Here come new patches.

-- 


Gilles Chanteperdrix.
Index: include/asm-ia64/bits/pod.h
===
--- include/asm-ia64/bits/pod.h (revision 3441)
+++ include/asm-ia64/bits/pod.h (working copy)
@@ -100,12 +100,6 @@ static inline void xnarch_switch_to(xnar
}
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: include/asm-blackfin/bits/pod.h
===
--- include/asm-blackfin/bits/pod.h (revision 3441)
+++ include/asm-blackfin/bits/pod.h (working copy)
@@ -67,12 +67,6 @@ static inline void xnarch_switch_to(xnar
rthal_thread_switch(out_tcb->tsp, in_tcb->tsp);
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: include/asm-arm/bits/pod.h
===
--- include/asm-arm/bits/pod.h  (revision 3441)
+++ include/asm-arm/bits/pod.h  (working copy)
@@ -96,12 +96,6 @@ static inline void xnarch_switch_to(xnar
rthal_thread_switch(prev, out_tcb->tip, in_tcb->tip);
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: include/asm-powerpc/bits/pod.h
===
--- include/asm-powerpc/bits/pod.h  (revision 3441)
+++ include/asm-powerpc/bits/pod.h  (working copy)
@@ -106,12 +106,6 @@ static inline void xnarch_switch_to(xnar
barrier();
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: include/asm-x86/bits/pod_64.h
===
--- include/asm-x86/bits/pod_64.h   (revision 3441)
+++ include/asm-x86/bits/pod_64.h   (working copy)
@@ -96,12 +96,6 @@ static inline void xnarch_switch_to(xnar
stts();
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: include/asm-x86/bits/pod_32.h
===
--- include/asm-x86/bits/pod_32.h   (revision 3441)
+++ include/asm-x86/bits/pod_32.h   (working copy)
@@ -123,12 +123,6 @@ static inline void xnarch_switch_to(xnar
stts();
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch

Re: [Xenomai-core] High latencies on ARM.

2008-01-23 Thread Gilles Chanteperdrix
On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
> Gilles Chanteperdrix wrote:
> > On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> Gilles Chanteperdrix wrote:
> >>>  > Please find attached a patch implementing these ideas. This adds some
> >>>  > clutter, which I would be happy to reduce. Better ideas are welcome.
> >>>  >
> >>>
> >>> Ok. New version of the patch, this time split in two parts, should
> >>> hopefully make it more readable.
> >>>
> >> Ack. I'd suggest the following:
> >>
> >> - let's have a rate limiter when walking the zombie queue in
> >> __xnpod_finalize_zombies. We hold the superlock here, and what the patch
> >> also introduces is the potential for flushing more than a single TCB at
> >> a time, which might not always be a cheap operation, depending on which
> >> cra^H^Hode runs on behalf of the deletion hooks for instance. We may
> >> take for granted that no sane code would continuously create more
> >> threads than we would be able to finalize in a given time frame anyway.
> >
> > The maximum number of zombies in the queue is
> > 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
> > only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
> > bit is armed.
>
> Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
> thread deletion isn't cheap already.

Ok, as you wish.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-23 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> Gilles Chanteperdrix wrote:
>>>  > Please find attached a patch implementing these ideas. This adds some
>>>  > clutter, which I would be happy to reduce. Better ideas are welcome.
>>>  >
>>>
>>> Ok. New version of the patch, this time split in two parts, should
>>> hopefully make it more readable.
>>>
>> Ack. I'd suggest the following:
>>
>> - let's have a rate limiter when walking the zombie queue in
>> __xnpod_finalize_zombies. We hold the superlock here, and what the patch
>> also introduces is the potential for flushing more than a single TCB at
>> a time, which might not always be a cheap operation, depending on which
>> cra^H^Hode runs on behalf of the deletion hooks for instance. We may
>> take for granted that no sane code would continuously create more
>> threads than we would be able to finalize in a given time frame anyway.
> 
> The maximum number of zombies in the queue is
> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
> only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
> bit is armed.

Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
thread deletion isn't cheap already.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-23 Thread Gilles Chanteperdrix
On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote:
> Gilles Chanteperdrix wrote:
> > Gilles Chanteperdrix wrote:
> >  > Please find attached a patch implementing these ideas. This adds some
> >  > clutter, which I would be happy to reduce. Better ideas are welcome.
> >  >
> >
> > Ok. New version of the patch, this time split in two parts, should
> > hopefully make it more readable.
> >
>
> Ack. I'd suggest the following:
>
> - let's have a rate limiter when walking the zombie queue in
> __xnpod_finalize_zombies. We hold the superlock here, and what the patch
> also introduces is the potential for flushing more than a single TCB at
> a time, which might not always be a cheap operation, depending on which
> cra^H^Hode runs on behalf of the deletion hooks for instance. We may
> take for granted that no sane code would continuously create more
> threads than we would be able to finalize in a given time frame anyway.

The maximum number of zombies in the queue is
1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
bit is armed.

>
> - We could move most of the code depending on XNARCH_WANT_UNLOCKED_CTXSW
>  to conditional inlines in pod.h. This would reduce the visual pollution
> a lot.

Ok, will try that, especially since the code added to the 4 places
where a scheduling tail takes place is pretty repetitive.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-23 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
> Gilles Chanteperdrix wrote:
>  > Please find attached a patch implementing these ideas. This adds some
>  > clutter, which I would be happy to reduce. Better ideas are welcome.
>  > 
> 
> Ok. New version of the patch, this time split in two parts, should
> hopefully make it more readable.
> 

Ack. I'd suggest the following:

- let's have a rate limiter when walking the zombie queue in
__xnpod_finalize_zombies. We hold the superlock here, and what the patch
also introduces is the potential for flushing more than a single TCB at
a time, which might not always be a cheap operation, depending on which
cra^H^Hode runs on behalf of the deletion hooks for instance. We may
take for granted that no sane code would continuously create more
threads than we would be able to finalize in a given time frame anyway.

- We could move most of the code depending on XNARCH_WANT_UNLOCKED_CTXSW
 to conditional inlines in pod.h. This would reduce the visual pollution
a lot.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-22 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
 > Does the patch improve ARM latencies already?

Yes, it does. The (interrupt) latency goes from above 100us to
80us. This is not yet 50us, though.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-22 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
 > Gilles Chanteperdrix wrote:
 > > Gilles Chanteperdrix wrote:
 > >  > Hi,
 > >  > 
 > >  > after some (unsuccessful) time trying to instrument the code in a way
 > >  > that does not change the latency results completely, I found the
 > >  > reason for the high latency with latency -t 1 and latency -t 2 on ARM.
 > >  > So, here comes an update on this issue. The culprit is the user-space
 > >  > context switch, which flushes the processor cache with the nklock
 > >  > locked, irqs off.
 > >  > 
 > >  > There are two things we could do:
 > >  > - arrange for the ARM cache flush to happen with the nklock unlocked
 > >  > and irqs enabled. This will improve interrupt latency (latency -t 2)
 > >  > but obviously not scheduling latency (latency -t 1). If we go that
 > >  > way, there are several problems we should solve:
 > >  > 
 > >  > we do not want interrupt handlers to reenter xnpod_schedule(), for
 > >  > this we can use the XNLOCK bit, set on whatever is
 > >  > xnpod_current_thread() when the cache flush occurs
 > >  > 
 > >  > since the interrupt handler may modify the rescheduling bits, we need
 > >  > to test these bits in xnpod_schedule() epilogue and restart
 > >  > xnpod_schedule() if need be
 > >  > 
 > >  > we do not want xnpod_delete_thread() to delete one of the two threads
 > >  > involved in the context switch, for this the only solution I found is
 > >  > to add a bit to the thread mask meaning that the thread is currently
 > >  > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
 > >  > to delete whatever thread was marked for deletion
 > >  > 
 > >  > in case of migration with xnpod_migrate_thread, we do not want
 > >  > xnpod_schedule() on the target CPU to switch to the migrated thread
 > >  > before the context switch on the source CPU is finished, for this we
 > >  > can avoid setting the resched bit in xnpod_migrate_thread(), detect
 > >  > the condition in xnpod_schedule() epilogue and set the rescheduling
 > >  > bits so that xnpod_schedule is restarted and send the IPI to the
 > >  > target CPU.
 > > 
 > > Please find attached a patch implementing these ideas. This adds some
 > > clutter, which I would be happy to reduce. Better ideas are welcome.
 > > 
 > 
 > I tried to cross-read the patch (-p would have been nice) but failed - 
 > this needs to be applied on some tree. Does the patch improve ARM 
 > latencies already?

I split the patch in two parts in another post, this should make it
easier to read.

 > 
 > > 
 > >  > 
 > >  > - avoid using user-space real-time tasks when running latency
 > >  > kernel-space benches, i.e. at least in the latency -t 1 and latency -t
 > >  > 2 case. This means that we should change the timerbench driver. There
 > >  > are at least two ways of doing this:
 > >  > use an rt_pipe
 > >  >  modify the timerbench driver to implement only the nrt ioctl, using
 > >  > vanilla linux services such as wait_event and wake_up.
 > >  > 
 > >  > What do you think ?
 > > 
 > > So, what do you thing is the best way to change the timerbench driver,
 > > * use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even
 > >  if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make
 > >  the timerbench non portable on other implementations of rtdm, eg. rtdm
 > >  over rtai or the version of rtdm which runs over vanilla linux
 > > * modify the timerbecn driver to implement only nrt ioctls ? Pros:
 > >   better driver portability; cons: latency would still need
 > >   CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2.
 > 
 > I'm still voting for my third approach:
 > 
 >   -> Write latency as kernel application (klatency) against the
 >  timerbench device
 >   -> Call NRT IOCTLs of timerbench during module init/cleanup
 >   -> Use module parameters for customization
 >   -> Setup a low-prio kernel-based RT task to issue the RT IOCTLs
 >   -> Format the results nicely (similar to userland latency) in that RT
 >  task and stuff them into some rtpipe
 >   -> Use "cat /dev/rtpipeX" to display the results

Sorry this mail is older than your last reply to my question. I had
problems with my MTA, so I resent all the mail which were not sent, I
hoped they would be sent with their original date preserved, but
unfortunately, this is not the case.

Now, to answer your suggestion, I think that formating the results
belongs to user-space, not to kernel-space. Besides, emitting NRT ioctls
from module initialization and cleanup routines make this klatency
module quite inflexible. I was rather thinking about implementing the RT
versions of the IOCTLS so that they could be called from a kernel space
real-time task.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-22 Thread Jan Kiszka

Gilles Chanteperdrix wrote:

Gilles Chanteperdrix wrote:
 > Hi,
 > 
 > after some (unsuccessful) time trying to instrument the code in a way

 > that does not change the latency results completely, I found the
 > reason for the high latency with latency -t 1 and latency -t 2 on ARM.
 > So, here comes an update on this issue. The culprit is the user-space
 > context switch, which flushes the processor cache with the nklock
 > locked, irqs off.
 > 
 > There are two things we could do:

 > - arrange for the ARM cache flush to happen with the nklock unlocked
 > and irqs enabled. This will improve interrupt latency (latency -t 2)
 > but obviously not scheduling latency (latency -t 1). If we go that
 > way, there are several problems we should solve:
 > 
 > we do not want interrupt handlers to reenter xnpod_schedule(), for

 > this we can use the XNLOCK bit, set on whatever is
 > xnpod_current_thread() when the cache flush occurs
 > 
 > since the interrupt handler may modify the rescheduling bits, we need

 > to test these bits in xnpod_schedule() epilogue and restart
 > xnpod_schedule() if need be
 > 
 > we do not want xnpod_delete_thread() to delete one of the two threads

 > involved in the context switch, for this the only solution I found is
 > to add a bit to the thread mask meaning that the thread is currently
 > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
 > to delete whatever thread was marked for deletion
 > 
 > in case of migration with xnpod_migrate_thread, we do not want

 > xnpod_schedule() on the target CPU to switch to the migrated thread
 > before the context switch on the source CPU is finished, for this we
 > can avoid setting the resched bit in xnpod_migrate_thread(), detect
 > the condition in xnpod_schedule() epilogue and set the rescheduling
 > bits so that xnpod_schedule is restarted and send the IPI to the
 > target CPU.

Please find attached a patch implementing these ideas. This adds some
clutter, which I would be happy to reduce. Better ideas are welcome.



I tried to cross-read the patch (-p would have been nice) but failed - 
this needs to be applied on some tree. Does the patch improve ARM 
latencies already?




 > 
 > - avoid using user-space real-time tasks when running latency

 > kernel-space benches, i.e. at least in the latency -t 1 and latency -t
 > 2 case. This means that we should change the timerbench driver. There
 > are at least two ways of doing this:
 > use an rt_pipe
 >  modify the timerbench driver to implement only the nrt ioctl, using
 > vanilla linux services such as wait_event and wake_up.
 > 
 > What do you think ?


So, what do you thing is the best way to change the timerbench driver,
* use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even
 if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make
 the timerbench non portable on other implementations of rtdm, eg. rtdm
 over rtai or the version of rtdm which runs over vanilla linux
* modify the timerbecn driver to implement only nrt ioctls ? Pros:
  better driver portability; cons: latency would still need
  CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2.


I'm still voting for my third approach:

 -> Write latency as kernel application (klatency) against the
timerbench device
 -> Call NRT IOCTLs of timerbench during module init/cleanup
 -> Use module parameters for customization
 -> Setup a low-prio kernel-based RT task to issue the RT IOCTLs
 -> Format the results nicely (similar to userland latency) in that RT
task and stuff them into some rtpipe
 -> Use "cat /dev/rtpipeX" to display the results

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-22 Thread Gilles Chanteperdrix
Gilles Chanteperdrix wrote:
 > Hi,
 > 
 > after some (unsuccessful) time trying to instrument the code in a way
 > that does not change the latency results completely, I found the
 > reason for the high latency with latency -t 1 and latency -t 2 on ARM.
 > So, here comes an update on this issue. The culprit is the user-space
 > context switch, which flushes the processor cache with the nklock
 > locked, irqs off.
 > 
 > There are two things we could do:
 > - arrange for the ARM cache flush to happen with the nklock unlocked
 > and irqs enabled. This will improve interrupt latency (latency -t 2)
 > but obviously not scheduling latency (latency -t 1). If we go that
 > way, there are several problems we should solve:
 > 
 > we do not want interrupt handlers to reenter xnpod_schedule(), for
 > this we can use the XNLOCK bit, set on whatever is
 > xnpod_current_thread() when the cache flush occurs
 > 
 > since the interrupt handler may modify the rescheduling bits, we need
 > to test these bits in xnpod_schedule() epilogue and restart
 > xnpod_schedule() if need be
 > 
 > we do not want xnpod_delete_thread() to delete one of the two threads
 > involved in the context switch, for this the only solution I found is
 > to add a bit to the thread mask meaning that the thread is currently
 > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
 > to delete whatever thread was marked for deletion
 > 
 > in case of migration with xnpod_migrate_thread, we do not want
 > xnpod_schedule() on the target CPU to switch to the migrated thread
 > before the context switch on the source CPU is finished, for this we
 > can avoid setting the resched bit in xnpod_migrate_thread(), detect
 > the condition in xnpod_schedule() epilogue and set the rescheduling
 > bits so that xnpod_schedule is restarted and send the IPI to the
 > target CPU.

Please find attached a patch implementing these ideas. This adds some
clutter, which I would be happy to reduce. Better ideas are welcome.


 > 
 > - avoid using user-space real-time tasks when running latency
 > kernel-space benches, i.e. at least in the latency -t 1 and latency -t
 > 2 case. This means that we should change the timerbench driver. There
 > are at least two ways of doing this:
 > use an rt_pipe
 >  modify the timerbench driver to implement only the nrt ioctl, using
 > vanilla linux services such as wait_event and wake_up.
 > 
 > What do you think ?

So, what do you thing is the best way to change the timerbench driver,
* use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even
 if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make
 the timerbench non portable on other implementations of rtdm, eg. rtdm
 over rtai or the version of rtdm which runs over vanilla linux
* modify the timerbecn driver to implement only nrt ioctls ? Pros:
  better driver portability; cons: latency would still need
  CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2.

-- 


Gilles Chanteperdrix.
Index: include/asm-arm/bits/pod.h
===
--- include/asm-arm/bits/pod.h  (revision 3405)
+++ include/asm-arm/bits/pod.h  (working copy)
@@ -67,41 +67,41 @@
 #endif /* TIF_MMSWITCH_INT */
 }
 
-static inline void xnarch_switch_to(xnarchtcb_t * out_tcb, xnarchtcb_t * 
in_tcb)
-{
-   struct task_struct *prev = out_tcb->active_task;
-   struct mm_struct *prev_mm = out_tcb->active_mm;
-   struct task_struct *next = in_tcb->user_task;
-
-
-   if (likely(next != NULL)) {
-   in_tcb->active_task = next;
-   in_tcb->active_mm = in_tcb->mm;
-   rthal_clear_foreign_stack(&rthal_domain);
-   } else {
-   in_tcb->active_task = prev;
-   in_tcb->active_mm = prev_mm;
-   rthal_set_foreign_stack(&rthal_domain);
-   }
-
-   if (prev_mm != in_tcb->active_mm) {
-   /* Switch to new user-space thread? */
-   if (in_tcb->active_mm)
-   switch_mm(prev_mm, in_tcb->active_mm, next);
-   if (!next->mm)
-   enter_lazy_tlb(prev_mm, next);
-   }
-
-   /* Kernel-to-kernel context switch. */
-   rthal_thread_switch(prev, out_tcb->tip, in_tcb->tip);
+#define xnarch_switch_to(_out_tcb, _in_tcb, lock)  \
+{  \
+   xnarchtcb_t *in_tcb = (_in_tcb);\
+   xnarchtcb_t *out_tcb = (_out_tcb);  \
+   struct task_struct *prev = out_tcb->active_task;\
+   struct mm_struct *prev_mm = out_tcb->active_mm; \
+   struct task_struct *next = in_tcb->user_task;   \
+   \
+  

Re: [Xenomai-core] High latencies on ARM.

2008-01-22 Thread Gilles Chanteperdrix
Gilles Chanteperdrix wrote:
 > Please find attached a patch implementing these ideas. This adds some
 > clutter, which I would be happy to reduce. Better ideas are welcome.
 > 

Ok. New version of the patch, this time split in two parts, should
hopefully make it more readable.

 > 
 >  > 
 >  > - avoid using user-space real-time tasks when running latency
 >  > kernel-space benches, i.e. at least in the latency -t 1 and latency -t
 >  > 2 case. This means that we should change the timerbench driver. There
 >  > are at least two ways of doing this:
 >  > use an rt_pipe
 >  >  modify the timerbench driver to implement only the nrt ioctl, using
 >  > vanilla linux services such as wait_event and wake_up.
 >  > 
 >  > What do you think ?
 > 
 > So, what do you thing is the best way to change the timerbench driver,
 > * use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even
 >  if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make
 >  the timerbench non portable on other implementations of rtdm, eg. rtdm
 >  over rtai or the version of rtdm which runs over vanilla linux
 > * modify the timerbecn driver to implement only nrt ioctls ? Pros:
 >   better driver portability; cons: latency would still need
 >   CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2.

-- 


Gilles Chanteperdrix.
Index: include/nucleus/pod.h
===
--- include/nucleus/pod.h   (revision 3405)
+++ include/nucleus/pod.h   (working copy)
@@ -139,6 +139,7 @@
 
xntimer_t htimer;   /*!< Host timer. */
 
+   xnqueue_t zombies;
 } xnsched_t;
 
 #define nkpod (&nkpod_struct)
@@ -238,6 +239,14 @@
 }
 #endif /* CONFIG_XENO_OPT_WATCHDOG */
 
+void __xnpod_finalize_zombies(xnsched_t *sched);
+
+static inline void xnpod_finalize_zombies(xnsched_t *sched)
+{
+   if (!emptyq_p(&sched->zombies))
+   __xnpod_finalize_zombies(sched);
+}
+
/* -- Beginning of the exported interface */
 
 #define xnpod_sched_slot(cpu) \
Index: ksrc/nucleus/pod.c
===
--- ksrc/nucleus/pod.c  (revision 3415)
+++ ksrc/nucleus/pod.c  (working copy)
@@ -292,6 +292,7 @@
 #endif /* CONFIG_SMP */
xntimer_set_name(&sched->htimer, htimer_name);
xntimer_set_sched(&sched->htimer, sched);
+   initq(&sched->zombies);
}
 
xnlock_put_irqrestore(&nklock, s);
@@ -545,63 +546,28 @@
__clrbits(sched->status, XNKCOUT);
 }
 
-static inline void xnpod_switch_zombie(xnthread_t *threadout,
-  xnthread_t *threadin)
+void __xnpod_finalize_zombies(xnsched_t *sched)
 {
-   /* Must be called with nklock locked, interrupts off. */
-   xnsched_t *sched = xnpod_current_sched();
-#ifdef CONFIG_XENO_OPT_PERVASIVE
-   int shadow = xnthread_test_state(threadout, XNSHADOW);
-#endif /* CONFIG_XENO_OPT_PERVASIVE */
+   xnholder_t *holder;
 
-   trace_mark(xn_nucleus_sched_finalize,
-  "thread_out %p thread_out_name %s "
-  "thread_in %p thread_in_name %s",
-  threadout, xnthread_name(threadout),
-  threadin, xnthread_name(threadin));
+   while ((holder = getq(&sched->zombies))) {
+   xnthread_t *thread = link2thread(holder, glink);
 
-   if (!emptyq_p(&nkpod->tdeleteq) && !xnthread_test_state(threadout, 
XNROOT)) {
-   trace_mark(xn_nucleus_thread_callout,
-  "thread %p thread_name %s hook %s",
-  threadout, xnthread_name(threadout), "DELETE");
-   xnpod_fire_callouts(&nkpod->tdeleteq, threadout);
-   }
+   /* Must be called with nklock locked, interrupts off. */
+   trace_mark(xn_nucleus_sched_finalize,
+  "thread_out %p thread_out_name %s",
+  thread, xnthread_name(thread));
 
-   sched->runthread = threadin;
+   if (!emptyq_p(&nkpod->tdeleteq)
+   && !xnthread_test_state(thread, XNROOT)) {
+   trace_mark(xn_nucleus_thread_callout,
+  "thread %p thread_name %s hook %s",
+  thread, xnthread_name(thread), "DELETE");
+   xnpod_fire_callouts(&nkpod->tdeleteq, thread);
+   }
 
-   if (xnthread_test_state(threadin, XNROOT)) {
-   xnpod_reset_watchdog(sched);
-   xnfreesync();
-   xnarch_enter_root(xnthread_archtcb(threadin));
+   xnthread_cleanup_tcb(thread);
}
-
-   /* FIXME: Catch 22 here, whether we choose to run on an invalid
-  stack (cleanup then hooks), or to access the TCB space shortly
-  after it has been freed while non-preemptible (hooks then
-  cleanup)... Option #2 is curr

Re: [Xenomai-core] High latencies on ARM.

2008-01-21 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
 > Gilles Chanteperdrix wrote:
 > > On Jan 17, 2008 3:16 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 > >> Gilles Chanteperdrix wrote:
 > >>> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 >  Gilles Chanteperdrix wrote:
 > > On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 > >> Gilles Chanteperdrix wrote:
 > >>> Hi,
 > >>>
 > >>> after some (unsuccessful) time trying to instrument the code in a way
 > >>> that does not change the latency results completely, I found the
 > >>> reason for the high latency with latency -t 1 and latency -t 2 on 
 > >>> ARM.
 > >>> So, here comes an update on this issue. The culprit is the user-space
 > >>> context switch, which flushes the processor cache with the nklock
 > >>> locked, irqs off.
 > >>>
 > >>> There are two things we could do:
 > >>> - arrange for the ARM cache flush to happen with the nklock unlocked
 > >>> and irqs enabled. This will improve interrupt latency (latency -t 2)
 > >>> but obviously not scheduling latency (latency -t 1). If we go that
 > >>> way, there are several problems we should solve:
 > >>>
 > >>> we do not want interrupt handlers to reenter xnpod_schedule(), for
 > >>> this we can use the XNLOCK bit, set on whatever is
 > >>> xnpod_current_thread() when the cache flush occurs
 > >>>
 > >>> since the interrupt handler may modify the rescheduling bits, we need
 > >>> to test these bits in xnpod_schedule() epilogue and restart
 > >>> xnpod_schedule() if need be
 > >>>
 > >>> we do not want xnpod_delete_thread() to delete one of the two threads
 > >>> involved in the context switch, for this the only solution I found is
 > >>> to add a bit to the thread mask meaning that the thread is currently
 > >>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule 
 > >>> epilogue
 > >>> to delete whatever thread was marked for deletion
 > >>>
 > >>> in case of migration with xnpod_migrate_thread, we do not want
 > >>> xnpod_schedule() on the target CPU to switch to the migrated thread
 > >>> before the context switch on the source CPU is finished, for this we
 > >>> can avoid setting the resched bit in xnpod_migrate_thread(), detect
 > >>> the condition in xnpod_schedule() epilogue and set the rescheduling
 > >>> bits so that xnpod_schedule is restarted and send the IPI to the
 > >>> target CPU.
 > >>>
 > >>> - avoid using user-space real-time tasks when running latency
 > >>> kernel-space benches, i.e. at least in the latency -t 1 and latency 
 > >>> -t
 > >>> 2 case. This means that we should change the timerbench driver. There
 > >>> are at least two ways of doing this:
 > >>> use an rt_pipe
 > >>>  modify the timerbench driver to implement only the nrt ioctl, using
 > >>> vanilla linux services such as wait_event and wake_up.
 > >> [As you reminded me of this unanswered question:]
 > >> One may consider adding further modes _besides_ current kernel tests
 > >> that do not rely on RTDM & native userland support (e.g. when
 > >> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are 
 > >> valid
 > >> scenarios as well that must not be killed by such a change.
 > > I think the current test scenario for latency -t 1 and latency -t 2
 > > are a bit misleading: they measure kernel-space latencies in presence
 > > of user-space real-time tasks. When one runs latency -t 1 or latency
 > > -t 2, one would expect that there are only kernel-space real-time
 > > tasks.
 >  If they are misleading, depends on your perspective. In fact, they are
 >  measuring in-kernel scenarios over the standard Xenomai setup, which
 >  includes userland RT task activity these day. Those scenarios are mainly
 >  targeting driver use cases, not pure kernel-space applications.
 > 
 >  But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
 >  would benefit from an additional set of test cases.
 > >>> Ok, I will not touch timerbench then, and implement another kernel 
 > >>> module.
 > >>>
 > >> [Without considering all details]
 > >> To achieve this independence of user space RT thread, it should suffice
 > >> to implement a kernel-based frontend for timerbench. This frontent would
 > >> then either dump to syslog or open some pipe to tell userland about the
 > >> benchmark results. What do yo think?
 > > 
 > > My intent was to implement a protocol similar to the one of
 > > timerbench, but using an rt-pipe, and continue to use the latency
 > > test, adding new options such as -t 3 and t 4. But there may be
 > > problems with this approach: if we are compiling without
 > > CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is
 > > probably simpler to implement a klatency that just reads from the
 > > rt-pipe.
 > 
 > But th

Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Gilles Chanteperdrix
On Jan 17, 2008 3:22 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>
> Gilles Chanteperdrix wrote:
> > On Jan 17, 2008 3:16 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>  Gilles Chanteperdrix wrote:
> > On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> Hi,
> >>>
> >>> after some (unsuccessful) time trying to instrument the code in a way
> >>> that does not change the latency results completely, I found the
> >>> reason for the high latency with latency -t 1 and latency -t 2 on ARM.
> >>> So, here comes an update on this issue. The culprit is the user-space
> >>> context switch, which flushes the processor cache with the nklock
> >>> locked, irqs off.
> >>>
> >>> There are two things we could do:
> >>> - arrange for the ARM cache flush to happen with the nklock unlocked
> >>> and irqs enabled. This will improve interrupt latency (latency -t 2)
> >>> but obviously not scheduling latency (latency -t 1). If we go that
> >>> way, there are several problems we should solve:
> >>>
> >>> we do not want interrupt handlers to reenter xnpod_schedule(), for
> >>> this we can use the XNLOCK bit, set on whatever is
> >>> xnpod_current_thread() when the cache flush occurs
> >>>
> >>> since the interrupt handler may modify the rescheduling bits, we need
> >>> to test these bits in xnpod_schedule() epilogue and restart
> >>> xnpod_schedule() if need be
> >>>
> >>> we do not want xnpod_delete_thread() to delete one of the two threads
> >>> involved in the context switch, for this the only solution I found is
> >>> to add a bit to the thread mask meaning that the thread is currently
> >>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
> >>> to delete whatever thread was marked for deletion
> >>>
> >>> in case of migration with xnpod_migrate_thread, we do not want
> >>> xnpod_schedule() on the target CPU to switch to the migrated thread
> >>> before the context switch on the source CPU is finished, for this we
> >>> can avoid setting the resched bit in xnpod_migrate_thread(), detect
> >>> the condition in xnpod_schedule() epilogue and set the rescheduling
> >>> bits so that xnpod_schedule is restarted and send the IPI to the
> >>> target CPU.
> >>>
> >>> - avoid using user-space real-time tasks when running latency
> >>> kernel-space benches, i.e. at least in the latency -t 1 and latency -t
> >>> 2 case. This means that we should change the timerbench driver. There
> >>> are at least two ways of doing this:
> >>> use an rt_pipe
> >>>  modify the timerbench driver to implement only the nrt ioctl, using
> >>> vanilla linux services such as wait_event and wake_up.
> >> [As you reminded me of this unanswered question:]
> >> One may consider adding further modes _besides_ current kernel tests
> >> that do not rely on RTDM & native userland support (e.g. when
> >> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
> >> scenarios as well that must not be killed by such a change.
> > I think the current test scenario for latency -t 1 and latency -t 2
> > are a bit misleading: they measure kernel-space latencies in presence
> > of user-space real-time tasks. When one runs latency -t 1 or latency
> > -t 2, one would expect that there are only kernel-space real-time
> > tasks.
>  If they are misleading, depends on your perspective. In fact, they are
>  measuring in-kernel scenarios over the standard Xenomai setup, which
>  includes userland RT task activity these day. Those scenarios are mainly
>  targeting driver use cases, not pure kernel-space applications.
> 
>  But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
>  would benefit from an additional set of test cases.
> >>> Ok, I will not touch timerbench then, and implement another kernel module.
> >>>
> >> [Without considering all details]
> >> To achieve this independence of user space RT thread, it should suffice
> >> to implement a kernel-based frontend for timerbench. This frontent would
> >> then either dump to syslog or open some pipe to tell userland about the
> >> benchmark results. What do yo think?
> >
> > My intent was to implement a protocol similar to the one of
> > timerbench, but using an rt-pipe, and continue to use the latency
> > test, adding new options such as -t 3 and t 4. But there may be
> > problems with this approach: if we are compiling without
> > CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is
> > probably simpler to implement a klatency that just reads from the
> > rt-pipe.
>
> But that klantency could perfectly reuse what timerbench already
> provides, without code changes to 

Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Gilles Chanteperdrix
On Jan 17, 2008 3:16 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>
> Gilles Chanteperdrix wrote:
> > On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>  Gilles Chanteperdrix wrote:
> > Hi,
> >
> > after some (unsuccessful) time trying to instrument the code in a way
> > that does not change the latency results completely, I found the
> > reason for the high latency with latency -t 1 and latency -t 2 on ARM.
> > So, here comes an update on this issue. The culprit is the user-space
> > context switch, which flushes the processor cache with the nklock
> > locked, irqs off.
> >
> > There are two things we could do:
> > - arrange for the ARM cache flush to happen with the nklock unlocked
> > and irqs enabled. This will improve interrupt latency (latency -t 2)
> > but obviously not scheduling latency (latency -t 1). If we go that
> > way, there are several problems we should solve:
> >
> > we do not want interrupt handlers to reenter xnpod_schedule(), for
> > this we can use the XNLOCK bit, set on whatever is
> > xnpod_current_thread() when the cache flush occurs
> >
> > since the interrupt handler may modify the rescheduling bits, we need
> > to test these bits in xnpod_schedule() epilogue and restart
> > xnpod_schedule() if need be
> >
> > we do not want xnpod_delete_thread() to delete one of the two threads
> > involved in the context switch, for this the only solution I found is
> > to add a bit to the thread mask meaning that the thread is currently
> > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
> > to delete whatever thread was marked for deletion
> >
> > in case of migration with xnpod_migrate_thread, we do not want
> > xnpod_schedule() on the target CPU to switch to the migrated thread
> > before the context switch on the source CPU is finished, for this we
> > can avoid setting the resched bit in xnpod_migrate_thread(), detect
> > the condition in xnpod_schedule() epilogue and set the rescheduling
> > bits so that xnpod_schedule is restarted and send the IPI to the
> > target CPU.
> >
> > - avoid using user-space real-time tasks when running latency
> > kernel-space benches, i.e. at least in the latency -t 1 and latency -t
> > 2 case. This means that we should change the timerbench driver. There
> > are at least two ways of doing this:
> > use an rt_pipe
> >  modify the timerbench driver to implement only the nrt ioctl, using
> > vanilla linux services such as wait_event and wake_up.
>  [As you reminded me of this unanswered question:]
>  One may consider adding further modes _besides_ current kernel tests
>  that do not rely on RTDM & native userland support (e.g. when
>  CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
>  scenarios as well that must not be killed by such a change.
> >>> I think the current test scenario for latency -t 1 and latency -t 2
> >>> are a bit misleading: they measure kernel-space latencies in presence
> >>> of user-space real-time tasks. When one runs latency -t 1 or latency
> >>> -t 2, one would expect that there are only kernel-space real-time
> >>> tasks.
> >> If they are misleading, depends on your perspective. In fact, they are
> >> measuring in-kernel scenarios over the standard Xenomai setup, which
> >> includes userland RT task activity these day. Those scenarios are mainly
> >> targeting driver use cases, not pure kernel-space applications.
> >>
> >> But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
> >> would benefit from an additional set of test cases.
> >
> > Ok, I will not touch timerbench then, and implement another kernel module.
> >
>
> [Without considering all details]
> To achieve this independence of user space RT thread, it should suffice
> to implement a kernel-based frontend for timerbench. This frontent would
> then either dump to syslog or open some pipe to tell userland about the
> benchmark results. What do yo think?

My intent was to implement a protocol similar to the one of
timerbench, but using an rt-pipe, and continue to use the latency
test, adding new options such as -t 3 and t 4. But there may be
problems with this approach: if we are compiling without
CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is
probably simpler to implement a klatency that just reads from the
rt-pipe.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
> On Jan 17, 2008 3:16 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 Gilles Chanteperdrix wrote:
> On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> Hi,
>>>
>>> after some (unsuccessful) time trying to instrument the code in a way
>>> that does not change the latency results completely, I found the
>>> reason for the high latency with latency -t 1 and latency -t 2 on ARM.
>>> So, here comes an update on this issue. The culprit is the user-space
>>> context switch, which flushes the processor cache with the nklock
>>> locked, irqs off.
>>>
>>> There are two things we could do:
>>> - arrange for the ARM cache flush to happen with the nklock unlocked
>>> and irqs enabled. This will improve interrupt latency (latency -t 2)
>>> but obviously not scheduling latency (latency -t 1). If we go that
>>> way, there are several problems we should solve:
>>>
>>> we do not want interrupt handlers to reenter xnpod_schedule(), for
>>> this we can use the XNLOCK bit, set on whatever is
>>> xnpod_current_thread() when the cache flush occurs
>>>
>>> since the interrupt handler may modify the rescheduling bits, we need
>>> to test these bits in xnpod_schedule() epilogue and restart
>>> xnpod_schedule() if need be
>>>
>>> we do not want xnpod_delete_thread() to delete one of the two threads
>>> involved in the context switch, for this the only solution I found is
>>> to add a bit to the thread mask meaning that the thread is currently
>>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
>>> to delete whatever thread was marked for deletion
>>>
>>> in case of migration with xnpod_migrate_thread, we do not want
>>> xnpod_schedule() on the target CPU to switch to the migrated thread
>>> before the context switch on the source CPU is finished, for this we
>>> can avoid setting the resched bit in xnpod_migrate_thread(), detect
>>> the condition in xnpod_schedule() epilogue and set the rescheduling
>>> bits so that xnpod_schedule is restarted and send the IPI to the
>>> target CPU.
>>>
>>> - avoid using user-space real-time tasks when running latency
>>> kernel-space benches, i.e. at least in the latency -t 1 and latency -t
>>> 2 case. This means that we should change the timerbench driver. There
>>> are at least two ways of doing this:
>>> use an rt_pipe
>>>  modify the timerbench driver to implement only the nrt ioctl, using
>>> vanilla linux services such as wait_event and wake_up.
>> [As you reminded me of this unanswered question:]
>> One may consider adding further modes _besides_ current kernel tests
>> that do not rely on RTDM & native userland support (e.g. when
>> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
>> scenarios as well that must not be killed by such a change.
> I think the current test scenario for latency -t 1 and latency -t 2
> are a bit misleading: they measure kernel-space latencies in presence
> of user-space real-time tasks. When one runs latency -t 1 or latency
> -t 2, one would expect that there are only kernel-space real-time
> tasks.
 If they are misleading, depends on your perspective. In fact, they are
 measuring in-kernel scenarios over the standard Xenomai setup, which
 includes userland RT task activity these day. Those scenarios are mainly
 targeting driver use cases, not pure kernel-space applications.

 But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
 would benefit from an additional set of test cases.
>>> Ok, I will not touch timerbench then, and implement another kernel module.
>>>
>> [Without considering all details]
>> To achieve this independence of user space RT thread, it should suffice
>> to implement a kernel-based frontend for timerbench. This frontent would
>> then either dump to syslog or open some pipe to tell userland about the
>> benchmark results. What do yo think?
> 
> My intent was to implement a protocol similar to the one of
> timerbench, but using an rt-pipe, and continue to use the latency
> test, adding new options such as -t 3 and t 4. But there may be
> problems with this approach: if we are compiling without
> CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is
> probably simpler to implement a klatency that just reads from the
> rt-pipe.

But that klantency could perfectly reuse what timerbench already
provides, without code changes to the latter, in theory.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Jan Kiszka
Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>>> Gilles Chanteperdrix wrote:
 On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> Gilles Chanteperdrix wrote:
>> Hi,
>>
>> after some (unsuccessful) time trying to instrument the code in a way
>> that does not change the latency results completely, I found the
>> reason for the high latency with latency -t 1 and latency -t 2 on ARM.
>> So, here comes an update on this issue. The culprit is the user-space
>> context switch, which flushes the processor cache with the nklock
>> locked, irqs off.
>>
>> There are two things we could do:
>> - arrange for the ARM cache flush to happen with the nklock unlocked
>> and irqs enabled. This will improve interrupt latency (latency -t 2)
>> but obviously not scheduling latency (latency -t 1). If we go that
>> way, there are several problems we should solve:
>>
>> we do not want interrupt handlers to reenter xnpod_schedule(), for
>> this we can use the XNLOCK bit, set on whatever is
>> xnpod_current_thread() when the cache flush occurs
>>
>> since the interrupt handler may modify the rescheduling bits, we need
>> to test these bits in xnpod_schedule() epilogue and restart
>> xnpod_schedule() if need be
>>
>> we do not want xnpod_delete_thread() to delete one of the two threads
>> involved in the context switch, for this the only solution I found is
>> to add a bit to the thread mask meaning that the thread is currently
>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
>> to delete whatever thread was marked for deletion
>>
>> in case of migration with xnpod_migrate_thread, we do not want
>> xnpod_schedule() on the target CPU to switch to the migrated thread
>> before the context switch on the source CPU is finished, for this we
>> can avoid setting the resched bit in xnpod_migrate_thread(), detect
>> the condition in xnpod_schedule() epilogue and set the rescheduling
>> bits so that xnpod_schedule is restarted and send the IPI to the
>> target CPU.
>>
>> - avoid using user-space real-time tasks when running latency
>> kernel-space benches, i.e. at least in the latency -t 1 and latency -t
>> 2 case. This means that we should change the timerbench driver. There
>> are at least two ways of doing this:
>> use an rt_pipe
>>  modify the timerbench driver to implement only the nrt ioctl, using
>> vanilla linux services such as wait_event and wake_up.
> [As you reminded me of this unanswered question:]
> One may consider adding further modes _besides_ current kernel tests
> that do not rely on RTDM & native userland support (e.g. when
> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
> scenarios as well that must not be killed by such a change.
 I think the current test scenario for latency -t 1 and latency -t 2
 are a bit misleading: they measure kernel-space latencies in presence
 of user-space real-time tasks. When one runs latency -t 1 or latency
 -t 2, one would expect that there are only kernel-space real-time
 tasks.
>>> If they are misleading, depends on your perspective. In fact, they are
>>> measuring in-kernel scenarios over the standard Xenomai setup, which
>>> includes userland RT task activity these day. Those scenarios are mainly
>>> targeting driver use cases, not pure kernel-space applications.
>>>
>>> But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
>>> would benefit from an additional set of test cases.
>> Ok, I will not touch timerbench then, and implement another kernel module.
>>
> 
> [Without considering all details]
> To achieve this independence of user space RT thread, it should suffice
> to implement a kernel-based frontend for timerbench. This frontent would
> then either dump to syslog or open some pipe to tell userland about the
> benchmark results. What do yo think?
> 

(That is only in case you meant "reimplementing timerbench" with
"implement another kernel module". Just write a kernel-hosted RTDM user
of timerbench.)

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
 Gilles Chanteperdrix wrote:
> Hi,
>
> after some (unsuccessful) time trying to instrument the code in a way
> that does not change the latency results completely, I found the
> reason for the high latency with latency -t 1 and latency -t 2 on ARM.
> So, here comes an update on this issue. The culprit is the user-space
> context switch, which flushes the processor cache with the nklock
> locked, irqs off.
>
> There are two things we could do:
> - arrange for the ARM cache flush to happen with the nklock unlocked
> and irqs enabled. This will improve interrupt latency (latency -t 2)
> but obviously not scheduling latency (latency -t 1). If we go that
> way, there are several problems we should solve:
>
> we do not want interrupt handlers to reenter xnpod_schedule(), for
> this we can use the XNLOCK bit, set on whatever is
> xnpod_current_thread() when the cache flush occurs
>
> since the interrupt handler may modify the rescheduling bits, we need
> to test these bits in xnpod_schedule() epilogue and restart
> xnpod_schedule() if need be
>
> we do not want xnpod_delete_thread() to delete one of the two threads
> involved in the context switch, for this the only solution I found is
> to add a bit to the thread mask meaning that the thread is currently
> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
> to delete whatever thread was marked for deletion
>
> in case of migration with xnpod_migrate_thread, we do not want
> xnpod_schedule() on the target CPU to switch to the migrated thread
> before the context switch on the source CPU is finished, for this we
> can avoid setting the resched bit in xnpod_migrate_thread(), detect
> the condition in xnpod_schedule() epilogue and set the rescheduling
> bits so that xnpod_schedule is restarted and send the IPI to the
> target CPU.
>
> - avoid using user-space real-time tasks when running latency
> kernel-space benches, i.e. at least in the latency -t 1 and latency -t
> 2 case. This means that we should change the timerbench driver. There
> are at least two ways of doing this:
> use an rt_pipe
>  modify the timerbench driver to implement only the nrt ioctl, using
> vanilla linux services such as wait_event and wake_up.
 [As you reminded me of this unanswered question:]
 One may consider adding further modes _besides_ current kernel tests
 that do not rely on RTDM & native userland support (e.g. when
 CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
 scenarios as well that must not be killed by such a change.
>>> I think the current test scenario for latency -t 1 and latency -t 2
>>> are a bit misleading: they measure kernel-space latencies in presence
>>> of user-space real-time tasks. When one runs latency -t 1 or latency
>>> -t 2, one would expect that there are only kernel-space real-time
>>> tasks.
>> If they are misleading, depends on your perspective. In fact, they are
>> measuring in-kernel scenarios over the standard Xenomai setup, which
>> includes userland RT task activity these day. Those scenarios are mainly
>> targeting driver use cases, not pure kernel-space applications.
>>
>> But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
>> would benefit from an additional set of test cases.
> 
> Ok, I will not touch timerbench then, and implement another kernel module.
> 

[Without considering all details]
To achieve this independence of user space RT thread, it should suffice
to implement a kernel-based frontend for timerbench. This frontent would
then either dump to syslog or open some pipe to tell userland about the
benchmark results. What do yo think?

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Gilles Chanteperdrix
On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>
> Gilles Chanteperdrix wrote:
> > On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> >> Gilles Chanteperdrix wrote:
> >>> Hi,
> >>>
> >>> after some (unsuccessful) time trying to instrument the code in a way
> >>> that does not change the latency results completely, I found the
> >>> reason for the high latency with latency -t 1 and latency -t 2 on ARM.
> >>> So, here comes an update on this issue. The culprit is the user-space
> >>> context switch, which flushes the processor cache with the nklock
> >>> locked, irqs off.
> >>>
> >>> There are two things we could do:
> >>> - arrange for the ARM cache flush to happen with the nklock unlocked
> >>> and irqs enabled. This will improve interrupt latency (latency -t 2)
> >>> but obviously not scheduling latency (latency -t 1). If we go that
> >>> way, there are several problems we should solve:
> >>>
> >>> we do not want interrupt handlers to reenter xnpod_schedule(), for
> >>> this we can use the XNLOCK bit, set on whatever is
> >>> xnpod_current_thread() when the cache flush occurs
> >>>
> >>> since the interrupt handler may modify the rescheduling bits, we need
> >>> to test these bits in xnpod_schedule() epilogue and restart
> >>> xnpod_schedule() if need be
> >>>
> >>> we do not want xnpod_delete_thread() to delete one of the two threads
> >>> involved in the context switch, for this the only solution I found is
> >>> to add a bit to the thread mask meaning that the thread is currently
> >>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
> >>> to delete whatever thread was marked for deletion
> >>>
> >>> in case of migration with xnpod_migrate_thread, we do not want
> >>> xnpod_schedule() on the target CPU to switch to the migrated thread
> >>> before the context switch on the source CPU is finished, for this we
> >>> can avoid setting the resched bit in xnpod_migrate_thread(), detect
> >>> the condition in xnpod_schedule() epilogue and set the rescheduling
> >>> bits so that xnpod_schedule is restarted and send the IPI to the
> >>> target CPU.
> >>>
> >>> - avoid using user-space real-time tasks when running latency
> >>> kernel-space benches, i.e. at least in the latency -t 1 and latency -t
> >>> 2 case. This means that we should change the timerbench driver. There
> >>> are at least two ways of doing this:
> >>> use an rt_pipe
> >>>  modify the timerbench driver to implement only the nrt ioctl, using
> >>> vanilla linux services such as wait_event and wake_up.
> >> [As you reminded me of this unanswered question:]
> >> One may consider adding further modes _besides_ current kernel tests
> >> that do not rely on RTDM & native userland support (e.g. when
> >> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
> >> scenarios as well that must not be killed by such a change.
> >
> > I think the current test scenario for latency -t 1 and latency -t 2
> > are a bit misleading: they measure kernel-space latencies in presence
> > of user-space real-time tasks. When one runs latency -t 1 or latency
> > -t 2, one would expect that there are only kernel-space real-time
> > tasks.
>
> If they are misleading, depends on your perspective. In fact, they are
> measuring in-kernel scenarios over the standard Xenomai setup, which
> includes userland RT task activity these day. Those scenarios are mainly
> targeting driver use cases, not pure kernel-space applications.
>
> But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
> would benefit from an additional set of test cases.

Ok, I will not touch timerbench then, and implement another kernel module.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
> On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>> Gilles Chanteperdrix wrote:
>>> Hi,
>>>
>>> after some (unsuccessful) time trying to instrument the code in a way
>>> that does not change the latency results completely, I found the
>>> reason for the high latency with latency -t 1 and latency -t 2 on ARM.
>>> So, here comes an update on this issue. The culprit is the user-space
>>> context switch, which flushes the processor cache with the nklock
>>> locked, irqs off.
>>>
>>> There are two things we could do:
>>> - arrange for the ARM cache flush to happen with the nklock unlocked
>>> and irqs enabled. This will improve interrupt latency (latency -t 2)
>>> but obviously not scheduling latency (latency -t 1). If we go that
>>> way, there are several problems we should solve:
>>>
>>> we do not want interrupt handlers to reenter xnpod_schedule(), for
>>> this we can use the XNLOCK bit, set on whatever is
>>> xnpod_current_thread() when the cache flush occurs
>>>
>>> since the interrupt handler may modify the rescheduling bits, we need
>>> to test these bits in xnpod_schedule() epilogue and restart
>>> xnpod_schedule() if need be
>>>
>>> we do not want xnpod_delete_thread() to delete one of the two threads
>>> involved in the context switch, for this the only solution I found is
>>> to add a bit to the thread mask meaning that the thread is currently
>>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
>>> to delete whatever thread was marked for deletion
>>>
>>> in case of migration with xnpod_migrate_thread, we do not want
>>> xnpod_schedule() on the target CPU to switch to the migrated thread
>>> before the context switch on the source CPU is finished, for this we
>>> can avoid setting the resched bit in xnpod_migrate_thread(), detect
>>> the condition in xnpod_schedule() epilogue and set the rescheduling
>>> bits so that xnpod_schedule is restarted and send the IPI to the
>>> target CPU.
>>>
>>> - avoid using user-space real-time tasks when running latency
>>> kernel-space benches, i.e. at least in the latency -t 1 and latency -t
>>> 2 case. This means that we should change the timerbench driver. There
>>> are at least two ways of doing this:
>>> use an rt_pipe
>>>  modify the timerbench driver to implement only the nrt ioctl, using
>>> vanilla linux services such as wait_event and wake_up.
>> [As you reminded me of this unanswered question:]
>> One may consider adding further modes _besides_ current kernel tests
>> that do not rely on RTDM & native userland support (e.g. when
>> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
>> scenarios as well that must not be killed by such a change.
> 
> I think the current test scenario for latency -t 1 and latency -t 2
> are a bit misleading: they measure kernel-space latencies in presence
> of user-space real-time tasks. When one runs latency -t 1 or latency
> -t 2, one would expect that there are only kernel-space real-time
> tasks.

If they are misleading, depends on your perspective. In fact, they are
measuring in-kernel scenarios over the standard Xenomai setup, which
includes userland RT task activity these day. Those scenarios are mainly
targeting driver use cases, not pure kernel-space applications.

But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
would benefit from an additional set of test cases.

Jan
-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Gilles Chanteperdrix
On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>
> Gilles Chanteperdrix wrote:
> > Hi,
> >
> > after some (unsuccessful) time trying to instrument the code in a way
> > that does not change the latency results completely, I found the
> > reason for the high latency with latency -t 1 and latency -t 2 on ARM.
> > So, here comes an update on this issue. The culprit is the user-space
> > context switch, which flushes the processor cache with the nklock
> > locked, irqs off.
> >
> > There are two things we could do:
> > - arrange for the ARM cache flush to happen with the nklock unlocked
> > and irqs enabled. This will improve interrupt latency (latency -t 2)
> > but obviously not scheduling latency (latency -t 1). If we go that
> > way, there are several problems we should solve:
> >
> > we do not want interrupt handlers to reenter xnpod_schedule(), for
> > this we can use the XNLOCK bit, set on whatever is
> > xnpod_current_thread() when the cache flush occurs
> >
> > since the interrupt handler may modify the rescheduling bits, we need
> > to test these bits in xnpod_schedule() epilogue and restart
> > xnpod_schedule() if need be
> >
> > we do not want xnpod_delete_thread() to delete one of the two threads
> > involved in the context switch, for this the only solution I found is
> > to add a bit to the thread mask meaning that the thread is currently
> > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
> > to delete whatever thread was marked for deletion
> >
> > in case of migration with xnpod_migrate_thread, we do not want
> > xnpod_schedule() on the target CPU to switch to the migrated thread
> > before the context switch on the source CPU is finished, for this we
> > can avoid setting the resched bit in xnpod_migrate_thread(), detect
> > the condition in xnpod_schedule() epilogue and set the rescheduling
> > bits so that xnpod_schedule is restarted and send the IPI to the
> > target CPU.
> >
> > - avoid using user-space real-time tasks when running latency
> > kernel-space benches, i.e. at least in the latency -t 1 and latency -t
> > 2 case. This means that we should change the timerbench driver. There
> > are at least two ways of doing this:
> > use an rt_pipe
> >  modify the timerbench driver to implement only the nrt ioctl, using
> > vanilla linux services such as wait_event and wake_up.
>
> [As you reminded me of this unanswered question:]
> One may consider adding further modes _besides_ current kernel tests
> that do not rely on RTDM & native userland support (e.g. when
> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
> scenarios as well that must not be killed by such a change.

I think the current test scenario for latency -t 1 and latency -t 2
are a bit misleading: they measure kernel-space latencies in presence
of user-space real-time tasks. When one runs latency -t 1 or latency
-t 2, one would expect that there are only kernel-space real-time
tasks.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
> Hi,
> 
> after some (unsuccessful) time trying to instrument the code in a way
> that does not change the latency results completely, I found the
> reason for the high latency with latency -t 1 and latency -t 2 on ARM.
> So, here comes an update on this issue. The culprit is the user-space
> context switch, which flushes the processor cache with the nklock
> locked, irqs off.
> 
> There are two things we could do:
> - arrange for the ARM cache flush to happen with the nklock unlocked
> and irqs enabled. This will improve interrupt latency (latency -t 2)
> but obviously not scheduling latency (latency -t 1). If we go that
> way, there are several problems we should solve:
> 
> we do not want interrupt handlers to reenter xnpod_schedule(), for
> this we can use the XNLOCK bit, set on whatever is
> xnpod_current_thread() when the cache flush occurs
> 
> since the interrupt handler may modify the rescheduling bits, we need
> to test these bits in xnpod_schedule() epilogue and restart
> xnpod_schedule() if need be
> 
> we do not want xnpod_delete_thread() to delete one of the two threads
> involved in the context switch, for this the only solution I found is
> to add a bit to the thread mask meaning that the thread is currently
> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
> to delete whatever thread was marked for deletion
> 
> in case of migration with xnpod_migrate_thread, we do not want
> xnpod_schedule() on the target CPU to switch to the migrated thread
> before the context switch on the source CPU is finished, for this we
> can avoid setting the resched bit in xnpod_migrate_thread(), detect
> the condition in xnpod_schedule() epilogue and set the rescheduling
> bits so that xnpod_schedule is restarted and send the IPI to the
> target CPU.
> 
> - avoid using user-space real-time tasks when running latency
> kernel-space benches, i.e. at least in the latency -t 1 and latency -t
> 2 case. This means that we should change the timerbench driver. There
> are at least two ways of doing this:
> use an rt_pipe
>  modify the timerbench driver to implement only the nrt ioctl, using
> vanilla linux services such as wait_event and wake_up.

[As you reminded me of this unanswered question:]
One may consider adding further modes _besides_ current kernel tests
that do not rely on RTDM & native userland support (e.g. when
CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
scenarios as well that must not be killed by such a change.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


[Xenomai-core] High latencies on ARM.

2008-01-02 Thread Gilles Chanteperdrix
Hi,

after some (unsuccessful) time trying to instrument the code in a way
that does not change the latency results completely, I found the
reason for the high latency with latency -t 1 and latency -t 2 on ARM.
So, here comes an update on this issue. The culprit is the user-space
context switch, which flushes the processor cache with the nklock
locked, irqs off.

There are two things we could do:
- arrange for the ARM cache flush to happen with the nklock unlocked
and irqs enabled. This will improve interrupt latency (latency -t 2)
but obviously not scheduling latency (latency -t 1). If we go that
way, there are several problems we should solve:

we do not want interrupt handlers to reenter xnpod_schedule(), for
this we can use the XNLOCK bit, set on whatever is
xnpod_current_thread() when the cache flush occurs

since the interrupt handler may modify the rescheduling bits, we need
to test these bits in xnpod_schedule() epilogue and restart
xnpod_schedule() if need be

we do not want xnpod_delete_thread() to delete one of the two threads
involved in the context switch, for this the only solution I found is
to add a bit to the thread mask meaning that the thread is currently
switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
to delete whatever thread was marked for deletion

in case of migration with xnpod_migrate_thread, we do not want
xnpod_schedule() on the target CPU to switch to the migrated thread
before the context switch on the source CPU is finished, for this we
can avoid setting the resched bit in xnpod_migrate_thread(), detect
the condition in xnpod_schedule() epilogue and set the rescheduling
bits so that xnpod_schedule is restarted and send the IPI to the
target CPU.

- avoid using user-space real-time tasks when running latency
kernel-space benches, i.e. at least in the latency -t 1 and latency -t
2 case. This means that we should change the timerbench driver. There
are at least two ways of doing this:
use an rt_pipe
 modify the timerbench driver to implement only the nrt ioctl, using
vanilla linux services such as wait_event and wake_up.

What do you think ?

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core