Re: [Xenomai-core] High latencies on ARM.

2008-01-28 Thread Gilles Chanteperdrix
On Jan 28, 2008 12:34 AM, Philippe Gerum [EMAIL PROTECTED] wrote:

 Gilles Chanteperdrix wrote:
  Philippe Gerum wrote:
Gilles Chanteperdrix wrote:
 Philippe Gerum wrote:
   Gilles Chanteperdrix wrote:
On Jan 23, 2008 7:34 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
Gilles Chanteperdrix wrote:
On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] 
  wrote:
Gilles Chanteperdrix wrote:
Gilles Chanteperdrix wrote:
  Please find attached a patch implementing these ideas. 
  This adds some
  clutter, which I would be happy to reduce. Better ideas 
  are welcome.
 
   
Ok. New version of the patch, this time split in two parts, 
  should
hopefully make it more readable.
   
Ack. I'd suggest the following:
   
- let's have a rate limiter when walking the zombie queue in
__xnpod_finalize_zombies. We hold the superlock here, and what 
  the patch
also introduces is the potential for flushing more than a 
  single TCB at
a time, which might not always be a cheap operation, depending 
  on which
cra^H^Hode runs on behalf of the deletion hooks for instance. 
  We may
take for granted that no sane code would continuously create 
  more
threads than we would be able to finalize in a given time 
  frame anyway.
The maximum number of zombies in the queue is
1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the 
  queue
only if a deleted thread is xnpod_current_thread(), or if the 
  XNLOCKSW
bit is armed.
Ack. rate_limit = 1? I'm really reluctant to increase the WCET 
  here,
thread deletion isn't cheap already.
   
I am not sure that holding the nklock while we run the thread 
  deletion
hooks is really needed.
   
  
   Deletion hooks may currently rely on the following assumptions when 
  running:
  
   - rescheduling is locked
   - nklock is held, interrupts are off
   - they run on behalf of the deletor context
  
   The self-delete refactoring currently kills #3 because we now run 
  the
   hooks after the context switch, and would also kill #2 if we did not
   hold the nklock (btw, enabling the nucleus debug while running with 
  this
   patch should raise an abort, from xnshadow_unmap, due to the second
   assertion).
  
   
Forget about this; shadows are always exited in secondary mode, so
that's fine, i.e. xnpod_current_thread() != deleted thread, hence we
should always run the deletion hooks immediately on behalf of the caller.
 
  What happens if the watchdog kills a user-space thread which is
  currently running in primary mode ? If I read xnpod_delete_thread
  correctly, the SIGKILL signal is sent to the target thread only if it is
  not the current thread.
 

 I'd say: zombie queuing from xnpod_delete, then shadow unmap on behalf
 of the next switched context which would trigger the lo-stage unmap
 request - wake_up_process against the Linux side and asbestos underwear
 provided by the relax epilogue, which would eventually reap the guy
 through do_exit(). As a matter of fact, we would still have the
 unmap-over-non-current issue, that's true.

 Ok, could we try coding a damn Tetris instead? Pong, maybe? Gasp...

Games for mobile phones then, because I am afraid games for consoles
or PCs are too complicated for me.

No, seriously, how do we solve this ? Maybe we could relax from
xnpod_delete_thread ?


-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-28 Thread Gilles Chanteperdrix
Gilles Chanteperdrix wrote:
  On Jan 28, 2008 12:34 AM, Philippe Gerum [EMAIL PROTECTED] wrote:
  
   Gilles Chanteperdrix wrote:
Philippe Gerum wrote:
  Gilles Chanteperdrix wrote:
   Philippe Gerum wrote:
 Gilles Chanteperdrix wrote:
  On Jan 23, 2008 7:34 PM, Philippe Gerum [EMAIL PROTECTED] 
wrote:
  Gilles Chanteperdrix wrote:
  On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] 
wrote:
  Gilles Chanteperdrix wrote:
  Gilles Chanteperdrix wrote:
Please find attached a patch implementing these ideas. 
This adds some
clutter, which I would be happy to reduce. Better ideas 
are welcome.
   
 
  Ok. New version of the patch, this time split in two 
parts, should
  hopefully make it more readable.
 
  Ack. I'd suggest the following:
 
  - let's have a rate limiter when walking the zombie queue in
  __xnpod_finalize_zombies. We hold the superlock here, and 
what the patch
  also introduces is the potential for flushing more than a 
single TCB at
  a time, which might not always be a cheap operation, 
depending on which
  cra^H^Hode runs on behalf of the deletion hooks for 
instance. We may
  take for granted that no sane code would continuously 
create more
  threads than we would be able to finalize in a given time 
frame anyway.
  The maximum number of zombies in the queue is
  1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to 
the queue
  only if a deleted thread is xnpod_current_thread(), or if 
the XNLOCKSW
  bit is armed.
  Ack. rate_limit = 1? I'm really reluctant to increase the 
WCET here,
  thread deletion isn't cheap already.
 
  I am not sure that holding the nklock while we run the thread 
deletion
  hooks is really needed.
 

 Deletion hooks may currently rely on the following assumptions 
when running:

 - rescheduling is locked
 - nklock is held, interrupts are off
 - they run on behalf of the deletor context

 The self-delete refactoring currently kills #3 because we now 
run the
 hooks after the context switch, and would also kill #2 if we did 
not
 hold the nklock (btw, enabling the nucleus debug while running 
with this
 patch should raise an abort, from xnshadow_unmap, due to the 
second
 assertion).

 
  Forget about this; shadows are always exited in secondary mode, so
  that's fine, i.e. xnpod_current_thread() != deleted thread, hence we
  should always run the deletion hooks immediately on behalf of the 
caller.
   
What happens if the watchdog kills a user-space thread which is
currently running in primary mode ? If I read xnpod_delete_thread
correctly, the SIGKILL signal is sent to the target thread only if it is
not the current thread.
   
  
   I'd say: zombie queuing from xnpod_delete, then shadow unmap on behalf
   of the next switched context which would trigger the lo-stage unmap
   request - wake_up_process against the Linux side and asbestos underwear
   provided by the relax epilogue, which would eventually reap the guy
   through do_exit(). As a matter of fact, we would still have the
   unmap-over-non-current issue, that's true.
  
   Ok, could we try coding a damn Tetris instead? Pong, maybe? Gasp...
  
  Games for mobile phones then, because I am afraid games for consoles
  or PCs are too complicated for me.
  
  No, seriously, how do we solve this ? Maybe we could relax from
  xnpod_delete_thread ?

This will not work, xnpod_schedule will not let xnshadow_relax suspend
the current thread while in interrupt context.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-27 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
 Philippe Gerum wrote:
   Gilles Chanteperdrix wrote:
Philippe Gerum wrote:
  Gilles Chanteperdrix wrote:
   On Jan 23, 2008 7:34 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
   Gilles Chanteperdrix wrote:
   On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
   Gilles Chanteperdrix wrote:
   Gilles Chanteperdrix wrote:
 Please find attached a patch implementing these ideas. This 
 adds some
 clutter, which I would be happy to reduce. Better ideas are 
 welcome.

  
   Ok. New version of the patch, this time split in two parts, 
 should
   hopefully make it more readable.
  
   Ack. I'd suggest the following:
  
   - let's have a rate limiter when walking the zombie queue in
   __xnpod_finalize_zombies. We hold the superlock here, and what 
 the patch
   also introduces is the potential for flushing more than a single 
 TCB at
   a time, which might not always be a cheap operation, depending 
 on which
   cra^H^Hode runs on behalf of the deletion hooks for instance. We 
 may
   take for granted that no sane code would continuously create more
   threads than we would be able to finalize in a given time frame 
 anyway.
   The maximum number of zombies in the queue is
   1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the 
 queue
   only if a deleted thread is xnpod_current_thread(), or if the 
 XNLOCKSW
   bit is armed.
   Ack. rate_limit = 1? I'm really reluctant to increase the WCET 
 here,
   thread deletion isn't cheap already.
   
   I am not sure that holding the nklock while we run the thread 
 deletion
   hooks is really needed.
   
  
  Deletion hooks may currently rely on the following assumptions when 
 running:
  
  - rescheduling is locked
  - nklock is held, interrupts are off
  - they run on behalf of the deletor context
  
  The self-delete refactoring currently kills #3 because we now run the
  hooks after the context switch, and would also kill #2 if we did not
  hold the nklock (btw, enabling the nucleus debug while running with 
 this
  patch should raise an abort, from xnshadow_unmap, due to the second
  assertion).
  
   
   Forget about this; shadows are always exited in secondary mode, so
   that's fine, i.e. xnpod_current_thread() != deleted thread, hence we
   should always run the deletion hooks immediately on behalf of the caller.
 
 What happens if the watchdog kills a user-space thread which is
 currently running in primary mode ? If I read xnpod_delete_thread
 correctly, the SIGKILL signal is sent to the target thread only if it is
 not the current thread.
 

I'd say: zombie queuing from xnpod_delete, then shadow unmap on behalf
of the next switched context which would trigger the lo-stage unmap
request - wake_up_process against the Linux side and asbestos underwear
provided by the relax epilogue, which would eventually reap the guy
through do_exit(). As a matter of fact, we would still have the
unmap-over-non-current issue, that's true.

Ok, could we try coding a damn Tetris instead? Pong, maybe? Gasp...

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-26 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
 On Jan 23, 2008 7:34 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
 On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
 Gilles Chanteperdrix wrote:
   Please find attached a patch implementing these ideas. This adds some
   clutter, which I would be happy to reduce. Better ideas are welcome.
  

 Ok. New version of the patch, this time split in two parts, should
 hopefully make it more readable.

 Ack. I'd suggest the following:

 - let's have a rate limiter when walking the zombie queue in
 __xnpod_finalize_zombies. We hold the superlock here, and what the patch
 also introduces is the potential for flushing more than a single TCB at
 a time, which might not always be a cheap operation, depending on which
 cra^H^Hode runs on behalf of the deletion hooks for instance. We may
 take for granted that no sane code would continuously create more
 threads than we would be able to finalize in a given time frame anyway.
 The maximum number of zombies in the queue is
 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
 only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
 bit is armed.
 Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
 thread deletion isn't cheap already.
 
 I am not sure that holding the nklock while we run the thread deletion
 hooks is really needed.
 

Deletion hooks may currently rely on the following assumptions when running:

- rescheduling is locked
- nklock is held, interrupts are off
- they run on behalf of the deletor context

The self-delete refactoring currently kills #3 because we now run the
hooks after the context switch, and would also kill #2 if we did not
hold the nklock (btw, enabling the nucleus debug while running with this
patch should raise an abort, from xnshadow_unmap, due to the second
assertion).

It should be possible to get rid of #3 for xnshadow_unmap (serious
testing needed here), but we would have to grab the nklock from this
routine anyway.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-26 Thread Gilles Chanteperdrix
Philippe Gerum wrote:
  Gilles Chanteperdrix wrote:
   On Jan 23, 2008 7:34 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
   Gilles Chanteperdrix wrote:
   On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
   Gilles Chanteperdrix wrote:
   Gilles Chanteperdrix wrote:
 Please find attached a patch implementing these ideas. This adds 
   some
 clutter, which I would be happy to reduce. Better ideas are welcome.

  
   Ok. New version of the patch, this time split in two parts, should
   hopefully make it more readable.
  
   Ack. I'd suggest the following:
  
   - let's have a rate limiter when walking the zombie queue in
   __xnpod_finalize_zombies. We hold the superlock here, and what the patch
   also introduces is the potential for flushing more than a single TCB at
   a time, which might not always be a cheap operation, depending on which
   cra^H^Hode runs on behalf of the deletion hooks for instance. We may
   take for granted that no sane code would continuously create more
   threads than we would be able to finalize in a given time frame anyway.
   The maximum number of zombies in the queue is
   1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
   only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
   bit is armed.
   Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
   thread deletion isn't cheap already.
   
   I am not sure that holding the nklock while we run the thread deletion
   hooks is really needed.
   
  
  Deletion hooks may currently rely on the following assumptions when running:
  
  - rescheduling is locked
  - nklock is held, interrupts are off
  - they run on behalf of the deletor context
  
  The self-delete refactoring currently kills #3 because we now run the
  hooks after the context switch, and would also kill #2 if we did not
  hold the nklock (btw, enabling the nucleus debug while running with this
  patch should raise an abort, from xnshadow_unmap, due to the second
  assertion).
  
  It should be possible to get rid of #3 for xnshadow_unmap (serious
  testing needed here), but we would have to grab the nklock from this
  routine anyway.

Since the unmapped task is no longer running on the current CPU, is no
there any chance that it is run on another CPU by the time we get to
xnshadow_unmap ?

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-26 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
 Philippe Gerum wrote:
   Gilles Chanteperdrix wrote:
On Jan 23, 2008 7:34 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
Gilles Chanteperdrix wrote:
On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
Gilles Chanteperdrix wrote:
Gilles Chanteperdrix wrote:
  Please find attached a patch implementing these ideas. This adds 
 some
  clutter, which I would be happy to reduce. Better ideas are 
 welcome.
 
   
Ok. New version of the patch, this time split in two parts, should
hopefully make it more readable.
   
Ack. I'd suggest the following:
   
- let's have a rate limiter when walking the zombie queue in
__xnpod_finalize_zombies. We hold the superlock here, and what the 
 patch
also introduces is the potential for flushing more than a single TCB 
 at
a time, which might not always be a cheap operation, depending on 
 which
cra^H^Hode runs on behalf of the deletion hooks for instance. We may
take for granted that no sane code would continuously create more
threads than we would be able to finalize in a given time frame 
 anyway.
The maximum number of zombies in the queue is
1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
bit is armed.
Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
thread deletion isn't cheap already.

I am not sure that holding the nklock while we run the thread deletion
hooks is really needed.

   
   Deletion hooks may currently rely on the following assumptions when 
 running:
   
   - rescheduling is locked
   - nklock is held, interrupts are off
   - they run on behalf of the deletor context
   
   The self-delete refactoring currently kills #3 because we now run the
   hooks after the context switch, and would also kill #2 if we did not
   hold the nklock (btw, enabling the nucleus debug while running with this
   patch should raise an abort, from xnshadow_unmap, due to the second
   assertion).
   

Forget about this; shadows are always exited in secondary mode, so
that's fine, i.e. xnpod_current_thread() != deleted thread, hence we
should always run the deletion hooks immediately on behalf of the caller.

   It should be possible to get rid of #3 for xnshadow_unmap (serious
   testing needed here), but we would have to grab the nklock from this
   routine anyway.
 
 Since the unmapped task is no longer running on the current CPU, is no
 there any chance that it is run on another CPU by the time we get to
 xnshadow_unmap ?
 

The unmapped task is running actually, and do_exit() may reschedule
quite late until kernel preemption is eventually disabled, which happens
long after the I-pipe notifier is fired. We would need the nklock to
protect the RPI management too.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-24 Thread Gilles Chanteperdrix
On Jan 23, 2008 7:34 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
  On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
  Gilles Chanteperdrix wrote:
  Gilles Chanteperdrix wrote:
Please find attached a patch implementing these ideas. This adds some
clutter, which I would be happy to reduce. Better ideas are welcome.
   
 
  Ok. New version of the patch, this time split in two parts, should
  hopefully make it more readable.
 
  Ack. I'd suggest the following:
 
  - let's have a rate limiter when walking the zombie queue in
  __xnpod_finalize_zombies. We hold the superlock here, and what the patch
  also introduces is the potential for flushing more than a single TCB at
  a time, which might not always be a cheap operation, depending on which
  cra^H^Hode runs on behalf of the deletion hooks for instance. We may
  take for granted that no sane code would continuously create more
  threads than we would be able to finalize in a given time frame anyway.
 
  The maximum number of zombies in the queue is
  1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
  only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
  bit is armed.

 Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
 thread deletion isn't cheap already.

I am not sure that holding the nklock while we run the thread deletion
hooks is really needed.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-23 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
 Gilles Chanteperdrix wrote:
   Please find attached a patch implementing these ideas. This adds some
   clutter, which I would be happy to reduce. Better ideas are welcome.
   
 
 Ok. New version of the patch, this time split in two parts, should
 hopefully make it more readable.
 

Ack. I'd suggest the following:

- let's have a rate limiter when walking the zombie queue in
__xnpod_finalize_zombies. We hold the superlock here, and what the patch
also introduces is the potential for flushing more than a single TCB at
a time, which might not always be a cheap operation, depending on which
cra^H^Hode runs on behalf of the deletion hooks for instance. We may
take for granted that no sane code would continuously create more
threads than we would be able to finalize in a given time frame anyway.

- We could move most of the code depending on XNARCH_WANT_UNLOCKED_CTXSW
 to conditional inlines in pod.h. This would reduce the visual pollution
a lot.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-23 Thread Gilles Chanteperdrix
On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
  Gilles Chanteperdrix wrote:
Please find attached a patch implementing these ideas. This adds some
clutter, which I would be happy to reduce. Better ideas are welcome.
   
 
  Ok. New version of the patch, this time split in two parts, should
  hopefully make it more readable.
 

 Ack. I'd suggest the following:

 - let's have a rate limiter when walking the zombie queue in
 __xnpod_finalize_zombies. We hold the superlock here, and what the patch
 also introduces is the potential for flushing more than a single TCB at
 a time, which might not always be a cheap operation, depending on which
 cra^H^Hode runs on behalf of the deletion hooks for instance. We may
 take for granted that no sane code would continuously create more
 threads than we would be able to finalize in a given time frame anyway.

The maximum number of zombies in the queue is
1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
bit is armed.


 - We could move most of the code depending on XNARCH_WANT_UNLOCKED_CTXSW
  to conditional inlines in pod.h. This would reduce the visual pollution
 a lot.

Ok, will try that, especially since the code added to the 4 places
where a scheduling tail takes place is pretty repetitive.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-23 Thread Philippe Gerum
Gilles Chanteperdrix wrote:
 On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
 Gilles Chanteperdrix wrote:
   Please find attached a patch implementing these ideas. This adds some
   clutter, which I would be happy to reduce. Better ideas are welcome.
  

 Ok. New version of the patch, this time split in two parts, should
 hopefully make it more readable.

 Ack. I'd suggest the following:

 - let's have a rate limiter when walking the zombie queue in
 __xnpod_finalize_zombies. We hold the superlock here, and what the patch
 also introduces is the potential for flushing more than a single TCB at
 a time, which might not always be a cheap operation, depending on which
 cra^H^Hode runs on behalf of the deletion hooks for instance. We may
 take for granted that no sane code would continuously create more
 threads than we would be able to finalize in a given time frame anyway.
 
 The maximum number of zombies in the queue is
 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
 only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
 bit is armed.

Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
thread deletion isn't cheap already.

-- 
Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-23 Thread Gilles Chanteperdrix
On Jan 23, 2008 7:34 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
  On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
  Gilles Chanteperdrix wrote:
  Gilles Chanteperdrix wrote:
Please find attached a patch implementing these ideas. This adds some
clutter, which I would be happy to reduce. Better ideas are welcome.
   
 
  Ok. New version of the patch, this time split in two parts, should
  hopefully make it more readable.
 
  Ack. I'd suggest the following:
 
  - let's have a rate limiter when walking the zombie queue in
  __xnpod_finalize_zombies. We hold the superlock here, and what the patch
  also introduces is the potential for flushing more than a single TCB at
  a time, which might not always be a cheap operation, depending on which
  cra^H^Hode runs on behalf of the deletion hooks for instance. We may
  take for granted that no sane code would continuously create more
  threads than we would be able to finalize in a given time frame anyway.
 
  The maximum number of zombies in the queue is
  1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
  only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
  bit is armed.

 Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
 thread deletion isn't cheap already.

Ok, as you wish.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-23 Thread Gilles Chanteperdrix
Philippe Gerum wrote:
  Gilles Chanteperdrix wrote:
   On Jan 23, 2008 6:48 PM, Philippe Gerum [EMAIL PROTECTED] wrote:
   Gilles Chanteperdrix wrote:
   Gilles Chanteperdrix wrote:
 Please find attached a patch implementing these ideas. This adds some
 clutter, which I would be happy to reduce. Better ideas are welcome.

  
   Ok. New version of the patch, this time split in two parts, should
   hopefully make it more readable.
  
   Ack. I'd suggest the following:
  
   - let's have a rate limiter when walking the zombie queue in
   __xnpod_finalize_zombies. We hold the superlock here, and what the patch
   also introduces is the potential for flushing more than a single TCB at
   a time, which might not always be a cheap operation, depending on which
   cra^H^Hode runs on behalf of the deletion hooks for instance. We may
   take for granted that no sane code would continuously create more
   threads than we would be able to finalize in a given time frame anyway.
   
   The maximum number of zombies in the queue is
   1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue
   only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW
   bit is armed.
  
  Ack. rate_limit = 1? I'm really reluctant to increase the WCET here,
  thread deletion isn't cheap already.

Here come new patches.

-- 


Gilles Chanteperdrix.
Index: include/asm-ia64/bits/pod.h
===
--- include/asm-ia64/bits/pod.h (revision 3441)
+++ include/asm-ia64/bits/pod.h (working copy)
@@ -100,12 +100,6 @@ static inline void xnarch_switch_to(xnar
}
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: include/asm-blackfin/bits/pod.h
===
--- include/asm-blackfin/bits/pod.h (revision 3441)
+++ include/asm-blackfin/bits/pod.h (working copy)
@@ -67,12 +67,6 @@ static inline void xnarch_switch_to(xnar
rthal_thread_switch(out_tcb-tsp, in_tcb-tsp);
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: include/asm-arm/bits/pod.h
===
--- include/asm-arm/bits/pod.h  (revision 3441)
+++ include/asm-arm/bits/pod.h  (working copy)
@@ -96,12 +96,6 @@ static inline void xnarch_switch_to(xnar
rthal_thread_switch(prev, out_tcb-tip, in_tcb-tip);
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: include/asm-powerpc/bits/pod.h
===
--- include/asm-powerpc/bits/pod.h  (revision 3441)
+++ include/asm-powerpc/bits/pod.h  (working copy)
@@ -106,12 +106,6 @@ static inline void xnarch_switch_to(xnar
barrier();
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: include/asm-x86/bits/pod_64.h
===
--- include/asm-x86/bits/pod_64.h   (revision 3441)
+++ include/asm-x86/bits/pod_64.h   (working copy)
@@ -96,12 +96,6 @@ static inline void xnarch_switch_to(xnar
stts();
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: include/asm-x86/bits/pod_32.h
===
--- include/asm-x86/bits/pod_32.h   (revision 3441)
+++ include/asm-x86/bits/pod_32.h   (working copy)
@@ -123,12 +123,6 @@ static inline void xnarch_switch_to(xnar
stts();
 }
 
-static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb,
- xnarchtcb_t * next_tcb)
-{
-   xnarch_switch_to(dead_tcb, next_tcb);
-}
-
 static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb)
 {
/* Empty */
Index: 

Re: [Xenomai-core] High latencies on ARM.

2008-01-22 Thread Gilles Chanteperdrix
Gilles Chanteperdrix wrote:
  Please find attached a patch implementing these ideas. This adds some
  clutter, which I would be happy to reduce. Better ideas are welcome.
  

Ok. New version of the patch, this time split in two parts, should
hopefully make it more readable.

  

- avoid using user-space real-time tasks when running latency
kernel-space benches, i.e. at least in the latency -t 1 and latency -t
2 case. This means that we should change the timerbench driver. There
are at least two ways of doing this:
use an rt_pipe
 modify the timerbench driver to implement only the nrt ioctl, using
vanilla linux services such as wait_event and wake_up.

What do you think ?
  
  So, what do you thing is the best way to change the timerbench driver,
  * use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even
   if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make
   the timerbench non portable on other implementations of rtdm, eg. rtdm
   over rtai or the version of rtdm which runs over vanilla linux
  * modify the timerbecn driver to implement only nrt ioctls ? Pros:
better driver portability; cons: latency would still need
CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2.

-- 


Gilles Chanteperdrix.
Index: include/nucleus/pod.h
===
--- include/nucleus/pod.h   (revision 3405)
+++ include/nucleus/pod.h   (working copy)
@@ -139,6 +139,7 @@
 
xntimer_t htimer;   /*! Host timer. */
 
+   xnqueue_t zombies;
 } xnsched_t;
 
 #define nkpod (nkpod_struct)
@@ -238,6 +239,14 @@
 }
 #endif /* CONFIG_XENO_OPT_WATCHDOG */
 
+void __xnpod_finalize_zombies(xnsched_t *sched);
+
+static inline void xnpod_finalize_zombies(xnsched_t *sched)
+{
+   if (!emptyq_p(sched-zombies))
+   __xnpod_finalize_zombies(sched);
+}
+
/* -- Beginning of the exported interface */
 
 #define xnpod_sched_slot(cpu) \
Index: ksrc/nucleus/pod.c
===
--- ksrc/nucleus/pod.c  (revision 3415)
+++ ksrc/nucleus/pod.c  (working copy)
@@ -292,6 +292,7 @@
 #endif /* CONFIG_SMP */
xntimer_set_name(sched-htimer, htimer_name);
xntimer_set_sched(sched-htimer, sched);
+   initq(sched-zombies);
}
 
xnlock_put_irqrestore(nklock, s);
@@ -545,63 +546,28 @@
__clrbits(sched-status, XNKCOUT);
 }
 
-static inline void xnpod_switch_zombie(xnthread_t *threadout,
-  xnthread_t *threadin)
+void __xnpod_finalize_zombies(xnsched_t *sched)
 {
-   /* Must be called with nklock locked, interrupts off. */
-   xnsched_t *sched = xnpod_current_sched();
-#ifdef CONFIG_XENO_OPT_PERVASIVE
-   int shadow = xnthread_test_state(threadout, XNSHADOW);
-#endif /* CONFIG_XENO_OPT_PERVASIVE */
+   xnholder_t *holder;
 
-   trace_mark(xn_nucleus_sched_finalize,
-  thread_out %p thread_out_name %s 
-  thread_in %p thread_in_name %s,
-  threadout, xnthread_name(threadout),
-  threadin, xnthread_name(threadin));
+   while ((holder = getq(sched-zombies))) {
+   xnthread_t *thread = link2thread(holder, glink);
 
-   if (!emptyq_p(nkpod-tdeleteq)  !xnthread_test_state(threadout, 
XNROOT)) {
-   trace_mark(xn_nucleus_thread_callout,
-  thread %p thread_name %s hook %s,
-  threadout, xnthread_name(threadout), DELETE);
-   xnpod_fire_callouts(nkpod-tdeleteq, threadout);
-   }
+   /* Must be called with nklock locked, interrupts off. */
+   trace_mark(xn_nucleus_sched_finalize,
+  thread_out %p thread_out_name %s,
+  thread, xnthread_name(thread));
 
-   sched-runthread = threadin;
+   if (!emptyq_p(nkpod-tdeleteq)
+!xnthread_test_state(thread, XNROOT)) {
+   trace_mark(xn_nucleus_thread_callout,
+  thread %p thread_name %s hook %s,
+  thread, xnthread_name(thread), DELETE);
+   xnpod_fire_callouts(nkpod-tdeleteq, thread);
+   }
 
-   if (xnthread_test_state(threadin, XNROOT)) {
-   xnpod_reset_watchdog(sched);
-   xnfreesync();
-   xnarch_enter_root(xnthread_archtcb(threadin));
+   xnthread_cleanup_tcb(thread);
}
-
-   /* FIXME: Catch 22 here, whether we choose to run on an invalid
-  stack (cleanup then hooks), or to access the TCB space shortly
-  after it has been freed while non-preemptible (hooks then
-  cleanup)... Option #2 is current. */
-
-   xnthread_cleanup_tcb(threadout);
-
-   

Re: [Xenomai-core] High latencies on ARM.

2008-01-22 Thread Gilles Chanteperdrix
Gilles Chanteperdrix wrote:
  Hi,
  
  after some (unsuccessful) time trying to instrument the code in a way
  that does not change the latency results completely, I found the
  reason for the high latency with latency -t 1 and latency -t 2 on ARM.
  So, here comes an update on this issue. The culprit is the user-space
  context switch, which flushes the processor cache with the nklock
  locked, irqs off.
  
  There are two things we could do:
  - arrange for the ARM cache flush to happen with the nklock unlocked
  and irqs enabled. This will improve interrupt latency (latency -t 2)
  but obviously not scheduling latency (latency -t 1). If we go that
  way, there are several problems we should solve:
  
  we do not want interrupt handlers to reenter xnpod_schedule(), for
  this we can use the XNLOCK bit, set on whatever is
  xnpod_current_thread() when the cache flush occurs
  
  since the interrupt handler may modify the rescheduling bits, we need
  to test these bits in xnpod_schedule() epilogue and restart
  xnpod_schedule() if need be
  
  we do not want xnpod_delete_thread() to delete one of the two threads
  involved in the context switch, for this the only solution I found is
  to add a bit to the thread mask meaning that the thread is currently
  switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
  to delete whatever thread was marked for deletion
  
  in case of migration with xnpod_migrate_thread, we do not want
  xnpod_schedule() on the target CPU to switch to the migrated thread
  before the context switch on the source CPU is finished, for this we
  can avoid setting the resched bit in xnpod_migrate_thread(), detect
  the condition in xnpod_schedule() epilogue and set the rescheduling
  bits so that xnpod_schedule is restarted and send the IPI to the
  target CPU.

Please find attached a patch implementing these ideas. This adds some
clutter, which I would be happy to reduce. Better ideas are welcome.


  
  - avoid using user-space real-time tasks when running latency
  kernel-space benches, i.e. at least in the latency -t 1 and latency -t
  2 case. This means that we should change the timerbench driver. There
  are at least two ways of doing this:
  use an rt_pipe
   modify the timerbench driver to implement only the nrt ioctl, using
  vanilla linux services such as wait_event and wake_up.
  
  What do you think ?

So, what do you thing is the best way to change the timerbench driver,
* use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even
 if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make
 the timerbench non portable on other implementations of rtdm, eg. rtdm
 over rtai or the version of rtdm which runs over vanilla linux
* modify the timerbecn driver to implement only nrt ioctls ? Pros:
  better driver portability; cons: latency would still need
  CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2.

-- 


Gilles Chanteperdrix.
Index: include/asm-arm/bits/pod.h
===
--- include/asm-arm/bits/pod.h  (revision 3405)
+++ include/asm-arm/bits/pod.h  (working copy)
@@ -67,41 +67,41 @@
 #endif /* TIF_MMSWITCH_INT */
 }
 
-static inline void xnarch_switch_to(xnarchtcb_t * out_tcb, xnarchtcb_t * 
in_tcb)
-{
-   struct task_struct *prev = out_tcb-active_task;
-   struct mm_struct *prev_mm = out_tcb-active_mm;
-   struct task_struct *next = in_tcb-user_task;
-
-
-   if (likely(next != NULL)) {
-   in_tcb-active_task = next;
-   in_tcb-active_mm = in_tcb-mm;
-   rthal_clear_foreign_stack(rthal_domain);
-   } else {
-   in_tcb-active_task = prev;
-   in_tcb-active_mm = prev_mm;
-   rthal_set_foreign_stack(rthal_domain);
-   }
-
-   if (prev_mm != in_tcb-active_mm) {
-   /* Switch to new user-space thread? */
-   if (in_tcb-active_mm)
-   switch_mm(prev_mm, in_tcb-active_mm, next);
-   if (!next-mm)
-   enter_lazy_tlb(prev_mm, next);
-   }
-
-   /* Kernel-to-kernel context switch. */
-   rthal_thread_switch(prev, out_tcb-tip, in_tcb-tip);
+#define xnarch_switch_to(_out_tcb, _in_tcb, lock)  \
+{  \
+   xnarchtcb_t *in_tcb = (_in_tcb);\
+   xnarchtcb_t *out_tcb = (_out_tcb);  \
+   struct task_struct *prev = out_tcb-active_task;\
+   struct mm_struct *prev_mm = out_tcb-active_mm; \
+   struct task_struct *next = in_tcb-user_task;   \
+   \
+   \
+   if (likely(next != NULL)) {   

Re: [Xenomai-core] High latencies on ARM.

2008-01-22 Thread Jan Kiszka

Gilles Chanteperdrix wrote:

Gilles Chanteperdrix wrote:
  Hi,
  
  after some (unsuccessful) time trying to instrument the code in a way

  that does not change the latency results completely, I found the
  reason for the high latency with latency -t 1 and latency -t 2 on ARM.
  So, here comes an update on this issue. The culprit is the user-space
  context switch, which flushes the processor cache with the nklock
  locked, irqs off.
  
  There are two things we could do:

  - arrange for the ARM cache flush to happen with the nklock unlocked
  and irqs enabled. This will improve interrupt latency (latency -t 2)
  but obviously not scheduling latency (latency -t 1). If we go that
  way, there are several problems we should solve:
  
  we do not want interrupt handlers to reenter xnpod_schedule(), for

  this we can use the XNLOCK bit, set on whatever is
  xnpod_current_thread() when the cache flush occurs
  
  since the interrupt handler may modify the rescheduling bits, we need

  to test these bits in xnpod_schedule() epilogue and restart
  xnpod_schedule() if need be
  
  we do not want xnpod_delete_thread() to delete one of the two threads

  involved in the context switch, for this the only solution I found is
  to add a bit to the thread mask meaning that the thread is currently
  switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
  to delete whatever thread was marked for deletion
  
  in case of migration with xnpod_migrate_thread, we do not want

  xnpod_schedule() on the target CPU to switch to the migrated thread
  before the context switch on the source CPU is finished, for this we
  can avoid setting the resched bit in xnpod_migrate_thread(), detect
  the condition in xnpod_schedule() epilogue and set the rescheduling
  bits so that xnpod_schedule is restarted and send the IPI to the
  target CPU.

Please find attached a patch implementing these ideas. This adds some
clutter, which I would be happy to reduce. Better ideas are welcome.



I tried to cross-read the patch (-p would have been nice) but failed - 
this needs to be applied on some tree. Does the patch improve ARM 
latencies already?




  
  - avoid using user-space real-time tasks when running latency

  kernel-space benches, i.e. at least in the latency -t 1 and latency -t
  2 case. This means that we should change the timerbench driver. There
  are at least two ways of doing this:
  use an rt_pipe
   modify the timerbench driver to implement only the nrt ioctl, using
  vanilla linux services such as wait_event and wake_up.
  
  What do you think ?


So, what do you thing is the best way to change the timerbench driver,
* use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even
 if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make
 the timerbench non portable on other implementations of rtdm, eg. rtdm
 over rtai or the version of rtdm which runs over vanilla linux
* modify the timerbecn driver to implement only nrt ioctls ? Pros:
  better driver portability; cons: latency would still need
  CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2.


I'm still voting for my third approach:

 - Write latency as kernel application (klatency) against the
timerbench device
 - Call NRT IOCTLs of timerbench during module init/cleanup
 - Use module parameters for customization
 - Setup a low-prio kernel-based RT task to issue the RT IOCTLs
 - Format the results nicely (similar to userland latency) in that RT
task and stuff them into some rtpipe
 - Use cat /dev/rtpipeX to display the results

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-22 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
  Gilles Chanteperdrix wrote:
   Gilles Chanteperdrix wrote:
 Hi,
 
 after some (unsuccessful) time trying to instrument the code in a way
 that does not change the latency results completely, I found the
 reason for the high latency with latency -t 1 and latency -t 2 on ARM.
 So, here comes an update on this issue. The culprit is the user-space
 context switch, which flushes the processor cache with the nklock
 locked, irqs off.
 
 There are two things we could do:
 - arrange for the ARM cache flush to happen with the nklock unlocked
 and irqs enabled. This will improve interrupt latency (latency -t 2)
 but obviously not scheduling latency (latency -t 1). If we go that
 way, there are several problems we should solve:
 
 we do not want interrupt handlers to reenter xnpod_schedule(), for
 this we can use the XNLOCK bit, set on whatever is
 xnpod_current_thread() when the cache flush occurs
 
 since the interrupt handler may modify the rescheduling bits, we need
 to test these bits in xnpod_schedule() epilogue and restart
 xnpod_schedule() if need be
 
 we do not want xnpod_delete_thread() to delete one of the two threads
 involved in the context switch, for this the only solution I found is
 to add a bit to the thread mask meaning that the thread is currently
 switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
 to delete whatever thread was marked for deletion
 
 in case of migration with xnpod_migrate_thread, we do not want
 xnpod_schedule() on the target CPU to switch to the migrated thread
 before the context switch on the source CPU is finished, for this we
 can avoid setting the resched bit in xnpod_migrate_thread(), detect
 the condition in xnpod_schedule() epilogue and set the rescheduling
 bits so that xnpod_schedule is restarted and send the IPI to the
 target CPU.
   
   Please find attached a patch implementing these ideas. This adds some
   clutter, which I would be happy to reduce. Better ideas are welcome.
   
  
  I tried to cross-read the patch (-p would have been nice) but failed - 
  this needs to be applied on some tree. Does the patch improve ARM 
  latencies already?

I split the patch in two parts in another post, this should make it
easier to read.

  
   
 
 - avoid using user-space real-time tasks when running latency
 kernel-space benches, i.e. at least in the latency -t 1 and latency -t
 2 case. This means that we should change the timerbench driver. There
 are at least two ways of doing this:
 use an rt_pipe
  modify the timerbench driver to implement only the nrt ioctl, using
 vanilla linux services such as wait_event and wake_up.
 
 What do you think ?
   
   So, what do you thing is the best way to change the timerbench driver,
   * use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even
if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make
the timerbench non portable on other implementations of rtdm, eg. rtdm
over rtai or the version of rtdm which runs over vanilla linux
   * modify the timerbecn driver to implement only nrt ioctls ? Pros:
 better driver portability; cons: latency would still need
 CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2.
  
  I'm still voting for my third approach:
  
- Write latency as kernel application (klatency) against the
   timerbench device
- Call NRT IOCTLs of timerbench during module init/cleanup
- Use module parameters for customization
- Setup a low-prio kernel-based RT task to issue the RT IOCTLs
- Format the results nicely (similar to userland latency) in that RT
   task and stuff them into some rtpipe
- Use cat /dev/rtpipeX to display the results

Sorry this mail is older than your last reply to my question. I had
problems with my MTA, so I resent all the mail which were not sent, I
hoped they would be sent with their original date preserved, but
unfortunately, this is not the case.

Now, to answer your suggestion, I think that formating the results
belongs to user-space, not to kernel-space. Besides, emitting NRT ioctls
from module initialization and cleanup routines make this klatency
module quite inflexible. I was rather thinking about implementing the RT
versions of the IOCTLS so that they could be called from a kernel space
real-time task.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-22 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
  Does the patch improve ARM latencies already?

Yes, it does. The (interrupt) latency goes from above 100us to
80us. This is not yet 50us, though.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-21 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
  Gilles Chanteperdrix wrote:
   On Jan 17, 2008 3:16 PM, Jan Kiszka [EMAIL PROTECTED] wrote:
   Gilles Chanteperdrix wrote:
   On Jan 17, 2008 12:55 PM, Jan Kiszka [EMAIL PROTECTED] wrote:
   Gilles Chanteperdrix wrote:
   On Jan 17, 2008 11:42 AM, Jan Kiszka [EMAIL PROTECTED] wrote:
   Gilles Chanteperdrix wrote:
   Hi,
  
   after some (unsuccessful) time trying to instrument the code in a way
   that does not change the latency results completely, I found the
   reason for the high latency with latency -t 1 and latency -t 2 on 
   ARM.
   So, here comes an update on this issue. The culprit is the user-space
   context switch, which flushes the processor cache with the nklock
   locked, irqs off.
  
   There are two things we could do:
   - arrange for the ARM cache flush to happen with the nklock unlocked
   and irqs enabled. This will improve interrupt latency (latency -t 2)
   but obviously not scheduling latency (latency -t 1). If we go that
   way, there are several problems we should solve:
  
   we do not want interrupt handlers to reenter xnpod_schedule(), for
   this we can use the XNLOCK bit, set on whatever is
   xnpod_current_thread() when the cache flush occurs
  
   since the interrupt handler may modify the rescheduling bits, we need
   to test these bits in xnpod_schedule() epilogue and restart
   xnpod_schedule() if need be
  
   we do not want xnpod_delete_thread() to delete one of the two threads
   involved in the context switch, for this the only solution I found is
   to add a bit to the thread mask meaning that the thread is currently
   switching, and to (re)test the XNZOMBIE bit in xnpod_schedule 
   epilogue
   to delete whatever thread was marked for deletion
  
   in case of migration with xnpod_migrate_thread, we do not want
   xnpod_schedule() on the target CPU to switch to the migrated thread
   before the context switch on the source CPU is finished, for this we
   can avoid setting the resched bit in xnpod_migrate_thread(), detect
   the condition in xnpod_schedule() epilogue and set the rescheduling
   bits so that xnpod_schedule is restarted and send the IPI to the
   target CPU.
  
   - avoid using user-space real-time tasks when running latency
   kernel-space benches, i.e. at least in the latency -t 1 and latency 
   -t
   2 case. This means that we should change the timerbench driver. There
   are at least two ways of doing this:
   use an rt_pipe
modify the timerbench driver to implement only the nrt ioctl, using
   vanilla linux services such as wait_event and wake_up.
   [As you reminded me of this unanswered question:]
   One may consider adding further modes _besides_ current kernel tests
   that do not rely on RTDM  native userland support (e.g. when
   CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are 
   valid
   scenarios as well that must not be killed by such a change.
   I think the current test scenario for latency -t 1 and latency -t 2
   are a bit misleading: they measure kernel-space latencies in presence
   of user-space real-time tasks. When one runs latency -t 1 or latency
   -t 2, one would expect that there are only kernel-space real-time
   tasks.
   If they are misleading, depends on your perspective. In fact, they are
   measuring in-kernel scenarios over the standard Xenomai setup, which
   includes userland RT task activity these day. Those scenarios are mainly
   targeting driver use cases, not pure kernel-space applications.
  
   But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
   would benefit from an additional set of test cases.
   Ok, I will not touch timerbench then, and implement another kernel 
   module.
  
   [Without considering all details]
   To achieve this independence of user space RT thread, it should suffice
   to implement a kernel-based frontend for timerbench. This frontent would
   then either dump to syslog or open some pipe to tell userland about the
   benchmark results. What do yo think?
   
   My intent was to implement a protocol similar to the one of
   timerbench, but using an rt-pipe, and continue to use the latency
   test, adding new options such as -t 3 and t 4. But there may be
   problems with this approach: if we are compiling without
   CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is
   probably simpler to implement a klatency that just reads from the
   rt-pipe.
  
  But that klantency could perfectly reuse what timerbench already
  provides, without code changes to the latter, in theory.

In theory yes, but in practice, timerbench non real-time ioctls use some
linux services, so they can not be called from the context of a
kernel-space task listening on a real-time pipe.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
 Hi,
 
 after some (unsuccessful) time trying to instrument the code in a way
 that does not change the latency results completely, I found the
 reason for the high latency with latency -t 1 and latency -t 2 on ARM.
 So, here comes an update on this issue. The culprit is the user-space
 context switch, which flushes the processor cache with the nklock
 locked, irqs off.
 
 There are two things we could do:
 - arrange for the ARM cache flush to happen with the nklock unlocked
 and irqs enabled. This will improve interrupt latency (latency -t 2)
 but obviously not scheduling latency (latency -t 1). If we go that
 way, there are several problems we should solve:
 
 we do not want interrupt handlers to reenter xnpod_schedule(), for
 this we can use the XNLOCK bit, set on whatever is
 xnpod_current_thread() when the cache flush occurs
 
 since the interrupt handler may modify the rescheduling bits, we need
 to test these bits in xnpod_schedule() epilogue and restart
 xnpod_schedule() if need be
 
 we do not want xnpod_delete_thread() to delete one of the two threads
 involved in the context switch, for this the only solution I found is
 to add a bit to the thread mask meaning that the thread is currently
 switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
 to delete whatever thread was marked for deletion
 
 in case of migration with xnpod_migrate_thread, we do not want
 xnpod_schedule() on the target CPU to switch to the migrated thread
 before the context switch on the source CPU is finished, for this we
 can avoid setting the resched bit in xnpod_migrate_thread(), detect
 the condition in xnpod_schedule() epilogue and set the rescheduling
 bits so that xnpod_schedule is restarted and send the IPI to the
 target CPU.
 
 - avoid using user-space real-time tasks when running latency
 kernel-space benches, i.e. at least in the latency -t 1 and latency -t
 2 case. This means that we should change the timerbench driver. There
 are at least two ways of doing this:
 use an rt_pipe
  modify the timerbench driver to implement only the nrt ioctl, using
 vanilla linux services such as wait_event and wake_up.

[As you reminded me of this unanswered question:]
One may consider adding further modes _besides_ current kernel tests
that do not rely on RTDM  native userland support (e.g. when
CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
scenarios as well that must not be killed by such a change.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Gilles Chanteperdrix
On Jan 17, 2008 11:42 AM, Jan Kiszka [EMAIL PROTECTED] wrote:

 Gilles Chanteperdrix wrote:
  Hi,
 
  after some (unsuccessful) time trying to instrument the code in a way
  that does not change the latency results completely, I found the
  reason for the high latency with latency -t 1 and latency -t 2 on ARM.
  So, here comes an update on this issue. The culprit is the user-space
  context switch, which flushes the processor cache with the nklock
  locked, irqs off.
 
  There are two things we could do:
  - arrange for the ARM cache flush to happen with the nklock unlocked
  and irqs enabled. This will improve interrupt latency (latency -t 2)
  but obviously not scheduling latency (latency -t 1). If we go that
  way, there are several problems we should solve:
 
  we do not want interrupt handlers to reenter xnpod_schedule(), for
  this we can use the XNLOCK bit, set on whatever is
  xnpod_current_thread() when the cache flush occurs
 
  since the interrupt handler may modify the rescheduling bits, we need
  to test these bits in xnpod_schedule() epilogue and restart
  xnpod_schedule() if need be
 
  we do not want xnpod_delete_thread() to delete one of the two threads
  involved in the context switch, for this the only solution I found is
  to add a bit to the thread mask meaning that the thread is currently
  switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
  to delete whatever thread was marked for deletion
 
  in case of migration with xnpod_migrate_thread, we do not want
  xnpod_schedule() on the target CPU to switch to the migrated thread
  before the context switch on the source CPU is finished, for this we
  can avoid setting the resched bit in xnpod_migrate_thread(), detect
  the condition in xnpod_schedule() epilogue and set the rescheduling
  bits so that xnpod_schedule is restarted and send the IPI to the
  target CPU.
 
  - avoid using user-space real-time tasks when running latency
  kernel-space benches, i.e. at least in the latency -t 1 and latency -t
  2 case. This means that we should change the timerbench driver. There
  are at least two ways of doing this:
  use an rt_pipe
   modify the timerbench driver to implement only the nrt ioctl, using
  vanilla linux services such as wait_event and wake_up.

 [As you reminded me of this unanswered question:]
 One may consider adding further modes _besides_ current kernel tests
 that do not rely on RTDM  native userland support (e.g. when
 CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
 scenarios as well that must not be killed by such a change.

I think the current test scenario for latency -t 1 and latency -t 2
are a bit misleading: they measure kernel-space latencies in presence
of user-space real-time tasks. When one runs latency -t 1 or latency
-t 2, one would expect that there are only kernel-space real-time
tasks.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
 On Jan 17, 2008 11:42 AM, Jan Kiszka [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
 Hi,

 after some (unsuccessful) time trying to instrument the code in a way
 that does not change the latency results completely, I found the
 reason for the high latency with latency -t 1 and latency -t 2 on ARM.
 So, here comes an update on this issue. The culprit is the user-space
 context switch, which flushes the processor cache with the nklock
 locked, irqs off.

 There are two things we could do:
 - arrange for the ARM cache flush to happen with the nklock unlocked
 and irqs enabled. This will improve interrupt latency (latency -t 2)
 but obviously not scheduling latency (latency -t 1). If we go that
 way, there are several problems we should solve:

 we do not want interrupt handlers to reenter xnpod_schedule(), for
 this we can use the XNLOCK bit, set on whatever is
 xnpod_current_thread() when the cache flush occurs

 since the interrupt handler may modify the rescheduling bits, we need
 to test these bits in xnpod_schedule() epilogue and restart
 xnpod_schedule() if need be

 we do not want xnpod_delete_thread() to delete one of the two threads
 involved in the context switch, for this the only solution I found is
 to add a bit to the thread mask meaning that the thread is currently
 switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
 to delete whatever thread was marked for deletion

 in case of migration with xnpod_migrate_thread, we do not want
 xnpod_schedule() on the target CPU to switch to the migrated thread
 before the context switch on the source CPU is finished, for this we
 can avoid setting the resched bit in xnpod_migrate_thread(), detect
 the condition in xnpod_schedule() epilogue and set the rescheduling
 bits so that xnpod_schedule is restarted and send the IPI to the
 target CPU.

 - avoid using user-space real-time tasks when running latency
 kernel-space benches, i.e. at least in the latency -t 1 and latency -t
 2 case. This means that we should change the timerbench driver. There
 are at least two ways of doing this:
 use an rt_pipe
  modify the timerbench driver to implement only the nrt ioctl, using
 vanilla linux services such as wait_event and wake_up.
 [As you reminded me of this unanswered question:]
 One may consider adding further modes _besides_ current kernel tests
 that do not rely on RTDM  native userland support (e.g. when
 CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
 scenarios as well that must not be killed by such a change.
 
 I think the current test scenario for latency -t 1 and latency -t 2
 are a bit misleading: they measure kernel-space latencies in presence
 of user-space real-time tasks. When one runs latency -t 1 or latency
 -t 2, one would expect that there are only kernel-space real-time
 tasks.

If they are misleading, depends on your perspective. In fact, they are
measuring in-kernel scenarios over the standard Xenomai setup, which
includes userland RT task activity these day. Those scenarios are mainly
targeting driver use cases, not pure kernel-space applications.

But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
would benefit from an additional set of test cases.

Jan
-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Gilles Chanteperdrix
On Jan 17, 2008 3:16 PM, Jan Kiszka [EMAIL PROTECTED] wrote:

 Gilles Chanteperdrix wrote:
  On Jan 17, 2008 12:55 PM, Jan Kiszka [EMAIL PROTECTED] wrote:
  Gilles Chanteperdrix wrote:
  On Jan 17, 2008 11:42 AM, Jan Kiszka [EMAIL PROTECTED] wrote:
  Gilles Chanteperdrix wrote:
  Hi,
 
  after some (unsuccessful) time trying to instrument the code in a way
  that does not change the latency results completely, I found the
  reason for the high latency with latency -t 1 and latency -t 2 on ARM.
  So, here comes an update on this issue. The culprit is the user-space
  context switch, which flushes the processor cache with the nklock
  locked, irqs off.
 
  There are two things we could do:
  - arrange for the ARM cache flush to happen with the nklock unlocked
  and irqs enabled. This will improve interrupt latency (latency -t 2)
  but obviously not scheduling latency (latency -t 1). If we go that
  way, there are several problems we should solve:
 
  we do not want interrupt handlers to reenter xnpod_schedule(), for
  this we can use the XNLOCK bit, set on whatever is
  xnpod_current_thread() when the cache flush occurs
 
  since the interrupt handler may modify the rescheduling bits, we need
  to test these bits in xnpod_schedule() epilogue and restart
  xnpod_schedule() if need be
 
  we do not want xnpod_delete_thread() to delete one of the two threads
  involved in the context switch, for this the only solution I found is
  to add a bit to the thread mask meaning that the thread is currently
  switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
  to delete whatever thread was marked for deletion
 
  in case of migration with xnpod_migrate_thread, we do not want
  xnpod_schedule() on the target CPU to switch to the migrated thread
  before the context switch on the source CPU is finished, for this we
  can avoid setting the resched bit in xnpod_migrate_thread(), detect
  the condition in xnpod_schedule() epilogue and set the rescheduling
  bits so that xnpod_schedule is restarted and send the IPI to the
  target CPU.
 
  - avoid using user-space real-time tasks when running latency
  kernel-space benches, i.e. at least in the latency -t 1 and latency -t
  2 case. This means that we should change the timerbench driver. There
  are at least two ways of doing this:
  use an rt_pipe
   modify the timerbench driver to implement only the nrt ioctl, using
  vanilla linux services such as wait_event and wake_up.
  [As you reminded me of this unanswered question:]
  One may consider adding further modes _besides_ current kernel tests
  that do not rely on RTDM  native userland support (e.g. when
  CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
  scenarios as well that must not be killed by such a change.
  I think the current test scenario for latency -t 1 and latency -t 2
  are a bit misleading: they measure kernel-space latencies in presence
  of user-space real-time tasks. When one runs latency -t 1 or latency
  -t 2, one would expect that there are only kernel-space real-time
  tasks.
  If they are misleading, depends on your perspective. In fact, they are
  measuring in-kernel scenarios over the standard Xenomai setup, which
  includes userland RT task activity these day. Those scenarios are mainly
  targeting driver use cases, not pure kernel-space applications.
 
  But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
  would benefit from an additional set of test cases.
 
  Ok, I will not touch timerbench then, and implement another kernel module.
 

 [Without considering all details]
 To achieve this independence of user space RT thread, it should suffice
 to implement a kernel-based frontend for timerbench. This frontent would
 then either dump to syslog or open some pipe to tell userland about the
 benchmark results. What do yo think?

My intent was to implement a protocol similar to the one of
timerbench, but using an rt-pipe, and continue to use the latency
test, adding new options such as -t 3 and t 4. But there may be
problems with this approach: if we are compiling without
CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is
probably simpler to implement a klatency that just reads from the
rt-pipe.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
 On Jan 17, 2008 3:16 PM, Jan Kiszka [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
 On Jan 17, 2008 12:55 PM, Jan Kiszka [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
 On Jan 17, 2008 11:42 AM, Jan Kiszka [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
 Hi,

 after some (unsuccessful) time trying to instrument the code in a way
 that does not change the latency results completely, I found the
 reason for the high latency with latency -t 1 and latency -t 2 on ARM.
 So, here comes an update on this issue. The culprit is the user-space
 context switch, which flushes the processor cache with the nklock
 locked, irqs off.

 There are two things we could do:
 - arrange for the ARM cache flush to happen with the nklock unlocked
 and irqs enabled. This will improve interrupt latency (latency -t 2)
 but obviously not scheduling latency (latency -t 1). If we go that
 way, there are several problems we should solve:

 we do not want interrupt handlers to reenter xnpod_schedule(), for
 this we can use the XNLOCK bit, set on whatever is
 xnpod_current_thread() when the cache flush occurs

 since the interrupt handler may modify the rescheduling bits, we need
 to test these bits in xnpod_schedule() epilogue and restart
 xnpod_schedule() if need be

 we do not want xnpod_delete_thread() to delete one of the two threads
 involved in the context switch, for this the only solution I found is
 to add a bit to the thread mask meaning that the thread is currently
 switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
 to delete whatever thread was marked for deletion

 in case of migration with xnpod_migrate_thread, we do not want
 xnpod_schedule() on the target CPU to switch to the migrated thread
 before the context switch on the source CPU is finished, for this we
 can avoid setting the resched bit in xnpod_migrate_thread(), detect
 the condition in xnpod_schedule() epilogue and set the rescheduling
 bits so that xnpod_schedule is restarted and send the IPI to the
 target CPU.

 - avoid using user-space real-time tasks when running latency
 kernel-space benches, i.e. at least in the latency -t 1 and latency -t
 2 case. This means that we should change the timerbench driver. There
 are at least two ways of doing this:
 use an rt_pipe
  modify the timerbench driver to implement only the nrt ioctl, using
 vanilla linux services such as wait_event and wake_up.
 [As you reminded me of this unanswered question:]
 One may consider adding further modes _besides_ current kernel tests
 that do not rely on RTDM  native userland support (e.g. when
 CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
 scenarios as well that must not be killed by such a change.
 I think the current test scenario for latency -t 1 and latency -t 2
 are a bit misleading: they measure kernel-space latencies in presence
 of user-space real-time tasks. When one runs latency -t 1 or latency
 -t 2, one would expect that there are only kernel-space real-time
 tasks.
 If they are misleading, depends on your perspective. In fact, they are
 measuring in-kernel scenarios over the standard Xenomai setup, which
 includes userland RT task activity these day. Those scenarios are mainly
 targeting driver use cases, not pure kernel-space applications.

 But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
 would benefit from an additional set of test cases.
 Ok, I will not touch timerbench then, and implement another kernel module.

 [Without considering all details]
 To achieve this independence of user space RT thread, it should suffice
 to implement a kernel-based frontend for timerbench. This frontent would
 then either dump to syslog or open some pipe to tell userland about the
 benchmark results. What do yo think?
 
 My intent was to implement a protocol similar to the one of
 timerbench, but using an rt-pipe, and continue to use the latency
 test, adding new options such as -t 3 and t 4. But there may be
 problems with this approach: if we are compiling without
 CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is
 probably simpler to implement a klatency that just reads from the
 rt-pipe.

But that klantency could perfectly reuse what timerbench already
provides, without code changes to the latter, in theory.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Gilles Chanteperdrix
On Jan 17, 2008 12:55 PM, Jan Kiszka [EMAIL PROTECTED] wrote:

 Gilles Chanteperdrix wrote:
  On Jan 17, 2008 11:42 AM, Jan Kiszka [EMAIL PROTECTED] wrote:
  Gilles Chanteperdrix wrote:
  Hi,
 
  after some (unsuccessful) time trying to instrument the code in a way
  that does not change the latency results completely, I found the
  reason for the high latency with latency -t 1 and latency -t 2 on ARM.
  So, here comes an update on this issue. The culprit is the user-space
  context switch, which flushes the processor cache with the nklock
  locked, irqs off.
 
  There are two things we could do:
  - arrange for the ARM cache flush to happen with the nklock unlocked
  and irqs enabled. This will improve interrupt latency (latency -t 2)
  but obviously not scheduling latency (latency -t 1). If we go that
  way, there are several problems we should solve:
 
  we do not want interrupt handlers to reenter xnpod_schedule(), for
  this we can use the XNLOCK bit, set on whatever is
  xnpod_current_thread() when the cache flush occurs
 
  since the interrupt handler may modify the rescheduling bits, we need
  to test these bits in xnpod_schedule() epilogue and restart
  xnpod_schedule() if need be
 
  we do not want xnpod_delete_thread() to delete one of the two threads
  involved in the context switch, for this the only solution I found is
  to add a bit to the thread mask meaning that the thread is currently
  switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
  to delete whatever thread was marked for deletion
 
  in case of migration with xnpod_migrate_thread, we do not want
  xnpod_schedule() on the target CPU to switch to the migrated thread
  before the context switch on the source CPU is finished, for this we
  can avoid setting the resched bit in xnpod_migrate_thread(), detect
  the condition in xnpod_schedule() epilogue and set the rescheduling
  bits so that xnpod_schedule is restarted and send the IPI to the
  target CPU.
 
  - avoid using user-space real-time tasks when running latency
  kernel-space benches, i.e. at least in the latency -t 1 and latency -t
  2 case. This means that we should change the timerbench driver. There
  are at least two ways of doing this:
  use an rt_pipe
   modify the timerbench driver to implement only the nrt ioctl, using
  vanilla linux services such as wait_event and wake_up.
  [As you reminded me of this unanswered question:]
  One may consider adding further modes _besides_ current kernel tests
  that do not rely on RTDM  native userland support (e.g. when
  CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
  scenarios as well that must not be killed by such a change.
 
  I think the current test scenario for latency -t 1 and latency -t 2
  are a bit misleading: they measure kernel-space latencies in presence
  of user-space real-time tasks. When one runs latency -t 1 or latency
  -t 2, one would expect that there are only kernel-space real-time
  tasks.

 If they are misleading, depends on your perspective. In fact, they are
 measuring in-kernel scenarios over the standard Xenomai setup, which
 includes userland RT task activity these day. Those scenarios are mainly
 targeting driver use cases, not pure kernel-space applications.

 But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
 would benefit from an additional set of test cases.

Ok, I will not touch timerbench then, and implement another kernel module.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Jan Kiszka
Jan Kiszka wrote:
 Gilles Chanteperdrix wrote:
 On Jan 17, 2008 12:55 PM, Jan Kiszka [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
 On Jan 17, 2008 11:42 AM, Jan Kiszka [EMAIL PROTECTED] wrote:
 Gilles Chanteperdrix wrote:
 Hi,

 after some (unsuccessful) time trying to instrument the code in a way
 that does not change the latency results completely, I found the
 reason for the high latency with latency -t 1 and latency -t 2 on ARM.
 So, here comes an update on this issue. The culprit is the user-space
 context switch, which flushes the processor cache with the nklock
 locked, irqs off.

 There are two things we could do:
 - arrange for the ARM cache flush to happen with the nklock unlocked
 and irqs enabled. This will improve interrupt latency (latency -t 2)
 but obviously not scheduling latency (latency -t 1). If we go that
 way, there are several problems we should solve:

 we do not want interrupt handlers to reenter xnpod_schedule(), for
 this we can use the XNLOCK bit, set on whatever is
 xnpod_current_thread() when the cache flush occurs

 since the interrupt handler may modify the rescheduling bits, we need
 to test these bits in xnpod_schedule() epilogue and restart
 xnpod_schedule() if need be

 we do not want xnpod_delete_thread() to delete one of the two threads
 involved in the context switch, for this the only solution I found is
 to add a bit to the thread mask meaning that the thread is currently
 switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
 to delete whatever thread was marked for deletion

 in case of migration with xnpod_migrate_thread, we do not want
 xnpod_schedule() on the target CPU to switch to the migrated thread
 before the context switch on the source CPU is finished, for this we
 can avoid setting the resched bit in xnpod_migrate_thread(), detect
 the condition in xnpod_schedule() epilogue and set the rescheduling
 bits so that xnpod_schedule is restarted and send the IPI to the
 target CPU.

 - avoid using user-space real-time tasks when running latency
 kernel-space benches, i.e. at least in the latency -t 1 and latency -t
 2 case. This means that we should change the timerbench driver. There
 are at least two ways of doing this:
 use an rt_pipe
  modify the timerbench driver to implement only the nrt ioctl, using
 vanilla linux services such as wait_event and wake_up.
 [As you reminded me of this unanswered question:]
 One may consider adding further modes _besides_ current kernel tests
 that do not rely on RTDM  native userland support (e.g. when
 CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
 scenarios as well that must not be killed by such a change.
 I think the current test scenario for latency -t 1 and latency -t 2
 are a bit misleading: they measure kernel-space latencies in presence
 of user-space real-time tasks. When one runs latency -t 1 or latency
 -t 2, one would expect that there are only kernel-space real-time
 tasks.
 If they are misleading, depends on your perspective. In fact, they are
 measuring in-kernel scenarios over the standard Xenomai setup, which
 includes userland RT task activity these day. Those scenarios are mainly
 targeting driver use cases, not pure kernel-space applications.

 But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
 would benefit from an additional set of test cases.
 Ok, I will not touch timerbench then, and implement another kernel module.

 
 [Without considering all details]
 To achieve this independence of user space RT thread, it should suffice
 to implement a kernel-based frontend for timerbench. This frontent would
 then either dump to syslog or open some pipe to tell userland about the
 benchmark results. What do yo think?
 

(That is only in case you meant reimplementing timerbench with
implement another kernel module. Just write a kernel-hosted RTDM user
of timerbench.)

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] High latencies on ARM.

2008-01-17 Thread Gilles Chanteperdrix
On Jan 17, 2008 3:22 PM, Jan Kiszka [EMAIL PROTECTED] wrote:

 Gilles Chanteperdrix wrote:
  On Jan 17, 2008 3:16 PM, Jan Kiszka [EMAIL PROTECTED] wrote:
  Gilles Chanteperdrix wrote:
  On Jan 17, 2008 12:55 PM, Jan Kiszka [EMAIL PROTECTED] wrote:
  Gilles Chanteperdrix wrote:
  On Jan 17, 2008 11:42 AM, Jan Kiszka [EMAIL PROTECTED] wrote:
  Gilles Chanteperdrix wrote:
  Hi,
 
  after some (unsuccessful) time trying to instrument the code in a way
  that does not change the latency results completely, I found the
  reason for the high latency with latency -t 1 and latency -t 2 on ARM.
  So, here comes an update on this issue. The culprit is the user-space
  context switch, which flushes the processor cache with the nklock
  locked, irqs off.
 
  There are two things we could do:
  - arrange for the ARM cache flush to happen with the nklock unlocked
  and irqs enabled. This will improve interrupt latency (latency -t 2)
  but obviously not scheduling latency (latency -t 1). If we go that
  way, there are several problems we should solve:
 
  we do not want interrupt handlers to reenter xnpod_schedule(), for
  this we can use the XNLOCK bit, set on whatever is
  xnpod_current_thread() when the cache flush occurs
 
  since the interrupt handler may modify the rescheduling bits, we need
  to test these bits in xnpod_schedule() epilogue and restart
  xnpod_schedule() if need be
 
  we do not want xnpod_delete_thread() to delete one of the two threads
  involved in the context switch, for this the only solution I found is
  to add a bit to the thread mask meaning that the thread is currently
  switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue
  to delete whatever thread was marked for deletion
 
  in case of migration with xnpod_migrate_thread, we do not want
  xnpod_schedule() on the target CPU to switch to the migrated thread
  before the context switch on the source CPU is finished, for this we
  can avoid setting the resched bit in xnpod_migrate_thread(), detect
  the condition in xnpod_schedule() epilogue and set the rescheduling
  bits so that xnpod_schedule is restarted and send the IPI to the
  target CPU.
 
  - avoid using user-space real-time tasks when running latency
  kernel-space benches, i.e. at least in the latency -t 1 and latency -t
  2 case. This means that we should change the timerbench driver. There
  are at least two ways of doing this:
  use an rt_pipe
   modify the timerbench driver to implement only the nrt ioctl, using
  vanilla linux services such as wait_event and wake_up.
  [As you reminded me of this unanswered question:]
  One may consider adding further modes _besides_ current kernel tests
  that do not rely on RTDM  native userland support (e.g. when
  CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid
  scenarios as well that must not be killed by such a change.
  I think the current test scenario for latency -t 1 and latency -t 2
  are a bit misleading: they measure kernel-space latencies in presence
  of user-space real-time tasks. When one runs latency -t 1 or latency
  -t 2, one would expect that there are only kernel-space real-time
  tasks.
  If they are misleading, depends on your perspective. In fact, they are
  measuring in-kernel scenarios over the standard Xenomai setup, which
  includes userland RT task activity these day. Those scenarios are mainly
  targeting driver use cases, not pure kernel-space applications.
 
  But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we
  would benefit from an additional set of test cases.
  Ok, I will not touch timerbench then, and implement another kernel module.
 
  [Without considering all details]
  To achieve this independence of user space RT thread, it should suffice
  to implement a kernel-based frontend for timerbench. This frontent would
  then either dump to syslog or open some pipe to tell userland about the
  benchmark results. What do yo think?
 
  My intent was to implement a protocol similar to the one of
  timerbench, but using an rt-pipe, and continue to use the latency
  test, adding new options such as -t 3 and t 4. But there may be
  problems with this approach: if we are compiling without
  CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is
  probably simpler to implement a klatency that just reads from the
  rt-pipe.

 But that klantency could perfectly reuse what timerbench already
 provides, without code changes to the latter, in theory.

That would be a kernel module then, but I also need some user-space
piece of software to do the computations and print the results.

-- 
   Gilles Chanteperdrix

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core