Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > On Jan 17, 2008 3:22 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > > > > Gilles Chanteperdrix wrote: > > > On Jan 17, 2008 3:16 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > > >> Gilles Chanteperdrix wrote: > > >>> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > > Gilles Chanteperdrix wrote: > > > On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > > >> Gilles Chanteperdrix wrote: > > >>> Hi, > > >>> > > >>> after some (unsuccessful) time trying to instrument the code in a > > >>> way > > >>> that does not change the latency results completely, I found the > > >>> reason for the high latency with latency -t 1 and latency -t 2 on > > >>> ARM. > > >>> So, here comes an update on this issue. The culprit is the > > >>> user-space > > >>> context switch, which flushes the processor cache with the nklock > > >>> locked, irqs off. > > >>> > > >>> There are two things we could do: > > >>> - arrange for the ARM cache flush to happen with the nklock > > >>> unlocked > > >>> and irqs enabled. This will improve interrupt latency (latency -t > > >>> 2) > > >>> but obviously not scheduling latency (latency -t 1). If we go that > > >>> way, there are several problems we should solve: > > >>> > > >>> we do not want interrupt handlers to reenter xnpod_schedule(), for > > >>> this we can use the XNLOCK bit, set on whatever is > > >>> xnpod_current_thread() when the cache flush occurs > > >>> > > >>> since the interrupt handler may modify the rescheduling bits, we > > >>> need > > >>> to test these bits in xnpod_schedule() epilogue and restart > > >>> xnpod_schedule() if need be > > >>> > > >>> we do not want xnpod_delete_thread() to delete one of the two > > >>> threads > > >>> involved in the context switch, for this the only solution I found > > >>> is > > >>> to add a bit to the thread mask meaning that the thread is > > >>> currently > > >>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule > > >>> epilogue > > >>> to delete whatever thread was marked for deletion > > >>> > > >>> in case of migration with xnpod_migrate_thread, we do not want > > >>> xnpod_schedule() on the target CPU to switch to the migrated thread > > >>> before the context switch on the source CPU is finished, for this > > >>> we > > >>> can avoid setting the resched bit in xnpod_migrate_thread(), detect > > >>> the condition in xnpod_schedule() epilogue and set the rescheduling > > >>> bits so that xnpod_schedule is restarted and send the IPI to the > > >>> target CPU. > > >>> > > >>> - avoid using user-space real-time tasks when running latency > > >>> kernel-space benches, i.e. at least in the latency -t 1 and > > >>> latency -t > > >>> 2 case. This means that we should change the timerbench driver. > > >>> There > > >>> are at least two ways of doing this: > > >>> use an rt_pipe > > >>> modify the timerbench driver to implement only the nrt ioctl, > > >>> using > > >>> vanilla linux services such as wait_event and wake_up. > > >> [As you reminded me of this unanswered question:] > > >> One may consider adding further modes _besides_ current kernel tests > > >> that do not rely on RTDM & native userland support (e.g. when > > >> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are > > >> valid > > >> scenarios as well that must not be killed by such a change. > > > I think the current test scenario for latency -t 1 and latency -t 2 > > > are a bit misleading: they measure kernel-space latencies in presence > > > of user-space real-time tasks. When one runs latency -t 1 or latency > > > -t 2, one would expect that there are only kernel-space real-time > > > tasks. > > If they are misleading, depends on your perspective. In fact, they are > > measuring in-kernel scenarios over the standard Xenomai setup, which > > includes userland RT task activity these day. Those scenarios are > > mainly > > targeting driver use cases, not pure kernel-space applications. > > > > But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we > > would benefit from an additional set of test cases. > > >>> Ok, I will not touch timerbench then, and implement another kernel > > >>> module. > > >>> > > >> [Without considering all details] > > >> To achieve this independence of user space RT thread, it should suffice > > >> to implement a kernel-based frontend for timerbench. This frontent would > > >> then either dump to syslog or open some pipe to tell userland about the > > >> benchmark results. What do yo think? > > > > > > My intent was to imple
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > On Jan 28, 2008 12:34 AM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > > > > Gilles Chanteperdrix wrote: > > > Philippe Gerum wrote: > > > > Gilles Chanteperdrix wrote: > > > > > Philippe Gerum wrote: > > > > > > Gilles Chanteperdrix wrote: > > > > > > > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> > > > wrote: > > > > > > >> Gilles Chanteperdrix wrote: > > > > > > >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> > > > wrote: > > > > > > Gilles Chanteperdrix wrote: > > > > > > > Gilles Chanteperdrix wrote: > > > > > > > > Please find attached a patch implementing these ideas. > > > This adds some > > > > > > > > clutter, which I would be happy to reduce. Better ideas > > > are welcome. > > > > > > > > > > > > > > > > > > > > > > Ok. New version of the patch, this time split in two > > > parts, should > > > > > > > hopefully make it more readable. > > > > > > > > > > > > > Ack. I'd suggest the following: > > > > > > > > > > > > - let's have a rate limiter when walking the zombie queue in > > > > > > __xnpod_finalize_zombies. We hold the superlock here, and > > > what the patch > > > > > > also introduces is the potential for flushing more than a > > > single TCB at > > > > > > a time, which might not always be a cheap operation, > > > depending on which > > > > > > cra^H^Hode runs on behalf of the deletion hooks for > > > instance. We may > > > > > > take for granted that no sane code would continuously > > > create more > > > > > > threads than we would be able to finalize in a given time > > > frame anyway. > > > > > > >>> The maximum number of zombies in the queue is > > > > > > >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to > > > the queue > > > > > > >>> only if a deleted thread is xnpod_current_thread(), or if > > > the XNLOCKSW > > > > > > >>> bit is armed. > > > > > > >> Ack. rate_limit = 1? I'm really reluctant to increase the > > > WCET here, > > > > > > >> thread deletion isn't cheap already. > > > > > > > > > > > > > > I am not sure that holding the nklock while we run the thread > > > deletion > > > > > > > hooks is really needed. > > > > > > > > > > > > > > > > > > > Deletion hooks may currently rely on the following assumptions > > > when running: > > > > > > > > > > > > - rescheduling is locked > > > > > > - nklock is held, interrupts are off > > > > > > - they run on behalf of the deletor context > > > > > > > > > > > > The self-delete refactoring currently kills #3 because we now > > > run the > > > > > > hooks after the context switch, and would also kill #2 if we did > > > not > > > > > > hold the nklock (btw, enabling the nucleus debug while running > > > with this > > > > > > patch should raise an abort, from xnshadow_unmap, due to the > > > second > > > > > > assertion). > > > > > > > > > > > > > > Forget about this; shadows are always exited in secondary mode, so > > > > that's fine, i.e. xnpod_current_thread() != deleted thread, hence we > > > > should always run the deletion hooks immediately on behalf of the > > > caller. > > > > > > What happens if the watchdog kills a user-space thread which is > > > currently running in primary mode ? If I read xnpod_delete_thread > > > correctly, the SIGKILL signal is sent to the target thread only if it is > > > not the current thread. > > > > > > > I'd say: zombie queuing from xnpod_delete, then shadow unmap on behalf > > of the next switched context which would trigger the lo-stage unmap > > request -> wake_up_process against the Linux side and asbestos underwear > > provided by the relax epilogue, which would eventually reap the guy > > through do_exit(). As a matter of fact, we would still have the > > unmap-over-non-current issue, that's true. > > > > Ok, could we try coding a damn Tetris instead? Pong, maybe? Gasp... > > Games for mobile phones then, because I am afraid games for consoles > or PCs are too complicated for me. > > No, seriously, how do we solve this ? Maybe we could relax from > xnpod_delete_thread ? This will not work, xnpod_schedule will not let xnshadow_relax suspend the current thread while in interrupt context. -- Gilles Chanteperdrix. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
On Jan 28, 2008 12:34 AM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > > Gilles Chanteperdrix wrote: > > Philippe Gerum wrote: > > > Gilles Chanteperdrix wrote: > > > > Philippe Gerum wrote: > > > > > Gilles Chanteperdrix wrote: > > > > > > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > > > > > >> Gilles Chanteperdrix wrote: > > > > > >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> > > wrote: > > > > > Gilles Chanteperdrix wrote: > > > > > > Gilles Chanteperdrix wrote: > > > > > > > Please find attached a patch implementing these ideas. > > This adds some > > > > > > > clutter, which I would be happy to reduce. Better ideas > > are welcome. > > > > > > > > > > > > > > > > > > > Ok. New version of the patch, this time split in two parts, > > should > > > > > > hopefully make it more readable. > > > > > > > > > > > Ack. I'd suggest the following: > > > > > > > > > > - let's have a rate limiter when walking the zombie queue in > > > > > __xnpod_finalize_zombies. We hold the superlock here, and what > > the patch > > > > > also introduces is the potential for flushing more than a > > single TCB at > > > > > a time, which might not always be a cheap operation, depending > > on which > > > > > cra^H^Hode runs on behalf of the deletion hooks for instance. > > We may > > > > > take for granted that no sane code would continuously create > > more > > > > > threads than we would be able to finalize in a given time > > frame anyway. > > > > > >>> The maximum number of zombies in the queue is > > > > > >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the > > queue > > > > > >>> only if a deleted thread is xnpod_current_thread(), or if the > > XNLOCKSW > > > > > >>> bit is armed. > > > > > >> Ack. rate_limit = 1? I'm really reluctant to increase the WCET > > here, > > > > > >> thread deletion isn't cheap already. > > > > > > > > > > > > I am not sure that holding the nklock while we run the thread > > deletion > > > > > > hooks is really needed. > > > > > > > > > > > > > > > > Deletion hooks may currently rely on the following assumptions when > > running: > > > > > > > > > > - rescheduling is locked > > > > > - nklock is held, interrupts are off > > > > > - they run on behalf of the deletor context > > > > > > > > > > The self-delete refactoring currently kills #3 because we now run > > the > > > > > hooks after the context switch, and would also kill #2 if we did not > > > > > hold the nklock (btw, enabling the nucleus debug while running with > > this > > > > > patch should raise an abort, from xnshadow_unmap, due to the second > > > > > assertion). > > > > > > > > > > > Forget about this; shadows are always exited in secondary mode, so > > > that's fine, i.e. xnpod_current_thread() != deleted thread, hence we > > > should always run the deletion hooks immediately on behalf of the caller. > > > > What happens if the watchdog kills a user-space thread which is > > currently running in primary mode ? If I read xnpod_delete_thread > > correctly, the SIGKILL signal is sent to the target thread only if it is > > not the current thread. > > > > I'd say: zombie queuing from xnpod_delete, then shadow unmap on behalf > of the next switched context which would trigger the lo-stage unmap > request -> wake_up_process against the Linux side and asbestos underwear > provided by the relax epilogue, which would eventually reap the guy > through do_exit(). As a matter of fact, we would still have the > unmap-over-non-current issue, that's true. > > Ok, could we try coding a damn Tetris instead? Pong, maybe? Gasp... Games for mobile phones then, because I am afraid games for consoles or PCs are too complicated for me. No, seriously, how do we solve this ? Maybe we could relax from xnpod_delete_thread ? -- Gilles Chanteperdrix ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > Philippe Gerum wrote: > > Gilles Chanteperdrix wrote: > > > Philippe Gerum wrote: > > > > Gilles Chanteperdrix wrote: > > > > > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > > > > >> Gilles Chanteperdrix wrote: > > > > >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > > > > Gilles Chanteperdrix wrote: > > > > > Gilles Chanteperdrix wrote: > > > > > > Please find attached a patch implementing these ideas. This > adds some > > > > > > clutter, which I would be happy to reduce. Better ideas are > welcome. > > > > > > > > > > > > > > > > Ok. New version of the patch, this time split in two parts, > should > > > > > hopefully make it more readable. > > > > > > > > > Ack. I'd suggest the following: > > > > > > > > - let's have a rate limiter when walking the zombie queue in > > > > __xnpod_finalize_zombies. We hold the superlock here, and what > the patch > > > > also introduces is the potential for flushing more than a single > TCB at > > > > a time, which might not always be a cheap operation, depending > on which > > > > cra^H^Hode runs on behalf of the deletion hooks for instance. We > may > > > > take for granted that no sane code would continuously create more > > > > threads than we would be able to finalize in a given time frame > anyway. > > > > >>> The maximum number of zombies in the queue is > > > > >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the > queue > > > > >>> only if a deleted thread is xnpod_current_thread(), or if the > XNLOCKSW > > > > >>> bit is armed. > > > > >> Ack. rate_limit = 1? I'm really reluctant to increase the WCET > here, > > > > >> thread deletion isn't cheap already. > > > > > > > > > > I am not sure that holding the nklock while we run the thread > deletion > > > > > hooks is really needed. > > > > > > > > > > > > > Deletion hooks may currently rely on the following assumptions when > running: > > > > > > > > - rescheduling is locked > > > > - nklock is held, interrupts are off > > > > - they run on behalf of the deletor context > > > > > > > > The self-delete refactoring currently kills #3 because we now run the > > > > hooks after the context switch, and would also kill #2 if we did not > > > > hold the nklock (btw, enabling the nucleus debug while running with > this > > > > patch should raise an abort, from xnshadow_unmap, due to the second > > > > assertion). > > > > > > > > Forget about this; shadows are always exited in secondary mode, so > > that's fine, i.e. xnpod_current_thread() != deleted thread, hence we > > should always run the deletion hooks immediately on behalf of the caller. > > What happens if the watchdog kills a user-space thread which is > currently running in primary mode ? If I read xnpod_delete_thread > correctly, the SIGKILL signal is sent to the target thread only if it is > not the current thread. > I'd say: zombie queuing from xnpod_delete, then shadow unmap on behalf of the next switched context which would trigger the lo-stage unmap request -> wake_up_process against the Linux side and asbestos underwear provided by the relax epilogue, which would eventually reap the guy through do_exit(). As a matter of fact, we would still have the unmap-over-non-current issue, that's true. Ok, could we try coding a damn Tetris instead? Pong, maybe? Gasp... -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > Philippe Gerum wrote: > > Gilles Chanteperdrix wrote: > > > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > > >> Gilles Chanteperdrix wrote: > > >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > > Gilles Chanteperdrix wrote: > > > Gilles Chanteperdrix wrote: > > > > Please find attached a patch implementing these ideas. This adds > some > > > > clutter, which I would be happy to reduce. Better ideas are > welcome. > > > > > > > > > > Ok. New version of the patch, this time split in two parts, should > > > hopefully make it more readable. > > > > > Ack. I'd suggest the following: > > > > - let's have a rate limiter when walking the zombie queue in > > __xnpod_finalize_zombies. We hold the superlock here, and what the > patch > > also introduces is the potential for flushing more than a single TCB > at > > a time, which might not always be a cheap operation, depending on > which > > cra^H^Hode runs on behalf of the deletion hooks for instance. We may > > take for granted that no sane code would continuously create more > > threads than we would be able to finalize in a given time frame > anyway. > > >>> The maximum number of zombies in the queue is > > >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue > > >>> only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW > > >>> bit is armed. > > >> Ack. rate_limit = 1? I'm really reluctant to increase the WCET here, > > >> thread deletion isn't cheap already. > > > > > > I am not sure that holding the nklock while we run the thread deletion > > > hooks is really needed. > > > > > > > Deletion hooks may currently rely on the following assumptions when > running: > > > > - rescheduling is locked > > - nklock is held, interrupts are off > > - they run on behalf of the deletor context > > > > The self-delete refactoring currently kills #3 because we now run the > > hooks after the context switch, and would also kill #2 if we did not > > hold the nklock (btw, enabling the nucleus debug while running with this > > patch should raise an abort, from xnshadow_unmap, due to the second > > assertion). > > Forget about this; shadows are always exited in secondary mode, so that's fine, i.e. xnpod_current_thread() != deleted thread, hence we should always run the deletion hooks immediately on behalf of the caller. > > It should be possible to get rid of #3 for xnshadow_unmap (serious > > testing needed here), but we would have to grab the nklock from this > > routine anyway. > > Since the unmapped task is no longer running on the current CPU, is no > there any chance that it is run on another CPU by the time we get to > xnshadow_unmap ? > The unmapped task is running actually, and do_exit() may reschedule quite late until kernel preemption is eventually disabled, which happens long after the I-pipe notifier is fired. We would need the nklock to protect the RPI management too. -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Philippe Gerum wrote: > Gilles Chanteperdrix wrote: > > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > Gilles Chanteperdrix wrote: > > Gilles Chanteperdrix wrote: > > > Please find attached a patch implementing these ideas. This adds > > some > > > clutter, which I would be happy to reduce. Better ideas are welcome. > > > > > > > Ok. New version of the patch, this time split in two parts, should > > hopefully make it more readable. > > > Ack. I'd suggest the following: > > - let's have a rate limiter when walking the zombie queue in > __xnpod_finalize_zombies. We hold the superlock here, and what the patch > also introduces is the potential for flushing more than a single TCB at > a time, which might not always be a cheap operation, depending on which > cra^H^Hode runs on behalf of the deletion hooks for instance. We may > take for granted that no sane code would continuously create more > threads than we would be able to finalize in a given time frame anyway. > >>> The maximum number of zombies in the queue is > >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue > >>> only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW > >>> bit is armed. > >> Ack. rate_limit = 1? I'm really reluctant to increase the WCET here, > >> thread deletion isn't cheap already. > > > > I am not sure that holding the nklock while we run the thread deletion > > hooks is really needed. > > > > Deletion hooks may currently rely on the following assumptions when running: > > - rescheduling is locked > - nklock is held, interrupts are off > - they run on behalf of the deletor context > > The self-delete refactoring currently kills #3 because we now run the > hooks after the context switch, and would also kill #2 if we did not > hold the nklock (btw, enabling the nucleus debug while running with this > patch should raise an abort, from xnshadow_unmap, due to the second > assertion). > > It should be possible to get rid of #3 for xnshadow_unmap (serious > testing needed here), but we would have to grab the nklock from this > routine anyway. Since the unmapped task is no longer running on the current CPU, is no there any chance that it is run on another CPU by the time we get to xnshadow_unmap ? -- Gilles Chanteperdrix. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: >> Gilles Chanteperdrix wrote: >>> On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: Gilles Chanteperdrix wrote: > Gilles Chanteperdrix wrote: > > Please find attached a patch implementing these ideas. This adds some > > clutter, which I would be happy to reduce. Better ideas are welcome. > > > > Ok. New version of the patch, this time split in two parts, should > hopefully make it more readable. > Ack. I'd suggest the following: - let's have a rate limiter when walking the zombie queue in __xnpod_finalize_zombies. We hold the superlock here, and what the patch also introduces is the potential for flushing more than a single TCB at a time, which might not always be a cheap operation, depending on which cra^H^Hode runs on behalf of the deletion hooks for instance. We may take for granted that no sane code would continuously create more threads than we would be able to finalize in a given time frame anyway. >>> The maximum number of zombies in the queue is >>> 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue >>> only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW >>> bit is armed. >> Ack. rate_limit = 1? I'm really reluctant to increase the WCET here, >> thread deletion isn't cheap already. > > I am not sure that holding the nklock while we run the thread deletion > hooks is really needed. > Deletion hooks may currently rely on the following assumptions when running: - rescheduling is locked - nklock is held, interrupts are off - they run on behalf of the deletor context The self-delete refactoring currently kills #3 because we now run the hooks after the context switch, and would also kill #2 if we did not hold the nklock (btw, enabling the nucleus debug while running with this patch should raise an abort, from xnshadow_unmap, due to the second assertion). It should be possible to get rid of #3 for xnshadow_unmap (serious testing needed here), but we would have to grab the nklock from this routine anyway. -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > Gilles Chanteperdrix wrote: > > On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> Gilles Chanteperdrix wrote: > >>> > Please find attached a patch implementing these ideas. This adds some > >>> > clutter, which I would be happy to reduce. Better ideas are welcome. > >>> > > >>> > >>> Ok. New version of the patch, this time split in two parts, should > >>> hopefully make it more readable. > >>> > >> Ack. I'd suggest the following: > >> > >> - let's have a rate limiter when walking the zombie queue in > >> __xnpod_finalize_zombies. We hold the superlock here, and what the patch > >> also introduces is the potential for flushing more than a single TCB at > >> a time, which might not always be a cheap operation, depending on which > >> cra^H^Hode runs on behalf of the deletion hooks for instance. We may > >> take for granted that no sane code would continuously create more > >> threads than we would be able to finalize in a given time frame anyway. > > > > The maximum number of zombies in the queue is > > 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue > > only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW > > bit is armed. > > Ack. rate_limit = 1? I'm really reluctant to increase the WCET here, > thread deletion isn't cheap already. I am not sure that holding the nklock while we run the thread deletion hooks is really needed. -- Gilles Chanteperdrix ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Philippe Gerum wrote: > Gilles Chanteperdrix wrote: > > On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> Gilles Chanteperdrix wrote: > >>> > Please find attached a patch implementing these ideas. This adds some > >>> > clutter, which I would be happy to reduce. Better ideas are welcome. > >>> > > >>> > >>> Ok. New version of the patch, this time split in two parts, should > >>> hopefully make it more readable. > >>> > >> Ack. I'd suggest the following: > >> > >> - let's have a rate limiter when walking the zombie queue in > >> __xnpod_finalize_zombies. We hold the superlock here, and what the patch > >> also introduces is the potential for flushing more than a single TCB at > >> a time, which might not always be a cheap operation, depending on which > >> cra^H^Hode runs on behalf of the deletion hooks for instance. We may > >> take for granted that no sane code would continuously create more > >> threads than we would be able to finalize in a given time frame anyway. > > > > The maximum number of zombies in the queue is > > 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue > > only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW > > bit is armed. > > Ack. rate_limit = 1? I'm really reluctant to increase the WCET here, > thread deletion isn't cheap already. Here come new patches. -- Gilles Chanteperdrix. Index: include/asm-ia64/bits/pod.h === --- include/asm-ia64/bits/pod.h (revision 3441) +++ include/asm-ia64/bits/pod.h (working copy) @@ -100,12 +100,6 @@ static inline void xnarch_switch_to(xnar } } -static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb, - xnarchtcb_t * next_tcb) -{ - xnarch_switch_to(dead_tcb, next_tcb); -} - static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb) { /* Empty */ Index: include/asm-blackfin/bits/pod.h === --- include/asm-blackfin/bits/pod.h (revision 3441) +++ include/asm-blackfin/bits/pod.h (working copy) @@ -67,12 +67,6 @@ static inline void xnarch_switch_to(xnar rthal_thread_switch(out_tcb->tsp, in_tcb->tsp); } -static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb, - xnarchtcb_t * next_tcb) -{ - xnarch_switch_to(dead_tcb, next_tcb); -} - static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb) { /* Empty */ Index: include/asm-arm/bits/pod.h === --- include/asm-arm/bits/pod.h (revision 3441) +++ include/asm-arm/bits/pod.h (working copy) @@ -96,12 +96,6 @@ static inline void xnarch_switch_to(xnar rthal_thread_switch(prev, out_tcb->tip, in_tcb->tip); } -static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb, - xnarchtcb_t * next_tcb) -{ - xnarch_switch_to(dead_tcb, next_tcb); -} - static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb) { /* Empty */ Index: include/asm-powerpc/bits/pod.h === --- include/asm-powerpc/bits/pod.h (revision 3441) +++ include/asm-powerpc/bits/pod.h (working copy) @@ -106,12 +106,6 @@ static inline void xnarch_switch_to(xnar barrier(); } -static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb, - xnarchtcb_t * next_tcb) -{ - xnarch_switch_to(dead_tcb, next_tcb); -} - static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb) { /* Empty */ Index: include/asm-x86/bits/pod_64.h === --- include/asm-x86/bits/pod_64.h (revision 3441) +++ include/asm-x86/bits/pod_64.h (working copy) @@ -96,12 +96,6 @@ static inline void xnarch_switch_to(xnar stts(); } -static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb, - xnarchtcb_t * next_tcb) -{ - xnarch_switch_to(dead_tcb, next_tcb); -} - static inline void xnarch_finalize_no_switch(xnarchtcb_t * dead_tcb) { /* Empty */ Index: include/asm-x86/bits/pod_32.h === --- include/asm-x86/bits/pod_32.h (revision 3441) +++ include/asm-x86/bits/pod_32.h (working copy) @@ -123,12 +123,6 @@ static inline void xnarch_switch_to(xnar stts(); } -static inline void xnarch_finalize_and_switch(xnarchtcb_t * dead_tcb, - xnarchtcb_t * next_tcb) -{ - xnarch_switch_to(dead_tcb, next_tcb); -} - static inline void xnarch
Re: [Xenomai-core] High latencies on ARM.
On Jan 23, 2008 7:34 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > Gilles Chanteperdrix wrote: > > On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> Gilles Chanteperdrix wrote: > >>> > Please find attached a patch implementing these ideas. This adds some > >>> > clutter, which I would be happy to reduce. Better ideas are welcome. > >>> > > >>> > >>> Ok. New version of the patch, this time split in two parts, should > >>> hopefully make it more readable. > >>> > >> Ack. I'd suggest the following: > >> > >> - let's have a rate limiter when walking the zombie queue in > >> __xnpod_finalize_zombies. We hold the superlock here, and what the patch > >> also introduces is the potential for flushing more than a single TCB at > >> a time, which might not always be a cheap operation, depending on which > >> cra^H^Hode runs on behalf of the deletion hooks for instance. We may > >> take for granted that no sane code would continuously create more > >> threads than we would be able to finalize in a given time frame anyway. > > > > The maximum number of zombies in the queue is > > 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue > > only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW > > bit is armed. > > Ack. rate_limit = 1? I'm really reluctant to increase the WCET here, > thread deletion isn't cheap already. Ok, as you wish. -- Gilles Chanteperdrix ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: >> Gilles Chanteperdrix wrote: >>> Gilles Chanteperdrix wrote: >>> > Please find attached a patch implementing these ideas. This adds some >>> > clutter, which I would be happy to reduce. Better ideas are welcome. >>> > >>> >>> Ok. New version of the patch, this time split in two parts, should >>> hopefully make it more readable. >>> >> Ack. I'd suggest the following: >> >> - let's have a rate limiter when walking the zombie queue in >> __xnpod_finalize_zombies. We hold the superlock here, and what the patch >> also introduces is the potential for flushing more than a single TCB at >> a time, which might not always be a cheap operation, depending on which >> cra^H^Hode runs on behalf of the deletion hooks for instance. We may >> take for granted that no sane code would continuously create more >> threads than we would be able to finalize in a given time frame anyway. > > The maximum number of zombies in the queue is > 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue > only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW > bit is armed. Ack. rate_limit = 1? I'm really reluctant to increase the WCET here, thread deletion isn't cheap already. -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
On Jan 23, 2008 6:48 PM, Philippe Gerum <[EMAIL PROTECTED]> wrote: > Gilles Chanteperdrix wrote: > > Gilles Chanteperdrix wrote: > > > Please find attached a patch implementing these ideas. This adds some > > > clutter, which I would be happy to reduce. Better ideas are welcome. > > > > > > > Ok. New version of the patch, this time split in two parts, should > > hopefully make it more readable. > > > > Ack. I'd suggest the following: > > - let's have a rate limiter when walking the zombie queue in > __xnpod_finalize_zombies. We hold the superlock here, and what the patch > also introduces is the potential for flushing more than a single TCB at > a time, which might not always be a cheap operation, depending on which > cra^H^Hode runs on behalf of the deletion hooks for instance. We may > take for granted that no sane code would continuously create more > threads than we would be able to finalize in a given time frame anyway. The maximum number of zombies in the queue is 1 + XNARCH_WANT_UNLOCKED_CTXSW, since a zombie is added to the queue only if a deleted thread is xnpod_current_thread(), or if the XNLOCKSW bit is armed. > > - We could move most of the code depending on XNARCH_WANT_UNLOCKED_CTXSW > to conditional inlines in pod.h. This would reduce the visual pollution > a lot. Ok, will try that, especially since the code added to the 4 places where a scheduling tail takes place is pretty repetitive. -- Gilles Chanteperdrix ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > Gilles Chanteperdrix wrote: > > Please find attached a patch implementing these ideas. This adds some > > clutter, which I would be happy to reduce. Better ideas are welcome. > > > > Ok. New version of the patch, this time split in two parts, should > hopefully make it more readable. > Ack. I'd suggest the following: - let's have a rate limiter when walking the zombie queue in __xnpod_finalize_zombies. We hold the superlock here, and what the patch also introduces is the potential for flushing more than a single TCB at a time, which might not always be a cheap operation, depending on which cra^H^Hode runs on behalf of the deletion hooks for instance. We may take for granted that no sane code would continuously create more threads than we would be able to finalize in a given time frame anyway. - We could move most of the code depending on XNARCH_WANT_UNLOCKED_CTXSW to conditional inlines in pod.h. This would reduce the visual pollution a lot. -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Jan Kiszka wrote: > Does the patch improve ARM latencies already? Yes, it does. The (interrupt) latency goes from above 100us to 80us. This is not yet 50us, though. -- Gilles Chanteperdrix. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Jan Kiszka wrote: > Gilles Chanteperdrix wrote: > > Gilles Chanteperdrix wrote: > > > Hi, > > > > > > after some (unsuccessful) time trying to instrument the code in a way > > > that does not change the latency results completely, I found the > > > reason for the high latency with latency -t 1 and latency -t 2 on ARM. > > > So, here comes an update on this issue. The culprit is the user-space > > > context switch, which flushes the processor cache with the nklock > > > locked, irqs off. > > > > > > There are two things we could do: > > > - arrange for the ARM cache flush to happen with the nklock unlocked > > > and irqs enabled. This will improve interrupt latency (latency -t 2) > > > but obviously not scheduling latency (latency -t 1). If we go that > > > way, there are several problems we should solve: > > > > > > we do not want interrupt handlers to reenter xnpod_schedule(), for > > > this we can use the XNLOCK bit, set on whatever is > > > xnpod_current_thread() when the cache flush occurs > > > > > > since the interrupt handler may modify the rescheduling bits, we need > > > to test these bits in xnpod_schedule() epilogue and restart > > > xnpod_schedule() if need be > > > > > > we do not want xnpod_delete_thread() to delete one of the two threads > > > involved in the context switch, for this the only solution I found is > > > to add a bit to the thread mask meaning that the thread is currently > > > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue > > > to delete whatever thread was marked for deletion > > > > > > in case of migration with xnpod_migrate_thread, we do not want > > > xnpod_schedule() on the target CPU to switch to the migrated thread > > > before the context switch on the source CPU is finished, for this we > > > can avoid setting the resched bit in xnpod_migrate_thread(), detect > > > the condition in xnpod_schedule() epilogue and set the rescheduling > > > bits so that xnpod_schedule is restarted and send the IPI to the > > > target CPU. > > > > Please find attached a patch implementing these ideas. This adds some > > clutter, which I would be happy to reduce. Better ideas are welcome. > > > > I tried to cross-read the patch (-p would have been nice) but failed - > this needs to be applied on some tree. Does the patch improve ARM > latencies already? I split the patch in two parts in another post, this should make it easier to read. > > > > > > > > > - avoid using user-space real-time tasks when running latency > > > kernel-space benches, i.e. at least in the latency -t 1 and latency -t > > > 2 case. This means that we should change the timerbench driver. There > > > are at least two ways of doing this: > > > use an rt_pipe > > > modify the timerbench driver to implement only the nrt ioctl, using > > > vanilla linux services such as wait_event and wake_up. > > > > > > What do you think ? > > > > So, what do you thing is the best way to change the timerbench driver, > > * use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even > > if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make > > the timerbench non portable on other implementations of rtdm, eg. rtdm > > over rtai or the version of rtdm which runs over vanilla linux > > * modify the timerbecn driver to implement only nrt ioctls ? Pros: > > better driver portability; cons: latency would still need > > CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2. > > I'm still voting for my third approach: > > -> Write latency as kernel application (klatency) against the > timerbench device > -> Call NRT IOCTLs of timerbench during module init/cleanup > -> Use module parameters for customization > -> Setup a low-prio kernel-based RT task to issue the RT IOCTLs > -> Format the results nicely (similar to userland latency) in that RT > task and stuff them into some rtpipe > -> Use "cat /dev/rtpipeX" to display the results Sorry this mail is older than your last reply to my question. I had problems with my MTA, so I resent all the mail which were not sent, I hoped they would be sent with their original date preserved, but unfortunately, this is not the case. Now, to answer your suggestion, I think that formating the results belongs to user-space, not to kernel-space. Besides, emitting NRT ioctls from module initialization and cleanup routines make this klatency module quite inflexible. I was rather thinking about implementing the RT versions of the IOCTLS so that they could be called from a kernel space real-time task. -- Gilles Chanteperdrix. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: Gilles Chanteperdrix wrote: > Hi, > > after some (unsuccessful) time trying to instrument the code in a way > that does not change the latency results completely, I found the > reason for the high latency with latency -t 1 and latency -t 2 on ARM. > So, here comes an update on this issue. The culprit is the user-space > context switch, which flushes the processor cache with the nklock > locked, irqs off. > > There are two things we could do: > - arrange for the ARM cache flush to happen with the nklock unlocked > and irqs enabled. This will improve interrupt latency (latency -t 2) > but obviously not scheduling latency (latency -t 1). If we go that > way, there are several problems we should solve: > > we do not want interrupt handlers to reenter xnpod_schedule(), for > this we can use the XNLOCK bit, set on whatever is > xnpod_current_thread() when the cache flush occurs > > since the interrupt handler may modify the rescheduling bits, we need > to test these bits in xnpod_schedule() epilogue and restart > xnpod_schedule() if need be > > we do not want xnpod_delete_thread() to delete one of the two threads > involved in the context switch, for this the only solution I found is > to add a bit to the thread mask meaning that the thread is currently > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue > to delete whatever thread was marked for deletion > > in case of migration with xnpod_migrate_thread, we do not want > xnpod_schedule() on the target CPU to switch to the migrated thread > before the context switch on the source CPU is finished, for this we > can avoid setting the resched bit in xnpod_migrate_thread(), detect > the condition in xnpod_schedule() epilogue and set the rescheduling > bits so that xnpod_schedule is restarted and send the IPI to the > target CPU. Please find attached a patch implementing these ideas. This adds some clutter, which I would be happy to reduce. Better ideas are welcome. I tried to cross-read the patch (-p would have been nice) but failed - this needs to be applied on some tree. Does the patch improve ARM latencies already? > > - avoid using user-space real-time tasks when running latency > kernel-space benches, i.e. at least in the latency -t 1 and latency -t > 2 case. This means that we should change the timerbench driver. There > are at least two ways of doing this: > use an rt_pipe > modify the timerbench driver to implement only the nrt ioctl, using > vanilla linux services such as wait_event and wake_up. > > What do you think ? So, what do you thing is the best way to change the timerbench driver, * use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make the timerbench non portable on other implementations of rtdm, eg. rtdm over rtai or the version of rtdm which runs over vanilla linux * modify the timerbecn driver to implement only nrt ioctls ? Pros: better driver portability; cons: latency would still need CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2. I'm still voting for my third approach: -> Write latency as kernel application (klatency) against the timerbench device -> Call NRT IOCTLs of timerbench during module init/cleanup -> Use module parameters for customization -> Setup a low-prio kernel-based RT task to issue the RT IOCTLs -> Format the results nicely (similar to userland latency) in that RT task and stuff them into some rtpipe -> Use "cat /dev/rtpipeX" to display the results Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > Hi, > > after some (unsuccessful) time trying to instrument the code in a way > that does not change the latency results completely, I found the > reason for the high latency with latency -t 1 and latency -t 2 on ARM. > So, here comes an update on this issue. The culprit is the user-space > context switch, which flushes the processor cache with the nklock > locked, irqs off. > > There are two things we could do: > - arrange for the ARM cache flush to happen with the nklock unlocked > and irqs enabled. This will improve interrupt latency (latency -t 2) > but obviously not scheduling latency (latency -t 1). If we go that > way, there are several problems we should solve: > > we do not want interrupt handlers to reenter xnpod_schedule(), for > this we can use the XNLOCK bit, set on whatever is > xnpod_current_thread() when the cache flush occurs > > since the interrupt handler may modify the rescheduling bits, we need > to test these bits in xnpod_schedule() epilogue and restart > xnpod_schedule() if need be > > we do not want xnpod_delete_thread() to delete one of the two threads > involved in the context switch, for this the only solution I found is > to add a bit to the thread mask meaning that the thread is currently > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue > to delete whatever thread was marked for deletion > > in case of migration with xnpod_migrate_thread, we do not want > xnpod_schedule() on the target CPU to switch to the migrated thread > before the context switch on the source CPU is finished, for this we > can avoid setting the resched bit in xnpod_migrate_thread(), detect > the condition in xnpod_schedule() epilogue and set the rescheduling > bits so that xnpod_schedule is restarted and send the IPI to the > target CPU. Please find attached a patch implementing these ideas. This adds some clutter, which I would be happy to reduce. Better ideas are welcome. > > - avoid using user-space real-time tasks when running latency > kernel-space benches, i.e. at least in the latency -t 1 and latency -t > 2 case. This means that we should change the timerbench driver. There > are at least two ways of doing this: > use an rt_pipe > modify the timerbench driver to implement only the nrt ioctl, using > vanilla linux services such as wait_event and wake_up. > > What do you think ? So, what do you thing is the best way to change the timerbench driver, * use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make the timerbench non portable on other implementations of rtdm, eg. rtdm over rtai or the version of rtdm which runs over vanilla linux * modify the timerbecn driver to implement only nrt ioctls ? Pros: better driver portability; cons: latency would still need CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2. -- Gilles Chanteperdrix. Index: include/asm-arm/bits/pod.h === --- include/asm-arm/bits/pod.h (revision 3405) +++ include/asm-arm/bits/pod.h (working copy) @@ -67,41 +67,41 @@ #endif /* TIF_MMSWITCH_INT */ } -static inline void xnarch_switch_to(xnarchtcb_t * out_tcb, xnarchtcb_t * in_tcb) -{ - struct task_struct *prev = out_tcb->active_task; - struct mm_struct *prev_mm = out_tcb->active_mm; - struct task_struct *next = in_tcb->user_task; - - - if (likely(next != NULL)) { - in_tcb->active_task = next; - in_tcb->active_mm = in_tcb->mm; - rthal_clear_foreign_stack(&rthal_domain); - } else { - in_tcb->active_task = prev; - in_tcb->active_mm = prev_mm; - rthal_set_foreign_stack(&rthal_domain); - } - - if (prev_mm != in_tcb->active_mm) { - /* Switch to new user-space thread? */ - if (in_tcb->active_mm) - switch_mm(prev_mm, in_tcb->active_mm, next); - if (!next->mm) - enter_lazy_tlb(prev_mm, next); - } - - /* Kernel-to-kernel context switch. */ - rthal_thread_switch(prev, out_tcb->tip, in_tcb->tip); +#define xnarch_switch_to(_out_tcb, _in_tcb, lock) \ +{ \ + xnarchtcb_t *in_tcb = (_in_tcb);\ + xnarchtcb_t *out_tcb = (_out_tcb); \ + struct task_struct *prev = out_tcb->active_task;\ + struct mm_struct *prev_mm = out_tcb->active_mm; \ + struct task_struct *next = in_tcb->user_task; \ + \ +
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > Please find attached a patch implementing these ideas. This adds some > clutter, which I would be happy to reduce. Better ideas are welcome. > Ok. New version of the patch, this time split in two parts, should hopefully make it more readable. > > > > > - avoid using user-space real-time tasks when running latency > > kernel-space benches, i.e. at least in the latency -t 1 and latency -t > > 2 case. This means that we should change the timerbench driver. There > > are at least two ways of doing this: > > use an rt_pipe > > modify the timerbench driver to implement only the nrt ioctl, using > > vanilla linux services such as wait_event and wake_up. > > > > What do you think ? > > So, what do you thing is the best way to change the timerbench driver, > * use an rt_pipe ? Pros: allows to run latency -t 1 and latency -t 2 even > if Xenomai is compiled with CONFIG_XENO_OPT_PERVASIVE off; cons: make > the timerbench non portable on other implementations of rtdm, eg. rtdm > over rtai or the version of rtdm which runs over vanilla linux > * modify the timerbecn driver to implement only nrt ioctls ? Pros: > better driver portability; cons: latency would still need > CONFIG_XENO_OPT_PERVASIVE to run latency -t 1 and latency -t 2. -- Gilles Chanteperdrix. Index: include/nucleus/pod.h === --- include/nucleus/pod.h (revision 3405) +++ include/nucleus/pod.h (working copy) @@ -139,6 +139,7 @@ xntimer_t htimer; /*!< Host timer. */ + xnqueue_t zombies; } xnsched_t; #define nkpod (&nkpod_struct) @@ -238,6 +239,14 @@ } #endif /* CONFIG_XENO_OPT_WATCHDOG */ +void __xnpod_finalize_zombies(xnsched_t *sched); + +static inline void xnpod_finalize_zombies(xnsched_t *sched) +{ + if (!emptyq_p(&sched->zombies)) + __xnpod_finalize_zombies(sched); +} + /* -- Beginning of the exported interface */ #define xnpod_sched_slot(cpu) \ Index: ksrc/nucleus/pod.c === --- ksrc/nucleus/pod.c (revision 3415) +++ ksrc/nucleus/pod.c (working copy) @@ -292,6 +292,7 @@ #endif /* CONFIG_SMP */ xntimer_set_name(&sched->htimer, htimer_name); xntimer_set_sched(&sched->htimer, sched); + initq(&sched->zombies); } xnlock_put_irqrestore(&nklock, s); @@ -545,63 +546,28 @@ __clrbits(sched->status, XNKCOUT); } -static inline void xnpod_switch_zombie(xnthread_t *threadout, - xnthread_t *threadin) +void __xnpod_finalize_zombies(xnsched_t *sched) { - /* Must be called with nklock locked, interrupts off. */ - xnsched_t *sched = xnpod_current_sched(); -#ifdef CONFIG_XENO_OPT_PERVASIVE - int shadow = xnthread_test_state(threadout, XNSHADOW); -#endif /* CONFIG_XENO_OPT_PERVASIVE */ + xnholder_t *holder; - trace_mark(xn_nucleus_sched_finalize, - "thread_out %p thread_out_name %s " - "thread_in %p thread_in_name %s", - threadout, xnthread_name(threadout), - threadin, xnthread_name(threadin)); + while ((holder = getq(&sched->zombies))) { + xnthread_t *thread = link2thread(holder, glink); - if (!emptyq_p(&nkpod->tdeleteq) && !xnthread_test_state(threadout, XNROOT)) { - trace_mark(xn_nucleus_thread_callout, - "thread %p thread_name %s hook %s", - threadout, xnthread_name(threadout), "DELETE"); - xnpod_fire_callouts(&nkpod->tdeleteq, threadout); - } + /* Must be called with nklock locked, interrupts off. */ + trace_mark(xn_nucleus_sched_finalize, + "thread_out %p thread_out_name %s", + thread, xnthread_name(thread)); - sched->runthread = threadin; + if (!emptyq_p(&nkpod->tdeleteq) + && !xnthread_test_state(thread, XNROOT)) { + trace_mark(xn_nucleus_thread_callout, + "thread %p thread_name %s hook %s", + thread, xnthread_name(thread), "DELETE"); + xnpod_fire_callouts(&nkpod->tdeleteq, thread); + } - if (xnthread_test_state(threadin, XNROOT)) { - xnpod_reset_watchdog(sched); - xnfreesync(); - xnarch_enter_root(xnthread_archtcb(threadin)); + xnthread_cleanup_tcb(thread); } - - /* FIXME: Catch 22 here, whether we choose to run on an invalid - stack (cleanup then hooks), or to access the TCB space shortly - after it has been freed while non-preemptible (hooks then - cleanup)... Option #2 is curr
Re: [Xenomai-core] High latencies on ARM.
Jan Kiszka wrote: > Gilles Chanteperdrix wrote: > > On Jan 17, 2008 3:16 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > Gilles Chanteperdrix wrote: > > On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> Hi, > >>> > >>> after some (unsuccessful) time trying to instrument the code in a way > >>> that does not change the latency results completely, I found the > >>> reason for the high latency with latency -t 1 and latency -t 2 on > >>> ARM. > >>> So, here comes an update on this issue. The culprit is the user-space > >>> context switch, which flushes the processor cache with the nklock > >>> locked, irqs off. > >>> > >>> There are two things we could do: > >>> - arrange for the ARM cache flush to happen with the nklock unlocked > >>> and irqs enabled. This will improve interrupt latency (latency -t 2) > >>> but obviously not scheduling latency (latency -t 1). If we go that > >>> way, there are several problems we should solve: > >>> > >>> we do not want interrupt handlers to reenter xnpod_schedule(), for > >>> this we can use the XNLOCK bit, set on whatever is > >>> xnpod_current_thread() when the cache flush occurs > >>> > >>> since the interrupt handler may modify the rescheduling bits, we need > >>> to test these bits in xnpod_schedule() epilogue and restart > >>> xnpod_schedule() if need be > >>> > >>> we do not want xnpod_delete_thread() to delete one of the two threads > >>> involved in the context switch, for this the only solution I found is > >>> to add a bit to the thread mask meaning that the thread is currently > >>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule > >>> epilogue > >>> to delete whatever thread was marked for deletion > >>> > >>> in case of migration with xnpod_migrate_thread, we do not want > >>> xnpod_schedule() on the target CPU to switch to the migrated thread > >>> before the context switch on the source CPU is finished, for this we > >>> can avoid setting the resched bit in xnpod_migrate_thread(), detect > >>> the condition in xnpod_schedule() epilogue and set the rescheduling > >>> bits so that xnpod_schedule is restarted and send the IPI to the > >>> target CPU. > >>> > >>> - avoid using user-space real-time tasks when running latency > >>> kernel-space benches, i.e. at least in the latency -t 1 and latency > >>> -t > >>> 2 case. This means that we should change the timerbench driver. There > >>> are at least two ways of doing this: > >>> use an rt_pipe > >>> modify the timerbench driver to implement only the nrt ioctl, using > >>> vanilla linux services such as wait_event and wake_up. > >> [As you reminded me of this unanswered question:] > >> One may consider adding further modes _besides_ current kernel tests > >> that do not rely on RTDM & native userland support (e.g. when > >> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are > >> valid > >> scenarios as well that must not be killed by such a change. > > I think the current test scenario for latency -t 1 and latency -t 2 > > are a bit misleading: they measure kernel-space latencies in presence > > of user-space real-time tasks. When one runs latency -t 1 or latency > > -t 2, one would expect that there are only kernel-space real-time > > tasks. > If they are misleading, depends on your perspective. In fact, they are > measuring in-kernel scenarios over the standard Xenomai setup, which > includes userland RT task activity these day. Those scenarios are mainly > targeting driver use cases, not pure kernel-space applications. > > But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we > would benefit from an additional set of test cases. > >>> Ok, I will not touch timerbench then, and implement another kernel > >>> module. > >>> > >> [Without considering all details] > >> To achieve this independence of user space RT thread, it should suffice > >> to implement a kernel-based frontend for timerbench. This frontent would > >> then either dump to syslog or open some pipe to tell userland about the > >> benchmark results. What do yo think? > > > > My intent was to implement a protocol similar to the one of > > timerbench, but using an rt-pipe, and continue to use the latency > > test, adding new options such as -t 3 and t 4. But there may be > > problems with this approach: if we are compiling without > > CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is > > probably simpler to implement a klatency that just reads from the > > rt-pipe. > > But th
Re: [Xenomai-core] High latencies on ARM.
On Jan 17, 2008 3:22 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > > Gilles Chanteperdrix wrote: > > On Jan 17, 2008 3:16 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > Gilles Chanteperdrix wrote: > > On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> Hi, > >>> > >>> after some (unsuccessful) time trying to instrument the code in a way > >>> that does not change the latency results completely, I found the > >>> reason for the high latency with latency -t 1 and latency -t 2 on ARM. > >>> So, here comes an update on this issue. The culprit is the user-space > >>> context switch, which flushes the processor cache with the nklock > >>> locked, irqs off. > >>> > >>> There are two things we could do: > >>> - arrange for the ARM cache flush to happen with the nklock unlocked > >>> and irqs enabled. This will improve interrupt latency (latency -t 2) > >>> but obviously not scheduling latency (latency -t 1). If we go that > >>> way, there are several problems we should solve: > >>> > >>> we do not want interrupt handlers to reenter xnpod_schedule(), for > >>> this we can use the XNLOCK bit, set on whatever is > >>> xnpod_current_thread() when the cache flush occurs > >>> > >>> since the interrupt handler may modify the rescheduling bits, we need > >>> to test these bits in xnpod_schedule() epilogue and restart > >>> xnpod_schedule() if need be > >>> > >>> we do not want xnpod_delete_thread() to delete one of the two threads > >>> involved in the context switch, for this the only solution I found is > >>> to add a bit to the thread mask meaning that the thread is currently > >>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue > >>> to delete whatever thread was marked for deletion > >>> > >>> in case of migration with xnpod_migrate_thread, we do not want > >>> xnpod_schedule() on the target CPU to switch to the migrated thread > >>> before the context switch on the source CPU is finished, for this we > >>> can avoid setting the resched bit in xnpod_migrate_thread(), detect > >>> the condition in xnpod_schedule() epilogue and set the rescheduling > >>> bits so that xnpod_schedule is restarted and send the IPI to the > >>> target CPU. > >>> > >>> - avoid using user-space real-time tasks when running latency > >>> kernel-space benches, i.e. at least in the latency -t 1 and latency -t > >>> 2 case. This means that we should change the timerbench driver. There > >>> are at least two ways of doing this: > >>> use an rt_pipe > >>> modify the timerbench driver to implement only the nrt ioctl, using > >>> vanilla linux services such as wait_event and wake_up. > >> [As you reminded me of this unanswered question:] > >> One may consider adding further modes _besides_ current kernel tests > >> that do not rely on RTDM & native userland support (e.g. when > >> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid > >> scenarios as well that must not be killed by such a change. > > I think the current test scenario for latency -t 1 and latency -t 2 > > are a bit misleading: they measure kernel-space latencies in presence > > of user-space real-time tasks. When one runs latency -t 1 or latency > > -t 2, one would expect that there are only kernel-space real-time > > tasks. > If they are misleading, depends on your perspective. In fact, they are > measuring in-kernel scenarios over the standard Xenomai setup, which > includes userland RT task activity these day. Those scenarios are mainly > targeting driver use cases, not pure kernel-space applications. > > But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we > would benefit from an additional set of test cases. > >>> Ok, I will not touch timerbench then, and implement another kernel module. > >>> > >> [Without considering all details] > >> To achieve this independence of user space RT thread, it should suffice > >> to implement a kernel-based frontend for timerbench. This frontent would > >> then either dump to syslog or open some pipe to tell userland about the > >> benchmark results. What do yo think? > > > > My intent was to implement a protocol similar to the one of > > timerbench, but using an rt-pipe, and continue to use the latency > > test, adding new options such as -t 3 and t 4. But there may be > > problems with this approach: if we are compiling without > > CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is > > probably simpler to implement a klatency that just reads from the > > rt-pipe. > > But that klantency could perfectly reuse what timerbench already > provides, without code changes to
Re: [Xenomai-core] High latencies on ARM.
On Jan 17, 2008 3:16 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > > Gilles Chanteperdrix wrote: > > On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > Gilles Chanteperdrix wrote: > > Hi, > > > > after some (unsuccessful) time trying to instrument the code in a way > > that does not change the latency results completely, I found the > > reason for the high latency with latency -t 1 and latency -t 2 on ARM. > > So, here comes an update on this issue. The culprit is the user-space > > context switch, which flushes the processor cache with the nklock > > locked, irqs off. > > > > There are two things we could do: > > - arrange for the ARM cache flush to happen with the nklock unlocked > > and irqs enabled. This will improve interrupt latency (latency -t 2) > > but obviously not scheduling latency (latency -t 1). If we go that > > way, there are several problems we should solve: > > > > we do not want interrupt handlers to reenter xnpod_schedule(), for > > this we can use the XNLOCK bit, set on whatever is > > xnpod_current_thread() when the cache flush occurs > > > > since the interrupt handler may modify the rescheduling bits, we need > > to test these bits in xnpod_schedule() epilogue and restart > > xnpod_schedule() if need be > > > > we do not want xnpod_delete_thread() to delete one of the two threads > > involved in the context switch, for this the only solution I found is > > to add a bit to the thread mask meaning that the thread is currently > > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue > > to delete whatever thread was marked for deletion > > > > in case of migration with xnpod_migrate_thread, we do not want > > xnpod_schedule() on the target CPU to switch to the migrated thread > > before the context switch on the source CPU is finished, for this we > > can avoid setting the resched bit in xnpod_migrate_thread(), detect > > the condition in xnpod_schedule() epilogue and set the rescheduling > > bits so that xnpod_schedule is restarted and send the IPI to the > > target CPU. > > > > - avoid using user-space real-time tasks when running latency > > kernel-space benches, i.e. at least in the latency -t 1 and latency -t > > 2 case. This means that we should change the timerbench driver. There > > are at least two ways of doing this: > > use an rt_pipe > > modify the timerbench driver to implement only the nrt ioctl, using > > vanilla linux services such as wait_event and wake_up. > [As you reminded me of this unanswered question:] > One may consider adding further modes _besides_ current kernel tests > that do not rely on RTDM & native userland support (e.g. when > CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid > scenarios as well that must not be killed by such a change. > >>> I think the current test scenario for latency -t 1 and latency -t 2 > >>> are a bit misleading: they measure kernel-space latencies in presence > >>> of user-space real-time tasks. When one runs latency -t 1 or latency > >>> -t 2, one would expect that there are only kernel-space real-time > >>> tasks. > >> If they are misleading, depends on your perspective. In fact, they are > >> measuring in-kernel scenarios over the standard Xenomai setup, which > >> includes userland RT task activity these day. Those scenarios are mainly > >> targeting driver use cases, not pure kernel-space applications. > >> > >> But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we > >> would benefit from an additional set of test cases. > > > > Ok, I will not touch timerbench then, and implement another kernel module. > > > > [Without considering all details] > To achieve this independence of user space RT thread, it should suffice > to implement a kernel-based frontend for timerbench. This frontent would > then either dump to syslog or open some pipe to tell userland about the > benchmark results. What do yo think? My intent was to implement a protocol similar to the one of timerbench, but using an rt-pipe, and continue to use the latency test, adding new options such as -t 3 and t 4. But there may be problems with this approach: if we are compiling without CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is probably simpler to implement a klatency that just reads from the rt-pipe. -- Gilles Chanteperdrix ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > On Jan 17, 2008 3:16 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: >> Gilles Chanteperdrix wrote: >>> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: Gilles Chanteperdrix wrote: > On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote: >> Gilles Chanteperdrix wrote: >>> Hi, >>> >>> after some (unsuccessful) time trying to instrument the code in a way >>> that does not change the latency results completely, I found the >>> reason for the high latency with latency -t 1 and latency -t 2 on ARM. >>> So, here comes an update on this issue. The culprit is the user-space >>> context switch, which flushes the processor cache with the nklock >>> locked, irqs off. >>> >>> There are two things we could do: >>> - arrange for the ARM cache flush to happen with the nklock unlocked >>> and irqs enabled. This will improve interrupt latency (latency -t 2) >>> but obviously not scheduling latency (latency -t 1). If we go that >>> way, there are several problems we should solve: >>> >>> we do not want interrupt handlers to reenter xnpod_schedule(), for >>> this we can use the XNLOCK bit, set on whatever is >>> xnpod_current_thread() when the cache flush occurs >>> >>> since the interrupt handler may modify the rescheduling bits, we need >>> to test these bits in xnpod_schedule() epilogue and restart >>> xnpod_schedule() if need be >>> >>> we do not want xnpod_delete_thread() to delete one of the two threads >>> involved in the context switch, for this the only solution I found is >>> to add a bit to the thread mask meaning that the thread is currently >>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue >>> to delete whatever thread was marked for deletion >>> >>> in case of migration with xnpod_migrate_thread, we do not want >>> xnpod_schedule() on the target CPU to switch to the migrated thread >>> before the context switch on the source CPU is finished, for this we >>> can avoid setting the resched bit in xnpod_migrate_thread(), detect >>> the condition in xnpod_schedule() epilogue and set the rescheduling >>> bits so that xnpod_schedule is restarted and send the IPI to the >>> target CPU. >>> >>> - avoid using user-space real-time tasks when running latency >>> kernel-space benches, i.e. at least in the latency -t 1 and latency -t >>> 2 case. This means that we should change the timerbench driver. There >>> are at least two ways of doing this: >>> use an rt_pipe >>> modify the timerbench driver to implement only the nrt ioctl, using >>> vanilla linux services such as wait_event and wake_up. >> [As you reminded me of this unanswered question:] >> One may consider adding further modes _besides_ current kernel tests >> that do not rely on RTDM & native userland support (e.g. when >> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid >> scenarios as well that must not be killed by such a change. > I think the current test scenario for latency -t 1 and latency -t 2 > are a bit misleading: they measure kernel-space latencies in presence > of user-space real-time tasks. When one runs latency -t 1 or latency > -t 2, one would expect that there are only kernel-space real-time > tasks. If they are misleading, depends on your perspective. In fact, they are measuring in-kernel scenarios over the standard Xenomai setup, which includes userland RT task activity these day. Those scenarios are mainly targeting driver use cases, not pure kernel-space applications. But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we would benefit from an additional set of test cases. >>> Ok, I will not touch timerbench then, and implement another kernel module. >>> >> [Without considering all details] >> To achieve this independence of user space RT thread, it should suffice >> to implement a kernel-based frontend for timerbench. This frontent would >> then either dump to syslog or open some pipe to tell userland about the >> benchmark results. What do yo think? > > My intent was to implement a protocol similar to the one of > timerbench, but using an rt-pipe, and continue to use the latency > test, adding new options such as -t 3 and t 4. But there may be > problems with this approach: if we are compiling without > CONFIG_XENO_OPT_PERVASIVE, latency will not run at all. So, it is > probably simpler to implement a klatency that just reads from the > rt-pipe. But that klantency could perfectly reuse what timerbench already provides, without code changes to the latter, in theory. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Jan Kiszka wrote: > Gilles Chanteperdrix wrote: >> On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: >>> Gilles Chanteperdrix wrote: On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > Gilles Chanteperdrix wrote: >> Hi, >> >> after some (unsuccessful) time trying to instrument the code in a way >> that does not change the latency results completely, I found the >> reason for the high latency with latency -t 1 and latency -t 2 on ARM. >> So, here comes an update on this issue. The culprit is the user-space >> context switch, which flushes the processor cache with the nklock >> locked, irqs off. >> >> There are two things we could do: >> - arrange for the ARM cache flush to happen with the nklock unlocked >> and irqs enabled. This will improve interrupt latency (latency -t 2) >> but obviously not scheduling latency (latency -t 1). If we go that >> way, there are several problems we should solve: >> >> we do not want interrupt handlers to reenter xnpod_schedule(), for >> this we can use the XNLOCK bit, set on whatever is >> xnpod_current_thread() when the cache flush occurs >> >> since the interrupt handler may modify the rescheduling bits, we need >> to test these bits in xnpod_schedule() epilogue and restart >> xnpod_schedule() if need be >> >> we do not want xnpod_delete_thread() to delete one of the two threads >> involved in the context switch, for this the only solution I found is >> to add a bit to the thread mask meaning that the thread is currently >> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue >> to delete whatever thread was marked for deletion >> >> in case of migration with xnpod_migrate_thread, we do not want >> xnpod_schedule() on the target CPU to switch to the migrated thread >> before the context switch on the source CPU is finished, for this we >> can avoid setting the resched bit in xnpod_migrate_thread(), detect >> the condition in xnpod_schedule() epilogue and set the rescheduling >> bits so that xnpod_schedule is restarted and send the IPI to the >> target CPU. >> >> - avoid using user-space real-time tasks when running latency >> kernel-space benches, i.e. at least in the latency -t 1 and latency -t >> 2 case. This means that we should change the timerbench driver. There >> are at least two ways of doing this: >> use an rt_pipe >> modify the timerbench driver to implement only the nrt ioctl, using >> vanilla linux services such as wait_event and wake_up. > [As you reminded me of this unanswered question:] > One may consider adding further modes _besides_ current kernel tests > that do not rely on RTDM & native userland support (e.g. when > CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid > scenarios as well that must not be killed by such a change. I think the current test scenario for latency -t 1 and latency -t 2 are a bit misleading: they measure kernel-space latencies in presence of user-space real-time tasks. When one runs latency -t 1 or latency -t 2, one would expect that there are only kernel-space real-time tasks. >>> If they are misleading, depends on your perspective. In fact, they are >>> measuring in-kernel scenarios over the standard Xenomai setup, which >>> includes userland RT task activity these day. Those scenarios are mainly >>> targeting driver use cases, not pure kernel-space applications. >>> >>> But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we >>> would benefit from an additional set of test cases. >> Ok, I will not touch timerbench then, and implement another kernel module. >> > > [Without considering all details] > To achieve this independence of user space RT thread, it should suffice > to implement a kernel-based frontend for timerbench. This frontent would > then either dump to syslog or open some pipe to tell userland about the > benchmark results. What do yo think? > (That is only in case you meant "reimplementing timerbench" with "implement another kernel module". Just write a kernel-hosted RTDM user of timerbench.) ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: >> Gilles Chanteperdrix wrote: >>> On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote: Gilles Chanteperdrix wrote: > Hi, > > after some (unsuccessful) time trying to instrument the code in a way > that does not change the latency results completely, I found the > reason for the high latency with latency -t 1 and latency -t 2 on ARM. > So, here comes an update on this issue. The culprit is the user-space > context switch, which flushes the processor cache with the nklock > locked, irqs off. > > There are two things we could do: > - arrange for the ARM cache flush to happen with the nklock unlocked > and irqs enabled. This will improve interrupt latency (latency -t 2) > but obviously not scheduling latency (latency -t 1). If we go that > way, there are several problems we should solve: > > we do not want interrupt handlers to reenter xnpod_schedule(), for > this we can use the XNLOCK bit, set on whatever is > xnpod_current_thread() when the cache flush occurs > > since the interrupt handler may modify the rescheduling bits, we need > to test these bits in xnpod_schedule() epilogue and restart > xnpod_schedule() if need be > > we do not want xnpod_delete_thread() to delete one of the two threads > involved in the context switch, for this the only solution I found is > to add a bit to the thread mask meaning that the thread is currently > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue > to delete whatever thread was marked for deletion > > in case of migration with xnpod_migrate_thread, we do not want > xnpod_schedule() on the target CPU to switch to the migrated thread > before the context switch on the source CPU is finished, for this we > can avoid setting the resched bit in xnpod_migrate_thread(), detect > the condition in xnpod_schedule() epilogue and set the rescheduling > bits so that xnpod_schedule is restarted and send the IPI to the > target CPU. > > - avoid using user-space real-time tasks when running latency > kernel-space benches, i.e. at least in the latency -t 1 and latency -t > 2 case. This means that we should change the timerbench driver. There > are at least two ways of doing this: > use an rt_pipe > modify the timerbench driver to implement only the nrt ioctl, using > vanilla linux services such as wait_event and wake_up. [As you reminded me of this unanswered question:] One may consider adding further modes _besides_ current kernel tests that do not rely on RTDM & native userland support (e.g. when CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid scenarios as well that must not be killed by such a change. >>> I think the current test scenario for latency -t 1 and latency -t 2 >>> are a bit misleading: they measure kernel-space latencies in presence >>> of user-space real-time tasks. When one runs latency -t 1 or latency >>> -t 2, one would expect that there are only kernel-space real-time >>> tasks. >> If they are misleading, depends on your perspective. In fact, they are >> measuring in-kernel scenarios over the standard Xenomai setup, which >> includes userland RT task activity these day. Those scenarios are mainly >> targeting driver use cases, not pure kernel-space applications. >> >> But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we >> would benefit from an additional set of test cases. > > Ok, I will not touch timerbench then, and implement another kernel module. > [Without considering all details] To achieve this independence of user space RT thread, it should suffice to implement a kernel-based frontend for timerbench. This frontent would then either dump to syslog or open some pipe to tell userland about the benchmark results. What do yo think? Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
On Jan 17, 2008 12:55 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > > Gilles Chanteperdrix wrote: > > On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > >> Gilles Chanteperdrix wrote: > >>> Hi, > >>> > >>> after some (unsuccessful) time trying to instrument the code in a way > >>> that does not change the latency results completely, I found the > >>> reason for the high latency with latency -t 1 and latency -t 2 on ARM. > >>> So, here comes an update on this issue. The culprit is the user-space > >>> context switch, which flushes the processor cache with the nklock > >>> locked, irqs off. > >>> > >>> There are two things we could do: > >>> - arrange for the ARM cache flush to happen with the nklock unlocked > >>> and irqs enabled. This will improve interrupt latency (latency -t 2) > >>> but obviously not scheduling latency (latency -t 1). If we go that > >>> way, there are several problems we should solve: > >>> > >>> we do not want interrupt handlers to reenter xnpod_schedule(), for > >>> this we can use the XNLOCK bit, set on whatever is > >>> xnpod_current_thread() when the cache flush occurs > >>> > >>> since the interrupt handler may modify the rescheduling bits, we need > >>> to test these bits in xnpod_schedule() epilogue and restart > >>> xnpod_schedule() if need be > >>> > >>> we do not want xnpod_delete_thread() to delete one of the two threads > >>> involved in the context switch, for this the only solution I found is > >>> to add a bit to the thread mask meaning that the thread is currently > >>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue > >>> to delete whatever thread was marked for deletion > >>> > >>> in case of migration with xnpod_migrate_thread, we do not want > >>> xnpod_schedule() on the target CPU to switch to the migrated thread > >>> before the context switch on the source CPU is finished, for this we > >>> can avoid setting the resched bit in xnpod_migrate_thread(), detect > >>> the condition in xnpod_schedule() epilogue and set the rescheduling > >>> bits so that xnpod_schedule is restarted and send the IPI to the > >>> target CPU. > >>> > >>> - avoid using user-space real-time tasks when running latency > >>> kernel-space benches, i.e. at least in the latency -t 1 and latency -t > >>> 2 case. This means that we should change the timerbench driver. There > >>> are at least two ways of doing this: > >>> use an rt_pipe > >>> modify the timerbench driver to implement only the nrt ioctl, using > >>> vanilla linux services such as wait_event and wake_up. > >> [As you reminded me of this unanswered question:] > >> One may consider adding further modes _besides_ current kernel tests > >> that do not rely on RTDM & native userland support (e.g. when > >> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid > >> scenarios as well that must not be killed by such a change. > > > > I think the current test scenario for latency -t 1 and latency -t 2 > > are a bit misleading: they measure kernel-space latencies in presence > > of user-space real-time tasks. When one runs latency -t 1 or latency > > -t 2, one would expect that there are only kernel-space real-time > > tasks. > > If they are misleading, depends on your perspective. In fact, they are > measuring in-kernel scenarios over the standard Xenomai setup, which > includes userland RT task activity these day. Those scenarios are mainly > targeting driver use cases, not pure kernel-space applications. > > But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we > would benefit from an additional set of test cases. Ok, I will not touch timerbench then, and implement another kernel module. -- Gilles Chanteperdrix ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote: >> Gilles Chanteperdrix wrote: >>> Hi, >>> >>> after some (unsuccessful) time trying to instrument the code in a way >>> that does not change the latency results completely, I found the >>> reason for the high latency with latency -t 1 and latency -t 2 on ARM. >>> So, here comes an update on this issue. The culprit is the user-space >>> context switch, which flushes the processor cache with the nklock >>> locked, irqs off. >>> >>> There are two things we could do: >>> - arrange for the ARM cache flush to happen with the nklock unlocked >>> and irqs enabled. This will improve interrupt latency (latency -t 2) >>> but obviously not scheduling latency (latency -t 1). If we go that >>> way, there are several problems we should solve: >>> >>> we do not want interrupt handlers to reenter xnpod_schedule(), for >>> this we can use the XNLOCK bit, set on whatever is >>> xnpod_current_thread() when the cache flush occurs >>> >>> since the interrupt handler may modify the rescheduling bits, we need >>> to test these bits in xnpod_schedule() epilogue and restart >>> xnpod_schedule() if need be >>> >>> we do not want xnpod_delete_thread() to delete one of the two threads >>> involved in the context switch, for this the only solution I found is >>> to add a bit to the thread mask meaning that the thread is currently >>> switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue >>> to delete whatever thread was marked for deletion >>> >>> in case of migration with xnpod_migrate_thread, we do not want >>> xnpod_schedule() on the target CPU to switch to the migrated thread >>> before the context switch on the source CPU is finished, for this we >>> can avoid setting the resched bit in xnpod_migrate_thread(), detect >>> the condition in xnpod_schedule() epilogue and set the rescheduling >>> bits so that xnpod_schedule is restarted and send the IPI to the >>> target CPU. >>> >>> - avoid using user-space real-time tasks when running latency >>> kernel-space benches, i.e. at least in the latency -t 1 and latency -t >>> 2 case. This means that we should change the timerbench driver. There >>> are at least two ways of doing this: >>> use an rt_pipe >>> modify the timerbench driver to implement only the nrt ioctl, using >>> vanilla linux services such as wait_event and wake_up. >> [As you reminded me of this unanswered question:] >> One may consider adding further modes _besides_ current kernel tests >> that do not rely on RTDM & native userland support (e.g. when >> CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid >> scenarios as well that must not be killed by such a change. > > I think the current test scenario for latency -t 1 and latency -t 2 > are a bit misleading: they measure kernel-space latencies in presence > of user-space real-time tasks. When one runs latency -t 1 or latency > -t 2, one would expect that there are only kernel-space real-time > tasks. If they are misleading, depends on your perspective. In fact, they are measuring in-kernel scenarios over the standard Xenomai setup, which includes userland RT task activity these day. Those scenarios are mainly targeting driver use cases, not pure kernel-space applications. But I agree that, for !CONFIG_XENO_OPT_PERVASIVE-like scenarios, we would benefit from an additional set of test cases. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
On Jan 17, 2008 11:42 AM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > > Gilles Chanteperdrix wrote: > > Hi, > > > > after some (unsuccessful) time trying to instrument the code in a way > > that does not change the latency results completely, I found the > > reason for the high latency with latency -t 1 and latency -t 2 on ARM. > > So, here comes an update on this issue. The culprit is the user-space > > context switch, which flushes the processor cache with the nklock > > locked, irqs off. > > > > There are two things we could do: > > - arrange for the ARM cache flush to happen with the nklock unlocked > > and irqs enabled. This will improve interrupt latency (latency -t 2) > > but obviously not scheduling latency (latency -t 1). If we go that > > way, there are several problems we should solve: > > > > we do not want interrupt handlers to reenter xnpod_schedule(), for > > this we can use the XNLOCK bit, set on whatever is > > xnpod_current_thread() when the cache flush occurs > > > > since the interrupt handler may modify the rescheduling bits, we need > > to test these bits in xnpod_schedule() epilogue and restart > > xnpod_schedule() if need be > > > > we do not want xnpod_delete_thread() to delete one of the two threads > > involved in the context switch, for this the only solution I found is > > to add a bit to the thread mask meaning that the thread is currently > > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue > > to delete whatever thread was marked for deletion > > > > in case of migration with xnpod_migrate_thread, we do not want > > xnpod_schedule() on the target CPU to switch to the migrated thread > > before the context switch on the source CPU is finished, for this we > > can avoid setting the resched bit in xnpod_migrate_thread(), detect > > the condition in xnpod_schedule() epilogue and set the rescheduling > > bits so that xnpod_schedule is restarted and send the IPI to the > > target CPU. > > > > - avoid using user-space real-time tasks when running latency > > kernel-space benches, i.e. at least in the latency -t 1 and latency -t > > 2 case. This means that we should change the timerbench driver. There > > are at least two ways of doing this: > > use an rt_pipe > > modify the timerbench driver to implement only the nrt ioctl, using > > vanilla linux services such as wait_event and wake_up. > > [As you reminded me of this unanswered question:] > One may consider adding further modes _besides_ current kernel tests > that do not rely on RTDM & native userland support (e.g. when > CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid > scenarios as well that must not be killed by such a change. I think the current test scenario for latency -t 1 and latency -t 2 are a bit misleading: they measure kernel-space latencies in presence of user-space real-time tasks. When one runs latency -t 1 or latency -t 2, one would expect that there are only kernel-space real-time tasks. -- Gilles Chanteperdrix ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] High latencies on ARM.
Gilles Chanteperdrix wrote: > Hi, > > after some (unsuccessful) time trying to instrument the code in a way > that does not change the latency results completely, I found the > reason for the high latency with latency -t 1 and latency -t 2 on ARM. > So, here comes an update on this issue. The culprit is the user-space > context switch, which flushes the processor cache with the nklock > locked, irqs off. > > There are two things we could do: > - arrange for the ARM cache flush to happen with the nklock unlocked > and irqs enabled. This will improve interrupt latency (latency -t 2) > but obviously not scheduling latency (latency -t 1). If we go that > way, there are several problems we should solve: > > we do not want interrupt handlers to reenter xnpod_schedule(), for > this we can use the XNLOCK bit, set on whatever is > xnpod_current_thread() when the cache flush occurs > > since the interrupt handler may modify the rescheduling bits, we need > to test these bits in xnpod_schedule() epilogue and restart > xnpod_schedule() if need be > > we do not want xnpod_delete_thread() to delete one of the two threads > involved in the context switch, for this the only solution I found is > to add a bit to the thread mask meaning that the thread is currently > switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue > to delete whatever thread was marked for deletion > > in case of migration with xnpod_migrate_thread, we do not want > xnpod_schedule() on the target CPU to switch to the migrated thread > before the context switch on the source CPU is finished, for this we > can avoid setting the resched bit in xnpod_migrate_thread(), detect > the condition in xnpod_schedule() epilogue and set the rescheduling > bits so that xnpod_schedule is restarted and send the IPI to the > target CPU. > > - avoid using user-space real-time tasks when running latency > kernel-space benches, i.e. at least in the latency -t 1 and latency -t > 2 case. This means that we should change the timerbench driver. There > are at least two ways of doing this: > use an rt_pipe > modify the timerbench driver to implement only the nrt ioctl, using > vanilla linux services such as wait_event and wake_up. [As you reminded me of this unanswered question:] One may consider adding further modes _besides_ current kernel tests that do not rely on RTDM & native userland support (e.g. when CONFIG_XENO_OPT_PERVASIVE is disabled). But the current tests are valid scenarios as well that must not be killed by such a change. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
[Xenomai-core] High latencies on ARM.
Hi, after some (unsuccessful) time trying to instrument the code in a way that does not change the latency results completely, I found the reason for the high latency with latency -t 1 and latency -t 2 on ARM. So, here comes an update on this issue. The culprit is the user-space context switch, which flushes the processor cache with the nklock locked, irqs off. There are two things we could do: - arrange for the ARM cache flush to happen with the nklock unlocked and irqs enabled. This will improve interrupt latency (latency -t 2) but obviously not scheduling latency (latency -t 1). If we go that way, there are several problems we should solve: we do not want interrupt handlers to reenter xnpod_schedule(), for this we can use the XNLOCK bit, set on whatever is xnpod_current_thread() when the cache flush occurs since the interrupt handler may modify the rescheduling bits, we need to test these bits in xnpod_schedule() epilogue and restart xnpod_schedule() if need be we do not want xnpod_delete_thread() to delete one of the two threads involved in the context switch, for this the only solution I found is to add a bit to the thread mask meaning that the thread is currently switching, and to (re)test the XNZOMBIE bit in xnpod_schedule epilogue to delete whatever thread was marked for deletion in case of migration with xnpod_migrate_thread, we do not want xnpod_schedule() on the target CPU to switch to the migrated thread before the context switch on the source CPU is finished, for this we can avoid setting the resched bit in xnpod_migrate_thread(), detect the condition in xnpod_schedule() epilogue and set the rescheduling bits so that xnpod_schedule is restarted and send the IPI to the target CPU. - avoid using user-space real-time tasks when running latency kernel-space benches, i.e. at least in the latency -t 1 and latency -t 2 case. This means that we should change the timerbench driver. There are at least two ways of doing this: use an rt_pipe modify the timerbench driver to implement only the nrt ioctl, using vanilla linux services such as wait_event and wake_up. What do you think ? -- Gilles Chanteperdrix ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core