On Thu, 2010-06-24 at 17:05 +0530, Nero Fernandez wrote:
Thanks for your response, Philippe.
The concerns while the carrying out my experiments were to:
- compare xenomai co-kernel overheads (timer and context switch
latencies)
in xenomai-space vs similar native-linux overheads. These are
presented in
the first two sheets.
- find out, how addition of xenomai, xenomai+adeos effects the native
kernel's
performance. Here, lmbench was used on the native linux side to
estimate
the changes to standard linux services.
How can your reasonably estimate the overhead of co-kernel services
without running any co-kernel services? Interrupt pipelining is not a
co-kernel service. You do nothing with interrupt pipelining except
enabling co-kernel services to be implemented with real-time response
guarantee.
Regarding the additions of latency measurements in sys-timer handler,
i performed
a similar measurement from xnintr_clock_handler(), and the results
were similar
to ones reported from sys-timer handler in xenomai-enabled linux.
If your benchmark is about Xenomai, then at least make sure to provide
results for Xenomai services, used in a relevant application and
platform context. Pretending that you instrumented
xnintr_clock_handler() at some point and got some results, but
eventually decided to illustrate your benchmark with other similar
results obtained from a totally unrelated instrumentation code, does not
help considering the figures as relevant.
Btw, hooking xnintr_clock_handler() is not correct. Again, benchmarking
interrupt latency with Xenomai has to measure the entire code path, from
the moment the interrupt is taken by the CPU, until it is delivered to
the Xenomai service user. By instrumenting directly in
xnintr_clock_handler(), your test bypasses the Xenomai timer handling
code which delivers the timer tick to the user code, and the
rescheduling procedure as well, so your figures are optimistically wrong
for any normal use case based on real-time tasks.
While trying to
make both these measurements, i tried to take care that delay-value
logging is
done at the end the handler routines,but the __ipipe_mach_tsc value is
recorded
at the beginning of the routine (a patch for this is included in the
worksheet itself)
This patch is hopelessly useless and misleading. Unless your intent is
to have your application directly embodied into low-level interrupt
handlers, you are not measuring the actual overhead.
Latency is not solely a matter of interrupt masking, but also a matter
of I/D cache misses, particularly on ARM - you have to traverse the
actual code until delivery to exhibit the latter.
This is exactly what the latency tests shipped with Xenomai are for:
- /usr/xenomai/bin/latency -t0/1/2
- /usr/xenomai/bin/klatency
- /usr/xenomai/bin/irqbench
If your system involves user-space tasks, then you should benchmark
user-space response time using latency [-t0]. If you plan to use
kernel-based tasks such as RTDM tasks, then latency -t1 and klatency
tests will provide correct results for your benchmark.
If you are interested only in interrupt latency, then latency -t2 will
help.
If you do think that those tests do not measure what you seem to be
interested in, then you may want to explain why on this list, so that we
eventually understand what you are after.
Regarding the system, changing the kernel version would invalidate my
results
as the system is a released CE device and has no plans to upgrade the
kernel.
Ok. But that makes your benchmark 100% irrelevant with respect to
assessing the real performances of a decent co-kernel on your setup.
AFAIK, enabling FCSE would limit the number of concurrent processes,
hence
becoming inviable in my scenario.
Ditto. Besides, FCSE as implemented in recent I-pipe patches has a
best-effort mode which lifts those limitations, at the expense of
voiding the latency guarantee, but on the average, that would still be
much better than always suffering the VIVT cache insanity without FCSE.
Quoting a previous mail of yours, regarding your target:
Processor : ARM926EJ-S rev 5 (v5l)
The latency hit induced by VIVT caching on arm926 is typically in the
180-200 us range under load in user-space, and 100-120 us in kernel
space. So, without FCSE, this would bite at each Xenomai __and__ linux
process context switch. Since your application requires that more than
95 processes be available in the system, you will likely get quite a few
switches in any given period of time, unless most of them always sleep,
of course.
Ok, so let me do some wild guesses here: you told us this is a CE-based
application; maybe it exists already? maybe it has to be put on steroïds
for gaining decent real-time guarantees it doesn't have yet? and perhaps
the design of that application involves many processes undergoing
periodic activities, so lots of context switches with address space
changes during normal operations?
And, you