Re: [Xenomai-core] irq0 usage

2009-03-26 Thread Steven Seeger
I forgot to mention. In order to keep tsc as the clockdev I disabled  
the code in the kernel that removes it as the clocksource.

Steven

On Mar 26, 2009, at 1:25 PM, Steven Seeger wrote:

 Using TSC really dropped us down. I don't know wh\y the timekeeper
 says tsc is unstable. We ran our system for an 8 minute cycle and
 timed it with a stopwatch, and it was accurate to the second.

 On our test irq0 usage dropped from 19% to 13%. Thanks for the help,
 guys.

 Steven


 ___
 Xenomai-core mailing list
 Xenomai-core@gna.org
 https://mail.gna.org/listinfo/xenomai-core


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] irq0 usage

2009-03-26 Thread Gilles Chanteperdrix
Steven Seeger wrote:
 I forgot to mention. In order to keep tsc as the clockdev I disabled  
 the code in the kernel that removes it as the clocksource.

Even when idle=poll or nohlt the kernel disables the tsc as clocksource ?
Also note that Xenomai does not care if Linux uses the tsc as
clocksource or not to use the tsc.

-- 
 Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] irq0 usage

2009-03-24 Thread Philippe Gerum
On Mon, 2009-03-23 at 19:32 -0400, Steven Seeger wrote:
  Ok, so we will agree that the 20%/60% ratios can't be compared, in  
  fact.
 
 Do you mean that this is not a fair comparison or that I should not be  
 this slow compared to RTAI?
 

I mean that you were comparing apples to oranges. If you really want to
compare them in order to figure out if a significant loss of performance
happened, then run your application in an RTAI/LXRT context in userland.

  The fact that the GX still has to use a crappy 8253 PIT for timing and
  must emulate the TSC using one of the PIT channels is not helping at
  all. Emulating the TSC costs 1 x time_of(outb) + 2 x time_of(inb),  
  each
  time a timestamp is read via the rdtsc emulation code. That is costly.
 
 Do you agree that if I build with TSC on and disable suspend on halt  
 (or use idle=poll) that xenomai will use rdtsc?

Xenomai will use rdtsc as soon as the kernel wants to use it. And the
kernel will do that as soon as the CPU model you picked in your setup
does exhibit TSC support. This is not a matter of Xenomai choosing to
ignore TSC support when available to the kernel, this never happens.
I seem to remember that your target has a bad TSC and loses time, unless
idle=poll is given; at the same time, we don't handle the SCx200 hires
timer that is Geode-specific, so there is likely no fallback option to
this issue but using idle=poll.

 
  It switches to supervisor mode using an interrupt (0x80); that logic  
  is
  really costly compared to the SEP entry. I'd say ~800ns-1us vs 200ns  
  on
  average for your target.
 
 This is bad, but since our fastest userspace period is 500us it is not  
 a dealbreaker. Just rt_task_wait_next_period() and one mutex lock/ 
 unlock is too much for it.
 

2.4.x will issue 3 syscalls there, 2.5.x only 1 most of the time.
If you really want to understand what is going on your system, you
should definitely enable the I-pipe tracer, and have a look at the
processing that takes place.

In any case, 3 syscalls over a 2Khz loop are no big deal over a sane hw;
the problem I see is that your target is cumulating a lot of issues:
buggy TSC, no SEP, sluggish ISA bus, no local APIC, braindamage C3
state. It's a bit like that hw would want to prevent you from using it
in real-time mode, I mean.

Again, the best way to know what is going on is to get a trace snapshot
from the I-pipe tracer. You would get detailed timing information for
kernel space activity, on a per-routine basis.

  Btw, did you fix your driver code regarding the unprotected usage of  
  FPU
  in pure Linux kernel context?
 
 Yes in fact the new driver does not use floats at all. It's purely  
 integer math.
 
  Eh, no. TSC is always preferred when available.
 
 I was looking at rthal_timer_program_shot().
 

This is used to program the next aperiodic shot and this should not
happen more than once per sample. OTOH, getting the CPU time via the TSC
emulation occurs a few times per sample.

  Frankly, those figures are really surprising. rdtsc() is about
  100-200ns, running rthal_get_8254_tsc() is a lot, lot more.
 
 I asked above if what we did would really use the TSC or not. What do  
 you think?
 

Do you have CONFIG_X86_TSC enabled in your kernel config? If so, then
you do use TSC with Xenomai as well.

  No, when _your_ test runs.
 
 So we should run latency -p and then our test and look at the output?
 

Run latency -p 500 in the same load conditions than your app, and while
this is running:

- dump /proc/xenomai/timerstat; we will find out what timers are
outstanding. 
- dump /proc/xenomai/stat a few times; we will find out the typical CPU
consumption of the timer tick.

Then, do the same with your application, and send the outputs.

 Thanks,
 Steven
 
 ___
 Xenomai-core mailing list
 Xenomai-core@gna.org
 https://mail.gna.org/listinfo/xenomai-core
-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] irq0 usage

2009-03-23 Thread Philippe Gerum
On Mon, 2009-03-23 at 15:59 -0400, Steven Seeger wrote:
 We are still running into issues where irq0 is using a lot of CPU  
 time. The same threads on an RTAi system on the same hardware used  
 about 13% of the CPU but are using closer to 60% on Xenomai.

What are you comparing, I mean, exactly?
All kernel RTAI vs all userland Xenomai?

The timer handler is charged for the callbacks it runs, so it really
boils down to what code is attached to Xenomai timers, aside of the
built-in scheduler tick.

When you measure that load, what does /proc/xenomai/timerstat say?

  I know  
 there is some overhead with userspace calls but hte irq0 handler alone  
 accounts for 20% of it. Are there any options that can speed things up?
 

Yeah, but you won't like it: buy a Geode that has SEP support for
syscalls and a working TSC, then switch on --enable-x86-sep. Ok,
granted, that is _not_ funny.

What would be interesting is to get the value reported for the timer
interrupt when the standard latency test runs at the same frequency than
your application does (use -p option).

 We've tried both one shot and periodic modes. I confirmed that the ISA  
 i/o timing is 1.3usec per outb as expected.
 
 Steven
 
 
 ___
 Xenomai-core mailing list
 Xenomai-core@gna.org
 https://mail.gna.org/listinfo/xenomai-core
-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] irq0 usage

2009-03-23 Thread Steven Seeger
 What are you comparing, I mean, exactly?
 All kernel RTAI vs all userland Xenomai?

Yes.



 The timer handler is charged for the callbacks it runs, so it really
 boils down to what code is attached to Xenomai timers, aside of the
 built-in scheduler tick.

In this case we have only a single RTDM timer that fires ever 125 us  
and does nothing (as a test.) It will be easy to remove this and  
compare the amount of usage irq0 handler uses without it. I know it'll  
be at least 14 or 15.



 When you measure that load, what does /proc/xenomai/timerstat say?

 I know
 there is some overhead with userspace calls but hte irq0 handler  
 alone
 accounts for 20% of it. Are there any options that can speed things  
 up?


 Yeah, but you won't like it: buy a Geode that has SEP support for
 syscalls and a working TSC, then switch on --enable-x86-sep. Ok,
 granted, that is _not_ funny.

We have a new Geode that has SEP and yes, things are faster. Just how  
much overhead does syscall create? Is there no better option other  
than SEP? If we could have kernel threads work without corrupting  
userland FPU contexts then we could use our two higher priority  
drivers in a kernel module to save overhead.

Is TSC really going to make that much of a difference? It seems that  
xenomai uses PIT anyway. We can build with TSC if we disable suspend  
on halt and it works. If we do this the usage stays the same. It may  
drop a couple tenths of a percent.

 What would be interesting is to get the value reported for the timer
 interrupt when the standard latency test runs at the same frequency  
 than
 your application does (use -p option).

So you mean look at cat /proc/stat/xenomai while running latency test - 
p?

Steven


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] irq0 usage

2009-03-23 Thread Philippe Gerum
On Mon, 2009-03-23 at 19:03 -0400, Steven Seeger wrote:
  What are you comparing, I mean, exactly?
  All kernel RTAI vs all userland Xenomai?
 
 Yes.
 

Ok, so we will agree that the 20%/60% ratios can't be compared, in fact.

 
 
  The timer handler is charged for the callbacks it runs, so it really
  boils down to what code is attached to Xenomai timers, aside of the
  built-in scheduler tick.
 
 In this case we have only a single RTDM timer that fires ever 125 us  
 and does nothing (as a test.) It will be easy to remove this and  
 compare the amount of usage irq0 handler uses without it. I know it'll  
 be at least 14 or 15.

Let's check this anyway.

The fact that the GX still has to use a crappy 8253 PIT for timing and
must emulate the TSC using one of the PIT channels is not helping at
all. Emulating the TSC costs 1 x time_of(outb) + 2 x time_of(inb), each
time a timestamp is read via the rdtsc emulation code. That is costly.

 
 
 
  When you measure that load, what does /proc/xenomai/timerstat say?
 
  I know
  there is some overhead with userspace calls but hte irq0 handler  
  alone
  accounts for 20% of it. Are there any options that can speed things  
  up?
 
 
  Yeah, but you won't like it: buy a Geode that has SEP support for
  syscalls and a working TSC, then switch on --enable-x86-sep. Ok,
  granted, that is _not_ funny.
 
 We have a new Geode that has SEP and yes, things are faster. Just how  
 much overhead does syscall create?

It switches to supervisor mode using an interrupt (0x80); that logic is
really costly compared to the SEP entry. I'd say ~800ns-1us vs 200ns on
average for your target.

  Is there no better option other  
 than SEP? If we could have kernel threads work without corrupting  
 userland FPU contexts then we could use our two higher priority  
 drivers in a kernel module to save overhead.

Btw, did you fix your driver code regarding the unprotected usage of FPU
in pure Linux kernel context? 

 
 Is TSC really going to make that much of a difference? It seems that  
 xenomai uses PIT anyway.

Eh, no. TSC is always preferred when available.

  We can build with TSC if we disable suspend  
 on halt and it works. If we do this the usage stays the same. It may  
 drop a couple tenths of a percent.

Frankly, those figures are really surprising. rdtsc() is about
100-200ns, running rthal_get_8254_tsc() is a lot, lot more.

 
  What would be interesting is to get the value reported for the timer
  interrupt when the standard latency test runs at the same frequency  
  than
  your application does (use -p option).
 
 So you mean look at cat /proc/stat/xenomai while running latency test - 
 p?
 

No, when _your_ test runs.

 Steven
 
-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] irq0 usage

2009-03-23 Thread Steven Seeger
 Ok, so we will agree that the 20%/60% ratios can't be compared, in  
 fact.

Do you mean that this is not a fair comparison or that I should not be  
this slow compared to RTAI?

 The fact that the GX still has to use a crappy 8253 PIT for timing and
 must emulate the TSC using one of the PIT channels is not helping at
 all. Emulating the TSC costs 1 x time_of(outb) + 2 x time_of(inb),  
 each
 time a timestamp is read via the rdtsc emulation code. That is costly.

Do you agree that if I build with TSC on and disable suspend on halt  
(or use idle=poll) that xenomai will use rdtsc?

 It switches to supervisor mode using an interrupt (0x80); that logic  
 is
 really costly compared to the SEP entry. I'd say ~800ns-1us vs 200ns  
 on
 average for your target.

This is bad, but since our fastest userspace period is 500us it is not  
a dealbreaker. Just rt_task_wait_next_period() and one mutex lock/ 
unlock is too much for it.

 Btw, did you fix your driver code regarding the unprotected usage of  
 FPU
 in pure Linux kernel context?

Yes in fact the new driver does not use floats at all. It's purely  
integer math.

 Eh, no. TSC is always preferred when available.

I was looking at rthal_timer_program_shot().

 Frankly, those figures are really surprising. rdtsc() is about
 100-200ns, running rthal_get_8254_tsc() is a lot, lot more.

I asked above if what we did would really use the TSC or not. What do  
you think?

 No, when _your_ test runs.

So we should run latency -p and then our test and look at the output?

Thanks,
Steven

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core