Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-14 Thread Jim Cromie

Philippe Gerum wrote:

Gilles Chanteperdrix wrote:

Philippe Gerum wrote:
  Redone the check here on a Centrino 1.6Mhz, and still have roughly 
x20   improvement (a bit better actually). I'm using Debian/sarge 
gcc 3.3.5.


I think I remember that Pentium M has a much shorter mull instruction
than other processors of the family.



That would explain. Anyway, as John Stulz put it:
math is hard, lets go shopping!



Heh.  Appropriate that his name (Stultz) comes up here, as his 
generic-time (GTOD)

patchset looks headed for 2.6.18, bringing with it a full re-working
of linux timers / timeofday.  IN this new world, time is kept on 
free-running counters.


Ive been running this patchset on my soekris for some time, since
GTOD detects that the TSC counts slowly, calls it insane, and does timing
with the PIT.

With GTOD, writing a new clocksource driver is easy, enough so I could 
do it.

My clocksource patch uses the 27 mhz timer on the Geode CPU.
Once the TSC is de-rated, mine becomes the best clocksource, and GTOD 
switches to it.


All of which is to say ..
new mainline code is coming, should this current rework notion wait,
given that its will all need revisited again later

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-14 Thread Jan Kiszka
Philippe Gerum wrote:
 Jim Cromie wrote:
 Philippe Gerum wrote:

 Gilles Chanteperdrix wrote:

 Philippe Gerum wrote:
   Redone the check here on a Centrino 1.6Mhz, and still have
 roughly x20   improvement (a bit better actually). I'm using
 Debian/sarge gcc 3.3.5.

 I think I remember that Pentium M has a much shorter mull instruction
 than other processors of the family.


 That would explain. Anyway, as John Stulz put it:
 math is hard, lets go shopping!


 Heh.  Appropriate that his name (Stultz) comes up here, as his
 generic-time (GTOD)
 patchset looks headed for 2.6.18, bringing with it a full re-working
 of linux timers / timeofday.  IN this new world, time is kept on
 free-running counters.

 Ive been running this patchset on my soekris for some time, since
 GTOD detects that the TSC counts slowly, calls it insane, and does timing
 with the PIT.

 With GTOD, writing a new clocksource driver is easy, enough so I could
 do it.
 My clocksource patch uses the 27 mhz timer on the Geode CPU.
 Once the TSC is de-rated, mine becomes the best clocksource, and GTOD
 switches to it.

 All of which is to say ..
 new mainline code is coming, should this current rework notion wait,
 given that its will all need revisited again later

 
 Clearly yes, since this is going to impact Adeos too. GTOD is going to
 fiddle with the PIT channels in a way Adeos needs to be aware of, in
 order for the client RTOS to reuse such timer. Added to the flow of
 other core changes planned for 2.6.18, this is likely going to be funky.
 
 Find wall. Beat head against same.
 

May not be required: the GTOD and clocksource abstractions could provide
a clean way to register some virtual, Adeos- or RTOS-provided clock with
Linux. And that clock may even lose ticks without Linux losing its
system time! So far for the theory, practice may still require walls...

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Philippe Gerum

Jan Kiszka wrote:

Hi,

between some football half-times of the last days ;), I played a bit
with a hand-optimised xnarch_tsc_to_ns() for x86. Using scaled math, I
achieved between 3 (P-I 133 MHz) to 4 times (P-M 1.3 GHz) faster
conversions than with the current variant. While this optimisation only
saves a few ten nanoseconds on high-end, slow processors can gain
several hundreds of nanos per conversion (my P-133: -600 ns).



I did exactely the same a few weeks ago, based on Anzinger's scaled math 
from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance 
improvements in some cases.



This does not come for free: accuracy of very large values is slightly
worse, but that's likely negligible compared to the clock accuracy of
TSCs (does anyone have any real numbers on the latter, BTW?).



We do start losing significant precision for 2 ms delays and above, 
IIRC. This could be an issue for some events in aperiodic mode, albeit 
we could use a plain divide for those. The cost of conditionally doing 
this remains to be evaluated though.



As we loose some bits the one way, converting back still requires real
division (i.e. the use of the existing slower xnarch_ns_to_tsc).
Otherwise, we would get significant errors already for small intervals.

To avoid loosing the optimisation again in ns_to_tsc, I thought about
basing the whole internal timer arithmetics on nanoseconds instead of
TSCs as it is now. Although I dug quite a lot in the current timer
subsystem the last weeks, I may still oversee aspects and I'm
x86-biased. Therefore my question before thinking or even patching
further this way: What was the motivation to choose TSCs as internal
time base?


TSC are not the whole nucleus time base, but only the timer management 
one. The motivation to use TSCs in nucleus/timer.c was to pick a unit 
which would not require any conversion beyond the initial one in 
xntimer_start.



Any pitfalls down the road (except introducing regressions)?


Well, pitfalls expected from changing the core idea of time of the timer 
management code... :o




Jan


PS: All this would be 2.3-stuff, for sure.





___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core



--

Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Jan Kiszka
Philippe Gerum wrote:
 Jan Kiszka wrote:
 Hi,

 between some football half-times of the last days ;), I played a bit
 with a hand-optimised xnarch_tsc_to_ns() for x86. Using scaled math, I
 achieved between 3 (P-I 133 MHz) to 4 times (P-M 1.3 GHz) faster
 conversions than with the current variant. While this optimisation only
 saves a few ten nanoseconds on high-end, slow processors can gain
 several hundreds of nanos per conversion (my P-133: -600 ns).

 
 I did exactely the same a few weeks ago, based on Anzinger's scaled math

:) We should coordinate better.

 from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance
 improvements in some cases.

Oops, that sounds like a bit too extreme optimisations. Is the original
version varying that much? I didn't observe this.

Here is my current version, BTW:

long tsc_scale;
unsigned int tsc_shift = 31;

static inline long long fast_tsc_to_ns(long long ts)
{
long long ret;

__asm__ (
/* HI = HIWORD(ts) * tsc_scale */
mov  %%eax,%%ebx\n\t
mov  %%edx,%%eax\n\t
imull %2\n\t
mov  %%eax,%%esi\n\t
mov  %%edx,%%edi\n\t

/* LO = LOWORD(ts) * tsc_scale */
mov  %%ebx,%%eax\n\t
mull %2\n\t

/* ret = (HI  32) + LO */
add  %%esi,%%edx\n\t
adc  $0,%%edi\n\t

/* ret = ret  tsc_shift */
shrd %%cl,%%edx,%%eax\n\t
shrd %%cl,%%edi,%%edx\n\t
: =A(ret)
: A (ts), m (tsc_scale), c (tsc_shift)
: ebx, esi, edi);

return ret;
}

void init_tsc(unsigned long cpu_freq)
{
unsigned long long scale;

while (1) {
scale = do_div(10LL  tsc_shift, cpu_freq);
if (scale = 0x7FFF)
break;
tsc_shift--;
}
tsc_scale = scale;
}

This version will use 31 (GHz cpu_freq) to 26 (~32 MHz) shifts, i.e. a
bit more than the Linux kernel's 22 bits.

 
 This does not come for free: accuracy of very large values is slightly
 worse, but that's likely negligible compared to the clock accuracy of
 TSCs (does anyone have any real numbers on the latter, BTW?).

 
 We do start losing significant precision for 2 ms delays and above,
 IIRC. This could be an issue for some events in aperiodic mode, albeit
 we could use a plain divide for those. The cost of conditionally doing
 this remains to be evaluated though.

Maybe I tested (not calculated - math is too hard for me :o)) the wrong
values, but I didn't see such high regressions.

 
 As we loose some bits the one way, converting back still requires real
 division (i.e. the use of the existing slower xnarch_ns_to_tsc).
 Otherwise, we would get significant errors already for small intervals.

 To avoid loosing the optimisation again in ns_to_tsc, I thought about
 basing the whole internal timer arithmetics on nanoseconds instead of
 TSCs as it is now. Although I dug quite a lot in the current timer
 subsystem the last weeks, I may still oversee aspects and I'm
 x86-biased. Therefore my question before thinking or even patching
 further this way: What was the motivation to choose TSCs as internal
 time base?
 
 TSC are not the whole nucleus time base, but only the timer management
 one. The motivation to use TSCs in nucleus/timer.c was to pick a unit
 which would not require any conversion beyond the initial one in
 xntimer_start.

That helps strictly periodic application timers, not aperiodic ones like
timeouts.

 
 Any pitfalls down the road (except introducing regressions)?
 
 Well, pitfalls expected from changing the core idea of time of the timer
 management code... :o
 

You mean turning

rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,RTHAL_CPU_FREQ));

into

rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,10));

e.g. ?

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Anders Blomdell

Jan Kiszka wrote:

Hi,

To avoid loosing the optimisation again in ns_to_tsc, I thought about
basing the whole internal timer arithmetics on nanoseconds instead of
TSCs as it is now. 
Good idea, makes it simpler to adopt to laptop frequency scaling and deep ACPI 
sleep, i.e. sync Xenomai time to the ACPI timer.


/Anders

--
Anders Blomdell  Email: [EMAIL PROTECTED]
Department of Automatic Control
Lund University  Phone:+46 46 222 4625
P.O. Box 118 Fax:  +46 46 138118
SE-221 00 Lund, Sweden

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
  Hi,
  
  between some football half-times of the last days ;), I played a bit
  with a hand-optimised xnarch_tsc_to_ns() for x86. Using scaled math, I
  achieved between 3 (P-I 133 MHz) to 4 times (P-M 1.3 GHz) faster
  conversions than with the current variant. While this optimisation only
  saves a few ten nanoseconds on high-end, slow processors can gain
  several hundreds of nanos per conversion (my P-133: -600 ns).

Some time ago, I did also some experiment on avoiding divisions. I came
to a solution that precompute fractions using a real division, and that
only use additions, multiplication and shifts for imuldiv and ullimd. I
thought there would be no loss in accuracy, but well, sometimes the last
bit is wrong.

Anyway, here is the code if you want to benchmark it, div96by32 and
u64(to|from)u32 are defined in asm-i386/hal.h or asm-generic/hal.h:

typedef struct {
unsigned long long frac;/* Fractionary part. */
unsigned long integ;/* Integer part. */
} u32frac_t;

/* m/d == integ + frac / 2^64 */
void precalc(u32frac_t *const f,
 const unsigned long m,
 const unsigned long d)
{
f-integ = m  d ? m / d :0;
f-frac = div96by32(u64fromu32(m % d, 0), 0, d, NULL);
}

inline unsigned long nodiv_imuldiv(unsigned long op, u32frac_t f)
{
const unsigned long tmp = (ullmul(op, f.frac  32))  32;

if(f.integ)
return tmp + op * f.integ;

return tmp;
}

#define add64and32(h, l, s) do {\
__asm__ (addl %2, %1\n\t  \
 adcl $0, %0  \
 : +r(h), +r(l) \
 : r(s)); \
} while(0)

#define add96and64(l0, l1, l2, s0, s1) do { \
__asm__ (addl %4, %2\n\t  \
 adcl %3, %1\n\t  \
 adcl $0, %0\n\t  \
 : +r(l0), +r(l1), +r(l2) \
 : r(s0), r(s1));   \
} while(0)

inline unsigned long long mul64by64_high(const unsigned long long op,
  const unsigned long long m)
{
/* Compute high 64 bits of multiplication 64 bits x 64 bits. */
unsigned long long t1, t2, t3;
u_long oph, opl, mh, ml, t0, t1h, t1l, t2h, t2l, t3h, t3l;

u64tou32(op, oph, opl);
u64tou32(m, mh, ml);
t0 = ullmul(opl, ml)  32;
t1 = ullmul(oph, ml); u64tou32(t1, t1h, t1l);
add64and32(t1h, t1l, t0);
t2 = ullmul(opl, mh); u64tou32(t2, t2h, t2l);
t3 = ullmul(oph, mh); u64tou32(t3, t3h, t3l);
add64and32(t3h, t3l, t2h);
add96and64(t3h, t3l, t2l, t1h, t1l);

return u64fromu32(t3h, t3l);
}

inline unsigned long long nodiv_ullimd(const unsigned long long op,
   const u32frac_t f)
{
const unsigned long long tmp = mul64by64_high(op, f.frac);

if(f.integ)
return tmp + op * f.integ;

return tmp;
}

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Philippe Gerum

Jan Kiszka wrote:

Philippe Gerum wrote:


Jan Kiszka wrote:


Hi,

between some football half-times of the last days ;), I played a bit
with a hand-optimised xnarch_tsc_to_ns() for x86. Using scaled math, I
achieved between 3 (P-I 133 MHz) to 4 times (P-M 1.3 GHz) faster
conversions than with the current variant. While this optimisation only
saves a few ten nanoseconds on high-end, slow processors can gain
several hundreds of nanos per conversion (my P-133: -600 ns).



I did exactely the same a few weeks ago, based on Anzinger's scaled math



:) We should coordinate better.



The answer is published roadmap + todo list, but this requires some 
organisation we have not been able to setup yet.





from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance
improvements in some cases.



Oops, that sounds like a bit too extreme optimisations. Is the original
version varying that much? I didn't observe this.

Here is my current version, BTW:

long tsc_scale;
unsigned int tsc_shift = 31;

static inline long long fast_tsc_to_ns(long long ts)
{
long long ret;

__asm__ (
/* HI = HIWORD(ts) * tsc_scale */
mov  %%eax,%%ebx\n\t
mov  %%edx,%%eax\n\t
imull %2\n\t
mov  %%eax,%%esi\n\t
mov  %%edx,%%edi\n\t

/* LO = LOWORD(ts) * tsc_scale */
mov  %%ebx,%%eax\n\t
mull %2\n\t

/* ret = (HI  32) + LO */
add  %%esi,%%edx\n\t
adc  $0,%%edi\n\t

/* ret = ret  tsc_shift */
shrd %%cl,%%edx,%%eax\n\t
shrd %%cl,%%edi,%%edx\n\t
: =A(ret)
: A (ts), m (tsc_scale), c (tsc_shift)
: ebx, esi, edi);

return ret;
}

void init_tsc(unsigned long cpu_freq)
{
unsigned long long scale;

while (1) {
scale = do_div(10LL  tsc_shift, cpu_freq);
if (scale = 0x7FFF)
break;
tsc_shift--;
}
tsc_scale = scale;
}

This version will use 31 (GHz cpu_freq) to 26 (~32 MHz) shifts, i.e. a
bit more than the Linux kernel's 22 bits.



Here is likely why we have different levels of accuracy and performance, 
 firstly my version is bluntly based on the khz freq, secondly it 
calculates the other way around, i.e. ns2tsc, so that tsc are keep in 
the inner code, but more efficiently converted from ns counts passed to 
the outer interface:


static unsigned long ns2cyc_scale;
#define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */

static inline void set_ns2cyc_scale(unsigned long cpu_khz)
{
ns2cyc_scale = (cpu_khz  NS2CYC_SCALE_FACTOR) / 100;
}

static inline unsigned long long ns_2_cycles(unsigned long long ns)
{
return ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;
}



TSC are not the whole nucleus time base, but only the timer management
one. The motivation to use TSCs in nucleus/timer.c was to pick a unit
which would not require any conversion beyond the initial one in
xntimer_start.



That helps strictly periodic application timers, not aperiodic ones like
timeouts.



It depends, periodic timers usually exhibit larger delays, so the gain 
is more significant with oneshot timings incurring smaller delays, hence 
a higher number of calculations.





Any pitfalls down the road (except introducing regressions)?


Well, pitfalls expected from changing the core idea of time of the timer
management code... :o



You mean turning

rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,RTHAL_CPU_FREQ));

into

rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,10));



Not really, it was a general remark about changing a code that might 
have some assumtions on using TSCs. Additionally, only x86 needs to 
rescale TSC values to the timer frequency, other archs use the same unit 
on both sides, and such unit might even have nothing to do with any CPU 
accounting (e.g. blackfin uses a free running timer, ppc uses the 
internal timebase, etc).


This said, it should not have that many assumptions, and in any case, 
they should be confined to nucleus/timers.c. I think we should give this 
kind of optimization a try.


--

Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Gilles Chanteperdrix
Philippe Gerum wrote:
  static inline unsigned long long ns_2_cycles(unsigned long long ns)
  {
   return ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;

This multiplication is 64 bits * 32 bits, the intermediate result may
need more than 64 bits, so you should compute it the same way as the
beginning of ullimd. Something like:

static inline unsigned long long ns_2_cycles(unsigned long long ns)
{
unsigned nsh, nsl, tlh, tll;
unsigned long long th, tl;

__rthal_u64tou32(ns, nsh, nsl);
tl = rthal_ullmul(nsl, ns2cyc_scale);
__rthal_u64tou32(tl, tlh, tll);
th = rthal_ullmul(nsh, ns2cyc_scale);
th += tlh;

tll = (unsigned) th  (32 - NS2CYC_SCALE_FACTOR) | tll  
NS2CYC_SCALE_FACTOR;
th = NS2CYC_SCALE_FACTOR;
return __rthal_u64fromu32(th, tll);
}


-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Philippe Gerum

Gilles Chanteperdrix wrote:

Philippe Gerum wrote:
  static inline unsigned long long ns_2_cycles(unsigned long long ns)
  {
   return ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;

This multiplication is 64 bits * 32 bits, the intermediate result may
need more than 64 bits, so you should compute it the same way as the
beginning of ullimd. Something like:


Sure, but the point is that if we were to use such code, we should bound 
the 64bit operand and would not use it beyond the tolerable loss of 
accuracy on output (e.g. 2ms).  This would require to break longer shots 
in several smaller ones, relying on the internal timer management logic 
to redo the shot until it has actually elapsed (which should be a rare 
case for oneshot timing), a bit like we are currently doing in bounding 
the values to 2^32-1 right now. Going for ullimd alike implementation 
somehow impedes the overall effort in reducing the CPU footprint, I 
guess. This said, I have still no clue if the gain in computation cycles 
is worth the additional overhead of dealing with possibly early shots - 
I tend to think it would be better on average though.




static inline unsigned long long ns_2_cycles(unsigned long long ns)
{
unsigned nsh, nsl, tlh, tll;
unsigned long long th, tl;

__rthal_u64tou32(ns, nsh, nsl);
tl = rthal_ullmul(nsl, ns2cyc_scale);
__rthal_u64tou32(tl, tlh, tll);
th = rthal_ullmul(nsh, ns2cyc_scale);
th += tlh;

tll = (unsigned) th  (32 - NS2CYC_SCALE_FACTOR) | tll  
NS2CYC_SCALE_FACTOR;
th = NS2CYC_SCALE_FACTOR;
return __rthal_u64fromu32(th, tll);
}





--

Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Jan Kiszka
Philippe Gerum wrote:
 Jan Kiszka wrote:
 Philippe Gerum wrote:
 from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance
 improvements in some cases.

 Oops, that sounds like a bit too extreme optimisations. Is the original
 version varying that much? I didn't observe this.

 Here is my current version, BTW:

 long tsc_scale;
 unsigned int tsc_shift = 31;

 static inline long long fast_tsc_to_ns(long long ts)
 {
 long long ret;

 __asm__ (
 /* HI = HIWORD(ts) * tsc_scale */
 mov  %%eax,%%ebx\n\t
 mov  %%edx,%%eax\n\t
 imull %2\n\t
 mov  %%eax,%%esi\n\t
 mov  %%edx,%%edi\n\t

 /* LO = LOWORD(ts) * tsc_scale */
 mov  %%ebx,%%eax\n\t
 mull %2\n\t

 /* ret = (HI  32) + LO */
 add  %%esi,%%edx\n\t
 adc  $0,%%edi\n\t

 /* ret = ret  tsc_shift */
 shrd %%cl,%%edx,%%eax\n\t
 shrd %%cl,%%edi,%%edx\n\t
 : =A(ret)
 : A (ts), m (tsc_scale), c (tsc_shift)
 : ebx, esi, edi);

 return ret;
 }

 void init_tsc(unsigned long cpu_freq)
 {
 unsigned long long scale;

 while (1) {
 scale = do_div(10LL  tsc_shift, cpu_freq);
 if (scale = 0x7FFF)
 break;
 tsc_shift--;
 }
 tsc_scale = scale;
 }

 This version will use 31 (GHz cpu_freq) to 26 (~32 MHz) shifts, i.e. a
 bit more than the Linux kernel's 22 bits.

 
 Here is likely why we have different levels of accuracy and performance,
  firstly my version is bluntly based on the khz freq, secondly it
 calculates the other way around, i.e. ns2tsc, so that tsc are keep in
 the inner code, but more efficiently converted from ns counts passed to
 the outer interface:
 
 static unsigned long ns2cyc_scale;
 #define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */

Linux only uses 10 bits for scheduling time calculation, which is
tick-based (low-res) anyway. The tsc clock_source uses 22 bits. The
latter overflows after an hour or so, because they drop all bits  64
after the multiplication - insignificantly faster when using optimised
code anyway.

 
 static inline void set_ns2cyc_scale(unsigned long cpu_khz)
 {
 ns2cyc_scale = (cpu_khz  NS2CYC_SCALE_FACTOR) / 100;
 }
 
 static inline unsigned long long ns_2_cycles(unsigned long long ns)
 {
 return ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;
 }
 

 TSC are not the whole nucleus time base, but only the timer management
 one. The motivation to use TSCs in nucleus/timer.c was to pick a unit
 which would not require any conversion beyond the initial one in
 xntimer_start.


 That helps strictly periodic application timers, not aperiodic ones like
 timeouts.

 
 It depends, periodic timers usually exhibit larger delays, so the gain
 is more significant with oneshot timings incurring smaller delays, hence
 a higher number of calculations.
 

 Any pitfalls down the road (except introducing regressions)?

 Well, pitfalls expected from changing the core idea of time of the timer
 management code... :o


 You mean turning

 rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,RTHAL_CPU_FREQ));


 into

 rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,10));


 
 Not really, it was a general remark about changing a code that might
 have some assumtions on using TSCs. Additionally, only x86 needs to
 rescale TSC values to the timer frequency, other archs use the same unit
 on both sides, and such unit might even have nothing to do with any CPU
 accounting (e.g. blackfin uses a free running timer, ppc uses the
 internal timebase, etc).

Ok, an interesting aspect I already assumed but didn't check in details
yet. That makes dealing with TSCs interesting again on != x86. In
contrast, on x86, there is the aspect of frequency scaling that Anders
brought up and which would speak pro nanos.

 
 This said, it should not have that many assumptions, and in any case,
 they should be confined to nucleus/timers.c. I think we should give this
 kind of optimization a try.
 

Yep, it just needs some more brain cycles how to do this precisely.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Gilles Chanteperdrix
Philippe Gerum wrote:
  Gilles Chanteperdrix wrote:
   Philippe Gerum wrote:
 static inline unsigned long long ns_2_cycles(unsigned long long ns)
 {
  return ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;
   
   This multiplication is 64 bits * 32 bits, the intermediate result may
   need more than 64 bits, so you should compute it the same way as the
   beginning of ullimd. Something like:
  
  Sure, but the point is that if we were to use such code, we should bound 
  the 64bit operand and would not use it beyond the tolerable loss of 
  accuracy on output (e.g. 2ms).  This would require to break longer shots 
  in several smaller ones, relying on the internal timer management logic 
  to redo the shot until it has actually elapsed (which should be a rare 
  case for oneshot timing), a bit like we are currently doing in bounding 
  the values to 2^32-1 right now. Going for ullimd alike implementation 
  somehow impedes the overall effort in reducing the CPU footprint, I 
  guess. This said, I have still no clue if the gain in computation cycles 
  is worth the additional overhead of dealing with possibly early shots - 
  I tend to think it would be better on average though.

Ok, we could then write:

static inline unsigned long long ns_2_cycles(unsigned ns)
{
return (unsigned long long) ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;
}

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Philippe Gerum

Gilles Chanteperdrix wrote:

Philippe Gerum wrote:
  Gilles Chanteperdrix wrote:
   Philippe Gerum wrote:
 static inline unsigned long long ns_2_cycles(unsigned long long ns)
 {
  return ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;
   
   This multiplication is 64 bits * 32 bits, the intermediate result may

   need more than 64 bits, so you should compute it the same way as the
   beginning of ullimd. Something like:
  
  Sure, but the point is that if we were to use such code, we should bound 
  the 64bit operand and would not use it beyond the tolerable loss of 
  accuracy on output (e.g. 2ms).  This would require to break longer shots 
  in several smaller ones, relying on the internal timer management logic 
  to redo the shot until it has actually elapsed (which should be a rare 
  case for oneshot timing), a bit like we are currently doing in bounding 
  the values to 2^32-1 right now. Going for ullimd alike implementation 
  somehow impedes the overall effort in reducing the CPU footprint, I 
  guess. This said, I have still no clue if the gain in computation cycles 
  is worth the additional overhead of dealing with possibly early shots - 
  I tend to think it would be better on average though.


Ok, we could then write:

static inline unsigned long long ns_2_cycles(unsigned ns)
{
return (unsigned long long) ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;
}



Yep.

--

Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Philippe Gerum

Jan Kiszka wrote:

Philippe Gerum wrote:


Jan Kiszka wrote:


Philippe Gerum wrote:


from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance
improvements in some cases.


Oops, that sounds like a bit too extreme optimisations. Is the original
version varying that much? I didn't observe this.

Here is my current version, BTW:

long tsc_scale;
unsigned int tsc_shift = 31;

static inline long long fast_tsc_to_ns(long long ts)
{
   long long ret;

   __asm__ (
   /* HI = HIWORD(ts) * tsc_scale */
   mov  %%eax,%%ebx\n\t
   mov  %%edx,%%eax\n\t
   imull %2\n\t
   mov  %%eax,%%esi\n\t
   mov  %%edx,%%edi\n\t

   /* LO = LOWORD(ts) * tsc_scale */
   mov  %%ebx,%%eax\n\t
   mull %2\n\t

   /* ret = (HI  32) + LO */
   add  %%esi,%%edx\n\t
   adc  $0,%%edi\n\t

   /* ret = ret  tsc_shift */
   shrd %%cl,%%edx,%%eax\n\t
   shrd %%cl,%%edi,%%edx\n\t
   : =A(ret)
   : A (ts), m (tsc_scale), c (tsc_shift)
   : ebx, esi, edi);

   return ret;
}

void init_tsc(unsigned long cpu_freq)
{
   unsigned long long scale;

   while (1) {
   scale = do_div(10LL  tsc_shift, cpu_freq);
   if (scale = 0x7FFF)
   break;
   tsc_shift--;
   }
   tsc_scale = scale;
}

This version will use 31 (GHz cpu_freq) to 26 (~32 MHz) shifts, i.e. a
bit more than the Linux kernel's 22 bits.



Here is likely why we have different levels of accuracy and performance,
firstly my version is bluntly based on the khz freq, secondly it
calculates the other way around, i.e. ns2tsc, so that tsc are keep in
the inner code, but more efficiently converted from ns counts passed to
the outer interface:

static unsigned long ns2cyc_scale;
#define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */



Linux only uses 10 bits for scheduling time calculation, which is
tick-based (low-res) anyway.


This code is rather used to compute TSC offsets within a tick, so the 
max operand is short, bounded and known by design. Hence the scale 
factor, AFAICS.


 The tsc clock_source uses 22 bits. The

latter overflows after an hour or so, because they drop all bits  64
after the multiplication - insignificantly faster when using optimised
code anyway.



This path to optimizing is about computing reasonably short delays this 
way, so roll-over and precision would not be a key factor.





static inline void set_ns2cyc_scale(unsigned long cpu_khz)
{
   ns2cyc_scale = (cpu_khz  NS2CYC_SCALE_FACTOR) / 100;
}

static inline unsigned long long ns_2_cycles(unsigned long long ns)
{
   return ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;
}



TSC are not the whole nucleus time base, but only the timer management
one. The motivation to use TSCs in nucleus/timer.c was to pick a unit
which would not require any conversion beyond the initial one in
xntimer_start.



That helps strictly periodic application timers, not aperiodic ones like
timeouts.



It depends, periodic timers usually exhibit larger delays, so the gain
is more significant with oneshot timings incurring smaller delays, hence
a higher number of calculations.



Any pitfalls down the road (except introducing regressions)?


Well, pitfalls expected from changing the core idea of time of the timer
management code... :o


You mean turning

rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,RTHAL_CPU_FREQ));


into

rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,10));




Not really, it was a general remark about changing a code that might
have some assumtions on using TSCs. Additionally, only x86 needs to
rescale TSC values to the timer frequency, other archs use the same unit
on both sides, and such unit might even have nothing to do with any CPU
accounting (e.g. blackfin uses a free running timer, ppc uses the
internal timebase, etc).



Ok, an interesting aspect I already assumed but didn't check in details
yet. That makes dealing with TSCs interesting again on != x86. In
contrast, on x86, there is the aspect of frequency scaling that Anders
brought up and which would speak pro nanos.



This said, it should not have that many assumptions, and in any case,
they should be confined to nucleus/timers.c. I think we should give this
kind of optimization a try.




Yep, it just needs some more brain cycles how to do this precisely.

Jan




--

Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Jan Kiszka
Philippe Gerum wrote:
 Here is likely why we have different levels of accuracy and performance,
  firstly my version is bluntly based on the khz freq, secondly it
 calculates the other way around, i.e. ns2tsc, so that tsc are keep in
 the inner code, but more efficiently converted from ns counts passed to
 the outer interface:
 
 static unsigned long ns2cyc_scale;
 #define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
 static inline void set_ns2cyc_scale(unsigned long cpu_khz)
 {
 ns2cyc_scale = (cpu_khz  NS2CYC_SCALE_FACTOR) / 100;
 }
 
 static inline unsigned long long ns_2_cycles(unsigned long long ns)
 {
 return ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;
 }

Your version performs ~50% better than mine (outperforming the original
version by factor 7 on a 1 GHz box, vs. 4.8). I think you compared
non-optimised code, didn't you? Without -O2, I see 15 times better
performance.

[Gilles variant yet refuses the get benchmarked.]

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
  Philippe Gerum wrote:
   Here is likely why we have different levels of accuracy and performance,
firstly my version is bluntly based on the khz freq, secondly it
   calculates the other way around, i.e. ns2tsc, so that tsc are keep in
   the inner code, but more efficiently converted from ns counts passed to
   the outer interface:
   
   static unsigned long ns2cyc_scale;
   #define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */
   
   static inline void set_ns2cyc_scale(unsigned long cpu_khz)
   {
   ns2cyc_scale = (cpu_khz  NS2CYC_SCALE_FACTOR) / 100;
   }
   
   static inline unsigned long long ns_2_cycles(unsigned long long ns)
   {
   return ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;
   }
  
  Your version performs ~50% better than mine (outperforming the original
  version by factor 7 on a 1 GHz box, vs. 4.8). I think you compared
  non-optimised code, didn't you? Without -O2, I see 15 times better
  performance.
  
  [Gilles variant yet refuses the get benchmarked.]

Since we accept a smaller range, I think you should benchmark
nodiv_imuldiv instead of nodiv_ullimd. And it should perform better
since it uses 32 bits shifts which are not real shifts.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Philippe Gerum

Jan Kiszka wrote:

Philippe Gerum wrote:


Here is likely why we have different levels of accuracy and performance,
firstly my version is bluntly based on the khz freq, secondly it
calculates the other way around, i.e. ns2tsc, so that tsc are keep in
the inner code, but more efficiently converted from ns counts passed to
the outer interface:

static unsigned long ns2cyc_scale;
#define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */

static inline void set_ns2cyc_scale(unsigned long cpu_khz)
{
   ns2cyc_scale = (cpu_khz  NS2CYC_SCALE_FACTOR) / 100;
}

static inline unsigned long long ns_2_cycles(unsigned long long ns)
{
   return ns * ns2cyc_scale  NS2CYC_SCALE_FACTOR;
}



Your version performs ~50% better than mine (outperforming the original
version by factor 7 on a 1 GHz box, vs. 4.8). I think you compared
non-optimised code, didn't you?


Nah, I'm not that drunk!

 Without -O2, I see 15 times better

performance.


Redone the check here on a Centrino 1.6Mhz, and still have roughly x20 
improvement (a bit better actually). I'm using Debian/sarge gcc 3.3.5.




[Gilles variant yet refuses the get benchmarked.]

Jan




--

Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Gilles Chanteperdrix
Philippe Gerum wrote:
  Redone the check here on a Centrino 1.6Mhz, and still have roughly x20 
  improvement (a bit better actually). I'm using Debian/sarge gcc 3.3.5.

I think I remember that Pentium M has a much shorter mull instruction
than other processors of the family.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] ns vs. tsc as internal timer base

2006-06-13 Thread Philippe Gerum

Gilles Chanteperdrix wrote:

Philippe Gerum wrote:
  Redone the check here on a Centrino 1.6Mhz, and still have roughly x20 
  improvement (a bit better actually). I'm using Debian/sarge gcc 3.3.5.


I think I remember that Pentium M has a much shorter mull instruction
than other processors of the family.



That would explain. Anyway, as John Stulz put it:
math is hard, lets go shopping!

--

Philippe.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core