Philippe Gerum wrote: > Jan Kiszka wrote: >> Hi, >> >> between some football half-times of the last days ;), I played a bit >> with a hand-optimised xnarch_tsc_to_ns() for x86. Using scaled math, I >> achieved between 3 (P-I 133 MHz) to 4 times (P-M 1.3 GHz) faster >> conversions than with the current variant. While this optimisation only >> saves a few ten nanoseconds on high-end, slow processors can gain >> several hundreds of nanos per conversion (my P-133: -600 ns). >> > > I did exactely the same a few weeks ago, based on Anzinger's scaled math
:) We should coordinate better. > from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance > improvements in some cases. Oops, that sounds like a bit too extreme optimisations. Is the original version varying that much? I didn't observe this. Here is my current version, BTW: long tsc_scale; unsigned int tsc_shift = 31; static inline long long fast_tsc_to_ns(long long ts) { long long ret; __asm__ ( /* HI = HIWORD(ts) * tsc_scale */ "mov %%eax,%%ebx\n\t" "mov %%edx,%%eax\n\t" "imull %2\n\t" "mov %%eax,%%esi\n\t" "mov %%edx,%%edi\n\t" /* LO = LOWORD(ts) * tsc_scale */ "mov %%ebx,%%eax\n\t" "mull %2\n\t" /* ret = (HI << 32) + LO */ "add %%esi,%%edx\n\t" "adc $0,%%edi\n\t" /* ret = ret >> tsc_shift */ "shrd %%cl,%%edx,%%eax\n\t" "shrd %%cl,%%edi,%%edx\n\t" : "=A"(ret) : "A" (ts), "m" (tsc_scale), "c" (tsc_shift) : "ebx", "esi", "edi"); return ret; } void init_tsc(unsigned long cpu_freq) { unsigned long long scale; while (1) { scale = do_div(1000000000LL << tsc_shift, cpu_freq); if (scale <= 0x7FFFFFFF) break; tsc_shift--; } tsc_scale = scale; } This version will use 31 (GHz cpu_freq) to 26 (~32 MHz) shifts, i.e. a bit more than the Linux kernel's 22 bits. > >> This does not come for free: accuracy of very large values is slightly >> worse, but that's likely negligible compared to the clock accuracy of >> TSCs (does anyone have any real numbers on the latter, BTW?). >> > > We do start losing significant precision for 2 ms delays and above, > IIRC. This could be an issue for some events in aperiodic mode, albeit > we could use a plain divide for those. The cost of conditionally doing > this remains to be evaluated though. Maybe I tested (not calculated - math is too hard for me :o)) the wrong values, but I didn't see such high regressions. > >> As we loose some bits the one way, converting back still requires "real" >> division (i.e. the use of the existing slower xnarch_ns_to_tsc). >> Otherwise, we would get significant errors already for small intervals. >> >> To avoid loosing the optimisation again in ns_to_tsc, I thought about >> basing the whole internal timer arithmetics on nanoseconds instead of >> TSCs as it is now. Although I dug quite a lot in the current timer >> subsystem the last weeks, I may still oversee aspects and I'm >> x86-biased. Therefore my question before thinking or even patching >> further this way: What was the motivation to choose TSCs as internal >> time base? > > TSC are not the whole nucleus time base, but only the timer management > one. The motivation to use TSCs in nucleus/timer.c was to pick a unit > which would not require any conversion beyond the initial one in > xntimer_start. That helps strictly periodic application timers, not aperiodic ones like timeouts. > >> Any pitfalls down the road (except introducing regressions)? > > Well, pitfalls expected from changing the core idea of time of the timer > management code... :o> > You mean turning rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,RTHAL_CPU_FREQ)); into rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,1000000000)); e.g. ? Jan
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core