Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance
>>> improvements in some cases.
>>
>> Oops, that sounds like a bit too extreme optimisations. Is the original
>> version varying that much? I didn't observe this.
>>
>> Here is my current version, BTW:
>>
>> long tsc_scale;
>> unsigned int tsc_shift = 31;
>>
>> static inline long long fast_tsc_to_ns(long long ts)
>> {
>>     long long ret;
>>
>>     __asm__ (
>>         /* HI = HIWORD(ts) * tsc_scale */
>>         "mov  %%eax,%%ebx\n\t"
>>         "mov  %%edx,%%eax\n\t"
>>         "imull %2\n\t"
>>         "mov  %%eax,%%esi\n\t"
>>         "mov  %%edx,%%edi\n\t"
>>
>>         /* LO = LOWORD(ts) * tsc_scale */
>>         "mov  %%ebx,%%eax\n\t"
>>         "mull %2\n\t"
>>
>>         /* ret = (HI << 32) + LO */
>>         "add  %%esi,%%edx\n\t"
>>         "adc  $0,%%edi\n\t"
>>
>>         /* ret = ret >> tsc_shift */
>>         "shrd %%cl,%%edx,%%eax\n\t"
>>         "shrd %%cl,%%edi,%%edx\n\t"
>>         : "=A"(ret)
>>         : "A" (ts), "m" (tsc_scale), "c" (tsc_shift)
>>         : "ebx", "esi", "edi");
>>
>>     return ret;
>> }
>>
>> void init_tsc(unsigned long cpu_freq)
>> {
>>     unsigned long long scale;
>>
>>     while (1) {
>>         scale = do_div(1000000000LL << tsc_shift, cpu_freq);
>>         if (scale <= 0x7FFFFFFF)
>>             break;
>>         tsc_shift--;
>>     }
>>     tsc_scale = scale;
>> }
>>
>> This version will use 31 (GHz cpu_freq) to 26 (~32 MHz) shifts, i.e. a
>> bit more than the Linux kernel's 22 bits.
>>
> 
> Here is likely why we have different levels of accuracy and performance,
>  firstly my version is bluntly based on the khz freq, secondly it
> calculates the other way around, i.e. ns2tsc, so that tsc are keep in
> the inner code, but more efficiently converted from ns counts passed to
> the outer interface:
> 
> static unsigned long ns2cyc_scale;
> #define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */

Linux only uses 10 bits for scheduling time calculation, which is
tick-based (low-res) anyway. The tsc clock_source uses 22 bits. The
latter overflows after an hour or so, because they drop all bits > 64
after the multiplication - insignificantly faster when using optimised
code anyway.

> 
> static inline void set_ns2cyc_scale(unsigned long cpu_khz)
> {
>     ns2cyc_scale = (cpu_khz << NS2CYC_SCALE_FACTOR) / 1000000;
> }
> 
> static inline unsigned long long ns_2_cycles(unsigned long long ns)
> {
>     return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
> }
> 
>>>
>>> TSC are not the whole nucleus time base, but only the timer management
>>> one. The motivation to use TSCs in nucleus/timer.c was to pick a unit
>>> which would not require any conversion beyond the initial one in
>>> xntimer_start.
>>
>>
>> That helps strictly periodic application timers, not aperiodic ones like
>> timeouts.
>>
> 
> It depends, periodic timers usually exhibit larger delays, so the gain
> is more significant with oneshot timings incurring smaller delays, hence
> a higher number of calculations.
> 
>>
>>>> Any pitfalls down the road (except introducing regressions)?
>>>
>>> Well, pitfalls expected from changing the core idea of time of the timer
>>> management code... :o>
>>>
>>
>> You mean turning
>>
>> rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,RTHAL_CPU_FREQ));
>>
>>
>> into
>>
>> rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,1000000000));
>>
>>
> 
> Not really, it was a general remark about changing a code that might
> have some assumtions on using TSCs. Additionally, only x86 needs to
> rescale TSC values to the timer frequency, other archs use the same unit
> on both sides, and such unit might even have nothing to do with any CPU
> accounting (e.g. blackfin uses a free running timer, ppc uses the
> internal timebase, etc).

Ok, an interesting aspect I already assumed but didn't check in details
yet. That makes dealing with TSCs interesting again on != x86. In
contrast, on x86, there is the aspect of frequency scaling that Anders
brought up and which would speak pro nanos.

> 
> This said, it should not have that many assumptions, and in any case,
> they should be confined to nucleus/timers.c. I think we should give this
> kind of optimization a try.
> 

Yep, it just needs some more brain cycles how to do this precisely.

Jan

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to