Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>
>>> Jan Kiszka wrote:
>>>
>>>> Gilles Chanteperdrix wrote:
>>>>
>>>>
>>>>> Jan Kiszka wrote:
>>>>>
>>>>>> Jan Kiszka wrote:
>>>>>> ...
>>>>>>
>>>>>>> fast-tsc-to-ns-v2.patch
>>>>>>>
>>>>>>>    [Rebased, improved rounding of least significant digit]
>>>>>> Rounding in the fast path for the sake of the last digit was silly.
>>>>>> Instead, I'm now addressing the ugly interval printing via
>>>>>> xnarch_precise_tsc_to_ns when converting the timer interval back into
>>>>>> nanos. -v3 incorporating this has just been uploaded.
>>>>> Hi,
>>>>>
>>>>> I had a look at the fast-tsc-to-ns implementation, here is how I would
>>>>> rewrite it:
>>>>>
>>>>> static inline void xnarch_init_llmulshft(const unsigned m_in,
>>>>>                                    const unsigned d_in,
>>>>>                                    unsigned *m_out,
>>>>>                                    unsigned *s_out)
>>>>> {
>>>>>   unsigned long long mult;
>>>>>
>>>>>   *s_out = 31;
>>>>>   while (1) {
>>>>>           mult = ((unsigned long long)m_in) << *s_out;
>>>>>           do_div(mult, d_in);
>>>>>           if (mult <= INT_MAX)
>>>>>                   break;
>>>>>           (*s_out)--;
>>>>>   }
>>>>>   *m_out = (unsigned)mult;
>>>>> }
>>>>>
>>>>> /* Non x86. */
>>>>> #define __rthal_u96shift(h, m, l, s) ({           \
>>>>>   unsigned _l = (l);                      \
>>>>>   unsigned _m = (m);                      \
>>>>>   unsigned _s = (s);                      \
>>>>>   _l >>= _s;                              \
>>>>>   _m >>= s;                               \
>>>>>   _l |= (_m << (32 - s));                 \
>>>>>   _m |= ((h) << (32 - s));                \
>>>>>       __rthal_u64fromu32(_m, _l);         \
>>>>> })
>>>>>
>>>>> /* x86 */
>>>>> #define __rthal_u96shift(h, m, l, s) ({           \
>>>>>   unsigned _l = (l);                      \
>>>>>   unsigned _m = (m);                      \
>>>>>   unsigned _s = (s);                      \
>>>>>   asm ("shrdl\t%%cl,%1,%0"                \
>>>>>        : "+r,?m"(_l)                      \
>>>>>        : "r,r"(_m), "c,c"(_s));           \
>>>>>   asm ("shrdl\t%%cl,%1,%0"                \
>>>>>        : "+r,?m"(_m)                      \
>>>>>        : "r,r"(h), "c,c"(_s));            \
>>>>>   __rthal_u64fromu32(_m, _l);             \
>>>>> })
>>>>>
>>>>> static inline long long rthal_llmi(int i, int j)
>>>>> {
>>>>>       /* Signed fast 32x32->64 multiplication */
>>>>>   return (long long) i * j;
>>>>> }
>>>>>
>>>>> static inline long long gilles_llmulshft(const long long op,
>>>>>                                    const unsigned m,
>>>>>                                    const unsigned s)
>>>>> {
>>>>>   unsigned oph, opl, tlh, tll, thh, thl;
>>>>>   unsigned long long th, tl;
>>>>>
>>>>>   __rthal_u64tou32(op, oph, opl);
>>>>>   tl = rthal_ullmul(opl, m);
>>>>>   __rthal_u64tou32(tl, tlh, tll);
>>>>>   th = rthal_llmi(oph, m);
>>>>>   th += tlh;
>>>>>   __rthal_u64tou32(th, thh, thl);
>>>>>   
>>>>>   return __rthal_u96shift(thh, thl, tll, s);
>>>>> }
>>>>>
>>>>>
>>>> Thanks for your suggestion.
>>>>
>>>> While your generic version produces comparable code, the x86 variant is
>>>> about twice as large as the full-assembly version. And code size
>>>> translates into I-cache occupation, which may have latency costs.
>>>>
>>>> [gcc 4.1, i386]
>>>> -O2 -mregparm=3 -fomit-frame-pointer:
>>>>    63: 08048490   119 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>>    68: 08048510   121 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>>    77: 08048450    57 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>>    78: 080483c0   135 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>>
>>>> -Os -mregparm=3 -fomit-frame-pointer:
>>>>    63: 0804843b    93 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>>    68: 08048498    97 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>>    77: 08048410    43 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>>    78: 080483b4    92 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>>
>>>> -O2:
>>>>    63: 08048480   120 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>>    68: 08048500   105 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>>    77: 08048440    60 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>>    78: 080483c0   117 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>>
>>>> -Os:
>>>>    63: 08048438   104 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>>    68: 080484a0    83 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>>    77: 0804840b    45 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>>    78: 080483b4    87 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>>
>>>> I'm not arguing we should turn each and every Xenomai arch code into
>>>> pure assembly. But in this case it already happened, it's less scattered
>>>> source code-wise, and it is compacter object-wise. So I would prefer to
>>>> keep it as is.
>>> I would say the advantage of having a C version outperform the
>>> advantages of the full assembly version. C is really easier to
>>> understand and debug.
>>
>> Personally, I prefer the clear (and commented) assembly over the nested
>> macros and inlines.
> 
> Not when the macro and inline bear names that are easy to understand. If
> you do not find the names easy to understand, then change them (I do not
> like rthal_llmul either, but I could not find a name). To make the
> assembly fully understandable, you would need to comment every
> statement. And now, run the assembly code in gdb, and try and print the
> value of a 64 bits intermediate result: you can't.

No question, this is a matter of taste.

> 
>>
>>> The differences between the two versions are some register moves, which
>>> cost almost nothing, especially since each operation in the assembly
>>
>> Cycle-wise, you are right. But what bites us more in the worst case are
>> memory accesses, specifically when they are not cached. Code size
>> matters more according to my experience.
>>
>>
>>> version depends on the result of the previous operation, which means
>>> lots of pipeline stall, the register moves will just feed the pipeline.
>>> I do not think they really matter. Look at the assembly produced for
>>> gilles_llmulshft on ARM, a low end architecture where each instruction
>>> really costs:
>>> gilles_llmulshft:
>>>        @ args = 0, pretend = 0, frame = 0
>>>        @ frame_needed = 0, uses_anonymous_args = 0
>>>        @ link register save eliminated.
>>>        stmfd   sp!, {r4, r5, r6, r7}
>>>        umull   r6, r7, r0, r2
>>>        mov     r4, r7
>>>        mov     r5, #0
>>>        smlal   r4, r5, r2, r1
>>>        rsb     ip, r3, #32
>>>        mov     r2, r4, lsr r3
>>>        orr     r1, r2, r5, asl ip
>>>        mov     r2, r2, asl ip
>>>        orr     r0, r2, r6, lsr r3
>>>        @ lr needed for prologue
>>>        ldmfd   sp!, {r4, r5, r6, r7}
>>>        mov     pc, lr
>>>
>>> pretty minimal, no ?
>>
>> OK, your version can perfectly go into the ARM arch. But i386 is
>> different: less registers, thus easily a lot of variable shuffling...
> 
> variable shuffling which does not really matter, that is my point,
> otherwise the x86 family would not be as fast as it is.

Think of the *code size*...

> 
>>
>>> The full assembly version has another big drawback, it is a big block
>>> that the optimizer can not split, whereas in a C version, the optimizer
>>> can decide to interleave the surrounding code. So a C version will
>>> inline better.
>>
>> We are not inlining that service anymore, at least not for its primary
>> usage tsc-to-ns. Inlining costs object size, thus increases the latency
>> (although it saves us a few cycles).
> 
> it *is* inlined, in tsc_to/from_ns. Another question that I forgot in my

xnarch_tsc_to_ns uninlines this service, and I don't see other, larger
users so far.

> previous mails: why not using llmulshft for the two services ?

See below, see my original post on all the conversion approaches: scaled
math is inaccurate, doing it both ways may cause noticeable errors when
dealing with calculated vs. measured time stamps over, granted, fairly
long periods.

> 
>>
>>> There is one thing I do not like with llmulshft (any implementation), it
>>> is the rounding policy towards minus infinity. llmulshft(-1, 2/3)
>>> returns -1 whereas llimd would return 0.
>>
>> See other postings: rounding of the last digit doesn't matter with
>> scaled math, it's already inaccurate by nature. That's also why we have
>> it only one-way.
> 
> When returning -1 instead of 0, it is not the last digit that is wrong,
> but the first (and only) one.

So this is about -1 nanoseconds vs. 0 nanoseconds. Well, does this error
matter in real life? :->


Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to