Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
> 
>>Jan Kiszka wrote:
>>
>>>Gilles Chanteperdrix wrote:
>>>
>>>
>>>>Jan Kiszka wrote:
>>>>
>>>>>Jan Kiszka wrote:
>>>>>...
>>>>>
>>>>>>fast-tsc-to-ns-v2.patch
>>>>>>
>>>>>>    [Rebased, improved rounding of least significant digit]
>>>>>
>>>>>Rounding in the fast path for the sake of the last digit was silly.
>>>>>Instead, I'm now addressing the ugly interval printing via
>>>>>xnarch_precise_tsc_to_ns when converting the timer interval back into
>>>>>nanos. -v3 incorporating this has just been uploaded.
>>>>
>>>>Hi,
>>>>
>>>>I had a look at the fast-tsc-to-ns implementation, here is how I would
>>>>rewrite it:
>>>>
>>>>static inline void xnarch_init_llmulshft(const unsigned m_in,
>>>>                                     const unsigned d_in,
>>>>                                     unsigned *m_out,
>>>>                                     unsigned *s_out)
>>>>{
>>>>    unsigned long long mult;
>>>>
>>>>    *s_out = 31;
>>>>    while (1) {
>>>>            mult = ((unsigned long long)m_in) << *s_out;
>>>>            do_div(mult, d_in);
>>>>            if (mult <= INT_MAX)
>>>>                    break;
>>>>            (*s_out)--;
>>>>    }
>>>>    *m_out = (unsigned)mult;
>>>>}
>>>>
>>>>/* Non x86. */
>>>>#define __rthal_u96shift(h, m, l, s) ({             \
>>>>    unsigned _l = (l);                      \
>>>>    unsigned _m = (m);                      \
>>>>    unsigned _s = (s);                      \
>>>>    _l >>= _s;                              \
>>>>    _m >>= s;                               \
>>>>    _l |= (_m << (32 - s));                 \
>>>>    _m |= ((h) << (32 - s));                \
>>>>       __rthal_u64fromu32(_m, _l);          \
>>>>})
>>>>
>>>>/* x86 */
>>>>#define __rthal_u96shift(h, m, l, s) ({             \
>>>>    unsigned _l = (l);                      \
>>>>    unsigned _m = (m);                      \
>>>>    unsigned _s = (s);                      \
>>>>    asm ("shrdl\t%%cl,%1,%0"                \
>>>>         : "+r,?m"(_l)                      \
>>>>         : "r,r"(_m), "c,c"(_s));           \
>>>>    asm ("shrdl\t%%cl,%1,%0"                \
>>>>         : "+r,?m"(_m)                      \
>>>>         : "r,r"(h), "c,c"(_s));            \
>>>>    __rthal_u64fromu32(_m, _l);             \
>>>>})
>>>>
>>>>static inline long long rthal_llmi(int i, int j)
>>>>{
>>>>       /* Signed fast 32x32->64 multiplication */
>>>>    return (long long) i * j;
>>>>}
>>>>
>>>>static inline long long gilles_llmulshft(const long long op,
>>>>                                     const unsigned m,
>>>>                                     const unsigned s)
>>>>{
>>>>    unsigned oph, opl, tlh, tll, thh, thl;
>>>>    unsigned long long th, tl;
>>>>
>>>>    __rthal_u64tou32(op, oph, opl);
>>>>    tl = rthal_ullmul(opl, m);
>>>>    __rthal_u64tou32(tl, tlh, tll);
>>>>    th = rthal_llmi(oph, m);
>>>>    th += tlh;
>>>>    __rthal_u64tou32(th, thh, thl);
>>>>    
>>>>    return __rthal_u96shift(thh, thl, tll, s);
>>>>}
>>>>
>>>>
>>>
>>>Thanks for your suggestion.
>>>
>>>While your generic version produces comparable code, the x86 variant is
>>>about twice as large as the full-assembly version. And code size
>>>translates into I-cache occupation, which may have latency costs.
>>>
>>>[gcc 4.1, i386]
>>>-O2 -mregparm=3 -fomit-frame-pointer:
>>>    63: 08048490   119 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>    68: 08048510   121 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>    77: 08048450    57 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>    78: 080483c0   135 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>
>>>-Os -mregparm=3 -fomit-frame-pointer:
>>>    63: 0804843b    93 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>    68: 08048498    97 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>    77: 08048410    43 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>    78: 080483b4    92 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>
>>>-O2:
>>>    63: 08048480   120 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>    68: 08048500   105 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>    77: 08048440    60 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>    78: 080483c0   117 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>
>>>-Os:
>>>    63: 08048438   104 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>    68: 080484a0    83 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>    77: 0804840b    45 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>    78: 080483b4    87 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>
>>>I'm not arguing we should turn each and every Xenomai arch code into
>>>pure assembly. But in this case it already happened, it's less scattered
>>>source code-wise, and it is compacter object-wise. So I would prefer to
>>>keep it as is.
>>
>>I would say the advantage of having a C version outperform the
>>advantages of the full assembly version. C is really easier to
>>understand and debug.
> 
> 
> Personally, I prefer the clear (and commented) assembly over the nested
> macros and inlines.

Not when the macro and inline bear names that are easy to understand. If
you do not find the names easy to understand, then change them (I do not
like rthal_llmul either, but I could not find a name). To make the
assembly fully understandable, you would need to comment every
statement. And now, run the assembly code in gdb, and try and print the
value of a 64 bits intermediate result: you can't.

> 
> 
>>The differences between the two versions are some register moves, which
>>cost almost nothing, especially since each operation in the assembly
> 
> 
> Cycle-wise, you are right. But what bites us more in the worst case are
> memory accesses, specifically when they are not cached. Code size
> matters more according to my experience.
> 
> 
>>version depends on the result of the previous operation, which means
>>lots of pipeline stall, the register moves will just feed the pipeline.
>>I do not think they really matter. Look at the assembly produced for
>>gilles_llmulshft on ARM, a low end architecture where each instruction
>>really costs:
>>gilles_llmulshft:
>>        @ args = 0, pretend = 0, frame = 0
>>        @ frame_needed = 0, uses_anonymous_args = 0
>>        @ link register save eliminated.
>>        stmfd   sp!, {r4, r5, r6, r7}
>>        umull   r6, r7, r0, r2
>>        mov     r4, r7
>>        mov     r5, #0
>>        smlal   r4, r5, r2, r1
>>        rsb     ip, r3, #32
>>        mov     r2, r4, lsr r3
>>        orr     r1, r2, r5, asl ip
>>        mov     r2, r2, asl ip
>>        orr     r0, r2, r6, lsr r3
>>        @ lr needed for prologue
>>        ldmfd   sp!, {r4, r5, r6, r7}
>>        mov     pc, lr
>>
>>pretty minimal, no ?
> 
> 
> OK, your version can perfectly go into the ARM arch. But i386 is
> different: less registers, thus easily a lot of variable shuffling...

variable shuffling which does not really matter, that is my point,
otherwise the x86 family would not be as fast as it is.

> 
> 
>>The full assembly version has another big drawback, it is a big block
>>that the optimizer can not split, whereas in a C version, the optimizer
>>can decide to interleave the surrounding code. So a C version will
>>inline better.
> 
> 
> We are not inlining that service anymore, at least not for its primary
> usage tsc-to-ns. Inlining costs object size, thus increases the latency
> (although it saves us a few cycles).

it *is* inlined, in tsc_to/from_ns. Another question that I forgot in my
previous mails: why not using llmulshft for the two services ?

> 
> 
>>There is one thing I do not like with llmulshft (any implementation), it
>>is the rounding policy towards minus infinity. llmulshft(-1, 2/3)
>>returns -1 whereas llimd would return 0.
> 
> 
> See other postings: rounding of the last digit doesn't matter with
> scaled math, it's already inaccurate by nature. That's also why we have
> it only one-way.

When returning -1 instead of 0, it is not the last digit that is wrong,
but the first (and only) one.

-- 
                                                 Gilles Chanteperdrix

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to