Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Gilles Chanteperdrix wrote:
>>>>> Hi Jan,
>>>>> I see that the implementation of rthal_llmulshft seems to account for
>>>>> the first argument sign. Does it work ? Namely, in the generic
>>>>> implementation will __rthal_u96shift propagate the sign bit ?
>>>> Yes, this works (given there is no overflow, of course). If you consider
>>>> a high word of 0xfffffff0 and a (right) shift of 8, we effectively cut
>>>> off all the leading 1s: high << (32-8) = 0xf0000000. But this only works
>>>> because we replace a right shift with a left shift (plus some OR'ing
>>>> later on). If we had to do a real right shift, we would also have to
>>>> take signed vs. unsigned into account (ie. shift in zeros or the sign
>>>> bit from the left?).
>>>>> If yes, do you see a way llimd could be made to work the same way ? This
>>>>> way we would avoid inline ullimd twice in llimd code.
>>>> As the basic building block here is a multiplication, we cannot get
>>>> around telling apart signed from unsigned (or converting signed into
>>>> unsigned): the underlying multiplication logic is different.
>>>> But what about this approach:
>>>> static inline __attribute__((__const__)) long long
>>>> __rthal_generic_llimd (long long op, unsigned m, unsigned d)
>>>> {
>>>>    int signed = 0;
>>>>    long long ret;
>>>>    if (op < 0LL) {
>>>>            op = -op;
>>>>            signed = 1;
>>>>    }
>>>>    ret = __rthal_generic_ullimd(op, m, d);
>>>>    return signed ? -ret : ret;
>>>> }
>>>> However, I guess writing this in assembly for archs that suffer should
>>>> be more efficient.
>>> Hi Jan,
>>> You may have noticed that we played a bit with arithmetic operations
>>> (namely, we use an llimd without division to make the reverse of
>>> llmulshft), and it pays off on slow machines, such as ARM, where the
>>> division is done in software.
>>> At this chance, I looked at the code generated by this soluion, and I am
>>> not sure that it is better: on ARM, and I suspect this is true on other
>>> architectures, the operations needed to negate a long long clobbers the
>>> code conditions, which means we can not make these operations
>>> conditionals without a conditional jump, so the hand-coded assembler is
>>> not better than what the compiler does: it uses two conditional jumps
>>> whereas the original solution uses only one. Of course we could set sign
>>> to -1 or 1, and multiply by sign at the end, but the multiplication is
>>> probably even heavier than conditional jump.
>> Yes, on the archs that matter here (32-bit).
>>> So, would you have any idea of a better solution ?
>> In an assembly version, one could save 'sign' in form of a jump target
>> that should be taken after __rthal_generic_ullimd (ie. jump to the
>> negation, or jump over it). Specifically when that address is kept in a
>> register, I think smart branch prediction units will be able to do the
>> right forecast.
> Good idea, there is even a gcc extension which allows to do this in the
> generic section:
> static inline __attribute__((__const__)) long long
> __rthal_generic_llimd (long long op, unsigned m, unsigned d)
> {
>       void *epilogue;
>       long long ret;
>       if (op < 0LL) {
>               op = -op;
>               epilogue = &&ret_neg;
>       } else
>               epilogue = &&ret_unchanged;
>       ret = __rthal_generic_ullimd(op, m, d);
>       goto *epilogue;
> ret_unchanged:
>       return ret;
> ret_neg:
>       return -ret;
> }

This works as expected on ARM, however, gcc 4.0 on x86 generates two
calls to __rthal_generic_ullimd with the indirect jump after each one.
It seems it has stopped half-way when "optimizing"...


Xenomai-core mailing list

Reply via email to