Gilles Chanteperdrix wrote:
> Gilles Chanteperdrix wrote:
>> Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Gilles Chanteperdrix wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Gilles Chanteperdrix wrote:
>>>>>>> Hi Jan,
>>>>>>> I see that the implementation of rthal_llmulshft seems to account for
>>>>>>> the first argument sign. Does it work ? Namely, in the generic
>>>>>>> implementation will __rthal_u96shift propagate the sign bit ?
>>>>>> Yes, this works (given there is no overflow, of course). If you consider
>>>>>> a high word of 0xfffffff0 and a (right) shift of 8, we effectively cut
>>>>>> off all the leading 1s: high << (32-8) = 0xf0000000. But this only works
>>>>>> because we replace a right shift with a left shift (plus some OR'ing
>>>>>> later on). If we had to do a real right shift, we would also have to
>>>>>> take signed vs. unsigned into account (ie. shift in zeros or the sign
>>>>>> bit from the left?).
>>>>>>> If yes, do you see a way llimd could be made to work the same way ? This
>>>>>>> way we would avoid inline ullimd twice in llimd code.
>>>>>> As the basic building block here is a multiplication, we cannot get
>>>>>> around telling apart signed from unsigned (or converting signed into
>>>>>> unsigned): the underlying multiplication logic is different.
>>>>>> But what about this approach:
>>>>>> static inline __attribute__((__const__)) long long
>>>>>> __rthal_generic_llimd (long long op, unsigned m, unsigned d)
>>>>>> {
>>>>>>  int signed = 0;
>>>>>>  long long ret;
>>>>>>  if (op < 0LL) {
>>>>>>          op = -op;
>>>>>>          signed = 1;
>>>>>>  }
>>>>>>  ret = __rthal_generic_ullimd(op, m, d);
>>>>>>  return signed ? -ret : ret;
>>>>>> }
>>>>>> However, I guess writing this in assembly for archs that suffer should
>>>>>> be more efficient.
>>>>> Hi Jan,
>>>>> You may have noticed that we played a bit with arithmetic operations
>>>>> (namely, we use an llimd without division to make the reverse of
>>>>> llmulshft), and it pays off on slow machines, such as ARM, where the
>>>>> division is done in software.
>>>>> At this chance, I looked at the code generated by this soluion, and I am
>>>>> not sure that it is better: on ARM, and I suspect this is true on other
>>>>> architectures, the operations needed to negate a long long clobbers the
>>>>> code conditions, which means we can not make these operations
>>>>> conditionals without a conditional jump, so the hand-coded assembler is
>>>>> not better than what the compiler does: it uses two conditional jumps
>>>>> whereas the original solution uses only one. Of course we could set sign
>>>>> to -1 or 1, and multiply by sign at the end, but the multiplication is
>>>>> probably even heavier than conditional jump.
>>>> Yes, on the archs that matter here (32-bit).
>>>>> So, would you have any idea of a better solution ?
>>>> In an assembly version, one could save 'sign' in form of a jump target
>>>> that should be taken after __rthal_generic_ullimd (ie. jump to the
>>>> negation, or jump over it). Specifically when that address is kept in a
>>>> register, I think smart branch prediction units will be able to do the
>>>> right forecast.
>>> Good idea, there is even a gcc extension which allows to do this in the
>>> generic section:
>>> static inline __attribute__((__const__)) long long
>>> __rthal_generic_llimd (long long op, unsigned m, unsigned d)
>>> {
>>>     void *epilogue;
>>>     long long ret;
>>>     if (op < 0LL) {
>>>             op = -op;
>>>             epilogue = &&ret_neg;
>>>     } else
>>>             epilogue = &&ret_unchanged;
>>>     ret = __rthal_generic_ullimd(op, m, d);
>>>     goto *epilogue;
>>> ret_unchanged:
>>>     return ret;
>>> ret_neg:
>>>     return -ret;
>>> }
>> This works as expected on ARM, however, gcc 4.0 on x86 generates two
>> calls to __rthal_generic_ullimd with the indirect jump after each one.
>> It seems it has stopped half-way when "optimizing"...
> Actually, gcc does the right thing if the implementation of
> __rthal_generic_ullimd is not trivial.

I think in that case un-inlining __rthal_generic_ullimd, keeping only
the two different call paths inlined, should be better anyway.


Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

Xenomai-core mailing list

Reply via email to