https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88255

            Bug ID: 88255
           Summary: Thumb-1: GCC too aggressive on mul->lsl/sub/add
                    optimization
           Product: gcc
           Version: 8.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: husseydevin at gmail dot com
  Target Milestone: ---

I might be wrong, but it appears that GCC is too aggressive in its conversion
from multiplication to shift+add when targeting Thumb-1

It is true that, for example, the Cortex-M0 can have the small multiplier and a
16 cycle shift sequence would be faster. However, I was targeting arm7tdmi
(-march=armv4t -mthumb -O3 -mtune=arm7tdmi) which, if I am not mistaken, uses
one cycle for every 8 bits.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0234b/i102180.html

However, looking in the source code, I notice that the loop is dividing by 4. I
think it might be a bug that is causing the otherwise 7 (I think) cycle
sequence in the code below to be considered as having a weight of 18 cycles.

https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm.c#L8959

I could be wrong, but one of the things I noticed is that very old versions of
GCC (2.95) will not perform this many shifts, and that Clang, when given the 
transpiled output in C and targeted for the same platform, will actually
convert it back into a ldr/mul.

However, when targeting cortex-m0plus.small-multiply, it will still turn it
into multiplication.

Code example: 

  unsigned MultiplyByPrime(unsigned val)
  {
      return val * 2246822519U;
  }

  MultiplyByPrime:
     lsls    r3, r0, #7 @ unsigned ret = val << 7;
     subs    r3, r3, r0 @ ret -= val;
     lsls    r3, r3, #5 @ ret <<= 5;
     subs    r3, r3, r0 @ ret -= val;
     lsls    r3, r3, #2 @ ret <<= 2;
     adds    r3, r3, r0 @ ret += val;
     lsls    r2, r3, #3 @ unsigned tmp = ret << 3;
     adds    r3, r3, r2 @ ret += tmp;
     lsls    r3, r3, #1 @ ret <<= 1;
     adds    r3, r3, r0 @ ret += val;
     lsls    r3, r3, #6 @ ret <<= 6;
     adds    r3, r3, r0 @ ret += val;
     lsls    r2, r3, #4 @ tmp = ret << 4;
     subs    r3, r2, r3 @ ret = tmp - ret;
     lsls    r3, r3, #3 @ ret <<= 3;
     subs    r0, r3, r0 @ ret -= val;
     bx      lr         @ return ret;

Reply via email to