https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88255
Bug ID: 88255 Summary: Thumb-1: GCC too aggressive on mul->lsl/sub/add optimization Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- I might be wrong, but it appears that GCC is too aggressive in its conversion from multiplication to shift+add when targeting Thumb-1 It is true that, for example, the Cortex-M0 can have the small multiplier and a 16 cycle shift sequence would be faster. However, I was targeting arm7tdmi (-march=armv4t -mthumb -O3 -mtune=arm7tdmi) which, if I am not mistaken, uses one cycle for every 8 bits. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0234b/i102180.html However, looking in the source code, I notice that the loop is dividing by 4. I think it might be a bug that is causing the otherwise 7 (I think) cycle sequence in the code below to be considered as having a weight of 18 cycles. https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm.c#L8959 I could be wrong, but one of the things I noticed is that very old versions of GCC (2.95) will not perform this many shifts, and that Clang, when given the transpiled output in C and targeted for the same platform, will actually convert it back into a ldr/mul. However, when targeting cortex-m0plus.small-multiply, it will still turn it into multiplication. Code example: unsigned MultiplyByPrime(unsigned val) { return val * 2246822519U; } MultiplyByPrime: lsls r3, r0, #7 @ unsigned ret = val << 7; subs r3, r3, r0 @ ret -= val; lsls r3, r3, #5 @ ret <<= 5; subs r3, r3, r0 @ ret -= val; lsls r3, r3, #2 @ ret <<= 2; adds r3, r3, r0 @ ret += val; lsls r2, r3, #3 @ unsigned tmp = ret << 3; adds r3, r3, r2 @ ret += tmp; lsls r3, r3, #1 @ ret <<= 1; adds r3, r3, r0 @ ret += val; lsls r3, r3, #6 @ ret <<= 6; adds r3, r3, r0 @ ret += val; lsls r2, r3, #4 @ tmp = ret << 4; subs r3, r2, r3 @ ret = tmp - ret; lsls r3, r3, #3 @ ret <<= 3; subs r0, r3, r0 @ ret -= val; bx lr @ return ret;