Vorbis author, Timothy Terriberry, was complaining about gcc inefficiency on arm, so I asked him write it up in case it would be of use to gcc devs to improve gcc. Is this a useful?
His description follows: You asked me at the summit to describe some of the problems gcc has on ARM. I'll start with the function oc_huff_token_decode from libtheora, which is the last function in http://svn.xiph.org/experimental/derf/theora-ptalarbvorm/lib/huffdec.c This is the primary function for bitstream decoding, and may get called a few hundred thousand times per frame. Unlike most of the other time-critical functions in a video codec, it can't really be accelerated with SIMD asm, though given how badly gcc does, regular asm might be worth it. This the result of using arm-none-linux-gnueabi-gcc (Gentoo 4.5.0 p1.1) 4.5.0 with -O3 -fomit-frame-pointer -funroll-loops (this last makes a fairly big difference on x86, and so is part of the default flags for the library). oc_huff_token_decode: @ Function supports interworking. @ args = 0, pretend = 0, frame = 8 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp} sub sp, sp, #8 So, the next two lines already start off bad: str r0, [sp, #4] ldr r4, [sp, #4] Spilling the register to the stack might be okay, but then loading it back from the stack when it's still in r0? ldr ip, [r0, #4] ldr r3, [r0, #8] ldr r2, [r4, #12] ldr r0, [r0, #0] And then continuing to use _both_ the copy in r0 and r4? What did we even make the copy for? r0 is now clobbered, and r4 gets clobbered before it gets used again. Maybe I'm missing some subtlety here: mov fp, r1 but it seems to me this mov could easily be eliminated by swapping the roles of fp and r1 everywhere below (that may not be the best solution, but it's certainly _a_ solution). This is one of the larger problems with gcc on ARM: it emits way too many moves for something that's supposed to be compiling for a three-address ISA. mov r7, #0 .L418: mov r1, r7, asl #1 ldrsh r5, [fp, r1] cmp r2, r5 The "fast path" here takes this branch instead of falling through, but that's probably my fault for not using PGO or __builtin_expect(). bge .L414 cmp ip, r0 bcs .L419 rsb r1, r2, #24 ldrb r6, [ip], #1 @ zero_extendqisi2 subs r4, r1, #8 add r2, r2, #8 orr r3, r3, r6, asl r1 bmi .L414 So here we're checking if ptr - stop has 4 byte alignment so that we can do 4 iterations at once without testing the if-block inside the loop. This is all well and good, except that the loop can only iterate 3 times. All of the unrolling it does to get the alignment right would already have been enough to execute the entire loop. gcc continues to generate code like this even when n and available are declared as unsigned (so that overflow would be required to get "shift" larger than 24, instead of just loading a negative value for available) and when -funsafe-loop-optimizations is used (though I don't think this is what that flag actually controls). I'd be happy to hear other suggestions for rewriting this so that gcc can recognize the iteration limit, as I doubt that's just an ARM problem. rsb r6, ip, r0 ands r6, r6, #3 mov r1, ip beq .L434 cmp r0, ip bls .L419 So here we... copy ip into r1, then immediately clobber the original ip, then put r1 _back_ into ip. Two more wasted instructions. mov r1, ip ldrb ip, [r1], #1 @ zero_extendqisi2 add r2, r2, #8 orr r3, r3, ip, asl r4 subs r4, r4, #8 mov ip, r1 bmi .L414 cmp r6, #1 beq .L434 cmp r6, #2 beq .L429 ldrb ip, [r1], #1 @ zero_extendqisi2 add r2, r2, #8 orr r3, r3, ip, asl r4 subs r4, r4, #8 mov ip, r1 bmi .L414 .L429: ldrb ip, [r1], #1 @ zero_extendqisi2 add r2, r2, #8 orr r3, r3, ip, asl r4 subs r4, r4, #8 mov ip, r1 bmi .L414 So here's another minor disaster of excess mov's. r7 is copied into r9 so that r7 can be used inside the loop... except this loop is self-contained and could just as well have used r9. .L434: mov r9, r7 .L415: Also, since we just branched here so we could compare ptr to stop once every four (of three total) iterations, the first thing we do is... compare ptr against stop. So in the best case this "optimization" actually does three comparisons instead of two if the loop iterates twice (including the alignment test), and only if we iterate three times (the maximum) and happened to have the correct alignment (1 in 4 chance) does it break even. cmp r0, r1 What is actually put in r7? Why, one of _two_ copies of r1. r1 really contains the value of ip, which we uselessly copied to it above, and using it here means we needed to add _another_ useless copy of ip to r1 in the other path to this label. mov r7, r1 add r2, r2, #8 This is the one that really takes the cake: just in case we branch here, we'll copy r1 back into ip (note that ip already contains r1), and then if we don't branch we immediately clobber it. It does this every time. mov ip, r1 bls .L437 ldrb ip, [r7], #1 @ zero_extendqisi2 subs r6, r4, #8 orr r3, r3, ip, asl r4 mov r8, r2 mov ip, r7 This one is also pretty fantastic. So, this time we're going to do the load using r1, instead of r7 like we did above (why do we have two copies? I still don't know! actually three if you count ip). That means we need to increment r7 manually instead of doing it as part of the load. And just for giggles, we do that _before_ the branch here, which goes to an instruction that immediately clobbers r7. add r7, r7, #1 bmi .L436 ldrb ip, [r1, #1] @ zero_extendqisi2 subs sl, r6, #8 orr r3, r3, ip, asl r6 add r2, r2, #8 mov ip, r7 bmi .L436 ldrb r6, [r7, #0] @ zero_extendqisi2 subs r7, r4, #24 add ip, r1, #3 add r2, r8, #16 orr r3, r3, r6, asl sl bmi .L436 ldrb ip, [r1, #3] @ zero_extendqisi2 subs r4, r4, #32 add r1, r1, #4 orr r3, r3, ip, asl r7 add r2, r8, #24 Yeah, better copy r1 into ip again so we can... copy r1 into ip again after the jump. Fortunately the iteration limit means we can never actually get here. mov ip, r1 bpl .L415 .L436: mov r7, r9 .L414: rsb r1, r5, #32 add r1, r7, r3, lsr r1 add r7, fp, r1, asl #1 ldrsh r7, [r7, #2] cmp r7, #0 Again, the fast path takes this branch. ble .L417 .L438: mov r3, r3, asl r5 rsb r2, r5, r2 b .L418 .L437: mov r7, r9 .L419: rsb r1, r5, #32 add r1, r7, r3, lsr r1 add r7, fp, r1, asl #1 ldrsh r7, [r7, #2] mov r2, #1073741824 cmp r7, #0 bgt .L438 .L417: rsb r7, r7, #0 mov r0, r7, asr #8 mov r3, r3, asl r0 Yeah! We can finally load back that value we stashed on the stack! Maybe if we hadn't made three or four copies of ptr, we might have been able to keep it in a register. ldr r1, [sp, #4] rsb r2, r0, r2 str ip, [r1, #4] and r0, r7, #255 str r3, [r1, #8] str r2, [r1, #12] add sp, sp, #8 ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp} bx lr Things are a little better without -funroll-loops, mostly because the function is much shorter, but there are still obvious problems, e.g., in the epilogue: and r8, r8, #255 stmib r0, {r3, ip} @ phole stm str r2, [r0, #12] mov r0, r8 If the and was re-ordered, it could take the place of the mov (the -funroll-loops version gets this right). Or the prologue: ldmib r0, {r3, ip} @ phole ldm ldr r4, [r0, #0] ldr r2, [r0, #12] It isn't clear to me why it picked registers that were out of order, when it could have loaded all four with one ldm. But hey, at least that's better than the -funroll-loops versions, which did four individual loads! Fixing this would also likely have allowed combining the stmib/str in the epilogue. -- Summary: Suboptimal code generation on arm Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: tglek at mozilla dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45622