[Bug tree-optimization/45622] New: Suboptimal code generation on arm

tglek at mozilla dot com Thu, 09 Sep 2010 14:39:49 -0700

Vorbis author, Timothy Terriberry, was complaining about gcc inefficiency on
arm, so I asked him write it up in case it would be of use to gcc devs to
improve gcc. Is this a useful?


His description follows:

You asked me at the summit to describe some of the problems gcc has on ARM.
I'll start with the function oc_huff_token_decode from libtheora, which is the
last function in
http://svn.xiph.org/experimental/derf/theora-ptalarbvorm/lib/huffdec.c
This is the primary function for bitstream decoding, and may get called a few
hundred thousand times per frame. Unlike most of the other time-critical
functions in a video codec, it can't really be accelerated with SIMD asm,
though given how badly gcc does, regular asm might be worth it.

This the result of using arm-none-linux-gnueabi-gcc (Gentoo 4.5.0 p1.1) 4.5.0
with -O3 -fomit-frame-pointer -funroll-loops (this last makes a fairly big
difference on x86, and so is part of the default flags for the library).

oc_huff_token_decode:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        stmfd   sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
        sub     sp, sp, #8

So, the next two lines already start off bad:

        str     r0, [sp, #4]
        ldr     r4, [sp, #4]

Spilling the register to the stack might be okay, but then loading it back from
the stack when it's still in r0?

        ldr     ip, [r0, #4]
        ldr     r3, [r0, #8]
        ldr     r2, [r4, #12]
        ldr     r0, [r0, #0]

And then continuing to use _both_ the copy in r0 and r4? What did we even make
the copy for? r0 is now clobbered, and r4 gets clobbered before it gets used
again.

Maybe I'm missing some subtlety here:

        mov     fp, r1

but it seems to me this mov could easily be eliminated by swapping the roles of
fp and r1 everywhere below (that may not be the best solution, but it's
certainly _a_ solution). This is one of the larger problems with gcc on ARM: it
emits way too many moves for something that's supposed to be compiling for a
three-address ISA.

        mov     r7, #0
.L418:
        mov     r1, r7, asl #1
        ldrsh   r5, [fp, r1]
        cmp     r2, r5

The "fast path" here takes this branch instead of falling through, but that's
probably my fault for not using PGO or __builtin_expect().

        bge     .L414
        cmp     ip, r0
        bcs     .L419
        rsb     r1, r2, #24
        ldrb    r6, [ip], #1    @ zero_extendqisi2
        subs    r4, r1, #8
        add     r2, r2, #8
        orr     r3, r3, r6, asl r1
        bmi     .L414

So here we're checking if ptr - stop has 4 byte alignment so that we can do 4
iterations at once without testing the if-block inside the loop. This is all
well and good, except that the loop can only iterate 3 times. All of the
unrolling it does to get the alignment right would already have been enough to
execute the entire loop. gcc continues to generate code like this even when n
and available are declared as unsigned (so that overflow would be required to
get "shift" larger than 24, instead of just loading a negative value for
available) and when -funsafe-loop-optimizations is used (though I don't think
this is what that flag actually controls). I'd be happy to hear other
suggestions for rewriting this so that gcc can recognize the iteration limit,
as I doubt that's just an ARM problem.

        rsb     r6, ip, r0
        ands    r6, r6, #3
        mov     r1, ip
        beq     .L434
        cmp     r0, ip
        bls     .L419

So here we... copy ip into r1, then immediately clobber the original ip, then
put r1 _back_ into ip. Two more wasted instructions.

        mov     r1, ip
        ldrb    ip, [r1], #1    @ zero_extendqisi2
        add     r2, r2, #8
        orr     r3, r3, ip, asl r4
        subs    r4, r4, #8
        mov     ip, r1

        bmi     .L414
        cmp     r6, #1
        beq     .L434
        cmp     r6, #2
        beq     .L429
        ldrb    ip, [r1], #1    @ zero_extendqisi2
        add     r2, r2, #8
        orr     r3, r3, ip, asl r4
        subs    r4, r4, #8
        mov     ip, r1
        bmi     .L414
.L429:
        ldrb    ip, [r1], #1    @ zero_extendqisi2
        add     r2, r2, #8
        orr     r3, r3, ip, asl r4
        subs    r4, r4, #8
        mov     ip, r1
        bmi     .L414

So here's another minor disaster of excess mov's. r7 is copied into r9 so that
r7 can be used inside the loop... except this loop is self-contained and could
just as well have used r9.

.L434:
        mov     r9, r7
.L415:

Also, since we just branched here so we could compare ptr to stop once every
four (of three total) iterations, the first thing we do is... compare ptr
against stop. So in the best case this "optimization" actually does three
comparisons instead of two if the loop iterates twice (including the alignment
test), and only if we iterate three times (the maximum) and happened to have
the correct alignment (1 in 4 chance) does it break even.

        cmp     r0, r1

What is actually put in r7? Why, one of _two_ copies of r1. r1 really contains
the value of ip, which we uselessly copied to it above, and using it here means
we needed to add _another_ useless copy of ip to r1 in the other path to this
label.

        mov     r7, r1
        add     r2, r2, #8

This is the one that really takes the cake: just in case we branch here, we'll
copy r1 back into ip (note that ip already contains r1), and then if we don't
branch we immediately clobber it. It does this every time.

        mov     ip, r1
        bls     .L437
        ldrb    ip, [r7], #1    @ zero_extendqisi2
        subs    r6, r4, #8
        orr     r3, r3, ip, asl r4
        mov     r8, r2
        mov     ip, r7

This one is also pretty fantastic. So, this time we're going to do the load
using r1, instead of r7 like we did above (why do we have two copies? I still
don't know! actually three if you count ip). That means we need to increment r7
manually instead of doing it as part of the load. And just for giggles, we do
that _before_ the branch here, which goes to an instruction that immediately
clobbers r7.

        add     r7, r7, #1
        bmi     .L436
        ldrb    ip, [r1, #1]    @ zero_extendqisi2
        subs    sl, r6, #8
        orr     r3, r3, ip, asl r6
        add     r2, r2, #8
        mov     ip, r7
        bmi     .L436
        ldrb    r6, [r7, #0]    @ zero_extendqisi2
        subs    r7, r4, #24
        add     ip, r1, #3
        add     r2, r8, #16
        orr     r3, r3, r6, asl sl
        bmi     .L436
        ldrb    ip, [r1, #3]    @ zero_extendqisi2
        subs    r4, r4, #32
        add     r1, r1, #4
        orr     r3, r3, ip, asl r7
        add     r2, r8, #24

Yeah, better copy r1 into ip again so we can... copy r1 into ip again after the
jump. Fortunately the iteration limit means we can never actually get here.

        mov     ip, r1
        bpl     .L415
.L436:
        mov     r7, r9
.L414:
        rsb     r1, r5, #32
        add     r1, r7, r3, lsr r1
        add     r7, fp, r1, asl #1
        ldrsh   r7, [r7, #2]
        cmp     r7, #0

Again, the fast path takes this branch.

        ble     .L417
.L438:
        mov     r3, r3, asl r5
        rsb     r2, r5, r2
        b       .L418
.L437:
        mov     r7, r9
.L419:
        rsb     r1, r5, #32
        add     r1, r7, r3, lsr r1
        add     r7, fp, r1, asl #1
        ldrsh   r7, [r7, #2]
        mov     r2, #1073741824
        cmp     r7, #0
        bgt     .L438
.L417:
        rsb     r7, r7, #0
        mov     r0, r7, asr #8
        mov     r3, r3, asl r0

Yeah! We can finally load back that value we stashed on the stack! Maybe if we
hadn't made three or four copies of ptr, we might have been able to keep it in
a register.

        ldr     r1, [sp, #4]
        rsb     r2, r0, r2
        str     ip, [r1, #4]
        and     r0, r7, #255
        str     r3, [r1, #8]
        str     r2, [r1, #12]
        add     sp, sp, #8
        ldmfd   sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
        bx      lr



Things are a little better without -funroll-loops, mostly because the function
is much shorter, but there are still obvious problems, e.g., in the epilogue:

        and     r8, r8, #255
        stmib   r0, {r3, ip}    @ phole stm
        str     r2, [r0, #12]
        mov     r0, r8

If the and was re-ordered, it could take the place of the mov (the
-funroll-loops version gets this right). Or the prologue:

        ldmib   r0, {r3, ip}    @ phole ldm
        ldr     r4, [r0, #0]
        ldr     r2, [r0, #12]

It isn't clear to me why it picked registers that were out of order, when it
could have loaded all four with one ldm. But hey, at least that's better than
the -funroll-loops versions, which did four individual loads! Fixing this would
also likely have allowed combining the  stmib/str in the epilogue.


-- 
           Summary: Suboptimal code generation on arm
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tglek at mozilla dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45622

[Bug tree-optimization/45622] New: Suboptimal code generation on arm

Reply via email to