https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62062

            Bug ID: 62062
           Summary: Missed optimization: write ptr reloaded in each loop
                    iteration
           Product: gcc
           Version: 4.9.2
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: petschy at gmail dot com

When developing a binary i/o lib, I ran into some performance degradation in
the writer functions. My investigation revealed that the write pointer was
loaded/stored in each loop iteration. Although this can be dodged by
hand-tuning the code via local variables kept in registers, the resulting code
is longer, less clear, harder to maintain, etc. 

For this report, I recompiled w/ 4.9.2, but earlier versions in 4.x give the
same results. The test box is an AMD FX w/ Debian Jessie.

gcc compiled from git commit f964b16:
g++-4.9.2 -v
Using built-in specs.
COLLECT_GCC=g++-4.9.2
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-unknown-linux-gnu/4.9.2/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ./configure --enable-languages=c,c++ --disable-multilib
--program-suffix=-4.9.2
Thread model: posix
gcc version 4.9.2 20140808 (prerelease) (GCC)

compiler/linker flags used:
g++-4.9.2 -c 20140725-reg_vs_mem.cpp -g -std=c++11 -Wall -Wextra -Werror
-Wundef -Wshadow -O3 -fno-tree-vectorize
g++-4.9.2 -o 20140725-reg_vs_mem 20140725-reg_vs_mem.o -g

Tried with less optimization, too, but made no difference.

-fno-tree-vectorize was used because otherwise the code generated for the
encoder functions used the vector registers, which resulted in serious code
size bloat and ~2-3x runtimes, opposed to the ~1.5x runtime increase due to the
redundant loads/stores.

For the tests, I made two versions for each function: one that expects a ptr
ref (baseline version), and another one that expects a buffer object (w/
beg/end ptrs for advanced functionality, which is not used here). As expected,
the same code is generated for these two.

I ruled out the aliasing rules first: I use int ptrs in this test, so that the
'char* aliases everything' rule is dodged. To prove my case:

write_run_char_ptr_ref() loads/stores the char ptr in each loop, since the
character written invalidates the pointer itself:
   0x0000000000400570 <+0>:     sub    $0x1,%edx
   0x0000000000400573 <+3>:     js     0x40058d <write_run_char_ptr_ref(char*&,
int, int)+29>
   0x0000000000400575 <+5>:     nopl   (%rax)
   0x0000000000400578 <+8>:     mov    (%rdi),%rax      # loop beg, load ptr
   0x000000000040057b <+11>:    sub    $0x1,%edx
   0x000000000040057e <+14>:    cmp    $0xffffffff,%edx
   0x0000000000400581 <+17>:    lea    0x1(%rax),%rcx
   0x0000000000400585 <+21>:    mov    %rcx,(%rdi)      # store updated ptr
   0x0000000000400588 <+24>:    mov    %sil,(%rax)
   0x000000000040058b <+27>:    jne    0x400578 <write_run_char_ptr_ref(char*&,
int, int)+8>
   0x000000000040058d <+29>:    repz retq 

write_run_char_ptr_ref_unaliased() keeps the ptr in a register, but this was
achieved w/ some platform dependent asm trickery, see unaliased_storeb() [btw,
some platform independent builtin would be nice for this], I use this in the
production code, which writes bytes at the lowest level.
   0x0000000000400590 <+0>:     test   %edx,%edx
   0x0000000000400592 <+2>:     jle    0x4005af
<write_run_char_ptr_ref_unaliased(char*&, int, int)+31>
   0x0000000000400594 <+4>:     mov    (%rdi),%rax           # load ptr
   0x0000000000400597 <+7>:     sub    $0x1,%edx
   0x000000000040059a <+10>:    lea    0x1(%rax,%rdx,1),%rdx # end ptr
   0x000000000040059f <+15>:    nop
   0x00000000004005a0 <+16>:    mov    %sil,(%rax)           # loop body, no
ptr load/store
   0x00000000004005a3 <+19>:    add    $0x1,%rax
   0x00000000004005a7 <+23>:    cmp    %rdx,%rax
   0x00000000004005aa <+26>:    jne    0x4005a0
<write_run_char_ptr_ref_unaliased(char*&, int, int)+16>
   0x00000000004005ac <+28>:    mov    %rax,(%rdi)           # store ptr after
the loop
   0x00000000004005af <+31>:    repz retq 

write_run_ptr_ref() uses int ptr, and the ptr is kept in a register for the
loop, without any trickery. The disassembly is the same as
write_run_char_ptr_ref_unaliased() above, except ints are written.
write_run_buf() is exactly the same. So far so good.

The next step is a variable length encoder, encode_ptr_ref(). This is not a
real working encoding, just for demonstration. Since the probabilities are
implied from the encoding lengths, the if conditions are peppered with
builtin_expect's. Again, the ptr ref and buf object versions are exactly the
same. However, I noticed that the ptr is written back multiple times, depending
on which 'if' becomes true. For the most likely case, it doesn't matter: only
one write is performed, but before the conditional jump. For the last two less
likely cases, two redundant writes are performed. Moreover, for the 3rd 'if'
block the actual useful ptr write is performed inside the block, and not before
the conditional jump as with the 1st and 2nd if's. Since the last two if's are
not likely, the performance impact of the redundant writes probably don't
matter much, but since this ticket is about redundant loads/stores, it might
have something to do with the core issue.
   0x0000000000400720 <+0>:     mov    (%rdi),%rax      # ptr load
   0x0000000000400723 <+3>:     cmp    $0xff,%esi
   0x0000000000400729 <+9>:     lea    0x4(%rax),%rdx
   0x000000000040072d <+13>:    mov    %rdx,(%rdi)      # store before the cond
   0x0000000000400730 <+16>:    jg     0x400738 <encode_buf(OutBuf&, int)+24>
   0x0000000000400732 <+18>:    mov    %esi,(%rax)      # most likely branch
   0x0000000000400734 <+20>:    retq   
   0x0000000000400735 <+21>:    nopl   (%rax)
   0x0000000000400738 <+24>:    lea    0x8(%rax),%rdx
   0x000000000040073c <+28>:    cmp    $0xffff,%esi
   0x0000000000400742 <+34>:    movl   $0x0,(%rax)
   0x0000000000400748 <+40>:    mov    %rdx,(%rdi)      # store again
   0x000000000040074b <+43>:    jg     0x400751 <encode_buf(OutBuf&, int)+49>
   0x000000000040074d <+45>:    mov    %esi,0x4(%rax)   # 2nd most likely br
   0x0000000000400750 <+48>:    retq   
   0x0000000000400751 <+49>:    cmp    $0xffffff,%esi
   0x0000000000400757 <+55>:    movl   $0x0,0x4(%rax)
   0x000000000040075e <+62>:    jg     0x40076b <encode_buf(OutBuf&, int)+75>
   0x0000000000400760 <+64>:    lea    0xc(%rax),%rdx   # 3rd most likely br
   0x0000000000400764 <+68>:    mov    %rdx,(%rdi)      # ptr stored after the
cond
   0x0000000000400767 <+71>:    mov    %esi,0x8(%rax)
   0x000000000040076a <+74>:    retq   
   0x000000000040076b <+75>:    lea    0x10(%rax),%rdx  # least likely br
   0x000000000040076f <+79>:    movl   $0x0,0x8(%rax)
   0x0000000000400776 <+86>:    mov    %rdx,(%rdi)      # store
   0x0000000000400779 <+89>:    mov    %esi,0xc(%rax)
   0x000000000040077c <+92>:    retq   

encode_ptr_ref_tmp() is the same as encode_ptr_ref(), but the ptr is loaded
into a local variable by hand, and written back to the reference in each branch
before return. The generated code is effectively the same, same size, only the
insn order is different: no redundant writes, all ptr writes are inside the
corresponding if block.
   0x00000000004006c0 <+0>:     cmp    $0xff,%esi
   0x00000000004006c6 <+6>:     mov    (%rdi),%rax      # ptr load
   0x00000000004006c9 <+9>:     jg     0x4006d8 <encode_ptr_ref_tmp(unsigned
int*&, int)+24>
   0x00000000004006cb <+11>:    mov    %esi,(%rax)
   0x00000000004006cd <+13>:    add    $0x4,%rax
   0x00000000004006d1 <+17>:    mov    %rax,(%rdi)      # ptr store
   0x00000000004006d4 <+20>:    retq   
   0x00000000004006d5 <+21>:    nopl   (%rax)
   0x00000000004006d8 <+24>:    cmp    $0xffff,%esi
   0x00000000004006de <+30>:    movl   $0x0,(%rax)
   0x00000000004006e4 <+36>:    jg     0x4006f1 <encode_ptr_ref_tmp(unsigned
int*&, int)+49>
   0x00000000004006e6 <+38>:    mov    %esi,0x4(%rax)
   0x00000000004006e9 <+41>:    add    $0x8,%rax
   0x00000000004006ed <+45>:    mov    %rax,(%rdi)      # ptr store
   0x00000000004006f0 <+48>:    retq   
   0x00000000004006f1 <+49>:    cmp    $0xffffff,%esi
   0x00000000004006f7 <+55>:    movl   $0x0,0x4(%rax)
   0x00000000004006fe <+62>:    jg     0x40070b <encode_ptr_ref_tmp(unsigned
int*&, int)+75>
   0x0000000000400700 <+64>:    mov    %esi,0x8(%rax)
   0x0000000000400703 <+67>:    add    $0xc,%rax
   0x0000000000400707 <+71>:    mov    %rax,(%rdi)      # ptr store
   0x000000000040070a <+74>:    retq   
   0x000000000040070b <+75>:    movl   $0x0,0x8(%rax)
   0x0000000000400712 <+82>:    mov    %esi,0xc(%rax)
   0x0000000000400715 <+85>:    add    $0x10,%rax
   0x0000000000400719 <+89>:    mov    %rax,(%rdi)      # ptr store
   0x000000000040071c <+92>:    retq   

Again, the ptr ref and buf object versions are exactly the same (encode_buf(),
encode_buf_tmp()).

Now we arrived to the core issue: test_encode_ptr_ref() calls the above
encode_ptr_ref() in a loop, and encodes an array of ints. encode_ptr_ref() is
not called, but inlined, no other calls are made, so the compiler/optimizer
should optimize away the loads/stores, but unfortunately, it doesn't: in every
loop iteration, the write ptr is loaded/stored.
   0x00000000004008f0 <+0>:     jmp    0x4008fa <test_encode_ptr_ref(unsigned
int*&, int const*, int)+10>
   0x00000000004008f2 <+2>:     nopw   0x0(%rax,%rax,1)
   0x00000000004008f8 <+8>:     mov    %eax,(%rcx)     # inner loop head: store
value
   0x00000000004008fa <+10>:    sub    $0x1,%edx
   0x00000000004008fd <+13>:    js     0x400938 <test_encode_ptr_ref(unsigned
int*&, int const*, int)+72>
   0x00000000004008ff <+15>:    add    $0x4,%rsi
   0x0000000000400903 <+19>:    mov    (%rdi),%rcx     # load ptr
   0x0000000000400906 <+22>:    mov    -0x4(%rsi),%eax # load value
   0x0000000000400909 <+25>:    lea    0x4(%rcx),%r8
   0x000000000040090d <+29>:    cmp    $0xff,%eax
   0x0000000000400912 <+34>:    mov    %r8,(%rdi)      # store ptr
   0x0000000000400915 <+37>:    jle    0x4008f8 <test_encode_ptr_ref(unsigned
int*&, int const*, int)+8>
   0x0000000000400917 <+39>:    lea    0x8(%rcx),%r8
   0x000000000040091b <+43>:    cmp    $0xffff,%eax
   0x0000000000400920 <+48>:    movl   $0x0,(%rcx)
   0x0000000000400926 <+54>:    mov    %r8,(%rdi)      # store ptr
   0x0000000000400929 <+57>:    jg     0x40093a <test_encode_ptr_ref(unsigned
int*&, int const*, int)+74>
   0x000000000040092b <+59>:    sub    $0x1,%edx
   0x000000000040092e <+62>:    mov    %eax,0x4(%rcx)
   0x0000000000400931 <+65>:    jns    0x4008ff <test_encode_ptr_ref(unsigned
int*&, int const*, int)+15>
   0x0000000000400933 <+67>:    nopl   0x0(%rax,%rax,1)
   0x0000000000400938 <+72>:    repz retq 
   0x000000000040093a <+74>:    cmp    $0xffffff,%eax
   0x000000000040093f <+79>:    movl   $0x0,0x4(%rcx)
   0x0000000000400946 <+86>:    jg     0x400954 <test_encode_ptr_ref(unsigned
int*&, int const*, int)+100>
   0x0000000000400948 <+88>:    lea    0xc(%rcx),%r8
   0x000000000040094c <+92>:    mov    %r8,(%rdi)    # store ptr
   0x000000000040094f <+95>:    mov    %eax,0x8(%rcx)
   0x0000000000400952 <+98>:    jmp    0x4008fa <test_encode_ptr_ref(unsigned
int*&, int const*, int)+10>
   0x0000000000400954 <+100>:   lea    0x10(%rcx),%r8
   0x0000000000400958 <+104>:   movl   $0x0,0x8(%rcx)
   0x000000000040095f <+111>:   mov    %r8,(%rdi)    # store ptr
   0x0000000000400962 <+114>:   mov    %eax,0xc(%rcx)
   0x0000000000400965 <+117>:   jmp    0x4008fa <test_encode_ptr_ref(unsigned
int*&, int const*, int)+10>

The generated code for write_run_ptr_ref() loaded the ptr before the loop, the
loop used the reg and stored back after the loop. Why can't the same be done
for test_encode_ptr_ref()?

I got a hunch that maybe the 'if' blocks are related, so I wrote a simple fn
that copies n ints from a src ptr to a dst ptr, passed as ref: copy_ptr_ref()
   0x00000000004005e0 <+0>:     test   %edx,%edx
   0x00000000004005e2 <+2>:     jle    0x400607 <copy_ptr_ref(unsigned int*&,
int const*, int)+39>
   0x00000000004005e4 <+4>:     mov    (%rdi),%r8          # load dst
   0x00000000004005e7 <+7>:     sub    $0x1,%edx
   0x00000000004005ea <+10>:    xor    %eax,%eax
   0x00000000004005ec <+12>:    add    $0x1,%rdx
   0x00000000004005f0 <+16>:    mov    (%rsi,%rax,4),%ecx  # load int
   0x00000000004005f3 <+19>:    mov    %ecx,(%r8,%rax,4)   # store int
   0x00000000004005f7 <+23>:    add    $0x1,%rax
   0x00000000004005fb <+27>:    cmp    %rdx,%rax
   0x00000000004005fe <+30>:    jne    0x4005f0 <copy_ptr_ref(unsigned int*&,
int const*, int)+16>
   0x0000000000400600 <+32>:    lea    (%r8,%rax,4),%rax
   0x0000000000400604 <+36>:    mov    %rax,(%rdi)         # store dst
   0x0000000000400607 <+39>:    repz retq 

No redundant loads/stores in the loop. The next step was to introduce a
condition: the int is copied only if it is not divisible by a given number:
copy_ptr_ref_pred(). GOTCHA! The dst load/store appeared in the loop:
   0x0000000000400610 <+0>:     mov    %edx,%r9d
   0x0000000000400613 <+3>:     nopl   0x0(%rax,%rax,1)
   0x0000000000400618 <+8>:     sub    $0x1,%r9d        # loop beg
   0x000000000040061c <+12>:    js     0x400643 <copy_ptr_ref_pred(unsigned
int*&, int const*, int, int)+51>
   0x000000000040061e <+14>:    add    $0x4,%rsi
   0x0000000000400622 <+18>:    mov    -0x4(%rsi),%r8d  # load int
   0x0000000000400626 <+22>:    mov    %r8d,%eax
   0x0000000000400629 <+25>:    cltd   
   0x000000000040062a <+26>:    idiv   %ecx
   0x000000000040062c <+28>:    test   %edx,%edx
   0x000000000040062e <+30>:    je     0x400618 <copy_ptr_ref_pred(unsigned
int*&, int const*, int, int)+8>
   0x0000000000400630 <+32>:    mov    (%rdi),%rax      # load dst
   0x0000000000400633 <+35>:    sub    $0x1,%r9d
   0x0000000000400637 <+39>:    lea    0x4(%rax),%rdx
   0x000000000040063b <+43>:    mov    %rdx,(%rdi)      # store dst
   0x000000000040063e <+46>:    mov    %r8d,(%rax)      # store int
   0x0000000000400641 <+49>:    jns    0x40061e <copy_ptr_ref_pred(unsigned
int*&, int const*, int, int)+14>
   0x0000000000400643 <+51>:    repz retq 

If I load the dst into a local variable, it gets a register allocated, and the
load/store in each iteration is eliminated. However, the compiler should do
this automatically, imho. copy_ptr_ref_pred_tmp():
   0x0000000000400650 <+0>:     test   %edx,%edx
   0x0000000000400652 <+2>:     jle    0x400686 <copy_ptr_ref_pred_tmp(unsigned
int*&, int const*, int, int)+54>
   0x0000000000400654 <+4>:     movslq %edx,%rdx
   0x0000000000400657 <+7>:     mov    (%rdi),%r9       # load dst once
   0x000000000040065a <+10>:    lea    (%rsi,%rdx,4),%r10
   0x000000000040065e <+14>:    xchg   %ax,%ax
   0x0000000000400660 <+16>:    cmp    %r10,%rsi
   0x0000000000400663 <+19>:    je     0x400683 <copy_ptr_ref_pred_tmp(unsigned
int*&, int const*, int, int)+51>
   0x0000000000400665 <+21>:    add    $0x4,%rsi
   0x0000000000400669 <+25>:    mov    -0x4(%rsi),%r8d  # load int
   0x000000000040066d <+29>:    mov    %r8d,%eax
   0x0000000000400670 <+32>:    cltd   
   0x0000000000400671 <+33>:    idiv   %ecx
   0x0000000000400673 <+35>:    test   %edx,%edx
   0x0000000000400675 <+37>:    je     0x400660 <copy_ptr_ref_pred_tmp(unsigned
int*&, int const*, int, int)+16>
   0x0000000000400677 <+39>:    mov    %r8d,(%r9)       # store int
   0x000000000040067a <+42>:    add    $0x4,%r9
   0x000000000040067e <+46>:    cmp    %r10,%rsi
   0x0000000000400681 <+49>:    jne    0x400665 <copy_ptr_ref_pred_tmp(unsigned
int*&, int const*, int, int)+21>
   0x0000000000400683 <+51>:    mov    %r9,(%rdi)       # store dst once
   0x0000000000400686 <+54>:    repz retq 

There are more complex cases, where eg the encoder fn is split into two parts:
the most likely case, which should be inlined, and the rest, which should be
called. See test_encode_ptr_ref_inline(). For this fn, the ideal code would be
the one that loads the dst ptr into a register, and uses the reg for the
inlined encoder code, and if the out-of-line part should be called, it stores
the reg back to the ref, calls the ool part, then loads back the reg from the
ref. This way the most likely parts of the basic encoders could be inlined into
the main encoder loop, and the most likely path would run without unnecessary
loads/stores.

Regards, Peter

Reply via email to