https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62062
Bug ID: 62062 Summary: Missed optimization: write ptr reloaded in each loop iteration Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: petschy at gmail dot com When developing a binary i/o lib, I ran into some performance degradation in the writer functions. My investigation revealed that the write pointer was loaded/stored in each loop iteration. Although this can be dodged by hand-tuning the code via local variables kept in registers, the resulting code is longer, less clear, harder to maintain, etc. For this report, I recompiled w/ 4.9.2, but earlier versions in 4.x give the same results. The test box is an AMD FX w/ Debian Jessie. gcc compiled from git commit f964b16: g++-4.9.2 -v Using built-in specs. COLLECT_GCC=g++-4.9.2 COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-unknown-linux-gnu/4.9.2/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: ./configure --enable-languages=c,c++ --disable-multilib --program-suffix=-4.9.2 Thread model: posix gcc version 4.9.2 20140808 (prerelease) (GCC) compiler/linker flags used: g++-4.9.2 -c 20140725-reg_vs_mem.cpp -g -std=c++11 -Wall -Wextra -Werror -Wundef -Wshadow -O3 -fno-tree-vectorize g++-4.9.2 -o 20140725-reg_vs_mem 20140725-reg_vs_mem.o -g Tried with less optimization, too, but made no difference. -fno-tree-vectorize was used because otherwise the code generated for the encoder functions used the vector registers, which resulted in serious code size bloat and ~2-3x runtimes, opposed to the ~1.5x runtime increase due to the redundant loads/stores. For the tests, I made two versions for each function: one that expects a ptr ref (baseline version), and another one that expects a buffer object (w/ beg/end ptrs for advanced functionality, which is not used here). As expected, the same code is generated for these two. I ruled out the aliasing rules first: I use int ptrs in this test, so that the 'char* aliases everything' rule is dodged. To prove my case: write_run_char_ptr_ref() loads/stores the char ptr in each loop, since the character written invalidates the pointer itself: 0x0000000000400570 <+0>: sub $0x1,%edx 0x0000000000400573 <+3>: js 0x40058d <write_run_char_ptr_ref(char*&, int, int)+29> 0x0000000000400575 <+5>: nopl (%rax) 0x0000000000400578 <+8>: mov (%rdi),%rax # loop beg, load ptr 0x000000000040057b <+11>: sub $0x1,%edx 0x000000000040057e <+14>: cmp $0xffffffff,%edx 0x0000000000400581 <+17>: lea 0x1(%rax),%rcx 0x0000000000400585 <+21>: mov %rcx,(%rdi) # store updated ptr 0x0000000000400588 <+24>: mov %sil,(%rax) 0x000000000040058b <+27>: jne 0x400578 <write_run_char_ptr_ref(char*&, int, int)+8> 0x000000000040058d <+29>: repz retq write_run_char_ptr_ref_unaliased() keeps the ptr in a register, but this was achieved w/ some platform dependent asm trickery, see unaliased_storeb() [btw, some platform independent builtin would be nice for this], I use this in the production code, which writes bytes at the lowest level. 0x0000000000400590 <+0>: test %edx,%edx 0x0000000000400592 <+2>: jle 0x4005af <write_run_char_ptr_ref_unaliased(char*&, int, int)+31> 0x0000000000400594 <+4>: mov (%rdi),%rax # load ptr 0x0000000000400597 <+7>: sub $0x1,%edx 0x000000000040059a <+10>: lea 0x1(%rax,%rdx,1),%rdx # end ptr 0x000000000040059f <+15>: nop 0x00000000004005a0 <+16>: mov %sil,(%rax) # loop body, no ptr load/store 0x00000000004005a3 <+19>: add $0x1,%rax 0x00000000004005a7 <+23>: cmp %rdx,%rax 0x00000000004005aa <+26>: jne 0x4005a0 <write_run_char_ptr_ref_unaliased(char*&, int, int)+16> 0x00000000004005ac <+28>: mov %rax,(%rdi) # store ptr after the loop 0x00000000004005af <+31>: repz retq write_run_ptr_ref() uses int ptr, and the ptr is kept in a register for the loop, without any trickery. The disassembly is the same as write_run_char_ptr_ref_unaliased() above, except ints are written. write_run_buf() is exactly the same. So far so good. The next step is a variable length encoder, encode_ptr_ref(). This is not a real working encoding, just for demonstration. Since the probabilities are implied from the encoding lengths, the if conditions are peppered with builtin_expect's. Again, the ptr ref and buf object versions are exactly the same. However, I noticed that the ptr is written back multiple times, depending on which 'if' becomes true. For the most likely case, it doesn't matter: only one write is performed, but before the conditional jump. For the last two less likely cases, two redundant writes are performed. Moreover, for the 3rd 'if' block the actual useful ptr write is performed inside the block, and not before the conditional jump as with the 1st and 2nd if's. Since the last two if's are not likely, the performance impact of the redundant writes probably don't matter much, but since this ticket is about redundant loads/stores, it might have something to do with the core issue. 0x0000000000400720 <+0>: mov (%rdi),%rax # ptr load 0x0000000000400723 <+3>: cmp $0xff,%esi 0x0000000000400729 <+9>: lea 0x4(%rax),%rdx 0x000000000040072d <+13>: mov %rdx,(%rdi) # store before the cond 0x0000000000400730 <+16>: jg 0x400738 <encode_buf(OutBuf&, int)+24> 0x0000000000400732 <+18>: mov %esi,(%rax) # most likely branch 0x0000000000400734 <+20>: retq 0x0000000000400735 <+21>: nopl (%rax) 0x0000000000400738 <+24>: lea 0x8(%rax),%rdx 0x000000000040073c <+28>: cmp $0xffff,%esi 0x0000000000400742 <+34>: movl $0x0,(%rax) 0x0000000000400748 <+40>: mov %rdx,(%rdi) # store again 0x000000000040074b <+43>: jg 0x400751 <encode_buf(OutBuf&, int)+49> 0x000000000040074d <+45>: mov %esi,0x4(%rax) # 2nd most likely br 0x0000000000400750 <+48>: retq 0x0000000000400751 <+49>: cmp $0xffffff,%esi 0x0000000000400757 <+55>: movl $0x0,0x4(%rax) 0x000000000040075e <+62>: jg 0x40076b <encode_buf(OutBuf&, int)+75> 0x0000000000400760 <+64>: lea 0xc(%rax),%rdx # 3rd most likely br 0x0000000000400764 <+68>: mov %rdx,(%rdi) # ptr stored after the cond 0x0000000000400767 <+71>: mov %esi,0x8(%rax) 0x000000000040076a <+74>: retq 0x000000000040076b <+75>: lea 0x10(%rax),%rdx # least likely br 0x000000000040076f <+79>: movl $0x0,0x8(%rax) 0x0000000000400776 <+86>: mov %rdx,(%rdi) # store 0x0000000000400779 <+89>: mov %esi,0xc(%rax) 0x000000000040077c <+92>: retq encode_ptr_ref_tmp() is the same as encode_ptr_ref(), but the ptr is loaded into a local variable by hand, and written back to the reference in each branch before return. The generated code is effectively the same, same size, only the insn order is different: no redundant writes, all ptr writes are inside the corresponding if block. 0x00000000004006c0 <+0>: cmp $0xff,%esi 0x00000000004006c6 <+6>: mov (%rdi),%rax # ptr load 0x00000000004006c9 <+9>: jg 0x4006d8 <encode_ptr_ref_tmp(unsigned int*&, int)+24> 0x00000000004006cb <+11>: mov %esi,(%rax) 0x00000000004006cd <+13>: add $0x4,%rax 0x00000000004006d1 <+17>: mov %rax,(%rdi) # ptr store 0x00000000004006d4 <+20>: retq 0x00000000004006d5 <+21>: nopl (%rax) 0x00000000004006d8 <+24>: cmp $0xffff,%esi 0x00000000004006de <+30>: movl $0x0,(%rax) 0x00000000004006e4 <+36>: jg 0x4006f1 <encode_ptr_ref_tmp(unsigned int*&, int)+49> 0x00000000004006e6 <+38>: mov %esi,0x4(%rax) 0x00000000004006e9 <+41>: add $0x8,%rax 0x00000000004006ed <+45>: mov %rax,(%rdi) # ptr store 0x00000000004006f0 <+48>: retq 0x00000000004006f1 <+49>: cmp $0xffffff,%esi 0x00000000004006f7 <+55>: movl $0x0,0x4(%rax) 0x00000000004006fe <+62>: jg 0x40070b <encode_ptr_ref_tmp(unsigned int*&, int)+75> 0x0000000000400700 <+64>: mov %esi,0x8(%rax) 0x0000000000400703 <+67>: add $0xc,%rax 0x0000000000400707 <+71>: mov %rax,(%rdi) # ptr store 0x000000000040070a <+74>: retq 0x000000000040070b <+75>: movl $0x0,0x8(%rax) 0x0000000000400712 <+82>: mov %esi,0xc(%rax) 0x0000000000400715 <+85>: add $0x10,%rax 0x0000000000400719 <+89>: mov %rax,(%rdi) # ptr store 0x000000000040071c <+92>: retq Again, the ptr ref and buf object versions are exactly the same (encode_buf(), encode_buf_tmp()). Now we arrived to the core issue: test_encode_ptr_ref() calls the above encode_ptr_ref() in a loop, and encodes an array of ints. encode_ptr_ref() is not called, but inlined, no other calls are made, so the compiler/optimizer should optimize away the loads/stores, but unfortunately, it doesn't: in every loop iteration, the write ptr is loaded/stored. 0x00000000004008f0 <+0>: jmp 0x4008fa <test_encode_ptr_ref(unsigned int*&, int const*, int)+10> 0x00000000004008f2 <+2>: nopw 0x0(%rax,%rax,1) 0x00000000004008f8 <+8>: mov %eax,(%rcx) # inner loop head: store value 0x00000000004008fa <+10>: sub $0x1,%edx 0x00000000004008fd <+13>: js 0x400938 <test_encode_ptr_ref(unsigned int*&, int const*, int)+72> 0x00000000004008ff <+15>: add $0x4,%rsi 0x0000000000400903 <+19>: mov (%rdi),%rcx # load ptr 0x0000000000400906 <+22>: mov -0x4(%rsi),%eax # load value 0x0000000000400909 <+25>: lea 0x4(%rcx),%r8 0x000000000040090d <+29>: cmp $0xff,%eax 0x0000000000400912 <+34>: mov %r8,(%rdi) # store ptr 0x0000000000400915 <+37>: jle 0x4008f8 <test_encode_ptr_ref(unsigned int*&, int const*, int)+8> 0x0000000000400917 <+39>: lea 0x8(%rcx),%r8 0x000000000040091b <+43>: cmp $0xffff,%eax 0x0000000000400920 <+48>: movl $0x0,(%rcx) 0x0000000000400926 <+54>: mov %r8,(%rdi) # store ptr 0x0000000000400929 <+57>: jg 0x40093a <test_encode_ptr_ref(unsigned int*&, int const*, int)+74> 0x000000000040092b <+59>: sub $0x1,%edx 0x000000000040092e <+62>: mov %eax,0x4(%rcx) 0x0000000000400931 <+65>: jns 0x4008ff <test_encode_ptr_ref(unsigned int*&, int const*, int)+15> 0x0000000000400933 <+67>: nopl 0x0(%rax,%rax,1) 0x0000000000400938 <+72>: repz retq 0x000000000040093a <+74>: cmp $0xffffff,%eax 0x000000000040093f <+79>: movl $0x0,0x4(%rcx) 0x0000000000400946 <+86>: jg 0x400954 <test_encode_ptr_ref(unsigned int*&, int const*, int)+100> 0x0000000000400948 <+88>: lea 0xc(%rcx),%r8 0x000000000040094c <+92>: mov %r8,(%rdi) # store ptr 0x000000000040094f <+95>: mov %eax,0x8(%rcx) 0x0000000000400952 <+98>: jmp 0x4008fa <test_encode_ptr_ref(unsigned int*&, int const*, int)+10> 0x0000000000400954 <+100>: lea 0x10(%rcx),%r8 0x0000000000400958 <+104>: movl $0x0,0x8(%rcx) 0x000000000040095f <+111>: mov %r8,(%rdi) # store ptr 0x0000000000400962 <+114>: mov %eax,0xc(%rcx) 0x0000000000400965 <+117>: jmp 0x4008fa <test_encode_ptr_ref(unsigned int*&, int const*, int)+10> The generated code for write_run_ptr_ref() loaded the ptr before the loop, the loop used the reg and stored back after the loop. Why can't the same be done for test_encode_ptr_ref()? I got a hunch that maybe the 'if' blocks are related, so I wrote a simple fn that copies n ints from a src ptr to a dst ptr, passed as ref: copy_ptr_ref() 0x00000000004005e0 <+0>: test %edx,%edx 0x00000000004005e2 <+2>: jle 0x400607 <copy_ptr_ref(unsigned int*&, int const*, int)+39> 0x00000000004005e4 <+4>: mov (%rdi),%r8 # load dst 0x00000000004005e7 <+7>: sub $0x1,%edx 0x00000000004005ea <+10>: xor %eax,%eax 0x00000000004005ec <+12>: add $0x1,%rdx 0x00000000004005f0 <+16>: mov (%rsi,%rax,4),%ecx # load int 0x00000000004005f3 <+19>: mov %ecx,(%r8,%rax,4) # store int 0x00000000004005f7 <+23>: add $0x1,%rax 0x00000000004005fb <+27>: cmp %rdx,%rax 0x00000000004005fe <+30>: jne 0x4005f0 <copy_ptr_ref(unsigned int*&, int const*, int)+16> 0x0000000000400600 <+32>: lea (%r8,%rax,4),%rax 0x0000000000400604 <+36>: mov %rax,(%rdi) # store dst 0x0000000000400607 <+39>: repz retq No redundant loads/stores in the loop. The next step was to introduce a condition: the int is copied only if it is not divisible by a given number: copy_ptr_ref_pred(). GOTCHA! The dst load/store appeared in the loop: 0x0000000000400610 <+0>: mov %edx,%r9d 0x0000000000400613 <+3>: nopl 0x0(%rax,%rax,1) 0x0000000000400618 <+8>: sub $0x1,%r9d # loop beg 0x000000000040061c <+12>: js 0x400643 <copy_ptr_ref_pred(unsigned int*&, int const*, int, int)+51> 0x000000000040061e <+14>: add $0x4,%rsi 0x0000000000400622 <+18>: mov -0x4(%rsi),%r8d # load int 0x0000000000400626 <+22>: mov %r8d,%eax 0x0000000000400629 <+25>: cltd 0x000000000040062a <+26>: idiv %ecx 0x000000000040062c <+28>: test %edx,%edx 0x000000000040062e <+30>: je 0x400618 <copy_ptr_ref_pred(unsigned int*&, int const*, int, int)+8> 0x0000000000400630 <+32>: mov (%rdi),%rax # load dst 0x0000000000400633 <+35>: sub $0x1,%r9d 0x0000000000400637 <+39>: lea 0x4(%rax),%rdx 0x000000000040063b <+43>: mov %rdx,(%rdi) # store dst 0x000000000040063e <+46>: mov %r8d,(%rax) # store int 0x0000000000400641 <+49>: jns 0x40061e <copy_ptr_ref_pred(unsigned int*&, int const*, int, int)+14> 0x0000000000400643 <+51>: repz retq If I load the dst into a local variable, it gets a register allocated, and the load/store in each iteration is eliminated. However, the compiler should do this automatically, imho. copy_ptr_ref_pred_tmp(): 0x0000000000400650 <+0>: test %edx,%edx 0x0000000000400652 <+2>: jle 0x400686 <copy_ptr_ref_pred_tmp(unsigned int*&, int const*, int, int)+54> 0x0000000000400654 <+4>: movslq %edx,%rdx 0x0000000000400657 <+7>: mov (%rdi),%r9 # load dst once 0x000000000040065a <+10>: lea (%rsi,%rdx,4),%r10 0x000000000040065e <+14>: xchg %ax,%ax 0x0000000000400660 <+16>: cmp %r10,%rsi 0x0000000000400663 <+19>: je 0x400683 <copy_ptr_ref_pred_tmp(unsigned int*&, int const*, int, int)+51> 0x0000000000400665 <+21>: add $0x4,%rsi 0x0000000000400669 <+25>: mov -0x4(%rsi),%r8d # load int 0x000000000040066d <+29>: mov %r8d,%eax 0x0000000000400670 <+32>: cltd 0x0000000000400671 <+33>: idiv %ecx 0x0000000000400673 <+35>: test %edx,%edx 0x0000000000400675 <+37>: je 0x400660 <copy_ptr_ref_pred_tmp(unsigned int*&, int const*, int, int)+16> 0x0000000000400677 <+39>: mov %r8d,(%r9) # store int 0x000000000040067a <+42>: add $0x4,%r9 0x000000000040067e <+46>: cmp %r10,%rsi 0x0000000000400681 <+49>: jne 0x400665 <copy_ptr_ref_pred_tmp(unsigned int*&, int const*, int, int)+21> 0x0000000000400683 <+51>: mov %r9,(%rdi) # store dst once 0x0000000000400686 <+54>: repz retq There are more complex cases, where eg the encoder fn is split into two parts: the most likely case, which should be inlined, and the rest, which should be called. See test_encode_ptr_ref_inline(). For this fn, the ideal code would be the one that loads the dst ptr into a register, and uses the reg for the inlined encoder code, and if the out-of-line part should be called, it stores the reg back to the ref, calls the ool part, then loads back the reg from the ref. This way the most likely parts of the basic encoders could be inlined into the main encoder loop, and the most likely path would run without unnecessary loads/stores. Regards, Peter