[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #112 from bergner at gcc dot gnu dot org 2009-10-03 01:39 --- Subject: Bug 33928 Author: bergner Date: Sat Oct 3 01:39:14 2009 New Revision: 152430 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=152430 Log: Backport from mainline. 2009-08-30 Alan Modra amo...@bigpond.net.au PR target/41081 * fwprop.c (get_reg_use_in): Delete. (free_load_extend): New function. (forward_propagate_subreg): Use it. 2009-08-23 Alan Modra amo...@bigpond.net.au PR target/41081 * fwprop.c (try_fwprop_subst): Allow multiple sets. (get_reg_use_in): New function. (forward_propagate_subreg): Propagate through subreg of zero_extend or sign_extend. 2009-05-08 Paolo Bonzini bonz...@gnu.org PR rtl-optimization/33928 PR 26854 * fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween, process_uses, build_single_def_use_links): New. (update_df): Update use_def_ref. (forward_propagate_into): Use get_def_for_use instead of use-def chains. (fwprop_init): Call build_single_def_use_links and let it initialize dataflow. (fwprop_done): Free use_def_ref. (fwprop_addr): Eliminate duplicate call to df_set_flags. * df-problems.c (df_rd_simulate_artificial_defs_at_top, df_rd_simulate_one_insn): New. (df_rd_bb_local_compute_process_def): Update head comment. (df_chain_create_bb): Use the new RD simulation functions. * df.h (df_rd_simulate_artificial_defs_at_top, df_rd_simulate_one_insn): New. * opts.c (decode_options): Enable fwprop at -O1. * doc/invoke.texi (-fforward-propagate): Document this. Modified: branches/ibm/gcc-4_3-branch/gcc/ChangeLog.ibm branches/ibm/gcc-4_3-branch/gcc/REVISION branches/ibm/gcc-4_3-branch/gcc/df-problems.c branches/ibm/gcc-4_3-branch/gcc/df.h branches/ibm/gcc-4_3-branch/gcc/doc/invoke.texi branches/ibm/gcc-4_3-branch/gcc/fwprop.c branches/ibm/gcc-4_3-branch/gcc/opts.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #111 from lucier at math dot purdue dot edu 2009-08-27 17:02 --- I can compile gambit 4.1.2 with -fschedule-insns except for the function noted in PR41164. On model name : Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz with gcc version 4.5.0 20090803 (experimental) [trunk revision 150373] (GCC) the times with -fschedule-insns are (time (direct-fft-recursive-4 a table)) 144 ms cpu time (144 user, 0 system) (time (inverse-fft-recursive-4 a table)) 136 ms cpu time (136 user, 0 system) and the times without -fschedule-insns are (time (direct-fft-recursive-4 a table)) 168 ms cpu time (168 user, 0 system) (time (inverse-fft-recursive-4 a table)) 172 ms cpu time (172 user, 0 system) That's a pretty big improvement. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #108 from lucier at math dot purdue dot edu 2009-08-27 01:18 --- direct.c contains a direct FFT; I've compiled the direct and inverse fft and I ran it on arrays with 2^23 double-precision complex elements and heine:~/programs/gcc/objdirs/bench-mainline-on-fft /pkgs/gcc-mainline/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../../mainline/configure --enable-checking=release --prefix=/pkgs/gcc-mainline --enable-languages=c,c++ -enable-stage1-languages=c,c++ Thread model: posix gcc version 4.5.0 20090803 (experimental) [trunk revision 150373] (GCC) The compile options were /pkgs/gcc-mainline/bin/gcc -save-temps -c -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -rdynamic -shared -fschedule-insns and the same without -fschedule-insns. The runtime for direct+inverse FFT with instruction scheduling was 1.264 seconds and the time for direct+inverse FFT without -fschedule-insns was 1.444 seconds, which is a 14% speedup for that one compiler option. This is on a 2.33GHz Core 2 quad machine. I'll attach the inner loops of direct.c with and with -fschedule-insns. I haven't been able to compile the complete Gambit runtime with -fschedule-insns on either x86-64 or ppc64; I've filed PR41164 and PR41176 for those two different failures. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #109 from lucier at math dot purdue dot edu 2009-08-27 01:22 --- Created an attachment (id=18432) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18432action=view) inner loop of direct.c with -fschedule-insns -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #110 from lucier at math dot purdue dot edu 2009-08-27 01:22 --- Created an attachment (id=18433) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18433action=view) inner loop of direct.c without -fschedule-insns -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #107 from rguenth at gcc dot gnu dot org 2009-08-04 12:28 --- GCC 4.3.4 is being released, adjusting target milestone. -- rguenth at gcc dot gnu dot org changed: What|Removed |Added Target Milestone|4.3.4 |4.3.5 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #104 from bonzini at gnu dot org 2009-06-16 06:47 --- I understood that with -frename-registers the regression is fixed. As I said, without a pre-regalloc scheduling pass and without register renaming, the scheduling quality you get is more or less random. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #105 from bonzini at gnu dot org 2009-06-16 07:01 --- Marking PR39157 as a duplicate of PR26854 is not exact (only the fwprop part is a duplicate, because we were getting large compile times because of building large data structures; the CFG Cleanup part is not exactly a duplicate) but I don't think it's important because anyway we have a patch for the fwprop issue. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #106 from lucier at math dot purdue dot edu 2009-06-16 07:24 --- This machine has 4ms ticks, so we're getting down to a few ticks difference with a benchmark of this size. It's 156ms with 4.2.4, 168ms with 4.5.0, and 164 ms when -frename-registers is added to the command line. It's not just scheduling, there are more memory accesses with 4.5.0. With a problem roughly 10 times as large, the times are 4.2.4: 2912ms 4.5.0: 3204ms 4.5.0: 3120ms (adding -frename-registers) So there's a 7% difference with -frename-registers. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #97 from bonzini at gnu dot org 2009-06-15 15:14 --- Brad, could you try to time compiler.i with and without -ftime-report to see how much of the tree stmt walking timevar is just accounting overhead? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #98 from lucier at math dot purdue dot edu 2009-06-15 16:11 --- I don't quite understand how you would like me to configure and run the test. First, I've applied your patches to speed up computing DF to my tree; do you want them included in the test, or should I use a pristine mainline? Second, when configuring mainline, should I include, or not include 1. --enable-gather-detailed-mem-stats 2. --enable-checking=release After that, I think you just want to run two compiles with and without -ftime-report, is that right? (Nothing about -fmem-report.) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #99 from paolo dot bonzini at gmail dot com 2009-06-15 16:20 --- Subject: Re: [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475 First, I've applied your patches to speed up computing DF to my tree; do you want them included in the test, or should I use a pristine mainline? It doesn't matter, but yes, use them. Second, when configuring mainline, should I include, or not include 1. --enable-gather-detailed-mem-stats 2. --enable-checking=release Again it shouldn't matter, but use only --enable-checking=release. After that, I think you just want to run two compiles with and without -ftime-report, is that right? (Nothing about -fmem-report.) Yes, and the output of -ftime-report is not needed. Just the time ./cc1 ... output for the two. Thanks! -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #103 from lucier at math dot purdue dot edu 2009-06-15 20:21 --- Regarding comment #101 ... With heine:~/programs/gcc/objdirs/gsc-fft-tests/gambc-v4_1_2 /pkgs/gcc-mainline/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline --enable-languages=c --disable-multilib --enable-checking=release Thread model: posix gcc version 4.5.0 20090608 (experimental) [trunk revision 148276] (GCC) (and including Paolo's patch to speed up DF), the routine in direct.c takes 168 ms cpu time (168 user, 0 system) As reported here http://www.math.purdue.edu/~lucier/bugzilla/9/ with gcc-4.2.4, this routine takes 156 ms on the same machine. Comment #9 gives the code that 4.2.4 generates at the start of the main loop; the start of the main loop with the version of 4.5.0 I gave above is: .L2938: movq%rcx, %rdx addq8(%rax), %rdx leaq4(%rcx), %rbx movq%rdx, -8(%rax) leaq4(%rdx), %rdi addq8(%rax), %rdx movq%rdi, -16(%rax) movq%rdx, -24(%rax) leaq4(%rdx), %rdi addq8(%rax), %rdx movq%rdi, -32(%rax) movq%rdx, -40(%rax) leaq4(%rdx), %rdi movq40(%rax), %rdx movq%rdi, -48(%rax) movsd 7(%rdx,%rdi,2), %xmm7 movq-40(%rax), %rdi leaq7(%rdx,%rcx,2), %r8 addq$8, %rcx movsd (%r8), %xmm4 cmpq%rcx, %r13 movsd 7(%rdx,%rdi,2), %xmm10 movq-32(%rax), %rdi movsd 7(%rdx,%rdi,2), %xmm5 movq-24(%rax), %rdi movsd 7(%rdx,%rdi,2), %xmm6 movq-16(%rax), %rdi movsd 7(%rdx,%rdi,2), %xmm13 movq-8(%rax), %rdi movsd 7(%rdx,%rdi,2), %xmm11 leaq(%rbx,%rbx), %rdi movsd 7(%rdi,%rdx), %xmm9 movq24(%rax), %rdx movapd %xmm11, %xmm14 movsd 15(%rdx), %xmm1 movsd 7(%rdx), %xmm2 movapd %xmm1, %xmm8 movsd 31(%rdx), %xmm3 movapd %xmm2, %xmm12 mulsd %xmm10, %xmm8 mulsd %xmm7, %xmm12 mulsd %xmm2, %xmm10 mulsd %xmm1, %xmm7 movsd 23(%rdx), %xmm0 So, to my mind, this is still a 4.5 regression, as there is still a slow-down and the code is still much less optimized by 4.5.0 than by 4.2.4. 168/156 ~ 1.08, so if you want to change the Summary of this bug to 8% regression, or some other things, that's fine, but I've changed this PR back to being a 4.5 regression. I was not really thrilled when Richard marked PR 39157 as a duplicate of this PR. To my mind, there are three more or less independent things---run time of Gambit-generated code, compile time of the code, and the space required to compile the code. This PR is about run time; PR 39157 was about space needed by the compiler; PR 26854 is about compile time. They seem to have all been mushed together. -- lucier at math dot purdue dot edu changed: What|Removed |Added Known to work|4.5.0 | Summary|[4.3/4.4 Regression] 30%|[4.3/4.4/4.5 Regression] 30% |performance slowdown in |performance slowdown in |floating-point code caused |floating-point code caused |by r118475 |by r118475 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #95 from lucier at math dot purdue dot edu 2009-06-14 14:59 --- The test case is compiler.i.gz -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #96 from lucier at math dot purdue dot edu 2009-06-14 15:02 --- Sorry, the gcc options are in comment 87 (the -fforward-propagate is now redundant), and without Paolo's recently proposed patch it requires about 9GB of memory to compile. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #93 from rguenth at gcc dot gnu dot org 2009-06-13 14:18 --- I would say that was the new SRA. -- rguenth at gcc dot gnu dot org changed: What|Removed |Added CC||mjambor at suse dot cz http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #94 from jamborm at gcc dot gnu dot org 2009-06-14 04:43 --- (In reply to comment #92) In the meanwhile something caused tree incremental SSA to jump up from 10s to 26s. Sob. (In reply to comment #93) I would say that was the new SRA. OK, I'll try to investigate. Which of the various attachments to this bug is the one to look at? Martin -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #92 from bonzini at gnu dot org 2009-06-12 14:50 --- In the meanwhile something caused tree incremental SSA to jump up from 10s to 26s. Sob. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #88 from bonzini at gnu dot org 2009-06-08 08:40 --- Created an attachment (id=17963) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17963action=view) patch I'm testing Here is a patch I'm testing that completes the rewrite of fwprop's dataflow. This should make it much faster and less memory hungry. It should also keep the generated code fast (with -frename-registers of course), if not it's a bug in the patch. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #89 from bonzini at gnu dot org 2009-06-08 08:59 --- Created an attachment (id=17964) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17964action=view) correct version oops, the previous one didn't work at -O1 even though it bootstrapped :-) -- bonzini at gnu dot org changed: What|Removed |Added Attachment #17963|0 |1 is obsolete|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #90 from bonzini at gnu dot org 2009-06-08 16:35 --- Yo, with the patch the time to compile compiler.i with the given options is 331s on my machine (with a checking compiler). Fwprop takes only 1% (including computation of the new dataflow problem). I'd estimate around 250s with your nonchecking build. I'll split it and post it tomorrow. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #91 from lucier at math dot purdue dot edu 2009-06-08 18:19 --- Created an attachment (id=17968) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17968action=view) time and memory report for compiler.i after Paolo's patch The patch cut the total bitmaps used compiling compiler.i from 60GB to 3GB; maximum memory (just from top) was 1631MB. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #84 from bonzini at gnu dot org 2009-05-15 10:35 --- Ok, I am working on a patch to add a multiple-definitions DF problem and use that together with a domwalk to find the single definitions (instead of reaching-definitions, which is the remaining slow part). The new problem has a bitvector sized by the number of registers rather than the number of defs (that is sized like the bitvectors for liveness), which means it will be fast. It is defined as follows: MDkill (B) = regs that have a def in B MDinit (B) = (union of MDkill (P) for every P : B \in DomFrontier(P) \cap LRin(B) MDin (B) = MDinit (B) \cup (union of MDout (P) for every predecessor P of B) MDout (B) = MDin (B) - MDkill (B) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #85 from lucier at math dot purdue dot edu 2009-05-16 00:20 --- Created an attachment (id=17878) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17878action=view) Large test file for testing time and memory usage This is the file compiler.i used in the previous tests. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #78 from bonzini at gnu dot org 2009-05-08 06:51 --- Subject: Bug 33928 Author: bonzini Date: Fri May 8 06:51:12 2009 New Revision: 147270 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=147270 Log: 2009-05-08 Paolo Bonzini bonz...@gnu.org PR rtl-optimization/33928 * loop-invariant.c (struct use): Add addr_use_p. (struct def): Add n_addr_uses. (struct invariant): Add cheap_address. (create_new_invariant): Set cheap_address. (record_use): Accept df_ref. Set addr_use_p and update n_addr_uses. (record_uses): Pass df_ref to record_use. (get_inv_cost): Do not add inv-cost to comp_cost for cheap addresses used only as such. Modified: trunk/gcc/ChangeLog trunk/gcc/loop-invariant.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #79 from bonzini at gnu dot org 2009-05-08 07:18 --- I'm cobbling up the DIY dataflow patch and it is all but ugly, actually. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #80 from bonzini at gnu dot org 2009-05-08 07:51 --- Subject: Bug 33928 Author: bonzini Date: Fri May 8 07:51:46 2009 New Revision: 147274 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=147274 Log: 2009-05-08 Paolo Bonzini bonz...@gnu.org PR rtl-optimization/33928 * loop-invariant.c (record_use): Fix vs. || mishap. Modified: trunk/gcc/ChangeLog trunk/gcc/loop-invariant.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #81 from bonzini at gnu dot org 2009-05-08 07:55 --- Created an attachment (id=17825) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17825action=view) speed up fwprop and enable it at -O1 Here is a patch I'm bootstrapping to remove fwprop's usage of UD chains. It does not affect at all the assembly output, it just changes the data structure that is used. compiler.i is probably too big for me, but I tried slatex.i and fwprop was ~2% of compilation time with this patch. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #82 from bonzini at gnu dot org 2009-05-08 09:41 --- Hm, looking at the time reports the patch will save about 30-40% of the fwprop execution time, and should fix the memory hog problem, but will still leave in the 70s needed to compute reaching definitions. I guess it's a step forward for -O2 but borderline for -O1. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #83 from bonzini at gnu dot org 2009-05-08 12:22 --- Subject: Bug 33928 Author: bonzini Date: Fri May 8 12:22:30 2009 New Revision: 147282 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=147282 Log: 2009-05-08 Paolo Bonzini bonz...@gnu.org PR rtl-optimization/33928 PR 26854 * fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween, process_uses, build_single_def_use_links): New. (update_df): Update use_def_ref. (forward_propagate_into): Use get_def_for_use instead of use-def chains. (fwprop_init): Call build_single_def_use_links and let it initialize dataflow. (fwprop_done): Free use_def_ref. (fwprop_addr): Eliminate duplicate call to df_set_flags. * df-problems.c (df_rd_simulate_artificial_defs_at_top, df_rd_simulate_one_insn): New. (df_rd_bb_local_compute_process_def): Update head comment. (df_chain_create_bb): Use the new RD simulation functions. * df.h (df_rd_simulate_artificial_defs_at_top, df_rd_simulate_one_insn): New. * opts.c (decode_options): Enable fwprop at -O1. * doc/invoke.texi (-fforward-propagate): Document this. Modified: trunk/gcc/ChangeLog trunk/gcc/df-problems.c trunk/gcc/df.h trunk/gcc/doc/invoke.texi trunk/gcc/fwprop.c trunk/gcc/opts.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #67 from bonzini at gnu dot org 2009-05-07 13:40 --- I'm thinking of enabling -frename-registers on x86; since it does not enable the first scheduling pass, the live ranges will be shorter and the register allocator may reuse the same register over and over with no freedom on schedule-insns2. This would leave only the bug with RTL loop invariant motion. Brad, you are the one who's regularly producing insane testcases, can you measure the slowdown from -O1 to -O1 -frename-registers? It is a local pass, so it should not be that much, but I'd rather check before (I'll check on a bootstrap instead). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #71 from lucier at math dot purdue dot edu 2009-05-07 16:02 --- Created an attachment (id=17820) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17820action=view) time for 31957, with rename-registers -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #74 from bonzini at gnu dot org 2009-05-07 16:21 --- Ok. One step at a time. :-) To recap, here is the situation: - the CSE optimization you mention was *not* removed, it was moved to fwprop, so it does not run at -O1. - once this was done, the way to go is to tune new optimizations, not to reintroduce old ones - for example, fwprop in turn triggered a bad choice in loop invariant motion, for which a patch has been posted. This patch will remove the need for -fno-move-loop-invariants on this testcase (this is a deficiency in LIM that is not specific to machine-generated code, OTOH the presence of many fp[N] accesses helps triggering it). - that scheduling is necessary now and not in 4.2.x, probably is just a matter of luck - why renaming registers is necessary now and not in 4.2.x is still a mystery; but, there is an explanation as to why it helps (it prolongs live ranges, something that on non-x86 archs is done by the pre-regalloc scheduling) - at least we have a set of options providing good performance on this testcase, and guidance towards better tuning of the various problematic optimizations To conclude, nobody is underestimating the significance of its PR, it's just a matter of priorities. Near the end of the release cycle, you tend to look at PRs with small testcases to minimize the time spent understanding the code; near the beginning, you hope that new features magically fix the PRs and concentrate on wrong-code bugs and so on. Complex P2s such as this one unfortunately tend to stay in a limbo. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #75 from lucier at math dot purdue dot edu 2009-05-07 16:31 --- Subject: Re: [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475 On May 7, 2009, at 12:21 PM, bonzini at gnu dot org wrote: --- Comment #74 from bonzini at gnu dot org 2009-05-07 16:21 --- Ok. One step at a time. :-) To recap, here is the situation: - that scheduling is necessary now and not in 4.2.x, probably is just a matter of luck If you mean -fschedule-insns2, it has always been part of the options list. - at least we have a set of options providing good performance on this testcase, and guidance towards better tuning of the various problematic optimizations OK, but -fforward-propagate is not viable in general for these machine-generated codes. Brad -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #76 from bonzini at gnu dot org 2009-05-07 16:37 --- It should be possible to modify fwprop to avoid excessive memory usage (doing its own dataflow, basically, instead of using UD chains) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #77 from steven at gcc dot gnu dot org 2009-05-07 17:50 --- Re. comment #75: Just the fact that an option is enabled in both releases doesn't mean the pass behind it is doing the same thing in both releases. What the scheduler does, depends heavily on the code you feed it. Sometimes it is pure (good or bad) luck that changes the behavior of a pass in the compiler. The interactions between all the pieces are just very complicated (which is why, IMHO, retargetable-compiler engineering is so difficult: controlling the pipeline is undoable). Re. comment #76: Sad as it may be, I think this is the best short-term solution. Alternatively we could re-work fwprop to work on regions and use the partial-CFG dataflow stuff, similar to what the RTL loop optimizers (like loop-invariant) do. To be honest, I'd much prefer the latter, but the DIY-fwprop thing is probably easier in the short term. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #61 from jakub at gcc dot gnu dot org 2009-05-06 13:05 --- Also see PR39871, maybe that's related (though on ARM). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #62 from bonzini at gnu dot org 2009-05-06 15:07 --- No, totally unrelated to PR39871 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #63 from lucier at math dot purdue dot edu 2009-05-06 19:57 --- Was the patch in comment 55 meant for me to bootstrap and test with today's mainline? It crashes at the gcc_assert at /* Subroutine of canon_reg. Pass *XLOC through canon_reg, and validate the result if necessary. INSN is as for canon_reg. */ static void validate_canon_reg (rtx *xloc, rtx insn) { if (*xloc) { rtx new_rtx = canon_reg (*xloc, insn); /* If replacing pseudo with hard reg or vice versa, ensure the insn remains valid. Likewise if the insn has MATCH_DUPs. */ gcc_assert (insn new_rtx); validate_change (insn, xloc, new_rtx, 1); } } when building libgcc: /tmp/lucier/gcc/objdirs/mainline/./gcc/xgcc -B/tmp/lucier/gcc/objdirs/mainline/./gcc/ -B/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/bin/ -B/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/lib/ -isystem /pkgs/gcc-mainline/x86_64-unknown-linux-gnu/include -isystem /pkgs/gcc-mainline/x86_64-unknown-linux-gnu/sys-include -g -O2 -m32 -O2 -g -O2 -DIN_GCC -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wcast-qual -Wold-style-definition -isystem ./include -fPIC -g -DHAVE_GTHR_DEFAULT -DIN_LIBGCC2 -D__GCC_FLOAT_NOT_NEEDED -I. -I. -I../../.././gcc -I../../../../../mainline/libgcc -I../../../../../mainline/libgcc/. -I../../../../../mainline/libgcc/../gcc -I../../../../../mainline/libgcc/../include -I../../../../../mainline/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS -DUSE_TLS -o _moddi3.o -MT _moddi3.o -MD -MP -MF _moddi3.dep -DL_moddi3 -c ../../../../../mainline/libgcc/../gcc/libgcc2.c \ -fexceptions -fnon-call-exceptions -fvisibility=hidden -DHIDE_EXPORTS ../../../../../mainline/libgcc/../gcc/libgcc2.c: In function รข: ../../../../../mainline/libgcc/../gcc/libgcc2.c:1121: internal compiler error: in validate_canon_reg, at cse.c:2730 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #64 from lucier at math dot purdue dot edu 2009-05-06 20:43 --- In answer to comment 60, here's the command line where I added -fforward-propagate -fno-move-loop-invariants: /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate -fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR=\/usr/local/Gambit-C/v4.1.2\ -D___SYS_TYPE_CPU=\x86_64\ -D___SYS_TYPE_VENDOR=\unknown\ -D___SYS_TYPE_OS=\linux-gnu\ -c _num.c here's the compiler: /pkgs/gcc-mainline/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: /tmp/lucier/gcc/mainline/configure --enable-checking=release --prefix=/pkgs/gcc-mainline --enable-languages=c Thread model: posix gcc version 4.5.0 20090506 (experimental) [trunk revision 147199] (GCC) and the runtime didn't change (substantially) 132 ms cpu time (132 user, 0 system) and the loop looks pretty much just as bad (it's 117 instructions long, by my count): .L2752: movq%rcx, %rdx addq8(%rax), %rdx leaq4(%rcx), %rdi movq%rdx, -8(%rax) leaq4(%rdx), %rbx addq8(%rax), %rdx movq%rbx, -16(%rax) movq%rdx, -24(%rax) leaq4(%rdx), %rbx addq8(%rax), %rdx movq%rbx, -32(%rax) movq%rdx, -40(%rax) leaq4(%rdx), %rbx movq40(%rax), %rdx movq%rbx, -48(%rax) movsd 7(%rdx,%rbx,2), %xmm9 movq-40(%rax), %rbx leaq7(%rdx,%rcx,2), %r8 addq$8, %rcx movsd (%r8), %xmm4 cmpq%rcx, %r13 movsd 7(%rdx,%rbx,2), %xmm11 movq-32(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm5 movq-24(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm7 movq-16(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm14 movq-8(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm6 leaq(%rdi,%rdi), %rbx movsd 7(%rbx,%rdx), %xmm8 movq24(%rax), %rdx movapd %xmm6, %xmm13 movsd 15(%rdx), %xmm1 movsd 7(%rdx), %xmm2 movapd %xmm1, %xmm10 movsd 31(%rdx), %xmm3 movapd %xmm2, %xmm12 mulsd %xmm11, %xmm10 mulsd %xmm9, %xmm12 mulsd %xmm2, %xmm11 mulsd %xmm1, %xmm9 movsd 23(%rdx), %xmm0 addsd %xmm12, %xmm10 movapd %xmm2, %xmm12 mulsd %xmm7, %xmm2 subsd %xmm9, %xmm11 movapd %xmm1, %xmm9 mulsd %xmm5, %xmm12 mulsd %xmm5, %xmm1 movapd %xmm8, %xmm5 mulsd %xmm7, %xmm9 movapd %xmm4, %xmm7 subsd %xmm11, %xmm13 addsd %xmm6, %xmm11 movsd .LC5(%rip), %xmm6 subsd %xmm1, %xmm2 movapd %xmm0, %xmm1 addsd %xmm12, %xmm9 movapd %xmm14, %xmm12 xorpd %xmm3, %xmm6 subsd %xmm10, %xmm12 mulsd %xmm13, %xmm1 subsd %xmm2, %xmm7 addsd %xmm4, %xmm2 movapd %xmm6, %xmm4 addsd %xmm14, %xmm10 mulsd %xmm13, %xmm6 mulsd %xmm12, %xmm4 subsd %xmm9, %xmm5 mulsd %xmm0, %xmm12 addsd %xmm8, %xmm9 movapd %xmm0, %xmm8 mulsd %xmm11, %xmm0 addsd %xmm1, %xmm4 movapd %xmm3, %xmm1 mulsd %xmm10, %xmm3 subsd %xmm12, %xmm6 mulsd %xmm11, %xmm1 mulsd %xmm10, %xmm8 subsd %xmm3, %xmm0 addsd %xmm1, %xmm8 movapd %xmm2, %xmm1 addsd %xmm0, %xmm1 subsd %xmm0, %xmm2 movapd %xmm7, %xmm0 subsd %xmm6, %xmm7 addsd %xmm6, %xmm0 movsd %xmm1, (%r8) movapd %xmm9, %xmm1 movq40(%rax), %rdx subsd %xmm8, %xmm9 addsd %xmm8, %xmm1 movsd %xmm1, 7(%rbx,%rdx) movq-8(%rax), %rbx movq40(%rax), %rdx movsd %xmm2, 7(%rdx,%rbx,2) movq-16(%rax), %rbx movq40(%rax), %rdx movsd %xmm9, 7(%rdx,%rbx,2) movq-24(%rax), %rbx movq40(%rax), %rdx movsd %xmm0, 7(%rdx,%rbx,2) movapd %xmm5, %xmm0 movq-32(%rax), %rbx movq40(%rax), %rdx subsd %xmm4, %xmm5 addsd %xmm4, %xmm0 movsd %xmm0, 7(%rdx,%rbx,2) movq-40(%rax), %rbx movq40(%rax), %rdx movsd %xmm7, 7(%rdx,%rbx,2) movq-48(%rax), %rbx movq40(%rax), %rdx movsd %xmm5, 7(%rdx,%rbx,2) jg .L2752 movq%rdi, %r13 .L2751: -- lucier at math dot purdue dot edu changed: What|Removed |Added
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #65 from bonzini at gnu dot org 2009-05-07 05:03 --- Subject: Re: [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475 lucier at math dot purdue dot edu wrote: --- Comment #64 from lucier at math dot purdue dot edu 2009-05-06 20:43 --- In answer to comment 60, here's the command line where I added -fforward-propagate -fno-move-loop-invariants: Hmm, can you try adding -frename-registers *or* -fweb (i.e. together they get no benefit) too? and the loop looks pretty much just as bad (it's 117 instructions long, by my count): 116 actually: the movq here is outside the loop (that's how I made all the instruction counts). movsd %xmm5, 7(%rdx,%rbx,2) jg .L2752 movq%rdi, %r13 .L2751: -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475
--- Comment #66 from lucier at math dot purdue dot edu 2009-05-07 05:27 --- Adding -frename-registers gives a significant speedup (sometimes as fast as 4.1.2 on this shared machine, i.e., it somtimes hits 108 ms instead of 132-140ms), the command line with -fforward-propagate -fno-move-loop-invariants -frename-registers is /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate -fno-move-loop-invariants -frename-registers -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR=\/usr/local/Gambit-C/v4.1.2\ -D___SYS_TYPE_CPU=\x86_64\ -D___SYS_TYPE_VENDOR=\unknown\ -D___SYS_TYPE_OS=\linux-gnu\ -c _num.c and the loop is .L2752: movq%rcx, %r12 addq8(%rax), %r12 leaq4(%rcx), %rdi movq%r12, -8(%rax) leaq4(%r12), %r8 addq8(%rax), %r12 movq%r8, -16(%rax) movq-8(%rax), %r8 movq-16(%rax), %rdx movq%r12, -24(%rax) leaq4(%r12), %rbx addq8(%rax), %r12 movq-24(%rax), %r9 movq%rbx, -32(%rax) movq24(%rax), %rbx movq-32(%rax), %r10 leaq4(%r12), %r11 movq%r12, -40(%rax) movq40(%rax), %r12 movq-40(%rax), %r14 movq%r11, -48(%rax) movsd 15(%rbx), %xmm1 movsd 7(%rbx), %xmm2 movsd 7(%r12,%r11,2), %xmm9 movapd %xmm1, %xmm3 movsd 7(%r12,%r14,2), %xmm11 leaq7(%r12,%rcx,2), %r11 movapd %xmm2, %xmm10 leaq(%rdi,%rdi), %r14 mulsd %xmm11, %xmm3 movapd %xmm2, %xmm12 mulsd %xmm9, %xmm10 addq$8, %rcx mulsd %xmm1, %xmm9 cmpq%rcx, %r13 mulsd %xmm2, %xmm11 movsd 7(%r12,%r10,2), %xmm5 movsd 7(%r12,%r9,2), %xmm7 addsd %xmm10, %xmm3 movsd 7(%r12,%r8,2), %xmm6 subsd %xmm9, %xmm11 mulsd %xmm7, %xmm2 movapd %xmm1, %xmm9 mulsd %xmm5, %xmm1 movapd %xmm6, %xmm13 movsd 7(%r12,%rdx,2), %xmm14 mulsd %xmm5, %xmm12 mulsd %xmm7, %xmm9 subsd %xmm11, %xmm13 movsd 31(%rbx), %xmm0 addsd %xmm6, %xmm11 movsd .LC5(%rip), %xmm6 subsd %xmm1, %xmm2 movsd (%r11), %xmm4 movapd %xmm14, %xmm10 xorpd %xmm0, %xmm6 addsd %xmm12, %xmm9 movsd 7(%r14,%r12), %xmm8 subsd %xmm3, %xmm10 movapd %xmm4, %xmm7 addsd %xmm14, %xmm3 movsd 23(%rbx), %xmm15 subsd %xmm2, %xmm7 movapd %xmm8, %xmm5 addsd %xmm4, %xmm2 movapd %xmm6, %xmm4 subsd %xmm9, %xmm5 movapd %xmm15, %xmm14 addsd %xmm8, %xmm9 mulsd %xmm10, %xmm4 movapd %xmm15, %xmm8 mulsd %xmm15, %xmm10 movapd %xmm0, %xmm12 mulsd %xmm11, %xmm15 mulsd %xmm3, %xmm0 movapd %xmm7, %xmm1 mulsd %xmm13, %xmm6 mulsd %xmm3, %xmm8 movapd %xmm9, %xmm3 mulsd %xmm11, %xmm12 subsd %xmm0, %xmm15 mulsd %xmm13, %xmm14 subsd %xmm10, %xmm6 movapd %xmm2, %xmm10 movapd %xmm5, %xmm0 addsd %xmm12, %xmm8 addsd %xmm15, %xmm10 subsd %xmm15, %xmm2 addsd %xmm14, %xmm4 addsd %xmm8, %xmm3 movsd %xmm10, (%r11) movq40(%rax), %r10 subsd %xmm8, %xmm9 addsd %xmm6, %xmm1 addsd %xmm4, %xmm0 movsd %xmm3, 7(%r14,%r10) movq-8(%rax), %r9 movq40(%rax), %rdx subsd %xmm6, %xmm7 subsd %xmm4, %xmm5 movsd %xmm2, 7(%rdx,%r9,2) movq-16(%rax), %r8 movq40(%rax), %r12 movsd %xmm9, 7(%r12,%r8,2) movq-24(%rax), %rbx movq40(%rax), %r11 movsd %xmm1, 7(%r11,%rbx,2) movq-32(%rax), %r14 movq40(%rax), %r10 movsd %xmm0, 7(%r10,%r14,2) movq-40(%rax), %r9 movq40(%rax), %rdx movsd %xmm7, 7(%rdx,%r9,2) movq-48(%rax), %r8 movq40(%rax), %r12 movsd %xmm5, 7(%r12,%r8,2) jg .L2752 Adding -fforward-propagate -fno-move-loop-invariants -fweb instead of -fforward-propagate -fno-move-loop-invariants -frename-registers, so the compile line is /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate -fno-move-loop-invariants -fweb -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR=\/usr/local/Gambit-C/v4.1.2\ -D___SYS_TYPE_CPU=\x86_64\