[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-10-02 Thread bergner at gcc dot gnu dot org


--- Comment #112 from bergner at gcc dot gnu dot org  2009-10-03 01:39 
---
Subject: Bug 33928

Author: bergner
Date: Sat Oct  3 01:39:14 2009
New Revision: 152430

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=152430
Log:
Backport from mainline.

2009-08-30  Alan Modra  amo...@bigpond.net.au

PR target/41081
* fwprop.c (get_reg_use_in): Delete.
(free_load_extend): New function.
(forward_propagate_subreg): Use it.

2009-08-23  Alan Modra  amo...@bigpond.net.au

PR target/41081
* fwprop.c (try_fwprop_subst): Allow multiple sets.
(get_reg_use_in): New function.
(forward_propagate_subreg): Propagate through subreg of zero_extend
or sign_extend.

2009-05-08  Paolo Bonzini  bonz...@gnu.org

PR rtl-optimization/33928
PR 26854
* fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween,
process_uses, build_single_def_use_links): New.
(update_df): Update use_def_ref.
(forward_propagate_into): Use get_def_for_use instead of use-def
chains.
(fwprop_init): Call build_single_def_use_links and let it initialize
dataflow.
(fwprop_done): Free use_def_ref.
(fwprop_addr): Eliminate duplicate call to df_set_flags.
* df-problems.c (df_rd_simulate_artificial_defs_at_top,
df_rd_simulate_one_insn): New.
(df_rd_bb_local_compute_process_def): Update head comment.
(df_chain_create_bb): Use the new RD simulation functions.
* df.h (df_rd_simulate_artificial_defs_at_top,
df_rd_simulate_one_insn): New.
* opts.c (decode_options): Enable fwprop at -O1.
* doc/invoke.texi (-fforward-propagate): Document this.

Modified:
branches/ibm/gcc-4_3-branch/gcc/ChangeLog.ibm
branches/ibm/gcc-4_3-branch/gcc/REVISION
branches/ibm/gcc-4_3-branch/gcc/df-problems.c
branches/ibm/gcc-4_3-branch/gcc/df.h
branches/ibm/gcc-4_3-branch/gcc/doc/invoke.texi
branches/ibm/gcc-4_3-branch/gcc/fwprop.c
branches/ibm/gcc-4_3-branch/gcc/opts.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-08-27 Thread lucier at math dot purdue dot edu


--- Comment #111 from lucier at math dot purdue dot edu  2009-08-27 17:02 
---
I can compile gambit 4.1.2 with -fschedule-insns except for the function noted
in PR41164.

On

model name  : Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz

with

gcc version 4.5.0 20090803 (experimental) [trunk revision 150373] (GCC) 

the times with -fschedule-insns are

(time (direct-fft-recursive-4 a table))
144 ms cpu time (144 user, 0 system)
(time (inverse-fft-recursive-4 a table))
136 ms cpu time (136 user, 0 system)

and the times without -fschedule-insns are

(time (direct-fft-recursive-4 a table))
168 ms cpu time (168 user, 0 system)
(time (inverse-fft-recursive-4 a table))
172 ms cpu time (172 user, 0 system)

That's a pretty big improvement.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-08-26 Thread lucier at math dot purdue dot edu


--- Comment #108 from lucier at math dot purdue dot edu  2009-08-27 01:18 
---
direct.c contains a direct FFT; I've compiled the direct and inverse fft and I
ran it on arrays with 2^23 double-precision complex elements and

heine:~/programs/gcc/objdirs/bench-mainline-on-fft /pkgs/gcc-mainline/bin/gcc
-v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release
--prefix=/pkgs/gcc-mainline --enable-languages=c,c++
-enable-stage1-languages=c,c++
Thread model: posix
gcc version 4.5.0 20090803 (experimental) [trunk revision 150373] (GCC) 

The compile options were

/pkgs/gcc-mainline/bin/gcc -save-temps -c -Wno-unused -O1 -fno-math-errno
-fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv
-fomit-frame-pointer -fPIC -fno-common -mieee-fp -rdynamic -shared
-fschedule-insns

and the same without -fschedule-insns.

The runtime for direct+inverse FFT with instruction scheduling was 1.264
seconds and the time for direct+inverse FFT without -fschedule-insns was 1.444
seconds, which is a 14% speedup for that one compiler option.  This is on a
2.33GHz Core 2 quad machine.

I'll attach the inner loops of direct.c with and with -fschedule-insns.

I haven't been able to compile the complete Gambit runtime with
-fschedule-insns on either x86-64 or ppc64; I've filed PR41164 and PR41176 for
those two different failures.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-08-26 Thread lucier at math dot purdue dot edu


--- Comment #109 from lucier at math dot purdue dot edu  2009-08-27 01:22 
---
Created an attachment (id=18432)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18432action=view)
inner loop of direct.c with -fschedule-insns


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-08-26 Thread lucier at math dot purdue dot edu


--- Comment #110 from lucier at math dot purdue dot edu  2009-08-27 01:22 
---
Created an attachment (id=18433)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18433action=view)
inner loop of direct.c without -fschedule-insns


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-08-04 Thread rguenth at gcc dot gnu dot org


--- Comment #107 from rguenth at gcc dot gnu dot org  2009-08-04 12:28 
---
GCC 4.3.4 is being released, adjusting target milestone.


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

   Target Milestone|4.3.4   |4.3.5


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-16 Thread bonzini at gnu dot org


--- Comment #104 from bonzini at gnu dot org  2009-06-16 06:47 ---
I understood that with -frename-registers the regression is fixed.  As I said,
without a pre-regalloc scheduling pass and without register renaming, the
scheduling quality you get is more or less random.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-16 Thread bonzini at gnu dot org


--- Comment #105 from bonzini at gnu dot org  2009-06-16 07:01 ---
Marking PR39157 as a duplicate of PR26854 is not exact (only the fwprop part is
a duplicate, because we were getting large compile times because of building
large data structures; the CFG Cleanup part is not exactly a duplicate) but I
don't think it's important because anyway we have a patch for the fwprop issue.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-16 Thread lucier at math dot purdue dot edu


--- Comment #106 from lucier at math dot purdue dot edu  2009-06-16 07:24 
---
This machine has 4ms ticks, so we're getting down to a few ticks difference
with a benchmark of this size.  It's 156ms with 4.2.4, 168ms with 4.5.0, and
164 ms when -frename-registers is added to the command line.

It's not just scheduling, there are more memory accesses with 4.5.0.

With a problem roughly 10 times as large, the times are

4.2.4:  2912ms
4.5.0:  3204ms
4.5.0:  3120ms (adding -frename-registers)

So there's a 7% difference with -frename-registers.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-15 Thread bonzini at gnu dot org


--- Comment #97 from bonzini at gnu dot org  2009-06-15 15:14 ---
Brad, could you try to time compiler.i with and without -ftime-report to see
how much of the tree stmt walking timevar is just accounting overhead?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-15 Thread lucier at math dot purdue dot edu


--- Comment #98 from lucier at math dot purdue dot edu  2009-06-15 16:11 
---
I don't quite understand how you would like me to configure and run the test.

First, I've applied your patches to speed up computing DF to my tree; do you
want them included in the test, or should I use a pristine mainline?

Second, when configuring mainline, should I include, or not include

1.  --enable-gather-detailed-mem-stats
2.  --enable-checking=release

After that, I think you just want to run two compiles with and without
-ftime-report, is that right?  (Nothing about -fmem-report.)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-15 Thread paolo dot bonzini at gmail dot com


--- Comment #99 from paolo dot bonzini at gmail dot com  2009-06-15 16:20 
---
Subject: Re:  [4.3/4.4/4.5 Regression] 30% performance
 slowdown in floating-point code caused by  r118475

 First, I've applied your patches to speed up computing DF to my tree; do you
 want them included in the test, or should I use a pristine mainline?

It doesn't matter, but yes, use them.

 Second, when configuring mainline, should I include, or not include
 
 1.  --enable-gather-detailed-mem-stats
 2.  --enable-checking=release

Again it shouldn't matter, but use only --enable-checking=release.

 After that, I think you just want to run two compiles with and without
 -ftime-report, is that right?  (Nothing about -fmem-report.)

Yes, and the output of -ftime-report is not needed.  Just the time 
./cc1 ... output for the two.  Thanks!


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-15 Thread lucier at math dot purdue dot edu


--- Comment #103 from lucier at math dot purdue dot edu  2009-06-15 20:21 
---
Regarding comment #101 ...

With

heine:~/programs/gcc/objdirs/gsc-fft-tests/gambc-v4_1_2
/pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline
--enable-languages=c --disable-multilib --enable-checking=release
Thread model: posix
gcc version 4.5.0 20090608 (experimental) [trunk revision 148276] (GCC) 

(and including Paolo's patch to speed up DF), the routine in direct.c takes

168 ms cpu time (168 user, 0 system)

As reported here

http://www.math.purdue.edu/~lucier/bugzilla/9/

with gcc-4.2.4, this routine takes 156 ms on the same machine.

Comment #9 gives the code that 4.2.4 generates at the start of the main loop; 
the start of the main loop with the version of 4.5.0 I gave above is:

.L2938:
movq%rcx, %rdx
addq8(%rax), %rdx
leaq4(%rcx), %rbx
movq%rdx, -8(%rax)
leaq4(%rdx), %rdi
addq8(%rax), %rdx
movq%rdi, -16(%rax)
movq%rdx, -24(%rax)
leaq4(%rdx), %rdi
addq8(%rax), %rdx
movq%rdi, -32(%rax)
movq%rdx, -40(%rax)
leaq4(%rdx), %rdi
movq40(%rax), %rdx
movq%rdi, -48(%rax)
movsd   7(%rdx,%rdi,2), %xmm7
movq-40(%rax), %rdi
leaq7(%rdx,%rcx,2), %r8
addq$8, %rcx
movsd   (%r8), %xmm4
cmpq%rcx, %r13
movsd   7(%rdx,%rdi,2), %xmm10
movq-32(%rax), %rdi
movsd   7(%rdx,%rdi,2), %xmm5
movq-24(%rax), %rdi
movsd   7(%rdx,%rdi,2), %xmm6
movq-16(%rax), %rdi
movsd   7(%rdx,%rdi,2), %xmm13
movq-8(%rax), %rdi
movsd   7(%rdx,%rdi,2), %xmm11
leaq(%rbx,%rbx), %rdi
movsd   7(%rdi,%rdx), %xmm9
movq24(%rax), %rdx
movapd  %xmm11, %xmm14
movsd   15(%rdx), %xmm1
movsd   7(%rdx), %xmm2
movapd  %xmm1, %xmm8
movsd   31(%rdx), %xmm3
movapd  %xmm2, %xmm12
mulsd   %xmm10, %xmm8
mulsd   %xmm7, %xmm12
mulsd   %xmm2, %xmm10
mulsd   %xmm1, %xmm7
movsd   23(%rdx), %xmm0

So, to my mind, this is still a 4.5 regression, as there is still a slow-down
and the code is still much less optimized by 4.5.0 than by 4.2.4. 168/156 ~
1.08, so if you want to change the Summary of this bug to 8% regression, or
some other things, that's fine, but I've changed this PR back to being a 4.5
regression.

I was not really thrilled when Richard marked PR 39157 as a duplicate of this
PR.  To my mind, there are three more or less independent things---run time of
Gambit-generated code, compile time of the code, and the space required to
compile the code.  This PR is about run time; PR 39157 was about space needed
by the compiler; PR 26854 is about compile time.  They seem to have all been
mushed together.


-- 

lucier at math dot purdue dot edu changed:

   What|Removed |Added

  Known to work|4.5.0   |
Summary|[4.3/4.4 Regression] 30%|[4.3/4.4/4.5 Regression] 30%
   |performance slowdown in |performance slowdown in
   |floating-point code caused  |floating-point code caused
   |by  r118475 |by  r118475


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-14 Thread lucier at math dot purdue dot edu


--- Comment #95 from lucier at math dot purdue dot edu  2009-06-14 14:59 
---
The test case is compiler.i.gz


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-14 Thread lucier at math dot purdue dot edu


--- Comment #96 from lucier at math dot purdue dot edu  2009-06-14 15:02 
---
Sorry, the gcc options are in comment 87 (the -fforward-propagate is now
redundant), and without Paolo's recently proposed patch it requires about 9GB
of memory to compile.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-13 Thread rguenth at gcc dot gnu dot org


--- Comment #93 from rguenth at gcc dot gnu dot org  2009-06-13 14:18 
---
I would say that was the new SRA.


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||mjambor at suse dot cz


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-13 Thread jamborm at gcc dot gnu dot org


--- Comment #94 from jamborm at gcc dot gnu dot org  2009-06-14 04:43 
---
(In reply to comment #92)
 In the meanwhile something caused tree incremental SSA to jump up from 10s 
 to
 26s.  Sob.
 

(In reply to comment #93)
 I would say that was the new SRA.
 

OK, I'll try to investigate.  Which of the various attachments to this
bug is the one to look at?

Martin


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-12 Thread bonzini at gnu dot org


--- Comment #92 from bonzini at gnu dot org  2009-06-12 14:50 ---
In the meanwhile something caused tree incremental SSA to jump up from 10s to
26s.  Sob.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-08 Thread bonzini at gnu dot org


--- Comment #88 from bonzini at gnu dot org  2009-06-08 08:40 ---
Created an attachment (id=17963)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17963action=view)
patch I'm testing

Here is a patch I'm testing that completes the rewrite of fwprop's dataflow. 
This should make it much faster and less memory hungry.  It should also keep
the generated code fast (with -frename-registers of course), if not it's a bug
in the patch.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-08 Thread bonzini at gnu dot org


--- Comment #89 from bonzini at gnu dot org  2009-06-08 08:59 ---
Created an attachment (id=17964)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17964action=view)
correct version

oops, the previous one didn't work at -O1 even though it bootstrapped :-)


-- 

bonzini at gnu dot org changed:

   What|Removed |Added

  Attachment #17963|0   |1
is obsolete||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-08 Thread bonzini at gnu dot org


--- Comment #90 from bonzini at gnu dot org  2009-06-08 16:35 ---
Yo, with the patch the time to compile compiler.i with the given options is
331s on my machine (with a checking compiler).  Fwprop takes only 1% (including
computation of the new dataflow problem).  I'd estimate around 250s with your
nonchecking build.  I'll split it and post it tomorrow.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-06-08 Thread lucier at math dot purdue dot edu


--- Comment #91 from lucier at math dot purdue dot edu  2009-06-08 18:19 
---
Created an attachment (id=17968)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17968action=view)
time and memory report for compiler.i after Paolo's patch

The patch cut the total bitmaps used compiling compiler.i from  60GB to 3GB;
maximum memory (just from top) was 1631MB.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-15 Thread bonzini at gnu dot org


--- Comment #84 from bonzini at gnu dot org  2009-05-15 10:35 ---
Ok, I am working on a patch to add a multiple-definitions DF problem and use
that together with a domwalk to find the single definitions (instead of
reaching-definitions, which is the remaining slow part).  The new problem has a
bitvector sized by the number of registers rather than the number of defs (that
is sized like the bitvectors for liveness), which means it will be fast.  It is
defined as follows:

MDkill (B) = regs that have a def in B
MDinit (B) = (union of MDkill (P) for every P : B \in DomFrontier(P) \cap
LRin(B)
MDin (B) = MDinit (B) \cup (union of MDout (P) for every predecessor P of B)
MDout (B) = MDin (B) - MDkill (B)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-15 Thread lucier at math dot purdue dot edu


--- Comment #85 from lucier at math dot purdue dot edu  2009-05-16 00:20 
---
Created an attachment (id=17878)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17878action=view)
Large test file for testing time and memory usage

This is the file compiler.i used in the previous tests.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-08 Thread bonzini at gcc dot gnu dot org


--- Comment #78 from bonzini at gnu dot org  2009-05-08 06:51 ---
Subject: Bug 33928

Author: bonzini
Date: Fri May  8 06:51:12 2009
New Revision: 147270

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=147270
Log:
2009-05-08  Paolo Bonzini  bonz...@gnu.org

PR rtl-optimization/33928
* loop-invariant.c (struct use): Add addr_use_p.
(struct def): Add n_addr_uses.
(struct invariant): Add cheap_address.
(create_new_invariant): Set cheap_address.
(record_use): Accept df_ref.  Set addr_use_p and update n_addr_uses.
(record_uses): Pass df_ref to record_use.
(get_inv_cost): Do not add inv-cost to comp_cost for cheap addresses
used
only as such.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/loop-invariant.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-08 Thread bonzini at gnu dot org


--- Comment #79 from bonzini at gnu dot org  2009-05-08 07:18 ---
I'm cobbling up the DIY dataflow patch and it is all but ugly, actually.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-08 Thread bonzini at gcc dot gnu dot org


--- Comment #80 from bonzini at gnu dot org  2009-05-08 07:51 ---
Subject: Bug 33928

Author: bonzini
Date: Fri May  8 07:51:46 2009
New Revision: 147274

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=147274
Log:
2009-05-08  Paolo Bonzini  bonz...@gnu.org

PR rtl-optimization/33928
* loop-invariant.c (record_use): Fix  vs. || mishap.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/loop-invariant.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-08 Thread bonzini at gnu dot org


--- Comment #81 from bonzini at gnu dot org  2009-05-08 07:55 ---
Created an attachment (id=17825)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17825action=view)
speed up fwprop and enable it at -O1

Here is a patch I'm bootstrapping to remove fwprop's usage of UD chains.  It
does not affect at all the assembly output, it just changes the data structure
that is used.

compiler.i is probably too big for me, but I tried slatex.i and fwprop was ~2%
of compilation time with this patch.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-08 Thread bonzini at gnu dot org


--- Comment #82 from bonzini at gnu dot org  2009-05-08 09:41 ---
Hm, looking at the time reports the patch will save about 30-40% of the fwprop
execution time, and should fix the memory hog problem, but will still leave in
the 70s needed to compute reaching definitions.  I guess it's a step forward
for -O2 but borderline for -O1.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-08 Thread bonzini at gcc dot gnu dot org


--- Comment #83 from bonzini at gnu dot org  2009-05-08 12:22 ---
Subject: Bug 33928

Author: bonzini
Date: Fri May  8 12:22:30 2009
New Revision: 147282

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=147282
Log:
2009-05-08  Paolo Bonzini  bonz...@gnu.org

PR rtl-optimization/33928
PR 26854
* fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween,
process_uses, build_single_def_use_links): New.
(update_df): Update use_def_ref.
(forward_propagate_into): Use get_def_for_use instead of use-def
chains.
(fwprop_init): Call build_single_def_use_links and let it initialize
dataflow.
(fwprop_done): Free use_def_ref.
(fwprop_addr): Eliminate duplicate call to df_set_flags.
* df-problems.c (df_rd_simulate_artificial_defs_at_top, 
df_rd_simulate_one_insn): New.
(df_rd_bb_local_compute_process_def): Update head comment.
(df_chain_create_bb): Use the new RD simulation functions.
* df.h (df_rd_simulate_artificial_defs_at_top, 
df_rd_simulate_one_insn): New.
* opts.c (decode_options): Enable fwprop at -O1.
* doc/invoke.texi (-fforward-propagate): Document this.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/df-problems.c
trunk/gcc/df.h
trunk/gcc/doc/invoke.texi
trunk/gcc/fwprop.c
trunk/gcc/opts.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-07 Thread bonzini at gnu dot org


--- Comment #67 from bonzini at gnu dot org  2009-05-07 13:40 ---
I'm thinking of enabling -frename-registers on x86; since it does not enable
the first scheduling pass, the live ranges will be shorter and the register
allocator may reuse the same register over and over with no freedom on
schedule-insns2.  

This would leave only the bug with RTL loop invariant motion.

Brad, you are the one who's regularly producing insane testcases, can you
measure the slowdown from -O1 to -O1 -frename-registers?  It is a local pass,
so it should not be that much, but I'd rather check before (I'll check on a
bootstrap instead).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-07 Thread lucier at math dot purdue dot edu


--- Comment #71 from lucier at math dot purdue dot edu  2009-05-07 16:02 
---
Created an attachment (id=17820)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17820action=view)
time for 31957, with rename-registers


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-07 Thread bonzini at gnu dot org


--- Comment #74 from bonzini at gnu dot org  2009-05-07 16:21 ---
Ok.  One step at a time. :-)  To recap, here is the situation:

- the CSE optimization you mention was *not* removed, it was moved to fwprop,
so it does not run at -O1.

- once this was done, the way to go is to tune new optimizations, not to
reintroduce old ones

- for example, fwprop in turn triggered a bad choice in loop invariant motion,
for which a patch has been posted.  This patch will remove the need for
-fno-move-loop-invariants on this testcase (this is a deficiency in LIM that is
not specific to machine-generated code, OTOH the presence of many fp[N]
accesses helps triggering it).

- that scheduling is necessary now and not in 4.2.x, probably is just a matter
of luck

- why renaming registers is necessary now and not in 4.2.x is still a mystery;
but, there is an explanation as to why it helps (it prolongs live ranges,
something that on non-x86 archs is done by the pre-regalloc scheduling)

- at least we have a set of options providing good performance on this
testcase, and guidance towards better tuning of the various problematic
optimizations

To conclude, nobody is underestimating the significance of its PR, it's just a
matter of priorities.  Near the end of the release cycle, you tend to look at
PRs with small testcases to minimize the time spent understanding the code;
near the beginning, you hope that new features magically fix the PRs and
concentrate on wrong-code bugs and so on.  Complex P2s such as this one
unfortunately tend to stay in a limbo.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-07 Thread lucier at math dot purdue dot edu


--- Comment #75 from lucier at math dot purdue dot edu  2009-05-07 16:31 
---
Subject: Re:  [4.3/4.4/4.5 Regression] 30% performance slowdown in
floating-point code caused by  r118475


On May 7, 2009, at 12:21 PM, bonzini at gnu dot org wrote:

 --- Comment #74 from bonzini at gnu dot org  2009-05-07 16:21  
 ---
 Ok.  One step at a time. :-)  To recap, here is the situation:

 - that scheduling is necessary now and not in 4.2.x, probably is  
 just a matter
 of luck

If you mean -fschedule-insns2, it has always been part of the options  
list.

 - at least we have a set of options providing good performance on this
 testcase, and guidance towards better tuning of the various  
 problematic
 optimizations

OK, but -fforward-propagate is not viable in general for these  
machine-generated codes.


Brad


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-07 Thread bonzini at gnu dot org


--- Comment #76 from bonzini at gnu dot org  2009-05-07 16:37 ---
It should be possible to modify fwprop to avoid excessive memory usage (doing
its own dataflow, basically, instead of using UD chains)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-07 Thread steven at gcc dot gnu dot org


--- Comment #77 from steven at gcc dot gnu dot org  2009-05-07 17:50 ---
Re. comment #75: Just the fact that an option is enabled in both releases
doesn't mean the pass behind it is doing the same thing in both releases. What
the scheduler does, depends heavily on the code you feed it.  Sometimes it is
pure (good or bad) luck that changes  the behavior of a pass in the compiler. 
The interactions between all the pieces are just very complicated (which is
why, IMHO, retargetable-compiler engineering is so difficult: controlling the
pipeline is undoable).

Re. comment #76:
Sad as it may be, I think this is the best short-term solution.
Alternatively we could re-work fwprop to work on regions and use the
partial-CFG dataflow stuff, similar to what the RTL loop optimizers (like
loop-invariant) do.  To be honest, I'd much prefer the latter, but the
DIY-fwprop thing is probably easier in the short term.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-06 Thread jakub at gcc dot gnu dot org


--- Comment #61 from jakub at gcc dot gnu dot org  2009-05-06 13:05 ---
Also see PR39871, maybe that's related (though on ARM).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-06 Thread bonzini at gnu dot org


--- Comment #62 from bonzini at gnu dot org  2009-05-06 15:07 ---
No, totally unrelated to PR39871


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-06 Thread lucier at math dot purdue dot edu


--- Comment #63 from lucier at math dot purdue dot edu  2009-05-06 19:57 
---
Was the patch in comment 55 meant for me to bootstrap and test with today's
mainline?  It crashes at the gcc_assert at

/* Subroutine of canon_reg.  Pass *XLOC through canon_reg, and validate
   the result if necessary.  INSN is as for canon_reg.  */

static void
validate_canon_reg (rtx *xloc, rtx insn)
{
  if (*xloc)
{
  rtx new_rtx = canon_reg (*xloc, insn);

  /* If replacing pseudo with hard reg or vice versa, ensure the
 insn remains valid.  Likewise if the insn has MATCH_DUPs.  */
  gcc_assert (insn  new_rtx);
  validate_change (insn, xloc, new_rtx, 1);
}
}

when building libgcc:

/tmp/lucier/gcc/objdirs/mainline/./gcc/xgcc
-B/tmp/lucier/gcc/objdirs/mainline/./gcc/
-B/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/bin/
-B/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/lib/ -isystem
/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/include -isystem
/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/sys-include -g -O2 -m32 -O2  -g -O2
-DIN_GCC   -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes
-Wcast-qual -Wold-style-definition  -isystem ./include  -fPIC -g
-DHAVE_GTHR_DEFAULT -DIN_LIBGCC2 -D__GCC_FLOAT_NOT_NEEDED   -I. -I.
-I../../.././gcc -I../../../../../mainline/libgcc
-I../../../../../mainline/libgcc/. -I../../../../../mainline/libgcc/../gcc
-I../../../../../mainline/libgcc/../include
-I../../../../../mainline/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT
-DHAVE_CC_TLS -DUSE_TLS -o _moddi3.o -MT _moddi3.o -MD -MP -MF _moddi3.dep
-DL_moddi3 -c ../../../../../mainline/libgcc/../gcc/libgcc2.c \
  -fexceptions -fnon-call-exceptions -fvisibility=hidden -DHIDE_EXPORTS
../../../../../mainline/libgcc/../gcc/libgcc2.c: In function รข:
../../../../../mainline/libgcc/../gcc/libgcc2.c:1121: internal compiler error:
in validate_canon_reg, at cse.c:2730


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-06 Thread lucier at math dot purdue dot edu


--- Comment #64 from lucier at math dot purdue dot edu  2009-05-06 20:43 
---
In answer to comment 60, here's the command line where I added
-fforward-propagate -fno-move-loop-invariants:

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused
-O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing
-fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate
-fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY
-D___GAMBCDIR=\/usr/local/Gambit-C/v4.1.2\ -D___SYS_TYPE_CPU=\x86_64\
-D___SYS_TYPE_VENDOR=\unknown\ -D___SYS_TYPE_OS=\linux-gnu\ -c _num.c

here's the compiler:

/pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: /tmp/lucier/gcc/mainline/configure --enable-checking=release
--prefix=/pkgs/gcc-mainline --enable-languages=c
Thread model: posix
gcc version 4.5.0 20090506 (experimental) [trunk revision 147199] (GCC) 

and the runtime didn't change (substantially)

132 ms cpu time (132 user, 0 system)

and the loop looks pretty much just as bad (it's 117 instructions long, by my
count):

.L2752:
movq%rcx, %rdx
addq8(%rax), %rdx
leaq4(%rcx), %rdi
movq%rdx, -8(%rax)
leaq4(%rdx), %rbx
addq8(%rax), %rdx
movq%rbx, -16(%rax)
movq%rdx, -24(%rax)
leaq4(%rdx), %rbx
addq8(%rax), %rdx
movq%rbx, -32(%rax)
movq%rdx, -40(%rax)
leaq4(%rdx), %rbx
movq40(%rax), %rdx
movq%rbx, -48(%rax)
movsd   7(%rdx,%rbx,2), %xmm9
movq-40(%rax), %rbx
leaq7(%rdx,%rcx,2), %r8
addq$8, %rcx
movsd   (%r8), %xmm4
cmpq%rcx, %r13
movsd   7(%rdx,%rbx,2), %xmm11
movq-32(%rax), %rbx
movsd   7(%rdx,%rbx,2), %xmm5
movq-24(%rax), %rbx
movsd   7(%rdx,%rbx,2), %xmm7
movq-16(%rax), %rbx
movsd   7(%rdx,%rbx,2), %xmm14
movq-8(%rax), %rbx
movsd   7(%rdx,%rbx,2), %xmm6
leaq(%rdi,%rdi), %rbx
movsd   7(%rbx,%rdx), %xmm8
movq24(%rax), %rdx
movapd  %xmm6, %xmm13
movsd   15(%rdx), %xmm1
movsd   7(%rdx), %xmm2
movapd  %xmm1, %xmm10
movsd   31(%rdx), %xmm3
movapd  %xmm2, %xmm12
mulsd   %xmm11, %xmm10
mulsd   %xmm9, %xmm12
mulsd   %xmm2, %xmm11
mulsd   %xmm1, %xmm9
movsd   23(%rdx), %xmm0
addsd   %xmm12, %xmm10
movapd  %xmm2, %xmm12
mulsd   %xmm7, %xmm2
subsd   %xmm9, %xmm11
movapd  %xmm1, %xmm9
mulsd   %xmm5, %xmm12
mulsd   %xmm5, %xmm1
movapd  %xmm8, %xmm5
mulsd   %xmm7, %xmm9
movapd  %xmm4, %xmm7
subsd   %xmm11, %xmm13
addsd   %xmm6, %xmm11
movsd   .LC5(%rip), %xmm6
subsd   %xmm1, %xmm2
movapd  %xmm0, %xmm1
addsd   %xmm12, %xmm9
movapd  %xmm14, %xmm12
xorpd   %xmm3, %xmm6
subsd   %xmm10, %xmm12
mulsd   %xmm13, %xmm1
subsd   %xmm2, %xmm7
addsd   %xmm4, %xmm2
movapd  %xmm6, %xmm4
addsd   %xmm14, %xmm10
mulsd   %xmm13, %xmm6
mulsd   %xmm12, %xmm4
subsd   %xmm9, %xmm5
mulsd   %xmm0, %xmm12
addsd   %xmm8, %xmm9
movapd  %xmm0, %xmm8
mulsd   %xmm11, %xmm0
addsd   %xmm1, %xmm4
movapd  %xmm3, %xmm1
mulsd   %xmm10, %xmm3
subsd   %xmm12, %xmm6
mulsd   %xmm11, %xmm1
mulsd   %xmm10, %xmm8
subsd   %xmm3, %xmm0
addsd   %xmm1, %xmm8
movapd  %xmm2, %xmm1
addsd   %xmm0, %xmm1
subsd   %xmm0, %xmm2
movapd  %xmm7, %xmm0
subsd   %xmm6, %xmm7
addsd   %xmm6, %xmm0
movsd   %xmm1, (%r8)
movapd  %xmm9, %xmm1
movq40(%rax), %rdx
subsd   %xmm8, %xmm9
addsd   %xmm8, %xmm1
movsd   %xmm1, 7(%rbx,%rdx)
movq-8(%rax), %rbx
movq40(%rax), %rdx
movsd   %xmm2, 7(%rdx,%rbx,2)
movq-16(%rax), %rbx
movq40(%rax), %rdx
movsd   %xmm9, 7(%rdx,%rbx,2)
movq-24(%rax), %rbx
movq40(%rax), %rdx
movsd   %xmm0, 7(%rdx,%rbx,2)
movapd  %xmm5, %xmm0
movq-32(%rax), %rbx
movq40(%rax), %rdx
subsd   %xmm4, %xmm5
addsd   %xmm4, %xmm0
movsd   %xmm0, 7(%rdx,%rbx,2)
movq-40(%rax), %rbx
movq40(%rax), %rdx
movsd   %xmm7, 7(%rdx,%rbx,2)
movq-48(%rax), %rbx
movq40(%rax), %rdx
movsd   %xmm5, 7(%rdx,%rbx,2)
jg  .L2752
movq%rdi, %r13
.L2751:


-- 

lucier at math dot purdue dot edu changed:

   What|Removed |Added

[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-06 Thread bonzini at gnu dot org


--- Comment #65 from bonzini at gnu dot org  2009-05-07 05:03 ---
Subject: Re:  [4.3/4.4/4.5 Regression] 30% performance
 slowdown in floating-point code caused by  r118475

lucier at math dot purdue dot edu wrote:
 --- Comment #64 from lucier at math dot purdue dot edu  2009-05-06 20:43 
 ---
 In answer to comment 60, here's the command line where I added
 -fforward-propagate -fno-move-loop-invariants:

Hmm, can you try adding -frename-registers *or* -fweb (i.e. together
they get no benefit) too?

 and the loop looks pretty much just as bad (it's 117 instructions long, by my
 count):

116 actually: the movq here is outside the loop (that's how I made all
the instruction counts).

 movsd   %xmm5, 7(%rdx,%rbx,2)
 jg  .L2752
 movq%rdi, %r13
 .L2751:


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475

2009-05-06 Thread lucier at math dot purdue dot edu


--- Comment #66 from lucier at math dot purdue dot edu  2009-05-07 05:27 
---
Adding -frename-registers gives a significant speedup (sometimes as fast as
4.1.2 on this shared machine, i.e., it somtimes hits 108 ms instead of
132-140ms), the command line with -fforward-propagate -fno-move-loop-invariants
-frename-registers  is

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused
-O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing
-fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate
-fno-move-loop-invariants -frename-registers -DHAVE_CONFIG_H -D___PRIMAL
-D___LIBRARY -D___GAMBCDIR=\/usr/local/Gambit-C/v4.1.2\
-D___SYS_TYPE_CPU=\x86_64\ -D___SYS_TYPE_VENDOR=\unknown\
-D___SYS_TYPE_OS=\linux-gnu\ -c _num.c

and the loop is

.L2752:
movq%rcx, %r12
addq8(%rax), %r12
leaq4(%rcx), %rdi
movq%r12, -8(%rax)
leaq4(%r12), %r8
addq8(%rax), %r12
movq%r8, -16(%rax)
movq-8(%rax), %r8
movq-16(%rax), %rdx
movq%r12, -24(%rax)
leaq4(%r12), %rbx
addq8(%rax), %r12
movq-24(%rax), %r9
movq%rbx, -32(%rax)
movq24(%rax), %rbx
movq-32(%rax), %r10
leaq4(%r12), %r11
movq%r12, -40(%rax)
movq40(%rax), %r12
movq-40(%rax), %r14
movq%r11, -48(%rax)
movsd   15(%rbx), %xmm1
movsd   7(%rbx), %xmm2
movsd   7(%r12,%r11,2), %xmm9
movapd  %xmm1, %xmm3
movsd   7(%r12,%r14,2), %xmm11
leaq7(%r12,%rcx,2), %r11
movapd  %xmm2, %xmm10
leaq(%rdi,%rdi), %r14
mulsd   %xmm11, %xmm3
movapd  %xmm2, %xmm12
mulsd   %xmm9, %xmm10
addq$8, %rcx
mulsd   %xmm1, %xmm9
cmpq%rcx, %r13
mulsd   %xmm2, %xmm11
movsd   7(%r12,%r10,2), %xmm5
movsd   7(%r12,%r9,2), %xmm7
addsd   %xmm10, %xmm3
movsd   7(%r12,%r8,2), %xmm6
subsd   %xmm9, %xmm11
mulsd   %xmm7, %xmm2
movapd  %xmm1, %xmm9
mulsd   %xmm5, %xmm1
movapd  %xmm6, %xmm13
movsd   7(%r12,%rdx,2), %xmm14
mulsd   %xmm5, %xmm12
mulsd   %xmm7, %xmm9
subsd   %xmm11, %xmm13
movsd   31(%rbx), %xmm0
addsd   %xmm6, %xmm11
movsd   .LC5(%rip), %xmm6
subsd   %xmm1, %xmm2
movsd   (%r11), %xmm4
movapd  %xmm14, %xmm10
xorpd   %xmm0, %xmm6
addsd   %xmm12, %xmm9
movsd   7(%r14,%r12), %xmm8
subsd   %xmm3, %xmm10
movapd  %xmm4, %xmm7
addsd   %xmm14, %xmm3
movsd   23(%rbx), %xmm15
subsd   %xmm2, %xmm7
movapd  %xmm8, %xmm5
addsd   %xmm4, %xmm2
movapd  %xmm6, %xmm4
subsd   %xmm9, %xmm5
movapd  %xmm15, %xmm14
addsd   %xmm8, %xmm9
mulsd   %xmm10, %xmm4
movapd  %xmm15, %xmm8
mulsd   %xmm15, %xmm10
movapd  %xmm0, %xmm12
mulsd   %xmm11, %xmm15
mulsd   %xmm3, %xmm0
movapd  %xmm7, %xmm1
mulsd   %xmm13, %xmm6
mulsd   %xmm3, %xmm8
movapd  %xmm9, %xmm3
mulsd   %xmm11, %xmm12
subsd   %xmm0, %xmm15
mulsd   %xmm13, %xmm14
subsd   %xmm10, %xmm6
movapd  %xmm2, %xmm10
movapd  %xmm5, %xmm0
addsd   %xmm12, %xmm8
addsd   %xmm15, %xmm10
subsd   %xmm15, %xmm2
addsd   %xmm14, %xmm4
addsd   %xmm8, %xmm3
movsd   %xmm10, (%r11)
movq40(%rax), %r10
subsd   %xmm8, %xmm9
addsd   %xmm6, %xmm1
addsd   %xmm4, %xmm0
movsd   %xmm3, 7(%r14,%r10)
movq-8(%rax), %r9
movq40(%rax), %rdx
subsd   %xmm6, %xmm7
subsd   %xmm4, %xmm5
movsd   %xmm2, 7(%rdx,%r9,2)
movq-16(%rax), %r8
movq40(%rax), %r12
movsd   %xmm9, 7(%r12,%r8,2)
movq-24(%rax), %rbx
movq40(%rax), %r11
movsd   %xmm1, 7(%r11,%rbx,2)
movq-32(%rax), %r14
movq40(%rax), %r10
movsd   %xmm0, 7(%r10,%r14,2)
movq-40(%rax), %r9
movq40(%rax), %rdx
movsd   %xmm7, 7(%rdx,%r9,2)
movq-48(%rax), %r8
movq40(%rax), %r12
movsd   %xmm5, 7(%r12,%r8,2)
jg  .L2752

Adding -fforward-propagate -fno-move-loop-invariants -fweb instead of
-fforward-propagate -fno-move-loop-invariants -frename-registers, so the
compile line is

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused
-O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing
-fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate
-fno-move-loop-invariants -fweb -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY
-D___GAMBCDIR=\/usr/local/Gambit-C/v4.1.2\ -D___SYS_TYPE_CPU=\x86_64\