[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-09-26 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||FIXED

--- Comment #27 from Richard Guenther rguenth at gcc dot gnu.org 2011-09-26 
10:16:20 UTC ---
Yes, I think I analyzed the reason for this at some point (IPA profile) and
fixed it.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-09-22 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #26 from Dominique d'Humieres dominiq at lps dot ens.fr 
2011-09-22 15:25:48 UTC ---
AFAICT this pr has been fixed since some time. Here are the results I get on
x86_64-apple-darwin10 (Core2Duo 2.53Ghz, 3Mb cache, 4Gb RAM) at revision
179079:

Compile options : -fprotect-parens -Ofast -funroll-loops -fwhole-program

   without -flto with -flto

Benchmark   Compile  Executable   Ave Run   Compile  Executable   Ave Run
 Name(secs) (bytes)(secs)(secs) (bytes)(secs)
-   ---  --   ---   ---  --   ---
   ac  3.28   54936  8.81  6.64   54968  8.81
   aermod 75.46 1184280 18.65131.50 1212648 18.20
  air 11.24  106336  7.26 22.38  106904  7.39
 capacita  3.87   77152 41.29  7.36   77200 41.31
  channel  1.25   34744  3.03  2.39   34864  3.03
doduc 12.40  200016 28.02 22.47  200496 27.69
  fatigue  4.06   77400  4.83  8.17   77488  4.84
  gas_dyn  9.32  119256  4.92 16.64  119816  4.92
   induct  7.37  148840 13.83 14.76  153224 13.84
linpk  0.70   26024 21.64  1.93   26064 21.64
 mdbx  3.77   80864 12.46  7.21   81040 12.46
   nf  4.08   71848 19.34  8.07   71896 19.35
  protein 15.17  131304 35.30 26.05  127224 35.48
   rnflow 12.58  130888 28.25 23.76  131000 26.92
 test_fpu  4.78   92968 10.63 13.35   93024 10.64
 tfft  0.74   22352  3.28  1.98   22432  3.28

Geometric Mean Execution Time = 12.23 secs  12.18 secs

Compile options : -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear 
-fomit-frame-pointer --param max-inline-insns-auto=200 -fwhole-program

   without -flto with -flto

Benchmark   Compile  Executable   Ave Run   Compile  Executable   Ave Run
 Name(secs) (bytes)(secs)(secs) (bytes)(secs)
-   ---  --   ---   ---  --   ---
   ac  4.05   54904  8.11  8.18   54920  8.11
   aermod101.55 1494688 18.17169.63 1527120 18.12
  air 14.46  114328  7.05 30.35  114912  7.04
 capacita  5.39   97552 40.24 10.80   97584 40.21
  channel  1.68   38792  2.91  3.17   3  2.91
doduc 12.98  208112 27.47 25.77  208584 27.52
  fatigue  4.84   81440  2.95 10.27   81504  2.93
  gas_dyn 13.55  143776  4.86 25.03  144392  4.86
   induct 12.95  189872 13.78 24.32  190176 13.96
linpk  0.73   21856 21.69  2.44   21888 21.69
 mdbx  4.32   84928 12.45  9.39   85104 12.54
   nf  7.41   92248 18.93 17.82   92272 18.91
  protein 17.26  160040 35.51 31.08  155984 35.47
   rnflow 15.16  138880 28.27 27.28  139040 26.85
 test_fpu  5.05   92872 10.65 14.65   92928 10.65
 tfft  0.75   22352  3.28  1.72   22432  3.28

Geometric Mean Execution Time = 11.67 secs  11.64 secs

The option -flto improves the run time for rnflow.f90 by ~5% without slowdown
for the other tests. Could these results be checked on other platforms and this
PR closed if they agree with mine?


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-02-16 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #25 from Dominique d'Humieres dominiq at lps dot ens.fr 
2011-02-16 18:38:19 UTC ---
AFAICT the patch in http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00973.html
seems to fix most of the fatigue.f90 problems:

At revision 170178 without the patch, I get

[macbook] lin/test% gfcp -Ofast fatigue.f90
[macbook] lin/test% time a.out  /dev/null
8.903u 0.005s 0:08.91 99.8%0+0k 0+2io 0pf+0w
[macbook] lin/test% gfcp -Ofast -fwhole-program fatigue.f90
[macbook] lin/test% time a.out  /dev/null
6.392u 0.002s 0:06.39 100.0%0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcp -Ofast -finline-limit=322 -fwhole-program fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.653u 0.002s 0:04.65 100.0%0+0k 0+1io 0pf+0w
[macbook] lin/test% gfcp -Ofast -finline-limit=322 -fwhole-program -flto
fatigue.f90
[macbook] lin/test% time a.out  /dev/null
8.212u 0.004s 0:08.22 99.8%0+0k 0+2io 0pf+0w
[macbook] lin/test% gfcp -Ofast -finline-limit=322 --param
large-function-growth=132 -fwhole-program -flto fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.526u 0.004s 0:04.53 99.7%0+0k 0+1io 0pf+0w

At revision 170212 with the patch, I get

[macbook] lin/test% gfc -Ofast fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.628u 0.002s 0:04.63 99.7%0+0k 0+0io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.654u 0.002s 0:04.65 100.0%0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -finline-limit=322 -fwhole-program fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.657u 0.002s 0:04.66 99.7%0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -finline-limit=322 -fwhole-program -flto
fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.715u 0.003s 0:04.72 99.7%0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -finline-limit=322 --param
large-function-growth=132 -fwhole-program -flto fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.713u 0.003s 0:04.71 100.0%0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -finline-limit=322 --param
large-function-growth=137 -fwhole-program -flto fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.524u 0.003s 0:04.52 100.0%0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast --param large-function-growth=137
-fwhole-program -flto fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.564u 0.003s 0:04.57 99.7%0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast --param large-function-growth=137
-fwhole-program fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.479u 0.003s 0:04.48 99.7%0+0k 0+2io 0pf+0w

A quick check of the other tests does not show any obvious slowdown with the
patch.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-24 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #21 from Dominique d'Humieres dominiq at lps dot ens.fr 
2011-01-24 09:29:00 UTC ---
I have regtested my working tree (with other patches) with the patch in comment
#15 and got 180 new failures (likely 90 for both -m32 and -m64), but I have not
checked that carefully). 

Among them, 124 are of the kind scan-tree-dump-times fre *: dump file does not
exist and seem to be due to the extra pass producing fre1 and fre2. I can
adjust the test for say fre2 and see what's happening.

Then I see

FAIL: gcc.dg/ipa/ipa-pta-14.c scan-ipa-dump pta foo.result = { NULL a[^ ]* a[^
]* c[^ ]* }

FAIL: gcc.dg/matrix/matrix-1.c scan-ipa-dump-times matrix-reorg Flattened 3
dimensions 1
FAIL: gcc.dg/matrix/matrix-2.c scan-ipa-dump-times matrix-reorg Flattened 2
dimensions 1
FAIL: gcc.dg/matrix/matrix-3.c scan-ipa-dump-times matrix-reorg Flattened 2
dimensions 1
FAIL: gcc.dg/matrix/matrix-6.c scan-ipa-dump-times matrix-reorg Flattened 2
dimensions 1
FAIL: gcc.dg/matrix/transpose-1.c scan-ipa-dump-times matrix-reorg Flattened 3
dimensions 1
FAIL: gcc.dg/matrix/transpose-1.c scan-ipa-dump-times matrix-reorg Transposed
3
FAIL: gcc.dg/matrix/transpose-2.c scan-ipa-dump-times matrix-reorg Flattened 3
dimensions 1
FAIL: gcc.dg/matrix/transpose-3.c scan-ipa-dump-times matrix-reorg Flattened 2
dimensions 1
FAIL: gcc.dg/matrix/transpose-3.c scan-ipa-dump-times matrix-reorg Transposed
2
FAIL: gcc.dg/matrix/transpose-4.c scan-ipa-dump-times matrix-reorg Flattened 3
dimensions 1
FAIL: gcc.dg/matrix/transpose-4.c scan-ipa-dump-times matrix-reorg Transposed
2
FAIL: gcc.dg/matrix/transpose-5.c scan-ipa-dump-times matrix-reorg Flattened 3
dimensions 1
FAIL: gcc.dg/matrix/transpose-6.c scan-ipa-dump-times matrix-reorg Flattened 3
dimensions 1

FAIL: gcc.dg/torture/pta-structcopy-1.c  -O2  scan-tree-dump alias points-to
vars: { i }
FAIL: gcc.dg/torture/pta-structcopy-1.c  -O3 -fomit-frame-pointer 
scan-tree-dump alias points-to vars: { i }
FAIL: gcc.dg/torture/pta-structcopy-1.c  -O3 -g  scan-tree-dump alias
points-to vars: { i }
FAIL: gcc.dg/torture/pta-structcopy-1.c  -Os  scan-tree-dump alias points-to
vars: { i }
FAIL: gcc.dg/torture/pta-structcopy-1.c  -O2 -flto -flto-partition=none 
scan-tree-dump alias points-to vars: { i }
FAIL: gcc.dg/torture/pta-structcopy-1.c  -O2 -flto  scan-tree-dump alias
points-to vars: { i }

FAIL: gcc.dg/tree-ssa/pta-ptrarith-1.c scan-tree-dump ealias q_., points-to
vars: { k }
FAIL: gcc.dg/tree-ssa/sra-9.c scan-tree-dump-times optimized = s.b 0
FAIL: gcc.dg/tree-ssa/ssa-dce-4.c scan-tree-dump-times cddce1 a\[[^

FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg f6: va_list escapes 0,
needs to save (3|12|24) GPR units
FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg f11: va_list escapes 0,
needs to save (3|12|24) GPR units
FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg f12: va_list escapes 0,
needs to save [1-9][0-9]* GPR units
FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg f13: va_list escapes 0,
needs to save [1-9][0-9]* GPR units
FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg f14: va_list escapes 0,
needs to save [1-9][0-9]* GPR units

FAIL: g++.dg/ipa/iinline-1.C scan-ipa-dump inline String::funcOne[^\n]*inline
copy in int main
FAIL: g++.dg/ipa/iinline-2.C scan-ipa-dump inline String::funcOne[^\n]*inline
copy in int main

So far I have only looked at gcc.dg/ipa/ipa-pta-14.c, for which grepping
foo.result yields

p_1 = foo.result
foo.result = foo.arg1
Equivalence classes for Direct node node id 15:foo.result are pointer: 8,
location:0
Unifying foo.result to foo.arg0
foo.result = { a.0+32 } same as foo.arg0

instead of

p_1 = foo.result
foo.result = D.2736_3
Equivalence classes for Direct node node id 15:foo.result are pointer: 13,
location:0
Unifying foo.result to p_1
foo.result = { NULL a.0+32 a.64+64 c.0+32 } same as p_1

Is it a missed optimization or wrong-code?


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-24 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #22 from Richard Guenther rguenth at gcc dot gnu.org 2011-01-24 
14:07:14 UTC ---
(In reply to comment #15)
 Enabling early FRE
 Index: passes.c
 ===
 --- passes.c(revision 169136)
 +++ passes.c(working copy)
 @@ -760,6 +760,7 @@
   NEXT_PASS (pass_remove_cgraph_callee_edges);
   NEXT_PASS (pass_rename_ssa_copies);
   NEXT_PASS (pass_ccp);
 +  NEXT_PASS (pass_fre);
   NEXT_PASS (pass_forwprop);
   /* pass_build_ealias is a dummy pass that ensures that we
  execute TODO_rebuild_alias at this point.  Re-building
 @@ -782,7 +783,7 @@
 
 reduces perida size estimate to 694 (so by about 30%) and hookes law to 141 
 (by
 11%). Not enough to make inlining happen, still.

That FRE pass should be after pass_sra_early (certainly after
pass_build_ealias).


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-24 Thread howarth at nitro dot med.uc.edu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Jack Howarth howarth at nitro dot med.uc.edu changed:

   What|Removed |Added

 CC||howarth at nitro dot
   ||med.uc.edu

--- Comment #23 from Jack Howarth howarth at nitro dot med.uc.edu 2011-01-24 
17:58:00 UTC ---
(In reply to comment #22)

 That FRE pass should be after pass_sra_early (certainly after
 pass_build_ealias).

Index: gcc/passes.c
===
--- gcc/passes.c(revision 169145)
+++ gcc/passes.c(working copy)
@@ -767,6 +767,7 @@ init_optimization_passes (void)
  locals into SSA form if possible.  */
   NEXT_PASS (pass_build_ealias);
   NEXT_PASS (pass_sra_early);
+  NEXT_PASS (pass_fre);
   NEXT_PASS (pass_copy_prop);
   NEXT_PASS (pass_merge_phi);
   NEXT_PASS (pass_cd_dce);

gives Elapsed CPU time  = 8.43600E+00 for

gfortran -O3 -ffast-math -funroll-loops -flto -fwhole-program fatigue.f90 -o
fatigue

and Elapsed CPU time  = 4.16600E+00 for

gfortran -O3 -ffast-math -funroll-loops -finline-limit=250 --param
large-function-growth=250 -flto -fwhole-program fatigue.f90 -o fatigue


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-24 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #24 from Dominique d'Humieres dominiq at lps dot ens.fr 
2011-01-24 18:16:47 UTC ---
(In reply to comment #22)
 That FRE pass should be after pass_sra_early (certainly after
 pass_build_ealias).

Moving pass_fre after pass_sra_early does not fix the failures in the test
suite rported in comment #21.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-23 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Jan Hubicka hubicka at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2011.01.23 15:59:30
 Ever Confirmed|0   |1

--- Comment #12 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 
15:59:30 UTC ---
Reproduces for me.

Perdida is funcion called once, what happens with default settings is that
perdida is not considered as inline candidate for small function inlining (it
is estimated to over 700 instructions, so it is huge)

later we try to inline it as function called once, but hit large function
growth limit. Compiling with --param large-function-growth=100 solve the
problem, but it does not make the testcase faster.
So problem is elsewhere.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-23 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #13 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 
16:45:23 UTC ---
OK, the slowdown comes away when both hookers_law and perida is inlined.
First needs -finline-limit=380 the second needs large-function-growth=1000
(or large increase of inline limit to make perida to be considered as small
function and inlined before iztaccihuatl grows that much).

Without large-function-growth we fail at:
Considering perdida size 1056.
 Called once from iztaccihuatl 6151 insns.
 Not inlining: --param large-function-growth limit reached.

This is because inlining for functions called once first process read_input:
Considering read_input size 3099.
 Called once from iztaccihuatl 3128 insns.
 Inlined into iztaccihuatl which now has 6151 size for a net change of -76
size.

that makes it too large.

large-function-insns is 2700, large-function-growth is 100%, so iztaccihuatl
can't growth past 3128*2 insns.

We might increase large-function-growth (I will give it a try on our
benchmarks) or we might convince inlined to inline first perida rather than
read_input because perida is smaller...

Honza


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-23 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #14 from Dominique d'Humieres dominiq at lps dot ens.fr 
2011-01-23 17:04:07 UTC ---
After removing the comments, generalized_hookes_law reads

  function generalized_hookes_law (strain_tensor, lambda, mu) result
(stress_tensor)
!
  real (kind = LONGreal), dimension(:,:), intent(in) :: strain_tensor
  real (kind = LONGreal), intent(in) :: lambda, mu
  real (kind = LONGreal), dimension(3,3) :: stress_tensor
  real (kind = LONGreal), dimension(6) ::generalized_strain_vector,

 generalized_stress_vector
  real (kind = LONGreal), dimension(6,6) :: generalized_constitutive_tensor
  integer :: i
!
  generalized_constitutive_tensor(:,:) = 0.0_LONGreal
  generalized_constitutive_tensor(1,1) = lambda + 2.0_LONGreal * mu
  generalized_constitutive_tensor(1,2) = lambda
  generalized_constitutive_tensor(1,3) = lambda
  generalized_constitutive_tensor(2,1) = lambda
  generalized_constitutive_tensor(2,2) = lambda + 2.0_LONGreal * mu
  generalized_constitutive_tensor(2,3) = lambda
  generalized_constitutive_tensor(3,1) = lambda
  generalized_constitutive_tensor(3,2) = lambda
  generalized_constitutive_tensor(3,3) = lambda + 2.0_LONGreal * mu
  generalized_constitutive_tensor(4,4) = mu
  generalized_constitutive_tensor(5,5) = mu
  generalized_constitutive_tensor(6,6) = mu
!
  generalized_strain_vector(1) = strain_tensor(1,1)
  generalized_strain_vector(2) = strain_tensor(2,2)
  generalized_strain_vector(3) = strain_tensor(3,3)
  generalized_strain_vector(4) = strain_tensor(2,3)
  generalized_strain_vector(5) = strain_tensor(1,3)
  generalized_strain_vector(6) = strain_tensor(1,2)
!
  do i = 1, 6
  generalized_stress_vector(i) =
dot_product(generalized_constitutive_tensor(i,:), 
   
generalized_strain_vector(:))
  end do
!
  stress_tensor(1,1) = generalized_stress_vector(1)
  stress_tensor(2,2) = generalized_stress_vector(2)
  stress_tensor(3,3) = generalized_stress_vector(3)
  stress_tensor(2,3) = generalized_stress_vector(4)
  stress_tensor(1,3) = generalized_stress_vector(5)
  stress_tensor(1,2) = generalized_stress_vector(6)
  stress_tensor(3,2) = stress_tensor(2,3)
  stress_tensor(3,1) = stress_tensor(1,3)
  stress_tensor(2,1) = stress_tensor(1,2)
!
  end function generalized_hookes_law

Note that 24 elements out of the 36 ones of generalized_constitutive_tensor are
null. Using that, the subroutine can be replaced with

  function generalized_hookes_law (strain_tensor, lambda, mu) result
(stress_tensor)
!
  real (kind = LONGreal), dimension(:,:), intent(in) :: strain_tensor
  real (kind = LONGreal), intent(in) :: lambda, mu
  real (kind = LONGreal), dimension(3,3) :: stress_tensor
  real (kind = LONGreal) :: tmp
!
  stress_tensor(:,:) = mu * strain_tensor(:,:)
  tmp = lambda * (strain_tensor(1,1) + strain_tensor(2,2) +
strain_tensor(3,3))
  stress_tensor(1,1) = tmp + 2.0_LONGreal * stress_tensor(1,1)
  stress_tensor(2,2) = tmp + 2.0_LONGreal * stress_tensor(2,2)
  stress_tensor(3,3) = tmp + 2.0_LONGreal * stress_tensor(3,3)
!
  end function generalized_hookes_law

end module perdida_m

which is inlined at -finline-limit=320.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-23 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #15 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 
17:56:31 UTC ---
Enabling early FRE
Index: passes.c
===
--- passes.c(revision 169136)
+++ passes.c(working copy)
@@ -760,6 +760,7 @@
  NEXT_PASS (pass_remove_cgraph_callee_edges);
  NEXT_PASS (pass_rename_ssa_copies);
  NEXT_PASS (pass_ccp);
+  NEXT_PASS (pass_fre);
  NEXT_PASS (pass_forwprop);
  /* pass_build_ealias is a dummy pass that ensures that we
 execute TODO_rebuild_alias at this point.  Re-building
@@ -782,7 +783,7 @@

reduces perida size estimate to 694 (so by about 30%) and hookes law to 141 (by
11%). Not enough to make inlining happen, still.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-23 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #16 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 
17:57:58 UTC ---
Also w/o inlining hookes_law but with inlining perida (by using
large-function-growth parameter only and the patch abov), I get 30% speedup,
not 50% as with inlining both, but it seems that we miss some optimization that
is independent on inlining w/o early FRE.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-23 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #17 from Dominique d'Humieres dominiq at lps dot ens.fr 
2011-01-23 19:38:30 UTC ---
With the patch in comment #15 and -finline-limit=300, I get


Date  Time : 23 Jan 2011 20:18:02
Test Name   : pbharness
Compile Command : gfcp %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=300 -fwhole-program -flto -o %n
Benchmarks  : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :  300.0
Target Error %  :  0.200
Minimum Repeats : 2
Maximum Repeats : 5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
Name(secs) (bytes)(secs) Repeats   Err %
   -   ---  --   --- ---  --
  ac  3.55   54576  8.12   2  0.0062
  aermod103.51 1595448 18.87   2  0.0079
 air  8.87   90048  6.89   2  0.0798
capacita  5.84   89056 40.27   2  0.0199
 channel  1.62   34448  2.98   2  0.0168
   doduc 14.30  203936 27.79   2  0.0162
 fatigue  4.89   89264  4.74   2  0.0106
 gas_dyn 11.72  148176  4.64   5  0.0535
  induct 10.87  205976 14.00   2  0.0036
   linpk  1.58   21536 21.71   2  0.0415
mdbx  5.60   84752 12.56   2  0.1871
  nf  7.24   83712 29.23   5  0.0744
 protein 11.81  163760 35.10   2  0.0342
  rnflow 14.86  171392 26.91   2  0.0223
test_fpu 11.35  145848 11.03   2  0.0952
tfft  1.10   22072  3.30   2  0.1817

Geometric Mean Execution Time =  12.36 seconds

to be compared to the lowest Geometric Mean I have got so far (most of the
difference is due to nf which depends a lot of the mood of my laptop)


Date  Time : 22 Dec 2010 10:33:08
Test Name   : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=600 --param hot-bb-frequency-fraction=2000
-fwhole-program -flto -o %n
Benchmarks  : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :  300.0
Target Error %  :  0.200
Minimum Repeats : 2
Maximum Repeats : 5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
Name(secs) (bytes)(secs) Repeats   Err %
   -   ---  --   --- ---  --
  ac 11.55   58672  8.11   2  0.0123
  aermod164.78 1522240 19.11   2  0.1151
 air 20.73   85984  6.87   5  0.1914
capacita 14.66  105472 40.22   2  0.0584
 channel  3.22   34448  2.92   4  0.1714
   doduc 24.70  212360 27.81   2  0.1025
 fatigue  9.81   85144  4.70   3  0.1862
 gas_dyn 24.13  144240  4.66   5  0.4507
  induct 22.50  214136 13.69   2  0.1096
   linpk  2.56   21536 21.68   2  0.0231
mdbx  8.93   84744 12.52   2  0.0080
  nf 22.61  104136 27.63   2  0.0778
 protein 26.19  155768 35.51   2  0.0127
  rnflow 30.99  163200 26.15   2  0.0248
test_fpu 18.79  145848 10.98   2  0.0182
tfft  1.92   22072  3.29   2  0.0304

Geometric Mean Execution Time =  12.27 seconds


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-23 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Jan Hubicka hubicka at gcc dot gnu.org changed:

   What|Removed |Added

   Last reconfirmed|2011-01-23 15:59:30 |
 CC||rguenther at suse dot de

--- Comment #18 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 
20:00:23 UTC ---
We produce very lousy code for the out of line copy of
__perdida_m_MOD_generalized_hookes_law. This seems to be reason why we inline
it.

Code is bit better with early FRE but still we get in
vect_pgeneralized_constitutive_tensor (optimized dump):

  generalized_constitutive_tensor = {};
  D.4502_45 = *lambda_44(D);
  D.4503_47 = *mu_46(D);
  D.4504_48 = D.4503_47 * 2.0e+0;
  D.4505_49 = D.4504_48 + D.4502_45;
  generalized_constitutive_tensor[0] = D.4505_49;
  generalized_constitutive_tensor[6] = D.4502_45;
  generalized_constitutive_tensor[12] = D.4502_45;
  generalized_constitutive_tensor[1] = D.4502_45;
  generalized_constitutive_tensor[7] = D.4505_49;
  generalized_constitutive_tensor[13] = D.4502_45;
  generalized_constitutive_tensor[2] = D.4502_45;
  generalized_constitutive_tensor[8] = D.4502_45;
  generalized_constitutive_tensor[14] = D.4505_49;
  generalized_constitutive_tensor[21] = D.4503_47;
  generalized_constitutive_tensor[28] = D.4503_47;
  generalized_constitutive_tensor[35] = D.4503_47;

initialize the array with mostly zeros and then we use it in vectorized loop:

  vect_cst_.855_301 = {D.4508_69, D.4508_69};
  vect_cst_.862_295 = {D.4511_73, D.4511_73};
  vect_cst_.870_288 = {D.4514_77, D.4514_77};
  vect_cst_.878_323 = {D.4519_82, D.4519_82};
  vect_cst_.886_330 = {D.4522_86, D.4522_86};
  vect_cst_.894_337 = {D.4526_90, D.4526_90};
  vect_var_.853_205 = MEM[(real(kind=8)[36]
*)generalized_constitutive_tensor];
  vect_var_.854_210 = vect_var_.853_205 * vect_cst_.855_301;
  vect_var_.860_211 = MEM[(real(kind=8)[36] *)generalized_constitutive_tensor
+ 48B];
  vect_var_.861_214 = vect_var_.860_211 * vect_cst_.862_295;
  vect_var_.863_215 = vect_var_.861_214 + vect_var_.854_210;
  vect_var_.868_220 = MEM[(real(kind=8)[36] *)generalized_constitutive_tensor
+ 96B];
  vect_var_.869_221 = vect_var_.868_220 * vect_cst_.870_288;
  vect_var_.871_224 = vect_var_.863_215 + vect_var_.869_221;
  vect_var_.876_225 = MEM[(real(kind=8)[36] *)generalized_constitutive_tensor
+ 144B];

we would better go with unrolling this and optimizing away 0 terms.
W/o -ftree-vectorize we however still don't do this transform. We end up with:

  generalized_constitutive_tensor = {};
  D.4502_45 = *lambda_44(D);
  D.4503_47 = *mu_46(D);
  D.4504_48 = D.4503_47 * 2.0e+0;
  D.4505_49 = D.4504_48 + D.4502_45;
  generalized_constitutive_tensor[1] = D.4502_45;
  generalized_constitutive_tensor[7] = D.4505_49;
  generalized_constitutive_tensor[13] = D.4502_45;
  generalized_constitutive_tensor[2] = D.4502_45;
  generalized_constitutive_tensor[8] = D.4502_45;
  generalized_constitutive_tensor[14] = D.4505_49;
  generalized_constitutive_tensor[21] = D.4503_47;
  generalized_constitutive_tensor[28] = D.4503_47;
  generalized_constitutive_tensor[35] = D.4503_47;

 pretmp.827_334 = generalized_constitutive_tensor[1];
  pretmp.830_336 = generalized_constitutive_tensor[7];
  pretmp.832_338 = generalized_constitutive_tensor[13];
  pretmp.834_340 = generalized_constitutive_tensor[19];
  pretmp.836_342 = generalized_constitutive_tensor[25];
  pretmp.838_344 = generalized_constitutive_tensor[31];

so copy propagation and SRA are missing. Moreover we can't figure out that
generalized_constitutive_tensor[31] is 0.

So it is quite good testcase for optimization queue ordering.
Honza


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-23 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Jan Hubicka hubicka at gcc dot gnu.org changed:

   What|Removed |Added

   Last reconfirmed||2011-01-23 15:59:30

--- Comment #19 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 
21:05:51 UTC ---
This adds enough passes so we generate sane code for hookes_law.
(and we do that before inlining)
Index: passes.c
===
--- passes.c(revision 169136)
+++ passes.c(working copy)
@@ -775,6 +775,14 @@
  NEXT_PASS (pass_convert_switch);
   NEXT_PASS (pass_cleanup_eh);
   NEXT_PASS (pass_profile);
+ NEXT_PASS (pass_tree_loop_init);
+ NEXT_PASS (pass_complete_unroll);
+ NEXT_PASS (pass_tree_loop_done);
+  NEXT_PASS (pass_ccp);
+  NEXT_PASS (pass_fre);
+  NEXT_PASS (pass_dse);
+  NEXT_PASS (pass_fre);
+  NEXT_PASS (pass_cd_dce);
   NEXT_PASS (pass_local_pure_const);
  /* Split functions creates parts that are not run through
 early optimizations again.  It is thus good idea to do this
@@ -782,7 +790,7 @@

We need to unroll the loop, do ccp to get constant array indexes, FRE to
propagate through memory acceses. For some reason FRE is needed twice or the
loads from the temporary array are not copy propagated.

I didn't tested if DSE is really needed or cd_dce gets rid of the dead store
into the array. Still a lot of copyprop oppurtunity is left.

This makes hookes_law estimate to be 91 instructions, so -finline-limit=183
should be enough.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-23 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #20 from Dominique d'Humieres dominiq at lps dot ens.fr 
2011-01-23 23:20:34 UTC ---
 This makes hookes_law estimate to be 91 instructions, so -finline-limit=183
 should be enough.

With the patch in comment #19, I rather find a threshold of -finline-limit=256.
In top of that as shown by the timing below the patch increases the threshold
for ac.f90 and breaks the vectorization for induct.f90.

Would the patch in comment #15 and an increase of the default value for
-finline-limit to 300 be acceptable at this stage (with the usual bells and
whisles: SPEC, ...)?


Date  Time : 23 Jan 2011 23:18:23
Test Name   : pbharness
Compile Command : gfcp %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=300 -fwhole-program -flto -o %n
Benchmarks  : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :  300.0
Target Error %  :  0.200
Minimum Repeats : 2
Maximum Repeats : 5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
Name(secs) (bytes)(secs) Repeats   Err %
   -   ---  --   --- ---  --
  ac  3.15   50536  9.58   2  0.0156
  aermod104.98 1652280 18.79   2  0.1011
 air  8.83   90048  6.99   5  0.7334
capacita  5.95   89056 40.21   2  0.0174
 channel  1.65   34448  2.99   2  0.0502
   doduc 14.59  208056 27.91   2  0.0036
 fatigue  4.80   89264  4.72   2  0.0212
 gas_dyn 11.65  148176  4.66   5  0.4391
  induct 11.20  205976 22.34   2  0.0672
   linpk  1.59   21536 21.70   2  0.0299
mdbx  5.78   84760 12.58   2  0.0119
  nf  7.60   83712 29.53   5  0.3854
 protein 11.69  163760 35.18   2  0.1109
  rnflow 15.23  167296 26.97   2  0.0890
test_fpu 11.33  145848 11.06   5  0.3715
tfft  1.13   22072  3.30   2  0.0607

Geometric Mean Execution Time =  12.89 seconds


Date  Time : 23 Jan 2011 23:54:28
Test Name   : pbharness
Compile Command : gfcp %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -o %n
Benchmarks  : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :  300.0
Target Error %  :  0.200
Minimum Repeats : 2
Maximum Repeats : 5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
Name(secs) (bytes)(secs) Repeats   Err %
   -   ---  --   --- ---  --
  ac  3.59   54576  8.10   2  0.0062
  aermod103.73 1558344 18.91   2  0.0238
 air 10.47   89992  6.77   5  0.1563
capacita  7.47  101344 40.08   2  0.0137
 channel  1.65   34448  2.97   5  0.5872
   doduc 15.82  216376 27.61   2  0.
 fatigue  5.10   89264  4.73   2  0.
 gas_dyn 12.09  152264  4.69   5  0.6428
  induct 11.10  205976 22.33   2  0.0403
   linpk  1.59   21536 21.72   2  0.0368
mdbx  5.85   84760 12.58   2  0.0517
  nf 11.34  108280 28.98   2  0.1087
 protein 11.65  163760 35.18   3  0.1422
  rnflow 17.39  183696 26.71   2  0.0243
test_fpu 11.49  145816 11.02   2  0.1226
tfft  1.43   22072  3.29   2  0.0911

Geometric Mean Execution Time =  12.70 seconds


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2011-01-08 Thread hubicka at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #11 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-08 
20:08:26 UTC ---
Does --param hot-bb-frequency-fraction=10 work here?

This is weird!-( I have done the following profiling and it shows that -flto
prevents the inlining of __perdida_m_MOD_perdida, while -fno-inline-functions
restores it. This contradicts what the manual says:

-finline-functions
Integrate all simple functions into their callers. The compiler heuristically
decides which functions are simple enough to be worth integrating in this way.


Disabling autoinlining of small function can allow other inlining (inlining
functions called once or inlining for size), so this is not completely
unexpected.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2010-09-30 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #10 from Dominique d'Humieres dominiq at lps dot ens.fr 
2010-09-30 17:28:19 UTC ---
(In reply to comment #8)
 Using -fno-inline-functions, the program recovers the speed of the no-LTO
 version.

This does not work on powerpc-apple-darwin9:

[karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g fatigue.f90
[karma] lin/test% time a.out  /dev/null
15.942u 0.052s 0:16.54 96.6%0+0k 2+1io 40pf+0w
[karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g -flto
fatigue.f90
[karma] lin/test% time a.out  /dev/null
20.330u 0.063s 0:21.06 96.8%0+0k 0+2io 0pf+0w
[karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g -flto
-fno-inline-functions fatigue.f90
[karma] lin/test% time a.out  /dev/null
20.678u 0.063s 0:21.33 97.1%0+0k 0+2io 0pf+0w
[karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g -flto
-finline-limit=600 fatigue.f90
[karma] lin/test% time a.out  /dev/null
10.903u 0.036s 0:11.30 96.7%0+0k 0+2io 0pf+0w


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2010-09-29 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #9 from Dominique d'Humieres dominiq at lps dot ens.fr 2010-09-29 
20:27:36 UTC ---
(In reply to comment #8)
 Using -fno-inline-functions, the program recovers the speed of the no-LTO
 version.

This is weird!-( I have done the following profiling and it shows that -flto
prevents the inlining of __perdida_m_MOD_perdida, while -fno-inline-functions
restores it. This contradicts what the manual says:

-finline-functions
Integrate all simple functions into their callers. The compiler heuristically
decides which functions are simple enough to be worth integrating in this way.

Note also that in order to inline __perdida_m_MOD_generalized_hookes_law one
needs -finline-limit=600 (actually some number between 300 and 400).


[macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g fatigue.f90
[macbook] lin/test% time a.out  /dev/null
6.547u 0.024s 0:06.57 99.8%0+0k 0+2io 0pf+0w

+ 70.8%, MAIN__, a.out
| + 10.1%, free, libSystem.B.dylib
| |   7.9%, szone_size, libSystem.B.dylib
| + 8.0%, malloc, libSystem.B.dylib
| | + 6.4%, malloc_zone_malloc, libSystem.B.dylib
| | |   4.4%, szone_malloc_should_clear, libSystem.B.dylib
| | |   0.4%, szone_malloc, libSystem.B.dylib
| |   0.4%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib
| |   0.1%, szone_malloc_should_clear, libSystem.B.dylib
|   4.1%, szone_free_definite_size, libSystem.B.dylib
|   2.4%, cosisin, libSystem.B.dylib
| + 0.7%, cexp, libSystem.B.dylib
| |   0.1%, exp$fenv_access_off, libSystem.B.dylib
| |   0.0%, dyld_stub_exp, libSystem.B.dylib
  27.2%, __perdida_m_MOD_generalized_hookes_law, a.out
  0.5%, dyld_stub_malloc, a.out
  0.4%, free, libSystem.B.dylib
  0.4%, dyld_stub_free, a.out
  0.4%, szone_free_definite_size, libSystem.B.dylib
  0.2%, malloc, libSystem.B.dylib
  0.1%, dyld_stub_cexp, a.out
  0.0%, cexp, libSystem.B.dylib

[macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -flto fatigue.f90
[macbook] lin/test% time a.out  /dev/null
9.013u 0.027s 0:09.04 99.8%0+0k 0+2io 0pf+0w

+ 64.8%, __perdida_m_MOD_perdida, a.out 
---
| + 6.8%, free, libSystem.B.dylib
| |   4.9%, szone_size, libSystem.B.dylib
| + 5.2%, malloc, libSystem.B.dylib
| | + 4.1%, malloc_zone_malloc, libSystem.B.dylib
| | |   2.5%, szone_malloc_should_clear, libSystem.B.dylib
| | |   0.5%, szone_malloc, libSystem.B.dylib
| |   0.3%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib
|   3.1%, szone_free_definite_size, libSystem.B.dylib
  19.3%, __perdida_m_MOD_generalized_hookes_law, a.out
+ 14.6%, MAIN__.2130, a.out
|   1.8%, cosisin, libSystem.B.dylib
| + 0.4%, cexp, libSystem.B.dylib
| |   0.1%, exp$fenv_access_off, libSystem.B.dylib
| |   0.0%, dyld_stub_exp, libSystem.B.dylib
| |   0.0%, cosisin, libSystem.B.dylib
  0.3%, szone_free_definite_size, libSystem.B.dylib
  0.3%, dyld_stub_malloc, a.out
  0.3%, dyld_stub_free, a.out
  0.2%, free, libSystem.B.dylib
  0.2%, malloc, libSystem.B.dylib
  0.0%, cexp, libSystem.B.dylib
  0.0%, data_transfer_init, libgfortran.3.dylib

[macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -flto
-fno-inline-functions fatigue.f90
[macbook] lin/test% time a.out  /dev/null
6.575u 0.021s 0:06.61 99.6%0+0k 0+2io 0pf+0w

+ 71.0%, MAIN__.2130, a.out
| + 8.9%, free, libSystem.B.dylib
| |   6.6%, szone_size, libSystem.B.dylib
| + 8.1%, malloc, libSystem.B.dylib
| | + 6.4%, malloc_zone_malloc, libSystem.B.dylib
| | |   4.5%, szone_malloc_should_clear, libSystem.B.dylib
| | |   0.6%, szone_malloc, libSystem.B.dylib
| |   0.4%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib
| |   0.2%, szone_malloc_should_clear, libSystem.B.dylib
|   4.4%, szone_free_definite_size, libSystem.B.dylib
|   1.9%, cosisin, libSystem.B.dylib
| + 1.0%, cexp, libSystem.B.dylib
| |   0.1%, exp$fenv_access_off, libSystem.B.dylib
| |   0.1%, cosisin, libSystem.B.dylib
| |   0.0%, dyld_stub_exp, libSystem.B.dylib
  27.3%, __perdida_m_MOD_generalized_hookes_law, a.out
  0.4%, free, libSystem.B.dylib
  0.3%, dyld_stub_malloc, a.out
  0.3%, dyld_stub_free, a.out
  0.3%, szone_free_definite_size, libSystem.B.dylib
  0.2%, malloc, libSystem.B.dylib
  0.1%, dyld_stub_cexp, a.out
  0.0%, cexp, libSystem.B.dylib

[macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -flto
-finline-limit=600 fatigue.f90
[macbook] lin/test% time a.out  /dev/null
4.768u 0.018s 0:04.79 99.5%0+0k 0+1io 0pf+0w

+ 97.5%, MAIN__.2133, a.out
| + 15.4%, free, libSystem.B.dylib
| |   10.6%, szone_size, libSystem.B.dylib
| + 11.4%, malloc, libSystem.B.dylib
| | + 9.6%, malloc_zone_malloc, libSystem.B.dylib
| | |   4.9%, szone_malloc_should_clear, libSystem.B.dylib
| | |   0.9%, szone_malloc, libSystem.B.dylib
| |   0.4%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib
|   6.4%, szone_free_definite_size, libSystem.B.dylib
|   2.7%, cosisin, libSystem.B.dylib
| + 0.8%, cexp, libSystem.B.dylib
| |   0.1%, exp$fenv_access_off, libSystem.B.dylib
| |   0.1%, cosisin, libSystem.B.dylib

[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2010-09-28 Thread burnus at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Tobias Burnus burnus at gcc dot gnu.org changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #3 from Tobias Burnus burnus at gcc dot gnu.org 2010-09-28 
12:23:06 UTC ---
(In reply to comment #2)
 For single-file programs -fwhole-program and -flto should be basically
 equivalent if the Frontend provides correctly merged decls.  I suppose
 it does not and thus we do less inlining with -fwhole-program compared
 to -flto.

It might well be the reason that one does less inlining without LTO - but
that's then not only a FE bug (not correctly merged decls) but also a ME/target
bug as the LTO program is _slower_.


Cf. also PR 44334, which is about a -fwhole-program slowdown (w/ and w/o
-flto). For the latter program, it helped to use --param
hot-bb-frequency-fraction=2000. However, for this PR, the option does not seem
to help.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2010-09-28 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #4 from Richard Guenther rguenth at gcc dot gnu.org 2010-09-28 
13:38:58 UTC ---
(In reply to comment #3)
 (In reply to comment #2)
  For single-file programs -fwhole-program and -flto should be basically
  equivalent if the Frontend provides correctly merged decls.  I suppose
  it does not and thus we do less inlining with -fwhole-program compared
  to -flto.
 
 It might well be the reason that one does less inlining without LTO - but

more inlining with LTO.  You read my stmt wrong.

 that's then not only a FE bug (not correctly merged decls) but also a 
 ME/target
 bug as the LTO program is _slower_.

Sure.  As with all performance related bugs this needs analysis and is
unlikely an LTO problem - LTO does not (not-)optimize, optimization
passes do.

 
 Cf. also PR 44334, which is about a -fwhole-program slowdown (w/ and w/o
 -flto). For the latter program, it helped to use --param
 hot-bb-frequency-fraction=2000. However, for this PR, the option does not 
 seem
 to help.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2010-09-28 Thread Joost.VandeVondele at pci dot uzh.ch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #5 from Joost VandeVondele Joost.VandeVondele at pci dot uzh.ch 
2010-09-28 13:58:18 UTC ---
(In reply to comment #4)
 Sure.  As with all performance related bugs this needs analysis and is
 unlikely an LTO problem - LTO does not (not-)optimize, optimization
 passes do.

I'm wondering if there is any description on how to do this. For example, how
do I get the assembly of a function and the -fdump-tree-all files from a gold
based linking that goes as:

rm -f test.s test2.s test.o test2.o ;
gfortran -c -flto test.f90 ; 
gfortran -c -flto test2.f90 ;  
gfortran -O3 -march=native -fuse-linker-plugin -fwhopr=2 test.o test2.o

just using -S or -fdump-tree-all doesn't work. 

Is 'objdump -d' the only tool ?


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2010-09-28 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #6 from Richard Guenther rguenth at gcc dot gnu.org 2010-09-28 
14:07:54 UTC ---
(In reply to comment #5)
 (In reply to comment #4)
  Sure.  As with all performance related bugs this needs analysis and is
  unlikely an LTO problem - LTO does not (not-)optimize, optimization
  passes do.
 
 I'm wondering if there is any description on how to do this. For example, how
 do I get the assembly of a function and the -fdump-tree-all files from a gold
 based linking that goes as:
 
 rm -f test.s test2.s test.o test2.o ;
 gfortran -c -flto test.f90 ; 
 gfortran -c -flto test2.f90 ;  
 gfortran -O3 -march=native -fuse-linker-plugin -fwhopr=2 test.o test2.o
 
 just using -S or -fdump-tree-all doesn't work. 
 
 Is 'objdump -d' the only tool ?

No, -fdump-tree-all works, it just uses maybe un-intuitive base-names.
Append -v to see them, for -fwhopr it should be the output file
specified with -o (which you leave out which causes us to use
not a.out but some temporary file in /tmp), with -o t I get
t.ltrans[01].147t.optimized, etc.  With -flto it's just t.147t.optimized.
To retain assembler you have to use -save-temps which retains
t.ltrans[01].s, with -flto it retains t1.s (using the base of the first
object file).


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2010-09-28 Thread Joost.VandeVondele at pci dot uzh.ch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #7 from Joost VandeVondele Joost.VandeVondele at pci dot uzh.ch 
2010-09-28 14:19:38 UTC ---
(In reply to comment #6)
 No, -fdump-tree-all works

great... I forgot to look in /tmp, and -save-temps also works fine.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2010-09-28 Thread burnus at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #8 from Tobias Burnus burnus at gcc dot gnu.org 2010-09-28 
14:57:34 UTC ---
Using -fno-inline-functions, the program recovers the speed of the no-LTO
version.

Notes from #gcc:
(dominiq) For fatigue the key for speed-up is inlining of
generalized_hookes_law and you need -finline-limit=400
(richi) Considering inline candidate generalized_hookes_law. / Inlining
failed: --param max-inline-insns-auto limit reached


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2010-09-27 Thread Joost.VandeVondele at pci dot uzh.ch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Joost VandeVondele Joost.VandeVondele at pci dot uzh.ch changed:

   What|Removed |Added

 CC||Joost.VandeVondele at pci
   ||dot uzh.ch

--- Comment #1 from Joost VandeVondele Joost.VandeVondele at pci dot uzh.ch 
2010-09-27 10:39:05 UTC ---
I have observed similar 40% slowdown in CP2K as a result of LTO. I haven't yet
investigated.


[Bug lto/45810] 40% slowdown when using LTO for a single-file program

2010-09-27 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #2 from Richard Guenther rguenth at gcc dot gnu.org 2010-09-27 
10:48:33 UTC ---
For single-file programs -fwhole-program and -flto should be basically
equivalent if the Frontend provides correctly merged decls.  I suppose
it does not and thus we do less inlining with -fwhole-program compared
to -flto.