[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED --- Comment #27 from Richard Guenther rguenth at gcc dot gnu.org 2011-09-26 10:16:20 UTC --- Yes, I think I analyzed the reason for this at some point (IPA profile) and fixed it.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #26 from Dominique d'Humieres dominiq at lps dot ens.fr 2011-09-22 15:25:48 UTC --- AFAICT this pr has been fixed since some time. Here are the results I get on x86_64-apple-darwin10 (Core2Duo 2.53Ghz, 3Mb cache, 4Gb RAM) at revision 179079: Compile options : -fprotect-parens -Ofast -funroll-loops -fwhole-program without -flto with -flto Benchmark Compile Executable Ave Run Compile Executable Ave Run Name(secs) (bytes)(secs)(secs) (bytes)(secs) - --- -- --- --- -- --- ac 3.28 54936 8.81 6.64 54968 8.81 aermod 75.46 1184280 18.65131.50 1212648 18.20 air 11.24 106336 7.26 22.38 106904 7.39 capacita 3.87 77152 41.29 7.36 77200 41.31 channel 1.25 34744 3.03 2.39 34864 3.03 doduc 12.40 200016 28.02 22.47 200496 27.69 fatigue 4.06 77400 4.83 8.17 77488 4.84 gas_dyn 9.32 119256 4.92 16.64 119816 4.92 induct 7.37 148840 13.83 14.76 153224 13.84 linpk 0.70 26024 21.64 1.93 26064 21.64 mdbx 3.77 80864 12.46 7.21 81040 12.46 nf 4.08 71848 19.34 8.07 71896 19.35 protein 15.17 131304 35.30 26.05 127224 35.48 rnflow 12.58 130888 28.25 23.76 131000 26.92 test_fpu 4.78 92968 10.63 13.35 93024 10.64 tfft 0.74 22352 3.28 1.98 22432 3.28 Geometric Mean Execution Time = 12.23 secs 12.18 secs Compile options : -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer --param max-inline-insns-auto=200 -fwhole-program without -flto with -flto Benchmark Compile Executable Ave Run Compile Executable Ave Run Name(secs) (bytes)(secs)(secs) (bytes)(secs) - --- -- --- --- -- --- ac 4.05 54904 8.11 8.18 54920 8.11 aermod101.55 1494688 18.17169.63 1527120 18.12 air 14.46 114328 7.05 30.35 114912 7.04 capacita 5.39 97552 40.24 10.80 97584 40.21 channel 1.68 38792 2.91 3.17 3 2.91 doduc 12.98 208112 27.47 25.77 208584 27.52 fatigue 4.84 81440 2.95 10.27 81504 2.93 gas_dyn 13.55 143776 4.86 25.03 144392 4.86 induct 12.95 189872 13.78 24.32 190176 13.96 linpk 0.73 21856 21.69 2.44 21888 21.69 mdbx 4.32 84928 12.45 9.39 85104 12.54 nf 7.41 92248 18.93 17.82 92272 18.91 protein 17.26 160040 35.51 31.08 155984 35.47 rnflow 15.16 138880 28.27 27.28 139040 26.85 test_fpu 5.05 92872 10.65 14.65 92928 10.65 tfft 0.75 22352 3.28 1.72 22432 3.28 Geometric Mean Execution Time = 11.67 secs 11.64 secs The option -flto improves the run time for rnflow.f90 by ~5% without slowdown for the other tests. Could these results be checked on other platforms and this PR closed if they agree with mine?
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #25 from Dominique d'Humieres dominiq at lps dot ens.fr 2011-02-16 18:38:19 UTC --- AFAICT the patch in http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00973.html seems to fix most of the fatigue.f90 problems: At revision 170178 without the patch, I get [macbook] lin/test% gfcp -Ofast fatigue.f90 [macbook] lin/test% time a.out /dev/null 8.903u 0.005s 0:08.91 99.8%0+0k 0+2io 0pf+0w [macbook] lin/test% gfcp -Ofast -fwhole-program fatigue.f90 [macbook] lin/test% time a.out /dev/null 6.392u 0.002s 0:06.39 100.0%0+0k 0+0io 0pf+0w [macbook] lin/test% gfcp -Ofast -finline-limit=322 -fwhole-program fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.653u 0.002s 0:04.65 100.0%0+0k 0+1io 0pf+0w [macbook] lin/test% gfcp -Ofast -finline-limit=322 -fwhole-program -flto fatigue.f90 [macbook] lin/test% time a.out /dev/null 8.212u 0.004s 0:08.22 99.8%0+0k 0+2io 0pf+0w [macbook] lin/test% gfcp -Ofast -finline-limit=322 --param large-function-growth=132 -fwhole-program -flto fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.526u 0.004s 0:04.53 99.7%0+0k 0+1io 0pf+0w At revision 170212 with the patch, I get [macbook] lin/test% gfc -Ofast fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.628u 0.002s 0:04.63 99.7%0+0k 0+0io 0pf+0w [macbook] lin/test% gfc -Ofast -fwhole-program fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.654u 0.002s 0:04.65 100.0%0+0k 0+1io 0pf+0w [macbook] lin/test% gfc -Ofast -finline-limit=322 -fwhole-program fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.657u 0.002s 0:04.66 99.7%0+0k 0+1io 0pf+0w [macbook] lin/test% gfc -Ofast -finline-limit=322 -fwhole-program -flto fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.715u 0.003s 0:04.72 99.7%0+0k 0+1io 0pf+0w [macbook] lin/test% gfc -Ofast -finline-limit=322 --param large-function-growth=132 -fwhole-program -flto fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.713u 0.003s 0:04.71 100.0%0+0k 0+1io 0pf+0w [macbook] lin/test% gfc -Ofast -finline-limit=322 --param large-function-growth=137 -fwhole-program -flto fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.524u 0.003s 0:04.52 100.0%0+0k 0+1io 0pf+0w [macbook] lin/test% gfc -Ofast --param large-function-growth=137 -fwhole-program -flto fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.564u 0.003s 0:04.57 99.7%0+0k 0+1io 0pf+0w [macbook] lin/test% gfc -Ofast --param large-function-growth=137 -fwhole-program fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.479u 0.003s 0:04.48 99.7%0+0k 0+2io 0pf+0w A quick check of the other tests does not show any obvious slowdown with the patch.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #21 from Dominique d'Humieres dominiq at lps dot ens.fr 2011-01-24 09:29:00 UTC --- I have regtested my working tree (with other patches) with the patch in comment #15 and got 180 new failures (likely 90 for both -m32 and -m64), but I have not checked that carefully). Among them, 124 are of the kind scan-tree-dump-times fre *: dump file does not exist and seem to be due to the extra pass producing fre1 and fre2. I can adjust the test for say fre2 and see what's happening. Then I see FAIL: gcc.dg/ipa/ipa-pta-14.c scan-ipa-dump pta foo.result = { NULL a[^ ]* a[^ ]* c[^ ]* } FAIL: gcc.dg/matrix/matrix-1.c scan-ipa-dump-times matrix-reorg Flattened 3 dimensions 1 FAIL: gcc.dg/matrix/matrix-2.c scan-ipa-dump-times matrix-reorg Flattened 2 dimensions 1 FAIL: gcc.dg/matrix/matrix-3.c scan-ipa-dump-times matrix-reorg Flattened 2 dimensions 1 FAIL: gcc.dg/matrix/matrix-6.c scan-ipa-dump-times matrix-reorg Flattened 2 dimensions 1 FAIL: gcc.dg/matrix/transpose-1.c scan-ipa-dump-times matrix-reorg Flattened 3 dimensions 1 FAIL: gcc.dg/matrix/transpose-1.c scan-ipa-dump-times matrix-reorg Transposed 3 FAIL: gcc.dg/matrix/transpose-2.c scan-ipa-dump-times matrix-reorg Flattened 3 dimensions 1 FAIL: gcc.dg/matrix/transpose-3.c scan-ipa-dump-times matrix-reorg Flattened 2 dimensions 1 FAIL: gcc.dg/matrix/transpose-3.c scan-ipa-dump-times matrix-reorg Transposed 2 FAIL: gcc.dg/matrix/transpose-4.c scan-ipa-dump-times matrix-reorg Flattened 3 dimensions 1 FAIL: gcc.dg/matrix/transpose-4.c scan-ipa-dump-times matrix-reorg Transposed 2 FAIL: gcc.dg/matrix/transpose-5.c scan-ipa-dump-times matrix-reorg Flattened 3 dimensions 1 FAIL: gcc.dg/matrix/transpose-6.c scan-ipa-dump-times matrix-reorg Flattened 3 dimensions 1 FAIL: gcc.dg/torture/pta-structcopy-1.c -O2 scan-tree-dump alias points-to vars: { i } FAIL: gcc.dg/torture/pta-structcopy-1.c -O3 -fomit-frame-pointer scan-tree-dump alias points-to vars: { i } FAIL: gcc.dg/torture/pta-structcopy-1.c -O3 -g scan-tree-dump alias points-to vars: { i } FAIL: gcc.dg/torture/pta-structcopy-1.c -Os scan-tree-dump alias points-to vars: { i } FAIL: gcc.dg/torture/pta-structcopy-1.c -O2 -flto -flto-partition=none scan-tree-dump alias points-to vars: { i } FAIL: gcc.dg/torture/pta-structcopy-1.c -O2 -flto scan-tree-dump alias points-to vars: { i } FAIL: gcc.dg/tree-ssa/pta-ptrarith-1.c scan-tree-dump ealias q_., points-to vars: { k } FAIL: gcc.dg/tree-ssa/sra-9.c scan-tree-dump-times optimized = s.b 0 FAIL: gcc.dg/tree-ssa/ssa-dce-4.c scan-tree-dump-times cddce1 a\[[^ FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg f6: va_list escapes 0, needs to save (3|12|24) GPR units FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg f11: va_list escapes 0, needs to save (3|12|24) GPR units FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg f12: va_list escapes 0, needs to save [1-9][0-9]* GPR units FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg f13: va_list escapes 0, needs to save [1-9][0-9]* GPR units FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg f14: va_list escapes 0, needs to save [1-9][0-9]* GPR units FAIL: g++.dg/ipa/iinline-1.C scan-ipa-dump inline String::funcOne[^\n]*inline copy in int main FAIL: g++.dg/ipa/iinline-2.C scan-ipa-dump inline String::funcOne[^\n]*inline copy in int main So far I have only looked at gcc.dg/ipa/ipa-pta-14.c, for which grepping foo.result yields p_1 = foo.result foo.result = foo.arg1 Equivalence classes for Direct node node id 15:foo.result are pointer: 8, location:0 Unifying foo.result to foo.arg0 foo.result = { a.0+32 } same as foo.arg0 instead of p_1 = foo.result foo.result = D.2736_3 Equivalence classes for Direct node node id 15:foo.result are pointer: 13, location:0 Unifying foo.result to p_1 foo.result = { NULL a.0+32 a.64+64 c.0+32 } same as p_1 Is it a missed optimization or wrong-code?
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #22 from Richard Guenther rguenth at gcc dot gnu.org 2011-01-24 14:07:14 UTC --- (In reply to comment #15) Enabling early FRE Index: passes.c === --- passes.c(revision 169136) +++ passes.c(working copy) @@ -760,6 +760,7 @@ NEXT_PASS (pass_remove_cgraph_callee_edges); NEXT_PASS (pass_rename_ssa_copies); NEXT_PASS (pass_ccp); + NEXT_PASS (pass_fre); NEXT_PASS (pass_forwprop); /* pass_build_ealias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. Re-building @@ -782,7 +783,7 @@ reduces perida size estimate to 694 (so by about 30%) and hookes law to 141 (by 11%). Not enough to make inlining happen, still. That FRE pass should be after pass_sra_early (certainly after pass_build_ealias).
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 Jack Howarth howarth at nitro dot med.uc.edu changed: What|Removed |Added CC||howarth at nitro dot ||med.uc.edu --- Comment #23 from Jack Howarth howarth at nitro dot med.uc.edu 2011-01-24 17:58:00 UTC --- (In reply to comment #22) That FRE pass should be after pass_sra_early (certainly after pass_build_ealias). Index: gcc/passes.c === --- gcc/passes.c(revision 169145) +++ gcc/passes.c(working copy) @@ -767,6 +767,7 @@ init_optimization_passes (void) locals into SSA form if possible. */ NEXT_PASS (pass_build_ealias); NEXT_PASS (pass_sra_early); + NEXT_PASS (pass_fre); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_merge_phi); NEXT_PASS (pass_cd_dce); gives Elapsed CPU time = 8.43600E+00 for gfortran -O3 -ffast-math -funroll-loops -flto -fwhole-program fatigue.f90 -o fatigue and Elapsed CPU time = 4.16600E+00 for gfortran -O3 -ffast-math -funroll-loops -finline-limit=250 --param large-function-growth=250 -flto -fwhole-program fatigue.f90 -o fatigue
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #24 from Dominique d'Humieres dominiq at lps dot ens.fr 2011-01-24 18:16:47 UTC --- (In reply to comment #22) That FRE pass should be after pass_sra_early (certainly after pass_build_ealias). Moving pass_fre after pass_sra_early does not fix the failures in the test suite rported in comment #21.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 Jan Hubicka hubicka at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2011.01.23 15:59:30 Ever Confirmed|0 |1 --- Comment #12 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 15:59:30 UTC --- Reproduces for me. Perdida is funcion called once, what happens with default settings is that perdida is not considered as inline candidate for small function inlining (it is estimated to over 700 instructions, so it is huge) later we try to inline it as function called once, but hit large function growth limit. Compiling with --param large-function-growth=100 solve the problem, but it does not make the testcase faster. So problem is elsewhere.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #13 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 16:45:23 UTC --- OK, the slowdown comes away when both hookers_law and perida is inlined. First needs -finline-limit=380 the second needs large-function-growth=1000 (or large increase of inline limit to make perida to be considered as small function and inlined before iztaccihuatl grows that much). Without large-function-growth we fail at: Considering perdida size 1056. Called once from iztaccihuatl 6151 insns. Not inlining: --param large-function-growth limit reached. This is because inlining for functions called once first process read_input: Considering read_input size 3099. Called once from iztaccihuatl 3128 insns. Inlined into iztaccihuatl which now has 6151 size for a net change of -76 size. that makes it too large. large-function-insns is 2700, large-function-growth is 100%, so iztaccihuatl can't growth past 3128*2 insns. We might increase large-function-growth (I will give it a try on our benchmarks) or we might convince inlined to inline first perida rather than read_input because perida is smaller... Honza
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #14 from Dominique d'Humieres dominiq at lps dot ens.fr 2011-01-23 17:04:07 UTC --- After removing the comments, generalized_hookes_law reads function generalized_hookes_law (strain_tensor, lambda, mu) result (stress_tensor) ! real (kind = LONGreal), dimension(:,:), intent(in) :: strain_tensor real (kind = LONGreal), intent(in) :: lambda, mu real (kind = LONGreal), dimension(3,3) :: stress_tensor real (kind = LONGreal), dimension(6) ::generalized_strain_vector, generalized_stress_vector real (kind = LONGreal), dimension(6,6) :: generalized_constitutive_tensor integer :: i ! generalized_constitutive_tensor(:,:) = 0.0_LONGreal generalized_constitutive_tensor(1,1) = lambda + 2.0_LONGreal * mu generalized_constitutive_tensor(1,2) = lambda generalized_constitutive_tensor(1,3) = lambda generalized_constitutive_tensor(2,1) = lambda generalized_constitutive_tensor(2,2) = lambda + 2.0_LONGreal * mu generalized_constitutive_tensor(2,3) = lambda generalized_constitutive_tensor(3,1) = lambda generalized_constitutive_tensor(3,2) = lambda generalized_constitutive_tensor(3,3) = lambda + 2.0_LONGreal * mu generalized_constitutive_tensor(4,4) = mu generalized_constitutive_tensor(5,5) = mu generalized_constitutive_tensor(6,6) = mu ! generalized_strain_vector(1) = strain_tensor(1,1) generalized_strain_vector(2) = strain_tensor(2,2) generalized_strain_vector(3) = strain_tensor(3,3) generalized_strain_vector(4) = strain_tensor(2,3) generalized_strain_vector(5) = strain_tensor(1,3) generalized_strain_vector(6) = strain_tensor(1,2) ! do i = 1, 6 generalized_stress_vector(i) = dot_product(generalized_constitutive_tensor(i,:), generalized_strain_vector(:)) end do ! stress_tensor(1,1) = generalized_stress_vector(1) stress_tensor(2,2) = generalized_stress_vector(2) stress_tensor(3,3) = generalized_stress_vector(3) stress_tensor(2,3) = generalized_stress_vector(4) stress_tensor(1,3) = generalized_stress_vector(5) stress_tensor(1,2) = generalized_stress_vector(6) stress_tensor(3,2) = stress_tensor(2,3) stress_tensor(3,1) = stress_tensor(1,3) stress_tensor(2,1) = stress_tensor(1,2) ! end function generalized_hookes_law Note that 24 elements out of the 36 ones of generalized_constitutive_tensor are null. Using that, the subroutine can be replaced with function generalized_hookes_law (strain_tensor, lambda, mu) result (stress_tensor) ! real (kind = LONGreal), dimension(:,:), intent(in) :: strain_tensor real (kind = LONGreal), intent(in) :: lambda, mu real (kind = LONGreal), dimension(3,3) :: stress_tensor real (kind = LONGreal) :: tmp ! stress_tensor(:,:) = mu * strain_tensor(:,:) tmp = lambda * (strain_tensor(1,1) + strain_tensor(2,2) + strain_tensor(3,3)) stress_tensor(1,1) = tmp + 2.0_LONGreal * stress_tensor(1,1) stress_tensor(2,2) = tmp + 2.0_LONGreal * stress_tensor(2,2) stress_tensor(3,3) = tmp + 2.0_LONGreal * stress_tensor(3,3) ! end function generalized_hookes_law end module perdida_m which is inlined at -finline-limit=320.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #15 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 17:56:31 UTC --- Enabling early FRE Index: passes.c === --- passes.c(revision 169136) +++ passes.c(working copy) @@ -760,6 +760,7 @@ NEXT_PASS (pass_remove_cgraph_callee_edges); NEXT_PASS (pass_rename_ssa_copies); NEXT_PASS (pass_ccp); + NEXT_PASS (pass_fre); NEXT_PASS (pass_forwprop); /* pass_build_ealias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. Re-building @@ -782,7 +783,7 @@ reduces perida size estimate to 694 (so by about 30%) and hookes law to 141 (by 11%). Not enough to make inlining happen, still.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #16 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 17:57:58 UTC --- Also w/o inlining hookes_law but with inlining perida (by using large-function-growth parameter only and the patch abov), I get 30% speedup, not 50% as with inlining both, but it seems that we miss some optimization that is independent on inlining w/o early FRE.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #17 from Dominique d'Humieres dominiq at lps dot ens.fr 2011-01-23 19:38:30 UTC --- With the patch in comment #15 and -finline-limit=300, I get Date Time : 23 Jan 2011 20:18:02 Test Name : pbharness Compile Command : gfcp %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=300 -fwhole-program -flto -o %n Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft Maximum Times : 300.0 Target Error % : 0.200 Minimum Repeats : 2 Maximum Repeats : 5 Benchmark Compile Executable Ave Run Number Estim Name(secs) (bytes)(secs) Repeats Err % - --- -- --- --- -- ac 3.55 54576 8.12 2 0.0062 aermod103.51 1595448 18.87 2 0.0079 air 8.87 90048 6.89 2 0.0798 capacita 5.84 89056 40.27 2 0.0199 channel 1.62 34448 2.98 2 0.0168 doduc 14.30 203936 27.79 2 0.0162 fatigue 4.89 89264 4.74 2 0.0106 gas_dyn 11.72 148176 4.64 5 0.0535 induct 10.87 205976 14.00 2 0.0036 linpk 1.58 21536 21.71 2 0.0415 mdbx 5.60 84752 12.56 2 0.1871 nf 7.24 83712 29.23 5 0.0744 protein 11.81 163760 35.10 2 0.0342 rnflow 14.86 171392 26.91 2 0.0223 test_fpu 11.35 145848 11.03 2 0.0952 tfft 1.10 22072 3.30 2 0.1817 Geometric Mean Execution Time = 12.36 seconds to be compared to the lowest Geometric Mean I have got so far (most of the difference is due to nf which depends a lot of the mood of my laptop) Date Time : 22 Dec 2010 10:33:08 Test Name : pbharness Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 --param hot-bb-frequency-fraction=2000 -fwhole-program -flto -o %n Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft Maximum Times : 300.0 Target Error % : 0.200 Minimum Repeats : 2 Maximum Repeats : 5 Benchmark Compile Executable Ave Run Number Estim Name(secs) (bytes)(secs) Repeats Err % - --- -- --- --- -- ac 11.55 58672 8.11 2 0.0123 aermod164.78 1522240 19.11 2 0.1151 air 20.73 85984 6.87 5 0.1914 capacita 14.66 105472 40.22 2 0.0584 channel 3.22 34448 2.92 4 0.1714 doduc 24.70 212360 27.81 2 0.1025 fatigue 9.81 85144 4.70 3 0.1862 gas_dyn 24.13 144240 4.66 5 0.4507 induct 22.50 214136 13.69 2 0.1096 linpk 2.56 21536 21.68 2 0.0231 mdbx 8.93 84744 12.52 2 0.0080 nf 22.61 104136 27.63 2 0.0778 protein 26.19 155768 35.51 2 0.0127 rnflow 30.99 163200 26.15 2 0.0248 test_fpu 18.79 145848 10.98 2 0.0182 tfft 1.92 22072 3.29 2 0.0304 Geometric Mean Execution Time = 12.27 seconds
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 Jan Hubicka hubicka at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed|2011-01-23 15:59:30 | CC||rguenther at suse dot de --- Comment #18 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 20:00:23 UTC --- We produce very lousy code for the out of line copy of __perdida_m_MOD_generalized_hookes_law. This seems to be reason why we inline it. Code is bit better with early FRE but still we get in vect_pgeneralized_constitutive_tensor (optimized dump): generalized_constitutive_tensor = {}; D.4502_45 = *lambda_44(D); D.4503_47 = *mu_46(D); D.4504_48 = D.4503_47 * 2.0e+0; D.4505_49 = D.4504_48 + D.4502_45; generalized_constitutive_tensor[0] = D.4505_49; generalized_constitutive_tensor[6] = D.4502_45; generalized_constitutive_tensor[12] = D.4502_45; generalized_constitutive_tensor[1] = D.4502_45; generalized_constitutive_tensor[7] = D.4505_49; generalized_constitutive_tensor[13] = D.4502_45; generalized_constitutive_tensor[2] = D.4502_45; generalized_constitutive_tensor[8] = D.4502_45; generalized_constitutive_tensor[14] = D.4505_49; generalized_constitutive_tensor[21] = D.4503_47; generalized_constitutive_tensor[28] = D.4503_47; generalized_constitutive_tensor[35] = D.4503_47; initialize the array with mostly zeros and then we use it in vectorized loop: vect_cst_.855_301 = {D.4508_69, D.4508_69}; vect_cst_.862_295 = {D.4511_73, D.4511_73}; vect_cst_.870_288 = {D.4514_77, D.4514_77}; vect_cst_.878_323 = {D.4519_82, D.4519_82}; vect_cst_.886_330 = {D.4522_86, D.4522_86}; vect_cst_.894_337 = {D.4526_90, D.4526_90}; vect_var_.853_205 = MEM[(real(kind=8)[36] *)generalized_constitutive_tensor]; vect_var_.854_210 = vect_var_.853_205 * vect_cst_.855_301; vect_var_.860_211 = MEM[(real(kind=8)[36] *)generalized_constitutive_tensor + 48B]; vect_var_.861_214 = vect_var_.860_211 * vect_cst_.862_295; vect_var_.863_215 = vect_var_.861_214 + vect_var_.854_210; vect_var_.868_220 = MEM[(real(kind=8)[36] *)generalized_constitutive_tensor + 96B]; vect_var_.869_221 = vect_var_.868_220 * vect_cst_.870_288; vect_var_.871_224 = vect_var_.863_215 + vect_var_.869_221; vect_var_.876_225 = MEM[(real(kind=8)[36] *)generalized_constitutive_tensor + 144B]; we would better go with unrolling this and optimizing away 0 terms. W/o -ftree-vectorize we however still don't do this transform. We end up with: generalized_constitutive_tensor = {}; D.4502_45 = *lambda_44(D); D.4503_47 = *mu_46(D); D.4504_48 = D.4503_47 * 2.0e+0; D.4505_49 = D.4504_48 + D.4502_45; generalized_constitutive_tensor[1] = D.4502_45; generalized_constitutive_tensor[7] = D.4505_49; generalized_constitutive_tensor[13] = D.4502_45; generalized_constitutive_tensor[2] = D.4502_45; generalized_constitutive_tensor[8] = D.4502_45; generalized_constitutive_tensor[14] = D.4505_49; generalized_constitutive_tensor[21] = D.4503_47; generalized_constitutive_tensor[28] = D.4503_47; generalized_constitutive_tensor[35] = D.4503_47; pretmp.827_334 = generalized_constitutive_tensor[1]; pretmp.830_336 = generalized_constitutive_tensor[7]; pretmp.832_338 = generalized_constitutive_tensor[13]; pretmp.834_340 = generalized_constitutive_tensor[19]; pretmp.836_342 = generalized_constitutive_tensor[25]; pretmp.838_344 = generalized_constitutive_tensor[31]; so copy propagation and SRA are missing. Moreover we can't figure out that generalized_constitutive_tensor[31] is 0. So it is quite good testcase for optimization queue ordering. Honza
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 Jan Hubicka hubicka at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2011-01-23 15:59:30 --- Comment #19 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-23 21:05:51 UTC --- This adds enough passes so we generate sane code for hookes_law. (and we do that before inlining) Index: passes.c === --- passes.c(revision 169136) +++ passes.c(working copy) @@ -775,6 +775,14 @@ NEXT_PASS (pass_convert_switch); NEXT_PASS (pass_cleanup_eh); NEXT_PASS (pass_profile); + NEXT_PASS (pass_tree_loop_init); + NEXT_PASS (pass_complete_unroll); + NEXT_PASS (pass_tree_loop_done); + NEXT_PASS (pass_ccp); + NEXT_PASS (pass_fre); + NEXT_PASS (pass_dse); + NEXT_PASS (pass_fre); + NEXT_PASS (pass_cd_dce); NEXT_PASS (pass_local_pure_const); /* Split functions creates parts that are not run through early optimizations again. It is thus good idea to do this @@ -782,7 +790,7 @@ We need to unroll the loop, do ccp to get constant array indexes, FRE to propagate through memory acceses. For some reason FRE is needed twice or the loads from the temporary array are not copy propagated. I didn't tested if DSE is really needed or cd_dce gets rid of the dead store into the array. Still a lot of copyprop oppurtunity is left. This makes hookes_law estimate to be 91 instructions, so -finline-limit=183 should be enough.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #20 from Dominique d'Humieres dominiq at lps dot ens.fr 2011-01-23 23:20:34 UTC --- This makes hookes_law estimate to be 91 instructions, so -finline-limit=183 should be enough. With the patch in comment #19, I rather find a threshold of -finline-limit=256. In top of that as shown by the timing below the patch increases the threshold for ac.f90 and breaks the vectorization for induct.f90. Would the patch in comment #15 and an increase of the default value for -finline-limit to 300 be acceptable at this stage (with the usual bells and whisles: SPEC, ...)? Date Time : 23 Jan 2011 23:18:23 Test Name : pbharness Compile Command : gfcp %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=300 -fwhole-program -flto -o %n Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft Maximum Times : 300.0 Target Error % : 0.200 Minimum Repeats : 2 Maximum Repeats : 5 Benchmark Compile Executable Ave Run Number Estim Name(secs) (bytes)(secs) Repeats Err % - --- -- --- --- -- ac 3.15 50536 9.58 2 0.0156 aermod104.98 1652280 18.79 2 0.1011 air 8.83 90048 6.99 5 0.7334 capacita 5.95 89056 40.21 2 0.0174 channel 1.65 34448 2.99 2 0.0502 doduc 14.59 208056 27.91 2 0.0036 fatigue 4.80 89264 4.72 2 0.0212 gas_dyn 11.65 148176 4.66 5 0.4391 induct 11.20 205976 22.34 2 0.0672 linpk 1.59 21536 21.70 2 0.0299 mdbx 5.78 84760 12.58 2 0.0119 nf 7.60 83712 29.53 5 0.3854 protein 11.69 163760 35.18 2 0.1109 rnflow 15.23 167296 26.97 2 0.0890 test_fpu 11.33 145848 11.06 5 0.3715 tfft 1.13 22072 3.30 2 0.0607 Geometric Mean Execution Time = 12.89 seconds Date Time : 23 Jan 2011 23:54:28 Test Name : pbharness Compile Command : gfcp %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -o %n Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft Maximum Times : 300.0 Target Error % : 0.200 Minimum Repeats : 2 Maximum Repeats : 5 Benchmark Compile Executable Ave Run Number Estim Name(secs) (bytes)(secs) Repeats Err % - --- -- --- --- -- ac 3.59 54576 8.10 2 0.0062 aermod103.73 1558344 18.91 2 0.0238 air 10.47 89992 6.77 5 0.1563 capacita 7.47 101344 40.08 2 0.0137 channel 1.65 34448 2.97 5 0.5872 doduc 15.82 216376 27.61 2 0. fatigue 5.10 89264 4.73 2 0. gas_dyn 12.09 152264 4.69 5 0.6428 induct 11.10 205976 22.33 2 0.0403 linpk 1.59 21536 21.72 2 0.0368 mdbx 5.85 84760 12.58 2 0.0517 nf 11.34 108280 28.98 2 0.1087 protein 11.65 163760 35.18 3 0.1422 rnflow 17.39 183696 26.71 2 0.0243 test_fpu 11.49 145816 11.02 2 0.1226 tfft 1.43 22072 3.29 2 0.0911 Geometric Mean Execution Time = 12.70 seconds
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #11 from Jan Hubicka hubicka at gcc dot gnu.org 2011-01-08 20:08:26 UTC --- Does --param hot-bb-frequency-fraction=10 work here? This is weird!-( I have done the following profiling and it shows that -flto prevents the inlining of __perdida_m_MOD_perdida, while -fno-inline-functions restores it. This contradicts what the manual says: -finline-functions Integrate all simple functions into their callers. The compiler heuristically decides which functions are simple enough to be worth integrating in this way. Disabling autoinlining of small function can allow other inlining (inlining functions called once or inlining for size), so this is not completely unexpected.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #10 from Dominique d'Humieres dominiq at lps dot ens.fr 2010-09-30 17:28:19 UTC --- (In reply to comment #8) Using -fno-inline-functions, the program recovers the speed of the no-LTO version. This does not work on powerpc-apple-darwin9: [karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g fatigue.f90 [karma] lin/test% time a.out /dev/null 15.942u 0.052s 0:16.54 96.6%0+0k 2+1io 40pf+0w [karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g -flto fatigue.f90 [karma] lin/test% time a.out /dev/null 20.330u 0.063s 0:21.06 96.8%0+0k 0+2io 0pf+0w [karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g -flto -fno-inline-functions fatigue.f90 [karma] lin/test% time a.out /dev/null 20.678u 0.063s 0:21.33 97.1%0+0k 0+2io 0pf+0w [karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g -flto -finline-limit=600 fatigue.f90 [karma] lin/test% time a.out /dev/null 10.903u 0.036s 0:11.30 96.7%0+0k 0+2io 0pf+0w
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #9 from Dominique d'Humieres dominiq at lps dot ens.fr 2010-09-29 20:27:36 UTC --- (In reply to comment #8) Using -fno-inline-functions, the program recovers the speed of the no-LTO version. This is weird!-( I have done the following profiling and it shows that -flto prevents the inlining of __perdida_m_MOD_perdida, while -fno-inline-functions restores it. This contradicts what the manual says: -finline-functions Integrate all simple functions into their callers. The compiler heuristically decides which functions are simple enough to be worth integrating in this way. Note also that in order to inline __perdida_m_MOD_generalized_hookes_law one needs -finline-limit=600 (actually some number between 300 and 400). [macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g fatigue.f90 [macbook] lin/test% time a.out /dev/null 6.547u 0.024s 0:06.57 99.8%0+0k 0+2io 0pf+0w + 70.8%, MAIN__, a.out | + 10.1%, free, libSystem.B.dylib | | 7.9%, szone_size, libSystem.B.dylib | + 8.0%, malloc, libSystem.B.dylib | | + 6.4%, malloc_zone_malloc, libSystem.B.dylib | | | 4.4%, szone_malloc_should_clear, libSystem.B.dylib | | | 0.4%, szone_malloc, libSystem.B.dylib | | 0.4%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib | | 0.1%, szone_malloc_should_clear, libSystem.B.dylib | 4.1%, szone_free_definite_size, libSystem.B.dylib | 2.4%, cosisin, libSystem.B.dylib | + 0.7%, cexp, libSystem.B.dylib | | 0.1%, exp$fenv_access_off, libSystem.B.dylib | | 0.0%, dyld_stub_exp, libSystem.B.dylib 27.2%, __perdida_m_MOD_generalized_hookes_law, a.out 0.5%, dyld_stub_malloc, a.out 0.4%, free, libSystem.B.dylib 0.4%, dyld_stub_free, a.out 0.4%, szone_free_definite_size, libSystem.B.dylib 0.2%, malloc, libSystem.B.dylib 0.1%, dyld_stub_cexp, a.out 0.0%, cexp, libSystem.B.dylib [macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -flto fatigue.f90 [macbook] lin/test% time a.out /dev/null 9.013u 0.027s 0:09.04 99.8%0+0k 0+2io 0pf+0w + 64.8%, __perdida_m_MOD_perdida, a.out --- | + 6.8%, free, libSystem.B.dylib | | 4.9%, szone_size, libSystem.B.dylib | + 5.2%, malloc, libSystem.B.dylib | | + 4.1%, malloc_zone_malloc, libSystem.B.dylib | | | 2.5%, szone_malloc_should_clear, libSystem.B.dylib | | | 0.5%, szone_malloc, libSystem.B.dylib | | 0.3%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib | 3.1%, szone_free_definite_size, libSystem.B.dylib 19.3%, __perdida_m_MOD_generalized_hookes_law, a.out + 14.6%, MAIN__.2130, a.out | 1.8%, cosisin, libSystem.B.dylib | + 0.4%, cexp, libSystem.B.dylib | | 0.1%, exp$fenv_access_off, libSystem.B.dylib | | 0.0%, dyld_stub_exp, libSystem.B.dylib | | 0.0%, cosisin, libSystem.B.dylib 0.3%, szone_free_definite_size, libSystem.B.dylib 0.3%, dyld_stub_malloc, a.out 0.3%, dyld_stub_free, a.out 0.2%, free, libSystem.B.dylib 0.2%, malloc, libSystem.B.dylib 0.0%, cexp, libSystem.B.dylib 0.0%, data_transfer_init, libgfortran.3.dylib [macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -flto -fno-inline-functions fatigue.f90 [macbook] lin/test% time a.out /dev/null 6.575u 0.021s 0:06.61 99.6%0+0k 0+2io 0pf+0w + 71.0%, MAIN__.2130, a.out | + 8.9%, free, libSystem.B.dylib | | 6.6%, szone_size, libSystem.B.dylib | + 8.1%, malloc, libSystem.B.dylib | | + 6.4%, malloc_zone_malloc, libSystem.B.dylib | | | 4.5%, szone_malloc_should_clear, libSystem.B.dylib | | | 0.6%, szone_malloc, libSystem.B.dylib | | 0.4%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib | | 0.2%, szone_malloc_should_clear, libSystem.B.dylib | 4.4%, szone_free_definite_size, libSystem.B.dylib | 1.9%, cosisin, libSystem.B.dylib | + 1.0%, cexp, libSystem.B.dylib | | 0.1%, exp$fenv_access_off, libSystem.B.dylib | | 0.1%, cosisin, libSystem.B.dylib | | 0.0%, dyld_stub_exp, libSystem.B.dylib 27.3%, __perdida_m_MOD_generalized_hookes_law, a.out 0.4%, free, libSystem.B.dylib 0.3%, dyld_stub_malloc, a.out 0.3%, dyld_stub_free, a.out 0.3%, szone_free_definite_size, libSystem.B.dylib 0.2%, malloc, libSystem.B.dylib 0.1%, dyld_stub_cexp, a.out 0.0%, cexp, libSystem.B.dylib [macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -flto -finline-limit=600 fatigue.f90 [macbook] lin/test% time a.out /dev/null 4.768u 0.018s 0:04.79 99.5%0+0k 0+1io 0pf+0w + 97.5%, MAIN__.2133, a.out | + 15.4%, free, libSystem.B.dylib | | 10.6%, szone_size, libSystem.B.dylib | + 11.4%, malloc, libSystem.B.dylib | | + 9.6%, malloc_zone_malloc, libSystem.B.dylib | | | 4.9%, szone_malloc_should_clear, libSystem.B.dylib | | | 0.9%, szone_malloc, libSystem.B.dylib | | 0.4%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib | 6.4%, szone_free_definite_size, libSystem.B.dylib | 2.7%, cosisin, libSystem.B.dylib | + 0.8%, cexp, libSystem.B.dylib | | 0.1%, exp$fenv_access_off, libSystem.B.dylib | | 0.1%, cosisin, libSystem.B.dylib
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 Tobias Burnus burnus at gcc dot gnu.org changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #3 from Tobias Burnus burnus at gcc dot gnu.org 2010-09-28 12:23:06 UTC --- (In reply to comment #2) For single-file programs -fwhole-program and -flto should be basically equivalent if the Frontend provides correctly merged decls. I suppose it does not and thus we do less inlining with -fwhole-program compared to -flto. It might well be the reason that one does less inlining without LTO - but that's then not only a FE bug (not correctly merged decls) but also a ME/target bug as the LTO program is _slower_. Cf. also PR 44334, which is about a -fwhole-program slowdown (w/ and w/o -flto). For the latter program, it helped to use --param hot-bb-frequency-fraction=2000. However, for this PR, the option does not seem to help.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #4 from Richard Guenther rguenth at gcc dot gnu.org 2010-09-28 13:38:58 UTC --- (In reply to comment #3) (In reply to comment #2) For single-file programs -fwhole-program and -flto should be basically equivalent if the Frontend provides correctly merged decls. I suppose it does not and thus we do less inlining with -fwhole-program compared to -flto. It might well be the reason that one does less inlining without LTO - but more inlining with LTO. You read my stmt wrong. that's then not only a FE bug (not correctly merged decls) but also a ME/target bug as the LTO program is _slower_. Sure. As with all performance related bugs this needs analysis and is unlikely an LTO problem - LTO does not (not-)optimize, optimization passes do. Cf. also PR 44334, which is about a -fwhole-program slowdown (w/ and w/o -flto). For the latter program, it helped to use --param hot-bb-frequency-fraction=2000. However, for this PR, the option does not seem to help.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #5 from Joost VandeVondele Joost.VandeVondele at pci dot uzh.ch 2010-09-28 13:58:18 UTC --- (In reply to comment #4) Sure. As with all performance related bugs this needs analysis and is unlikely an LTO problem - LTO does not (not-)optimize, optimization passes do. I'm wondering if there is any description on how to do this. For example, how do I get the assembly of a function and the -fdump-tree-all files from a gold based linking that goes as: rm -f test.s test2.s test.o test2.o ; gfortran -c -flto test.f90 ; gfortran -c -flto test2.f90 ; gfortran -O3 -march=native -fuse-linker-plugin -fwhopr=2 test.o test2.o just using -S or -fdump-tree-all doesn't work. Is 'objdump -d' the only tool ?
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #6 from Richard Guenther rguenth at gcc dot gnu.org 2010-09-28 14:07:54 UTC --- (In reply to comment #5) (In reply to comment #4) Sure. As with all performance related bugs this needs analysis and is unlikely an LTO problem - LTO does not (not-)optimize, optimization passes do. I'm wondering if there is any description on how to do this. For example, how do I get the assembly of a function and the -fdump-tree-all files from a gold based linking that goes as: rm -f test.s test2.s test.o test2.o ; gfortran -c -flto test.f90 ; gfortran -c -flto test2.f90 ; gfortran -O3 -march=native -fuse-linker-plugin -fwhopr=2 test.o test2.o just using -S or -fdump-tree-all doesn't work. Is 'objdump -d' the only tool ? No, -fdump-tree-all works, it just uses maybe un-intuitive base-names. Append -v to see them, for -fwhopr it should be the output file specified with -o (which you leave out which causes us to use not a.out but some temporary file in /tmp), with -o t I get t.ltrans[01].147t.optimized, etc. With -flto it's just t.147t.optimized. To retain assembler you have to use -save-temps which retains t.ltrans[01].s, with -flto it retains t1.s (using the base of the first object file).
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #7 from Joost VandeVondele Joost.VandeVondele at pci dot uzh.ch 2010-09-28 14:19:38 UTC --- (In reply to comment #6) No, -fdump-tree-all works great... I forgot to look in /tmp, and -save-temps also works fine.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #8 from Tobias Burnus burnus at gcc dot gnu.org 2010-09-28 14:57:34 UTC --- Using -fno-inline-functions, the program recovers the speed of the no-LTO version. Notes from #gcc: (dominiq) For fatigue the key for speed-up is inlining of generalized_hookes_law and you need -finline-limit=400 (richi) Considering inline candidate generalized_hookes_law. / Inlining failed: --param max-inline-insns-auto limit reached
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 Joost VandeVondele Joost.VandeVondele at pci dot uzh.ch changed: What|Removed |Added CC||Joost.VandeVondele at pci ||dot uzh.ch --- Comment #1 from Joost VandeVondele Joost.VandeVondele at pci dot uzh.ch 2010-09-27 10:39:05 UTC --- I have observed similar 40% slowdown in CP2K as a result of LTO. I haven't yet investigated.
[Bug lto/45810] 40% slowdown when using LTO for a single-file program
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810 --- Comment #2 from Richard Guenther rguenth at gcc dot gnu.org 2010-09-27 10:48:33 UTC --- For single-file programs -fwhole-program and -flto should be basically equivalent if the Frontend provides correctly merged decls. I suppose it does not and thus we do less inlining with -fwhole-program compared to -flto.