[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #50 from Rogério de Souza Moraes --- Created attachment 44848 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44848=edit GCC 6.3.0 consolidated patch based on Richard's patches The patch attached is a backport based on Richard's patches to GCC v6.3.0. If any issues, please let me know. Regards, -- Rogerio
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #49 from Richard Biener --- Author: rguenth Date: Wed Oct 17 08:49:00 2018 New Revision: 265235 URL: https://gcc.gnu.org/viewcvs?rev=265235=gcc=rev Log: 2018-10-16 Richard Biener Backport from mainline 2018-10-08 Richard Biener PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Do cheap check first. (ssa_propagation_engine::ssa_propagate): Remove redundant bitmap bit clearing. 2018-10-05 Richard Biener PR tree-optimization/63155 * tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess vertical space in dumpfiles. * tree-ssa-propagate.h (ssa_propagation_engine::process_ssa_edge_worklist): Remove. * tree-ssa-propagate.c (cfg_blocks_back): New global. (ssa_edge_worklist_back): Likewise. (curr_order): Likewise. (cfg_blocks_get): Remove abstraction. (cfg_blocks_add): Likewise. (cfg_blocks_empty_p): Likewise. (add_ssa_edge): Add to current or next worklist based on RPO index. (add_control_edge): Likewise. (ssa_propagation_engine::process_ssa_edge_worklist): Fold into ... (ssa_propagation_engine::ssa_propagate): ... here. Unify iteration from CFG and SSA edge worklist so we process everything in RPO order, prioritizing forward progress over iteration. (ssa_prop_init): Allocate new worklists, do not dump immediate uses. (ssa_prop_fini): Free new worklists. 2018-09-24 Richard Biener PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to the worklist when the edge of the respective argument isn't executable. Modified: branches/gcc-8-branch/gcc/tree-ssa-ccp.c branches/gcc-8-branch/gcc/tree-ssa-propagate.h
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #48 from Richard Biener --- Author: rguenth Date: Wed Oct 17 07:01:28 2018 New Revision: 265231 URL: https://gcc.gnu.org/viewcvs?rev=265231=gcc=rev Log: 2018-10-17 Richard Biener Backport from mainline 2018-10-08 Richard Sandiford PR middle-end/63155 * gimple-ssa-backprop.c (backprop::intersect_uses): Use FOR_EACH_IMM_USE_FAST instead of FOR_EACH_IMM_USE_STMT. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/gimple-ssa-backprop.c
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #47 from Richard Biener --- Author: rguenth Date: Tue Oct 16 13:23:56 2018 New Revision: 265193 URL: https://gcc.gnu.org/viewcvs?rev=265193=gcc=rev Log: 2018-10-16 Richard Biener Backport from mainline 2018-10-08 Richard Biener PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Do cheap check first. (ssa_propagation_engine::ssa_propagate): Remove redundant bitmap bit clearing. 2018-10-05 Richard Biener PR tree-optimization/63155 * tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess vertical space in dumpfiles. * tree-ssa-propagate.h (ssa_propagation_engine::process_ssa_edge_worklist): Remove. * tree-ssa-propagate.c (cfg_blocks_back): New global. (ssa_edge_worklist_back): Likewise. (curr_order): Likewise. (cfg_blocks_get): Remove abstraction. (cfg_blocks_add): Likewise. (cfg_blocks_empty_p): Likewise. (add_ssa_edge): Add to current or next worklist based on RPO index. (add_control_edge): Likewise. (ssa_propagation_engine::process_ssa_edge_worklist): Fold into ... (ssa_propagation_engine::ssa_propagate): ... here. Unify iteration from CFG and SSA edge worklist so we process everything in RPO order, prioritizing forward progress over iteration. (ssa_prop_init): Allocate new worklists, do not dump immediate uses. (ssa_prop_fini): Free new worklists. 2018-09-24 Richard Biener PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to the worklist when the edge of the respective argument isn't executable. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/tree-ssa-propagate.c
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #46 from Richard Biener --- Author: rguenth Date: Tue Oct 16 11:23:22 2018 New Revision: 265189 URL: https://gcc.gnu.org/viewcvs?rev=265189=gcc=rev Log: 2018-10-16 Richard Biener Backport from mainline 2018-09-18 Richard Biener PR middle-end/63155 * tree-ssa-coalesce.c (tree_int_map_hasher): Remove. (compute_samebase_partition_bases): Likewise. (coalesce_ssa_name): Always use compute_optimized_partition_bases. (gimple_can_coalesce_p): Simplify. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/tree-ssa-coalesce.c
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #45 from Richard Biener --- Author: rguenth Date: Tue Oct 9 11:43:46 2018 New Revision: 264956 URL: https://gcc.gnu.org/viewcvs?rev=264956=gcc=rev Log: 2018-10-09 Richard Biener PR tree-optimization/63155 * tree-ssa-structalias.c: Include tree-ssa.h. (get_constraint_for_ssa_var): For undefs return nothing_id. (find_func_aliases): Cleanup PHI handling. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-structalias.c
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #44 from Richard Biener --- (In reply to Richard Biener from comment #43) > This makes CCP the main > offender again but as said the rectification would probably mean pulling > back the SSA SCC discovery code from SCCVN and use that in the SSA > propagator somehow. I take that back. SCC processing is quite fundamentally incompatible with the way SSA propagation works. But what would be possible is to add a non-optimistic mode to the SSA propagator removing the need to iterate at all. That's some non-trivial work though, possibly better spent teaching value-numbering the bits of CCP that it doesn't do (bit-value tracking, UNDEF handling) and then kill off CCP altogether.
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #43 from Richard Biener --- We're now down to tree PTA : 3.92 ( 16%) 0.12 ( 36%) 4.02 ( 16%) 12445 kB ( 2%) tree CCP : 7.43 ( 30%) 0.02 ( 6%) 7.44 ( 29%) 646 kB ( 0%) tree FRE : 2.34 ( 9%) 0.00 ( 0%) 2.35 ( 9%) 116 kB ( 0%) tree backward propagate: 0.62 ( 2%) 0.00 ( 0%) 0.62 ( 2%) 0 kB ( 0%) out of ssa : 3.01 ( 12%) 0.00 ( 0%) 3.01 ( 12%) 0 kB ( 0%) TOTAL : 24.91 0.33 25.26 573769 kB notice the tree backward propagate improvement. This makes CCP the main offender again but as said the rectification would probably mean pulling back the SSA SCC discovery code from SCCVN and use that in the SSA propagator somehow. The out of SSA time is what was originally topic of this bug. The tree PTA time is "new" and related to the number of PHI nodes and edges. You can disable PTA via -fno-tree-pta. The tree FRE time is PHI lookups/inserts, some refactoring can speed this up a bit.
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #42 from rsandifo at gcc dot gnu.org --- Author: rsandifo Date: Mon Oct 8 18:58:59 2018 New Revision: 264941 URL: https://gcc.gnu.org/viewcvs?rev=264941=gcc=rev Log: Use FOR_EACH_IMM_USE_FAST in gimple-ssa-backprop.c As pointed out by Richard in PR63155. It speeds up the testcase a few %. 2018-10-08 Richard Sandiford gcc/ PR middle-end/63155 * gimple-ssa-backprop.c (backprop::intersect_uses): Use FOR_EACH_IMM_USE_FAST instead of FOR_EACH_IMM_USE_STMT. Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-ssa-backprop.c
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #41 from rsandifo at gcc dot gnu.org --- (In reply to Richard Biener from comment #39) > Oh, and backprop is really intersect_uses () with > > FOR_EACH_IMM_USE_STMT (stmt, iter, var) > { > > being quadratic due to its stupid implementation (we really have many uses > of vars). Ouch, hadn't realised the difference between them was that severe. > If the pass can deal with duplicate stmt uses just fine using > FOR_EACH_IMM_USE_FAST is going to be faster. Yeah, should be fine here, since the function is just gathering information. Testing a patch...
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #40 from Richard Biener --- Author: rguenth Date: Mon Oct 8 07:16:28 2018 New Revision: 264912 URL: https://gcc.gnu.org/viewcvs?rev=264912=gcc=rev Log: 2018-10-08 Richard Biener PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Do cheap check first. (ssa_propagation_engine::ssa_propagate): Remove redundant bitmap bit clearing. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-propagate.c
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 Richard Biener changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #39 from Richard Biener --- Oh, and backprop is really intersect_uses () with FOR_EACH_IMM_USE_STMT (stmt, iter, var) { being quadratic due to its stupid implementation (we really have many uses of vars). If the pass can deal with duplicate stmt uses just fine using FOR_EACH_IMM_USE_FAST is going to be faster.
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #38 from Richard Biener --- For the last testcase the compile-time on trunk is now 25s at -O1: tree PTA : 3.37 ( 13%) 0.10 ( 30%) 3.46 ( 13%) 12445 kB ( 2%) tree CCP : 4.61 ( 18%) 0.00 ( 0%) 4.62 ( 18%) 646 kB ( 0%) tree FRE : 2.21 ( 9%) 0.01 ( 3%) 2.21 ( 9%) 116 kB ( 0%) tree backward propagate: 5.03 ( 20%) 0.00 ( 0%) 5.04 ( 20%) 0 kB ( 0%) out of ssa : 3.05 ( 12%) 0.00 ( 0%) 3.05 ( 12%) 0 kB ( 0%) TOTAL : 25.39 0.33 25.72 573954 kB and perf: Samples: 9K of event 'instructions', Event count (approx.): 107285199390 Overhead Samples Command Shared Object Symbol ◆ 18.06% 1195 cc1 cc1 [.] (anonymous namespace)::backprop::process_var ▒ 5.58% 560 cc1 cc1 [.] visit_phi ▒ 5.21% 476 cc1 cc1 [.] inchash::add_expr ▒ 5.21% 671 cc1 cc1 [.] VN_INFO ▒ 5.14% 493 cc1 cc1 [.] bitmap_set_bit ▒ 3.13% 296 cc1 cc1 [.] hash_table::find_with_hash▒ 2.99% 287 cc1 cc1 [.] vn_phi_lookup ▒ 2.39% 229 cc1 cc1 [.] bitmap_ior_into ▒ 1.77% 165 cc1 cc1 [.] do_rpo_vn
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #37 from Richard Biener --- Author: rguenth Date: Fri Oct 5 12:54:51 2018 New Revision: 264869 URL: https://gcc.gnu.org/viewcvs?rev=264869=gcc=rev Log: 2018-10-05 Richard Biener PR tree-optimization/63155 * tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess vertical space in dumpfiles. * tree-ssa-propagate.h (ssa_propagation_engine::process_ssa_edge_worklist): Remove. * tree-ssa-propagate.c (cfg_blocks_back): New global. (ssa_edge_worklist_back): Likewise. (curr_order): Likewise. (cfg_blocks_get): Remove abstraction. (cfg_blocks_add): Likewise. (cfg_blocks_empty_p): Likewise. (add_ssa_edge): Add to current or next worklist based on RPO index. (add_control_edge): Likewise. (ssa_propagation_engine::process_ssa_edge_worklist): Fold into ... (ssa_propagation_engine::ssa_propagate): ... here. Unify iteration from CFG and SSA edge worklist so we process everything in RPO order, prioritizing forward progress over iteration. (ssa_prop_init): Allocate new worklists, do not dump immediate uses. (ssa_prop_fini): Free new worklists. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-ccp.c trunk/gcc/tree-ssa-propagate.c trunk/gcc/tree-ssa-propagate.h
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #36 from rguenther at suse dot de --- On Thu, 4 Oct 2018, rogerio.souza at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 > > --- Comment #35 from Rogério de Souza Moraes > --- > Created attachment 44791 > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44791=edit > Small testcase more similar to original environment > > Hi Richard, > > this is a new testcase, based on another file in the original environment. > It’s > quite small (7000 lines, 240 setjmp calls). > > This code with a little complex but still simplified control structure > represents state machine implementation, which is very widely used by our > customers. Another new factor is the nested setjmp calls. Of course, original > testcase is more complex and takes even more time with more difference. > > You can run it using the following commands: > > > time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0 -m32 -m32 -w -c -O0 > -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 > gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o > > time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0 -m32 -m32 -w -c -O > -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 > gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o > > > Results : > > GCC: 4.8.5 (From RHEL 7.5) > > real0m0.349s > user0m0.255s > sys 0m0.083s > > real0m0.193s > user0m0.163s > sys 0m0.023s > > GCC: 6.3.0 (GCC 6.3.0 with Revision 264523 backported and applied to it) > > real0m32.235s > user0m30.486s > sys 0m1.622s > > real3m34.203s > user3m33.726s > sys 0m0.292s > > The performance difference is relevant in this test. Thanks for the more realistic testcase. I can confirm the above and I also see a slowdown in GCC 9 compared to GCC 8 at -O1: > /usr//bin/time gcc-8 -S t.c -O -fwrapv -mstackrealign -mpreferred-stack-boundary=4 -m32 157.48user 0.24system 2:37.78elapsed 99%CPU (0avgtext+0avgdata 888036maxresident)k 47704inputs+152outputs (8major+240936minor)pagefaults 0swaps > /usr//bin/time gcc-9 -S t.c -O -fwrapv -mstackrealign -mpreferred-stack-boundary=4 -m32 197.61user 0.39system 3:18.08elapsed 99%CPU (0avgtext+0avgdata 890628maxresident)k 0inputs+184outputs (0major+259016minor)pagefaults 0swaps Somehow it's still CCP that makes things slow: tree CCP : 178.52 ( 89%) 0.01 ( 2%) 178.55 ( 89%) 646 kB ( 0%) perf tells me it's - 96.33%29.55% 14801 cc1 cc1 [.] ccp_propagate::visit_phi▒ ccp_propagate::visit_phi ▒ - ssa_propagation_engine::simulate_stmt ▒ + 49.51% ssa_propagation_engine::simulate_block ▒ + 46.82% ssa_propagation_engine::ssa_propagate - 37.06%28.98% 12421 cc1 cc1 [.] ccp_lattice_meet▒ - ccp_lattice_meet ▒ + 37.02% ccp_propagate::visit_phi ▒ + 0.03% set_lattice_value -5.17% 5.17% 1949 cc1 cc1 [.] wi::bit_or >, generic_w▒ wi::bit_or >, generic_wide_int > > ▒ - ccp_lattice_meet ▒ + 5.16% ccp_propagate::visit_phi ▒ + 0.01% set_lattice_value -4.02% 4.02% 1509 cc1 cc1 [.] canonicalize_value ▒ - canonicalize_value ▒ + 4.02% get_value_for_expr ▒ + 0.00% ccp_folder::get_value -2.90% 2.89% 1083 cc1 cc1 [.] wi::eq_p >, int> ▒ wi::eq_p >, int> ▒ - ccp_lattice_meet ▒ + 2.89% ccp_propagate::visit_phi ▒ + 0.00% set_lattice_value As said, thanks for the testcase.
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #35 from Rogério de Souza Moraes --- Created attachment 44791 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44791=edit Small testcase more similar to original environment Hi Richard, this is a new testcase, based on another file in the original environment. It’s quite small (7000 lines, 240 setjmp calls). This code with a little complex but still simplified control structure represents state machine implementation, which is very widely used by our customers. Another new factor is the nested setjmp calls. Of course, original testcase is more complex and takes even more time with more difference. You can run it using the following commands: time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0 -m32 -m32 -w -c -O0 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0 -m32 -m32 -w -c -O -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o Results : GCC: 4.8.5 (From RHEL 7.5) real0m0.349s user0m0.255s sys 0m0.083s real0m0.193s user0m0.163s sys 0m0.023s GCC: 6.3.0 (GCC 6.3.0 with Revision 264523 backported and applied to it) real0m32.235s user0m30.486s sys 0m1.622s real3m34.203s user3m33.726s sys 0m0.292s The performance difference is relevant in this test. Regards, -- Rogerio
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 --- Comment #34 from David --- My primary concern in 87316 was about memory usage and this patch definitely helps a lot with that. Thanks! Using -ftree-coalesce-vars helps on >= 4.9 versions and does not seem to have an adverse effect on test coverage.
[Bug middle-end/63155] [6/7/8 Regression] memory hog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 Richard Biener changed: What|Removed |Added Known to work||9.0 Summary|[6/7/8/9 Regression] memory |[6/7/8 Regression] memory |hog |hog --- Comment #33 from Richard Biener --- So on trunk for the original testcase I now see > /usr/bin/time ./cc1 -quiet testunity_Runner.i -std=c99 2.70user 0.16system 0:02.86elapsed 100%CPU (0avgtext+0avgdata 427672maxresident)k 0inputs+504outputs (0major+106295minor)pagefaults 0swaps while on the same machine using GCC 4.8: > /usr/bin/time /space/rguenther/install/gcc-4.8.5/bin/gcc -S > testunity_Runner.i -std=c99 0.24user 0.01system 0:00.60elapsed 41%CPU (0avgtext+0avgdata 39424maxresident)k 30960inputs+504outputs (37major+8516minor)pagefaults 0swaps so we've come a long way but still regressed which is somehow not avoidable because of the correctness fix that started this. For reference GCC 8.2 numbers are > /usr/bin/time /space/rguenther/install/gcc-8.2/bin/gcc -S testunity_Runner.i > -std=c99 94.31user 2.46system 1:36.79elapsed 99%CPU (0avgtext+0avgdata 10172916maxresident)k 0inputs+504outputs (0major+2535422minor)pagefaults 0swaps So overall I consider this issue fixed for trunk.