[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 --- Comment #20 from Dominique d'Humieres --- Updated timings % gfc6 -c pr80960.f90 -fdefault-integer-8 -O2 -ftime-report ... combiner: 81.12 (55%) usr 1.17 (41%) sys 82.31 (54%) wall 2700699 kB (60%) ggc ... TOTAL : 148.83 2.87 151.88 4503753 kB % gfc7 -c pr80960.f90 -fdefault-integer-8 -O2 -ftime-report ... combiner: 153.49 (67%) usr 1.64 (44%) sys 155.24 (67%) wall 2700699 kB (60%) ggc ... TOTAL : 228.22 3.77 232.32 4508399 kB % gfc8 -c pr80960.f90 -fdefault-integer-8 -O2 -ftime-report ... combiner : 427.57 ( 85%) 2.17 ( 53%) 430.01 ( 84%) 2700709 kB ( 60%) ... TOTAL : 505.59 4.06510.13 4499127 kB % gfcp -c pr80960.f90 -fdefault-integer-8 -O2 -ftime-report ... combiner : 426.78 ( 85%) 1.55 ( 47%) 428.43 ( 85%) 2700713 kB ( 60%) ... TOTAL : 502.36 3.31505.81 4501068 kB
[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 --- Comment #19 from Segher Boessenkool --- (In reply to rguent...@suse.de from comment #18) > Hmm, so if we'd have numbered stmts in an EBB we could check the > distance between set and use and not combine when that gets too big? Yeah. Or we could even not make a LOG_LINK in the first place between statements that are too far apart. > > Combine also makes garbage for every try, and none of that is cleaned > > up during combine. Maybe we should change that? (I can try next week). > > Not sure how easy that is but yes, it might help quite a bit due > to less churn on the cache. Just ggc_free()ing the "toplevel" > RTX of failed attempts might already help a bit. It's of course > kind-of a hack then but with an appropriate comment it would be > fine I guess (recursively ggc_free()ing might run into sharing > issues so that probably won't work). combine does not change anything *between* combination attempts, and all attempts go via the same function (try_combine), so calling gcc_collect should be fine. Manually gcc_free'ing things would be a hack alright.
[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 --- Comment #18 from rguenther at suse dot de --- On Thu, 4 Apr 2019, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 > > --- Comment #17 from Segher Boessenkool --- > (In reply to rguent...@suse.de from comment #16) > > > Actually it already does two walks over the whole function in > > combine_instructions it seems, so recording # insns per EBB should > > be possible? (if that's really the key metric causing the issue) > > The average distance between a set and its first use is the key metric. > The numbers make it feel like that is pretty constrained here still > (I haven't run numbers on it), but 100 is very much already if there are > 1M insns in the block (or whatever). All numbers that aren't terrible, > but combines it takes up quite a chunk of time. Hmm, so if we'd have numbered stmts in an EBB we could check the distance between set and use and not combine when that gets too big? > Combine also makes garbage for every try, and none of that is cleaned > up during combine. Maybe we should change that? (I can try next week). Not sure how easy that is but yes, it might help quite a bit due to less churn on the cache. Just ggc_free()ing the "toplevel" RTX of failed attempts might already help a bit. It's of course kind-of a hack then but with an appropriate comment it would be fine I guess (recursively ggc_free()ing might run into sharing issues so that probably won't work).
[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 --- Comment #17 from Segher Boessenkool --- (In reply to rguent...@suse.de from comment #16) > I would guess so. I wonder if the combiner could restrict itself > here? Maybe LUID "distances" are an approximation? Doesn't the > combiner use DF uses so the number of combinations shouldn't increase > with basic-block size but only with the number of uses? Of course > since we don't have SSA the uses probably include those that cross > other defs... Combine doesn't try too many pairs: it tries every def only with its first use, so that is linear in # insns. But the check if a combination is valid uses reg_used_between_p, which is no good for insns a big distance apart. > That said, algorithmically first building up a kind-of-SSA to > restrict things combine will try might help to avoid this kind > of issues. Yup. Not going to happen in stage4, of course :-/ There are a few other things which aren't linear, but this is the worst one (the rest only happens occasionally, or only on successful combinations). > Since combine does a df_analyze we should have a way to record > the number of insns in a block without another IL walk, it could > also fall back to 2->1 and 2->2 insn combinations after visiting > a new PARAM max-combine-bb-insns-3-3 number of insns in an EBB. The 3->1 (or 3->2) isn't really the problem; there just are many more to try than 2->[12]. > Actually it already does two walks over the whole function in > combine_instructions it seems, so recording # insns per EBB should > be possible? (if that's really the key metric causing the issue) The average distance between a set and its first use is the key metric. The numbers make it feel like that is pretty constrained here still (I haven't run numbers on it), but 100 is very much already if there are 1M insns in the block (or whatever). All numbers that aren't terrible, but combines it takes up quite a chunk of time. Combine also makes garbage for every try, and none of that is cleaned up during combine. Maybe we should change that? (I can try next week).
[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 --- Comment #16 from rguenther at suse dot de --- On Tue, 2 Apr 2019, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 > > --- Comment #15 from Segher Boessenkool --- > It seems to be that this happens for huge basic blocks, and the combiner > tries to combine pairs of instructions that are far apart. This is unlikely > to work often, and the cost is quadratic in # insns, the way checking where > a register is used works. > > The param to do only 2->1 (and 2->2) combinations should help a lot, make > combine not take longer than the rest of the compiler does. Does it? I would guess so. I wonder if the combiner could restrict itself here? Maybe LUID "distances" are an approximation? Doesn't the combiner use DF uses so the number of combinations shouldn't increase with basic-block size but only with the number of uses? Of course since we don't have SSA the uses probably include those that cross other defs... That said, algorithmically first building up a kind-of-SSA to restrict things combine will try might help to avoid this kind of issues. Since combine does a df_analyze we should have a way to record the number of insns in a block without another IL walk, it could also fall back to 2->1 and 2->2 insn combinations after visiting a new PARAM max-combine-bb-insns-3-3 number of insns in an EBB. Actually it already does two walks over the whole function in combine_instructions it seems, so recording # insns per EBB should be possible? (if that's really the key metric causing the issue)
[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 --- Comment #15 from Segher Boessenkool --- It seems to be that this happens for huge basic blocks, and the combiner tries to combine pairs of instructions that are far apart. This is unlikely to work often, and the cost is quadratic in # insns, the way checking where a register is used works. The param to do only 2->1 (and 2->2) combinations should help a lot, make combine not take longer than the rest of the compiler does. Does it?
[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 --- Comment #14 from rguenther at suse dot de --- On Sun, 31 Mar 2019, tkoenig at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 > > --- Comment #13 from Thomas Koenig --- > With -O2, the combiner takes up quite a lot of time: > > $ time gfortran -ftime-report -g0 -O2 -fdefault-integer-8 -c fe_objective.f90 > > alias stmt walking : 15.75 ( 4%) 0.11 ( 5%) 15.89 ( > 4%) > 2 kB ( 0%) > dead store elim2 : 10.49 ( 2%) 0.33 ( 16%) 10.82 ( > 3%) > 1578727 kB ( 35%) > combiner : 346.20 ( 81%) 0.89 ( 44%) 347.17 ( > 81%) > 2701135 kB ( 60%) > TOTAL : 428.68 2.01430.83 > > 4504484 kB > > With -O1, this now has as biggest consumers of cycles > > alias stmt walking : 11.80 ( 31%) 0.04 ( 13%) 11.78 ( > 31%) > 2 kB ( 0%) > integrated RA : 5.61 ( 15%) 0.06 ( 20%) 5.67 ( > 15%) > 34896 kB ( 10%) > LRA hard reg assignment: 4.69 ( 12%) 0.00 ( 0%) 4.69 ( > 12%) > 0 kB ( 0%) > TOTAL : 37.68 0.30 38.00 > > 364905 kB > > > which does not look too bad (and memory consumption has remained constant). > > Note that this is with checking enabled. You can mitigate enabled checking somewhat with -fno-checking.
[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 --- Comment #13 from Thomas Koenig --- With -O2, the combiner takes up quite a lot of time: $ time gfortran -ftime-report -g0 -O2 -fdefault-integer-8 -c fe_objective.f90 alias stmt walking : 15.75 ( 4%) 0.11 ( 5%) 15.89 ( 4%) 2 kB ( 0%) dead store elim2 : 10.49 ( 2%) 0.33 ( 16%) 10.82 ( 3%) 1578727 kB ( 35%) combiner : 346.20 ( 81%) 0.89 ( 44%) 347.17 ( 81%) 2701135 kB ( 60%) TOTAL : 428.68 2.01430.83 4504484 kB With -O1, this now has as biggest consumers of cycles alias stmt walking : 11.80 ( 31%) 0.04 ( 13%) 11.78 ( 31%) 2 kB ( 0%) integrated RA : 5.61 ( 15%) 0.06 ( 20%) 5.67 ( 15%) 34896 kB ( 10%) LRA hard reg assignment: 4.69 ( 12%) 0.00 ( 0%) 4.69 ( 12%) 0 kB ( 0%) TOTAL : 37.68 0.30 38.00 364905 kB which does not look too bad (and memory consumption has remained constant). Note that this is with checking enabled.
[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 Richard Biener changed: What|Removed |Added Target Milestone|7.4 |7.5