[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case

2019-04-06 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960

--- Comment #20 from Dominique d'Humieres  ---
Updated timings

% gfc6 -c pr80960.f90 -fdefault-integer-8 -O2 -ftime-report
...
 combiner:  81.12 (55%) usr   1.17 (41%) sys  82.31 (54%) wall
2700699 kB (60%) ggc
...
 TOTAL : 148.83 2.87   151.88   
4503753 kB
% gfc7 -c pr80960.f90 -fdefault-integer-8 -O2 -ftime-report
...
 combiner: 153.49 (67%) usr   1.64 (44%) sys 155.24 (67%) wall
2700699 kB (60%) ggc
...
 TOTAL : 228.22 3.77   232.32   
4508399 kB
% gfc8 -c pr80960.f90 -fdefault-integer-8 -O2 -ftime-report
...
 combiner   : 427.57 ( 85%)   2.17 ( 53%) 430.01 ( 84%)
2700709 kB ( 60%)
...
 TOTAL  : 505.59  4.06510.13   
4499127 kB
% gfcp -c pr80960.f90 -fdefault-integer-8 -O2 -ftime-report
...
 combiner   : 426.78 ( 85%)   1.55 ( 47%) 428.43 ( 85%)
2700713 kB ( 60%)
...
 TOTAL  : 502.36  3.31505.81   
4501068 kB

[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case

2019-04-05 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960

--- Comment #19 from Segher Boessenkool  ---
(In reply to rguent...@suse.de from comment #18)
> Hmm, so if we'd have numbered stmts in an EBB we could check the
> distance between set and use and not combine when that gets too big?

Yeah.  Or we could even not make a LOG_LINK in the first place between
statements that are too far apart.

> > Combine also makes garbage for every try, and none of that is cleaned
> > up during combine.  Maybe we should change that?  (I can try next week).
> 
> Not sure how easy that is but yes, it might help quite a bit due
> to less churn on the cache.  Just ggc_free()ing the "toplevel"
> RTX of failed attempts might already help a bit.  It's of course
> kind-of a hack then but with an appropriate comment it would be
> fine I guess (recursively ggc_free()ing might run into sharing
> issues so that probably won't work).

combine does not change anything *between* combination attempts, and
all attempts go via the same function (try_combine), so calling gcc_collect
should be fine.  Manually gcc_free'ing things would be a hack alright.

[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case

2019-04-05 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960

--- Comment #18 from rguenther at suse dot de  ---
On Thu, 4 Apr 2019, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960
> 
> --- Comment #17 from Segher Boessenkool  ---
> (In reply to rguent...@suse.de from comment #16)
> 
> > Actually it already does two walks over the whole function in
> > combine_instructions it seems, so recording # insns per EBB should
> > be possible?  (if that's really the key metric causing the issue)
> 
> The average distance between a set and its first use is the key metric.
> The numbers make it feel like that is pretty constrained here still
> (I haven't run numbers on it), but 100 is very much already if there are
> 1M insns in the block (or whatever).  All numbers that aren't terrible,
> but combines it takes up quite a chunk of time.

Hmm, so if we'd have numbered stmts in an EBB we could check the
distance between set and use and not combine when that gets too big?

> Combine also makes garbage for every try, and none of that is cleaned
> up during combine.  Maybe we should change that?  (I can try next week).

Not sure how easy that is but yes, it might help quite a bit due
to less churn on the cache.  Just ggc_free()ing the "toplevel"
RTX of failed attempts might already help a bit.  It's of course
kind-of a hack then but with an appropriate comment it would be
fine I guess (recursively ggc_free()ing might run into sharing
issues so that probably won't work).

[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case

2019-04-03 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960

--- Comment #17 from Segher Boessenkool  ---
(In reply to rguent...@suse.de from comment #16)

> I would guess so.  I wonder if the combiner could restrict itself
> here?  Maybe LUID "distances" are an approximation?  Doesn't the
> combiner use DF uses so the number of combinations shouldn't increase
> with basic-block size but only with the number of uses?  Of course
> since we don't have SSA the uses probably include those that cross
> other defs...

Combine doesn't try too many pairs: it tries every def only with its
first use, so that is linear in # insns.  But the check if a combination
is valid uses reg_used_between_p, which is no good for insns a big
distance apart.

> That said, algorithmically first building up a kind-of-SSA to
> restrict things combine will try might help to avoid this kind
> of issues.

Yup.  Not going to happen in stage4, of course :-/

There are a few other things which aren't linear, but this is the
worst one (the rest only happens occasionally, or only on successful
combinations).

> Since combine does a df_analyze we should have a way to record
> the number of insns in a block without another IL walk, it could
> also fall back to 2->1 and 2->2 insn combinations after visiting
> a new PARAM max-combine-bb-insns-3-3 number of insns in an EBB.

The 3->1 (or 3->2) isn't really the problem; there just are many more
to try than 2->[12].

> Actually it already does two walks over the whole function in
> combine_instructions it seems, so recording # insns per EBB should
> be possible?  (if that's really the key metric causing the issue)

The average distance between a set and its first use is the key metric.
The numbers make it feel like that is pretty constrained here still
(I haven't run numbers on it), but 100 is very much already if there are
1M insns in the block (or whatever).  All numbers that aren't terrible,
but combines it takes up quite a chunk of time.

Combine also makes garbage for every try, and none of that is cleaned
up during combine.  Maybe we should change that?  (I can try next week).

[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case

2019-04-03 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960

--- Comment #16 from rguenther at suse dot de  ---
On Tue, 2 Apr 2019, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960
> 
> --- Comment #15 from Segher Boessenkool  ---
> It seems to be that this happens for huge basic blocks, and the combiner
> tries to combine pairs of instructions that are far apart.  This is unlikely
> to work often, and the cost is quadratic in # insns, the way checking where
> a register is used works.
> 
> The param to do only 2->1 (and 2->2) combinations should help a lot, make
> combine not take longer than the rest of the compiler does.  Does it?

I would guess so.  I wonder if the combiner could restrict itself
here?  Maybe LUID "distances" are an approximation?  Doesn't the
combiner use DF uses so the number of combinations shouldn't increase
with basic-block size but only with the number of uses?  Of course
since we don't have SSA the uses probably include those that cross
other defs...

That said, algorithmically first building up a kind-of-SSA to
restrict things combine will try might help to avoid this kind
of issues.

Since combine does a df_analyze we should have a way to record
the number of insns in a block without another IL walk, it could
also fall back to 2->1 and 2->2 insn combinations after visiting
a new PARAM max-combine-bb-insns-3-3 number of insns in an EBB.
Actually it already does two walks over the whole function in
combine_instructions it seems, so recording # insns per EBB should
be possible?  (if that's really the key metric causing the issue)

[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case

2019-04-02 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960

--- Comment #15 from Segher Boessenkool  ---
It seems to be that this happens for huge basic blocks, and the combiner
tries to combine pairs of instructions that are far apart.  This is unlikely
to work often, and the cost is quadratic in # insns, the way checking where
a register is used works.

The param to do only 2->1 (and 2->2) combinations should help a lot, make
combine not take longer than the rest of the compiler does.  Does it?

[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case

2019-04-01 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960

--- Comment #14 from rguenther at suse dot de  ---
On Sun, 31 Mar 2019, tkoenig at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960
> 
> --- Comment #13 from Thomas Koenig  ---
> With -O2, the combiner takes up quite a lot of time:
> 
> $ time gfortran -ftime-report -g0 -O2 -fdefault-integer-8 -c fe_objective.f90
> 
>  alias stmt walking :  15.75 (  4%)   0.11 (  5%)  15.89 (  
> 4%)
>   2 kB (  0%)
>  dead store elim2   :  10.49 (  2%)   0.33 ( 16%)  10.82 (  
> 3%)
> 1578727 kB ( 35%)
>  combiner   : 346.20 ( 81%)   0.89 ( 44%) 347.17 ( 
> 81%)
> 2701135 kB ( 60%)
>  TOTAL  : 428.68  2.01430.83  
>  
> 4504484 kB
> 
> With -O1, this now has as biggest consumers of cycles
> 
>  alias stmt walking :  11.80 ( 31%)   0.04 ( 13%)  11.78 ( 
> 31%)
>   2 kB (  0%)
>  integrated RA  :   5.61 ( 15%)   0.06 ( 20%)   5.67 ( 
> 15%)
>   34896 kB ( 10%)
>  LRA hard reg assignment:   4.69 ( 12%)   0.00 (  0%)   4.69 ( 
> 12%)
>   0 kB (  0%)
>  TOTAL  :  37.68  0.30 38.00  
>  
>  364905 kB
> 
> 
> which does not look too bad (and memory consumption has remained constant).
> 
> Note that this is with checking enabled.

You can mitigate enabled checking somewhat with -fno-checking.

[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case

2019-03-31 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960

--- Comment #13 from Thomas Koenig  ---
With -O2, the combiner takes up quite a lot of time:

$ time gfortran -ftime-report -g0 -O2 -fdefault-integer-8 -c fe_objective.f90

 alias stmt walking :  15.75 (  4%)   0.11 (  5%)  15.89 (  4%)
  2 kB (  0%)
 dead store elim2   :  10.49 (  2%)   0.33 ( 16%)  10.82 (  3%)
1578727 kB ( 35%)
 combiner   : 346.20 ( 81%)   0.89 ( 44%) 347.17 ( 81%)
2701135 kB ( 60%)
 TOTAL  : 428.68  2.01430.83   
4504484 kB

With -O1, this now has as biggest consumers of cycles

 alias stmt walking :  11.80 ( 31%)   0.04 ( 13%)  11.78 ( 31%)
  2 kB (  0%)
 integrated RA  :   5.61 ( 15%)   0.06 ( 20%)   5.67 ( 15%)
  34896 kB ( 10%)
 LRA hard reg assignment:   4.69 ( 12%)   0.00 (  0%)   4.69 ( 12%)
  0 kB (  0%)
 TOTAL  :  37.68  0.30 38.00   
 364905 kB


which does not look too bad (and memory consumption has remained constant).

Note that this is with checking enabled.

[Bug rtl-optimization/80960] [7/8/9 Regression] Huge memory use when compiling a very large test case

2018-12-06 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|7.4 |7.5