[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-04-29 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #101 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:c57a8aea0c3ab8394f7dbfa417ee27b4613f63b7

commit r12-280-gc57a8aea0c3ab8394f7dbfa417ee27b4613f63b7
Author: Richard Biener 
Date:   Thu Apr 29 08:32:00 2021 +0200

middle-end/38474 - speedup PTA constraint solving

In testcases like PR38474 and PR99912 we're seeing very slow
PTA solving.  One can observe an excessive amount of forwarding,
mostly during sd constraint solving.  The way we solve the graph
does not avoid forwarding the same bits through multiple paths,
and especially when such alternate path involves ESCAPED as
intermediate this causes the ESCAPED solution to be expanded
in receivers.

The following adds heuristic to add_graph_edge which adds
forwarding edges but also guards the initial solution forwarding
(which is the expensive part) to detect the case of ESCAPED
receiving the same set and the destination already containing
ESCAPED.

This speeds up the PTA solving process by more than 50%.

2021-04-29  Richard Biener  

PR middle-end/38474
* tree-ssa-structalias.c (add_graph_edge): Avoid direct
forwarding when indirect forwarding through ESCAPED
alread happens.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-16 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #100 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:3f16a1678156035bbe73b217fbce4d9c27d1d559

commit r11-7254-g3f16a1678156035bbe73b217fbce4d9c27d1d559
Author: Richard Biener 
Date:   Tue Feb 16 12:42:26 2021 +0100

tree-optimization/38474 - improve PTA varinfo sorting

This improves a previous heuristic to sort address-taken variables
first (because those appear in points-to bitmaps) by tracking which
variables appear in ADDRESSOF constraints (there's also
graph->address_taken but that's computed only later).

This shaves off 30s worth of compile-time for the full testcase in
PR38474 (which then still takes 965s to compile at -O2).

2021-02-16  Richard Biener  

PR tree-optimization/38474
* tree-ssa-structalias.c (variable_info::address_taken): New.
(new_var_info): Initialize address_taken.
(process_constraint): Set address_taken.
(solve_constraints): Use the new address_taken flag rather
than is_reg_var for sorting variables.
(dump_constraint): Dump the variable number if the name
is just NULL.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #99 from Richard Biener  ---
Just a short brain-dump for the PTA issue:

--param max-fields-for-field-sensitive=1 helps, so some magic limit and
auto-degrading might be a good idea.

Solver stats are not so bad:

 Total vars:   21507
Non-pointer vars:  16
Statically unified vars:  6800
Dynamically unified vars: 0
Iterations:   4
Number of edges:  43380
Number of implicit edges: 23056

but varmap "compression" happens before unifying those 6800 vars which
means bitmaps are less dense than possible.  That there's nothing
dynamically unified also says that likely iteration order is sub-optimal.
We don't have entries of the forward graph and so likely do multile
DFSs starting from somewhere inside for example.  Given we add both
succ and pred edges during solving itself makes itegrating the DFS
into the iteration itself look attractive eventually.

More stats are needed to judge iteration order tweaks.

We have IL like

  D.335748 = __result_mpfr_division_mp_mp;
  __result_mpfr_division_mp_mp ={v} {CLOBBER};
  D.76250 = D.335748;
  D.335748 ={v} {CLOBBER};
...
  mpfr_add (&__result_mpfr_addition_mp_mp, , , 0);

that just generates a lot of initial constraints and variables.  D.335748
becomes live here, so does D.76250.  This happens early also, like with

__attribute__((fn spec (". r r ")))
struct mpfr_type mpfr_division_mp_mp (struct mpfr_type & restrict a1, struct
mpfr_type & restrict a2)
{
  struct mpfr_type __result_mpfr_division_mp_mp;
  integer(kind=4) retval;

   :
  mpfr_init (&__result_mpfr_division_mp_mp);
  retval_6 = mpfr_div (&__result_mpfr_division_mp_mp, a1_3(D), a2_4(D), 0);
   = __result_mpfr_division_mp_mp;
  __result_mpfr_division_mp_mp ={v} {CLOBBER};
  return ;

having some early pass after inlining clean up the result would be nice
[simply renaming and eliding 1:1 copies here].  It takes until SRA to fix this
and the PTA pass after it (run as part of PRE) is fast:

 tree PTA   :  20.26 (  7%)

another thing to notice would be not splitting vars that just occur
"bare" in the IL but that would require some pre-scanning of the IL
to note interesting (and uninteresting) variables.  That's something
we might need anyway though for improved allocated array handling for
example.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-12 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #98 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:6cc886bf4279461b8931c4ca544185a85cd69f26

commit r11-7208-g6cc886bf4279461b8931c4ca544185a85cd69f26
Author: Richard Biener 
Date:   Fri Feb 12 11:13:36 2021 +0100

middle-end/38474 - fix alias walk budget accounting in IPA analysis

The walk_aliased_vdef calls do not update the walking budget until
it is hit by a single call (and then in one case it resumes with
no limit at all).  The following rectifies this in multiple places.
It also makes the updates more consistend and fixes
determine_known_aggregate_parts to account its own alias queries.

2021-02-12  Richard Biener  

PR middle-end/38474
* ipa-fnsummary.c (unmodified_parm_1): Only walk when
fbi->aa_walk_budget is bigger than zero.  Update
fbi->aa_walk_budget.
(param_change_prob): Likewise.
* ipa-prop.c (detect_type_change_from_memory_writes):
Properly account walk_aliased_vdefs.
(parm_preserved_before_stmt_p): Canonicalize updates.
(parm_ref_data_preserved_p): Likewise.
(parm_ref_data_pass_through_p): Likewise.
(determine_known_aggregate_parts): Account own alias queries.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #97 from Richard Biener  ---
So fixing that makes GCC 11 compile the full testcase at -O1 -g in 18 seconds
using about 1GB of memory.

That leaves PTA at -O2+ as the biggest offender (it also shows up with the
reduced testcase).

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #96 from Richard Biener  ---
The full testcase on trunk (g:95d94b52ea8478334fb92cca545f0bd904bd0034) at -O0
-g
now takes 9s to compile and uses 1GB ram.

With -O1 -g we have

Time variable   usr   sys  wall
  GGC
 callgraph functions expansion  :  13.41 ( 12%)   0.21 ( 60%)  13.63 ( 12%)
  439M ( 73%)
 callgraph ipa passes   :  94.79 ( 86%)   0.13 ( 37%)  94.95 ( 86%)
   75M ( 13%)
 ipa function summary   :  91.46 ( 83%)   0.02 (  6%)  91.53 ( 83%)
   17M (  3%)
 tree PTA   :   5.78 (  5%)   0.05 ( 14%)   5.85 (  5%)
   23M (  4%)
 TOTAL  : 109.96  0.35110.37   
  597M
109.97user 0.37system 1:50.38elapsed 99%CPU (0avgtext+0avgdata
1110568maxresident)k
0inputs+0outputs (0major+350549minor)pagefaults 0swaps

where perf shows

Samples: 448K of event 'cycles:u', Event count (approx.): 483237005145  
Overhead   Samples  Command   Shared Object Symbol  
  17.26% 77187  f951  f951  [.] get_ref_base_and_extent
  #
   8.36% 37385  f951  f951  [.]
stmt_may_clobber_ref_p_1  #
   7.16% 32045  f951  f951  [.] default_binds_local_p_3
  #
   6.40% 28628  f951  f951  [.] bitmap_bit_p   
  #
   6.39% 28557  f951  f951  [.]
determine_known_aggregate_parts   #
   5.92% 26464  f951  f951  [.] pt_solution_includes_1 
  #
   4.66% 20834  f951  f951  [.]
call_may_clobber_ref_p_1  #
   3.44% 15406  f951  f951  [.] flags_from_decl_or_type
  #
   3.35% 14971  f951  f951  [.] refs_may_alias_p_1 
  #
   3.05% 13667  f951  f951  [.] gimple_call_flags  
  #
   2.55% 11387  f951  f951  [.]
cgraph_node::get_availability #
   2.40% 10739  f951  libc-2.26.so  [.] __strncmp_sse42
  #
   2.32% 10372  f951  f951  [.] check_fnspec   
  #
   1.89%  8411  f951  f951  [.] bitmap_set_bit 
  #
   1.71%  7635  f951  f951  [.]
private_lookup_attribute  #
   1.68%  7512  f951  f951  [.]
get_modref_function_summary   #
   1.52%  6805  f951  f951  [.]
decl_binds_to_current_def_p   #
   1.46%  6512  f951  f951  [.] gimple_call_fnspec 
  #
   1.26%  5582  f951  f951  [.] bitmap_clear_bit   
  #
   0.94%  4212  f951  f951  [.]
cgraph_node::function_or_virtual_thunk_symbol   

we need to do sth about the IPA fnsummary cost, it looks unreasonable compared
to all the rest, at least for -O1.  Cutting down --param ipa-max-aa-steps
doesn't seem to help but it looks accounting is simply broken.

And with -O2 or -O3 we have

Time variable   usr   sys  wall
  GGC
 callgraph functions expansion  : 201.23 ( 20%)   0.77 ( 46%) 202.05 ( 20%)
 1230M ( 82%)
 callgraph ipa passes   : 807.58 ( 80%)   0.86 ( 52%) 808.75 ( 80%)
  201M ( 13%)
 ipa inlining heuristics:  40.25 (  4%)   0.01 (  1%)  40.24 (  4%)
   41M (  3%)
 alias stmt walking :  21.48 (  2%)   0.20 ( 12%)  21.72 (  2%)
  601k (  0%)
 tree PTA   : 788.36 ( 78%)   0.76 ( 46%) 789.43 ( 78%)
  101M (  7%)
 tree slp vectorization :  13.97 (  1%)   0.04 (  2%)  14.01 (  1%)
  225M ( 15%)
 expand vars:  92.66 (  9%)   0.00 (  0%)  92.72 (  9%)
   63M (  4%)
 TOTAL  :1010.42  1.66   1012.46   
 1509M
1010.42user 1.73system 16:52.53elapsed 99%CPU (0avgtext+0avgdata
4764428maxresident)k
0inputs+0outputs (0major+1199966minor)pagefaults 0swaps

surprisingly the IPA fnsummary issue is -O1 only but maybe it's an accounting
issue.  perf with callgraph points to (if I interpret correctly) the
determine_known_aggregate_parts function 

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-12 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #95 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:95d94b52ea8478334fb92cca545f0bd904bd0034

commit r11-7205-g95d94b52ea8478334fb92cca545f0bd904bd0034
Author: Richard Biener 
Date:   Thu Feb 11 11:13:47 2021 +0100

tree-optimization/38474 - fix store-merging compile-time regression

The following puts a limit on the number of alias tests we do in
terminate_all_aliasing_chains which is quadratic in the number of
overall stores currentrly tracked.  There is already a limit in
place on the maximum number of stores in a single chain so the
following adds a limit on the number of chains tracked.  The
worst number of overall stores tracked from the defaults (64 and 64)
is then 4096 which when imposed as the sole limit for the testcase
still causes

 store merging  :  71.65 ( 56%)

because the testcase is somewhat degenerate with most chains
consisting only of a single store (and 25% of exactly three stores).
The single stores are all CLOBBERs at the point variables go out of
scope.  Note unpatched we have

 store merging  : 308.60 ( 84%)

Limiting the number of chains to 64 brings this down to

 store merging  :   1.52 (  3%)

which is more reasonable.  There are ideas on how to make
terminate_all_aliasing_chains cheaper but for this degenerate case
they would not have any effect so I'll defer for GCC 12 for those.

I'm not sure we want to have both --params, just keeping the
more to-the-point max-stores-to-track works but makes the
degenerate case above slower.
I made the current default 1024 which for the testcasse
(without limiting chains) results in 25% compile time and 20s
putting it in the same ballpart as the next offender (which is PTA).

This is a regression on trunk and the GCC 10 branch btw.

2021-02-11  Richard Biener  

PR tree-optimization/38474
* params.opt (-param=max-store-chains-to-track=): New param.
(-param=max-stores-to-track=): Likewise.
* doc/invoke.texi (max-store-chains-to-track): Document.
(max-stores-to-track): Likewise.
* gimple-ssa-store-merging.c (pass_store_merging::m_n_chains):
New.
(pass_store_merging::m_n_stores): Likewise.
(pass_store_merging::terminate_and_process_chain): Update
m_n_stores and m_n_chains.
(pass_store_merging::process_store): Likewise.   Terminate
oldest chains if the number of stores or chains get too large.
(imm_store_chain_info::terminate_and_process_chain): Dump
chain length.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #94 from Richard Biener  ---
Created attachment 50165
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50165=edit
patch experiment

(In reply to Jakub Jelinek from comment #93)
> I think I'd go for more chains by default, at least 64 or even 256, with a
> param and tracking on how many we have in a counter.  The class has a
> ctor/dtor, so the increment/decrement of the counter can be done there.
> And I think it is doubly-linked, so tail should be prev on the head.

So the list is not circular but adding a counter is easy and avoids the
linked list walk in the usual cases when we do not hit the limit.

Note that the number of alias queries we do is
(param_max_store_chains_to_track * param_max_stores_to_merge) squared
with a default of 64 that's already 64 ^ 4.  So a less restrictive option
might be to limit the product of both numbers.

Now, I was thinking that we eventually can get rid of most alias queries
by instead of ambiguating against all stores in each chain only
ambiguate against the whole base of each chain, effectively making it
param_max_store_chains_to_track squared then.  That might lose some odd
cases.  The profile also shows that the caching of ao_refs (and thus
get_ref_base_and_extent calls) is imperfect - for the store chains
we could record ao_refs for example (it effectively already records
all necessary info anyway).

A simple patch caching the ao_ref in store_immediate_info improves compile
time to

 store merging  : 265.07 ( 83%)   0.00 (  0%) 265.14 ( 82%)
 3858k (  1%)
 TOTAL  : 321.25  0.53321.88   
  557M

shaving off some 50s and thus possibly worth the extra memory cost here.

Btw, I've done some statistics and for this testcase we indeed mostly
have chains with a single store (the clobber) and thus the idea of
reducing the number of alias queries by globbing all stores of a chain
into a single effective ao_ref wouldn't help too much
(but it would save us from caching the ao_ref in each store to a single
ao_ref per chain):

  36422 Terminating chain with 1 stores
  10889 Terminating chain with 3 stores

and the largest number of active chains is 32661.

The idea of limiting the number of overall tracked stores instead of the
number of chains would be similarly simple and put a limit on the overall
alias queries (but then we have the limit on the number of stores in a
single chain, possibly because of non-linearities in chain processing).

Still limiting the number of tracked chain is more obvious in behavior
and thus IMHO superior for users (even if it means we have to put a
more conservative default on that).

Statistics on the param defaults:

 641.31 (  2%) 
1282.64 (  4%)
2564.86 (  8%)

but given the above statistics the testcase isn't a good example to tune
these with just either one or three stores on each chain.  Enabling
the ao_ref caching only marginally improves the testcase with the
limiting in place:

2564.28 (  7%)

so any comments?

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #93 from Jakub Jelinek  ---
I think I'd go for more chains by default, at least 64 or even 256, with a
param and tracking on how many we have in a counter.  The class has a
ctor/dtor, so the increment/decrement of the counter can be done there.
And I think it is doubly-linked, so tail should be prev on the head.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #92 from Richard Biener  ---
Simple and stupid like the below works and does

 store merging  :   0.42 (  1%)   0.00 (  0%)   0.43 (  1%)
 3858k (  1%)
 TOTAL  :  56.86  0.56 57.45   
  557M

we have a limit of 64 stores in a single chain, so the product of both limits
limit the number of alias queries done in terminate_all_aliasing_chains.  I'll
polish it up tomorrow (and will refrain from trying to avoid the linear
walk here and keeping a counter or even a pointer to the last element).

diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
index f0f4a068de5..c6ec6b2cbce 100644
--- a/gcc/gimple-ssa-store-merging.c
+++ b/gcc/gimple-ssa-store-merging.c
@@ -5175,6 +5175,19 @@ pass_store_merging::process_store (gimple *stmt)

   /* Store aliases any existing chain?  */
   ret |= terminate_all_aliasing_chains (NULL, stmt);
+  unsigned cnt = 0;
+  imm_store_chain_info **e = _stores_head;
+  while (*e)
+if (++cnt > 16)
+  {
+   if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "Too many chains, terminating oldest before "
+  "starting a new chain.\n");
+   terminate_and_process_chain (*e);
+  }
+else
+  e = &(*e)->next;
+
   /* Start a new chain.  */
   class imm_store_chain_info *new_chain
 = new imm_store_chain_info (m_stores_head, base_addr);

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #91 from Richard Biener  ---
So the other simple idea I have is to limit the number of active store groups
and force-terminate in either a LRU or FIFO manner.

For the testcase at hand the decls we start the chain for are all only
used in full but knowing that would require some pre-analysis of the IL
similar to what SRA does for example (collecting all accesses).  It's
then also still easy to "break" such heuristic so limiting is in the end
the only (and I guess best) option.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #90 from Jakub Jelinek  ---
Because it says that the whole range is uninitialized, so the store merging
code doesn't need to care about pre-existing content in any gaps between the
stored values.  So say when the whole var is clobbered and then the code stores
to every second bitfield, we don't need to read the old content, mask it, or
with the stored bits and store that, but can just put some suitable value into
the gaps (0 or all ones or whatever is best).
For quadratic behavior, I wonder if we just shouldn't see how many chains are
we tracking currently and if we have too many (some param), terminate all of
them.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #89 from Richard Biener  ---
Fallout includes

FAIL: g++.dg/opt/store-merging-1.C  scan-tree-dump store-merging "New sequence
of [12] stores to replace old one of 2 stores"

which shows

Starting new chain with statement:
s ={v} {CLOBBER};
The base object is:

Recording immediate store from stmt:
s.a = 0;
Recording immediate store from stmt:
s.b = 0;
stmt causes chain termination:
foo (s);

and the CLOBBER allows us to use zeros for padding:

Store 0:
bitsize:64 bitpos:0 val:{CLOBBER}
Store 1:
bitsize:32 bitpos:0 val:0
Store 2:
bitsize:8 bitpos:32 val:0
After writing {CLOBBER} of size 64 at position 0
  the merged value contains 00 00 00 00 00 00 00 00
  the merged mask contains  00 00 00 00 00 00 00 00
After writing 0 of size 32 at position 0
  the merged value contains 00 00 00 00 00 00 00 00
  the merged mask contains  00 00 00 00 00 00 00 00
After writing 0 of size 8 at position 32
  the merged value contains 00 00 00 00 00 00 00 00
  the merged mask contains  00 00 00 00 00 00 00 00
Coalescing successful!
Merged into 1 stores
New sequence of 1 stores to replace old one of 2 stores
# .MEM_6 = VDEF <.MEM_5>
MEM  [(void *)] = 0;
Merging successful!

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #88 from rguenther at suse dot de  ---
On Wed, 10 Feb 2021, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474
> 
> --- Comment #87 from Jakub Jelinek  ---
> At least for PR92038 it is important to see CLOBBERs in the chain, including
> the first position in there.

Hmm, OK.  I'll look closer tomorrow but can you try to explain why
it's ever important at the first position?

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #87 from Jakub Jelinek  ---
At least for PR92038 it is important to see CLOBBERs in the chain, including
the first position in there.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #86 from Richard Biener  ---
OK, so clobber handling was added as a fix for PR92038

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #85 from Richard Biener  ---
Starting new chain with statement:
D.31414 ={v} {CLOBBER};
The base object is:

Starting new chain with statement:
D.31415 ={v} {CLOBBER};
The base object is:

...

but those are all the last use of the base object so they just add up and
are never invalidated, but lengthening the m_stores_head list and thus
making terminate_all_aliasing_chains more expensive.

Jakub, were the clobbers ever supposed to _start_ a chain?

With

diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
index f0f4a068de5..fa9a092d544 100644
--- a/gcc/gimple-ssa-store-merging.c
+++ b/gcc/gimple-ssa-store-merging.c
@@ -5175,6 +5175,9 @@ pass_store_merging::process_store (gimple *stmt)

   /* Store aliases any existing chain?  */
   ret |= terminate_all_aliasing_chains (NULL, stmt);
+  /* Do not start a new chain from a CLOBBER.  */
+  if (gimple_clobber_p (stmt))
+return ret;
   /* Start a new chain.  */
   class imm_store_chain_info *new_chain
 = new imm_store_chain_info (m_stores_head, base_addr);

compile-time gets down to

 store merging  :   1.18 (  2%)   0.00 (  0%)   1.18 (  2%)
 3858k (  1%)
 TOTAL  :  59.84  0.57 60.43   
  557M

I'm checking if it has any testsuite fallout.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #84 from Richard Biener  ---
So it's the usual (quadratic) culprit:

Samples: 1M of event 'cycles:u', Event count (approx.): 1675893461671   
Overhead   Samples  Command  Shared Object Symbol   
  20.61%316521  f951 f951  [.] get_ref_base_and_extent
  14.42%221221  f951 f951  [.] (anonymous
namespace)::pass_store_merging::terminate_all_aliasing_chains
   5.77% 88586  f951 f951  [.] special_function_p

I'll see whether I can do some surgery.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2021-02-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

Richard Biener  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #83 from Richard Biener  ---
Meh.  On trunk (GCC 11) we now have for the reduced testcase

> ./f951 -quiet testcase_reduced.f90 -ffree-line-length-512 -ftime-report -O3

Time variable   usr   sys  wall
  GGC
...
 callgraph ipa passes   :  28.09 (  8%)   0.23 ( 38%)  28.33 (  8%)
   68M ( 12%)
 ipa inlining heuristics:   5.13 (  1%)   0.01 (  2%)   5.13 (  1%)
   14M (  3%)
 alias stmt walking :   7.03 (  2%)   0.09 ( 15%)   7.15 (  2%)
  277k (  0%)
 tree PTA   :  26.20 (  7%)   0.17 ( 28%)  26.39 (  7%)
   25M (  5%)
 store merging  : 308.60 ( 84%)   0.01 (  2%) 308.70 ( 84%)
 3858k (  1%)
 TOTAL  : 365.68  0.61366.42   
  557M

so store-merging goes bollocks.  I will try to dig into it a bit but I'm not
very familiar with the code.  GCC 10 behaves similar here but not as bad:

 store merging  : 232.10 ( 82%)   0.02 (  4%) 232.19 ( 82%)
   3837 kB (  1%)
 TOTAL  : 283.51  0.45284.05   
 582957 kB

while GCC 9 is sane:

 store merging  :   0.04 (  0%)   0.00 (  0%)   0.04 (  0%)
   2700 kB (  1%)
 TOTAL  :  88.59  0.70 89.34   
 521364 kB

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2018-11-06 Thread hubicka at ucw dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #82 from Jan Hubicka  ---
> > Yep, this is because they used to be arrays indexed by symbol UIDs which
> > Martin converted to hash tables.  Inliner happily calls summary_get each
> > time it needs the summary.  I have some patches to speed this up which I
> > will push out after the type changes (while they add bit of extra
> > functionality by teaching ipa-predicates abou value range I hope they
> > are OK for early stage3).
> 
> Awww.  I guess with no longer re-using UIDs we then get bad hashing
> behavior as well :/  I hope the hash function is _not_ simply the UID?

Yep, UID reuse was there precisely to make the arrays reasonably
compact.  hash is int_hash which IMO returns simply uid.

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2018-11-06 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #81 from rguenther at suse dot de  ---
On Tue, 6 Nov 2018, hubicka at ucw dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474
> 
> --- Comment #80 from Jan Hubicka  ---
> > 
> > flat perf profile:
> > 
> > Samples: 510K of event 'instructions:p', Event count (approx.): 
> > 715615147320
> > Overhead   Samples  Command  Shared Object Symbol   
> > 
> >8.08% 95243  f951 f951  [.] bitmap_ior_into
> >7.21% 25966  f951 f951  [.] sreal::operator*
> >5.43% 19353  f951 f951  [.]
> > hash_table >5.20% 23167  f951 f951  [.] 
> > get_ref_base_and_extent
> >4.93% 17947  f951 f951  [.]
> > profile_count::to_sreal_s
> >4.37% 15865  f951 f951  [.] sreal::operator/
> >3.45% 30532  f951 f951  [.] bitmap_set_bit
> >3.41% 12159  f951 f951  [.]
> > hash_table >3.08% 11034  f951 f951  [.] 
> > default_binds_local_p_3
> >3.08% 11146  f951 f951  [.]
> > hash_table >2.21%  7877  f951 f951  [.]
> > want_inline_small_functio
> >1.93%  6874  f951 f951  [.] edge_badness
> >1.87%  6675  f951 f951  [.]
> > compute_inlined_call_time
> > 
> > the ipa_fn_summary hash and edge_growth_cache / call_summary hashes are
> > oddly on top of the profile...
> 
> Yep, this is because they used to be arrays indexed by symbol UIDs which
> Martin converted to hash tables.  Inliner happily calls summary_get each
> time it needs the summary.  I have some patches to speed this up which I
> will push out after the type changes (while they add bit of extra
> functionality by teaching ipa-predicates abou value range I hope they
> are OK for early stage3).

Awww.  I guess with no longer re-using UIDs we then get bad hashing
behavior as well :/  I hope the hash function is _not_ simply the UID?

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2018-11-06 Thread hubicka at ucw dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #80 from Jan Hubicka  ---
> 
> flat perf profile:
> 
> Samples: 510K of event 'instructions:p', Event count (approx.): 715615147320  
>   
> Overhead   Samples  Command  Shared Object Symbol 
>   
>8.08% 95243  f951 f951  [.] bitmap_ior_into
>7.21% 25966  f951 f951  [.] sreal::operator*
>5.43% 19353  f951 f951  [.]
> hash_table5.20% 23167  f951 f951  [.] get_ref_base_and_extent
>4.93% 17947  f951 f951  [.]
> profile_count::to_sreal_s
>4.37% 15865  f951 f951  [.] sreal::operator/
>3.45% 30532  f951 f951  [.] bitmap_set_bit
>3.41% 12159  f951 f951  [.]
> hash_table3.08% 11034  f951 f951  [.] default_binds_local_p_3
>3.08% 11146  f951 f951  [.]
> hash_table2.21%  7877  f951 f951  [.]
> want_inline_small_functio
>1.93%  6874  f951 f951  [.] edge_badness
>1.87%  6675  f951 f951  [.]
> compute_inlined_call_time
> 
> the ipa_fn_summary hash and edge_growth_cache / call_summary hashes are
> oddly on top of the profile...

Yep, this is because they used to be arrays indexed by symbol UIDs which
Martin converted to hash tables.  Inliner happily calls summary_get each
time it needs the summary.  I have some patches to speed this up which I
will push out after the type changes (while they add bit of extra
functionality by teaching ipa-predicates abou value range I hope they
are OK for early stage3).

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2018-11-06 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

Richard Biener  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #79 from Richard Biener  ---
On the GCC 8 branch with -g -O0 (x86_64-linux) I get

 TOTAL  :   9.59  0.49 10.09   
 551543 kB
10.53user 0.55system 0:11.08elapsed 99%CPU (0avgtext+0avgdata
1230908maxresident)k
0inputs+68384outputs (0major+379474minor)pagefaults 0swaps

On trunk with mem-stats the biggest offender is (as reported)

df_scan mw_reg  alloc-pool.h:487 (df_scan_alloc)   
  1136 0 :  0.0%  483M   38M: 92.4%  32

going up to -Ofast makes compile-time explode again.  For the reduced testcase
it's still inlining and PTA:

 ipa inlining heuristics:  41.55 ( 36%)   0.01 (  1%)  41.60 ( 36%)
   6333 kB (  2%)
 alias stmt walking :  14.73 ( 13%)   0.14 ( 18%)  14.98 ( 13%)
164 kB (  0%)
 tree PTA   :  37.93 ( 33%)   0.23 ( 30%)  38.16 ( 33%)
  31921 kB (  8%)
 TOTAL  : 115.67  0.77116.52   
 411925 kB
115.83user 0.81system 1:56.71elapsed 99%CPU (0avgtext+0avgdata
1034364maxresident)k
1624inputs+12624outputs (1major+308881minor)pagefaults 0swaps

trunk seems to behave similar:

 ipa inlining heuristics:  53.74 ( 41%)   0.02 (  2%)  53.75 ( 40%)
   5428 kB (  1%)
 alias stmt walking :  14.98 ( 11%)   0.19 ( 23%)  15.16 ( 11%)
165 kB (  0%)
 tree PTA   :  39.70 ( 30%)   0.24 ( 29%)  39.92 ( 30%)
  31896 kB (  8%)
 TOTAL  : 132.01  0.83132.85   
 407617 kB
132.01user 0.86system 2:12.88elapsed 99%CPU (0avgtext+0avgdata
1034096maxresident)k
0inputs+8224outputs (0major+301553minor)pagefaults 0swaps

flat perf profile:

Samples: 510K of event 'instructions:p', Event count (approx.): 715615147320
Overhead   Samples  Command  Shared Object Symbol   
   8.08% 95243  f951 f951  [.] bitmap_ior_into
   7.21% 25966  f951 f951  [.] sreal::operator*
   5.43% 19353  f951 f951  [.]
hash_table

[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2013-12-10 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #78 from Richard Biener rguenth at gcc dot gnu.org ---
Author: rguenth
Date: Tue Dec 10 12:31:39 2013
New Revision: 205857

URL: http://gcc.gnu.org/viewcvs?rev=205857root=gccview=rev
Log:
2013-12-10  Richard Biener  rguent...@suse.de

PR middle-end/38474
* tree-ssa-structalias.c (solution_set_expand): Expand into
a different possibly cached bitmap and return the result.
(set_union_with_increment): Pass in a shared expanded bitmap
and adjust.
(do_sd_constraint): Likewise.
(do_ds_constraint): Likewise.
(do_complex_constraint): Likewise.
(solve_graph): Manage the shared expanded bitmap.

* gcc.dg/ipa/ipa-pta-14.c: Un-XFAIL.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c
trunk/gcc/tree-ssa-structalias.c


[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2013-12-09 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #77 from Richard Biener rguenth at gcc dot gnu.org ---
Author: rguenth
Date: Mon Dec  9 15:13:07 2013
New Revision: 205808

URL: http://gcc.gnu.org/viewcvs?rev=205808root=gccview=rev
Log:
2013-12-09  Richard Biener  rguent...@suse.de

PR middle-end/38474
* tree-ssa-structalias.c (set_union_with_increment): Remove
unreachable code.
(do_complex_constraint): Call set_union_with_increment with
the solution delta, not the full solution.
(make_transitive_closure_constraints): Merge the two
constraints.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-ssa-structalias.c


[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2013-12-06 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #75 from Richard Biener rguenth at gcc dot gnu.org ---
On trunk with the reduced testcase and -O2 (no -g):

 ipa inlining heuristics :   9.85 ( 5%) usr   0.00 ( 0%) sys   9.93 ( 5%) wall 
  1448 kB ( 0%) ggc
 tree PTA: 161.26 (78%) usr   0.30 (45%) sys 162.00 (78%) wall 
 42484 kB ( 8%) ggc
 expand vars :   3.06 ( 1%) usr   0.01 ( 2%) sys   3.06 ( 1%) wall 
 16074 kB ( 3%) ggc
 integrated RA   :   4.46 ( 2%) usr   0.11 (17%) sys   4.56 ( 2%) wall 
 87144 kB (17%) ggc
 TOTAL : 205.49 0.66   206.80
513245 kB

-O3 is in the same ballpark.  I'll look at this again.


[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2013-12-06 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474

--- Comment #76 from Richard Biener rguenth at gcc dot gnu.org ---
There are a lot of calls with fnspec, almost all constraints look like

D.12770.0+16 = allalltmp
D.12770.64+128 = allalltmp
D.12770.192+64 = allalltmp
callarg = READONLY
callarg = *callarg
callarg = callarg + UNKNOWN
CALLUSED = callarg
ESCAPED = NONLOCAL
allalltmp = CALLUSED
allalltmp = NONLOCAL
D.12771.0+16 = allalltmp
D.12771.64+128 = allalltmp
D.12771.192+64 = allalltmp
callarg = D.12770.0+16
callarg = *callarg
callarg = callarg + UNKNOWN

They get unified pretty quickly though.

Still we end up with many very large sets that include ESCAPED but also some
members of ESCAPED explicitely (that's redundant).  I have some idea on how
to mitigate this which eventually should speed things up (or at least reduce
memory usage).

Like

Index: tree-ssa-structalias.c
===
--- tree-ssa-structalias.c  (revision 205739)
+++ tree-ssa-structalias.c  (working copy)
@@ -1600,6 +1600,14 @@ do_sd_constraint (constraint_graph_t gra
   goto done;
 }

+  /* If the solution of Y contains escaped then filter all bits from
+ that from the delta to reduce work.  */
+  if (bitmap_bit_p (delta, escaped_id))
+{
+  bitmap_and_compl_into (delta, get_varinfo (find
(escaped_id))-solution);
+  flag |= bitmap_set_bit (sol, escaped_id);
+}
+  
   /* If we do not know at with offset the rhs is dereferenced compute
  the reachability set of DELTA, conservatively assuming it is
  dereferenced at all valid offsets.  */

will check next week.


[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2013-03-07 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474



Richard Biener rguenth at gcc dot gnu.org changed:



   What|Removed |Added



 Status|NEW |ASSIGNED

 Blocks||47344

 AssignedTo|unassigned at gcc dot   |rguenth at gcc dot gnu.org

   |gnu.org |



--- Comment #73 from Richard Biener rguenth at gcc dot gnu.org 2013-03-07 
10:31:53 UTC ---

On trunk with the reduced testcase I now see PTA taking 90% of compile-time ...



Argh.


[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2013-03-07 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474



--- Comment #74 from Richard Biener rguenth at gcc dot gnu.org 2013-03-07 
14:55:27 UTC ---

(In reply to comment #73)

 On trunk with the reduced testcase I now see PTA taking 90% of compile-time 
 ...

 

 Argh.



I can speed it up by



@@ -1631,7 +1619,20 @@ do_sd_constraint (constraint_graph_t gra

flag |= bitmap_set_bit (sol, escaped_id);

  else if (v-may_have_pointers

add_graph_edge (graph, lhs, t))

-   flag |= bitmap_ior_into (sol, get_varinfo (t)-solution);

+   {

+ /* For transitive closures, x = *(x + UNKNOWN), delay

+propagation of the solution across the added edges

+by marking sources as changed.  */

+ if (lhs == c-rhs.var)

+   {

+ bitmap_set_bit (changed, t);

+ flag |= true;

+   }

+ /* Else speedup solving by doing that here to save

+iterations.  */

+ else

+   flag |= bitmap_ior_into (sol, get_varinfo (t)-solution);

+   }



  /* If the variable is not exactly at the requested offset

 we have to include the next one.  */



as repeatedly walking all of 'sol' for 'sol = *(sol + UNKNOWN)' when

adding the solution of one of delta(sol)'s member is slow.  Using

a temporary bitmap to collect all changes doesn't speed it up though,

so the only effect of the above is that the forwarding is delayed until

the next solver iteration (where eventually we discover and eliminate

more indirect cycles).


[Bug middle-end/38474] compile time explosion in dataflow_set_preserve_mem_locs at -O3

2013-03-06 Thread steven at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474



Steven Bosscher steven at gcc dot gnu.org changed:



   What|Removed |Added



 Status|WAITING |NEW

URL|http://gcc.gnu.org/ml/gcc-p |

   |atches/2012-05/msg01813.htm |

   |l   |

 CC|hubicka at gcc dot gnu.org, |

   |jamborm at gcc dot gnu.org, |

   |matz at gcc dot gnu.org,|

   |vmakarov at redhat dot com  |

 AssignedTo|matz at gcc dot gnu.org |unassigned at gcc dot

   ||gnu.org

Summary|slow compilation at -O0 due |compile time explosion in

   |to expand's temp slot goo   |dataflow_set_preserve_mem_l

   ||ocs at -O3

  Known to fail||4.8.0