[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2019-02-22 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|8.3 |8.4

--- Comment #45 from Jakub Jelinek  ---
GCC 8.3 has been released.

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-10-17 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #44 from Jeffrey A. Law  ---
I'd be very hesitant to make the cost model target specific.  It goes against
core design goals of gimple.

Conceptually I believe we should be optimizing as much as possible on gimple
and that issues such as register pressure should be addressed at the
gimple->rtl border and later, not by throttling the gimple optimizers.

You could potentially look at Cliff Click's work from '95.  It could probably
be repurposed as a general lifetime shrinking pass through statement scheduling
within blocks and across blocks.

You could also look at reviving Bernd's work which tries to do statement
scheduling near the gimple->rtl border.

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-10-17 Thread prathamesh3492 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #43 from prathamesh3492 at gcc dot gnu.org ---
Sorry for duplications / formatting errors in previous comment. Is there a way
to edit posted comments ?

Thanks,
Prathamesh

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-10-17 Thread prathamesh3492 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #42 from prathamesh3492 at gcc dot gnu.org ---
Hi,
This is another simpler approach I tried to apply "cost-model" on hoisting
before approaching a more general solution:
http://people.linaro.org/~prathamesh.kulkarni/hoist-change-order.diff
In this prototype patch, I changed order of hoisting such that instead hoisting
an expression in first candidate block, it hoists expression one dominator at a
time.

For pr77445-2.c test-case, str_225 + 1 gets hoisted in block 10 because it's
the first candidate block found from the top-down dom-tree walk, which leaves
little room for controlling hoisting.
The patch forces expressions to be inserted in immediate dominator at a time
instead of the first candidate block. With this change, the following series of
hoistings take place for str_225 + 1:

Inserting expression in block 15 for code
hoisting:{pointer_plus_expr,str_225,1} (0079)
Inserting expression in block 14 for code hoisting:
{pointer_plus_expr,str_225,1} (0079)
Inserting expression in block 11 for code hoisting:
{pointer_plus_expr,str_225,1} (0079)
Inserting expression in block 10 for code hoisting:
{pointer_plus_expr,str_225,1} (0079)
Inserting expression in block 53 for code hoisting:
{pointer_plus_expr,str_225,1} (0079)

str_225 + 1 originally appears in blocks 16 and 17. It is then first hoisted
into their predecessor block 15, then into block 14 and so on. The advantage I
see with this order of hoisting is, we can control hoisting after each
insertion in it's immediate dominator. So for instance if according to our cost
model, we reach "hoisting threshold" after say block 14, we can then prevent
further hoistings of str_225 + 1. Whereas with the current approach it gets
hoisted right up to block 10 initially. Alternatively we could try to "sink"
the expression down to dominated blocks. I didn't explore this option yet.

* Cost model for hoisting
The cost model would be entirely target specific defined by a target hook and
shouldn't affect other architectures that don't wish to use it. I suppose a
very simple cost model for hoisting could take following two factors:
a) Number of hoistings of a particular expression measured in terms of
dominator depth - This is recorded by expr_hoist_map which is map the former representing value number of pre_expr and latter
represents the count.
b) Number of insertions in basic block - This is recorded by map, the former representing block index and latter represents the count.

I didn't attempt to define the cost-model in the patch. I was wondering what
could be other potential factors that we can consider ?

* Issues with changing hoisting order
I am not entirely sure if the result of changing hoisting order can result in
correctness issues or missed optimizations ? For some confidence, I validated
the patch with bootstrap+test on x86_64, which worked.

There are two problems I see:
(1) Interference with statistics of hoisting, which is easy to fix.
(2) Does not honor the "expression should be available in at least one
successor" constraint, which leads to more aggressive hoisting for
architectures that will not use cost model. In example above, str_225 + 1 got
hoisted one block further upto block-53, while with current-order it's
restricted to block-10. I suppose we could fix this by recording which
expressions were originally available at end of block ?

The patch passes bootstrap+test on x86_64.

* Hoistings crossing loop boundary - One "peculiarity" I see with FMS function
in pr77445-2.c is that all the hoistings cross loop boundaries at one point,
while other tests have significantly lesser.
I did a quick test with SPEC2006 to collect some data:
(number-of-hoistings vs number-of-functions)
{2: 89, 1: 166, 3: 37, 4: 14, 5: 8, 6: 10, 7: 11, 8: 2, 13: 3, 10: 1, 11: 5, 9:
4, 17: 1, 15: 2, 27: 1, 12: 2, 18: 1, 21: 1, 26: 1}

It seems most of the functions have cross loop hoistings less than 5 with 166
functions having one hoisting inside loop and 89 functions having two hoistings
across loops. I was wondering if a hoisting into a block from it's successor
should have "extra penalty" if it crosses a loop boundary ? Or does hoisting
inside a loop have no effect on register pressure ?

Thanks,
Prathamesh

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-08-01 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #41 from rguenther at suse dot de  ---
On Wed, 1 Aug 2018, prathamesh3492 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> 
> --- Comment #40 from prathamesh3492 at gcc dot gnu.org ---
> ping https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155#c38
> 
> Thanks,
> Prathamesh

Well, all heuristics will have up and downsides.  In principle
re-ordering sinking and PRE makes sense but given it isn't
only "sinking" side-effects of this need to be watched.  I suppose
similar as to how PRE hoists code it should also sink, removing
the need of a separate (ad-hoc!) sinking pass.  Note that both
passes are confused by dead code.

But yes, in general a live-range shrinking pass is what would
improve the situation in the best possible way given the
exact situation created by PRE & friends can be created by
adjusting the testcase in source.

One complication with the idea of a live-shrinking pass is
that we have TER.  Bernd has done some kind of live-shrinking
pass before that wasn't merged.  I think I pointed you to it
at some point.

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-08-01 Thread prathamesh3492 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #40 from prathamesh3492 at gcc dot gnu.org ---
ping https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155#c38

Thanks,
Prathamesh

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-07-26 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|8.2 |8.3

--- Comment #39 from Jakub Jelinek  ---
GCC 8.2 has been released.

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-07-19 Thread prathamesh3492 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #38 from prathamesh3492 at gcc dot gnu.org ---
Hi,
The issue can be reproduced exactly, with pr77445-2.c. I am testing with making
is_digit() noinline.

* Reordering SINK before PRE

SPEC2006 data for building SPEC2006 with sink before pre:
Number of statements sunk: +2677 (~ +14%)
Number of total PRE insertions: -3971 (~ -1%)
On the private embedded benchmark suite, there's overall no significant
difference.

Not sure if this is much helpful. Is there a way to get info about number of
registers spilled from lra dump or assembly ?
I would like to see the effect on spills by reordering passes.

Reordering sink before pre seems to regress no-scevccp-outer-22.c and
ssa-dom-thread-7.c, and several SVE tests on aarch64:
http://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/262002-sink-pre/aarch64-none-linux-gnu/diff-gcc-rh60-aarch64-none-linux-gnu-default-default-default.txt

Also there seems to be some interplay with hoisting and forwprop. Disabling
forwprop3 and forwprop4 seems to eliminate the spill too. However as Bin
pointed out on the list, forwprop is also helping to reduce register pressure
for this case by mem_ref folding (forward_propagate_addr_expr).

* Jump threading cost models

It seems jump-threading pass increases the size for this case from 38 to 79
blocks. Wondering if that adds up to "resource hog", eventually leading
to extra spill ? Disabling jump threading pass eliminates the spill.

I looked a bit into fine tuning jump threading cost models for cortex-m7.
Strangely, setting max-jump-thread-duplication-stmts to 20 and
fsm-scale-path-stmts to 3 not only removes the spill but also results in 9 more
hoistings! I am investigating why this resulted
in improved performance. However it regresses ssa-dom-thread-7.c:
http://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/262539-jump-thread-cost-models/aarch64-none-elf/diff-gcc-rh60-aarch64-none-elf-default-default-default.txt

* Stop-gap measure for hoisting ?

As a stop-gap measure, would it make sense to "localize" hoisting within
"large" loop (based on loop->num_nodes?) by refusing to hoist expressions
computed outside loop ?
My assumption is that hoisting will increase live range of expression which was
previously computed in a block outside loop but is brought inside the
loop due to hoisting since we'd now need to consider path along the loop as
well for estimating it's live-range ? I suppose a cheap way to test that would
be to check if block's post-dominator also lies within the same loop since it
would ensure all paths from block to EXIT would lie inside the loop ?
I created a patch for this
(http://people.linaro.org/~prathamesh.kulkarni/pdom.diff), which works to
remove the spill but regressed pr77445-2.c (which is how I stumbled on that
test). Although the underlying issue doesn't seem particularly relevant to
hoisting, so not sure if this "heuristic" makes much sense.

* Live range shrinking pass

There was some discussion about an inter-block live-range shrinking GIMPLE pass
on the list (https://gcc.gnu.org/ml/gcc/2018-05/msg00260.html), which will run
just before expand. I would be grateful for suggestions on how to get started
with it. I realize this'd be pretty hard, but would like to give a try. 

Thanks,
Prathamesh

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-05-22 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #37 from bin cheng  ---
(In reply to rguent...@suse.de from comment #36)
> On Tue, 22 May 2018, amker at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> > 
> > bin cheng  changed:
> > 
> >What|Removed |Added
> > 
> >  CC||amker at gcc dot gnu.org
> > 
> > --- Comment #35 from bin cheng  ---
> > (In reply to prathamesh3492 from comment #33)
> > > Created attachment 42341 [details]
> > > Test-case to reproduce regression with cortex-m7
> > > 
> > > I have attached an artificial test-case that is fairly representative of 
> > > the
> > > regression we are seeing in a benchmark. The test-case mimics a
> > > deterministic finite automaton. With code-hoisting there's an additional
> > > spill of r5 near beginning of the function.
> > > 
> > ...
> > > 
> > > Without code-hoisting it is reusing r3 to store a + 1, while due to code
> > > hoisting it uses the extra register 'r2' to store the value of hoisted
> > > expression a + 1.
> > > 
> > > Would it be a good idea to somehow "limit" the distance (in terms of 
> > > number
> > > of basic blocks maybe?) between the definition of hoisted variable and 
> > > it's
> > > furthest use during PRE ? If that exceeds a certain threshold then PRE
> > > should choose not to hoist that expression. The threshold could be a param
> > > that can be set by backends.
> > > Does this analysis look reasonable ?
> > 
> > It might be more accurate to calculate register pressure and use that to 
> > guide
> > code hoisting.  I introduced register pressure hoisting for RTL under option
> > -fira-hoist-pressure, basically similar thing needs to be done here.
> > 
> > The proposed Tree-SSA register pressure patch set is still under review, but
> > please note it only does minimal now by only computing register pressure.  
> > To
> > make it useful in this case, it may need to be improved by
> > calculating/recording live range for statements (I did that in previous 
> > version
> > patch).  We would also need interfaces updating live range information in 
> > line
> > with code motion.
> 
> One important thing on GIMPLE is that stmt order (inside a BB at least)
> is quite arbitrary and thus LIVE should consider the stmts ordered
> in LIVE-optimal way to not introduce too much noise.  It might be that
> we should only consider live-through [loops] for heuristics in some 
> places plus the obvious changes in liveness that transforms induce.

Yes, I actually did experiments only counting live ranges at bb in/out when
computing max pressure for the current implementation.  It doesn't make much
difference for the only use in predcom, but could be important for case like
this one.

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-05-22 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

--- Comment #36 from rguenther at suse dot de  ---
On Tue, 22 May 2018, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155
> 
> bin cheng  changed:
> 
>What|Removed |Added
> 
>  CC||amker at gcc dot gnu.org
> 
> --- Comment #35 from bin cheng  ---
> (In reply to prathamesh3492 from comment #33)
> > Created attachment 42341 [details]
> > Test-case to reproduce regression with cortex-m7
> > 
> > I have attached an artificial test-case that is fairly representative of the
> > regression we are seeing in a benchmark. The test-case mimics a
> > deterministic finite automaton. With code-hoisting there's an additional
> > spill of r5 near beginning of the function.
> > 
> ...
> > 
> > Without code-hoisting it is reusing r3 to store a + 1, while due to code
> > hoisting it uses the extra register 'r2' to store the value of hoisted
> > expression a + 1.
> > 
> > Would it be a good idea to somehow "limit" the distance (in terms of number
> > of basic blocks maybe?) between the definition of hoisted variable and it's
> > furthest use during PRE ? If that exceeds a certain threshold then PRE
> > should choose not to hoist that expression. The threshold could be a param
> > that can be set by backends.
> > Does this analysis look reasonable ?
> 
> It might be more accurate to calculate register pressure and use that to guide
> code hoisting.  I introduced register pressure hoisting for RTL under option
> -fira-hoist-pressure, basically similar thing needs to be done here.
> 
> The proposed Tree-SSA register pressure patch set is still under review, but
> please note it only does minimal now by only computing register pressure.  To
> make it useful in this case, it may need to be improved by
> calculating/recording live range for statements (I did that in previous 
> version
> patch).  We would also need interfaces updating live range information in line
> with code motion.

One important thing on GIMPLE is that stmt order (inside a BB at least)
is quite arbitrary and thus LIVE should consider the stmts ordered
in LIVE-optimal way to not introduce too much noise.  It might be that
we should only consider live-through [loops] for heuristics in some 
places plus the obvious changes in liveness that transforms induce.

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-05-22 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

bin cheng  changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #35 from bin cheng  ---
(In reply to prathamesh3492 from comment #33)
> Created attachment 42341 [details]
> Test-case to reproduce regression with cortex-m7
> 
> I have attached an artificial test-case that is fairly representative of the
> regression we are seeing in a benchmark. The test-case mimics a
> deterministic finite automaton. With code-hoisting there's an additional
> spill of r5 near beginning of the function.
> 
...
> 
> Without code-hoisting it is reusing r3 to store a + 1, while due to code
> hoisting it uses the extra register 'r2' to store the value of hoisted
> expression a + 1.
> 
> Would it be a good idea to somehow "limit" the distance (in terms of number
> of basic blocks maybe?) between the definition of hoisted variable and it's
> furthest use during PRE ? If that exceeds a certain threshold then PRE
> should choose not to hoist that expression. The threshold could be a param
> that can be set by backends.
> Does this analysis look reasonable ?

It might be more accurate to calculate register pressure and use that to guide
code hoisting.  I introduced register pressure hoisting for RTL under option
-fira-hoist-pressure, basically similar thing needs to be done here.

The proposed Tree-SSA register pressure patch set is still under review, but
please note it only does minimal now by only computing register pressure.  To
make it useful in this case, it may need to be improved by
calculating/recording live range for statements (I did that in previous version
patch).  We would also need interfaces updating live range information in line
with code motion.

I think this case is difficult also because it's hard to decide high register
pressure or not which could be the boundary case, and as we know, Tree-SSA
register pressure is in no way accurate.

Another difficulty is some kind of global information is needed.  Given stupid
example:

a = ...
b = ...
loop
  x = a + b;
  ...
  y = a * b;
  y = x + y;
  store y;
end

The expressions need to be considered together in order to understand register
pressure change.

Thanks,
bin
> 
> Thanks,
> Prathamesh

[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled

2018-05-02 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|8.0 |8.2

--- Comment #34 from Jakub Jelinek  ---
GCC 8.1 has been released.