[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 Jakub Jelinek changed: What|Removed |Added Target Milestone|8.3 |8.4 --- Comment #45 from Jakub Jelinek --- GCC 8.3 has been released.
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #44 from Jeffrey A. Law --- I'd be very hesitant to make the cost model target specific. It goes against core design goals of gimple. Conceptually I believe we should be optimizing as much as possible on gimple and that issues such as register pressure should be addressed at the gimple->rtl border and later, not by throttling the gimple optimizers. You could potentially look at Cliff Click's work from '95. It could probably be repurposed as a general lifetime shrinking pass through statement scheduling within blocks and across blocks. You could also look at reviving Bernd's work which tries to do statement scheduling near the gimple->rtl border.
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #43 from prathamesh3492 at gcc dot gnu.org --- Sorry for duplications / formatting errors in previous comment. Is there a way to edit posted comments ? Thanks, Prathamesh
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #42 from prathamesh3492 at gcc dot gnu.org --- Hi, This is another simpler approach I tried to apply "cost-model" on hoisting before approaching a more general solution: http://people.linaro.org/~prathamesh.kulkarni/hoist-change-order.diff In this prototype patch, I changed order of hoisting such that instead hoisting an expression in first candidate block, it hoists expression one dominator at a time. For pr77445-2.c test-case, str_225 + 1 gets hoisted in block 10 because it's the first candidate block found from the top-down dom-tree walk, which leaves little room for controlling hoisting. The patch forces expressions to be inserted in immediate dominator at a time instead of the first candidate block. With this change, the following series of hoistings take place for str_225 + 1: Inserting expression in block 15 for code hoisting:{pointer_plus_expr,str_225,1} (0079) Inserting expression in block 14 for code hoisting: {pointer_plus_expr,str_225,1} (0079) Inserting expression in block 11 for code hoisting: {pointer_plus_expr,str_225,1} (0079) Inserting expression in block 10 for code hoisting: {pointer_plus_expr,str_225,1} (0079) Inserting expression in block 53 for code hoisting: {pointer_plus_expr,str_225,1} (0079) str_225 + 1 originally appears in blocks 16 and 17. It is then first hoisted into their predecessor block 15, then into block 14 and so on. The advantage I see with this order of hoisting is, we can control hoisting after each insertion in it's immediate dominator. So for instance if according to our cost model, we reach "hoisting threshold" after say block 14, we can then prevent further hoistings of str_225 + 1. Whereas with the current approach it gets hoisted right up to block 10 initially. Alternatively we could try to "sink" the expression down to dominated blocks. I didn't explore this option yet. * Cost model for hoisting The cost model would be entirely target specific defined by a target hook and shouldn't affect other architectures that don't wish to use it. I suppose a very simple cost model for hoisting could take following two factors: a) Number of hoistings of a particular expression measured in terms of dominator depth - This is recorded by expr_hoist_map which is map the former representing value number of pre_expr and latter represents the count. b) Number of insertions in basic block - This is recorded by map, the former representing block index and latter represents the count. I didn't attempt to define the cost-model in the patch. I was wondering what could be other potential factors that we can consider ? * Issues with changing hoisting order I am not entirely sure if the result of changing hoisting order can result in correctness issues or missed optimizations ? For some confidence, I validated the patch with bootstrap+test on x86_64, which worked. There are two problems I see: (1) Interference with statistics of hoisting, which is easy to fix. (2) Does not honor the "expression should be available in at least one successor" constraint, which leads to more aggressive hoisting for architectures that will not use cost model. In example above, str_225 + 1 got hoisted one block further upto block-53, while with current-order it's restricted to block-10. I suppose we could fix this by recording which expressions were originally available at end of block ? The patch passes bootstrap+test on x86_64. * Hoistings crossing loop boundary - One "peculiarity" I see with FMS function in pr77445-2.c is that all the hoistings cross loop boundaries at one point, while other tests have significantly lesser. I did a quick test with SPEC2006 to collect some data: (number-of-hoistings vs number-of-functions) {2: 89, 1: 166, 3: 37, 4: 14, 5: 8, 6: 10, 7: 11, 8: 2, 13: 3, 10: 1, 11: 5, 9: 4, 17: 1, 15: 2, 27: 1, 12: 2, 18: 1, 21: 1, 26: 1} It seems most of the functions have cross loop hoistings less than 5 with 166 functions having one hoisting inside loop and 89 functions having two hoistings across loops. I was wondering if a hoisting into a block from it's successor should have "extra penalty" if it crosses a loop boundary ? Or does hoisting inside a loop have no effect on register pressure ? Thanks, Prathamesh
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #41 from rguenther at suse dot de --- On Wed, 1 Aug 2018, prathamesh3492 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > --- Comment #40 from prathamesh3492 at gcc dot gnu.org --- > ping https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155#c38 > > Thanks, > Prathamesh Well, all heuristics will have up and downsides. In principle re-ordering sinking and PRE makes sense but given it isn't only "sinking" side-effects of this need to be watched. I suppose similar as to how PRE hoists code it should also sink, removing the need of a separate (ad-hoc!) sinking pass. Note that both passes are confused by dead code. But yes, in general a live-range shrinking pass is what would improve the situation in the best possible way given the exact situation created by PRE & friends can be created by adjusting the testcase in source. One complication with the idea of a live-shrinking pass is that we have TER. Bernd has done some kind of live-shrinking pass before that wasn't merged. I think I pointed you to it at some point.
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #40 from prathamesh3492 at gcc dot gnu.org --- ping https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155#c38 Thanks, Prathamesh
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 Jakub Jelinek changed: What|Removed |Added Target Milestone|8.2 |8.3 --- Comment #39 from Jakub Jelinek --- GCC 8.2 has been released.
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #38 from prathamesh3492 at gcc dot gnu.org --- Hi, The issue can be reproduced exactly, with pr77445-2.c. I am testing with making is_digit() noinline. * Reordering SINK before PRE SPEC2006 data for building SPEC2006 with sink before pre: Number of statements sunk: +2677 (~ +14%) Number of total PRE insertions: -3971 (~ -1%) On the private embedded benchmark suite, there's overall no significant difference. Not sure if this is much helpful. Is there a way to get info about number of registers spilled from lra dump or assembly ? I would like to see the effect on spills by reordering passes. Reordering sink before pre seems to regress no-scevccp-outer-22.c and ssa-dom-thread-7.c, and several SVE tests on aarch64: http://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/262002-sink-pre/aarch64-none-linux-gnu/diff-gcc-rh60-aarch64-none-linux-gnu-default-default-default.txt Also there seems to be some interplay with hoisting and forwprop. Disabling forwprop3 and forwprop4 seems to eliminate the spill too. However as Bin pointed out on the list, forwprop is also helping to reduce register pressure for this case by mem_ref folding (forward_propagate_addr_expr). * Jump threading cost models It seems jump-threading pass increases the size for this case from 38 to 79 blocks. Wondering if that adds up to "resource hog", eventually leading to extra spill ? Disabling jump threading pass eliminates the spill. I looked a bit into fine tuning jump threading cost models for cortex-m7. Strangely, setting max-jump-thread-duplication-stmts to 20 and fsm-scale-path-stmts to 3 not only removes the spill but also results in 9 more hoistings! I am investigating why this resulted in improved performance. However it regresses ssa-dom-thread-7.c: http://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/262539-jump-thread-cost-models/aarch64-none-elf/diff-gcc-rh60-aarch64-none-elf-default-default-default.txt * Stop-gap measure for hoisting ? As a stop-gap measure, would it make sense to "localize" hoisting within "large" loop (based on loop->num_nodes?) by refusing to hoist expressions computed outside loop ? My assumption is that hoisting will increase live range of expression which was previously computed in a block outside loop but is brought inside the loop due to hoisting since we'd now need to consider path along the loop as well for estimating it's live-range ? I suppose a cheap way to test that would be to check if block's post-dominator also lies within the same loop since it would ensure all paths from block to EXIT would lie inside the loop ? I created a patch for this (http://people.linaro.org/~prathamesh.kulkarni/pdom.diff), which works to remove the spill but regressed pr77445-2.c (which is how I stumbled on that test). Although the underlying issue doesn't seem particularly relevant to hoisting, so not sure if this "heuristic" makes much sense. * Live range shrinking pass There was some discussion about an inter-block live-range shrinking GIMPLE pass on the list (https://gcc.gnu.org/ml/gcc/2018-05/msg00260.html), which will run just before expand. I would be grateful for suggestions on how to get started with it. I realize this'd be pretty hard, but would like to give a try. Thanks, Prathamesh
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #37 from bin cheng --- (In reply to rguent...@suse.de from comment #36) > On Tue, 22 May 2018, amker at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > > > bin cheng changed: > > > >What|Removed |Added > > > > CC||amker at gcc dot gnu.org > > > > --- Comment #35 from bin cheng --- > > (In reply to prathamesh3492 from comment #33) > > > Created attachment 42341 [details] > > > Test-case to reproduce regression with cortex-m7 > > > > > > I have attached an artificial test-case that is fairly representative of > > > the > > > regression we are seeing in a benchmark. The test-case mimics a > > > deterministic finite automaton. With code-hoisting there's an additional > > > spill of r5 near beginning of the function. > > > > > ... > > > > > > Without code-hoisting it is reusing r3 to store a + 1, while due to code > > > hoisting it uses the extra register 'r2' to store the value of hoisted > > > expression a + 1. > > > > > > Would it be a good idea to somehow "limit" the distance (in terms of > > > number > > > of basic blocks maybe?) between the definition of hoisted variable and > > > it's > > > furthest use during PRE ? If that exceeds a certain threshold then PRE > > > should choose not to hoist that expression. The threshold could be a param > > > that can be set by backends. > > > Does this analysis look reasonable ? > > > > It might be more accurate to calculate register pressure and use that to > > guide > > code hoisting. I introduced register pressure hoisting for RTL under option > > -fira-hoist-pressure, basically similar thing needs to be done here. > > > > The proposed Tree-SSA register pressure patch set is still under review, but > > please note it only does minimal now by only computing register pressure. > > To > > make it useful in this case, it may need to be improved by > > calculating/recording live range for statements (I did that in previous > > version > > patch). We would also need interfaces updating live range information in > > line > > with code motion. > > One important thing on GIMPLE is that stmt order (inside a BB at least) > is quite arbitrary and thus LIVE should consider the stmts ordered > in LIVE-optimal way to not introduce too much noise. It might be that > we should only consider live-through [loops] for heuristics in some > places plus the obvious changes in liveness that transforms induce. Yes, I actually did experiments only counting live ranges at bb in/out when computing max pressure for the current implementation. It doesn't make much difference for the only use in predcom, but could be important for case like this one.
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 --- Comment #36 from rguenther at suse dot de --- On Tue, 22 May 2018, amker at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 > > bin cheng changed: > >What|Removed |Added > > CC||amker at gcc dot gnu.org > > --- Comment #35 from bin cheng --- > (In reply to prathamesh3492 from comment #33) > > Created attachment 42341 [details] > > Test-case to reproduce regression with cortex-m7 > > > > I have attached an artificial test-case that is fairly representative of the > > regression we are seeing in a benchmark. The test-case mimics a > > deterministic finite automaton. With code-hoisting there's an additional > > spill of r5 near beginning of the function. > > > ... > > > > Without code-hoisting it is reusing r3 to store a + 1, while due to code > > hoisting it uses the extra register 'r2' to store the value of hoisted > > expression a + 1. > > > > Would it be a good idea to somehow "limit" the distance (in terms of number > > of basic blocks maybe?) between the definition of hoisted variable and it's > > furthest use during PRE ? If that exceeds a certain threshold then PRE > > should choose not to hoist that expression. The threshold could be a param > > that can be set by backends. > > Does this analysis look reasonable ? > > It might be more accurate to calculate register pressure and use that to guide > code hoisting. I introduced register pressure hoisting for RTL under option > -fira-hoist-pressure, basically similar thing needs to be done here. > > The proposed Tree-SSA register pressure patch set is still under review, but > please note it only does minimal now by only computing register pressure. To > make it useful in this case, it may need to be improved by > calculating/recording live range for statements (I did that in previous > version > patch). We would also need interfaces updating live range information in line > with code motion. One important thing on GIMPLE is that stmt order (inside a BB at least) is quite arbitrary and thus LIVE should consider the stmts ordered in LIVE-optimal way to not introduce too much noise. It might be that we should only consider live-through [loops] for heuristics in some places plus the obvious changes in liveness that transforms induce.
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 bin cheng changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #35 from bin cheng --- (In reply to prathamesh3492 from comment #33) > Created attachment 42341 [details] > Test-case to reproduce regression with cortex-m7 > > I have attached an artificial test-case that is fairly representative of the > regression we are seeing in a benchmark. The test-case mimics a > deterministic finite automaton. With code-hoisting there's an additional > spill of r5 near beginning of the function. > ... > > Without code-hoisting it is reusing r3 to store a + 1, while due to code > hoisting it uses the extra register 'r2' to store the value of hoisted > expression a + 1. > > Would it be a good idea to somehow "limit" the distance (in terms of number > of basic blocks maybe?) between the definition of hoisted variable and it's > furthest use during PRE ? If that exceeds a certain threshold then PRE > should choose not to hoist that expression. The threshold could be a param > that can be set by backends. > Does this analysis look reasonable ? It might be more accurate to calculate register pressure and use that to guide code hoisting. I introduced register pressure hoisting for RTL under option -fira-hoist-pressure, basically similar thing needs to be done here. The proposed Tree-SSA register pressure patch set is still under review, but please note it only does minimal now by only computing register pressure. To make it useful in this case, it may need to be improved by calculating/recording live range for statements (I did that in previous version patch). We would also need interfaces updating live range information in line with code motion. I think this case is difficult also because it's hard to decide high register pressure or not which could be the boundary case, and as we know, Tree-SSA register pressure is in no way accurate. Another difficulty is some kind of global information is needed. Given stupid example: a = ... b = ... loop x = a + b; ... y = a * b; y = x + y; store y; end The expressions need to be considered together in order to understand register pressure change. Thanks, bin > > Thanks, > Prathamesh
[Bug tree-optimization/80155] [7/8/9 regression] Performance regression with code hoisting enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80155 Jakub Jelinek changed: What|Removed |Added Target Milestone|8.0 |8.2 --- Comment #34 from Jakub Jelinek --- GCC 8.1 has been released.