[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 Richard Biener changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |FIXED --- Comment #9 from Richard Biener --- Fixed.
[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 --- Comment #8 from Alexander Nesterovskiy --- I'd say that it's not just fixed but improved with an impressive gain. It is about +4% on HSW AVX2 and about +8% on SKX AVX512 after r257734 (compared to r257732) for a 465.tonto SPEC rate. Comparing to reference r253973 it is about +2% on HSW AVX2 and +18% on SKX AVX512 (AVX512 was greatly improved in last 3 months).
[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 Richard Biener changed: What|Removed |Added Status|NEW |WAITING --- Comment #7 from Richard Biener --- Possibly fixed now. Can you verify?
[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 --- Comment #6 from Richard Biener --- Author: rguenth Date: Fri Feb 16 13:47:25 2018 New Revision: 257734 URL: https://gcc.gnu.org/viewcvs?rev=257734=gcc=rev Log: 2018-02-16 Richard BienerPR tree-optimization/84037 PR tree-optimization/84016 PR target/82862 * config/i386/i386.c (ix86_builtin_vectorization_cost): Adjust vec_construct for the fact we need additional higher latency 128bit inserts for AVX256 and AVX512 vector builds. (ix86_add_stmt_cost): Scale vector construction cost for elementwise loads. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c
[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|NEW CC||amker.cheng at gmail dot com, ||vmakarov at redhat dot com Assignee|hubicka at gcc dot gnu.org |unassigned at gcc dot gnu.org --- Comment #5 from Jan Hubicka --- Adding Vladimir and Bin to CC. Perhaps they will have some ideas. I think stack store/restore is not too confusing for RA (it would be nice to get rid of it completely and get frame pointer back).
[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 --- Comment #4 from Richard Biener --- I don't have any good ideas here. Fortran with allocated arrays tends to use quite some integer registers for all the IV setup and computation. One can experiment with less peeling of vector epilogues (--param max-completely-peel-times=1) as well as maybe adding another code sinking pass. In the end it's intelligent remat of expressions (during RA) that needs to be done as I fully expect not having enough integer registers to compute and keep live everything. There seems to be missed invariant motion on the GIMPLE side and also stack allocation in an inner loop which we might be able to hoist. Maybe that (__builtin_stack_save/restore) confuses RA. Those builtins confuse LIM at least (a present memcpy does as well, and we expand that to a libcall). -fno-tree-loop-distribute-patterns helps for that. But even then we still spill a lot. Thus, try -fno-tree-loop-distribute-patterns plus Index: gcc/tree-ssa-loop-im.c === --- gcc/tree-ssa-loop-im.c (revision 255051) +++ gcc/tree-ssa-loop-im.c (working copy) @@ -1432,7 +1432,10 @@ gather_mem_refs_stmt (struct loop *loop, bool is_stored; unsigned id; - if (!gimple_vuse (stmt)) + if (!gimple_vuse (stmt) + || gimple_call_builtin_p (stmt, BUILT_IN_STACK_SAVE) + || gimple_call_builtin_p (stmt, BUILT_IN_STACK_RESTORE) + || gimple_call_builtin_p (stmt, BUILT_IN_ALLOCA_WITH_ALIGN)) return; mem = simple_mem_ref_in_stmt (stmt, _stored);
[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 --- Comment #3 from Jan Hubicka --- Created attachment 42680 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42680=edit Assembly produced showing register pressure issues.
[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 --- Comment #2 from Jan Hubicka --- First of all, thanks a lot for reproducer! Here are times with vectorizer enabled, disabled and no costmodel Performance counter stats for './a.out-vect 21 100' (10 runs): 4588.055614 task-clock:u (msec) #1.000 CPUs utilized ( +- 0.49% ) 0 context-switches:u#0.000 K/sec 0 cpu-migrations:u #0.000 K/sec 88 page-faults:u #0.019 K/sec ( +- 0.44% ) 14,911,755,271 cycles:u #3.250 GHz ( +- 0.37% ) 52,564,741,152 instructions:u#3.53 insn per cycle ( +- 0.00% ) 4,073,206,037 branches:u# 887.785 M/sec ( +- 0.00% ) 18,106,857 branch-misses:u #0.44% of all branches ( +- 0.30% ) 4.589172192 seconds time elapsed ( +- 0.50% ) jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-novect 21 100 Performance counter stats for './a.out-novect 21 100' (10 runs): 3549.651576 task-clock:u (msec) #1.000 CPUs utilized ( +- 0.65% ) 0 context-switches:u#0.000 K/sec 0 cpu-migrations:u #0.000 K/sec 88 page-faults:u #0.025 K/sec ( +- 0.42% ) 11,563,811,687 cycles:u #3.258 GHz ( +- 0.61% ) 39,259,740,624 instructions:u#3.40 insn per cycle ( +- 0.00% ) 3,061,205,511 branches:u# 862.396 M/sec ( +- 0.00% ) 11,774,836 branch-misses:u #0.38% of all branches ( +- 0.36% ) 3.550955730 seconds time elapsed ( +- 0.65% ) jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-nocost 21 100 Performance counter stats for './a.out-nocost 21 100' (10 runs): 4621.515923 task-clock:u (msec) #1.000 CPUs utilized ( +- 0.31% ) 0 context-switches:u#0.000 K/sec 0 cpu-migrations:u #0.000 K/sec 87 page-faults:u #0.019 K/sec ( +- 0.35% ) 14,965,340,896 cycles:u #3.238 GHz ( +- 0.30% ) 52,817,740,929 instructions:u#3.53 insn per cycle ( +- 0.00% ) 4,326,205,814 branches:u# 936.101 M/sec ( +- 0.00% ) 16,615,805 branch-misses:u #0.38% of all branches ( +- 0.10% ) 4.622600700 seconds time elapsed ( +- 0.31% ) So vectorization hurts both in time and instruction count. There are two loops to vectorize. _34 = _74 + S.2_106; _35 = _34 * _121; _36 = _35 + _124; _38 = _36 * _37; _39 = (sizetype) _38; _40 = _72 + _39; _41 = MEM[(real(kind=8)[0:] *)A.14_116][S.2_106]; *_40 = _41; S.2_88 = S.2_106 + 1; if (_77 < S.2_88) goto ; [15.00%] else goto loopback; [85.00%] Vector inside of loop cost: 76 Vector prologue cost: 24 Vector epilogue cost: 48 Scalar iteration cost: 24 Scalar outside cost: 24 Vector outside cost: 72 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 3 tonto.f90:26:0: note: Runtime profitability threshold = 4 tonto.f90:26:0: note: Static estimate profitability threshold = 11 and _18 = S.2_105 + 1; _19 = _18 * _61; _2 = _19 - _61; _21 = _2 * _3; _22 = (sizetype) _21; _23 = _11 + _22; _24 = *_23; _25 = _70 + S.2_105; _26 = _25 * _117; _27 = _26 + _120; _29 = _27 * _28; _30 = (sizetype) _29; _31 = _68 + _30; _32 = *_31; _33 = _24 * _32; MEM[(real(kind=8)[0:] *)A.14_116][S.2_105] = _33; if (_18 > _77) goto ; [15.00%] else loopback; [85.00%] Vector inside of loop cost: 176 Vector prologue cost: 24 Vector epilogue cost: 112 Scalar iteration cost: 48 Scalar outside cost: 24 Vector outside cost: 136 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 7 Static estimate profitability threshold = 18 Both loops iterate about 9 time so the thresholds are close to being never executed. So the slowdown seems to be just colateral damage of adding vectorized loop for something that is not executed enough. This is how we handle first loop: _34 = _74 + S.2_106; irrelevant _35 = _34 * _121;irrelevant _36 = _35 + _124;
[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 Jan Hubicka changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2017-11-19 Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Jan Hubicka --- I am aware of this regression. Will take a look into it now.
[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization Target||x86_64-linux-gnu Component|tree-optimization |target Version|unknown |8.0 Target Milestone|--- |8.0