[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2018-02-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

Richard Biener  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Richard Biener  ---
Fixed.

[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2018-02-19 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

--- Comment #8 from Alexander Nesterovskiy  ---
I'd say that it's not just fixed but improved with an impressive gain.

It is about +4% on HSW AVX2 and about +8% on SKX AVX512 after r257734 (compared
to r257732) for a 465.tonto SPEC rate.
Comparing to reference r253973 it is about +2% on HSW AVX2 and +18% on SKX
AVX512 (AVX512 was greatly improved in last 3 months).

[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2018-02-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |WAITING

--- Comment #7 from Richard Biener  ---
Possibly fixed now.  Can you verify?

[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2018-02-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

--- Comment #6 from Richard Biener  ---
Author: rguenth
Date: Fri Feb 16 13:47:25 2018
New Revision: 257734

URL: https://gcc.gnu.org/viewcvs?rev=257734=gcc=rev
Log:
2018-02-16  Richard Biener  

PR tree-optimization/84037
PR tree-optimization/84016
PR target/82862
* config/i386/i386.c (ix86_builtin_vectorization_cost):
Adjust vec_construct for the fact we need additional higher latency
128bit inserts for AVX256 and AVX512 vector builds.
(ix86_add_stmt_cost): Scale vector construction cost for
elementwise loads.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c

[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2017-11-22 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

Jan Hubicka  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
 CC||amker.cheng at gmail dot com,
   ||vmakarov at redhat dot com
   Assignee|hubicka at gcc dot gnu.org |unassigned at gcc dot 
gnu.org

--- Comment #5 from Jan Hubicka  ---
Adding Vladimir and Bin to CC. Perhaps they will have some ideas.
I think stack store/restore is not too confusing for RA (it would be nice to
get rid of it completely and get frame pointer back).

[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2017-11-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

--- Comment #4 from Richard Biener  ---
I don't have any good ideas here.  Fortran with allocated arrays tends to use
quite some integer registers for all the IV setup and computation.

One can experiment with less peeling of vector epilogues (--param
max-completely-peel-times=1) as well as maybe adding another code sinking pass.
 In the end
it's intelligent remat of expressions (during RA) that needs to be done as I
fully expect not having enough integer registers to compute and keep live
everything.

There seems to be missed invariant motion on the GIMPLE side and also
stack allocation in an inner loop which we might be able to hoist.  Maybe
that (__builtin_stack_save/restore) confuses RA.

Those builtins confuse LIM at least (a present memcpy does as well, and
we expand that to a libcall).  -fno-tree-loop-distribute-patterns helps for
that.

But even then we still spill a lot.  Thus, try
-fno-tree-loop-distribute-patterns plus

Index: gcc/tree-ssa-loop-im.c
===
--- gcc/tree-ssa-loop-im.c  (revision 255051)
+++ gcc/tree-ssa-loop-im.c  (working copy)
@@ -1432,7 +1432,10 @@ gather_mem_refs_stmt (struct loop *loop,
   bool is_stored;
   unsigned id;

-  if (!gimple_vuse (stmt))
+  if (!gimple_vuse (stmt)
+  || gimple_call_builtin_p (stmt, BUILT_IN_STACK_SAVE)
+  || gimple_call_builtin_p (stmt, BUILT_IN_STACK_RESTORE)
+  || gimple_call_builtin_p (stmt, BUILT_IN_ALLOCA_WITH_ALIGN))
 return;

   mem = simple_mem_ref_in_stmt (stmt, _stored);

[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2017-11-22 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

--- Comment #3 from Jan Hubicka  ---
Created attachment 42680
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42680=edit
Assembly produced showing register pressure issues.

[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2017-11-22 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

--- Comment #2 from Jan Hubicka  ---
First of all, thanks a lot for reproducer!

Here are times with vectorizer enabled, disabled and no costmodel
 Performance counter stats for './a.out-vect 21 100' (10 runs):

   4588.055614  task-clock:u (msec)   #1.000 CPUs utilized 
  ( +-  0.49% )
 0  context-switches:u#0.000 K/sec
 0  cpu-migrations:u  #0.000 K/sec
88  page-faults:u #0.019 K/sec 
  ( +-  0.44% )
14,911,755,271  cycles:u  #3.250 GHz   
  ( +-  0.37% )
52,564,741,152  instructions:u#3.53  insn per cycle
  ( +-  0.00% )
 4,073,206,037  branches:u#  887.785 M/sec 
  ( +-  0.00% )
18,106,857  branch-misses:u   #0.44% of all branches   
  ( +-  0.30% )

   4.589172192 seconds time elapsed
 ( +-  0.50% )

jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-novect 21
100

 Performance counter stats for './a.out-novect 21 100' (10 runs):

   3549.651576  task-clock:u (msec)   #1.000 CPUs utilized 
  ( +-  0.65% )
 0  context-switches:u#0.000 K/sec
 0  cpu-migrations:u  #0.000 K/sec
88  page-faults:u #0.025 K/sec 
  ( +-  0.42% )
11,563,811,687  cycles:u  #3.258 GHz   
  ( +-  0.61% )
39,259,740,624  instructions:u#3.40  insn per cycle
  ( +-  0.00% )
 3,061,205,511  branches:u#  862.396 M/sec 
  ( +-  0.00% )
11,774,836  branch-misses:u   #0.38% of all branches   
  ( +-  0.36% )

   3.550955730 seconds time elapsed
 ( +-  0.65% )

jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-nocost 21
100

 Performance counter stats for './a.out-nocost 21 100' (10 runs):

   4621.515923  task-clock:u (msec)   #1.000 CPUs utilized 
  ( +-  0.31% )
 0  context-switches:u#0.000 K/sec
 0  cpu-migrations:u  #0.000 K/sec
87  page-faults:u #0.019 K/sec 
  ( +-  0.35% )
14,965,340,896  cycles:u  #3.238 GHz   
  ( +-  0.30% )
52,817,740,929  instructions:u#3.53  insn per cycle
  ( +-  0.00% )
 4,326,205,814  branches:u#  936.101 M/sec 
  ( +-  0.00% )
16,615,805  branch-misses:u   #0.38% of all branches   
  ( +-  0.10% )

   4.622600700 seconds time elapsed
 ( +-  0.31% )

So vectorization hurts both in time and instruction count.


There are two loops to vectorize.

  _34 = _74 + S.2_106;
  _35 = _34 * _121;
  _36 = _35 + _124;
  _38 = _36 * _37;
  _39 = (sizetype) _38;
  _40 = _72 + _39;
  _41 = MEM[(real(kind=8)[0:] *)A.14_116][S.2_106];
  *_40 = _41;
  S.2_88 = S.2_106 + 1;
  if (_77 < S.2_88)
goto ; [15.00%]
  else
goto loopback; [85.00%]

  Vector inside of loop cost: 76
  Vector prologue cost: 24
  Vector epilogue cost: 48
  Scalar iteration cost: 24
  Scalar outside cost: 24
  Vector outside cost: 72
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 3
tonto.f90:26:0: note:   Runtime profitability threshold = 4
tonto.f90:26:0: note:   Static estimate profitability threshold = 11

and

  _18 = S.2_105 + 1;
  _19 = _18 * _61;
  _2 = _19 - _61;
  _21 = _2 * _3;
  _22 = (sizetype) _21;
  _23 = _11 + _22;
  _24 = *_23;
  _25 = _70 + S.2_105;
  _26 = _25 * _117;
  _27 = _26 + _120;
  _29 = _27 * _28;
  _30 = (sizetype) _29;
  _31 = _68 + _30;
  _32 = *_31;
  _33 = _24 * _32;
  MEM[(real(kind=8)[0:] *)A.14_116][S.2_105] = _33;
  if (_18 > _77)
goto ; [15.00%]
  else
loopback; [85.00%]


  Vector inside of loop cost: 176
  Vector prologue cost: 24
  Vector epilogue cost: 112
  Scalar iteration cost: 48
  Scalar outside cost: 24
  Vector outside cost: 136
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7
  Static estimate profitability threshold = 18

Both loops iterate about 9 time so the thresholds are close to being never
executed. So the slowdown seems to be just colateral damage of adding
vectorized loop for something that is not executed enough.

This is how we handle first loop:


  _34 = _74 + S.2_106; irrelevant
  _35 = _34 * _121;irrelevant
  _36 = _35 + _124;  

[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2017-11-19 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2017-11-19
   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Jan Hubicka  ---
I am aware of this regression. Will take a look into it now.

[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2017-11-06 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-linux-gnu
  Component|tree-optimization   |target
Version|unknown |8.0
   Target Milestone|--- |8.0