[Bug target/91446] Wrong cost for scalar_load/scalar_store of vector type

2019-09-18 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91446

H.J. Lu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED
   Target Milestone|--- |10.0

--- Comment #4 from H.J. Lu  ---
(In reply to Richard Biener from comment #2)
> On this ground the bug would be valid but not about costs (you may want
> to open a separate bug for this issue).

Fixed for GCC 10.  I opened PR 91811.

[Bug target/91446] Wrong cost for scalar_load/scalar_store of vector type

2019-09-18 Thread hjl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91446

--- Comment #3 from hjl at gcc dot gnu.org  ---
Author: hjl
Date: Wed Sep 18 19:49:19 2019
New Revision: 275905

URL: https://gcc.gnu.org/viewcvs?rev=275905=gcc=rev
Log:
i386: Increase Skylake SImode pseudo register store cost

On Skylake, SImode store cost isn't less than half cost of 128-bit vector
store.  This patch increases Skylake SImode pseudo register store cost to
make it the same as QImode and HImode.

gcc/

PR target/91446
* config/i386/x86-tune-costs.h (skylake_cost): Increase SImode
pseudo register store cost from 3 to 6 to make it the same as
QImode and HImode.

gcc/testsuite/

PR target/91446
* gcc.target/i386/pr91446.c: New test.

Added:
trunk/gcc/testsuite/gcc.target/i386/pr91446.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/x86-tune-costs.h
trunk/gcc/testsuite/ChangeLog

[Bug target/91446] Wrong cost for scalar_load/scalar_store of vector type

2019-08-15 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91446

--- Comment #2 from Richard Biener  ---
Index: gcc/config/i386/x86-tune-costs.h
===
--- gcc/config/i386/x86-tune-costs.h(revision 274422)
+++ gcc/config/i386/x86-tune-costs.h(working copy)
@@ -1442,7 +1442,7 @@ struct processor_costs skylake_cost = {
   {4, 4, 4},   /* cost of loading integer registers
   in QImode, HImode and SImode.
   Relative to reg-reg move (2).  */
-  {6, 6, 3},   /* cost of storing integer registers */
+  {6, 6, 6},   /* cost of storing integer registers */
   2,   /* cost of reg,reg fld/fst */
   {6, 6, 8},   /* cost of loading fp registers
   in SFmode, DFmode and XFmode */

produces

foo:
.LFB0:
.cfi_startproc
vmovq   %rdi, %xmm1
subq$40, %rsp
.cfi_def_cfa_offset 48
vpinsrq $1, %rsi, %xmm1, %xmm0
vmovq   %rdx, %xmm2
vmovaps %xmm0, (%rsp)
movq%rsp, %rdi
vpinsrq $1, %rcx, %xmm2, %xmm0
vmovaps %xmm0, 16(%rsp)
callbar
addq$40, %rsp
.cfi_def_cfa_offset 8
ret

it may appear odd that we don't use AVX256, this is because of

t.i:17:3: note:   === vect_analyze_data_ref_accesses ===
t.i:17:3: note:   Detected interleaving store t.width and t.height
t.i:17:3: note:   Detected interleaving store t.x and t.y
t.i:17:3: note:   Detected interleaving store of size 2
t.i:17:3: note: t.width = width_2(D);
t.i:17:3: note: t.height = height_4(D);
t.i:17:3: note:   Detected interleaving store of size 2
t.i:17:3: note: t.x = x_6(D);
t.i:17:3: note: t.y = y_8(D);

and thus we are "confused" about the different sign of the fields which
ultimatively yields to different vector types which would make a difference
if there is sign-dependent arithmetic performed, but not in this particular
case.  On GIMPLE we'd also need nop-conversions to make the IL checker happy.
On this ground the bug would be valid but not about costs (you may want
to open a separate bug for this issue).

[Bug target/91446] Wrong cost for scalar_load/scalar_store of vector type

2019-08-15 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91446

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener  ---
Not sure what you are after, we are costing for the scalar cost

0x4ad73f0 width_2(D) 1 times scalar_store costs 6 in body
0x4ad73f0 height_4(D) 1 times scalar_store costs 6 in body
0x4ad73f0 x_6(D) 1 times scalar_store costs 6 in body
0x4ad73f0 y_8(D) 1 times scalar_store costs 6 in body

and for the vectorized cost

0x4bd8330 width_2(D) 1 times vec_construct costs 8 in prologue
0x4bd8330 width_2(D) 1 times vector_store costs 16 in body
0x4bd8330 x_6(D) 1 times vec_construct costs 8 in prologue
0x4bd8330 x_6(D) 1 times vector_store costs 16 in body

t.i:17:3: note:  Cost model analysis:
  Vector inside of basic block cost: 32
  Vector prologue cost: 16
  Vector epilogue cost: 0
  Scalar cost of basic block: 24
t.i:17:3: missed:  not vectorized: vectorization is not profitable.

I assume skylake-avx512 uses skylake_cost.  The only issue I see is fixed
use of [0]/[2] (SImode/SFmode) also because the cost tables do not have
entries for DImode int_load/store.

Skylake costs are odd here:

  {4, 4, 4},/* cost of loading integer registers
   in QImode, HImode and SImode.
   Relative to reg-reg move (2).  */
  {6, 6, 3},/* cost of storing integer registers */

Why is SImode store cost 3?  (looks like "benchmark" random-number generator?)

 {6, 6, 6, 10, 20},/* cost of loading SSE registers
   in 32,64,128,256 and 512-bit */
  {6, 6, 6, 10, 20},/* cost of unaligned loads.  */
  {8, 8, 8, 12, 24},/* cost of storing SSE registers
   in 32,64,128,256 and 512-bit */
  {8, 8, 8, 8, 16}, /* cost of unaligned stores.  */

again, unaligned SSE stores are cheaper than aligned ones for 256 and 512
bits?!

In the end the scalar code is not vectorized because of vector construction
cost and because the vector store cost is higher than the scalar store cost
which oddly is too cheap (3 vs. expected 6).

So - INVALID?