[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2019-03-18 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Richard Biener  ---
Fixed.  I'll open two enhancement PRs for this testcase.

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2019-03-18 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #15 from Richard Biener  ---
Author: rguenth
Date: Mon Mar 18 09:17:43 2019
New Revision: 269754

URL: https://gcc.gnu.org/viewcvs?rev=269754=gcc=rev
Log:
2019-03-18  Richard Biener  

PR target/87561
* config/i386/i386.c (ix86_add_stmt_cost): Pessimize strided
loads and stores a bit more.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2019-03-18 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #14 from Richard Biener  ---
Author: rguenth
Date: Mon Mar 18 09:16:56 2019
New Revision: 269753

URL: https://gcc.gnu.org/viewcvs?rev=269753=gcc=rev
Log:
2019-03-18  Richard Biener  

PR target/87561
* config/i386/i386.c (ix86_add_stmt_cost): Apply strided
load pessimization to stores as well.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2019-03-15 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #13 from Richard Biener  ---
433.milc 9180336   27.4 *9180349   26.3 S
433.milc 9180335   27.4 S9180340   27.0 *
433.milc 9180344   26.7 S9180334   27.5 S
450.soplex   8340225   37.1 *8340223   37.5 S
450.soplex   8340226   36.9 S8340228   36.5 S
450.soplex   8340223   37.4 S8340223   37.3 *
482.sphinx3 19490386   50.5 *   19490392   49.8 S
482.sphinx3 19490384   50.7 S   19490374   52.1 *
482.sphinx3 19490394   49.5 S   19490368   53.0 S

comparing the fastest runtimes makes this a progression for both 433.milc
and 482.sphinx3 and no difference for 450.soplex.

I'll post the patch.

For GCC 10 we'd want to play with applying the cost model to the whole
loop nest instead.

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2019-03-15 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #12 from Richard Biener  ---
So I tested this with a one-off run of SPEC CPU 2006 on a Haswell machine
which shows the expected improvement on 416.gamess but also eventual
regressions for 433.milc (340s -> 343s), 450.soplex (223s -> 226s)
and 482.sphinx3 (383s -> 391s).  Re-checking those with a 3-run now.

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2019-03-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #11 from Richard Biener  ---
Btw, it is exactly the current pessimization of vector construction that makes
the AVX256 variant not profitable:

0x40e04e0 *co_99(D)[_53] 1 times vec_construct costs 112 in body

that's because we multiply the "real" cost (three inserts, 28) by
TYPE_VECTOR_SUBPARTS (four) in x86 add_stmt_cost.  For the SSE2 case
that results "only" in a factor of two.  Changing that "arbitrary"
doing into * (TYPE_VECTOR_SUBPARTS + 1) doesn't help.  We can add
equal handling to catch strided stores but that doesn't help either
on its own.  Doing both helps not vectorizing though.

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 269683)
+++ gcc/config/i386/i386.c  (working copy)
@@ -50534,14 +50534,15 @@ ix86_add_stmt_cost (void *data, int coun
  latency and execution resources for the many scalar loads
  (AGU and load ports).  Try to account for this by scaling the
  construction cost by the number of elements involved.  */
-  if (kind == vec_construct
+  if ((kind == vec_construct || kind == vec_to_scalar)
   && stmt_info
-  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
+  && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
+ || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
   && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
   && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
 {
   stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
-  stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
+  stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
 }
   if (stmt_cost == -1)
 stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2019-03-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #10 from Richard Biener  ---
(In reply to Michael Matz from comment #9)
> (In reply to Richard Biener from comment #8)
> > 
> > I'm out of ideas suitable for GCC 9 (besides reverting the patch, reverting
> > to bogus state).
> 
> Either that or some hack (e.g. artificially avoiding vectorization if
> runtime checks are necessary and the loop-nest isn't a box but a pyramid). 
> Whatever
> we do it's better to release GCC with internal bogus state than to release
> GCC with a known 10% performance regression (you could revert only on the
> release branch so that the regression stays in trunk).

So for example we cost 18 stmts in the scalar loop body and
32 stmts in the vector loop body.  That's unfortunately still a savings
of 4 compared to a vectorization-factor unrolled scalar body.

The ratio of vector builds from scalars to other stmts is 6 : 26, if you'd
factor in vector decompositions as well it's 8 : 24.

Given we've had issues with too eagerly doing strided loads / stores in other
cases I'd say a heuristic using that would make more sense than one on
runtime alias checks and/or loop nest structure.

Btw, I don't think avoding 10% regression in an obsolete benchmark
(SPEC 2006) is more important than not feeding garbage into the cost
model... (we've never assesed positive results from that change).

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2019-03-12 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #9 from Michael Matz  ---
(In reply to Richard Biener from comment #8)
> 
> I'm out of ideas suitable for GCC 9 (besides reverting the patch, reverting
> to bogus state).

Either that or some hack (e.g. artificially avoiding vectorization if runtime
checks are necessary and the loop-nest isn't a box but a pyramid).  Whatever
we do it's better to release GCC with internal bogus state than to release
GCC with a known 10% performance regression (you could revert only on the
release branch so that the regression stays in trunk).

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2019-03-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #8 from Richard Biener  ---
Re-checking today we reject AVX vectorization via the costmodel but do
SSE vectorization.  With versioning for alias we could also SLP vectorize this,
keeping the loop body smaller and avoiding an epilogue.  Esp. since we're
ending up without any vector load or store anyway.

Of course SLP analysis requires a grouped store which we do not have since
we do not identify XPQKL(MPQ,MKL) and XPQKL(MRS,MKL) as such (they ain't
with MPQ == MRS but the runtime alias check ensures that's not the case).
That is, we miss "strided group" detection or in general SLP forming via
different mechanism.

That said, I have a hard time thinking of a heuristic aligning with reality
(it's of course possible to come up with a hack).

Generally we'd need to work towards doing the versioning / cost model checks
on outer loops but the better versioning condition thing would be a
prerequesite for this.

I'm out of ideas suitable for GCC 9 (besides reverting the patch, reverting
to bogus state).

Scalar inner loop assembly:

.L8:
vmulsd  (%rax,%rdi,8), %xmm3, %xmm0
incl%ecx
vfmadd231sd (%rax), %xmm4, %xmm0
vfmadd213sd (%rdx), %xmm6, %xmm0
vmovsd  %xmm0, (%rdx)
vmulsd  (%rax,%r8,8), %xmm1, %xmm0
vfmadd231sd (%rax,%r10,8), %xmm2, %xmm0
addq%r15, %rax
vfmadd213sd (%rdx,%rsi,8), %xmm5, %xmm0
vmovsd  %xmm0, (%rdx,%rsi,8)
addq%rbp, %rdx
cmpl%r9d, %ecx
jne .L8

vectorized inner loop assembly:

.L9:
vmovsd  (%r10,%rcx), %xmm13
vmovsd  (%rdx), %xmm0
incl%r14d
vmovhpd (%r10,%rsi), %xmm13, %xmm13
vmovhpd (%rdx,%r13), %xmm0, %xmm14
vmovsd  (%rdi,%rcx), %xmm0
vmulpd  %xmm9, %xmm13, %xmm13
vmovhpd (%rdi,%rsi), %xmm0, %xmm0
vfmadd132pd %xmm10, %xmm13, %xmm0
vfmadd132pd %xmm12, %xmm14, %xmm0
vmovlpd %xmm0, (%rdx)
vmovhpd %xmm0, (%rdx,%r13)
vmovsd  (%r8,%rcx), %xmm13
vmovsd  (%rax), %xmm0
addq%r11, %rdx
vmovhpd (%r8,%rsi), %xmm13, %xmm13
vmovhpd (%rax,%r13), %xmm0, %xmm14
vmovsd  (%r9,%rcx), %xmm0
addq%rbx, %rcx
vmulpd  %xmm7, %xmm13, %xmm13
vmovhpd (%r9,%rsi), %xmm0, %xmm0
addq%rbx, %rsi
vfmadd132pd %xmm8, %xmm13, %xmm0
vfmadd132pd %xmm11, %xmm14, %xmm0
vmovlpd %xmm0, (%rax)
vmovhpd %xmm0, (%rax,%r13)
addq%r11, %rax
cmpl%r14d, %r15d
jne .L9

only outer loop context and knowledge of low trip count makes this bad.

The cost modeling doesn't know the scalar loop can execute like if
vectorized given the CPUs plenty of resources (speculating
non-dependence), whereas the vector variant introduces more constraints
to the pipelining due to data dependences from using vectors.  But
even IACA doesn't tell us the differences are big.

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2018-10-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #7 from Richard Biener  ---
(In reply to rsand...@gcc.gnu.org from comment #5)
> (In reply to Richard Biener from comment #4)
> > Another thing is the too complicated alias check where for
> > 
> > (gdb) p debug_data_reference (dr_a.dr)
> > #(Data Ref: 
> > #  bb: 14 
> > #  stmt: _28 = *xpqkl_172(D)[_27];
> > #  ref: *xpqkl_172(D)[_27];
> > #  base_object: *xpqkl_172(D);
> > #  Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 +
> > offset.34_149) + _480, +, stride.33_148}_6
> > #)
> > $9 = void
> > (gdb) p debug_data_reference (dr_b.dr)
> > #(Data Ref: 
> > #  bb: 14 
> > #  stmt: *xpqkl_172(D)[_50] = _65;
> > #  ref: *xpqkl_172(D)[_50];
> > #  base_object: *xpqkl_172(D);
> > #  Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 +
> > offset.34_149) + _486, +, stride.33_148}_6
> > #)
> > 
> > we generate
> > 
> > (ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) * stride.33_148 +
> > offset.34_149) + (integer(kind=8)) (_19 + jpack_161)) + (sizetype)
> > stride.33_148) * 8) < (ssizetype) ((sizetype) integer(kind=8)) mkl_203 +
> > 1) * stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) *
> > 8) || (ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) *
> > stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) +
> > (sizetype) stride.33_148) * 8) < (ssizetype) ((sizetype)
> > integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) +
> > (integer(kind=8)) (_19 + jpack_161)) * 8)
> > 
> > instead of simply _480 != _486 (well, OK, not _that_ simple).
> > 
> > I guess we miss many of the "optimizations" we do when dealing with
> > alias checks for constant steps.  In this case sth obvious would be
> > to special-case DR_STEP (dra) == DR_STEP (drb).  Richard?
> Not sure that would help much with the existing optimisations.
> I think the closest we get is create_intersect_range_checks_index,
> but "all" that avoids is scaling the index by the element size
> and adding the common base.  I guess the expensive bit here is
> multiplying by the stride, but the index-based check would still
> do that.
> 
> That said, create_intersect_range_checks_index does feel like it
> might be a bit *too* conservative (but I'm not brave enough to relax it)

One thing I notice above is that we do

 (ssizetype) ((sizetype)X * 8) < (ssizetype) ((sizetype)Y * 8)

that is, we do a signed comparison but do the multiplication in a type
that allows wrapping.  I suppose this is an artifact of using
DR_OFFSET and friends.

Iff dependence analysis which really looks at the access functions
iff the base is compatible would be able to return non-constant
distance vectors then it would return _231 - _225 as distance which
we could runtime-check against the vectorization factor.  I suppose
that's a feasible trick to try when code-generating the dependence check.

Note for 416.gamess it looks like NOC is just 5 but MPQ and MRS are so
that there is no runtime aliasing between iterations most of the time
(sometimes they are indeed equal).  The cost model check skips the
vector loop for MK == 2 and 3 and only will execute it for MK == 4 and 5.
An alternative for this kind of loop nest would be to cost-model for
MK % 2 == 0, thus requiring no epilogue loop.

A hack for doing the above is sth like the following which I think
would also work for more than one subscript by combining the tests
with ||  I think we need to actually test against the vectorization
factor here and we can ignore negative distances unless ddr_reversed, etc.,
unfortunately compute_affine_dependence frees the subscripts so we
cannot compute the "variable" distance vector during dependence analysis
and store it away - thus "hack" ;)

diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index 69c5f7b28ae..8973a4557d7 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -1823,6 +1823,30 @@ create_intersect_range_checks (struct loop *loop, tree
*cond_expr,
   if (create_intersect_range_checks_index (loop, cond_expr, dr_a, dr_b))
 return;

+  auto_vec loop_nest;
+  bool res = find_loop_nest (loop, _nest);
+  gcc_assert (res);
+  ddr_p ddr = initialize_data_dependence_relation (dr_a.dr, dr_b.dr,
loop_nest);
+  if (DDR_SUBSCRIPTS (ddr).length () == 1)
+{
+  tree fna = SUB_ACCESS_FN (DDR_SUBSCRIPTS (ddr)[0], 0);
+  tree fnb = SUB_ACCESS_FN (DDR_SUBSCRIPTS (ddr)[0], 1);
+  tree diff = chrec_fold_minus (TREE_TYPE (fna), fna, fnb);
+  if (!chrec_contains_undetermined (diff)
+ && !tree_contains_chrecs (diff, NULL))
+   {
+ free_dependence_relation (ddr);
+ if (TYPE_UNSIGNED (TREE_TYPE (diff)))
+   diff = fold_convert (signed_type_for (TREE_TYPE (diff)), diff);
+ *cond_expr = fold_build2 (GE_EXPR, boolean_type_node,
+   fold_build1 (ABS_EXPR,
+TREE_TYPE (diff), diff),
+  

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2018-10-10 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #6 from Richard Biener  ---
Created attachment 44820
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44820=edit
reduced testcase

Reduced testcase.

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2018-10-09 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #5 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #4)
> Another thing is the too complicated alias check where for
> 
> (gdb) p debug_data_reference (dr_a.dr)
> #(Data Ref: 
> #  bb: 14 
> #  stmt: _28 = *xpqkl_172(D)[_27];
> #  ref: *xpqkl_172(D)[_27];
> #  base_object: *xpqkl_172(D);
> #  Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 +
> offset.34_149) + _480, +, stride.33_148}_6
> #)
> $9 = void
> (gdb) p debug_data_reference (dr_b.dr)
> #(Data Ref: 
> #  bb: 14 
> #  stmt: *xpqkl_172(D)[_50] = _65;
> #  ref: *xpqkl_172(D)[_50];
> #  base_object: *xpqkl_172(D);
> #  Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 +
> offset.34_149) + _486, +, stride.33_148}_6
> #)
> 
> we generate
> 
> (ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) * stride.33_148 +
> offset.34_149) + (integer(kind=8)) (_19 + jpack_161)) + (sizetype)
> stride.33_148) * 8) < (ssizetype) ((sizetype) integer(kind=8)) mkl_203 +
> 1) * stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) *
> 8) || (ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) *
> stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) +
> (sizetype) stride.33_148) * 8) < (ssizetype) ((sizetype)
> integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) +
> (integer(kind=8)) (_19 + jpack_161)) * 8)
> 
> instead of simply _480 != _486 (well, OK, not _that_ simple).
> 
> I guess we miss many of the "optimizations" we do when dealing with
> alias checks for constant steps.  In this case sth obvious would be
> to special-case DR_STEP (dra) == DR_STEP (drb).  Richard?
Not sure that would help much with the existing optimisations.
I think the closest we get is create_intersect_range_checks_index,
but "all" that avoids is scaling the index by the element size
and adding the common base.  I guess the expensive bit here is
multiplying by the stride, but the index-based check would still
do that.

That said, create_intersect_range_checks_index does feel like it
might be a bit *too* conservative (but I'm not brave enough to relax it)

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2018-10-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 CC||rsandifo at gcc dot gnu.org

--- Comment #4 from Richard Biener  ---
Another thing is the too complicated alias check where for

(gdb) p debug_data_reference (dr_a.dr)
#(Data Ref: 
#  bb: 14 
#  stmt: _28 = *xpqkl_172(D)[_27];
#  ref: *xpqkl_172(D)[_27];
#  base_object: *xpqkl_172(D);
#  Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 +
offset.34_149) + _480, +, stride.33_148}_6
#)
$9 = void
(gdb) p debug_data_reference (dr_b.dr)
#(Data Ref: 
#  bb: 14 
#  stmt: *xpqkl_172(D)[_50] = _65;
#  ref: *xpqkl_172(D)[_50];
#  base_object: *xpqkl_172(D);
#  Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 +
offset.34_149) + _486, +, stride.33_148}_6
#)

we generate

(ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) * stride.33_148 +
offset.34_149) + (integer(kind=8)) (_19 + jpack_161)) + (sizetype)
stride.33_148) * 8) < (ssizetype) ((sizetype) integer(kind=8)) mkl_203 + 1)
* stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) * 8) ||
(ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) * stride.33_148 +
offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) + (sizetype)
stride.33_148) * 8) < (ssizetype) ((sizetype) integer(kind=8)) mkl_203 + 1)
* stride.33_148 + offset.34_149) + (integer(kind=8)) (_19 + jpack_161)) * 8)

instead of simply _480 != _486 (well, OK, not _that_ simple).

I guess we miss many of the "optimizations" we do when dealing with
alias checks for constant steps.  In this case sth obvious would be
to special-case DR_STEP (dra) == DR_STEP (drb).  Richard?

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2018-10-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

Richard Biener  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #3 from Richard Biener  ---
OK, so re-running perf gives me a more reasonable result (-march=native on
Haswell):

Overhead   Samples  Command  Shared Object   Symbol
  15.59%754868  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.]
forms_
  15.55%749452  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.]
forms_
  10.77%496796  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.]
twotff_
   7.58%377894  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.]
dirfck_
   7.57%375587  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.]
dirfck_
   7.01%328685  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.]
twotff_
   4.98%243101  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.]
xyzint_
   4.03%197815  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.]
xyzint_

with the already noticed loop where there's appearantly not enough iterations
warranting the vectorization and the cost model check comes in the way.

xyzint_ looks simiar.

Note that

DO 30 MK=1,NOC
DO 30 ML=1,MK
   MKL = MKL+1
   XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
 *   VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
   XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
 *   VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
   30   CONTINUE

shows the inner loop will first iterate once, then twice, then ... that
makes hoisting the cost model check not possible and also it makes the
alias check not invariant in the outer loop.  That would mean if we'd
code-generate the iteration cost-model then loop splitting might get
the idea of splitting the outer loop ... (but loop splitting runs before
vectorization of course).

So in this very case if we analyze the scalar evolution of the niter
of the loop we want to vectorize we get back {0, +, 1}_5 -- that's
certainly something we could factor in when computing the vectorization
cost.  It would increase the prologue/epilogue cost but it wouldn't
make vectorization never profitable (we know nothing about the upper bound
of the number of iterations).

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2018-10-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #2 from Richard Biener  ---
OK, so on haswell I see (- is bad, + is good):

-0x2342ca0 _40 + _45 1 times scalar_stmt costs 12 in body
+0x2342ca0 _40 + _45 1 times scalar_stmt costs 4 in body

so a simple add changes cost from 4 to 12 with the patch.  Ah, so that
goes

  switch (subcode)
{
case PLUS_EXPR:
case POINTER_PLUS_EXPR:
case MINUS_EXPR:
  if (kind == scalar_stmt)
{
  if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH)
stmt_cost = ix86_cost->addss;
  else if (X87_FLOAT_MODE_P (mode))
stmt_cost = ix86_cost->fadd;
  else
stmt_cost = ix86_cost->add;
}

where with kind == scalar_stmt we now run into the SSE_FLOAT_MODE_P case
(previously mode was sth like V2DFmode) and thus use ix86_cost->addss
instead of ix86_cost->add.  That's more correct.

That causes us to (for example) now vectorize mccas.fppized.f:3160 where
we previously figured vectorization is never profitable.  The look looks
like

DO 10 MK=1,NOC
DO 10 ML=1,MK
   MKL = MKL+1
   XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
 *   VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
   XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
 *   VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
   10   CONTINUE

and requires versioning for aliasing and strided loads and strided
stores.  We're too trigger-happy for doing that it seems.  Also the
vector version isn't entered at all at runtime.

But that's not the 10%.  And the big offenders from looking at perf output
do not have any vectorization decision changes...  very strage.

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2018-10-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

Richard Biener  changed:

   What|Removed |Added

 Target||x86_64-*-*, i?86-*-*
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2018-10-09
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
   Target Milestone|--- |9.0
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
Confirmed.  I'll have a look.