[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

2017-01-27 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

--- Comment #9 from James Greenhalgh  ---
> I'm curious how that benchmarks for you (with -ftree-loop-distribution vs.
> without).

Taking trunk as 100%, I see a 2% gain on trunk with
-fno-tree-loop-distribution-patterns , and 1% gain with your patch applied.

I'm not sure where the other 1% is going. There are a few more examples of new
memcpy calls in MAIN_ and inital_ , but they don't look hot so it may just be a
second-order effect.

Thanks for the fix.

[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

2017-01-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

--- Comment #8 from Richard Biener  ---
Author: rguenth
Date: Fri Jan 27 13:56:59 2017
New Revision: 244976

URL: https://gcc.gnu.org/viewcvs?rev=244976=gcc=rev
Log:
2017-01-27  Richard Biener  

PR tree-optimization/79245
* tree-loop-distribution.c (distribute_loop): Apply cost
modeling also to detected patterns.

* gcc.dg/tree-ssa/ldist-23.c: XFAIL.
* gcc.dg/tree-ssa/ldist-25.c: New testcase.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-25.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-23.c
trunk/gcc/tree-loop-distribution.c

[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

2017-01-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Richard Biener  ---
Fixed.

[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

2017-01-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

--- Comment #6 from Richard Biener  ---
(In reply to rguent...@suse.de from comment #5)
> On Fri, 27 Jan 2017, jgreenhalgh at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245
> > 
> > --- Comment #4 from James Greenhalgh  ---
> > (In reply to Richard Biener from comment #3)
> > > Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like
> > > 
> > >   int i;
> > >   for (i = 0; i < 128; ++i)
> > > {
> > >   a[i] = a[i] + 1;
> > >   b[i] = d[i];
> > >   c[i] = a[i] / d[i];
> > > }
> > > 
> > > where the testcase expects b[i] = d[i] to be split out as memcpy but
> > > the other two partitions to be fused.
> > > 
> > > Generally the cost model lacks computing the number of input/output 
> > > streams
> > > of a partition and a target interface to query it about limits.  Usually
> > > store bandwidth is not equal to load bandwidth and not re-used store 
> > > streams
> > > can benefit from non-temporal stores being used by libc.
> > > 
> > > In your testcase I wonder whether distributing to
> > > 
> > > for (int j = 0; j < x; j++)
> > >   {
> > > for (int i = 0; i < y; i++)
> > > {
> > >   c[j][i] = b[j][i] - a[j][i];
> > >   }
> > >   }
> > > memcpy (a, b, ...);
> > > 
> > > would be faster in the end (or even doing the memcpy first in this case).
> > > 
> > > Well, for now let's be more conservative given the cost model really is
> > > lacking.
> > 
> > The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3%
> > regression for Cortex-A72 after r242038, and I can fix that with
> > -fno-tree-loop-distribute-patterns.
> > 
> > In that benchmark you've got 3 instances of the above pattern, so you end up
> > with 3 memcpy calls after:
> > 
> >   DO 300 J=1,N
> >   DO 300 I=1,M
> >   UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
> >   VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))
> >   POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))
> >   U(I,J) = UNEW(I,J)
> >   V(I,J) = VNEW(I,J)
> >   P(I,J) = PNEW(I,J)
> >   300 CONTINUE
> > 
> > 3 memcpy calls compared to 3 vector store instructions doesn't seem the 
> > right
> > tradeoff to me. Sorry if I reduced the testcase too far to make the balance
> > clear.
> 
> Itanic seems to like it though:
> 
> http://gcc.opensuse.org/SPEC/CFP/sb-terbium-head-64/171_swim_big.png
> 
> For Haswell I don't see any ups/downs for AMD Fam15 I see a slowdown
> as well around that time.  I guess it depends if the CPU is already
> throttled by load/compute bandwith here.  What should be profitable
> is to distribute the above to three loops (w/o memcpy).  So after
> the patch doing -ftree-loop-distribution.  Patch being
> 
> Index: gcc/tree-loop-distribution.c
> ===
> --- gcc/tree-loop-distribution.c(revision 244963)
> +++ gcc/tree-loop-distribution.c(working copy)
> @@ -1548,8 +1548,7 @@ distribute_loop (struct loop *loop, vec<
>for (int j = i + 1;
>partitions.iterate (j, ); ++j)
> {
> - if (!partition_builtin_p (partition)
> - && similar_memory_accesses (rdg, into, partition))
> + if (similar_memory_accesses (rdg, into, partition))
> {
>   if (dump_file && (dump_flags & TDF_DETAILS))
> {

The distribution to three loops doesn't work because (fortran - eh) all
arrays are part of the same (unnamed) common block and thus the memory
accesses appear related -- in GIMPLE we see __BLNK__.uold[] and __BLNK__.unew[]
and thus they appear as part of the same structure.  The cost model
is really too simple (not considering dependence distance or considering
non-dependent accesses to be of infinite distance and thus unrelated - of
course
we compute dependence only after applying the cost model at the moment).
It works with some extra (partly redundant) dependence checking:

Index: gcc/tree-loop-distribution.c
===
--- gcc/tree-loop-distribution.c(revision 244963)
+++ gcc/tree-loop-distribution.c(working copy)
@@ -1199,8 +1199,8 @@ ref_base_address (data_reference_p dr)
accesses in RDG.  */

 static bool
-similar_memory_accesses (struct graph *rdg, partition *partition1,
-partition *partition2)
+similar_memory_accesses (struct graph *rdg, vec loops,
+partition *partition1, partition *partition2)
 {
   unsigned i, j, k, l;
   bitmap_iterator bi, bj;
@@ -1228,7 +1228,18 @@ similar_memory_accesses (struct graph *r
if (base1)
  FOR_EACH_VEC_ELT (RDG_DATAREFS (rdg, j), l, ref2)
if (base1 == ref_base_address (ref2))
- return true;
+ {
+   ddr_p ddr = 

[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

2017-01-27 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

--- Comment #5 from rguenther at suse dot de  ---
On Fri, 27 Jan 2017, jgreenhalgh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245
> 
> --- Comment #4 from James Greenhalgh  ---
> (In reply to Richard Biener from comment #3)
> > Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like
> > 
> >   int i;
> >   for (i = 0; i < 128; ++i)
> > {
> >   a[i] = a[i] + 1;
> >   b[i] = d[i];
> >   c[i] = a[i] / d[i];
> > }
> > 
> > where the testcase expects b[i] = d[i] to be split out as memcpy but
> > the other two partitions to be fused.
> > 
> > Generally the cost model lacks computing the number of input/output streams
> > of a partition and a target interface to query it about limits.  Usually
> > store bandwidth is not equal to load bandwidth and not re-used store streams
> > can benefit from non-temporal stores being used by libc.
> > 
> > In your testcase I wonder whether distributing to
> > 
> > for (int j = 0; j < x; j++)
> >   {
> > for (int i = 0; i < y; i++)
> >   {
> > c[j][i] = b[j][i] - a[j][i];
> >   }
> >   }
> > memcpy (a, b, ...);
> > 
> > would be faster in the end (or even doing the memcpy first in this case).
> > 
> > Well, for now let's be more conservative given the cost model really is
> > lacking.
> 
> The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3%
> regression for Cortex-A72 after r242038, and I can fix that with
> -fno-tree-loop-distribute-patterns.
> 
> In that benchmark you've got 3 instances of the above pattern, so you end up
> with 3 memcpy calls after:
> 
>   DO 300 J=1,N
>   DO 300 I=1,M
>   UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
>   VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))
>   POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))
>   U(I,J) = UNEW(I,J)
>   V(I,J) = VNEW(I,J)
>   P(I,J) = PNEW(I,J)
>   300 CONTINUE
> 
> 3 memcpy calls compared to 3 vector store instructions doesn't seem the right
> tradeoff to me. Sorry if I reduced the testcase too far to make the balance
> clear.

Itanic seems to like it though:

http://gcc.opensuse.org/SPEC/CFP/sb-terbium-head-64/171_swim_big.png

For Haswell I don't see any ups/downs for AMD Fam15 I see a slowdown
as well around that time.  I guess it depends if the CPU is already
throttled by load/compute bandwith here.  What should be profitable
is to distribute the above to three loops (w/o memcpy).  So after
the patch doing -ftree-loop-distribution.  Patch being

Index: gcc/tree-loop-distribution.c
===
--- gcc/tree-loop-distribution.c(revision 244963)
+++ gcc/tree-loop-distribution.c(working copy)
@@ -1548,8 +1548,7 @@ distribute_loop (struct loop *loop, vec<
   for (int j = i + 1;
   partitions.iterate (j, ); ++j)
{
- if (!partition_builtin_p (partition)
- && similar_memory_accesses (rdg, into, partition))
+ if (similar_memory_accesses (rdg, into, partition))
{
  if (dump_file && (dump_flags & TDF_DETAILS))
{

[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

2017-01-27 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

--- Comment #4 from James Greenhalgh  ---
(In reply to Richard Biener from comment #3)
> Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like
> 
>   int i;
>   for (i = 0; i < 128; ++i)
> {
>   a[i] = a[i] + 1;
>   b[i] = d[i];
>   c[i] = a[i] / d[i];
> }
> 
> where the testcase expects b[i] = d[i] to be split out as memcpy but
> the other two partitions to be fused.
> 
> Generally the cost model lacks computing the number of input/output streams
> of a partition and a target interface to query it about limits.  Usually
> store bandwidth is not equal to load bandwidth and not re-used store streams
> can benefit from non-temporal stores being used by libc.
> 
> In your testcase I wonder whether distributing to
> 
> for (int j = 0; j < x; j++)
>   {
> for (int i = 0; i < y; i++)
> {
>   c[j][i] = b[j][i] - a[j][i];
>   }
>   }
> memcpy (a, b, ...);
> 
> would be faster in the end (or even doing the memcpy first in this case).
> 
> Well, for now let's be more conservative given the cost model really is
> lacking.

The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3%
regression for Cortex-A72 after r242038, and I can fix that with
-fno-tree-loop-distribute-patterns.

In that benchmark you've got 3 instances of the above pattern, so you end up
with 3 memcpy calls after:

  DO 300 J=1,N
  DO 300 I=1,M
  UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
  VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))
  POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))
  U(I,J) = UNEW(I,J)
  V(I,J) = VNEW(I,J)
  P(I,J) = PNEW(I,J)
  300 CONTINUE

3 memcpy calls compared to 3 vector store instructions doesn't seem the right
tradeoff to me. Sorry if I reduced the testcase too far to make the balance
clear.

[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

2017-01-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

--- Comment #3 from Richard Biener  ---
Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like

  int i;
  for (i = 0; i < 128; ++i)
{
  a[i] = a[i] + 1;
  b[i] = d[i];
  c[i] = a[i] / d[i];
}

where the testcase expects b[i] = d[i] to be split out as memcpy but
the other two partitions to be fused.

Generally the cost model lacks computing the number of input/output streams
of a partition and a target interface to query it about limits.  Usually
store bandwidth is not equal to load bandwidth and not re-used store streams
can benefit from non-temporal stores being used by libc.

In your testcase I wonder whether distributing to

for (int j = 0; j < x; j++)
  {
for (int i = 0; i < y; i++)
  {
c[j][i] = b[j][i] - a[j][i];
  }
  }
memcpy (a, b, ...);

would be faster in the end (or even doing the memcpy first in this case).

Well, for now let's be more conservative given the cost model really is
lacking.

[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

2017-01-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2017-01-27
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
   Target Milestone|--- |7.0
 Ever confirmed|0   |1

--- Comment #2 from Richard Biener  ---
Probably needs enhancement here:

  /* Apply our simple cost model - fuse partitions with similar
 memory accesses.  */
  for (i = 0; partitions.iterate (i, ); ++i)
{
  bool changed = false;
  if (partition_builtin_p (into))
continue;
  for (int j = i + 1;
   partitions.iterate (j, ); ++j)
{
  if (!partition_builtin_p (partition)
  && similar_memory_accesses (rdg, into, partition))
{
  if (dump_file && (dump_flags & TDF_DETAILS))
{
  fprintf (dump_file, "fusing partitions\n");
  dump_bitmap (dump_file, into->stmts);

see how we do not let detected patterns participate in the cost modeling.

[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

2017-01-26 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

--- Comment #1 from James Greenhalgh  ---
Created attachment 40592
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40592=edit
Code generation for -Ofast -fomit-frame-pointer -mcpu=cortex-a72+crypto
-ftree-loop-distribute-patterns