[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 --- Comment #9 from James Greenhalgh --- > I'm curious how that benchmarks for you (with -ftree-loop-distribution vs. > without). Taking trunk as 100%, I see a 2% gain on trunk with -fno-tree-loop-distribution-patterns , and 1% gain with your patch applied. I'm not sure where the other 1% is going. There are a few more examples of new memcpy calls in MAIN_ and inital_ , but they don't look hot so it may just be a second-order effect. Thanks for the fix.
[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 --- Comment #8 from Richard Biener --- Author: rguenth Date: Fri Jan 27 13:56:59 2017 New Revision: 244976 URL: https://gcc.gnu.org/viewcvs?rev=244976=gcc=rev Log: 2017-01-27 Richard BienerPR tree-optimization/79245 * tree-loop-distribution.c (distribute_loop): Apply cost modeling also to detected patterns. * gcc.dg/tree-ssa/ldist-23.c: XFAIL. * gcc.dg/tree-ssa/ldist-25.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-25.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-23.c trunk/gcc/tree-loop-distribution.c
[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from Richard Biener --- Fixed.
[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 --- Comment #6 from Richard Biener --- (In reply to rguent...@suse.de from comment #5) > On Fri, 27 Jan 2017, jgreenhalgh at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 > > > > --- Comment #4 from James Greenhalgh --- > > (In reply to Richard Biener from comment #3) > > > Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like > > > > > > int i; > > > for (i = 0; i < 128; ++i) > > > { > > > a[i] = a[i] + 1; > > > b[i] = d[i]; > > > c[i] = a[i] / d[i]; > > > } > > > > > > where the testcase expects b[i] = d[i] to be split out as memcpy but > > > the other two partitions to be fused. > > > > > > Generally the cost model lacks computing the number of input/output > > > streams > > > of a partition and a target interface to query it about limits. Usually > > > store bandwidth is not equal to load bandwidth and not re-used store > > > streams > > > can benefit from non-temporal stores being used by libc. > > > > > > In your testcase I wonder whether distributing to > > > > > > for (int j = 0; j < x; j++) > > > { > > > for (int i = 0; i < y; i++) > > > { > > > c[j][i] = b[j][i] - a[j][i]; > > > } > > > } > > > memcpy (a, b, ...); > > > > > > would be faster in the end (or even doing the memcpy first in this case). > > > > > > Well, for now let's be more conservative given the cost model really is > > > lacking. > > > > The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3% > > regression for Cortex-A72 after r242038, and I can fix that with > > -fno-tree-loop-distribute-patterns. > > > > In that benchmark you've got 3 instances of the above pattern, so you end up > > with 3 memcpy calls after: > > > > DO 300 J=1,N > > DO 300 I=1,M > > UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) > > VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) > > POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) > > U(I,J) = UNEW(I,J) > > V(I,J) = VNEW(I,J) > > P(I,J) = PNEW(I,J) > > 300 CONTINUE > > > > 3 memcpy calls compared to 3 vector store instructions doesn't seem the > > right > > tradeoff to me. Sorry if I reduced the testcase too far to make the balance > > clear. > > Itanic seems to like it though: > > http://gcc.opensuse.org/SPEC/CFP/sb-terbium-head-64/171_swim_big.png > > For Haswell I don't see any ups/downs for AMD Fam15 I see a slowdown > as well around that time. I guess it depends if the CPU is already > throttled by load/compute bandwith here. What should be profitable > is to distribute the above to three loops (w/o memcpy). So after > the patch doing -ftree-loop-distribution. Patch being > > Index: gcc/tree-loop-distribution.c > === > --- gcc/tree-loop-distribution.c(revision 244963) > +++ gcc/tree-loop-distribution.c(working copy) > @@ -1548,8 +1548,7 @@ distribute_loop (struct loop *loop, vec< >for (int j = i + 1; >partitions.iterate (j, ); ++j) > { > - if (!partition_builtin_p (partition) > - && similar_memory_accesses (rdg, into, partition)) > + if (similar_memory_accesses (rdg, into, partition)) > { > if (dump_file && (dump_flags & TDF_DETAILS)) > { The distribution to three loops doesn't work because (fortran - eh) all arrays are part of the same (unnamed) common block and thus the memory accesses appear related -- in GIMPLE we see __BLNK__.uold[] and __BLNK__.unew[] and thus they appear as part of the same structure. The cost model is really too simple (not considering dependence distance or considering non-dependent accesses to be of infinite distance and thus unrelated - of course we compute dependence only after applying the cost model at the moment). It works with some extra (partly redundant) dependence checking: Index: gcc/tree-loop-distribution.c === --- gcc/tree-loop-distribution.c(revision 244963) +++ gcc/tree-loop-distribution.c(working copy) @@ -1199,8 +1199,8 @@ ref_base_address (data_reference_p dr) accesses in RDG. */ static bool -similar_memory_accesses (struct graph *rdg, partition *partition1, -partition *partition2) +similar_memory_accesses (struct graph *rdg, vec loops, +partition *partition1, partition *partition2) { unsigned i, j, k, l; bitmap_iterator bi, bj; @@ -1228,7 +1228,18 @@ similar_memory_accesses (struct graph *r if (base1) FOR_EACH_VEC_ELT (RDG_DATAREFS (rdg, j), l, ref2) if (base1 == ref_base_address (ref2)) - return true; + { + ddr_p ddr =
[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 --- Comment #5 from rguenther at suse dot de --- On Fri, 27 Jan 2017, jgreenhalgh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 > > --- Comment #4 from James Greenhalgh --- > (In reply to Richard Biener from comment #3) > > Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like > > > > int i; > > for (i = 0; i < 128; ++i) > > { > > a[i] = a[i] + 1; > > b[i] = d[i]; > > c[i] = a[i] / d[i]; > > } > > > > where the testcase expects b[i] = d[i] to be split out as memcpy but > > the other two partitions to be fused. > > > > Generally the cost model lacks computing the number of input/output streams > > of a partition and a target interface to query it about limits. Usually > > store bandwidth is not equal to load bandwidth and not re-used store streams > > can benefit from non-temporal stores being used by libc. > > > > In your testcase I wonder whether distributing to > > > > for (int j = 0; j < x; j++) > > { > > for (int i = 0; i < y; i++) > > { > > c[j][i] = b[j][i] - a[j][i]; > > } > > } > > memcpy (a, b, ...); > > > > would be faster in the end (or even doing the memcpy first in this case). > > > > Well, for now let's be more conservative given the cost model really is > > lacking. > > The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3% > regression for Cortex-A72 after r242038, and I can fix that with > -fno-tree-loop-distribute-patterns. > > In that benchmark you've got 3 instances of the above pattern, so you end up > with 3 memcpy calls after: > > DO 300 J=1,N > DO 300 I=1,M > UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) > VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) > POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) > U(I,J) = UNEW(I,J) > V(I,J) = VNEW(I,J) > P(I,J) = PNEW(I,J) > 300 CONTINUE > > 3 memcpy calls compared to 3 vector store instructions doesn't seem the right > tradeoff to me. Sorry if I reduced the testcase too far to make the balance > clear. Itanic seems to like it though: http://gcc.opensuse.org/SPEC/CFP/sb-terbium-head-64/171_swim_big.png For Haswell I don't see any ups/downs for AMD Fam15 I see a slowdown as well around that time. I guess it depends if the CPU is already throttled by load/compute bandwith here. What should be profitable is to distribute the above to three loops (w/o memcpy). So after the patch doing -ftree-loop-distribution. Patch being Index: gcc/tree-loop-distribution.c === --- gcc/tree-loop-distribution.c(revision 244963) +++ gcc/tree-loop-distribution.c(working copy) @@ -1548,8 +1548,7 @@ distribute_loop (struct loop *loop, vec< for (int j = i + 1; partitions.iterate (j, ); ++j) { - if (!partition_builtin_p (partition) - && similar_memory_accesses (rdg, into, partition)) + if (similar_memory_accesses (rdg, into, partition)) { if (dump_file && (dump_flags & TDF_DETAILS)) {
[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 --- Comment #4 from James Greenhalgh --- (In reply to Richard Biener from comment #3) > Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like > > int i; > for (i = 0; i < 128; ++i) > { > a[i] = a[i] + 1; > b[i] = d[i]; > c[i] = a[i] / d[i]; > } > > where the testcase expects b[i] = d[i] to be split out as memcpy but > the other two partitions to be fused. > > Generally the cost model lacks computing the number of input/output streams > of a partition and a target interface to query it about limits. Usually > store bandwidth is not equal to load bandwidth and not re-used store streams > can benefit from non-temporal stores being used by libc. > > In your testcase I wonder whether distributing to > > for (int j = 0; j < x; j++) > { > for (int i = 0; i < y; i++) > { > c[j][i] = b[j][i] - a[j][i]; > } > } > memcpy (a, b, ...); > > would be faster in the end (or even doing the memcpy first in this case). > > Well, for now let's be more conservative given the cost model really is > lacking. The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3% regression for Cortex-A72 after r242038, and I can fix that with -fno-tree-loop-distribute-patterns. In that benchmark you've got 3 instances of the above pattern, so you end up with 3 memcpy calls after: DO 300 J=1,N DO 300 I=1,M UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) U(I,J) = UNEW(I,J) V(I,J) = VNEW(I,J) P(I,J) = PNEW(I,J) 300 CONTINUE 3 memcpy calls compared to 3 vector store instructions doesn't seem the right tradeoff to me. Sorry if I reduced the testcase too far to make the balance clear.
[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 --- Comment #3 from Richard Biener --- Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like int i; for (i = 0; i < 128; ++i) { a[i] = a[i] + 1; b[i] = d[i]; c[i] = a[i] / d[i]; } where the testcase expects b[i] = d[i] to be split out as memcpy but the other two partitions to be fused. Generally the cost model lacks computing the number of input/output streams of a partition and a target interface to query it about limits. Usually store bandwidth is not equal to load bandwidth and not re-used store streams can benefit from non-temporal stores being used by libc. In your testcase I wonder whether distributing to for (int j = 0; j < x; j++) { for (int i = 0; i < y; i++) { c[j][i] = b[j][i] - a[j][i]; } } memcpy (a, b, ...); would be faster in the end (or even doing the memcpy first in this case). Well, for now let's be more conservative given the cost model really is lacking.
[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 Richard Biener changed: What|Removed |Added Priority|P3 |P1 Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2017-01-27 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Target Milestone|--- |7.0 Ever confirmed|0 |1 --- Comment #2 from Richard Biener --- Probably needs enhancement here: /* Apply our simple cost model - fuse partitions with similar memory accesses. */ for (i = 0; partitions.iterate (i, ); ++i) { bool changed = false; if (partition_builtin_p (into)) continue; for (int j = i + 1; partitions.iterate (j, ); ++j) { if (!partition_builtin_p (partition) && similar_memory_accesses (rdg, into, partition)) { if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, "fusing partitions\n"); dump_bitmap (dump_file, into->stmts); see how we do not let detected patterns participate in the cost modeling.
[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 --- Comment #1 from James Greenhalgh --- Created attachment 40592 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40592=edit Code generation for -Ofast -fomit-frame-pointer -mcpu=cortex-a72+crypto -ftree-loop-distribute-patterns