[Bug tree-optimization/81038] [8 regression] test case g++.dg/vect/slp-pr56812.cc fails starting with r248678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81038 Bill Schmidt changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #10 from Bill Schmidt --- Richi instead committed the more elegant patch from https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00397.html. Per Richi, fixed on x86_64. I've observed a testresults cycle for powerpc64-linux-gnu where this now passes, so looks fixed to me. Thanks!
[Bug tree-optimization/81038] [8 regression] test case g++.dg/vect/slp-pr56812.cc fails starting with r248678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81038 --- Comment #9 from Bill Schmidt --- Prospective patch posted at https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00137.html.
[Bug tree-optimization/81038] [8 regression] test case g++.dg/vect/slp-pr56812.cc fails starting with r248678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81038 --- Comment #8 from Bill Schmidt --- The commentary for r248678 reads in part: "Compute costs for doing no peeling at all, compare to the best peeling costs so far and avoid peeling if cheaper." Indeed, if you look at the vect dump for r248677, you see that the vectorizer decides to force alignment using peeling, even though the target processor has efficient unaligned memory access. Peeling proved to be barely unprofitable: /home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/g++.dg/vect/slp-pr56812.cc:1\ 6:18: note: Cost model analysis: Vector inside of loop cost: 1 Vector prologue cost: 7 Vector epilogue cost: 6 Scalar iteration cost: 1 Scalar outside cost: 0 Vector outside cost: 13 prologue iterations: 2 epilogue iterations: 2 Calculated minimum iters for profitability: 17 /home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/g++.dg/vect/slp-pr56812.cc:1\ 6:18: note: Runtime profitability threshold = 16 /home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/g++.dg/vect/slp-pr56812.cc:1\ 6:18: note: Static estimate profitability threshold = 16 /home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/g++.dg/vect/slp-pr56812.cc:1\ 6:18: note: not vectorized: vectorization not profitable. In the vect dump for r248678, the vectorizer isn't overly focused on peeling, and determines that it can use the efficient unaligned storage accesses. This leads to the more reasonable cost calculation: /home/wschmidt/gcc/gcc-mainline-test/gcc/testsuite/g++.dg/vect/slp-pr56812.cc:1\ 6:18: note: Cost model analysis: Vector inside of loop cost: 1 Vector prologue cost: 1 Vector epilogue cost: 0 Scalar iteration cost: 1 Scalar outside cost: 0 Vector outside cost: 1 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 2 /home/wschmidt/gcc/gcc-mainline-test/gcc/testsuite/g++.dg/vect/slp-pr56812.cc:1\ 6:18: note: Runtime profitability threshold = 3 /home/wschmidt/gcc/gcc-mainline-test/gcc/testsuite/g++.dg/vect/slp-pr56812.cc:1\ 6:18: note: Static estimate profitability threshold = 3 /home/wschmidt/gcc/gcc-mainline-test/gcc/testsuite/g++.dg/vect/slp-pr56812.cc:1\ 6:18: note: loop vectorized For this processor, we vectorized the code in "vect" rather than in "slp". For other processors, the choice could be different because of cost model differences. But I think in general we should always vectorize. In both cases the "optimized" dump produces: void mydata::Set(float) (struct mydata * const this, float x) { vector(4) float vect_cst__10; [11.11%]: vect_cst__10 = {x_5(D), x_5(D), x_5(D), x_5(D)}; MEM[(float *)this_4(D)] = vect_cst__10; MEM[(float *)this_4(D) + 16B] = vect_cst__10; return; } So I think perhaps it would be better to change the test to examine the "optimized" dump for one definition and two uses of a vect_cst__*. The point of the original complaint in PR56812 was that this test case was not vectorized (by SLP at the time), but so long as it is vectorized, that should be good enough for everyone.
[Bug tree-optimization/81038] [8 regression] test case g++.dg/vect/slp-pr56812.cc fails starting with r248678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81038 Bill Schmidt changed: What|Removed |Added Assignee|acsawdey at gcc dot gnu.org|wschmidt at gcc dot gnu.org --- Comment #7 from Bill Schmidt --- I'm looking at this one.
[Bug tree-optimization/81038] [8 regression] test case g++.dg/vect/slp-pr56812.cc fails starting with r248678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81038 ktkachov at gcc dot gnu.org changed: What|Removed |Added Target|powerpc*-*-*, i?86-*-*, |powerpc*-*-*, i?86-*-*, |x86_64-*-*, aarch64-*-* |x86_64-*-*, aarch64-*-*, ||arm*-*-* CC||ktkachov at gcc dot gnu.org --- Comment #6 from ktkachov at gcc dot gnu.org --- I'm also seeing this FAIL on arm
[Bug tree-optimization/81038] [8 regression] test case g++.dg/vect/slp-pr56812.cc fails starting with r248678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81038 Rainer Orth changed: What|Removed |Added Target|powerpc*-*-*|powerpc*-*-*, i?86-*-*, ||x86_64-*-*, aarch64-*-* CC||ro at gcc dot gnu.org --- Comment #5 from Rainer Orth --- Just for the record, this only affects several x86 targets.
[Bug tree-optimization/81038] [8 regression] test case g++.dg/vect/slp-pr56812.cc fails starting with r248678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81038 acsawdey at gcc dot gnu.org changed: What|Removed |Added Status|NEW |ASSIGNED CC||acsawdey at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |acsawdey at gcc dot gnu.org --- Comment #4 from acsawdey at gcc dot gnu.org --- At present trunk is vectorizing this in the vect pass not unrolling and vectorizing in slp. Code generated for mydata::Set is: _ZN6mydata3SetEf: .LFB4: .cfi_startproc xscvdpspn 1,1 li 9,16 xxspltw 0,1,0 stxvd2x 0,0,3 stxvd2x 0,3,9 blr It seems like the test case should be looking for this alternative, I can't see how a loop with a single stxvd2x that runs two iterations would be better.
[Bug tree-optimization/81038] [8 regression] test case g++.dg/vect/slp-pr56812.cc fails starting with r248678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81038 Steve Ellcey changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-08-01 CC||sje at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #3 from Steve Ellcey --- Looking at the slp dump file on aarch64 where this also fails I see these messages: slp-pr56812.cc:18:1: note: === vect_analyze_data_refs === slp-pr56812.cc:18:1: note: not vectorized: no vectype for stmt: MEM[(float *)thi s_4(D)] = vect_cst__10; scalar_type: vector(4) float slp-pr56812.cc:18:1: note: not vectorized: no vectype for stmt: MEM[(float *)vec tp_this.5_6] = vect_cst__10; scalar_type: vector(4) float slp-pr56812.cc:18:1: note: === vect_analyze_data_ref_accesses === slp-pr56812.cc:18:1: note: not vectorized: no grouped stores in basic block.
[Bug tree-optimization/81038] [8 regression] test case g++.dg/vect/slp-pr56812.cc fails starting with r248678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81038 Richard Biener changed: What|Removed |Added Target Milestone|--- |8.0 --- Comment #2 from Richard Biener --- Eventually the loop is no longer unrolled (was it?) and is now loop vectorized? (and that bit is "fragile" because of -fvect-cost-model=dynamic?) Just guessing.
[Bug tree-optimization/81038] [8 regression] test case g++.dg/vect/slp-pr56812.cc fails starting with r248678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81038 seurer at gcc dot gnu.org changed: What|Removed |Added Target||powerpc*-*-* CC||krebbel at gcc dot gnu.org, ||wschmidt at gcc dot gnu.org Host||powerpc*-*-* Build||powerpc*-*-* --- Comment #1 from seurer at gcc dot gnu.org --- Note: fails on powerpc64 both BE and LE.