[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 Jeffrey A. Law changed: What|Removed |Added Priority|P3 |P2 CC||law at redhat dot com
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 Andrew Senkevich changed: What|Removed |Added CC||andrew.n.senkevich at gmail dot co ||m --- Comment #15 from Andrew Senkevich --- I will look at it.
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #14 from Pat Haugen --- (In reply to amker from comment #13) > We should create another PR for additional copy instructions after my patch > and close this one. IMHO they are two different issues. Yes, I agree. Yuri, can you take care of that? Additional info, it's really just one copy introduced, but becomes 4 after unrolling. This is the loop from the first testcase without -funroll-loops. Looks like we could get rid of the vmovaps by making zmm2 the dest on the vpermps (assuming I'm understanding the asm correctly). .L26: vpermps (%rcx), %zmm10, %zmm1 leal1(%rsi), %esi vmovaps %zmm1, %zmm2 vmaxps (%r15,%rdx), %zmm3, %zmm1 vfnmadd132ps(%r12,%rdx), %zmm7, %zmm2 cmpl%esi, %r8d leaq-64(%rcx), %rcx vmaxps %zmm1, %zmm2, %zmm1 vmovups %zmm1, (%rdi,%rdx) leaq64(%rdx), %rdx ja .L26
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #13 from amker at gcc dot gnu.org --- We should create another PR for additional copy instructions after my patch and close this one. IMHO they are two different issues.
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #12 from Pat Haugen --- Now that pr77536 has been fixed can this issue be closed? The first testcase no longer has any loads from the stack in the loop, but does have 4 reg copies that seem like they could get cleaned up (RA/reload issue as Bin mentioned?). The second testcase is totally cleaned up.
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 Bug 78116 depends on bug 77536, which changed state. Bug 77536 Summary: Vectorizer not maintaining relationship of relative block frequencies in absence of real profile data https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77536 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 amker at gcc dot gnu.org changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #11 from amker at gcc dot gnu.org --- According to Pat's report, the proposed patch for PR77536 can resolve the second test regression, but introduces 4 additional copy instructions. https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01081.html I checked tree dump information, the only difference is frequency counters are now fixed, but IRA/reload still make bad choice with "correct" profiling information.
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #10 from Pat Haugen --- (In reply to Jakub Jelinek from comment #9) > Any progress on this? Besides waiting for pr77536 to be fixed, I'm not sure what specifically can be done on this issue to fix the problem. I personally have not done anything wrt to trying to fix the vectorizer frequencies since I'm unfamiliar with the middle-end.
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #9 from Jakub Jelinek --- Any progress on this?
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 Pat Haugen changed: What|Removed |Added Depends on||77536 --- Comment #8 from Pat Haugen --- Marking this as depends on pr77536. Before vectorization the loops have a BB freq of 8500, after vectorization the vectorized version of the loops have freq=536. It then appears IRA/reload pick the loop block as a spill location because the frequency is less than the surrounding preheader/exit blocks. To test this theory I compiled the 2 testcases under gdb and modified the frequency of the loop block at the start of ira() to be slightly greater than the preheader and the loads no longer occurred in the loop. For the first testcase, the loop is colder than the surrounding code even with r241172, but my change in r241173 made some changes in the CFG by removing a peeled copy of the loop. So there's a difference in CFGs which leads to differences in BB frequencies that I suspect is not passing some threshhold in the latter case and resulting in spill in the loop. For the second testcase, my patch in r241170 resulted in no changes to the CFG, only changes to the BB frequencies. After unrolling, the loop in r241169 ends up having a frequency greater than the surrounding code due to the incorrect code that was in loop-unroll.c. r241170 corrected that code and maintained the initial representation that the loop is colder than the surrounding code, which again leads to spill in the loop. Even though spill is not the desired result, I maintain that the changes in r241170 are the correct thing to do. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77536 [Bug 77536] Vectorizer not maintaining relationship of relative block frequencies in absence of real profile data
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #7 from Yuri Rumyantsev --- Compiler was configures with: Configured with: /configure --enable-languages=c,c++ --enable-clocale=gnu --enable-cloog-backend=isl --enable-shared --disable-libsanitizer --disable-bootstrap --disable-nls --with-system-zlib --with-demangler-in-ld --with-arch=corei7 --with-cpu=corei7 --with-fpmath=sse --prefix=/install Thread model: posix gcc version 7.0.0 20161014 (experimental) (GCC) I assume that arch and cpu are not essential.
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #6 from Pat Haugen --- Can you post your configure command (or gcc -v output).
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #5 from Yuri Rumyantsev --- Yes, some virtual register are allocated on stack and we got more loads from stack to get their values.
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #4 from Pat Haugen --- (In reply to Yuri Rumyantsev from comment #2) > WE also found out performance drop on another important benchmark with the > same symptoms after r241170, namely loop marked with .L18 has +12 more fills > from stack. The test-case will be attached. I don't understand what "fills from stack" is, are you referring to register spills?
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #3 from Yuri Rumyantsev --- Created attachment 39910 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39910=edit another test-case Must be compiled with "-Ofast -fopenmp -funroll-loops -march=knl"
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #2 from Yuri Rumyantsev --- WE also found out performance drop on another important benchmark with the same symptoms after r241170, namely loop marked with .L18 has +12 more fills from stack. The test-case will be attached.
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 Richard Biener changed: What|Removed |Added CC||pthaugen at gcc dot gnu.org Target Milestone|--- |7.0
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #1 from Yuri Rumyantsev --- Created attachment 39892 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39892=edit test-case to reproduce Must be compiled with "-Ofast -funroll-loops -march=knl" options.