[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2017-04-12 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

Jeffrey A. Law  changed:

   What|Removed |Added

   Priority|P3  |P2
 CC||law at redhat dot com

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2017-04-12 Thread andrew.n.senkevich at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

Andrew Senkevich  changed:

   What|Removed |Added

 CC||andrew.n.senkevich at gmail 
dot co
   ||m

--- Comment #15 from Andrew Senkevich  ---
I will look at it.

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2017-03-01 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #14 from Pat Haugen  ---
(In reply to amker from comment #13)
> We should create another PR for additional copy instructions after my patch
> and close this one.  IMHO they are two different issues.

Yes, I agree. Yuri, can you take care of that?

Additional info, it's really just one copy introduced, but becomes 4 after
unrolling. This is the loop from the first testcase without -funroll-loops.
Looks like we could get rid of the vmovaps by making zmm2 the dest on the
vpermps (assuming I'm understanding the asm correctly).

.L26:
vpermps (%rcx), %zmm10, %zmm1
leal1(%rsi), %esi
vmovaps %zmm1, %zmm2
vmaxps  (%r15,%rdx), %zmm3, %zmm1
vfnmadd132ps(%r12,%rdx), %zmm7, %zmm2
cmpl%esi, %r8d
leaq-64(%rcx), %rcx
vmaxps  %zmm1, %zmm2, %zmm1
vmovups %zmm1, (%rdi,%rdx)
leaq64(%rdx), %rdx
ja  .L26

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2017-02-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #13 from amker at gcc dot gnu.org ---
We should create another PR for additional copy instructions after my patch and
close this one.  IMHO they are two different issues.

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2017-02-28 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #12 from Pat Haugen  ---
Now that pr77536 has been fixed can this issue be closed? The first testcase no
longer has any loads from the stack in the loop, but does have 4 reg copies
that seem like they could get cleaned up (RA/reload issue as Bin mentioned?).
The second testcase is totally cleaned up.

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2017-02-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116
Bug 78116 depends on bug 77536, which changed state.

Bug 77536 Summary: Vectorizer not maintaining relationship of relative block 
frequencies in absence of real profile data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77536

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2017-02-17 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

amker at gcc dot gnu.org changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #11 from amker at gcc dot gnu.org ---
According to Pat's report, the proposed patch for PR77536 can resolve the
second test regression, but introduces 4 additional copy instructions. 
https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01081.html

I checked tree dump information, the only difference is frequency counters are
now fixed, but IRA/reload still make bad choice with "correct" profiling
information.

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2017-01-04 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #10 from Pat Haugen  ---
(In reply to Jakub Jelinek from comment #9)
> Any progress on this?

Besides waiting for pr77536 to be fixed, I'm not sure what specifically can be
done on this issue to fix the problem. I personally have not done anything wrt
to trying to fix the vectorizer frequencies since I'm unfamiliar with the
middle-end.

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2017-01-03 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #9 from Jakub Jelinek  ---
Any progress on this?

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-11-18 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

Pat Haugen  changed:

   What|Removed |Added

 Depends on||77536

--- Comment #8 from Pat Haugen  ---
Marking this as depends on pr77536. Before vectorization the loops have a BB
freq of 8500, after vectorization the vectorized version of the loops have
freq=536. It then appears IRA/reload pick the loop block as a spill location
because the frequency is less than the surrounding preheader/exit blocks. To
test this theory I compiled the 2 testcases under gdb and modified the
frequency of the loop block at the start of ira() to be slightly greater than
the preheader and the loads no longer occurred in the loop.

For the first testcase, the loop is colder than the surrounding code even with
r241172, but my change in r241173 made some changes in the CFG by removing a
peeled copy of the loop. So there's a difference in CFGs which leads to
differences in BB frequencies that I suspect is not passing some threshhold in
the latter case and resulting in spill in the loop.

For the second testcase, my patch in r241170 resulted in no changes to the CFG,
only changes to the BB frequencies. After unrolling, the loop in r241169 ends
up having a frequency greater than the surrounding code due to the incorrect
code that was in loop-unroll.c. r241170 corrected that code and maintained the
initial representation that the loop is colder than the surrounding code, which
again leads to spill in the loop. Even though spill is not the desired result,
I maintain that the changes in r241170 are the correct thing to do.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77536
[Bug 77536] Vectorizer not maintaining relationship of relative block
frequencies in absence of real profile data

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-31 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #7 from Yuri Rumyantsev  ---
Compiler was configures with:

Configured with: /configure --enable-languages=c,c++
--enable-clocale=gnu --enable-cloog-backend=isl --enable-shared
--disable-libsanitizer --disable-bootstrap --disable-nls --with-system-zlib
--with-demangler-in-ld --with-arch=corei7 --with-cpu=corei7 --with-fpmath=sse
--prefix=/install
Thread model: posix
gcc version 7.0.0 20161014 (experimental) (GCC)

I assume that arch and cpu are not essential.

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-27 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #6 from Pat Haugen  ---
Can you post your configure command (or gcc -v output).

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-27 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #5 from Yuri Rumyantsev  ---
Yes, some virtual register are allocated on stack and we got more loads from
stack to get their values.

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-27 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #4 from Pat Haugen  ---
(In reply to Yuri Rumyantsev from comment #2)
> WE also found out performance drop on another important benchmark with the
> same symptoms after r241170, namely loop marked with .L18 has +12 more fills
> from stack. The test-case will be attached.

I don't understand what "fills from stack" is, are you referring to register
spills?

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-27 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #3 from Yuri Rumyantsev  ---
Created attachment 39910
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39910=edit
another test-case

Must be compiled with "-Ofast -fopenmp -funroll-loops -march=knl"

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-27 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #2 from Yuri Rumyantsev  ---
WE also found out performance drop on another important benchmark with the same
symptoms after r241170, namely loop marked with .L18 has +12 more fills from
stack. The test-case will be attached.

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

Richard Biener  changed:

   What|Removed |Added

 CC||pthaugen at gcc dot gnu.org
   Target Milestone|--- |7.0

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-26 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 39892
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39892=edit
test-case to reproduce

Must be compiled with "-Ofast -funroll-loops -march=knl" options.