[Bug tree-optimization/63864] Missed late memory CSE

2021-12-12 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63864

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug tree-optimization/63864] Missed late memory CSE

2021-12-12 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63864

--- Comment #5 from Andrew Pinski  ---
Note I noticed at -O3 on the trunk, test_slow SLP vectorizer can happen while
test_ok does not. Anyways I think the orginal problem was fully fixed in GCC 6.

[Bug tree-optimization/63864] Missed late memory CSE

2019-03-04 Thread steven at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63864

--- Comment #4 from Steven Bosscher  ---
Code looks pretty much the same for "test_ok" and "test_slow" since GCC 6 for
x86-64, and since GCC 7 for i686.

GCC 6.3 x86-64:
test_ok(float (*) [3], float, float, float, float, float):
mulss   %xmm3, %xmm0
movss   4(%rdi), %xmm6
mulss   %xmm3, %xmm1
mulss   %xmm3, %xmm2
movss   12(%rdi), %xmm3
movaps  %xmm0, %xmm5
addss   %xmm4, %xmm1
movss   (%rdi), %xmm0
addss   %xmm4, %xmm5
addss   %xmm4, %xmm2
mulss   %xmm1, %xmm3
mulss   %xmm5, %xmm0
mulss   %xmm5, %xmm6
mulss   8(%rdi), %xmm5
addss   %xmm3, %xmm0
movss   24(%rdi), %xmm3
mulss   %xmm2, %xmm3
addss   %xmm3, %xmm0
movss   16(%rdi), %xmm3
mulss   %xmm1, %xmm3
mulss   20(%rdi), %xmm1
addss   %xmm3, %xmm6
movss   28(%rdi), %xmm3
mulss   %xmm2, %xmm3
mulss   32(%rdi), %xmm2
addss   %xmm1, %xmm5
addss   %xmm3, %xmm6
addss   %xmm2, %xmm5
addss   %xmm6, %xmm0
addss   %xmm5, %xmm0
ret
test_slow(mat3&, float, float, float, float, float):
mulss   %xmm3, %xmm0
mulss   %xmm3, %xmm1
mulss   %xmm2, %xmm3
movss   16(%rdi), %xmm2
movaps  %xmm0, %xmm6
addss   %xmm4, %xmm1
movss   4(%rdi), %xmm0
addss   %xmm4, %xmm6
addss   %xmm3, %xmm4
movss   (%rdi), %xmm3
mulss   %xmm1, %xmm2
mulss   %xmm6, %xmm0
mulss   %xmm6, %xmm3
mulss   8(%rdi), %xmm6
addss   %xmm2, %xmm0
movss   28(%rdi), %xmm2
mulss   %xmm4, %xmm2
addss   %xmm2, %xmm0
movss   12(%rdi), %xmm2
mulss   %xmm1, %xmm2
mulss   20(%rdi), %xmm1
addss   %xmm2, %xmm3
movss   24(%rdi), %xmm2
mulss   %xmm4, %xmm2
mulss   32(%rdi), %xmm4
addss   %xmm6, %xmm1
addss   %xmm2, %xmm3
addss   %xmm4, %xmm1
addss   %xmm3, %xmm0
addss   %xmm1, %xmm0
ret


GCC 7.4 i686:
test_ok(float (*) [3], float, float, float, float, float):
flds20(%esp)
flds8(%esp)
fmul%st(1), %st
movl4(%esp), %eax
fadds   24(%esp)
flds12(%esp)
fmul%st(2), %st
fadds   24(%esp)
fxch%st(2)
fmuls   16(%esp)
fadds   24(%esp)
flds(%eax)
fmul%st(2), %st
flds12(%eax)
fmul%st(4), %st
faddp   %st, %st(1)
flds24(%eax)
fmul%st(2), %st
faddp   %st, %st(1)
flds4(%eax)
fmul%st(3), %st
flds16(%eax)
fmul%st(5), %st
faddp   %st, %st(1)
flds28(%eax)
fmul%st(3), %st
faddp   %st, %st(1)
faddp   %st, %st(1)
fxch%st(2)
fmuls   8(%eax)
fxch%st(3)
fmuls   20(%eax)
faddp   %st, %st(3)
fmuls   32(%eax)
faddp   %st, %st(2)
faddp   %st, %st(1)
ret
test_slow(mat3&, float, float, float, float, float):
flds20(%esp)
flds8(%esp)
fmul%st(1), %st
movl4(%esp), %eax
fadds   24(%esp)
flds12(%esp)
fmul%st(2), %st
fadds   24(%esp)
fxch%st(2)
fmuls   16(%esp)
fadds   24(%esp)
flds4(%eax)
fmul%st(2), %st
flds16(%eax)
fmul%st(4), %st
faddp   %st, %st(1)
flds28(%eax)
fmul%st(2), %st
faddp   %st, %st(1)
flds(%eax)
fmul%st(3), %st
flds12(%eax)
fmul%st(5), %st
faddp   %st, %st(1)
flds24(%eax)
fmul%st(3), %st
faddp   %st, %st(1)
faddp   %st, %st(1)
fxch%st(2)
fmuls   8(%eax)
fxch%st(3)
fmuls   20(%eax)
faddp   %st, %st(3)
fmuls   32(%eax)
faddp   %st, %st(2)
faddp   %st, %st(1)
ret

[Bug tree-optimization/63864] Missed late memory CSE

2014-11-20 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63864

--- Comment #3 from Richard Biener rguenth at gcc dot gnu.org ---
DOM is now improved but as said this testcase needs handling of agggregate
copies which DOM doesn't handle (and I don't think we want to complicate it
with that).


[Bug tree-optimization/63864] Missed late memory CSE

2014-11-18 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63864

--- Comment #2 from Richard Biener rguenth at gcc dot gnu.org ---
Created attachment 34025
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=34025action=edit
candidate patch for DOM

Ok, so I have a patch to teach DOM to do more memory CSE but for this testcase
what remains is stuff like

  MEM[(float )r].e[0] = _220;
  _228 = y_5(D) * s_8(D);
  MEM[(float )r].e[1] = _228;
  _21 = z_6(D) * s_8(D);
  MEM[(float )r].e[2] = _21;
  D.2621 = r;
  r ={v} {CLOBBER};
  D.2620 = D.2621;
  D.2459 = D.2620;
  _201 = D.2459.e[0];

thus it isn't able to look through aggregate copies (which wouldn't fit
how I implemented it very well).


[Bug tree-optimization/63864] Missed late memory CSE

2014-11-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63864

Richard Biener rguenth at gcc dot gnu.org changed:

   What|Removed |Added

   Keywords||missed-optimization
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2014-11-14
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
Summary|Missed optimization,|Missed late memory CSE
   |related to SRA(??)  |
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener rguenth at gcc dot gnu.org ---
Confirmed - mine.  Note that SRA cannot decompose arrays but I see a lot
of missed CSE opportunities here which is because we unroll the loops
completely only very late and the only memory CSE pass after that is
DOM which is somewhat very limited here...

I'll try to improve that.  Related to some other PR.