[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2021-08-25 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #13 from Hongtao.liu  ---
;; Function fn (fn, funcdef_no=5484, decl_uid=32317, cgraph_uid=5485,
symbol_order=5484)

int fn (const int * px, const int * py, const int * pz, const int * pw, const
int * pa, const int * pb, const int * pc, const int * pd)
{
  vector(16) short unsigned int _3;
  vector(16) short unsigned int _5;
  vector(16) short int _7;
  vector(16) short int _9;
  vector(32) char _12;
  vector(32) unsigned char _14;
  vector(16) short unsigned int _16;
  vector(16) short unsigned int _17;
  vector(16) short int _18;
  vector(16) short int _19;
  vector(32) char _20;
  vector(32) unsigned char _21;
  vector(16) short unsigned int _22;
  vector(16) short unsigned int _23;
  vector(16) short int _24;
  vector(16) short int _25;
  vector(32) char _26;
  vector(32) unsigned char _27;
  vector(16) short unsigned int _28;
  vector(16) short unsigned int _29;
  vector(16) short int _30;
  vector(16) short int _31;
  int _32;
  vector(4) int _33;
  vector(8) int _34;
  vector(32) unsigned char _35;
  vector(32) char _36;
  vector(16) short unsigned int _37;
  vector(16) short unsigned int _38;
  vector(16) short unsigned int _39;
  vector(16) short unsigned int _40;
  vector(16) short unsigned int _41;
  vector(16) short unsigned int _42;
  vector(16) short unsigned int _43;
  vector(16) short unsigned int _44;
  vector(16) short unsigned int _45;
  vector(16) short unsigned int _46;
  vector(16) short unsigned int _47;
  vector(16) short unsigned int _48;
  vector(16) short unsigned int _50;
  vector(16) short unsigned int _51;
  vector(16) short unsigned int _53;
  vector(16) short unsigned int _54;
  vector(16) short unsigned int _56;
  vector(16) short unsigned int _57;
  vector(16) short unsigned int _59;
  vector(16) short unsigned int _60;
  vector(16) short int _62;
  vector(16) short int _63;
  vector(16) short unsigned int _64;
  vector(16) short unsigned int _65;
  vector(32) unsigned char _66;
  vector(32) char _67;
  vector(16) short int _68;
  vector(16) short int _69;
  vector(16) short unsigned int _70;
  vector(16) short unsigned int _71;
  vector(32) unsigned char _72;
  vector(32) char _73;
  vector(16) short int _74;
  vector(16) short int _75;
  vector(16) short unsigned int _76;
  vector(16) short unsigned int _77;
  vector(32) unsigned char _78;
  vector(32) char _79;
  vector(16) short int _80;
  vector(16) short int _81;
  vector(16) short unsigned int _82;
  vector(16) short unsigned int _83;
  vector(32) unsigned char _84;
  vector(32) char _85;
  vector(16) short int _86;
  vector(16) short int _87;
  vector(16) short unsigned int _88;
  vector(16) short unsigned int _89;
  vector(32) unsigned char _90;
  vector(32) char _91;
  vector(4) long long int _92;
  vector(4) long long int _93;
  vector(4) long long int _94;
  vector(4) long long int _95;
  vector(4) long long int _96;
  vector(4) long long int _97;
  vector(4) long long int _98;
  vector(4) long long int _99;
  vector(4) long long int _100;
  vector(4) long long int _101;
  vector(16) short unsigned int _107;
  vector(16) short unsigned int _108;
  vector(16) short unsigned int _109;
  vector(16) short unsigned int _110;
  vector(16) short unsigned int _111;

   [local count: 1073741824]:
  _101 = MEM[(const __m256i_u * {ref-all})px_2(D)];
  _100 = MEM[(const __m256i_u * {ref-all})py_4(D)];
  _99 = MEM[(const __m256i_u * {ref-all})pz_6(D)];
  _98 = MEM[(const __m256i_u * {ref-all})pw_8(D)];
  _97 = MEM[(const __m256i_u * {ref-all})pa_10(D)];
  _96 = MEM[(const __m256i_u * {ref-all})pb_11(D)];
  _95 = MEM[(const __m256i_u * {ref-all})pc_13(D)];
  _94 = MEM[(const __m256i_u * {ref-all})pd_15(D)];
  _93 = MEM[(const __m256i_u * {ref-all})pc_13(D) + 32B];
  _92 = MEM[(const __m256i_u * {ref-all})pd_15(D) + 32B];
  _86 = VIEW_CONVERT_EXPR(_96);
  _87 = VIEW_CONVERT_EXPR(_101);
  _88 = (vector(16) short unsigned int) _87;
  _89 = (vector(16) short unsigned int) _86;
  _90 = VEC_PACK_SAT_EXPR <_88, _89>;
  _91 = (vector(32) char) _90;
  _80 = VIEW_CONVERT_EXPR(_95);
  _81 = VIEW_CONVERT_EXPR(_100);
  _82 = (vector(16) short unsigned int) _81;
  _83 = (vector(16) short unsigned int) _80;
  _84 = VEC_PACK_SAT_EXPR <_82, _83>;
  _85 = (vector(32) char) _84;
  _74 = VIEW_CONVERT_EXPR(_94);
  _75 = VIEW_CONVERT_EXPR(_99);
  _76 = (vector(16) short unsigned int) _75;
  _77 = (vector(16) short unsigned int) _74;
  _78 = VEC_PACK_SAT_EXPR <_76, _77>;
  _79 = (vector(32) char) _78;
  _68 = VIEW_CONVERT_EXPR(_93);
  _69 = VIEW_CONVERT_EXPR(_98);
  _70 = (vector(16) short unsigned int) _69;
  _71 = (vector(16) short unsigned int) _68;
  _72 = VEC_PACK_SAT_EXPR <_70, _71>;
  _73 = (vector(32) char) _72;
  _62 = VIEW_CONVERT_EXPR(_92);
  _63 = VIEW_CONVERT_EXPR(_97);
  _64 = (vector(16) short unsigned int) _63;
  _65 = (vector(16) short unsigned int) _62;
  _66 = VEC_PACK_SAT_EXPR <_64, _65>;
  _67 = (vector(32) char) _66;
  _59 = VIEW_CONVERT_EXPR(_91);
  _60 = 

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2021-08-24 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #12 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #11)
> (In reply to Andrew Pinski from comment #10)
> > (In reply to Andrew Pinski from comment #9)
> > > (In reply to Hongtao.liu from comment #8)
> > > > Do we have IR for unsigned/signed saturation in gimple level?
> > > 
> > > Not yet. I was just looking for that today due because of PR 51492.
> > 
> > But there is a RFC out for it:
> > https://gcc.gnu.org/pipermail/gcc/2021-May/236015.html
> 
> Oh, VEC_PACK_SAT_EXPR is exact what i needed for _mm256_packus_epi16, thanks
> for the pointer.

And ‘vec_pack_ssat_m’, ‘vec_pack_usat_m’ for optab.

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2021-08-24 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #11 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #10)
> (In reply to Andrew Pinski from comment #9)
> > (In reply to Hongtao.liu from comment #8)
> > > Do we have IR for unsigned/signed saturation in gimple level?
> > 
> > Not yet. I was just looking for that today due because of PR 51492.
> 
> But there is a RFC out for it:
> https://gcc.gnu.org/pipermail/gcc/2021-May/236015.html

Oh, VEC_PACK_SAT_EXPR is exact what i needed for _mm256_packus_epi16, thanks
for the pointer.

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2021-08-24 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #10 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #9)
> (In reply to Hongtao.liu from comment #8)
> > Do we have IR for unsigned/signed saturation in gimple level?
> 
> Not yet. I was just looking for that today due because of PR 51492.

But there is a RFC out for it:
https://gcc.gnu.org/pipermail/gcc/2021-May/236015.html

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2021-08-24 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #9 from Andrew Pinski  ---
(In reply to Hongtao.liu from comment #8)
> Do we have IR for unsigned/signed saturation in gimple level?

Not yet. I was just looking for that today due because of PR 51492.

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2021-08-24 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #8 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #7)
> (In reply to Andrew Pinski from comment #5)
> > clang can now produce:
> > mov eax, dword ptr [esp + 16]
> > mov ecx, dword ptr [esp + 28]
> > vmovdqu xmm0, xmmword ptr [ecx + 32]
> > vmovdqu xmm1, xmmword ptr [eax]
> > vpackuswb   xmm2, xmm1, xmm0
> > vpsubw  xmm0, xmm1, xmm0
> > vpaddw  xmm0, xmm0, xmm2
> > vpackuswb   xmm0, xmm0, xmm0
> > vpackuswb   xmm0, xmm0, xmm0
> > vpextrd eax, xmm0, 1
> > ret
> > 
> > I suspect if the back-end is able to "fold" at the gimple level the builtins
> > into gimple, GCC will do a much better job.
> > Currently we have stuff like:
> > _27 = __builtin_ia32_vextractf128_si256 (_28, 0);
> > _26 = __builtin_ia32_vec_ext_v4si (_27, 1); [tail call]
> > 
> > I think both are just a BIT_FIELD_REF really and even more can be simplified
> > to just one bitfield extraction rather than what we do now:
> > vpackuswb   %ymm1, %ymm0, %ymm0
> > vpextrd $1, %xmm0, %eax
> > 
> > Plus it looks like with __builtin_ia32_vextractf128_si256 (_28, 0), clang is
> > able to remove half of the code due to only needing 128 bytes stuff :).
> 
> Yes, let's me try this.


Do we have IR for unsigned/signed saturation in gimple level?

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2021-08-24 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #7 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #5)
> clang can now produce:
> mov eax, dword ptr [esp + 16]
> mov ecx, dword ptr [esp + 28]
> vmovdqu xmm0, xmmword ptr [ecx + 32]
> vmovdqu xmm1, xmmword ptr [eax]
> vpackuswb   xmm2, xmm1, xmm0
> vpsubw  xmm0, xmm1, xmm0
> vpaddw  xmm0, xmm0, xmm2
> vpackuswb   xmm0, xmm0, xmm0
> vpackuswb   xmm0, xmm0, xmm0
> vpextrd eax, xmm0, 1
> ret
> 
> I suspect if the back-end is able to "fold" at the gimple level the builtins
> into gimple, GCC will do a much better job.
> Currently we have stuff like:
> _27 = __builtin_ia32_vextractf128_si256 (_28, 0);
> _26 = __builtin_ia32_vec_ext_v4si (_27, 1); [tail call]
> 
> I think both are just a BIT_FIELD_REF really and even more can be simplified
> to just one bitfield extraction rather than what we do now:
> vpackuswb   %ymm1, %ymm0, %ymm0
> vpextrd $1, %xmm0, %eax
> 
> Plus it looks like with __builtin_ia32_vextractf128_si256 (_28, 0), clang is
> able to remove half of the code due to only needing 128 bytes stuff :).

Yes, let's me try this.

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2021-08-24 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #6 from Petr  ---
Yes, the code is not really doing anything useful, I only wrote it to
demonstrate the spills problem. Clang actually outsmarted me by removing half
of the code :)

I think this issue can be closed, I cannot repro this with the newest GCC.

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2021-08-13 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

Andrew Pinski  changed:

   What|Removed |Added

  Component|rtl-optimization|target

--- Comment #5 from Andrew Pinski  ---
clang can now produce:
mov eax, dword ptr [esp + 16]
mov ecx, dword ptr [esp + 28]
vmovdqu xmm0, xmmword ptr [ecx + 32]
vmovdqu xmm1, xmmword ptr [eax]
vpackuswb   xmm2, xmm1, xmm0
vpsubw  xmm0, xmm1, xmm0
vpaddw  xmm0, xmm0, xmm2
vpackuswb   xmm0, xmm0, xmm0
vpackuswb   xmm0, xmm0, xmm0
vpextrd eax, xmm0, 1
ret

I suspect if the back-end is able to "fold" at the gimple level the builtins
into gimple, GCC will do a much better job.
Currently we have stuff like:
_27 = __builtin_ia32_vextractf128_si256 (_28, 0);
_26 = __builtin_ia32_vec_ext_v4si (_27, 1); [tail call]

I think both are just a BIT_FIELD_REF really and even more can be simplified to
just one bitfield extraction rather than what we do now:
vpackuswb   %ymm1, %ymm0, %ymm0
vpextrd $1, %xmm0, %eax

Plus it looks like with __builtin_ia32_vextractf128_si256 (_28, 0), clang is
able to remove half of the code due to only needing 128 bytes stuff :).

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2016-08-18 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #2 from Petr  ---
With '-mtune=intel' the push/pop sequence is gone, but YMM register management
remains the same - 24 memory accesses more than clang.

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2016-08-18 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

Andrew Pinski  changed:

   What|Removed |Added

  Component|c++ |target

--- Comment #1 from Andrew Pinski  ---
Try adding -march=intel or-mtune=intel . The default tuning for gcc is generic
which is combination of Intel and amd tuning. And because amd tuning needs not
to use gprs and SIMD registers at the same time spilling is faster there. It
tunes for that.