[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2022-11-14 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #21 from CVS Commits  ---
The master branch has been updated by Hongyu Wang :

https://gcc.gnu.org/g:dc95e1e9702f2f6367bbc108c8d01169be1b66d2

commit r13-4044-gdc95e1e9702f2f6367bbc108c8d01169be1b66d2
Author: Hongyu Wang 
Date:   Mon Jan 17 13:01:51 2022 +0800

Optimize VEC_PERM_EXPR with same permutation index and operation

The sequence
 c1 = VEC_PERM_EXPR (a, a, mask)
 c2 = VEC_PERM_EXPR (b, b, mask)
 c3 = c1 op c2
can be optimized to
 c = a op b
 c3 = VEC_PERM_EXPR (c, c, mask)
for all integer vector operation, and float operation with
full permutation.

gcc/ChangeLog:

PR target/98167
* match.pd: New perm + vector op patterns for int and fp vector.

gcc/testsuite/ChangeLog:

PR target/98167
* gcc.target/i386/pr98167.c: New test.

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2022-01-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #20 from Richard Biener  ---
-fno-trapping-math tells us we are not concerned about FP exception flags (so
say spurious FP_INEXACT is OK), -fno-signalling-nans is needed as well I guess.

Oh, and in practice performing the multiplication for elements that are
NaN or denormal might trigger very slow paths in the CPU which means
the optimization could be a pessimization runtime wise.  Eventually
zeroing the unused lanes in one(?) of the operands is enough to avoid that
(for denormal, I guess 0 * NaN is still NaN).

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2021-08-27 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #19 from Andrew Pinski  ---
(In reply to Hongtao.liu from comment #18)
> For vector integers it should be ok?
> For vector floating point we can add condition
> flag_unsafe_math_optimizations || !flag_trapping_math for the optimization.

I think only flag_trapping_math needed really.

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2021-08-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #18 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #17)
> (In reply to Hongtao.liu from comment #16)
> > typedef int v4si __attribute__ ((vector_size(16)));
> > 
> > v4si f(v4si a, v4si b) {
> > v4si a1 = __builtin_shufflevector (a, a, 2, 3 ,1 ,0);
> > v4si b1 = __builtin_shufflevector (b, a, 2, 3 ,1 ,0);
> > return a1 * b1;
> > }
> > 
> > gcc generate 
> > 
> > f:
> > vpshufd xmm1, xmm1, 30
> > vpshufd xmm0, xmm0, 30
> > vpmulld xmm0, xmm0, xmm1
> > ret
> > 
> > llvm generate
> > 
> > f:  # @f
> > vpmulld xmm0, xmm1, xmm0
> > vpshufd xmm0, xmm0, 30  # xmm0 = xmm0[2,3,1,0]
> > ret
> 
> For the above, this is safe for -ftrapping-math as all elements are still
> used.  It is when elements that are not used it might not be safe ...

For vector integers it should be ok?
For vector floating point we can add condition flag_unsafe_math_optimizations
|| !flag_trapping_math for the optimization.

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2021-08-27 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #17 from Andrew Pinski  ---
(In reply to Hongtao.liu from comment #16)
> typedef int v4si __attribute__ ((vector_size(16)));
> 
> v4si f(v4si a, v4si b) {
> v4si a1 = __builtin_shufflevector (a, a, 2, 3 ,1 ,0);
> v4si b1 = __builtin_shufflevector (b, a, 2, 3 ,1 ,0);
> return a1 * b1;
> }
> 
> gcc generate 
> 
> f:
> vpshufd xmm1, xmm1, 30
> vpshufd xmm0, xmm0, 30
> vpmulld xmm0, xmm0, xmm1
> ret
> 
> llvm generate
> 
> f:  # @f
> vpmulld xmm0, xmm1, xmm0
> vpshufd xmm0, xmm0, 30  # xmm0 = xmm0[2,3,1,0]
> ret

For the above, this is safe for -ftrapping-math as all elements are still used.
 It is when elements that are not used it might not be safe ...

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2021-08-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #16 from Hongtao.liu  ---
typedef int v4si __attribute__ ((vector_size(16)));

v4si f(v4si a, v4si b) {
v4si a1 = __builtin_shufflevector (a, a, 2, 3 ,1 ,0);
v4si b1 = __builtin_shufflevector (b, a, 2, 3 ,1 ,0);
return a1 * b1;
}

gcc generate 

f:
vpshufd xmm1, xmm1, 30
vpshufd xmm0, xmm0, 30
vpmulld xmm0, xmm0, xmm1
ret

llvm generate

f:  # @f
vpmulld xmm0, xmm1, xmm0
vpshufd xmm0, xmm0, 30  # xmm0 = xmm0[2,3,1,0]
ret

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2021-08-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #15 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #14)
> (In reply to Hongtao.liu from comment #13)
> > fold shulfps to vec_perm_exp, but still 2 shulfps are generated.
> > 
> > __m128 f (__m128 a, __m128 b)
> > {
> >   vector(4) float _3;
> >   vector(4) float _5;
> >   vector(4) float _6;
> > 
> > ;;   basic block 2, loop depth 0
> > ;;pred:   ENTRY
> >   _3 = VEC_PERM_EXPR ;
> >   _5 = VEC_PERM_EXPR ;
> >   _6 = _3 * _5;
> >   return _6;
> > ;;succ:   EXIT
> > 
> > }
> 
> So this is a bit more complex as not all targets have a good extract/dup
> functionary for scalars. So maybe this should be done as a define_insn for
> x86.

No need for extract/dup, if both perm indexes is the same, it can be c = a * b,
and vec_perm_expr (c, c, index}. it seems a quite general optimization which
could apply to all other operations.

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2021-08-26 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #14 from Andrew Pinski  ---
(In reply to Hongtao.liu from comment #13)
> fold shulfps to vec_perm_exp, but still 2 shulfps are generated.
> 
> __m128 f (__m128 a, __m128 b)
> {
>   vector(4) float _3;
>   vector(4) float _5;
>   vector(4) float _6;
> 
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _3 = VEC_PERM_EXPR ;
>   _5 = VEC_PERM_EXPR ;
>   _6 = _3 * _5;
>   return _6;
> ;;succ:   EXIT
> 
> }

So this is a bit more complex as not all targets have a good extract/dup
functionary for scalars. So maybe this should be done as a define_insn for x86.

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2021-08-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #13 from Hongtao.liu  ---
fold shulfps to vec_perm_exp, but still 2 shulfps are generated.

__m128 f (__m128 a, __m128 b)
{
  vector(4) float _3;
  vector(4) float _5;
  vector(4) float _6;

;;   basic block 2, loop depth 0
;;pred:   ENTRY
  _3 = VEC_PERM_EXPR ;
  _5 = VEC_PERM_EXPR ;
  _6 = _3 * _5;
  return _6;
;;succ:   EXIT

}

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2021-08-26 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #12 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:0fa4787bf34b173ce6f198e99b6f6dd8a3f98014

commit r12-3177-g0fa4787bf34b173ce6f198e99b6f6dd8a3f98014
Author: liuhongt 
Date:   Fri Dec 11 19:02:43 2020 +0800

Fold more shuffle builtins to VEC_PERM_EXPR.

A follow-up to
https://gcc.gnu.org/pipermail/gcc-patches/2019-May/521983.html

gcc/
PR target/98167
PR target/43147
* config/i386/i386.c (ix86_gimple_fold_builtin): Fold
IX86_BUILTIN_SHUFPD512, IX86_BUILTIN_SHUFPS512,
IX86_BUILTIN_SHUFPD256ï¼ IX86_BUILTIN_SHUFPSï¼
IX86_BUILTIN_SHUFPS256.
(ix86_masked_all_ones): New function.

gcc/testsuite/
* gcc.target/i386/avx512f-vshufpd-1.c: Adjust testcase.
* gcc.target/i386/avx512f-vshufps-1.c: Adjust testcase.
* gcc.target/i386/pr43147.c: New test.

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2021-01-14 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #11 from Jakub Jelinek  ---
Just would like to clarify, for some of the intrinsics like these that take an
immediate argument which encodes the permutation, the right thing is to fold
them in the gimple folder hook, because not all the immediate values might map
to __builtin_shuffle and even if they would, expressing it in C code would be
too large.
And then there are intrinsics that don't have any immediate arguments nor
vector arguments containing the permutation, they just take one or two vectors
and apply one specific permutations.  Those IMHO should be done in the headers
using __builtin_shuffle.

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2021-01-14 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #10 from Hongtao.liu  ---
A patch is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2020-December/561909.html

And record jakub comments in another thread

On Tue, Jan 12, 2021 at 11:47:48AM +0100, Jakub Jelinek via Gcc-patches wrote:
> > OTOH, perhaps some of the new testcases can be handled in x86
> > target_fold_builtin? In the long term, maybe target_fold_shuffle can
> > be introduced to map __builtin_shufle to various target builtins, so
> > the builtin can be processed further in target_fold_builtin. As
> > pointed out below, vector insn patterns can be quite complex, and push
> > RTL combiners to their limits, so perhaps they can be more efficiently
> > handled by tree passes.
>
> My primary motivation was to generate good code from __builtin_shuffle here
> and trying to find the best permutation and map it back from insns to
> builtins would be a nightmare.
> I'll see how many targets I need to modify to try the no middle-end
> force_reg for CONST0_RTX case...

For the folding, I think best would be to change _mm_unpacklo_epi8
and all the similar intrinsics for hardcoded specific permutations
from using a builtin to just using __builtin_shuffle (together with
verification that we emit as good or better code from it for each case of
course), and keep __builtin_shuffle -> VEC_PERM_EXPR as the canonical
form (with which the GIMPLE code can do any optimizations it wants).

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2020-12-09 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #9 from Hongtao.liu  ---
(In reply to Marc Glisse from comment #8)
> (In reply to Richard Biener from comment #4)
> > We already handle IX86_BUILTIN_SHUFPD there but not IX86_BUILTIN_SHUFPS for
> > some reason.
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2019-May/521983.html
> I was checking with just one builtin if this was the right approach, and
> never got to extend it to others, sorry. Handling shufps in a similar way
> seems good to me, if anyone has time to do it.

I'll take a look, assume I also need to fold those shuffle builtins for
avx/avx512.

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2020-12-08 Thread glisse at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #8 from Marc Glisse  ---
(In reply to Richard Biener from comment #4)
> We already handle IX86_BUILTIN_SHUFPD there but not IX86_BUILTIN_SHUFPS for
> some reason.

https://gcc.gnu.org/pipermail/gcc-patches/2019-May/521983.html
I was checking with just one builtin if this was the right approach, and never
got to extend it to others, sorry. Handling shufps in a similar way seems good
to me, if anyone has time to do it.

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2020-12-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #7 from Richard Biener  ---
The transform with doubles on the [1] element would produce

unpckhpd%xmm1, %xmm1
unpckhpd%xmm0, %xmm0
mulsd   %xmm1, %xmm0
unpcklpd%xmm0, %xmm0

so that's not profitable.  Which vector element is cheap (free) to promote
to scalar possibly depends on the target (ppc/arm with BE lane order?).

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2020-12-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #6 from Richard Biener  ---
Created attachment 49695
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49695=edit
vector lowering ssa_uniform_vector_p hack

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2020-12-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #5 from Richard Biener  ---
So

__m128d f(__m128d a, __m128d b) {
return _mm_mul_pd(_mm_shuffle_pd(a, a, 0), _mm_shuffle_pd(b, b, 0));
}

is expanded as

  _3 = VEC_PERM_EXPR ;
  _5 = VEC_PERM_EXPR ;
  _6 = _3 * _5;
  return _6;

but vector lowering ssa_uniform_vector_p doesn't yet handle VEC_PERM_EXPRs
with all-zero permute.  Hacking that in (not fixing the fallout) produces

   [local count: 1073741824]:
  _7 = BIT_FIELD_REF ;
  _8 = BIT_FIELD_REF ;
  _9 = _7 * _8;
  _6 = {_9, _9};

and

f:
.LFB534:
.cfi_startproc
mulsd   %xmm1, %xmm0
unpcklpd%xmm0, %xmm0
ret

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2020-12-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #4 from Richard Biener  ---
That works only for single-operation and doesn't really scale.  I think we want
to expose the permutes at the GIMPLE level via ix86_gimple_fold_builtin.  We
already handle IX86_BUILTIN_SHUFPD there but not IX86_BUILTIN_SHUFPS for
some reason.

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2020-12-07 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

--- Comment #3 from Hongtao.liu  ---
;; _3 = __builtin_ia32_shufps (b_2(D), b_2(D), 0);

(insn 7 6 8 (set (reg:V4SF 88)
(reg/v:V4SF 86 [ b ])) "./gcc/include/xmmintrin.h":746:19 -1
 (nil))

(insn 8 7 9 (set (reg:V4SF 89)
(reg/v:V4SF 86 [ b ])) "./gcc/include/xmmintrin.h":746:19 -1
 (nil))

(insn 9 8 10 (set (reg:V4SF 87)
(vec_select:V4SF (vec_concat:V8SF (reg:V4SF 88)
(reg:V4SF 89))
(parallel [
(const_int 0 [0]) repeated x2
(const_int 4 [0x4]) repeated x2
]))) "./gcc/include/xmmintrin.h":746:19 -1
 (nil))
;; _5 = __builtin_ia32_shufps (a_4(D), a_4(D), 0);

(insn 11 10 12 (set (reg:V4SF 91)
(reg/v:V4SF 85 [ a ])) "./gcc/include/xmmintrin.h":746:19 -1
 (nil))

(insn 12 11 13 (set (reg:V4SF 92)
(reg/v:V4SF 85 [ a ])) "./gcc/include/xmmintrin.h":746:19 -1
 (nil))

(insn 13 12 14 (set (reg:V4SF 90)
(vec_select:V4SF (vec_concat:V8SF (reg:V4SF 91)
(reg:V4SF 92))
(parallel [
(const_int 0 [0]) repeated x2
(const_int 4 [0x4]) repeated x2
]))) "./gcc/include/xmmintrin.h":746:19 -1
 (nil))


Simplify upper to

(vec_duplicate:V4SF
  (vec_select:SF (reg:V4SF 86)
 (parallel [(const_int 0)])

then add a combine splitter transform (mult:(vec_dup op1) (vec_dup op2)) to
(vec_dup (mult:op1 op2)?

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2020-12-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

Richard Biener  changed:

   What|Removed |Added

 Target|x86_64 i?86 |x86_64-*-* i?86-*-*
   Keywords||missed-optimization
   Last reconfirmed||2020-12-07
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #2 from Richard Biener  ---
Btw, vector lowering performs this optimization.  But then in GIMPLE we have

__m128 f (__m128 a, __m128 b)
{
  vector(4) float _3;
  vector(4) float _5;
  vector(4) float _6;

   [local count: 1073741824]:
  _3 = __builtin_ia32_shufps (b_2(D), b_2(D), 0);
  _5 = __builtin_ia32_shufps (a_4(D), a_4(D), 0);
  _6 = _3 * _5;
  return _6;

so we don't actually see the operation.  To rectify this the backend would
need to GIMPLE-fold those calls once the MAKS argument becomes constant.
Fold it to VEC_PERM_EXPR of VIEW_CONVERTed operands, that is.

Vector lowering doesn't perform generic permute optimizations, the vectorizer
does but it doesn't touch existing code.  I guess it could be done in some
new pass similar to backprop (but dataflow is the other way around).

[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation

2020-12-06 Thread gabravier at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167

Gabriel Ravier  changed:

   What|Removed |Added

Summary|[x86] Failure to optimize   |[x86] Failure to optimize
   |operation on indentically   |operation on indentically
   |shuffled operand into a |shuffled operands into a
   |shuffle of the result of|shuffle of the result of
   |the operation   |the operation

--- Comment #1 from Gabriel Ravier  ---
PS: I would expect it to be possible to expand this to a much more generic set
of operations, and maybe shuffle operands, thus the summary.