[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #21 from CVS Commits --- The master branch has been updated by Hongyu Wang : https://gcc.gnu.org/g:dc95e1e9702f2f6367bbc108c8d01169be1b66d2 commit r13-4044-gdc95e1e9702f2f6367bbc108c8d01169be1b66d2 Author: Hongyu Wang Date: Mon Jan 17 13:01:51 2022 +0800 Optimize VEC_PERM_EXPR with same permutation index and operation The sequence c1 = VEC_PERM_EXPR (a, a, mask) c2 = VEC_PERM_EXPR (b, b, mask) c3 = c1 op c2 can be optimized to c = a op b c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector operation, and float operation with full permutation. gcc/ChangeLog: PR target/98167 * match.pd: New perm + vector op patterns for int and fp vector. gcc/testsuite/ChangeLog: PR target/98167 * gcc.target/i386/pr98167.c: New test.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #20 from Richard Biener --- -fno-trapping-math tells us we are not concerned about FP exception flags (so say spurious FP_INEXACT is OK), -fno-signalling-nans is needed as well I guess. Oh, and in practice performing the multiplication for elements that are NaN or denormal might trigger very slow paths in the CPU which means the optimization could be a pessimization runtime wise. Eventually zeroing the unused lanes in one(?) of the operands is enough to avoid that (for denormal, I guess 0 * NaN is still NaN).
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #19 from Andrew Pinski --- (In reply to Hongtao.liu from comment #18) > For vector integers it should be ok? > For vector floating point we can add condition > flag_unsafe_math_optimizations || !flag_trapping_math for the optimization. I think only flag_trapping_math needed really.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #18 from Hongtao.liu --- (In reply to Andrew Pinski from comment #17) > (In reply to Hongtao.liu from comment #16) > > typedef int v4si __attribute__ ((vector_size(16))); > > > > v4si f(v4si a, v4si b) { > > v4si a1 = __builtin_shufflevector (a, a, 2, 3 ,1 ,0); > > v4si b1 = __builtin_shufflevector (b, a, 2, 3 ,1 ,0); > > return a1 * b1; > > } > > > > gcc generate > > > > f: > > vpshufd xmm1, xmm1, 30 > > vpshufd xmm0, xmm0, 30 > > vpmulld xmm0, xmm0, xmm1 > > ret > > > > llvm generate > > > > f: # @f > > vpmulld xmm0, xmm1, xmm0 > > vpshufd xmm0, xmm0, 30 # xmm0 = xmm0[2,3,1,0] > > ret > > For the above, this is safe for -ftrapping-math as all elements are still > used. It is when elements that are not used it might not be safe ... For vector integers it should be ok? For vector floating point we can add condition flag_unsafe_math_optimizations || !flag_trapping_math for the optimization.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #17 from Andrew Pinski --- (In reply to Hongtao.liu from comment #16) > typedef int v4si __attribute__ ((vector_size(16))); > > v4si f(v4si a, v4si b) { > v4si a1 = __builtin_shufflevector (a, a, 2, 3 ,1 ,0); > v4si b1 = __builtin_shufflevector (b, a, 2, 3 ,1 ,0); > return a1 * b1; > } > > gcc generate > > f: > vpshufd xmm1, xmm1, 30 > vpshufd xmm0, xmm0, 30 > vpmulld xmm0, xmm0, xmm1 > ret > > llvm generate > > f: # @f > vpmulld xmm0, xmm1, xmm0 > vpshufd xmm0, xmm0, 30 # xmm0 = xmm0[2,3,1,0] > ret For the above, this is safe for -ftrapping-math as all elements are still used. It is when elements that are not used it might not be safe ...
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #16 from Hongtao.liu --- typedef int v4si __attribute__ ((vector_size(16))); v4si f(v4si a, v4si b) { v4si a1 = __builtin_shufflevector (a, a, 2, 3 ,1 ,0); v4si b1 = __builtin_shufflevector (b, a, 2, 3 ,1 ,0); return a1 * b1; } gcc generate f: vpshufd xmm1, xmm1, 30 vpshufd xmm0, xmm0, 30 vpmulld xmm0, xmm0, xmm1 ret llvm generate f: # @f vpmulld xmm0, xmm1, xmm0 vpshufd xmm0, xmm0, 30 # xmm0 = xmm0[2,3,1,0] ret
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #15 from Hongtao.liu --- (In reply to Andrew Pinski from comment #14) > (In reply to Hongtao.liu from comment #13) > > fold shulfps to vec_perm_exp, but still 2 shulfps are generated. > > > > __m128 f (__m128 a, __m128 b) > > { > > vector(4) float _3; > > vector(4) float _5; > > vector(4) float _6; > > > > ;; basic block 2, loop depth 0 > > ;;pred: ENTRY > > _3 = VEC_PERM_EXPR ; > > _5 = VEC_PERM_EXPR ; > > _6 = _3 * _5; > > return _6; > > ;;succ: EXIT > > > > } > > So this is a bit more complex as not all targets have a good extract/dup > functionary for scalars. So maybe this should be done as a define_insn for > x86. No need for extract/dup, if both perm indexes is the same, it can be c = a * b, and vec_perm_expr (c, c, index}. it seems a quite general optimization which could apply to all other operations.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #14 from Andrew Pinski --- (In reply to Hongtao.liu from comment #13) > fold shulfps to vec_perm_exp, but still 2 shulfps are generated. > > __m128 f (__m128 a, __m128 b) > { > vector(4) float _3; > vector(4) float _5; > vector(4) float _6; > > ;; basic block 2, loop depth 0 > ;;pred: ENTRY > _3 = VEC_PERM_EXPR ; > _5 = VEC_PERM_EXPR ; > _6 = _3 * _5; > return _6; > ;;succ: EXIT > > } So this is a bit more complex as not all targets have a good extract/dup functionary for scalars. So maybe this should be done as a define_insn for x86.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #13 from Hongtao.liu --- fold shulfps to vec_perm_exp, but still 2 shulfps are generated. __m128 f (__m128 a, __m128 b) { vector(4) float _3; vector(4) float _5; vector(4) float _6; ;; basic block 2, loop depth 0 ;;pred: ENTRY _3 = VEC_PERM_EXPR ; _5 = VEC_PERM_EXPR ; _6 = _3 * _5; return _6; ;;succ: EXIT }
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #12 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:0fa4787bf34b173ce6f198e99b6f6dd8a3f98014 commit r12-3177-g0fa4787bf34b173ce6f198e99b6f6dd8a3f98014 Author: liuhongt Date: Fri Dec 11 19:02:43 2020 +0800 Fold more shuffle builtins to VEC_PERM_EXPR. A follow-up to https://gcc.gnu.org/pipermail/gcc-patches/2019-May/521983.html gcc/ PR target/98167 PR target/43147 * config/i386/i386.c (ix86_gimple_fold_builtin): Fold IX86_BUILTIN_SHUFPD512, IX86_BUILTIN_SHUFPS512, IX86_BUILTIN_SHUFPD256ï¼ IX86_BUILTIN_SHUFPSï¼ IX86_BUILTIN_SHUFPS256. (ix86_masked_all_ones): New function. gcc/testsuite/ * gcc.target/i386/avx512f-vshufpd-1.c: Adjust testcase. * gcc.target/i386/avx512f-vshufps-1.c: Adjust testcase. * gcc.target/i386/pr43147.c: New test.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #11 from Jakub Jelinek --- Just would like to clarify, for some of the intrinsics like these that take an immediate argument which encodes the permutation, the right thing is to fold them in the gimple folder hook, because not all the immediate values might map to __builtin_shuffle and even if they would, expressing it in C code would be too large. And then there are intrinsics that don't have any immediate arguments nor vector arguments containing the permutation, they just take one or two vectors and apply one specific permutations. Those IMHO should be done in the headers using __builtin_shuffle.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #10 from Hongtao.liu --- A patch is posted at https://gcc.gnu.org/pipermail/gcc-patches/2020-December/561909.html And record jakub comments in another thread On Tue, Jan 12, 2021 at 11:47:48AM +0100, Jakub Jelinek via Gcc-patches wrote: > > OTOH, perhaps some of the new testcases can be handled in x86 > > target_fold_builtin? In the long term, maybe target_fold_shuffle can > > be introduced to map __builtin_shufle to various target builtins, so > > the builtin can be processed further in target_fold_builtin. As > > pointed out below, vector insn patterns can be quite complex, and push > > RTL combiners to their limits, so perhaps they can be more efficiently > > handled by tree passes. > > My primary motivation was to generate good code from __builtin_shuffle here > and trying to find the best permutation and map it back from insns to > builtins would be a nightmare. > I'll see how many targets I need to modify to try the no middle-end > force_reg for CONST0_RTX case... For the folding, I think best would be to change _mm_unpacklo_epi8 and all the similar intrinsics for hardcoded specific permutations from using a builtin to just using __builtin_shuffle (together with verification that we emit as good or better code from it for each case of course), and keep __builtin_shuffle -> VEC_PERM_EXPR as the canonical form (with which the GIMPLE code can do any optimizations it wants).
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #9 from Hongtao.liu --- (In reply to Marc Glisse from comment #8) > (In reply to Richard Biener from comment #4) > > We already handle IX86_BUILTIN_SHUFPD there but not IX86_BUILTIN_SHUFPS for > > some reason. > > https://gcc.gnu.org/pipermail/gcc-patches/2019-May/521983.html > I was checking with just one builtin if this was the right approach, and > never got to extend it to others, sorry. Handling shufps in a similar way > seems good to me, if anyone has time to do it. I'll take a look, assume I also need to fold those shuffle builtins for avx/avx512.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #8 from Marc Glisse --- (In reply to Richard Biener from comment #4) > We already handle IX86_BUILTIN_SHUFPD there but not IX86_BUILTIN_SHUFPS for > some reason. https://gcc.gnu.org/pipermail/gcc-patches/2019-May/521983.html I was checking with just one builtin if this was the right approach, and never got to extend it to others, sorry. Handling shufps in a similar way seems good to me, if anyone has time to do it.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #7 from Richard Biener --- The transform with doubles on the [1] element would produce unpckhpd%xmm1, %xmm1 unpckhpd%xmm0, %xmm0 mulsd %xmm1, %xmm0 unpcklpd%xmm0, %xmm0 so that's not profitable. Which vector element is cheap (free) to promote to scalar possibly depends on the target (ppc/arm with BE lane order?).
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #6 from Richard Biener --- Created attachment 49695 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49695=edit vector lowering ssa_uniform_vector_p hack
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #5 from Richard Biener --- So __m128d f(__m128d a, __m128d b) { return _mm_mul_pd(_mm_shuffle_pd(a, a, 0), _mm_shuffle_pd(b, b, 0)); } is expanded as _3 = VEC_PERM_EXPR ; _5 = VEC_PERM_EXPR ; _6 = _3 * _5; return _6; but vector lowering ssa_uniform_vector_p doesn't yet handle VEC_PERM_EXPRs with all-zero permute. Hacking that in (not fixing the fallout) produces [local count: 1073741824]: _7 = BIT_FIELD_REF ; _8 = BIT_FIELD_REF ; _9 = _7 * _8; _6 = {_9, _9}; and f: .LFB534: .cfi_startproc mulsd %xmm1, %xmm0 unpcklpd%xmm0, %xmm0 ret
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #4 from Richard Biener --- That works only for single-operation and doesn't really scale. I think we want to expose the permutes at the GIMPLE level via ix86_gimple_fold_builtin. We already handle IX86_BUILTIN_SHUFPD there but not IX86_BUILTIN_SHUFPS for some reason.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #3 from Hongtao.liu --- ;; _3 = __builtin_ia32_shufps (b_2(D), b_2(D), 0); (insn 7 6 8 (set (reg:V4SF 88) (reg/v:V4SF 86 [ b ])) "./gcc/include/xmmintrin.h":746:19 -1 (nil)) (insn 8 7 9 (set (reg:V4SF 89) (reg/v:V4SF 86 [ b ])) "./gcc/include/xmmintrin.h":746:19 -1 (nil)) (insn 9 8 10 (set (reg:V4SF 87) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 88) (reg:V4SF 89)) (parallel [ (const_int 0 [0]) repeated x2 (const_int 4 [0x4]) repeated x2 ]))) "./gcc/include/xmmintrin.h":746:19 -1 (nil)) ;; _5 = __builtin_ia32_shufps (a_4(D), a_4(D), 0); (insn 11 10 12 (set (reg:V4SF 91) (reg/v:V4SF 85 [ a ])) "./gcc/include/xmmintrin.h":746:19 -1 (nil)) (insn 12 11 13 (set (reg:V4SF 92) (reg/v:V4SF 85 [ a ])) "./gcc/include/xmmintrin.h":746:19 -1 (nil)) (insn 13 12 14 (set (reg:V4SF 90) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 91) (reg:V4SF 92)) (parallel [ (const_int 0 [0]) repeated x2 (const_int 4 [0x4]) repeated x2 ]))) "./gcc/include/xmmintrin.h":746:19 -1 (nil)) Simplify upper to (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 86) (parallel [(const_int 0)]) then add a combine splitter transform (mult:(vec_dup op1) (vec_dup op2)) to (vec_dup (mult:op1 op2)?
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 Richard Biener changed: What|Removed |Added Target|x86_64 i?86 |x86_64-*-* i?86-*-* Keywords||missed-optimization Last reconfirmed||2020-12-07 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #2 from Richard Biener --- Btw, vector lowering performs this optimization. But then in GIMPLE we have __m128 f (__m128 a, __m128 b) { vector(4) float _3; vector(4) float _5; vector(4) float _6; [local count: 1073741824]: _3 = __builtin_ia32_shufps (b_2(D), b_2(D), 0); _5 = __builtin_ia32_shufps (a_4(D), a_4(D), 0); _6 = _3 * _5; return _6; so we don't actually see the operation. To rectify this the backend would need to GIMPLE-fold those calls once the MAKS argument becomes constant. Fold it to VEC_PERM_EXPR of VIEW_CONVERTed operands, that is. Vector lowering doesn't perform generic permute optimizations, the vectorizer does but it doesn't touch existing code. I guess it could be done in some new pass similar to backprop (but dataflow is the other way around).
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 Gabriel Ravier changed: What|Removed |Added Summary|[x86] Failure to optimize |[x86] Failure to optimize |operation on indentically |operation on indentically |shuffled operand into a |shuffled operands into a |shuffle of the result of|shuffle of the result of |the operation |the operation --- Comment #1 from Gabriel Ravier --- PS: I would expect it to be possible to expand this to a much more generic set of operations, and maybe shuffle operands, thus the summary.