[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066

--- Comment #5 from Peter Cordes  ---
> pextrw requires sse4.1 for mem operands.

You're right! I didn't double-check the asm manual for PEXTRW when writing up
the initial report, and had never realized that PINSRW wasn't symmetric with
it.  I was really surprised to see that in
https://www.felixcloutier.com/x86/pextrw

So we do need to care about tuning for _mm_storeu_si16(p, v) without SSE4.1
(without the option of PEXTRW to memory).  PEXTRW to an integer register is
obviously bad; we should be doing

movd  %xmm0, %eax
mov   %ax, (%rdi)

instead of an inefficient  pextrw $0, %xmm0, %eax ; movw-store

Reported as PR105079, since the cause of the load missed-opt was GCC thinking
the instruction wasn't available, rather than a wrong tuning choice like this
is.

[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-28 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066

--- Comment #4 from Hongtao.liu  ---
Fixed in GCC12.

[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-28 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066

--- Comment #3 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:e4352a0fee49441a32d12e8d8b98c425cfed4a86

commit r12-7841-ge4352a0fee49441a32d12e8d8b98c425cfed4a86
Author: liuhongt 
Date:   Mon Mar 28 11:12:37 2022 +0800

Fix typo in vec_setv8hi_0.

pinsrw is available for both reg and mem operand under sse2.
pextrw requires sse4.1 for mem operands.

The patch change attr "isa" for pinsrw mem alternative from sse4_noavx
to noavx, will enable below optimization.

-movzwl  (%rdi), %eax
 pxor%xmm1, %xmm1
-pinsrw  $0, %eax, %xmm1
+pinsrw  $0, (%rdi), %xmm1
 movdqa  %xmm1, %xmm0

gcc/ChangeLog:

PR target/105066
* config/i386/sse.md (vec_set_0): Change attr "isa" of
alternative 4 from sse4_noavx to noavx.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr105066.c: New test.

[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2022-03-28
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066

--- Comment #2 from Hongtao.liu  ---

> That may be a separate bug, IDK
> 

Open PR105072 for it.

[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066

--- Comment #1 from Hongtao.liu  ---
pinsrw is under sse2 for both reg and mem operands, but not for pextrw which
requires sse4.1 for memory operands.

10593(define_insn "vec_set_0"
10594  [(set (match_operand:V8_128 0 "register_operand"
10595  "=v,v,v,x,x,Yr,*x,x,x,x,v,v")
10596(vec_merge:V8_128
10597  (vec_duplicate:V8_128
10598(match_operand: 2 "nonimmediate_operand"
10599  " r,m,v,r,m,Yr,*x,r,m,x,r,m"))
10600  (match_operand:V8_128 1 "reg_or_0_operand"
10601  " C,C,v,0,0,0 ,0 ,x,x,x,v,v")
10602  (const_int 1)))]
10603  "TARGET_SSE2"
10604  "@
10605   vmovw\t{%k2, %0|%0, %k2}
10606   vmovw\t{%2, %0|%0, %2}
10607   vmovsh\t{%2, %1, %0|%0, %1, %2}
10608   pinsrw\t{$0, %k2, %0|%0, %k2, 0}
10609   pinsrw\t{$0, %2, %0|%0, %2, 0}
10610   pblendw\t{$1, %2, %0|%0, %2, 1}
10611   pblendw\t{$1, %2, %0|%0, %2, 1}
10612   vpinsrw\t{$0, %k2, %1, %0|%0, %1, %k2, 0}
10613   vpinsrw\t{$0, %2, %1, %0|%0, %1, %2, 0}
10614   vpblendw\t{$1, %2, %1, %0|%0, %1, %2, 1}
10615   vpinsrw\t{$0, %k2, %1, %0|%0, %1, %k2, 0}
10616   vpinsrw\t{$0, %2, %1, %0|%0, %1, %2, 0}"
10617  [(set (attr "isa")
10618(cond [(eq_attr "alternative" "0,1,2")
10619 (const_string "avx512fp16")
10620   (eq_attr "alternative" "3")
10621 (const_string "noavx")
10622   (eq_attr "alternative" "4,5,6")
10623 (const_string "sse4_noavx")

alternative 4 doesn't require sse4.


and for performance pinsw mem > vmovd reg > pinsrw reg

and yes, it's sub-optimization for below.

pmovzxbq(void*):  # -O3 -msse4.1 -mtune=haswell
pxor%xmm0, %xmm0  # 1 uop
pinsrw  $0, (%rdi), %xmm0 # 2 uops, one for shuffle port
pmovzxbq%xmm0, %xmm0  # 1 uop for the same shuffle port
ret