[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 --- Comment #5 from Peter Cordes --- > pextrw requires sse4.1 for mem operands. You're right! I didn't double-check the asm manual for PEXTRW when writing up the initial report, and had never realized that PINSRW wasn't symmetric with it. I was really surprised to see that in https://www.felixcloutier.com/x86/pextrw So we do need to care about tuning for _mm_storeu_si16(p, v) without SSE4.1 (without the option of PEXTRW to memory). PEXTRW to an integer register is obviously bad; we should be doing movd %xmm0, %eax mov %ax, (%rdi) instead of an inefficient pextrw $0, %xmm0, %eax ; movw-store Reported as PR105079, since the cause of the load missed-opt was GCC thinking the instruction wasn't available, rather than a wrong tuning choice like this is.
[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 --- Comment #4 from Hongtao.liu --- Fixed in GCC12.
[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 --- Comment #3 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:e4352a0fee49441a32d12e8d8b98c425cfed4a86 commit r12-7841-ge4352a0fee49441a32d12e8d8b98c425cfed4a86 Author: liuhongt Date: Mon Mar 28 11:12:37 2022 +0800 Fix typo in vec_setv8hi_0. pinsrw is available for both reg and mem operand under sse2. pextrw requires sse4.1 for mem operands. The patch change attr "isa" for pinsrw mem alternative from sse4_noavx to noavx, will enable below optimization. -movzwl (%rdi), %eax pxor%xmm1, %xmm1 -pinsrw $0, %eax, %xmm1 +pinsrw $0, (%rdi), %xmm1 movdqa %xmm1, %xmm0 gcc/ChangeLog: PR target/105066 * config/i386/sse.md (vec_set_0): Change attr "isa" of alternative 4 from sse4_noavx to noavx. gcc/testsuite/ChangeLog: * gcc.target/i386/pr105066.c: New test.
[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 Richard Biener changed: What|Removed |Added Last reconfirmed||2022-03-28 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW
[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 --- Comment #2 from Hongtao.liu --- > That may be a separate bug, IDK > Open PR105072 for it.
[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 --- Comment #1 from Hongtao.liu --- pinsrw is under sse2 for both reg and mem operands, but not for pextrw which requires sse4.1 for memory operands. 10593(define_insn "vec_set_0" 10594 [(set (match_operand:V8_128 0 "register_operand" 10595 "=v,v,v,x,x,Yr,*x,x,x,x,v,v") 10596(vec_merge:V8_128 10597 (vec_duplicate:V8_128 10598(match_operand: 2 "nonimmediate_operand" 10599 " r,m,v,r,m,Yr,*x,r,m,x,r,m")) 10600 (match_operand:V8_128 1 "reg_or_0_operand" 10601 " C,C,v,0,0,0 ,0 ,x,x,x,v,v") 10602 (const_int 1)))] 10603 "TARGET_SSE2" 10604 "@ 10605 vmovw\t{%k2, %0|%0, %k2} 10606 vmovw\t{%2, %0|%0, %2} 10607 vmovsh\t{%2, %1, %0|%0, %1, %2} 10608 pinsrw\t{$0, %k2, %0|%0, %k2, 0} 10609 pinsrw\t{$0, %2, %0|%0, %2, 0} 10610 pblendw\t{$1, %2, %0|%0, %2, 1} 10611 pblendw\t{$1, %2, %0|%0, %2, 1} 10612 vpinsrw\t{$0, %k2, %1, %0|%0, %1, %k2, 0} 10613 vpinsrw\t{$0, %2, %1, %0|%0, %1, %2, 0} 10614 vpblendw\t{$1, %2, %1, %0|%0, %1, %2, 1} 10615 vpinsrw\t{$0, %k2, %1, %0|%0, %1, %k2, 0} 10616 vpinsrw\t{$0, %2, %1, %0|%0, %1, %2, 0}" 10617 [(set (attr "isa") 10618(cond [(eq_attr "alternative" "0,1,2") 10619 (const_string "avx512fp16") 10620 (eq_attr "alternative" "3") 10621 (const_string "noavx") 10622 (eq_attr "alternative" "4,5,6") 10623 (const_string "sse4_noavx") alternative 4 doesn't require sse4. and for performance pinsw mem > vmovd reg > pinsrw reg and yes, it's sub-optimization for below. pmovzxbq(void*): # -O3 -msse4.1 -mtune=haswell pxor%xmm0, %xmm0 # 1 uop pinsrw $0, (%rdi), %xmm0 # 2 uops, one for shuffle port pmovzxbq%xmm0, %xmm0 # 1 uop for the same shuffle port ret