https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370
Bug ID: 82370 Summary: AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't. Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include <immintrin.h> #include <stdint.h> #include <stddef.h> void pack_high8_alignhack(uint8_t *restrict dst, const uint8_t *restrict src, size_t bytes) { uint8_t *end_dst = dst + bytes; do{ __m128i v0 = _mm_loadu_si128((__m128i*)src); __m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1)); v0 = _mm_srli_epi16(v0, 8); __m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF)); __m128i pack = _mm_packus_epi16(v0, v1); _mm_storeu_si128((__m128i*)dst, pack); dst += 16; src += 32; // 32 bytes } while(dst < end_dst); } pack_high8_alignhack: vmovdqa64 .LC0(%rip), %xmm2 # pointless EVEX when VEX is shorter addq %rdi, %rdx .L18: vmovdqu64 (%rsi), %xmm0 vpandq 15(%rsi), %xmm2, %xmm1 # pointless EVEX vs. VPAND addq $16, %rdi addq $32, %rsi vpsrlw $8, %xmm0, %xmm0 # could use a memory source. vpackuswb %xmm1, %xmm0, %xmm0 vmovups %xmm0, -16(%rdi) cmpq %rdi, %rdx ja .L18 ret There's no benefit to using VPANDQ (4-byte EVEX prefix) instead of VPAND (2-byte VEX prefix). Same for VMOVDQA64. We should only use the AVX512 version when we need masking, ZMM register size, or xmm/ymm16-31. Or in this case, to use the AVX512VL+AVX512BW form that lets us fold a load into a memory operand: VPSRLW xmm1 {k1}{z}, xmm2/m128, imm8 (https://hjlebbink.github.io/x86doc/html/PSRLW_PSRLD_PSRLQ.html). IACA2.3 says it micro-fuses, so it's definitely worth it. Clang gets everything right and emits: pack_high8_alignhack: addq %rdi, %rdx vmovdqa .LCPI2_0(%rip), %xmm0 # Plain AVX (VEX prefix) .LBB2_1: vpsrlw $8, (%rsi), %xmm1 # load folded into AVX512BW version vpand 15(%rsi), %xmm0, %xmm2 # AVX-128 VEX encoding. vpackuswb %xmm2, %xmm1, %xmm1 vmovdqu %xmm1, (%rdi) addq $16, %rdi addq $32, %rsi cmpq %rdx, %rdi jb .LBB2_1 retq vmovdqu is the same length as vmovups, so there's no benefit. But AFAIK, no downside on any CPU to always using FP stores on the results of vector-integer ALU instructions. (There isn't a separate mnemonic for EVEX vmovups, so the assembler uses the VEX encoding whenever it's encodeable that way. Or maybe for medium-size displacements that are multiples of the vector width, it can save a byte by using an EVEX + disp8 instead of VEX + disp32.)