https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82460
Bug ID: 82460 Summary: AVX512: choose between vpermi2d and vpermt2d to save mov instructions. Also, fails to optimize away shifts before shuffle Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include <immintrin.h> // gcc -O3 -march=skylake-avx512 -mavx512vbmi 8.0.0 20171004 // https://godbolt.org/g/fVt4Kb __m512i vpermi2d(__m512i t1, __m512i control, char *src) { return _mm512_permutex2var_epi32(control, t1, _mm512_loadu_si512(src)); } vpermt2d (%rdi), %zmm0, %zmm1 vmovdqa64 %zmm1, %zmm0 ret clang emits vpermi2d (%rdi), %zmm1, %zmm0 __m512i vpermi2b(__m512i t1, __m512i a, __m512i b) { return _mm512_permutex2var_epi8(a, t1, b); } vpermt2b %zmm2, %zmm0, %zmm1 vmovdqa64 %zmm1, %zmm0 ret clang emits vpermi2b %zmm2, %zmm1, %zmm0 This one compiles ok, though: __m512i vpermt2d(__m512i t1, __m512i control, char *src) { return _mm512_permutex2var_epi32(t1, control, _mm512_loadu_si512(src)); } vpermt2d (%rdi), %zmm1, %zmm0 --- But when auto-vectorizing this with AVX512VBMI (see bug 82459 for AVX512BW missed optimizations), gcc uses vpermi2b when vpermt2b would be better: void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t *__restrict__ src, size_t bytes) { uint8_t *end_dst = dst + bytes; do{ *dst++ = *src++ >> 8; } while(dst < end_dst); } .L9: vmovdqa64 (%rsi,%rax,2), %zmm0 vmovdqa64 64(%rsi,%rax,2), %zmm1 vmovdqa64 %zmm2, %zmm3 # copy the index vpsrlw $8, %zmm0, %zmm0 vpsrlw $8, %zmm1, %zmm1 vpermi2b %zmm1, %zmm0, %zmm3 # then destroy it vmovdqu8 %zmm3, (%rcx,%rax) # extra uop according to Intel: bug 82459 addq $64, %rax cmpq %rax, %rdi jne .L9 Of course, the shifts are redundant when we have a full byte shuffle that doesn't do any saturating: # different shuffle control in zmm1 .L9 vmovdqa64 (%rsi,%rax,2), %zmm0 vpermt2b 64(%rsi,%rax,2), %zmm1, %zmm0 vmovdqu64 %zmm0, (%rcx,%rax) addq $64, %rax cmpq %rax, %rdi jne .L9 If unrolling, use pointer increments so the shuffle can maybe avoid un-lamination, although some multi-uop instructions don't micro-fuse in the first place. vpermt2w is 3 uops on Skylake-AVX512 (p0 + 2p5), so we should expect vpermt2b to be at least that slow on the first CPUs that support it. On a CPU where vpermt2b is p0 + 2p5, this loop will run at about one store per 2 clocks, the same as what you can achieve with 2x shift + vpackuswb + vpermq (bug 82459). But this has one fewer p0 uop. With indexing from the end of the arrays to save the CMP, this could also be 7 fused-domain uops for the front-end (assuming no micro-fusion for the vpermt2b + load), but assuming the store does fuse.