https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82460

            Bug ID: 82460
           Summary: AVX512: choose between vpermi2d and vpermt2d to save
                    mov instructions.  Also, fails to optimize away shifts
                    before shuffle
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization, ssemmx
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

#include <immintrin.h>

// gcc  -O3 -march=skylake-avx512 -mavx512vbmi    8.0.0 20171004
// https://godbolt.org/g/fVt4Kb

__m512i vpermi2d(__m512i t1, __m512i control, char *src) {
  return _mm512_permutex2var_epi32(control, t1, _mm512_loadu_si512(src));
}
        vpermt2d        (%rdi), %zmm0, %zmm1
        vmovdqa64       %zmm1, %zmm0
        ret

  clang emits  vpermi2d  (%rdi), %zmm1, %zmm0

__m512i vpermi2b(__m512i t1, __m512i a, __m512i b) {
  return _mm512_permutex2var_epi8(a, t1, b);
}
        vpermt2b        %zmm2, %zmm0, %zmm1
        vmovdqa64       %zmm1, %zmm0
        ret

  clang emits  vpermi2b  %zmm2, %zmm1, %zmm0


This one compiles ok, though:

__m512i vpermt2d(__m512i t1, __m512i control, char *src) {
  return _mm512_permutex2var_epi32(t1, control, _mm512_loadu_si512(src));
}
        vpermt2d        (%rdi), %zmm1, %zmm0


---


But when auto-vectorizing this with AVX512VBMI (see bug 82459 for AVX512BW
missed optimizations), gcc uses vpermi2b when vpermt2b would be better:

void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
     *dst++ = *src++ >> 8;
  } while(dst < end_dst);
}


.L9:
        vmovdqa64       (%rsi,%rax,2), %zmm0
        vmovdqa64       64(%rsi,%rax,2), %zmm1
        vmovdqa64       %zmm2, %zmm3             # copy the index
        vpsrlw  $8, %zmm0, %zmm0
        vpsrlw  $8, %zmm1, %zmm1
        vpermi2b        %zmm1, %zmm0, %zmm3      # then destroy it
        vmovdqu8        %zmm3, (%rcx,%rax)       # extra uop according to
Intel: bug 82459
        addq    $64, %rax
        cmpq    %rax, %rdi
        jne     .L9

Of course, the shifts are redundant when we have a full byte shuffle that
doesn't do any saturating:

        # different shuffle control in zmm1
   .L9
        vmovdqa64       (%rsi,%rax,2), %zmm0
        vpermt2b        64(%rsi,%rax,2), %zmm1, %zmm0
        vmovdqu64        %zmm0, (%rcx,%rax)
        addq    $64, %rax
        cmpq    %rax, %rdi
        jne     .L9

If unrolling, use pointer increments so the shuffle can maybe avoid
un-lamination, although some multi-uop instructions don't micro-fuse in the
first place.

vpermt2w is 3 uops on Skylake-AVX512 (p0 + 2p5), so we should expect vpermt2b
to be at least that slow on the first CPUs that support it.  On a CPU where
vpermt2b is p0 + 2p5, this loop will run at about one store per 2 clocks, the
same as what you can achieve with 2x shift + vpackuswb + vpermq (bug 82459). 
But this has one fewer p0 uop.

With indexing from the end of the arrays to save the CMP, this could also be 7
fused-domain uops for the front-end (assuming no micro-fusion for the vpermt2b
+ load), but assuming the store does fuse.

Reply via email to