[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #4 from Peter Cordes --- The VPAND instructions in the 256-bit version are a missed-optimization. I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512. (If we improve code-gen for that choice, it will make it a win in more cases.) https://godbolt.org/g/2dfkNV Loads are folding into the shifts now, unlike with gcc7.3. (But they can't micro-fuse because of the indexed addressing mode. A pointer increment might save 1 front-end uop even in the non-unrolled loop) The separate integer loop counter is gone, replaced with a compare against an end-index. But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. Fewer instructions and (more importantly) 1/3 the shuffle uops. GCC knows how to do this for the 256-bit version, so it's apparently a failure of the cost-model that it doesn't for the 512-bit version. (Maybe requiring a shuffle-control vector instead of immediate puts it off? Or maybe it's counting the cost of the useless vpand instructions for the pack / permq option, even though they're not part of the shuffle-throughput bottleneck?) We do use vpackuswb + vpermq for 256-bit, but we have redundant AND instructions with set1_epi16(0x00FF) after a right shift already leaves the high byte zero. --- Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu. GCC should be using the VEX encoding of an instruction whenever it does exactly the same thing. At least we didn't use vpandd or vpandq EVEX instructions. (I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a store with no masking. Hopefully it's efficient.)
[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #3 from Peter Cordes --- I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512. (If we improve code-gen for that choice, it will make it a win in more cases.) https://godbolt.org/g/2dfkNV Loads are folding into the shifts now, unlike with gcc7.3. (But they can't micro-fuse because of the indexed addressing mode. A pointer increment might save 1 front-end uop even in the non-unrolled loop) The separate integer loop counter is gone, replaced with a compare against an end-index. But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. Fewer instructions and (more importantly) 1/3 the shuffle uops. GCC knows how to do this for the 256-bit version, so it's apparently a failure of the cost-model that it doesn't for the 512-bit version. (Maybe requiring a shuffle-control vector instead of immediate puts it off? Or maybe it's counting the cost of the useless vpand instructions for the pack / permq option, even though they're not part of the shuffle-throughput bottleneck?) We do use vpackuswb + vpermq for 256-bit, but we have redundant AND instructions with set1_epi16(0x00FF) after a right shift already leaves the high byte zero. --- Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu. GCC should be using the VEX encoding of an instruction whenever it does exactly the same thing. At least we didn't use vpandd or vpandq EVEX instructions. (I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a store with no masking. Hopefully it's efficient.)
[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 Andrew Senkevich changed: What|Removed |Added CC||andrew.n.senkevich at gmail dot co ||m --- Comment #2 from Andrew Senkevich --- Currently -mprefer-avx256 is default for SKX and vzeroupper addition was fixed, code generated is: .L3: vpsrlw $8, (%rsi,%rax,2), %ymm0 vpsrlw $8, 32(%rsi,%rax,2), %ymm1 vpand %ymm0, %ymm2, %ymm0 vpand %ymm1, %ymm2, %ymm1 vpackuswb %ymm1, %ymm0, %ymm0 vpermq $216, %ymm0, %ymm0 vmovdqu8%ymm0, (%rdi,%rax) addq$32, %rax cmpq%rax, %rdx jne .L3 vmovdqu8 remains but I cannot confirm it is slower.
[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #1 from Peter Cordes --- BTW, if we *are* using vpmovwb, it supports a memory operand. It doesn't save any front-end uops on Skylake-avx512, just code-size. Unless it means less efficient packing in the uop cache (since all uops from one instruction have to go in the same line) it should be better to fold the stores than to use separate store instructions. vpmovwb %zmm0,(%rcx) vpmovwb %zmm1, 32(%rcx) is 6 fused-domain uops (2 * 2 p5 shuffle uops, 2 micro-fused stores), according to IACA. It's possible to coax gcc into emitting it with intrinsics, but only with a -1 mask: // https://godbolt.org/g/SBZX1W void vpmovwb(__m512i a, char *p) { _mm256_storeu_si256(p, _mm512_cvtepi16_epi8(a)); } vpmovwb %zmm0, %ymm0 vmovdqu64 %ymm0, (%rdi) ret void vpmovwb_store(__m512i a, char *p) { _mm512_mask_cvtepi16_storeu_epi8(p, -1, a); } vpmovwb %zmm0, (%rdi) ret clang is the same here, not using a memory destination unless you hand-hold it with a -1 mask. Also note the lack of vzeroupper here, and in the auto-vectorized function, even with an explicit -mvzeroupper.