[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-08-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #4 from Peter Cordes  ---
The VPAND instructions in the 256-bit version are a missed-optimization.

I had another look at this with current trunk.  Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512.  (If we improve code-gen
for that choice, it will make it a win in more cases.)

https://godbolt.org/g/2dfkNV

Loads are folding into the shifts now, unlike with gcc7.3.  (But they can't
micro-fuse because of the indexed addressing mode.  A pointer increment might
save 1 front-end uop even in the non-unrolled loop)

The separate integer loop counter is gone, replaced with a compare against an
end-index.

But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. 
Fewer instructions and (more importantly) 1/3 the shuffle uops.  GCC knows how
to do this for the 256-bit version, so it's apparently a failure of the
cost-model that it doesn't for the 512-bit version.  (Maybe requiring a
shuffle-control vector instead of immediate puts it off?  Or maybe it's
counting the cost of the useless vpand instructions for the pack / permq
option, even though they're not part of the shuffle-throughput bottleneck?)



We do use vpackuswb + vpermq for 256-bit, but we have redundant AND
instructions with set1_epi16(0x00FF) after a right shift already leaves the
high byte zero.

---

Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu.  GCC should be
using the VEX encoding of an instruction whenever it does exactly the same
thing.  At least we didn't use vpandd or vpandq EVEX instructions.

(I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a
store with no masking.  Hopefully it's efficient.)

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-08-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #3 from Peter Cordes  ---
I had another look at this with current trunk.  Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512.  (If we improve code-gen
for that choice, it will make it a win in more cases.)

https://godbolt.org/g/2dfkNV

Loads are folding into the shifts now, unlike with gcc7.3.  (But they can't
micro-fuse because of the indexed addressing mode.  A pointer increment might
save 1 front-end uop even in the non-unrolled loop)

The separate integer loop counter is gone, replaced with a compare against an
end-index.

But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. 
Fewer instructions and (more importantly) 1/3 the shuffle uops.  GCC knows how
to do this for the 256-bit version, so it's apparently a failure of the
cost-model that it doesn't for the 512-bit version.  (Maybe requiring a
shuffle-control vector instead of immediate puts it off?  Or maybe it's
counting the cost of the useless vpand instructions for the pack / permq
option, even though they're not part of the shuffle-throughput bottleneck?)



We do use vpackuswb + vpermq for 256-bit, but we have redundant AND
instructions with set1_epi16(0x00FF) after a right shift already leaves the
high byte zero.

---

Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu.  GCC should be
using the VEX encoding of an instruction whenever it does exactly the same
thing.  At least we didn't use vpandd or vpandq EVEX instructions.

(I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a
store with no masking.  Hopefully it's efficient.)

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2017-11-23 Thread andrew.n.senkevich at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

Andrew Senkevich  changed:

   What|Removed |Added

 CC||andrew.n.senkevich at gmail 
dot co
   ||m

--- Comment #2 from Andrew Senkevich  ---
Currently -mprefer-avx256 is default for SKX and vzeroupper addition was fixed,
code generated is:

.L3:
vpsrlw  $8, (%rsi,%rax,2), %ymm0
vpsrlw  $8, 32(%rsi,%rax,2), %ymm1
vpand   %ymm0, %ymm2, %ymm0
vpand   %ymm1, %ymm2, %ymm1
vpackuswb   %ymm1, %ymm0, %ymm0
vpermq  $216, %ymm0, %ymm0
vmovdqu8%ymm0, (%rdi,%rax)
addq$32, %rax
cmpq%rax, %rdx
jne .L3

vmovdqu8 remains but I cannot confirm it is slower.

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2017-10-06 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #1 from Peter Cordes  ---
BTW, if we *are* using vpmovwb, it supports a memory operand.  It doesn't save
any front-end uops on Skylake-avx512, just code-size.  Unless it means less
efficient packing in the uop cache (since all uops from one instruction have to
go in the same line) it should be better to fold the stores than to use
separate store instructions.

vpmovwb %zmm0,(%rcx)
vpmovwb %zmm1, 32(%rcx)

is 6 fused-domain uops (2 * 2 p5 shuffle uops, 2 micro-fused stores), according
to IACA.

It's possible to coax gcc into emitting it with intrinsics, but only with a -1
mask:

// https://godbolt.org/g/SBZX1W
void vpmovwb(__m512i a, char *p) {
  _mm256_storeu_si256(p, _mm512_cvtepi16_epi8(a));
}
vpmovwb %zmm0, %ymm0
vmovdqu64   %ymm0, (%rdi)
ret

void vpmovwb_store(__m512i a, char *p) {
  _mm512_mask_cvtepi16_storeu_epi8(p, -1, a);
}
vpmovwb %zmm0, (%rdi)
ret

clang is the same here, not using a memory destination unless you hand-hold it
with a -1 mask.


Also note the lack of vzeroupper here, and in the auto-vectorized function,
even with an explicit -mvzeroupper.