[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-08-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #4 from Peter Cordes --- The VPAND instructions in the 256-bit version are a missed-optimization. I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512.

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-08-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #3 from Peter Cordes --- I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512. (If we improve code-gen for that choice, it will make it a win in more

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2017-11-23 Thread andrew.n.senkevich at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 Andrew Senkevich changed: What|Removed |Added CC||andrew.n.senkevich at gmail dot co

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2017-10-06 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #1 from Peter Cordes --- BTW, if we *are* using vpmovwb, it supports a memory operand. It doesn't save any front-end uops on Skylake-avx512, just code-size. Unless it means less efficient packing in the uop cache (since all uops