[Bug target/94863] Failure to use blendps over mov when possible

2024-04-13 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863

--- Comment #4 from Andrew Pinski  ---
(In reply to Gabriel Ravier from comment #1)
> Note: The given outputs for LLVM and GCC are when compiling with `-O3
> -msse4.1`

I think you have the oppsite meaning with respect to `-msse4.1` here. They are
the same at -O3 but LLVM produces blendps with `-msse4.1` .

[Bug target/94863] Failure to use blendps over mov when possible

2021-04-25 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug target/94863] Failure to use blendps over mov when possible

2020-04-30 Thread gabravier at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863

--- Comment #3 from Gabriel Ravier  ---
For binary size, the `movsd` takes 4 bytes and the `blendps` takes 6 bytes

The port allocations for the instructions are as such (same formatting as for
the throughputs) : 

Wolfdale: p5, p015
Nehalem: p5, p5
Westmere: p5, p5
Sandy Bridge: p05, p5
Ivy Bridge: p05, p5
Haswell: p015, p5
Broadwell: p015, p5
Skylake: p015, p5
Skylake-X: p015, p5
Kaby Lake: p015, p5
Coffee Lake: p015, p5
Cannon Lake: p015, p015
Ice Lake: p015, p015
Zen+: fp01, fp0123
Zen 2: fp013, fp0123

Something like "p015" meaning that the instruction can be executed on port 0, 1
or 5. Also, all architectures have both instructions take a single uop.

The latency of `blendps` and `movsd` are 1 on every single architecture I could
test

Final note : The numbers are specifically for the `blendps xmm, xmm, imm8` and
the `movsd xmm, xmm` forms of those instructions

[Bug target/94863] Failure to use blendps over mov when possible

2020-04-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863

--- Comment #2 from Richard Biener  ---
throughputs put aside - how's port allocation and latency figures?  That said,
GCC usually sides on the smaller insn encoding variant when latency isn't
different - we're usually not looking at throughput since throughput figures
are quite useless if you look at single insns (and our decision is per
individual instruction).  Somehow modeling alternatives during scheduling
might make more sense but that's not implemented.