[Bug target/94863] Failure to use blendps over mov when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863 --- Comment #4 from Andrew Pinski --- (In reply to Gabriel Ravier from comment #1) > Note: The given outputs for LLVM and GCC are when compiling with `-O3 > -msse4.1` I think you have the oppsite meaning with respect to `-msse4.1` here. They are the same at -O3 but LLVM produces blendps with `-msse4.1` .
[Bug target/94863] Failure to use blendps over mov when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863 Andrew Pinski changed: What|Removed |Added Severity|normal |enhancement
[Bug target/94863] Failure to use blendps over mov when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863 --- Comment #3 from Gabriel Ravier --- For binary size, the `movsd` takes 4 bytes and the `blendps` takes 6 bytes The port allocations for the instructions are as such (same formatting as for the throughputs) : Wolfdale: p5, p015 Nehalem: p5, p5 Westmere: p5, p5 Sandy Bridge: p05, p5 Ivy Bridge: p05, p5 Haswell: p015, p5 Broadwell: p015, p5 Skylake: p015, p5 Skylake-X: p015, p5 Kaby Lake: p015, p5 Coffee Lake: p015, p5 Cannon Lake: p015, p015 Ice Lake: p015, p015 Zen+: fp01, fp0123 Zen 2: fp013, fp0123 Something like "p015" meaning that the instruction can be executed on port 0, 1 or 5. Also, all architectures have both instructions take a single uop. The latency of `blendps` and `movsd` are 1 on every single architecture I could test Final note : The numbers are specifically for the `blendps xmm, xmm, imm8` and the `movsd xmm, xmm` forms of those instructions
[Bug target/94863] Failure to use blendps over mov when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863 --- Comment #2 from Richard Biener --- throughputs put aside - how's port allocation and latency figures? That said, GCC usually sides on the smaller insn encoding variant when latency isn't different - we're usually not looking at throughput since throughput figures are quite useless if you look at single insns (and our decision is per individual instruction). Somehow modeling alternatives during scheduling might make more sense but that's not implemented.