On Tue, 29 Apr 2025 at 15:22, Andrew Cooper <andrew.coop...@citrix.com> wrote: > > Oh, I didn't realise there was also a perf difference too, but Agner Fog > agrees.
The perf difference is exactly because of the issue where the non-rep one acts as a cmov, and has basically two inputs (the bits to test in the source, and the old value of the result register) I guess it's not "fundamental", but lzcnt is basically a bit simpler for hardware to implement, and the non-rep legacy bsf instruction basically has a dependency on the previous value of the result register. So even when it's a single uop for both cases, that single uop can be slower for the bsf because of the (typically false) dependency and extra pressure on the rename registers. Linus