On 29/04/2025 11:04 pm, Linus Torvalds wrote:
> On Tue, 29 Apr 2025 at 14:59, Andrew Cooper <andrew.coop...@citrix.com> wrote:
>> do_variable_ffs() doesn't quite work.
>>
>> REP BSF is LZCNT, and unconditionally writes it's output operand, and
>> defeats the attempt to preload with -1.
>>
>> Drop the REP prefix, and it should work as intended.
> Bah. That's what I get for just doing it blindly without actually
> looking at the kernel source. I just copied the __ffs() thing - and
> there the 'rep' is not for the zero case - which we don't care about -
> but because lzcnt performs better on newer CPUs.

Oh, I didn't realise there was also a perf difference too, but Agner Fog
agrees.

Apparently in Zen4, BSF and friends have become a single uop with a
sensible latency.  Previously they were 6-8 uops with a latency to match.

Intel appear to have have had them as a single uop since SandyBridge, so
quite a long time now.

~Andrew

Reply via email to