On 29/04/2025 11:04 pm, Linus Torvalds wrote: > On Tue, 29 Apr 2025 at 14:59, Andrew Cooper <andrew.coop...@citrix.com> wrote: >> do_variable_ffs() doesn't quite work. >> >> REP BSF is LZCNT, and unconditionally writes it's output operand, and >> defeats the attempt to preload with -1. >> >> Drop the REP prefix, and it should work as intended. > Bah. That's what I get for just doing it blindly without actually > looking at the kernel source. I just copied the __ffs() thing - and > there the 'rep' is not for the zero case - which we don't care about - > but because lzcnt performs better on newer CPUs.
Oh, I didn't realise there was also a perf difference too, but Agner Fog agrees. Apparently in Zen4, BSF and friends have become a single uop with a sensible latency. Previously they were 6-8 uops with a latency to match. Intel appear to have have had them as a single uop since SandyBridge, so quite a long time now. ~Andrew