On 29/04/2025 7:05 pm, Linus Torvalds wrote: > On Tue, 29 Apr 2025 at 07:38, Andrew Cooper <andrew.coop...@citrix.com> wrote: >> I tried that. (The thread started as a question around >> __builtin_constant_p() but did grow to cover __builtin_ffs().) > Maybe we could do something like > > #define ffs(x) \ > (statically_true((x) != 0) ? __ffs(x)+1 : __builtin_ffs(x)) > > which uses our "statically_true()" helper that is actually fairly good > at the whole "let the compiler tell us that it knows that value cannot > be zero" > > I didn't check what code that generated, but I've seen gcc do well on > that statically_true() thing in the past. > > Then we can just remove our current variable_ffs() thing entirely, > because we now depend on our (good) __ffs() and the builtin being > "good enough" for the bad case.
That would improve code generation for 32bit, but generally regress 64bit. Preloading the destination register with -1 is better than the CMOV form emitted by the builtin; BSF's habit of conditionally not writing the destination register *is* a CMOV of sorts. When I cleaned this up in Xen, there were several factors where I thought improvements could be made. Having both ffs() and __ffs(), where the latter is undefined in a common case, is a trap waiting for an unwary programmer. I have no particular love for ffs() being off-by-one from normal, but is well defined for all inputs. Also, leaving the constant folding to the arch-optimised form means that it often gets forgotten. Therefore, I rearranged everything to have this be common: static always_inline attr_const unsigned int ffs(unsigned int x) { if ( __builtin_constant_p(x) ) return __builtin_ffs(x); #ifdef arch_ffs return arch_ffs(x); #else return generic_ffsl(x); #endif } with most architectures implementing arch_ffs as: #define arch_ffs(x) ((x) ? 1 + __builtin_ctz(x) : 0) and x86 as: static always_inline unsigned int arch_ffs(unsigned int x) { unsigned int r; if ( __builtin_constant_p(x > 0) && x > 0 ) { /* * A common code pattern is: * * while ( bits ) * { * bit = ffs(bits); * ... * * and the optimiser really can work with the knowledge of x being * non-zero without knowing it's exact value, in which case we don't * need to compensate for BSF's corner cases. Otherwise... */ asm ( "bsf %[val], %[res]" : [res] "=r" (r) : [val] "rm" (x) ); } else { /* * ... the AMD manual states that BSF won't modify the destination * register if x=0. The Intel manual states that the result is * undefined, but the architects have said that the register is * written back with it's old value (zero extended as normal). */ asm ( "bsf %[val], %[res]" : [res] "=r" (r) : [val] "rm" (x), "[res]" (-1) ); } return r + 1; } #define arch_ffs arch_ffs and finally, providing compatibility for the other forms as: #define __ffs(x) (ffs(x) - 1) The end result is fewer APIs to implement in arch-specific code, and the removal of undefined behaviour. That said, I don't envy anyone wanting to try and untangle this in Linux, even if consensus were to agree on it as an approach. ~Andrew