Re: Popcount optimization for the slow-path lookups

2025-12-05 Thread Andrew Pogrebnoi
On Fri, Dec 5, 2025 at 5:40 PM Nathan Bossart wrote: > I don't think the proposed improvements are relevant for either of the > machines you used for your benchmarks. For x86, we've optimized our > popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it > to use Neon or SVE.

Re: Popcount optimization for the slow-path lookups

2025-12-05 Thread Nathan Bossart
On Fri, Dec 05, 2025 at 03:07:07PM +0200, Andrew Pogrebnoi wrote: > I want to propose an optimization for pg_popcount32_slow() and > pg_popcount64_slow() where lookups into pg_number_of_ones[] are made > branchless. It shows speedup around 58% for uint64 and 35% for uint32 words > compared to the c

Re: Popcount optimization for the slow-path lookups

2025-12-05 Thread Andrew Pogrebnoi
Hi David, Thanks for looking at it! > I would like to test if I can reproduce your results. Could you share > your test program? Here you go: https://gist.github.com/dAdAbird/1480ff15764f5a6301174806d8512a3a > You also don't specify an optimization level. That means the default > level -O0 is u

Re: Popcount optimization for the slow-path lookups

2025-12-05 Thread David Geier
Hi Andy! On 05.12.2025 14:07, Andrew Pogrebnoi wrote: > Hello hackers, > > I want to propose an optimization for pg_popcount32_slow() and > pg_popcount64_slow() where lookups into pg_number_of_ones[] are made > branchless. It shows speedup around 58% for uint64 and 35% for uint32 words > compared

Popcount optimization for the slow-path lookups

2025-12-05 Thread Andrew Pogrebnoi
Hello hackers, I want to propose an optimization for pg_popcount32_slow() and pg_popcount64_slow() where lookups into pg_number_of_ones[] are made branchless. It shows speedup around 58% for uint64 and 35% for uint32 words compared to the current, looped version. This is on x86. It is much more si