On Fri, Dec 5, 2025 at 5:40 PM Nathan Bossart
wrote:
> I don't think the proposed improvements are relevant for either of the
> machines you used for your benchmarks. For x86, we've optimized our
> popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized
it
> to use Neon or SVE.
On Fri, Dec 05, 2025 at 03:07:07PM +0200, Andrew Pogrebnoi wrote:
> I want to propose an optimization for pg_popcount32_slow() and
> pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
> branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
> compared to the c
Hi David,
Thanks for looking at it!
> I would like to test if I can reproduce your results. Could you share
> your test program?
Here you go:
https://gist.github.com/dAdAbird/1480ff15764f5a6301174806d8512a3a
> You also don't specify an optimization level. That means the default
> level -O0 is u
Hi Andy!
On 05.12.2025 14:07, Andrew Pogrebnoi wrote:
> Hello hackers,
>
> I want to propose an optimization for pg_popcount32_slow() and
> pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
> branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
> compared
Hello hackers,
I want to propose an optimization for pg_popcount32_slow() and
pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
compared to the current, looped version. This is on x86. It is much more
si