ARM Neon popcount

2013-02-27 Thread Torbjorn Granlund
I decided to play a bit with Neon, but instead of doing something hard like addmul_k, I wrote an mpn_popcount. :-) The code runs well for A15 at about 0.56 c/l, but much worse on A9 at about 2.8 c/l. (The inner-loops hard whacking on q8 is a problem on A9; using a8 and a9 alternatingly shaves

Re: ARM Neon popcount

2013-02-27 Thread Richard Henderson
On 2013-02-27 13:27, Torbjorn Granlund wrote: Specific questions: * I completely ignore alignment. Is that bad? I'm not sure about that. It's something that perhaps we should experiment with. As written, the code will work, as the chip will handle totally unaligned data. What I don't

Re: ARM Neon popcount

2013-02-27 Thread Richard Henderson
On 2013-02-27 14:33, Torbjorn Granlund wrote: vld1.32 { q1, q2 }, [r0@128]! As specified in section A.3.2.1, if you specify the alignment it will also be checked, so you'll get SIGBUS if its not right. I wanted to experiment, but I cannot find any syntax which is accepted

GMP symbol naming (and the history thereof)?

2013-02-27 Thread Richard Henderson
Several times over the past week as I debug my neon routines, it has become painfully apparent (as I accidentally single-step into the dynamic linker) that the shared libgmp could use some help in modernizing its internal linkage. Most important is arranging for calls within GMP to go through

Re: ARM Neon popcount

2013-02-27 Thread Niels Möller
Richard Henderson r...@twiddle.net writes: On 2013-02-27 13:27, Torbjorn Granlund wrote: * Can one read four 128-bit values using just one insn (for inner loop)? No. We can only read 4 64-bit values. I didn't actually realize the assembler would accept Q registers in the list grammar