Re: ARM Neon popcount

2013-02-28 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: What about vldm? Like vldmup!, {q0,q1,q2,q3} As far as I understand the manual, it supports a larger number of registers. The registers must be consecutive, but that's no problem here. I added a long list of things to try.

ARM Neon popcount

2013-02-27 Thread Torbjorn Granlund
I decided to play a bit with Neon, but instead of doing something hard like addmul_k, I wrote an mpn_popcount. :-) The code runs well for A15 at about 0.56 c/l, but much worse on A9 at about 2.8 c/l. (The inner-loops hard whacking on q8 is a problem on A9; using a8 and a9 alternatingly shaves

Re: ARM Neon popcount

2013-02-27 Thread Richard Henderson
On 2013-02-27 13:27, Torbjorn Granlund wrote: Specific questions: * I completely ignore alignment. Is that bad? I'm not sure about that. It's something that perhaps we should experiment with. As written, the code will work, as the chip will handle totally unaligned data. What I don't

Re: ARM Neon popcount

2013-02-27 Thread Richard Henderson
On 2013-02-27 14:33, Torbjorn Granlund wrote: vld1.32 { q1, q2 }, [r0@128]! As specified in section A.3.2.1, if you specify the alignment it will also be checked, so you'll get SIGBUS if its not right. I wanted to experiment, but I cannot find any syntax which is accepted

Re: ARM Neon popcount

2013-02-27 Thread Niels Möller
Richard Henderson r...@twiddle.net writes: On 2013-02-27 13:27, Torbjorn Granlund wrote: * Can one read four 128-bit values using just one insn (for inner loop)? No. We can only read 4 64-bit values. I didn't actually realize the assembler would accept Q registers in the list grammar