Re: Neon addmul_8

2013-02-26 Thread Niels Möller
Richard Henderson r...@twiddle.net writes: It might be worth a try having two copies of this main loop, one in which there are more than 8 limbs remaining, I tried that, and I ended up with something *very* similar to your addmul_8 (after first writing addmul_4 and addmul_6). The following

Re: Neon addmul_8

2013-02-26 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Hmm, I tried changing all output registers to unique registers (only written once in the loop, never ever read (except as vmlal reads the output register before accumulating to it). Do you mean that I need to change the *input* registers of all

Re: Neon addmul_8

2013-02-26 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: Maybe later, but for now, A9 is my target platform. But it seems you're right that Neon is almost useless there. I'm attaching the functions I've been testing, in case anyone else would like to play with them. /Niels dnl ARM neon mpn_addmul_4.

Re: Neon addmul_8

2013-02-26 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I'm attaching the functions I've been testing, in case anyone else would like to play with them. May I innocently ask if the function have survived the prescribed testing (tests/devel/addmul_N.c and/or tests/devel/try.c)? ;-) -- Torbjörn

Neon column-wise addmul_4

2013-02-26 Thread Richard Henderson
On 02/26/2013 05:14 AM, Niels Möller wrote: Untried tricks: One could try to use vuzp to separate high and low parts of the products. Then only the low parts need shifting around. I guess I'll try that with addmul_4 first, to see if it makes for any improvement. One could maybe use vaddw, to

Re: Neon addmul_8

2013-02-26 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: May I innocently ask if the function have survived the prescribed testing (tests/devel/addmul_N.c and/or tests/devel/try.c)? ;-) They have been subject to addmul_N.c testing. /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.

Re: Neon addmul_8

2013-02-26 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Perhaps I got the methodology wrong here, but it sure appears as if vmlal does not require the addend input until the 4th cycle, producing full output on the 5th. This seems to be the easiest way to hide a lot of output latency. I measured a

Re: Neon addmul_8

2013-02-26 Thread Richard Henderson
On 02/26/2013 10:41 AM, Torbjorn Granlund wrote: I'm not sure quite what's going on with the 3/4 issue rates. I really would have expected to see either exactly 1, or very nearly 1/2, especially for vadd. I think you mean 4/3. But also that is an underestimate. with 8-way unrolling