Re: arm neon

2013-02-23 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time with larger unrolling to make full use of the vector load insns, and less over-prefetching. Good improvement! Keep in mind that addmul_ will be used for smallish count

Re: arm neon

2013-02-23 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: On 2013-02-23 06:06, Niels Möller wrote: Not sure what the bottlenecks of your loop are though; instruction decoding, load/store, or the recurrency chain (but at least it shouldn't be multiplier throughput, right?). Yeah, neither am I. I

Re: arm neon

2013-02-23 Thread Richard Henderson
On 2013-02-23 06:06, Niels Möller wrote: Not sure what the bottlenecks of your loop are though; instruction decoding, load/store, or the recurrency chain (but at least it shouldn't be multiplier throughput, right?). Yeah, neither am I. I can't find any info on what latency of neon insns

Re: Extending the mpn interface

2013-02-23 Thread Marc Glisse
On Sat, 23 Feb 2013, Niels Möller wrote: ni...@lysator.liu.se (Niels Möller) writes: Another revision, this time as a patch. As you can see, I renamed things a bit more. Compile tested only... New patch below, including some testcases (no tests for mpz_roinit_n and MPZ_ROINIT_N though). I'd

Re: Extending the mpn interface

2013-02-23 Thread Niels Möller
Marc Glisse marc.gli...@inria.fr writes: Was the documentation in an earlier message (I couldn't find it)? There were more header comments in some of the earlier messages, most recently in http://gmplib.org/list-archives/gmp-devel/2013-February/002825.html. Should be added (as a new section?)

Re: arm neon

2013-02-23 Thread Richard Henderson
On 2013-02-23 05:31, Torbjorn Granlund wrote: Richard Henderson r...@twiddle.net writes: Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time with larger unrolling to make full use of the vector load insns, and less over-prefetching. Good improvement! Keep in

Neon addmul_4

2013-02-23 Thread Richard Henderson
Down to 2.8-3.0 cyc/limb. r~ dnl ARM neon mpn_addmul_4. dnl dnl Copyright 2013 Free Software Foundation, Inc. dnl dnl This file is part of the GNU MP Library. dnl dnl The GNU MP Library is free software; you can redistribute it and/or modify dnl it under the terms of the GNU Lesser General

Re: arm neon

2013-02-23 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: The reason for that is that we use Karatsuba's algoritm for counts of over about 20. Good to know. I won't base my tuning on tests/devel/addmul_N's default of ~600 limbs then. Using a larger count L1size might still be useful for

Re: Neon addmul_4

2013-02-23 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Down to 2.8-3.0 cyc/limb. Good leaps towards 0.7 c/l, and not far from the current code. (On A9 it runs at 4.25 c/l.) -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org

Neon addmul_8

2013-02-23 Thread Richard Henderson
gcc -O2 -g3 [...] addmul_N.c -DN=8 -DCLOCK=169400 $ ./t.out mpn_addmul_8: 2845ms (1.782 cycles/limb) [973.59 Gb/s] mpn_addmul_8: 2620ms (1.641 cycles/limb) [1057.20 Gb/s] mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19 Gb/s] mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19