Richard Henderson r...@twiddle.net writes:
Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time
with larger unrolling to make full use of the vector load insns, and less
over-prefetching.
Good improvement!
Keep in mind that addmul_ will be used for smallish count
Richard Henderson r...@twiddle.net writes:
On 2013-02-23 06:06, Niels Möller wrote:
Not sure what the bottlenecks of your loop are though; instruction
decoding, load/store, or the recurrency chain (but at least it shouldn't
be multiplier throughput, right?).
Yeah, neither am I. I
On 2013-02-23 06:06, Niels Möller wrote:
Not sure what the bottlenecks of your loop are though; instruction
decoding, load/store, or the recurrency chain (but at least it shouldn't
be multiplier throughput, right?).
Yeah, neither am I. I can't find any info on what latency of neon insns
On Sat, 23 Feb 2013, Niels Möller wrote:
ni...@lysator.liu.se (Niels Möller) writes:
Another revision, this time as a patch. As you can see, I renamed things
a bit more. Compile tested only...
New patch below, including some testcases (no tests for mpz_roinit_n and
MPZ_ROINIT_N though). I'd
Marc Glisse marc.gli...@inria.fr writes:
Was the documentation in an earlier message (I couldn't find it)?
There were more header comments in some of the earlier messages, most
recently in
http://gmplib.org/list-archives/gmp-devel/2013-February/002825.html.
Should be added (as a new section?)
On 2013-02-23 05:31, Torbjorn Granlund wrote:
Richard Henderson r...@twiddle.net writes:
Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time
with larger unrolling to make full use of the vector load insns, and less
over-prefetching.
Good improvement!
Keep in
Down to 2.8-3.0 cyc/limb.
r~
dnl ARM neon mpn_addmul_4.
dnl
dnl Copyright 2013 Free Software Foundation, Inc.
dnl
dnl This file is part of the GNU MP Library.
dnl
dnl The GNU MP Library is free software; you can redistribute it and/or modify
dnl it under the terms of the GNU Lesser General
Richard Henderson r...@twiddle.net writes:
The reason for that is that we use Karatsuba's algoritm for counts of
over about 20.
Good to know. I won't base my tuning on tests/devel/addmul_N's
default of ~600 limbs then.
Using a larger count L1size might still be useful for
Richard Henderson r...@twiddle.net writes:
Down to 2.8-3.0 cyc/limb.
Good leaps towards 0.7 c/l, and not far from the current code.
(On A9 it runs at 4.25 c/l.)
--
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
gcc -O2 -g3 [...] addmul_N.c -DN=8 -DCLOCK=169400
$ ./t.out
mpn_addmul_8: 2845ms (1.782 cycles/limb) [973.59 Gb/s]
mpn_addmul_8: 2620ms (1.641 cycles/limb) [1057.20 Gb/s]
mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19 Gb/s]
mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19
10 matches
Mail list logo