Dear Marco,
> Finally we can totally skip the first approximation and directly compute
>
> i' = T1[h(a)] + l(a) * T2[h(a)]
>
> using a piece-wise linear function.
yes, we noticed with Niels that we can replace most of the arithmetic in the
first Newton iteration by another table lookup.
"Marco Bodrato" writes:
> The constant (c) I added to the first iteration should be incorporated in
> T1, and of course it is not necessary to really use a constant. A
> different value can be used for each entry in the table...
In Newton iteration, one usually
Ciao,
Il Gio, 8 Giugno 2017 9:44 am, paul zimmermann ha scritto:
>Dear Marco,
> - in the 1-limb routine mpfr_sqrt1(), in the
> at the end we have u0*2^GMP_NUMB_BITS = r0^2 + rb*2^GMP_NUMB_BITS + sb
I'll look into it.
> - similarly in mpfr_sqrt2() in the else branch we have
> {np,
"Marco Bodrato" <bodr...@mail.dm.unipi.it> writes:
I got inspired by Adrien's code and by Paul's one, and I wrote another
candidate replacement for mpn_sqrtrem{1,2}, working for both 32 and 64
bits.
Nice!
My goals:
- new code should not be slower than the curren
Ciao,
Il Sab, 1 Aprile 2017 9:02 pm, Adrien Prost-Boucle ha scritto:
> The new diff is at the end of this message.
I got inspired by Adrien's code and by Paul's one, and I wrote another
candidate replacement for mpn_sqrtrem{1,2}, working for both 32 and 64
bits.
My goals:
- new code sho
On Sun, 2017-04-02 at 06:44 +0200, Marco Bodrato wrote:
>
> For ABI=32, can you please tell us the timings obtained with:
>
> make&&(cd tests/devel/;make sqrtrem_1_2& ./sqrtrem_1_2 x 1)
>
> before, and after your patch (maybe playing with the different flavours of
> the correction step :-)?
Hi,
On my side, I do observe a little improvement (64-bit).
With original GMP correction:
100x 100 values of 64 bits
Time: 4.77369 s
100x 100 values of 48 bits
Time: 5.07621 s
100x 100 values of 32 bits
Time: 5.07722 s
100x 100 values of 16 bits
Time: 5.24327 s
With my new
Ciao,
Il Sab, 1 Aprile 2017 9:02 pm, Adrien Prost-Boucle ha scritto:
> On Sat, 2017-04-01 at 18:15 +0200, Marco Bodrato wrote:
>> After the patch:
>> $ (cd tests/devel; make sqrtrem_1_2)& tests/devel/sqrtrem_1_2 c
>> Corner cases tested, up to bits:
>> \ 63
>> Values of a single limb, tested.
>>
On Sat, 2017-04-01 at 21:58 +0200, Marc Glisse wrote:
> Did you run "make distclean" between the 64-bit build and the 32-bit
> build? (doing the build out-of-tree avoids this kind of problem, since
> you can easily do the 32-bit build in a different directory)
I did clean but forgot about
On Sat, 1 Apr 2017, Adrien Prost-Boucle wrote:
On Sat, 2017-04-01 at 18:15 +0200, Marco Bodrato wrote:
Sorry, but even correcting the obvious typos, it doesn't pass the
tests.
I think I have found the error.
The final correction was wrong.
I hope it's OK now, but... I still can't compile
On Sat, 2017-04-01 at 18:15 +0200, Marco Bodrato wrote:
> Sorry, but even correcting the obvious typos, it doesn't pass the
> tests.
I think I have found the error.
The final correction was wrong.
I hope it's OK now, but... I still can't compile GMP with ABI=32.
Like you suggested I launched:
Ciao,
Il Mer, 29 Marzo 2017 8:38 am, Adrien Prost-Boucle ha scritto:
> On Wed, 2017-03-29 at 02:06 +0200, Marco Bodrato wrote:
>> You didn't try
>> ./configure ABI=32 && make && make check
>> did you?
> Standard copy/paste problem...
> Just replace vsh by a0 for now, I'll test ABI=32 when
Hi,
On Sat, 2017-03-25 at 21:34 +0100, Torbjörn Granlund wrote:
> The sqrtss and sqrtds are SIMD operations, right? That means that if we
> don't initialise all input fields with something, they might contain
> special values which triggers exceptional conditions.
The Intel docs say that
Ciao,
Il Mar, 28 Marzo 2017 9:18 pm, Adrien Prost-Boucle ha scritto:
> The diff is at the end of this message. Based on rev 17327.
> It's a bit ugly with interleaved ++ and -- ...
> Basically, remove previous mpn_sqrtrem1 and replace by my 2 versions, 32
> and 64b.
You didn't try
./configure
> The only branch is for the final correction at end of mpn_sqrtrem1.
> I tried with the previous mpn_sqrtrem1 version, which has a condition,
> and with an unconditional code that needs 2 multiplications.
>
> My version with unconditional correction had same speed.
> My version with current
Hi,
On Sat, 2017-03-25 at 15:38 +0100, Marco Bodrato wrote:
> Il Gio, 23 Marzo 2017 8:46 pm, Adrien Prost-Boucle ha scritto:
> > > About the pure C code, integer version that was working on,
> > > But... when I put that code in GMP code, that resulted in
> > > a noticeable slowdown /o\
> >
This is essentially what I propose:
unsigned int
isqrt (unsigned long a0)
{
double y;
__asm__ ("sqrtsd %1, %0" : "=x" (y) : "xm" (0.999 * a0));
unsigned int r = 1 + (unsigned int) y;
return r - ((unsigned long) r * r > a0);
}
If we decide to use FP for some subset of x86_64
On 2017-03-26 00:54:50 +0100, Adrien Prost-Boucle wrote:
> > I tested with a function that repeatedly sets all rounding modes.
> > The result is: 995413 calls to fesetround() per second on my laptop
> > That's extremely slow given the speed of the sqrt function!
>
> Ooops there was a typo in my
On 2017-03-16 14:45:58 +0100, Torbjörn Granlund wrote:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> I think the "long double" type means an 80-bit floating point format
> which was implemented in the original 80387 floating point coprocessor.
> And which isn't supported by the later
> I tested with a function that repeatedly sets all rounding modes.
> The result is: 995413 calls to fesetround() per second on my laptop
> That's extremely slow given the speed of the sqrt function!
Ooops there was a typo in my code: division by 4 instead of
multiplication by 4 (the number of
On Sat, 25 Mar 2017, Torbjörn Granlund wrote:
The sqrtss and sqrtds are SIMD operations, right? That means that if we
don't initialise all input fields with something, they might contain
special values which triggers exceptional conditions.
I don't think so. sqrtsd computes sqrt of the lower
Adrien Prost-Boucle writes:
As far as I know, these instructions are affected bu rounding mode.
And no instruction specifies the rounding mode.
So, I have to assume worst-case and consider the user programs may
have set a rounding mode that I don't
Adrien Prost-Boucle writes:
This page, at the end, tends to discourage playing with rounding mode too
frequently:
https://www.gnu.org/software/libc/manual/html_node/Rounding.html
GMP should absolutely not play with FP rounding mode, not even for some
> Il Ven, 24 Marzo 2017 11:54 pm, Adrien Prost-Boucle ha scritto:
> > This page, at the end, tends to discourage playing with rounding mode too
> > frequently:
> > https://www.gnu.org/software/libc/manual/html_node/Rounding.html
>
> This page is about writing portable code using libc, here we
Ciao,
Il Gio, 23 Marzo 2017 8:46 pm, Adrien Prost-Boucle ha scritto:
>> About the pure C code, integer version that was working on,
>> But... when I put that code in GMP code, that resulted in
>> a noticeable slowdown /o\
> Problem solved.
> Branch prediction made GMP's sqrtrem1 appear faster
Ciao,
Il Ven, 24 Marzo 2017 11:54 pm, Adrien Prost-Boucle ha scritto:
> This page, at the end, tends to discourage playing with rounding mode too
> frequently:
> https://www.gnu.org/software/libc/manual/html_node/Rounding.html
This page is about writing portable code using libc, here we have a
This page, at the end, tends to discourage playing with rounding mode too
frequently:
https://www.gnu.org/software/libc/manual/html_node/Rounding.html
On Fri, 2017-03-24 at 15:56 +0100, Torbjörn Granlund wrote:
> I haven't followed this thread carefully, but now I have some questions:
>
> Are
As far as I know, these instructions are affected bu rounding mode.
And no instruction specifies the rounding mode.
So, I have to assume worst-case and consider the user programs may
have set a rounding mode that I don't expect.
That's why I propose - as a first proposal - a function that always
Hi,
I now have a working version of sqrtrem1 that uses floating-point sqrt
instruction on x86-64.
For a quick glance, here is the speedup on my two machines:
Laptop Core 2 Duo 2GHz
==
1) Time when using only rev 17327
2) Time when using rev 17327 + FP version for sqrtrem1
"Marco Bodrato" writes:
But it can be useful on some platform. Say mpn/x86_64 ? or
mpn/x86_64/fast{sse,avx} ?
When it's know to be supported and for CPUs where it is kmown to run
well, sure.
We should never use the old x86 FPU stack with 80-bit FP registers.
ni...@lysator.liu.se (Niels Möller) writes:
I think the "long double" type means an 80-bit floating point format
which was implemented in the original 80387 floating point coprocessor.
And which isn't supported by the later x86 SIMD instructions (and not
supported at all on on non-x86
Ciao,
Il Mer, 15 Marzo 2017 8:56 pm, Adrien Prost-Boucle ha scritto:
> Hi,
>
>> I miss a case: 32 bits; to fully evaluate the impact of the patch+FP on
>> one-limb operands in the range 1..62.
>
> Isn't 64-bit and 32-bit data identical, for one mpn_sqrtrem1 call on
> x86-64?
> I don't get why we
Hi,
> I miss a case: 32 bits; to fully evaluate the impact of the patch+FP on
> one-limb operands in the range 1..62.
Isn't 64-bit and 32-bit data identical, for one mpn_sqrtrem1 call on x86-64?
I don't get why we would see a difference.
Or, do you mean we should add another test (along with
Hi,
On Mon, 2017-03-13 at 22:11 +0100, Marco Bodrato wrote:
> Not even my change really gave 2x... but, before it, you could obtain
> 20-30%, now you measure 50-70%... that's the right direction :-)
I just reached 2.4x :-)
For that, I added macros to indicate whether sqrtrem{1,2} need normalized
Sorry for alignment in my previous email.
Here is a cleaner version.
Note that the first time value (left) is for GMP hg17327
And the second one (right) is with the FP functions
GMP 6.1.99.hg17327 modified so functions mpn_sqrtrem{1,2} are extern
Laptop Core 2 Duo 2GHz
All tests are repeated
Hi,
> I'll be grateful if you run your measures again,
> in particular for 100 and 128 bits.
I pulled the latest changes (hg rev 17327) and launched my tests again.
I cleaned up the values ad numbers of bits, just so it's a bit cleaner.
Reminder: I compiled GMP with functions mpn_sqrtr
"Marco Bodrato" writes:
> After the patch.
> real 3m47.247s
The time to compute and check 2^33 square roots, but it can be reduced.
Computing just 2^32 square roots would be another great speedup...
> Almost 2x speedup, awesome!
Yes, Adrien
Ciao,
Il Gio, 9 Marzo 2017 2:33 am, Torbjörn Granlund ha scritto:
> Marco Bodrato writes:
> After the patch.
> real3m47.247s
The time to compute and check 2^33 square roots, but it can be reduced.
> Almost 2x speedup, awesome!
Yes, Adrien spotted a real
Ciao Adrien,
Il 2017-02-16 22:45 Adrien Prost-Boucle ha scritto:
Here are the results, with vanilla GMP and with floating-point
mpn_sqrtrem{1,2}.
1000 times, 128 bits .. vanilla 1.044 s / FP 0.781 s
1000 times, 100 bits .. vanilla 1.373 s / FP 1.077 s
There is some
the correct answers for all values in between.
> There are too many uint64_t values to check programmatically, and I
As a contribution to this discussion, I pushed a piece of code for this
kind of tests: https://gmplib.org/repo/gmp/rev/11887dfdf7a6 .
Once you have changed mpn_sqrtrem{1,2
Ciao,
Il Gio, 16 Febbraio 2017 10:45 pm, Adrien Prost-Boucle ha scritto:
> Before doing heavy (for me) asm development,
> I first wanted to evaluate could be the impact, on the mpz_sqrt()
> function itself, of a 2-3x speedup on functions mpn_sqrtrem{1,2}.
> There is some noticeable
Hi all,
Before doing heavy (for me) asm development,
I first wanted to evaluate could be the impact, on the mpz_sqrt()
function itself, of a 2-3x speedup on functions mpn_sqrtrem{1,2}.
To avoid compile-time dependency on libm,
I simply recompiled GMP with exposed symbols mpn_sqrtrem{1,2
Adrien Prost-Boucle writes:
> Maybe the availability of SSE / AVX / NEON etc instruction sets can be
> checked at compilation time?
That's what configure (and its helper scripts) does.
With --enable-fat, we also have runtime detection on certain systems.
> The
Adrien Prost-Boucle writes:
So first I'd like to know,
what do GMP developers think about using FP there?
Making GMP dependent in libm is not OK.
Using time-critical floating-point features on a CPU-by-CPU basis is ok,
but needs to be done with care. Are
// Subtract to 2*r to check if the root is too low
> usub64x2(, , root >> 63, root << 1, remh, reml);
> // Get the full correction
> corr += remh >> 63;
> #endif
>
> // Apply the error correction
> //printf("%"PRIi64" ", c
t 14:12 +0100, Marco Bodrato wrote:
> Ciao,
>
> Il Ven, 23 Dicembre 2016 5:22 pm, Adrien Prost-Boucle ha scritto:
> > I now have the same API for my implementation and for the code I've taken
> > from GMP.
>
> As a contribution to this discussion, I did the same thing,
Ciao,
Il Ven, 23 Dicembre 2016 5:22 pm, Adrien Prost-Boucle ha scritto:
> I now have the same API for my implementation and for the code I've taken
> from GMP.
As a contribution to this discussion, I did the same thing, but the other
direction.
The attached code contains a copy of mpn_s
47 matches
Mail list logo