Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-06-09 Thread paul zimmermann
Dear Marco, > Finally we can totally skip the first approximation and directly compute > > i' = T1[h(a)] + l(a) * T2[h(a)] > > using a piece-wise linear function. yes, we noticed with Niels that we can replace most of the arithmetic in the first Newton iteration by another table lookup.

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-06-09 Thread Niels Möller
"Marco Bodrato" writes: > The constant (c) I added to the first iteration should be incorporated in > T1, and of course it is not necessary to really use a constant. A > different value can be used for each entry in the table... In Newton iteration, one usually

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-06-08 Thread Marco Bodrato
Ciao, Il Gio, 8 Giugno 2017 9:44 am, paul zimmermann ha scritto: >Dear Marco, > - in the 1-limb routine mpfr_sqrt1(), in the > at the end we have u0*2^GMP_NUMB_BITS = r0^2 + rb*2^GMP_NUMB_BITS + sb I'll look into it. > - similarly in mpfr_sqrt2() in the else branch we have > {np,

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-06-06 Thread Torbjörn Granlund
"Marco Bodrato" <bodr...@mail.dm.unipi.it> writes: I got inspired by Adrien's code and by Paul's one, and I wrote another candidate replacement for mpn_sqrtrem{1,2}, working for both 32 and 64 bits. Nice! My goals: - new code should not be slower than the curren

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-06-06 Thread Marco Bodrato
Ciao, Il Sab, 1 Aprile 2017 9:02 pm, Adrien Prost-Boucle ha scritto: > The new diff is at the end of this message. I got inspired by Adrien's code and by Paul's one, and I wrote another candidate replacement for mpn_sqrtrem{1,2}, working for both 32 and 64 bits. My goals: - new code sho

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-04-02 Thread Adrien Prost-Boucle
On Sun, 2017-04-02 at 06:44 +0200, Marco Bodrato wrote: > > For ABI=32, can you please tell us the timings obtained with: > > make&&(cd tests/devel/;make sqrtrem_1_2& ./sqrtrem_1_2 x 1) > > before, and after your patch (maybe playing with the different flavours of > the correction step :-)?

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-04-02 Thread Adrien Prost-Boucle
Hi, On my side, I do observe a little improvement (64-bit). With original GMP correction: 100x 100 values of 64 bits Time: 4.77369 s 100x 100 values of 48 bits Time: 5.07621 s 100x 100 values of 32 bits Time: 5.07722 s 100x 100 values of 16 bits Time: 5.24327 s With my new

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-04-01 Thread Marco Bodrato
Ciao, Il Sab, 1 Aprile 2017 9:02 pm, Adrien Prost-Boucle ha scritto: > On Sat, 2017-04-01 at 18:15 +0200, Marco Bodrato wrote: >> After the patch: >> $ (cd tests/devel; make sqrtrem_1_2)& tests/devel/sqrtrem_1_2 c >> Corner cases tested, up to bits: >> \ 63 >> Values of a single limb, tested. >>

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-04-01 Thread Adrien Prost-Boucle
On Sat, 2017-04-01 at 21:58 +0200, Marc Glisse wrote: > Did you run "make distclean" between the 64-bit build and the 32-bit > build? (doing the build out-of-tree avoids this kind of problem, since > you can easily do the 32-bit build in a different directory) I did clean but forgot about

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-04-01 Thread Marc Glisse
On Sat, 1 Apr 2017, Adrien Prost-Boucle wrote: On Sat, 2017-04-01 at 18:15 +0200, Marco Bodrato wrote: Sorry, but even correcting the obvious typos, it doesn't pass the tests. I think I have found the error. The final correction was wrong. I hope it's OK now, but... I still can't compile

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-04-01 Thread Adrien Prost-Boucle
On Sat, 2017-04-01 at 18:15 +0200, Marco Bodrato wrote: > Sorry, but even correcting the obvious typos, it doesn't pass the > tests. I think I have found the error. The final correction was wrong. I hope it's OK now, but... I still can't compile GMP with ABI=32. Like you suggested I launched:

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-04-01 Thread Marco Bodrato
Ciao, Il Mer, 29 Marzo 2017 8:38 am, Adrien Prost-Boucle ha scritto: > On Wed, 2017-03-29 at 02:06 +0200, Marco Bodrato wrote: >> You didn't try >> ./configure ABI=32 && make && make check >> did you? > Standard copy/paste problem... > Just replace vsh by a0 for now, I'll test ABI=32 when

Re: mpn_sqrtrem{1,2} - floating-point sqrts{s,d}

2017-04-01 Thread Adrien Prost-Boucle
Hi, On Sat, 2017-03-25 at 21:34 +0100, Torbjörn Granlund wrote: > The sqrtss and sqrtds are SIMD operations, right?  That means that if we > don't initialise all input fields with something, they might contain > special values which triggers exceptional conditions. The Intel docs say that

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-03-28 Thread Marco Bodrato
Ciao, Il Mar, 28 Marzo 2017 9:18 pm, Adrien Prost-Boucle ha scritto: > The diff is at the end of this message. Based on rev 17327. > It's a bit ugly with interleaved ++ and -- ... > Basically, remove previous mpn_sqrtrem1 and replace by my 2 versions, 32 > and 64b. You didn't try ./configure

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-03-28 Thread Adrien Prost-Boucle
> The only branch is for the final correction at end of mpn_sqrtrem1. > I tried with the previous mpn_sqrtrem1 version, which has a condition, > and with an unconditional code that needs 2 multiplications. > > My version with unconditional correction had same speed. > My version with current

Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-03-28 Thread Adrien Prost-Boucle
Hi, On Sat, 2017-03-25 at 15:38 +0100, Marco Bodrato wrote: > Il Gio, 23 Marzo 2017 8:46 pm, Adrien Prost-Boucle ha scritto: > > > About the pure C code, integer version that was working on, > > > But... when I put that code in GMP code, that resulted in > > > a noticeable slowdown /o\ > >

Re: mpn_sqrtrem{1,2} - rounding mode - erratum

2017-03-25 Thread Torbjörn Granlund
This is essentially what I propose: unsigned int isqrt (unsigned long a0) { double y; __asm__ ("sqrtsd %1, %0" : "=x" (y) : "xm" (0.999 * a0)); unsigned int r = 1 + (unsigned int) y; return r - ((unsigned long) r * r > a0); } If we decide to use FP for some subset of x86_64

Re: mpn_sqrtrem{1,2} - rounding mode - erratum

2017-03-25 Thread Vincent Lefevre
On 2017-03-26 00:54:50 +0100, Adrien Prost-Boucle wrote: > > I tested with a function that repeatedly sets all rounding modes. > > The result is: 995413 calls to fesetround() per second on my laptop > > That's extremely slow given the speed of the sqrt function! > > Ooops there was a typo in my

Re: mpn_sqrtrem{1,2}

2017-03-25 Thread Vincent Lefevre
On 2017-03-16 14:45:58 +0100, Torbjörn Granlund wrote: > ni...@lysator.liu.se (Niels Möller) writes: > > I think the "long double" type means an 80-bit floating point format > which was implemented in the original 80387 floating point coprocessor. > And which isn't supported by the later

Re: mpn_sqrtrem{1,2} - rounding mode - erratum

2017-03-25 Thread Adrien Prost-Boucle
> I tested with a function that repeatedly sets all rounding modes. > The result is: 995413 calls to fesetround() per second on my laptop > That's extremely slow given the speed of the sqrt function! Ooops there was a typo in my code: division by 4 instead of multiplication by 4 (the number of

Re: mpn_sqrtrem{1,2}

2017-03-25 Thread Marc Glisse
On Sat, 25 Mar 2017, Torbjörn Granlund wrote: The sqrtss and sqrtds are SIMD operations, right? That means that if we don't initialise all input fields with something, they might contain special values which triggers exceptional conditions. I don't think so. sqrtsd computes sqrt of the lower

Re: mpn_sqrtrem{1,2}

2017-03-25 Thread Torbjörn Granlund
Adrien Prost-Boucle writes: As far as I know, these instructions are affected bu rounding mode. And no instruction specifies the rounding mode. So, I have to assume worst-case and consider the user programs may have set a rounding mode that I don't

Re: mpn_sqrtrem{1,2}

2017-03-25 Thread Torbjörn Granlund
Adrien Prost-Boucle writes: This page, at the end, tends to discourage playing with rounding mode too frequently: https://www.gnu.org/software/libc/manual/html_node/Rounding.html GMP should absolutely not play with FP rounding mode, not even for some

Re: mpn_sqrtrem{1,2} - rounding mode

2017-03-25 Thread Adrien Prost-Boucle
> Il Ven, 24 Marzo 2017 11:54 pm, Adrien Prost-Boucle ha scritto: > > This page, at the end, tends to discourage playing with rounding mode too > > frequently: > > https://www.gnu.org/software/libc/manual/html_node/Rounding.html > > This page is about writing portable code using libc, here we

Re: mpn_sqrtrem{1,2}

2017-03-25 Thread Marco Bodrato
Ciao, Il Gio, 23 Marzo 2017 8:46 pm, Adrien Prost-Boucle ha scritto: >> About the pure C code, integer version that was working on, >> But... when I put that code in GMP code, that resulted in >> a noticeable slowdown /o\ > Problem solved. > Branch prediction made GMP's sqrtrem1 appear faster

Re: mpn_sqrtrem{1,2}

2017-03-25 Thread Marco Bodrato
Ciao, Il Ven, 24 Marzo 2017 11:54 pm, Adrien Prost-Boucle ha scritto: > This page, at the end, tends to discourage playing with rounding mode too > frequently: > https://www.gnu.org/software/libc/manual/html_node/Rounding.html This page is about writing portable code using libc, here we have a

Re: mpn_sqrtrem{1,2}

2017-03-24 Thread Adrien Prost-Boucle
This page, at the end, tends to discourage playing with rounding mode too frequently: https://www.gnu.org/software/libc/manual/html_node/Rounding.html On Fri, 2017-03-24 at 15:56 +0100, Torbjörn Granlund wrote: > I haven't followed this thread carefully, but now I have some questions: > > Are

Re: mpn_sqrtrem{1,2}

2017-03-24 Thread Adrien Prost-Boucle
As far as I know, these instructions are affected bu rounding mode. And no instruction specifies the rounding mode. So, I have to assume worst-case and consider the user programs may have set a rounding mode that I don't expect. That's why I propose - as a first proposal - a function that always

Re: mpn_sqrtrem{1,2}

2017-03-22 Thread Adrien Prost-Boucle
Hi, I now have a working version of sqrtrem1 that uses floating-point sqrt instruction on x86-64. For a quick glance, here is the speedup on my two machines: Laptop Core 2 Duo 2GHz == 1) Time when using only rev 17327 2) Time when using rev 17327 + FP version for sqrtrem1

Re: mpn_sqrtrem{1,2}

2017-03-17 Thread Torbjörn Granlund
"Marco Bodrato" writes: But it can be useful on some platform. Say mpn/x86_64 ? or mpn/x86_64/fast{sse,avx} ? When it's know to be supported and for CPUs where it is kmown to run well, sure. We should never use the old x86 FPU stack with 80-bit FP registers.

Re: mpn_sqrtrem{1,2}

2017-03-16 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I think the "long double" type means an 80-bit floating point format which was implemented in the original 80387 floating point coprocessor. And which isn't supported by the later x86 SIMD instructions (and not supported at all on on non-x86

Re: mpn_sqrtrem{1,2}

2017-03-15 Thread Marco Bodrato
Ciao, Il Mer, 15 Marzo 2017 8:56 pm, Adrien Prost-Boucle ha scritto: > Hi, > >> I miss a case: 32 bits; to fully evaluate the impact of the patch+FP on >> one-limb operands in the range 1..62. > > Isn't 64-bit and 32-bit data identical, for one mpn_sqrtrem1 call on > x86-64? > I don't get why we

Re: mpn_sqrtrem{1,2}

2017-03-15 Thread Adrien Prost-Boucle
Hi, > I miss a case: 32 bits; to fully evaluate the impact of the patch+FP on > one-limb operands in the range 1..62. Isn't 64-bit and 32-bit data identical, for one mpn_sqrtrem1 call on x86-64? I don't get why we would see a difference. Or, do you mean we should add another test (along with

Re: mpn_sqrtrem{1,2}

2017-03-14 Thread Adrien Prost-Boucle
Hi, On Mon, 2017-03-13 at 22:11 +0100, Marco Bodrato wrote: > Not even my change really gave 2x... but, before it, you could obtain > 20-30%, now you measure 50-70%... that's the right direction :-) I just reached 2.4x :-) For that, I added macros to indicate whether sqrtrem{1,2} need normalized

Re: Re: mpn_sqrtrem{1,2}

2017-03-09 Thread Adrien Prost-Boucle
Sorry for alignment in my previous email. Here is a cleaner version. Note that the first time value (left) is for GMP hg17327 And the second one (right) is with the FP functions GMP 6.1.99.hg17327 modified so functions mpn_sqrtrem{1,2} are extern Laptop Core 2 Duo 2GHz All tests are repeated

Re: mpn_sqrtrem{1,2}

2017-03-09 Thread Adrien Prost-Boucle
Hi, > I'll be grateful if you run your measures again, > in particular for 100 and 128 bits. I pulled the latest changes (hg rev 17327) and launched my tests again. I cleaned up the values ad numbers of bits, just so it's a bit cleaner. Reminder: I compiled GMP with functions mpn_sqrtr

Re: mpn_sqrtrem{1,2}

2017-03-09 Thread Torbjörn Granlund
"Marco Bodrato" writes: > After the patch. > real 3m47.247s The time to compute and check 2^33 square roots, but it can be reduced. Computing just 2^32 square roots would be another great speedup... > Almost 2x speedup, awesome! Yes, Adrien

Re: mpn_sqrtrem{1,2}

2017-03-09 Thread Marco Bodrato
Ciao, Il Gio, 9 Marzo 2017 2:33 am, Torbjörn Granlund ha scritto: > Marco Bodrato writes: > After the patch. > real3m47.247s The time to compute and check 2^33 square roots, but it can be reduced. > Almost 2x speedup, awesome! Yes, Adrien spotted a real

Re: mpn_sqrtrem{1,2}

2017-03-08 Thread Marco Bodrato
Ciao Adrien, Il 2017-02-16 22:45 Adrien Prost-Boucle ha scritto: Here are the results, with vanilla GMP and with floating-point mpn_sqrtrem{1,2}. 1000 times, 128 bits .. vanilla 1.044 s / FP 0.781 s 1000 times, 100 bits .. vanilla 1.373 s / FP 1.077 s There is some

Re: mpn_sqrtrem{1,2}

2017-02-28 Thread Marco Bodrato
the correct answers for all values in between. > There are too many uint64_t values to check programmatically, and I As a contribution to this discussion, I pushed a piece of code for this kind of tests: https://gmplib.org/repo/gmp/rev/11887dfdf7a6 . Once you have changed mpn_sqrtrem{1,2

Re: mpn_sqrtrem{1,2}

2017-02-17 Thread Marco Bodrato
Ciao, Il Gio, 16 Febbraio 2017 10:45 pm, Adrien Prost-Boucle ha scritto: > Before doing heavy (for me) asm development, > I first wanted to evaluate could be the impact, on the mpz_sqrt() > function itself, of a 2-3x speedup on functions mpn_sqrtrem{1,2}. > There is some noticeable

Re: mpn_sqrtrem{1,2}

2017-02-16 Thread Adrien Prost-Boucle
Hi all, Before doing heavy (for me) asm development, I first wanted to evaluate could be the impact, on the mpz_sqrt() function itself, of a 2-3x speedup on functions mpn_sqrtrem{1,2}. To avoid compile-time dependency on libm, I simply recompiled GMP with exposed symbols mpn_sqrtrem{1,2

Re: mpn_sqrtrem{1,2}

2017-02-01 Thread Niels Möller
Adrien Prost-Boucle writes: > Maybe the availability of SSE / AVX / NEON etc instruction sets can be > checked at compilation time? That's what configure (and its helper scripts) does. With --enable-fat, we also have runtime detection on certain systems. > The

Re: Re: mpn_sqrtrem{1,2}

2017-01-29 Thread Torbjörn Granlund
Adrien Prost-Boucle writes: So first I'd like to know, what do GMP developers think about using FP there? Making GMP dependent in libm is not OK. Using time-critical floating-point features on a CPU-by-CPU basis is ok, but needs to be done with care. Are

Re: Re: mpn_sqrtrem{1,2}

2017-01-29 Thread Adrien Prost-Boucle
// Subtract to 2*r to check if the root is too low > usub64x2(, , root >> 63, root << 1, remh, reml); > // Get the full correction > corr += remh >> 63; > #endif > > // Apply the error correction > //printf("%"PRIi64" ", c

Re: mpn_sqrtrem{1,2}

2017-01-28 Thread Adrien Prost-Boucle
t 14:12 +0100, Marco Bodrato wrote: > Ciao, > > Il Ven, 23 Dicembre 2016 5:22 pm, Adrien Prost-Boucle ha scritto: > > I now have the same API for my implementation and for the code I've taken > > from GMP. > > As a contribution to this discussion, I did the same thing,

mpn_sqrtrem{1,2}

2017-01-28 Thread Marco Bodrato
Ciao, Il Ven, 23 Dicembre 2016 5:22 pm, Adrien Prost-Boucle ha scritto: > I now have the same API for my implementation and for the code I've taken > from GMP. As a contribution to this discussion, I did the same thing, but the other direction. The attached code contains a copy of mpn_s