Re: hgcd1/2

2019-10-07 Thread Torbjörn Granlund
"Marco Bodrato" writes: Not bad, is it? Not at all bad! Must we use masks to obtain branchless code? Or should we let the compiler choose? My experience is that the compiler cannot be trusted in this area; it may or may not use cmov and its reasons for doing it either way are subtle.

Re: hgcd1/2

2019-10-04 Thread Marco Bodrato
Ciao Il Gio, 5 Settembre 2019 10:33 am, Torbjörn Granlund ha scritto: > One of my previous proposal looks like this, when put into its right form: > > mp_limb_t > div1 (mp_limb_t n0, mp_limb_t d0) > { > if (UNLIKELY ((d0 >> 61) != 0)) > return n0 / d0; > > if (UNLIKELY (n0 >= (d0 << 3)))

Re: hgcd1/2

2019-09-17 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Unit tests would be nice. I think the tests/mpz/t-gcd.c does exercises the large quotient cases. I cobbled together this: test-div2.c Description: Binary data BTW, I wonder if it makes sense with HGCD2_DIV2_METHOD == 3 similar to

Re: hgcd1/2

2019-09-17 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > What are the specs for div2? > > Surely n1 > 0 and d1 > 0. All variants need d1 > 0, method 2 also needs n1 > 0 (for count_leading_zeros). > Also N >= D? Method 2 needs clz(n1) <= clz(d1). Besides that, I think they can handle N < D, i.e., q == 0.

Re: hgcd1/2

2019-09-17 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > I've made a quick try deleting it from the single-limb loop. See patch > below. Measurements are a bit noisy, but it looks like a slowdown when I > time it. With hgcd2 time increasing from 1220

Re: hgcd1/2

2019-09-17 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Ooops. Now it should be in. What are the specs for div2? Surely n1 > 0 and d1 > 0. Also N >= D? Does the new div2 always compute the "accurate" quotient, i.e., with the remainder R < D? I'm asking as I believe strongly in unit testing of these

Re: hgcd1/2

2019-09-17 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I've made a quick try deleting it from the single-limb loop. See patch below. Measurements are a bit noisy, but it looks like a slowdown when I time it. With hgcd2 time increasing from 1220 cycles to 1290 (this time measured on broadwell), which

Re: hgcd1/2

2019-09-16 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: > My guess is that special q=1 is often beneficial for the double-limb > loop, and rarely beneficial in the single-limb loop. But needs some > measurements. I've made a quick try deleting it from the single-limb loop. See patch below. Measurements are

Re: hgcd1/2

2019-09-16 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > Pushed now, with minor changes and deletion of the "current" code > (method 2 above). > > It is not in the main repo yet. Ooops. Now it should be in. Regards, /Niels -- Niels Möller.

Re: hgcd1/2

2019-09-16 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Pushed now, with minor changes and deletion of the "current" code (method 2 above). It is not in the main repo yet. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list

Re: hgcd1/2

2019-09-16 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: >> I'm appending another iteration of the patch to add div2 function based >> on div1 on the high limbs. Selected via HGCD2_DIV2_METHOD. Benchmarks: >> >> HGCD2_DIV2_METHOD mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 >> 1#1504.47

Re: hgcd1/2

2019-09-16 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: > I've written several div1 in asm (arm v5 method 2, 64-bit arm v8 method > 2 and 3, 64-bit x86 method 2). Nice. x64-div_11-m2.asm Description: Binary data arm32-div_11-m2.asm Description: Binary data arm64-div_11-m2.asm Description:

Re: hgcd1/2

2019-09-16 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Not exactly. I think running div1 like hgcd2 but without any of hgcd2 > bookkeeping would make some sense. I.e., feed div1 with the input of > Euclid's algorithm. To avoid skew from particular operands, perhaps > table 10 uniformly distrinuted

Re: hgcd1/2

2019-09-15 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: > So one could perhaps provide a set of uniformly distribted > random numbers for div1 and get adequate results? Are you saying that the quotient of two independently uniformly distributed numbers has a similar distribution as the euclidean

Re: hgcd1/2

2019-09-15 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > The distribution of operands are the same as for a plain, old Euclid, > isn't it? Should be. Or slightly different, if q = 1 is handeled specially. > So one could perhaps provide a set of uniformly distribted > random numbers for div1 and get

Re: hgcd1/2

2019-09-15 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: t...@gmplib.org (Torbjörn Granlund) writes: > Looking at when method 1 or 3 is faster than 2 is more interesting. > Method 1 and to some extent also method 3 would benefit from asm code, How would method 1 benefit from asm code? To use

Re: hgcd1/2

2019-09-15 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Looking at when method 1 or 3 is faster than 2 is more interesting. > Method 1 and to some extent also method 3 would benefit from asm code, How would method 1 benefit from asm code? To use instructions providing both quotient and remainder? For

Re: hgcd1/2

2019-09-15 Thread Torbjörn Granlund
Two configs hesitantly chose HGCD2_DIV1_METHOD = 2. For k10/64, method 2 outperforms method 3 by 0.34%. For ARM Cortex-A8 method 2's advantage is 1.94%. Wow. Looking at when method 1 or 3 is faster than 2 is more interesting. Method 1 and to some extent also method 3 would benefit from asm

Re: hgcd1/2

2019-09-14 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Just pushed a rename, and a change of default to #define HGCD2_DIV1_METHOD 3 Will you update the threshold page accordingly? Done. There is no aliasing mechanism, so now the table is empty. -- Torbjörn Please encrypt, key id 0xC8601622

Re: hgcd1/2

2019-09-14 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Next, I think we should go ahead with the rename HGCD2_METHOD to > DIV11_METHOD or possibly HGCD2_DIV1_METHOD. > > Feel free. Just pushed a rename, and a change of default to #define HGCD2_DIV1_METHOD 3 Will you update the threshold page

Re: hgcd1/2

2019-09-14 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: > I went ahead and committed that version, replacing the old > HGCD2_METHOD=2. I expect it is be the fastest method on some platform. Will be interesting to see results on thresholds. Nobody loves the new METHOD 2. :-( (Not many machines

Re: hgcd1/2

2019-09-13 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > t...@gmplib.org (Torbjörn Granlund) writes: > > I cooked a modern alternative: > > I went ahead and committed that version, replacing the old > HGCD2_METHOD=2. I expect it is be the fastest method on some platform. Will be interesting to see

Re: hgcd1/2

2019-09-13 Thread Torbjörn Granlund
t...@gmplib.org (Torbjörn Granlund) writes: I cooked a modern alternative: I went ahead and committed that version, replacing the old HGCD2_METHOD=2. I expect it is be the fastest method on some platform. (We might want to arrange for longlong.h to use lzcnt instead of bsr for modern AMD

Re: hgcd1/2

2019-09-11 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Sounds good. Give it a few days, and delete if it still looks slow everywhere. I cooked a modern alternative: static mp_double_limb_t div1 (mp_limb_t n0, mp_limb_t d0) { mp_double_limb_t res; int ncnt, dcnt, cnt; mp_limb_t q; mp_limb_t

Re: hgcd1/2

2019-09-10 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Conclusion: The HGCD2_METHOD 2 code in its current form is not useful on > any of our platforms. (Perhaps we should wait a few days more as not > all hosts have reported results yet.) Sounds good. Give it a few days, and delete if it still looks

Re: hgcd1/2

2019-09-10 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Maybe I'm confused. And I think tradeoffs are different for the double-limb and the single-limb loops. Or I am confused. I have mostly thought of div1, less of div2. For the single-limb loop, I think special handling of q == 1 means we avoid

Re: hgcd1/2

2019-09-10 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > > That should in particular be true if the q = 1 code is removed from > > hgcd2, for machines with good multiplication throughput. > > Can we remove that for everything but HGCD2_METHOD == 2? It

Re: hgcd1/2

2019-09-10 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: > I believe METHOD 1 could almost always be pointless if METHOD 3's > straight line code is made really efficient. I'd guess so too. But benchmarks are better than faith alone ;-) I have a lot of faith in benchmarking! :-) > That should in

Re: hgcd1/2

2019-09-09 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > I believe METHOD 1 could almost always be pointless if METHOD 3's > straight line code is made really efficient. I'd guess so too. But benchmarks are better than faith alone ;-) > That should in particular be true if the q = 1 code is removed from >

Re: hgcd1/2

2019-09-08 Thread Torbjörn Granlund
> Should we rename div1 to put it into the GMP name space? How about > mpn_div_11? (We could encode in its name that it is for use for small > q, but I'd suggest that we don't do that.) > > That would allow for some assembly experiments. If you think assembly will make a big

Re: hgcd1/2

2019-09-07 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Should we rename div1 to put it into the GMP name space? How about > mpn_div_11? (We could encode in its name that it is for use for small > q, but I'd suggest that we don't do that.) > > That would allow for some assembly experiments. If you think

Re: hgcd1/2

2019-09-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Keep in mind that the optimizations to div1 so far only help the second, cheaper, half of hgcd2. The first half, working with double precision, still uses the div2 function, which is full of branches. I know, but the number I mainly look at is

Re: hgcd1/2

2019-09-07 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > I'm a bit surprised that the METHOD 3 div1 didn't yield more speedup. > The hgcd2 code is almost content with a 25 cycle / operation, which is > really, really odd to me. Keep in mind that the optimizations to div1 so far only help the second,

Re: hgcd1/2

2019-09-07 Thread Torbjörn Granlund
There was a bug in the "euclides 2-choice binary" function. Now, with the intended algorithm, we get almost 4 bits per iteration. Algorithm Iterations Iter/Bit Bit/Iter Max iter plain binary :378420598 0.6971.435 58 binary+-

Re: hgcd1/2

2019-09-07 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > The current _METHOD 3 code starts with > > if (UNLIKELY ((d0 >> (GMP_LIMB_BITS - 3)) != 0) > || UNLIKELY (n0 >= (d0 << 3))) > > The first condition isn't that unlikely, though, and will

Re: hgcd1/2

2019-09-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I don't know much about itanic performance. It's not bad for GMP, but its division *libcall* needs some 150 cycles. I played with a METHOD = 3 div1 for Itanic as its predicated execution comes in very handy. The result is that one can do one

Re: hgcd1/2

2019-09-07 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > We have now about half of hosts asking for HGCD_METHOD = 1 as for > HGCD_METHOD = 3. Itanic wants HGCD_METHOD 2, and so does beagle (Arm > a8), but the latter with a preference of just 0.05% (over method 3). I don't know much aboutitanic

Re: hgcd1/2

2019-09-06 Thread Torbjörn Granlund
We have now about half of hosts asking for HGCD_METHOD = 1 as for HGCD_METHOD = 3. Itanic wants HGCD_METHOD 2, and so does beagle (Arm a8), but the latter with a preference of just 0.05% (over method 3). _METHOD 1 benefits from the special handling of q = 1 before invoking DIV1. _METHOD 2 & 3

Re: hgcd1/2

2019-09-05 Thread Laurent Desnogues
On Tue, Sep 3, 2019 at 10:49 AM Torbjörn Granlund wrote: > > ni...@lysator.liu.se (Niels Möller) writes: > > And this on a laptop with an Intel U4100 (5 years old?), so I'd assume > it doesn't have a particularly fast div instruction. Should we just > delete div1 ? On which architectures

Re: hgcd1/2

2019-09-05 Thread Torbjörn Granlund
We have more values for HGCD2_METHOD now: https://gmplib.org/devel/thres/HGCD2_METHOD.html The low-end ARM cores are no longer alone in desiring HGCD2_METHOD = 2. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list

Re: hgcd1/2

2019-09-05 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Can div2 be done sloppily, e.g., by extracting the top 64 bits of the > dividend and the corresponding (up to) 64 bits of the divisor, and > invoke DIV1 on these bits? I think so. Result will be at most one off, except in the unlikely case that q >

Re: hgcd1/2

2019-09-05 Thread Torbjörn Granlund
Can div2 be done sloppily, e.g., by extracting the top 64 bits of the dividend and the corresponding (up to) 64 bits of the divisor, and invoke DIV1 on these bits? OK, one needs to return a non-negative remainder, so a little bit of adjustment might be needed. But does the remainder need to be

Re: hgcd1/2

2019-09-05 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: For tuned *_METHOD values, it might be useful to let tuneup also add a comment with next best method and margin, something like #define HGCD2_METHOD 2 /* 3.2% faster than method 1 */ That'd be good! I did one variant with table lookup and an

Re: hgcd1/2

2019-09-05 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > The table generator program has now run. Also pi1 wants the old code. > > The next question is how badly they want it. :-) For tuned *_METHOD values, it might be useful to let tuneup also add a comment with next best method and margin, something

Re: hgcd1/2

2019-09-05 Thread Torbjörn Granlund
One of my previous proposal looks like this, when put into its right form: mp_limb_t div1 (mp_limb_t n0, mp_limb_t d0) { if (UNLIKELY ((d0 >> 61) != 0)) return n0 / d0; if (UNLIKELY (n0 >= (d0 << 3))) return n0 / d0; else { mp_limb_t q, mask; d0 <<= 2; mask =

Re: hgcd1/2

2019-09-05 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I had a quick look at the machines that completed tuning this night. These three seem to prefer the old code (HGCD2_METHOD == 2): armcortexa7neon-unknown-linux-gnueabihf pi2.gmplib.org-stat armcortexa8neon-unknown-linux-gnueabihf

Re: hgcd1/2

2019-09-04 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: > Excellent. I just pushed tuning support, with HGCD2_METHOD == 1 using > plain division for div1, and HGCD2_METHOD == 2 (default) using the old > method with the div1 function. I had a quick look at the machines that completed tuning this night. These

Re: hgcd1/2

2019-09-04 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > Pushed now. Plan is to next add a tuned HGCD2_METHOD, initially > controlling which div1 to use. Torbjörn, can you add that to the > thresholds web page once it's ready? > > I went ahead and added

Re: hgcd1/2

2019-09-04 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Pushed now. Plan is to next add a tuned HGCD2_METHOD, initially controlling which div1 to use. Torbjörn, can you add that to the thresholds web page once it's ready? I went ahead and added it. When data comes in, the subpage will become more

Re: hgcd1/2

2019-09-04 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: > First step is to add support to speed. Does the below look reasonable? I > modelled it a bit on SPEED_ROUTINE_MODLIMB_INVERT, which also measures a > fix size, but I don't quite understand all of struct speed_params. Pushed now. Plan is to next add a

Re: hgcd1/2

2019-09-03 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > We might want to teach tuneup to choose between small quotient division > primitives., like we do for Jacobi. If we think there's place for several hgcd2 variants, that's definitely needed. First step is to add support to speed. Does the below look

Re: hgcd1/2

2019-09-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Maybe it makes sense to check this in now, deleting the old div1 and div2? If it turns out to be a regression (is that easy to see from nightly tests?), we can consider putting it back under some #if DIV1_METHOD ..., or replace with some other

Re: hgcd1/2

2019-09-03 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > Here's a div2 function that does that. Together with the deleted div1, > it gives a 25% (!) speed up of gcd in the range 3-10 limbs, benchmarked > on my broadwell machine. And should do well on all

Re: hgcd1/2

2019-09-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Here's a div2 function that does that. Together with the deleted div1, it gives a 25% (!) speed up of gcd in the range 3-10 limbs, benchmarked on my broadwell machine. And should do well on all machines with decent small-quotient division.

Re: hgcd1/2

2019-09-03 Thread Torbjörn Granlund
I tested newer Intel systems too (Haswell, Skylake) and they all need around 25 cycles for a division n/d = 1. Intel Goldmont Plus (a current low-end CPU) is better, it needs about 12 cycles. AMD CPUs from the last 10 years all perform OK. It is funny that x86 vendors give division so little

Re: hgcd1/2

2019-09-03 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > So I think plain / is the way to go for certain systems! Then we should use that for the double limb loop too! It gets a bit tricky, since we need special handling of large quotients, but something like this should work: mp_limb_t q = ah / bh;

Re: hgcd1/2

2019-09-03 Thread Torbjörn Granlund
t...@gmplib.org (Torbjörn Granlund) writes: ni...@lysator.liu.se (Niels Möller) writes: > In that case, not so surprising that the div1 function loses. Do other architectures also have decent performance for small-quotient division? > I don't have the full picture, I'm afraid. > I

Re: hgcd1/2

2019-09-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Is there any even easier way to find out on which machines (if any) div1 improves performance? It's still a bit awkward to add tuning of HGCD_DIV1_METHOD and wait for the nightly builds. https://gmplib.org/devel/testsystems.html :-) SU4100$ ssh

Re: hgcd1/2

2019-09-03 Thread Torbjörn Granlund
t...@gmplib.org (Torbjörn Granlund) writes: count_leading_zeros (ncnt, n0) count_leading_zeros (dcnt, d0) ni = ... extract high k bits from n0 ... di = ... extract high \ell bits from d0 ... q = qtab[ni<<\ell+di]; ... adjust ... Well, perhaps q = qtab[ni<<\ell+di] << (ncnt - dcnt)

Re: hgcd1/2

2019-09-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: In that case, not so surprising that the div1 function loses. Do other architectures also have decent performance for small-quotient division? Do you think table lookup on high bits should beat 16 cycles? It needs to give good enough accuracy

Re: hgcd1/2

2019-09-03 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Intel ark doesn't seem to know any processor called "U4100" so I cannot > figure out what generation it belongs to. gmp's config.guess classifies it as core2. cat /proc/cpuinfo says processor : 0 vendor_id : GenuineIntel cpu family

Re: hgcd1/2

2019-09-03 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: > But... I get even better numbers if I keep the old code and just replace > the div1 function with plain division q = a / b: Attaching deletion patch. Tried it also on my broadwell machine. Benchmarking is less reliable there, but seems to give about

Re: hgcd1/2

2019-09-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: And this on a laptop with an Intel U4100 (5 years old?), so I'd assume it doesn't have a particularly fast div instruction. Should we just delete div1 ? On which architectures can we expect it to be beneficial? It should be fairly easy to find

Re: hgcd1/2

2019-09-03 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: > I think it's promising, and also a bit simpler than the current div1 > code. Some concerns: > > 1. I think the approximate q means that we will often reduce from a > b >to a close to but larger than b, so we need two iterations to get to >a <

hgcd1/2

2019-09-01 Thread Niels Möller
I've been plaing a bit with table based hgcd2, but to keep things simple, I started with hgcd1 (reducing full limbs to half limbs, with an associated matrix of half limb elements). Complete code below, the main loop is for (;;) { mp_limb_t a_hi; mp_limb_t b_hi; mp_limb_t