Re: GCD project status?
We now have almost complete results for the latest hgcd2.c tweaks. We have 5 div1 variants (as chosen by HGCD2_DIV1_METHOD). There are results and brief documentation here: https://gmplib.org/devel/thres/HGCD2_DIV1_METHOD.html About half of our test machines prefer '/', i.e., whatever word division the compiler comes up with. About one quarter prefers method 3, which handles n / d < 8 with plain and branch-free code, but '/' for greater divisors. About one quarter prefers table-driven inverse-and-multiply, which is in practice branch-free. The even-numbered methods are clearly unpopular. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
ni...@lysator.liu.se (Niels Möller) writes: Hmm. Definitely worth a try. But if we need explicit loads and stores from the structs, we'll not save that many instructions. x86 can handle operate-from-memory at almost the same cost as operate-from-register. But operate-to-memory is expensive. Load/store architectures typically reach their peak execution throughput only with a mixture of loads and operates. Each iteration needs to load all the values but store only half of them, so for each pair of values load + load + store, compared to mov, xor, and, xor, xor for conditionally swapping using a mask. You might want to explicitly load some of the values into registers. (And perhaps use the "restrict" keyword for the swappable pointers...but I am afraid that we'd be stepping outside the vague semantics of restrict.) > Some measurements with method 4 and 5 are now in. Modern Intel CPUs > like method 5, as I had expected. Nice! With a few % margin over method 3. 8 configs now vouch for method 5. And method 4 got its first "honourable mention"; beagle thinks it is 2nd best. :-) I don't think method 4 will see much use unless we find a way to radically improve the applicability of its large tables. I made table size 2048 the default, just for testing purposes. The next smaller table size is 512 bytes, which is a more reasonable size. It gets a 87% hit rate. I'd say we need that to get beyond 95% for method 4 to become viable. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
t...@gmplib.org (Torbjörn Granlund) writes: > Perhaps keeping the to-be-swapped variables in two structs, and instead > conditionally swap pointers to the structs? Hmm. Definitely worth a try. But if we need explicit loads and stores from the structs, we'll not save that many instructions. Each iteration needs to load all the values but store only half of them, so for each pair of values load + load + store, compared to mov, xor, and, xor, xor for conditionally swapping using a mask. > Some measurements with method 4 and 5 are now in. Modern Intel CPUs > like method 5, as I had expected. Nice! With a few % margin over method 3. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
ni...@lysator.liu.se (Niels Möller) writes: $ ./tune/speed -c -s1 -p10 mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 mpn_hgcd2_4 mpn_hgcd2_5 mpn_hgcd2_binary overhead 6.02 cycles, precision 10 units of 8.33e-10 secs, CPU freq 1200.00 MHz mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 mpn_hgcd2_4 mpn_hgcd2_5 mpn_hgcd2_binary 1#1668.90 1863.72 1670.73 1757.54 1738.50 2044.25 Had a look at the disassembly for the binary algorithm. The double-precision loop needs, 20 instructions for just the conditional swap logic, 23 for the clz + shift + subtract, 8 for the shift+add updates of the u matrix. Perhaps keeping the to-be-swapped variables in two structs, and instead conditionally swap pointers to the structs? Some measurements with method 4 and 5 are now in. Modern Intel CPUs like method 5, as I had expected. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
ni...@lysator.liu.se (Niels Möller) writes: > Below is one variant, that seems to pass tests. I haven't done any > benchmarks yet. Just gave it a try with tune/speed, together with all the latest div1 variants. This one on my laptop, and not looking that great: $ ./tune/speed -c -s1 -p10 mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 mpn_hgcd2_4 mpn_hgcd2_5 mpn_hgcd2_binary overhead 6.02 cycles, precision 10 units of 8.33e-10 secs, CPU freq 1200.00 MHz mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 mpn_hgcd2_4 mpn_hgcd2_5 mpn_hgcd2_binary 1#1668.90 1863.72 1670.73 1757.54 1738.50 2044.25 Had a look at the disassembly for the binary algorithm. The double-precision loop needs, 20 instructions for just the conditional swap logic, 23 for the clz + shift + subtract, 8 for the shift+add updates of the u matrix. By doing another conditional subtraction in each iteration, I get down to 1865 cycles. See below version. In the case acnt == bcnt, the condition for this is always false. But when acnt < bcnt, we have 2^{k-1} <= a < 2^k 2^{k-2} <= (2^s b) < 2^{k-1} so floor (a / (b 2^s) can be 1, 2, or 3. So one could consider yet another conditional subtraction (but then q = 3 2^s is no longer just a single shift and add). If I also try the SIMD trick, I get down to 1790 cycles, but I'd need a special middle iteration to make it correct. As long as both a,b >= 2^96 (more than one and a half limb), matrix coefficients fit in a half limb. But we don't switch to single precision until a,b < 2^96. So we'll have one or more iterations with a >= 2^96 > b, and there u01 and u11 may exceeed half a limb. May still be worth it to add that middle loop, for 5% speedup. Seems hard to beat method 1 on this hardware, but binary may well be a winner on machines with slow multiply, slow division, but fast count_leading_zeros. Regards, /Niels #define SWAP_MASK(mask, x, y) do { \ mp_limb_t swap_xor = ((x) ^ (y)) & mask;\ (x) ^= swap_xor;\ (y) ^= swap_xor;\ } while (0); int mpn_hgcd2 (mp_limb_t ah, mp_limb_t al, mp_limb_t bh, mp_limb_t bl, struct hgcd_matrix1 *M) { mp_limb_t u00, u01, u10, u11, swapped; if (ah < 2 || bh < 2) return 0; if (ah > bh || (ah == bh && al > bl)) { sub_ddmmss (ah, al, ah, al, bh, bl); if (ah < 2) return 0; u00 = u01 = u11 = 1; u10 = 0; } else { sub_ddmmss (bh, bl, bh, bl, ah, al); if (bh < 2) return 0; u00 = u10 = u11 = 1; u01 = 0; } swapped = 0; for (;;) { mp_limb_t mask, dh, dl; int acnt, bcnt, shift; if (ah == bh) goto done; mask = -(mp_limb_t) (ah < bh); swapped ^= mask; SWAP_MASK (mask, ah, bh); SWAP_MASK (mask, al, bl); SWAP_MASK (mask, u00, u01); SWAP_MASK (mask, u10, u11); ASSERT (ah > bh); if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2))) { ah = (ah << (GMP_LIMB_BITS / 2) ) + (al >> (GMP_LIMB_BITS / 2)); bh = (bh << (GMP_LIMB_BITS / 2) ) + (bl >> (GMP_LIMB_BITS / 2)); break; } count_leading_zeros (acnt, ah); count_leading_zeros (bcnt, bh); shift = bcnt - acnt; ASSERT (shift >= 0); /* Avoid subtraction underflow */ shift -= (shift > 0); dh = (bh << shift) | ((bl >> 1) >> (GMP_LIMB_BITS - 1 - shift)); dl = bl << shift; sub_ddmmss (ah, al, ah, al, dh, dl); mask = -(mp_limb_t) (ah > dh); shift -= mask; sub_ddmmss (ah, al, ah, al, mask & dh, mask & dl); if (ah < 2) { /* A is too small, so decrement q. */ mp_limb_t q = ((mp_limb_t) 1 << shift) - 1; u01 += q * u00; u11 += q * u10; goto done; } u01 += (u00 << shift); u11 += (u10 << shift); } /* Single-precision loop */ for (;;) { mp_limb_t mask, dh; int acnt, bcnt, shift; if (ah == bh) break; mask = -(mp_limb_t) (ah < bh); swapped ^= mask; SWAP_MASK (mask, ah, bh); SWAP_MASK (mask, u00, u01); SWAP_MASK (mask, u10, u11); ASSERT (ah > bh); count_leading_zeros (acnt, ah); count_leading_zeros (bcnt, bh); shift = bcnt - acnt; ASSERT (shift >= 0); /* Avoid subtraction underflow */ shift -= (shift > 0); dh = bh << shift; ah -= dh; mask = - (mp_limb_t) (ah > dh); shift -= mask; ah -= mask & dh; if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2 + 1))) { /* A is too small, so decrement q. */ mp_limb_t q = ((mp_limb_t) 1 << shift) - 1; u01 += q * u00; u11 += q * u10; break; } u01 += (u00 << shift); u11 += (u10 << shift); } done: SWAP_MASK (swapped, u00, u01);
Re: GCD project status?
t...@gmplib.org (Torbjörn Granlund) writes: I went ahead and pushed a hgcd2.c with _METHOD = 4 and _METHOD 5. The former tables quotients and neeeds one multiply per div1 call and the latter tables inverses and needs 2 multiplies per div1 call. I pushed improved versions. Instead of detecting potential problematic operands ahead of time, both methods now try its division-free algorithm, carefully avoiding overflow. If the remainder is not in range, they fall back to division. The quotient table method isn't pretty. It can surely be improved. It uses a very large table and gets poor throughput. It also has only about 80% branch hit rate, which is not good enough. Now 90%. But the table is still the same huge size. The inverse table method is more mature, I believe. It has much smaller tables and with the default table size (256 2-byte entries) it gets a branch hit rate of 97%. Doubling the table results in 98.5% rate. Now 99.74% for the smaller table. The larger table is no longer useful and is gone. The default is now a 64-byte table with 99.5% division avoidance. It is easier to see that the new variants are correct; if the few arithmetic operations do not overflow (which I claim to be true) and if the remainder is in range [0,d-1] then the result will be correct. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
ni...@lysator.liu.se (Niels Möller) writes: I see two ways. Either just do an inversion table, which will cost an additional multiplication when used. Than one could handle quotients up to 7 or 8 bits bits with just 256 bytes of table. Or shift both ni and di, and get quotient with something like q = tabp[(ni << NBITS) + di] >> (NBITS - (dcnt - ncnt)) Then one could handle quotients up to 5 bits with a table of 1024 bytes (indexing by 5 significant bits of each of n and d, excluding the most significant one bit). I went ahead and pushed a hgcd2.c with _METHOD = 4 and _METHOD 5. The former tables quotients and neeeds one multiply per div1 call and the latter tables inverses and needs 2 multiplies per div1 call. The quotient table method isn't pretty. It can surely be improved. It uses a very large table and gets poor throughput. It also has only about 80% branch hit rate, which is not good enough. Niels suggested a variant with better progress (I get quotient with at most NBITS/2 bits in the fast path while Niels expects NBITS) when extracting NBITS numerator and NBITS quotient bits. I would be happy to see a variant with NBITS progress. The inverse table method is more mature, I believe. It has much smaller tables and with the default table size (256 2-byte entries) it gets a branch hit rate of 97%. Doubling the table results in 98.5% rate. Benchmark? The inverse table method seems to be the fastest div1 thus far for some current Intel processors. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
t...@gmplib.org (Torbjörn Granlund) writes: > I had not realised you meant left-to-right. I thought you talked about > right-to-left. Interesting! Below is one variant, that seems to pass tests. I haven't done any benchmarks yet. All the SWAP_MASK calls look a bit expensive, should be 4 instructions each. Not sure if there's any way to do the swap in assembly that's much cheaper; the obvious alternative would be one mov and two cmov. When doing this, I realized one could do a kind of SIMD optimization for the double-precision loop (not yet tried). In the double-precision loop, I think the matrix elements u00, u01, u10, u11 all fit in half a limb each. So one could let u00, u10 share one limb, say u0010, using low anf high half, and let u01, u11 share a single limb u0111. Then u01 += q * u00; u11 += q * u10; could be replaced with the single multiply u0111 += q * u0010; And for the binary version below, we'd need only three SWAP_MASK per iteration, in both the double-limb and the single-limb loops. Regards, /Niels #define SWAP_MASK(mask, x, y) do { \ mp_limb_t swap_xor = ((x) ^ (y)) & mask;\ (x) ^= swap_xor;\ (y) ^= swap_xor;\ } while (0); int mpn_hgcd2 (mp_limb_t ah, mp_limb_t al, mp_limb_t bh, mp_limb_t bl, struct hgcd_matrix1 *M) { mp_limb_t u00, u01, u10, u11, swapped; if (ah < 2 || bh < 2) return 0; if (ah > bh || (ah == bh && al > bl)) { sub_ddmmss (ah, al, ah, al, bh, bl); if (ah < 2) return 0; u00 = u01 = u11 = 1; u10 = 0; } else { sub_ddmmss (bh, bl, bh, bl, ah, al); if (bh < 2) return 0; u00 = u10 = u11 = 1; u01 = 0; } swapped = 0; for (;;) { mp_limb_t mask, dh, dl; int acnt, bcnt, shift; if (ah == bh) goto done; mask = -(mp_limb_t) (ah < bh); swapped ^= mask; SWAP_MASK (mask, ah, bh); SWAP_MASK (mask, al, bl); SWAP_MASK (mask, u00, u01); SWAP_MASK (mask, u10, u11); ASSERT_ALWAYS (ah > bh); if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2))) { ah = (ah << (GMP_LIMB_BITS / 2) ) + (al >> (GMP_LIMB_BITS / 2)); bh = (bh << (GMP_LIMB_BITS / 2) ) + (bl >> (GMP_LIMB_BITS / 2)); break; } count_leading_zeros (acnt, ah); count_leading_zeros (bcnt, bh); shift = bcnt - acnt; ASSERT_ALWAYS (shift >= 0); /* Avoid subtraction underflow */ shift -= (shift > 0); dh = (bh << shift) | ((bl >> 1) >> (GMP_LIMB_BITS - 1 - shift)); dl = bl << shift; sub_ddmmss (ah, al, ah, al, dh, dl); if (ah < 2) { /* A is too small, so decrement q. */ mp_limb_t q = ((mp_limb_t) 1 << shift) - 1; u01 += q * u00; u11 += q * u10; goto done; } u01 += (u00 << shift); u11 += (u10 << shift); } /* Single-precision loop */ for (;;) { mp_limb_t mask; int acnt, bcnt, shift; if (ah == bh) break; mask = -(mp_limb_t) (ah < bh); swapped ^= mask; SWAP_MASK (mask, ah, bh); SWAP_MASK (mask, u00, u01); SWAP_MASK (mask, u10, u11); ASSERT_ALWAYS (ah > bh); count_leading_zeros (acnt, ah); count_leading_zeros (bcnt, bh); shift = bcnt - acnt; ASSERT_ALWAYS (shift >= 0); /* Avoid subtraction underflow */ shift -= (shift > 0); ah -= bh << shift; if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2 + 1))) { /* A is too small, so decrement q. */ mp_limb_t q = ((mp_limb_t) 1 << shift) - 1; u01 += q * u00; u11 += q * u10; break; } u01 += (u00 << shift); u11 += (u10 << shift); } done: SWAP_MASK (swapped, u00, u01); SWAP_MASK (swapped, u10, u11); ASSERT_ALWAYS (u00 * u11 - u01 * u10 == 1); M->u[0][0] = u00; M->u[0][1] = u01; M->u[1][0] = u10; M->u[1][1] = u11; return 1; } -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
Nice! -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
ni...@lysator.liu.se (Niels Möller) writes: > Hmm. So the hgcd2-x-y.c would be there als for speed, and hgcd2.c would > have something like > > #if TUNE_PROGRAM_BUILD > int mpn_hgcd2(...) > { > return (*hgcd2_method_pointer)(...); > } See below patch. diff -r 3e04a9bbba13 gmp-impl.h --- a/gmp-impl.hFri Sep 20 20:36:30 2019 +0200 +++ b/gmp-impl.hSun Sep 22 10:19:06 2019 +0200 @@ -4949,6 +4949,10 @@ #define MATRIX22_STRASSEN_THRESHOLDmatrix22_strassen_threshold extern mp_size_t matrix22_strassen_threshold; +typedef int hgcd2_func_t (mp_limb_t, mp_limb_t, mp_limb_t, mp_limb_t, + struct hgcd_matrix1 *); +extern hgcd2_func_t *hgcd2_func; + #undef HGCD_THRESHOLD #define HGCD_THRESHOLD hgcd_threshold extern mp_size_t hgcd_threshold; diff -r 3e04a9bbba13 tune/Makefile.am --- a/tune/Makefile.am Fri Sep 20 20:36:30 2019 +0200 +++ b/tune/Makefile.am Sun Sep 22 10:19:06 2019 +0200 @@ -96,7 +96,7 @@ speed_ext_SOURCES = speed-ext.c speed_ext_LDFLAGS = $(STATIC) -tuneup_SOURCES = tuneup.c +tuneup_SOURCES = tuneup.c hgcd2.c nodist_tuneup_SOURCES = sqr_basecase.c fac_ui.c $(TUNE_MPN_SRCS) tuneup_DEPENDENCIES = $(TUNE_SQR_OBJ) libspeed.la tuneup_LDADD = $(tuneup_DEPENDENCIES) $(TUNE_LIBS) diff -r 3e04a9bbba13 tune/hgcd2.c --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/tune/hgcd2.c Sun Sep 22 10:19:06 2019 +0200 @@ -0,0 +1,49 @@ +/* mpn/generic/hgcd2.c for tuning + +Copyright 2019 Free Software Foundation, Inc. + +This file is part of the GNU MP Library. + +The GNU MP Library is free software; you can redistribute it and/or modify +it under the terms of either: + + * the GNU Lesser General Public License as published by the Free +Software Foundation; either version 3 of the License, or (at your +option) any later version. + +or + + * the GNU General Public License as published by the Free Software +Foundation; either version 2 of the License, or (at your option) any +later version. + +or both in parallel, as here. + +The GNU MP Library is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +for more details. + +You should have received copies of the GNU General Public License and the +GNU Lesser General Public License along with the GNU MP Library. If not, +see https://www.gnu.org/licenses/. */ + +#define TUNE_PROGRAM_BUILD 1 + +#include "gmp-impl.h" + +hgcd2_func_t mpn_hgcd2_default; + +hgcd2_func_t *hgcd2_func = _hgcd2_default; + +int +mpn_hgcd2 (mp_limb_t ah, mp_limb_t al, mp_limb_t bh, mp_limb_t bl, + struct hgcd_matrix1 *M) +{ + return hgcd2_func(ah, al, bh, bl, M); +} + +#undef mpn_hgcd2 +#define mpn_hgcd2 mpn_hgcd2_default + +#include "mpn/generic/hgcd2.c" diff -r 3e04a9bbba13 tune/tuneup.c --- a/tune/tuneup.c Fri Sep 20 20:36:30 2019 +0200 +++ b/tune/tuneup.c Sun Sep 22 10:19:06 2019 +0200 @@ -716,8 +716,11 @@ select the fastest. Since *_METHOD defines start numbering from one, if functions[i] is fastest, the value of the define is i+1. Also output a comment with speedup compared to the next fastest - function. The NAME argument is used only for trace output.*/ -void + function. The NAME argument is used only for trace output. + + Returns the index of the fastest function. +*/ +int one_method (int n, speed_function_t *functions, const char *name, const char *define, const struct param_t *param) @@ -757,6 +760,7 @@ t[method_runner_up] / t[method]); TMP_FREE; + return method; } @@ -1958,15 +1962,17 @@ tune_hgcd2 (void) { static struct param_t param; - speed_function_t f[3] = -{ - speed_mpn_hgcd2_1, - speed_mpn_hgcd2_2, - speed_mpn_hgcd2_3, -}; + hgcd2_func_t *f[3] = +{ mpn_hgcd2_1, mpn_hgcd2_2, mpn_hgcd2_3 }; + speed_function_t speed_f[3] = +{ speed_mpn_hgcd2_1, speed_mpn_hgcd2_2, speed_mpn_hgcd2_3 }; + int best; s.size = 1; - one_method (3, f, "mpn_hgcd2", "HGCD2_DIV1_METHOD", ); + best = one_method (3, speed_f, "mpn_hgcd2", "HGCD2_DIV1_METHOD", ); + + /* Use selected function when tuning hgcd and gcd */ + hgcd2_func = f[best]; } void @@ -2236,9 +2242,6 @@ void tune_div_qr_1 (void) { - static struct param_t param; - doublet1, t2; - if (!HAVE_NATIVE_mpn_div_qr_1n_pi1) { static struct param_t param; -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
t...@gmplib.org (Torbjörn Granlund) writes: > With some thought, I am sure the table size could come down while the > accuracy could improve. I see two ways. Either just do an inversion table, which will cost an additional multiplication when used. Than one could handle quotients up to 7 or 8 bits bits with just 256 bytes of table. Or shift both ni and di, and get quotient with something like q = tabp[(ni << NBITS) + di] >> (NBITS - (dcnt - ncnt)) Then one could handle quotients up to 5 bits with a table of 1024 bytes (indexing by 5 significant bits of each of n and d, excluding the most significant one bit). > q0 = tabp[(ni << NBITS) + di]; > r0 = n0 - q0 * d0; > mask = -(mp_limb_t) (r0 >= d0); > q0 -= mask; > r0 -= d0 & mask; You round tabulated q down, and adjust up. I guess one can save one cycle if one instead rounds it up and adjusts down, q0 = tabp[(ni << NBITS) + di]; t = q0 * d0; r0 = n0 - t; mask = -(mp_limb_t) (n0 < t); q0 += mask; r0 += d0 & mask; since computation of mask and initial r0 can then run in parallel (or the same instruction, if the compiler is clever enough to use use carry out). Or is there a risk for overflow in q0 * d0? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
> * Speed up div1, presumably with a table-based approach. A quick-and-dirty but working variant below. The table size with NBITS = 8 is 32KiB. Far too big. It avoids branches in 83% of the cases, which is not that great considering the huge cost of mispredicted branches. With some thought, I am sure the table size could come down while the accuracy could improve. #define NBITS 8 static unsigned char tab[1 << (2 * NBITS - 1)]; static unsigned char *tabp = tab - (1 << (NBITS - 1) << NBITS); mp_double_limb_t div1 (mp_limb_t n0, mp_limb_t d0) { int ncnt; size_t ni, di; mp_limb_t q0; mp_limb_t r0; mp_limb_t mask; mp_double_limb_t res; count_leading_zeros (ncnt, n0); ni = n0 >> (64 - NBITS - ncnt); di = d0 >> (64 - NBITS - ncnt); if (UNLIKELY (di < (1 << NBITS/2))) { q0 = n0 / d0; r0 = n0 - q0 * d0; } else { q0 = tabp[(ni << NBITS) + di]; r0 = n0 - q0 * d0; mask = -(mp_limb_t) (r0 >= d0); q0 -= mask; r0 -= d0 & mask; } res.d1 = q0; res.d0 = r0; return res; } void init_tab() { size_t ni, di; for (ni = 1 << (NBITS - 1); ni < (1 << NBITS); ni++) { for (di = 0; di < (1 << NBITS); di++) { tabp[(ni << NBITS) + di] = (ni << NBITS) / ((di << NBITS) + ((1 << NBITS)-1)); } } } -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
t...@gmplib.org (Torbjörn Granlund) writes: > Ideally, one would compile hgcd2.c in all possible variants (presumable > through hand-crafted hgcd2-1-1.c, hgcd2-2-1.c, etc., and then call the > selected hgcd2's entry point through a function pointer for further > measuring. Hmm. So the hgcd2-x-y.c would be there als for speed, and hgcd2.c would have something like #if TUNE_PROGRAM_BUILD int mpn_hgcd2(...) { return (*hgcd2_method_pointer)(...); } > I'm pretty confused about the algorithms used for gcdext. Is it at all > helped by our hgcd2 improvements? If you look at mpn_gcdext_lehmer_n, that's a loop around hgcd2, applying each matrix to the numbers being reduced using mpn_matrix22_mul1_inverse_vector, and to the cofactors using mpn_hgcd_mul_matrix1_vector. At the end, it calls mpn_gcdext_1. mpn_gcdext_22 is missing. > What is suboptimal about gcdext for large operands? Subquadratic gcd is a loop calling hgcd on the high part of the numbers (iirc, n/3 size), then applies the matrix also to the low part, and repeat until numbers are in lehmer range. Current subquadratic gcdext does the same, but also builds up cofactors along the way. So numbers getting smaller, and cofactors larger. For one, that makes the optimal splitting for the hgcd call complicated, one should take into account not only the size of the numbers, but also the size of the cofactors. That's done in an adhoc way, different splitting for the very first iteration, then a fixed size ratio. More fundamentally, the multiplications used to build up the cofactors get more and more unbalanced for each iteration. That doesn't seem healthy for a divide-and-conquer algorithm. I suspect a better way is to do it like this: Given A, B of size n, split off high part, say size n/2 or so. Call hgcd, so we get reduced numbers a, b, with (A;B) = M (a;b) where M is if size n/4 and a,b of size 3n/4. Recursively compute gcdext(a,b), to get g = s a + t b Then construct one or both cofactors for g = S A + T B based on s, t and the matrix M. Then we'll reduce numbers in size as we recurse down, and build up the cofactors as we return back, without getting increasingly unbalanced multiply instructions. But I haven't tried it out. > * Try out left-to-right binary hgcd2. > > I had not realised you meant left-to-right. I thought you talked about > right-to-left. Interesting! There are a couple of different cases, I'm afraid. Say we have a of a_size bits and b of b_size bits. If a_size == b_size, set {a, b} <-- {|a-b|, min(a,b)} Like a binary step, but we don't divide out powers of two, and we eliminate at least one bit at the high end. When a_size > b_size, then to avoid underflow I'm afraid we need a <-- a - 2^{a_size - b_size - 1} b with progress of 0.5 bits or so. Or we could go half way-towards computing a quotient by using a <-- a - 2^{a_size - b_size} b in case that is non-negative, which would give us better progress. And similarly for a_size < b_size. So a fair amount of logic that needs to be branch free. Updates of u0 and u1 are cheap, but they make any needed conditional swap more costly. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
ni...@lysator.liu.se (Niels Möller) writes: How much speedup have we achieved? I don't know. I just observed the GCD_DC_THRESHOLD to change a lot, and that is a good sign. Looks like GCD_DC_THRESHOLD gets higher on many machines, but lower on a few? I now realize that the way tuneup currently works, tuning of HGCD_* and GCD_* do *not* take the tuned HGCD2_*_METHOD into account. To fix that, tuneup build needs mpn_hgcd2 to jump via a function pointer. Or is there a better way to do that? Dunno. Ideally, one would compile hgcd2.c in all possible variants (presumable through hand-crafted hgcd2-1-1.c, hgcd2-2-1.c, etc., and then call the selected hgcd2's entry point through a function pointer for further measuring. Using a function for div1/div2 would disturb the measurements a lot, I think. > * Make gcdext great again (as great as gcd). That means asm gcdext_11 and gcdext_22? In the Lehmer-range, I think it's as good as gcd. Then the subquadratic range gcdext is suboptimal, but that's a different project. Asm gcdext_11, _22, and prhaps some more should be done, I think. I'm pretty confused about the algorithms used for gcdext. Is it at all helped by our hgcd2 improvements? What is suboptimal about gcdext for large operands? * Try out left-to-right binary hgcd2. I had not realised you meant left-to-right. I thought you talked about right-to-left. Interesting! * Fix doc or implementation regarding mpn_gcd with even inputs. OK. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GCD project status?
t...@gmplib.org (Torbjörn Granlund) writes: > I feel we've achieved much of the possible speedup for gcd now. How much speedup have we achieved? > But what more can we do before we are completely done for now? > Let me try to list it: > > > * Add entry points for gcd_11 allowing even operand(s). > * Add entry points for gcd_22 allowing even operand(s)? > * Make generic/gcd_1.c call gcd_11 and get rid of */gcd_1.asm. And review other call sites for mpn_gcd_1. > * Consider adding gcd_33, etc, and also the basecase code Marco and > Torbjorn worked on. > > * Speed up div1, presumably with a table-based approach. > > * Copy the div1/div2 code to hgcd2_jacobi.c (or put it in its own file, > or in gmp-impl.h). > > * Tune HGCD_DIV2_METHOD? > > * Update lots of gmp-mparam.h files with HGCD_DIV{1,2}_METHOD and > GCD_DC_THRESHOLD (the latter has move a lot!) Looks like GCD_DC_THRESHOLD gets higher on many machines, but lower on a few? I now realize that the way tuneup currently works, tuning of HGCD_* and GCD_* do *not* take the tuned HGCD2_*_METHOD into account. To fix that, tuneup build needs mpn_hgcd2 to jump via a function pointer. Or is there a better way to do that? > * Make gcdext great again (as great as gcd). That means asm gcdext_11 and gcdext_22? In the Lehmer-range, I think it's as good as gcd. Then the subquadratic range gcdext is suboptimal, but that's a different project. > Did I forget something? * Try out left-to-right binary hgcd2. * Fix doc or implementation regarding mpn_gcd with even inputs. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel