sqrt algorithm (was: Re: Anomaly in mpn_sqrtrem and mpn_rottrem)
ni...@lysator.liu.se (Niels Möller) writes: So back to the drawing board. I had to redo the splitting a bit in sqrt_nm1, and write some special case code for n = 3. But now I have code that survives some testing. I'm attaching my current algorithm description and code. The code currently doesn't really exploit cancellation. Also, it computes divisions like floor (B^{k-1} E / 2H) by zero-padding, and calling mpn_tdiv_qr. To make this fast, we need some variant of divappr_q which don't require any of the uninteresting low limbs. Or alternatively, resurrect the notion of fraction limbs. Compare to bdiv_q functions, which computes N / D mod B^n, where both inputs and the output are n limbs. A divappr_q function ought to compute an approximation of N B^n / D, but possibly with a little mmore flexibility in the input and output sizes. The most flexible way is perhaps to mimic the fraction limbs-interface. Say we have inputs N = {np, nn}, D = {dp, dn}, and a scaling factor k, and compute an approximation of Q = floor (N B^k / D) of qn = nn + dn + 1 + k limbs. Or we could tie things together a bit more, by requiring that dn = nn = qn + 1 or so. For the divisions in the sqrt algorithms, I'm not sure of exactly how the sizes of the numerator E, the denominator H, and the quotient, relate, but they ought to all be pretty close to k. Regards, /Niels sqrt_nm1 (n, A): Input: A = {a_{n-1}, ..., a_0}, n = 2 Output: X = {x_{n-2}, ..., x_0} \approx sqrt(B^{n-2} A) if (n == 2) x_0 -- floor (sqrt (A)) return X = {x_0} if (n == 3) // Needs normalization c -- count_leading_zeros (a_2) ~1 A' -- 2^c A x_1 = floor (sqrt ({a'_2, a'_1})) E = A' - B x_1^2 return X = (B x_1 + floor (E / 2 x_1)) 2^{-c/2} k -- floor (n/2) H -- sqrt_nm1 (n - (k-1), {a_{n-1},...,a_{k-1}}) // n - (k-1) limbs if (n odd)// n = 2k+1 // We have a (k+1)-limb H, and produce k-1 low limbs. E -- B A - H^2 X -- B^{k-1} H + floor (B^{k-1} E / 2H) // XXX else // n = 2k // We have a k-limb H, and produce k-1 low limbs E -- A - H^2 X -- B^{k-1} H + floor (B^{k-1} E / 2H) return X sqrt_n (n, A): Input: A = {a_{n-1}, ..., a_0}, n = 1 Output: X = {x_{n-1}, ..., x_0} \approx sqrt(B^{n-1} A) if (n == 1) x_0 -- floor (sqrt (a_0)) return X = {x_0} if (n == 2) h -- floor (sqrt(A)) E -- A - h*h X -- sqrt(B) h + floor (sqrt(B) E / 2h) k -- floor (n/2) H -- sqrt_n(k+1, {a_{n-1},...,a_{n-1-k}}) // k+1 limbs if (n odd)// n = 2k+1 // We have (k+1)-limb H, and produce k low limbs E -- A - H^2 X -- B^k H + floor (B^k E / 2 H) else // n = 2k // We have (k+1)-limb H, and produce k-1 low limbs E -- B A - H^2 X -- B^{k-1} H + floor (B^{k-1} E / 2H) return X Size of remainder: Assume X = floor(sqrt(A)) = sqrt(A) - e, where A is n limbs. Define the remainder R = A - X^2. Then R = A - (sqrt(A) - e)^2 = 2 e sqrt(A) - e^2 = e (2 sqrt(A) - e) 2 sqrt(A) If n is even, n = 2k, then A - X^2 B^{k+1} If n is odd, n = 2k+1, then A - X^2 B^{k+1} So in all cases, R B^{floor(n/2) + 1}, and X B^{ceil(n/2)}, and with almost half a limb margin. Correctness test: 0 A - (X+1)^2 = A - X^2 - 2X - 1 = R - 2X - 1 R 2X + 1 R = 2X Small cases. Consider sqrtrem, size n. n = 1: sqrt_1 n = 2: sqrt_2 n = 3: sqrt_n(2) - sqrt_2 + special update n = 4: sqrt_nm1(3) - sqrt_nm1(2) - sqrt_2 n = 5: sqrt_n (3) - sqrt_n(2) - sqrt_2 + special update n = 6: sqrt_nm1 (4) - sqrt_nm1 (3) - sqrt_nm1(2) - sqrt_2 #include assert.h #include stdio.h #include stdlib.h #include gmp.h #if GMP_NUMB_BITS != 64 #error Unsupported limb size #endif #if !defined (__amd64__) #error Unsupported arch #endif /* From longlong.h */ typedef unsigned long int UDItype; #define add_ss(sh, sl, ah, al, bh, bl) \ __asm__ (addq %5,%q1\n\tadcq %3,%q0 \ : =r (sh), =r (sl) \ : 0 ((UDItype)(ah)), rme ((UDItype)(bh)), \ %1 ((UDItype)(al)), rme ((UDItype)(bl))) #define sub_ddmmss(sh, sl, ah, al, bh, bl) \ __asm__ (subq %5,%q1\n\tsbbq %3,%q0 \ : =r (sh), =r (sl) \ : 0 ((UDItype)(ah)), rme ((UDItype)(bh)), \ 1 ((UDItype)(al)), rme ((UDItype)(bl))) #define umul_ppmm(w1, w0, u, v) \ __asm__ (mulq %3 \ : =a (w0), =d (w1) \ : %0 ((UDItype)(u)), rm ((UDItype)(v))) #define udiv_qrnnd(q, r, n1, n0, dx) /* d renamed to dx avoiding =d */\ __asm__ (divq %4 /* stringification in KR C */ \ : =a (q), =d (r) \ : 0 ((UDItype)(n0)), 1 ((UDItype)(n1)), rm ((UDItype)(dx))) #define assert_no_carry(x) do { \ mp_limb_t __assert_cy = (x); \ assert (!__assert_cy); \ } while (0) #define assert_carry(x) do { \ mp_limb_t __assert_cy = (x); \ assert (__assert_cy); \ } while (0) static int mpn_zero_p (const mp_limb_t *xp, mp_size_t n) { mp_size_t i;
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
bodr...@mail.dm.unipi.it writes: Ciao, Il Ven, 12 Giugno 2015 9:04 am, Torbjörn Granlund ha scritto: We might want to look into the plain division-free sqrt(A) = A*sqrt(1/A) approach before implementing a tricky division sqrt(A). We can try improving the current implementation, before implementing any other algorithm :-) You're wise. I wrote a simple patch (it touches very few lines) that allows skipping the final squaring when mpn_sqrtrem is called with a NULL argument and the approximation computed so far can not change. It exploits the spurious half-a-limb that current code adds to the result when the size is odd, i.e. it changes the timings only for odd sizes. Nice speedup! I suppose we really ought to add a limb also for even sizes (at least when computing with more than a few limbs) to allow for this type of speedup for even operand sizes too... -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
ni...@lysator.liu.se (Niels Möller) writes: Hmm. Or maybe this is stupid. I could stop insisting on using a full size inverse (so that A / x or E / x can be computed as a *single* mulhi), and instead work with a half-size inverse, so that the quotient is computed in two steps. Then inverse computation using a plain Newton-step per iteration should work fine. We indeed do that already when performing division for its own sake; compute an inverse which is not longer than half the quotient size. We might want to look into the plain division-free sqrt(A) = A*sqrt(1/A) approach before implementing a tricky division sqrt(A). -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
Ciao, Il Ven, 12 Giugno 2015 9:04 am, Torbjörn Granlund ha scritto: We might want to look into the plain division-free sqrt(A) = A*sqrt(1/A) approach before implementing a tricky division sqrt(A). We can try improving the current implementation, before implementing any other algorithm :-) I wrote a simple patch (it touches very few lines) that allows skipping the final squaring when mpn_sqrtrem is called with a NULL argument and the approximation computed so far can not change. It exploits the spurious half-a-limb that current code adds to the result when the size is odd, i.e. it changes the timings only for odd sizes. Before the patch: $ tune/speed -rp1 -s21-110 -f15 mpn_sqrt mpn_root.2 mpn_rootrem.2 mpn_sqrtrem overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU freq 3500.08 MHz mpn_sqrtmpn_root.2 mpn_rootrem.2 mpn_sqrtrem 210.013801.61551.5569 #0.9996 315 #0.153741.12221.54261.1521 4725 0.000918521 #0.90001.24581.0004 70875 0.029574741 #0.96121.21921.0007 10631250.741147157 #0.97181.21470. Sometimes mpn_root.2 is faster than the other functios in a wide range With the patch applied: $ tune/speed.nsqrt -rp1 -s21-110 -f15 mpn_sqrt mpn_root.2 mpn_rootrem.2 mpn_sqrtrem overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU freq 3500.08 MHz mpn_sqrtmpn_root.2 mpn_rootrem.2 mpn_sqrtrem 21 #0.013391.64961.60891.0244 315 #0.129241.33181.84321.3701 4725 #0.0007996071.03491.43121.1495 70875#0.0263216531.08041.36981.1253 1063125 #0.6563897051.09821.37361.1296 mpn_rot.2 is the faster -- http://bodrato.it/ ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
Il Sab, 13 Giugno 2015 12:27 am, bodr...@mail.dm.unipi.it ha scritto: I wrote a simple patch (it touches very few lines) that allows skippingthe final I forgot the attachment... -- http://bodrato.it/ gmp.diff Description: plain/text ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
t...@gmplib.org (Torbjörn Granlund) writes: ni...@lysator.liu.se (Niels Möller) writes: As far as I see, a plain Newton iteration on won't produce an x' inverse with 4n bits of accuracy. One would need aither two iterations, or some other trick, maybe mixing in some interpolation of x' - x. I don't think one should struggle in order to get n - 2n bits in a Newtonian iteration. It is enough to achieve n - 2n-k for some constant k. I don't think one can do that for any small k in this case. I'll try to describe the problem. In the Newton iteration for the square root, which extends the precision of x from n bits to (close to) 2n bits, I'd expect that we 2n bits of precision for all inputs, including the inverse of the input x. Do you agree? So as input, we need a 2n-bit approximative inverse of the n-bit input x. As output, we want the new 2n-bit x and a corresponding 4n-bit approximative inverse. A single newton iteration can produce a 4n-bit approximation of the *n*-bit x, but then we also have new x bits from the square root approximation to take into account. One way to get the needed inverse approximation is to 1. Adjust the 2n-bit inverse of n-bit x to a 2n-bit inverse of the new 2n-bit x. Could be done with a Newton step (which seems a bit overkill), or perhaps one can use some cheaper linear interpolation. 2. Do a Newton step to extend that inverse of the 2n-bit x to 4n bits of precision. Hmm. Or maybe this is stupid. I could stop insisting on using a full size inverse (so that A / x or E / x can be computed as a *single* mulhi), and instead work with a half-size inverse, so that the quotient is computed in two steps. Then inverse computation using a plain Newton-step per iteration should work fine. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
t...@gmplib.org (Torbjörn Granlund) writes: Before I try to understand the rest of your reasoning: What is B? It's not the usual limb base, I presume, since then sqrt(B^{n-2}) = (B/2)^{n-2}, which I can compute completely without hard thinking (assuming the common case that B is a power of two)... Sorry, there was an A missing. The function sqrt_nm1 is intended to compute a close approximation of sqrt(B^{n-2} A), with n input limbs A and n-1 output limbs X. (Suggestions for better names are appreciated). And B is the usual limb base. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
ni...@lysator.liu.se (Niels Möller) writes: Here's a sketch with some more details. I've tried to work out both sqrt(B^{n-1} A) and sqrt(B^{n-2}). To my surprise, they seem independent, not mutually recursive. Before I try to understand the rest of your reasoning: What is B? It's not the usual limb base, I presume, since then sqrt(B^{n-2}) = (B/2)^{n-2}, which I can compute completely without hard thinking (assuming the common case that B is a power of two)... -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
ni...@lysator.liu.se (Niels Möller) writes: I also had a quick look at the math, and I realized (some of you surely knew that already) that floor(sqrt(a)) is mostly independent of the lowest half of the bits of a. Some of us indeed knew that... And this generalises to kth roots. It is similar to the more familiar problem of an m-limb by n-limb division; the low n dividend limbs affect the quotient very little... The lowest half of the bits are needed only for computing the remainder, and a final adjustment-by-one step. I think the iteration should totally ignore the low half bits in each step. Would it make sense with an mpn_sqrt_appr which takes a n-limb normalized number A as input, and computes an n-limb square root approximation of B^{n-1} A, with some well defined error bound, of at most 2? (We might also need a function for the the (n-1)-limb sqrt of B^{n-2} A). And a corresponding mpn_root_appr would compute the kth root of B^{(k-1) n - 1} A, I think. I haven't looked enough at your propposal to have an informed opinion. I think mpn_rootrem's approach for when the remainder is not needed might be wise, perhaps wiser. It computes with one extra limb, then uses that to almost always return the correct result without any (partial) remainder computation. To speed sqrt (non-rem) we might want to do like root. But both functions should probably do that only for operands over say 10 limbs. Below that, some _appr approach will be better. And as written before, as k grows, we would benefit enormously from a pow_1_highpart. The average speedup potential is O(k). -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
t...@gmplib.org (Torbjörn Granlund) writes: I haven't looked enough at your propposal to have an informed opinion. Here's a sketch with some more details. I've tried to work out both sqrt(B^{n-1} A) and sqrt(B^{n-2}). To my surprise, they seem independent, not mutually recursive. sqrt_nm1 (n, A): Input: A = {a_{n-1}, ..., a_0}, n = 2 Output: X = {x_{n-2}, ..., x_0} \approx sqrt(B^{n-2} A) if (n == 2) x_0 -- floor (sqrt ({a_1, a_0})) return X = {x_0} k -- floor (n/2) H -- sqrt_nm1 (k+1, {a_{n-1},...,a_{n-1-k}}) // k limbs if (n odd)// n = 2k+1 E -- B H^2 - A X -- B^k H - floor (B^{k-1} E / 2H) else // n = 2k E -- H^2 - A X -- B^{k-1} H - floor (B^{k-1} E / 2H) return X sqrt_n (n, A): Input: A = {a_{n-1}, ..., a_0}, n = 1 Output: X = {x_{n-1}, ..., x_0} \approx sqrt(B^{n-1} A) if (n == 1) x_0 -- floor (sqrt (a_0)) return X = {x_0} k -- floor (n/2) H -- sqrt_n(k+1, {a_{n-1},...,a_{n-1-k}}) // k+1 limbs if (n odd)// n = 2k+1 E -- H^2 - A X -- H B^k - floor (B^k E / 2 H) else // n = 2k E -- H^2 - B A X -- B^{k-1} H - floor (B^{k-1} E / 2H) return X Note that all the computations of the error/residual E is subject to a lot of cancellation, so one can use mullo (for small sizes) or wraparound arithmetic using mulmod_bnm1 for larger sizes. And the divisions really ought to use divappr, rather than exact truncating division. I'm not sure how that interface works, but one ought be able to pass in numerators like, e.g., B^{k-1} E, without actually padding the numerator with zeros. I don't know how large errors one gets, from the above algorithms, but I would expect errors to be at most 2 or 3. For proper error analysis, one should also think about desired sign of the error. The newton iteration wants to converge from above, and on the other hand the inclusion of more bits of A wants X to converge from below. So maybe one should add some fudge factor into the rounding somewhere force monotonous convergence? Also sign and magnitude of the divappr errors must be taken into account. If we end up with an error E and corresponding X update of varying sign, that's no fundamental problem, but it will make the wraparound arithmetic a little more complex. Finally, I'm not sure if one can expect tricks to eliminate division to be a big win. Except for the smallest sizes, division, and in particular divappr, should only be a small constant slower than multiplication, right? The division free iteration for 1/sqrt(A) needs additional multiplications, and the trick mensioned the other day to compute an inverse incrementally, will also introduce additional multiplications. I'm thinking that an exact sqrt/sqrtrem for even size n will call sqrt_nm1 with n/2+1 input limbs producing an n-limb root. And for odd n, call sqrt_n with (n+1)/2 input limbs. In either case, we'll get at least one bit of margin, with the worst case being odd n and a[n-1] == 1. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
bodr...@mail.dm.unipi.it writes: To compare them I wrote a quick and dirty specialization of the rootrem algorithm for the k==2 case (use sqr instead of mul, lshift instead of mul_1...) Cool! We should probably use sqr whenever possible, perhaps it is worth having a condition for this in rootrem.c, (I believe mpn_pow_1 gets it right, and uses sqr whenever possible, since I looked into that as an explanation.) $ tune/speed -p 1 -s 1-110 -f16 mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU freq 3500.08 MHz mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem 1 0.00518 0.00011 0.00516 #0.9 160.01103 0.00244 0.01117 #0.00244 256 0.09316 #0.08321 0.14106 0.08374 4096 #0.000633967 0.000659180 0.000894513 0.000654562 65536#0.026071234 0.026756061 0.033005667 0.026762939 1048576 0.732938835 0.734521311 0.962774086 #0.731488979 With the attached specialised function, rootrem.2 gains a 10% in speed and is faster than sqrt when the reminder is not needed and size grows. $ tune/speed -p 1 -s 1-110 -f16 mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU freq 3500.08 MHz mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem 1 0.00376 #0.00010 0.00383 0.00010 160.00884 #0.00241 0.00894 0.00243 256 #0.07743 0.08262 0.12138 0.08282 4096 #0.000563298 0.000657382 0.000816271 0.000650694 65536#0.023296001 0.026694774 0.030225570 0.026717529 1048576 #0.654117324 0.736489574 0.832002002 0.733319729 Your timing leaps hide some of the actual timing characteristics... Using tune/speed -r -p1000 -s16-1000 -f1.438 mpn_sqrt mpn_root.2 I get about 0.97 on average. And tune/speed -r -p1000 -s16-1000 -f1.438 mpn_sqrtrem mpn_rootrem.2 gives around 1.25. (In both cases without your rootrem hack.) I added some TRACE (printf...) in the code to show the primitives with a super-linear cost used in computation. The last functions called by mpn_sqrtrem when computing the root of a 2000 limb operand are: mpn_divrem nn=500/dn=250 - qn=250,rn=250 mpn_sqr an=250 ^ 2 - rn=500 mpn_divrem nn=1000/dn=500 - qn=500,rn=500 mpn_sqr an=500 ^ 2 - rn=1000 Similarly, for mpn_rootrem: mpn_div_q nn=500/dn=251 - qn=249 mpn_sqr an=501 ^ 2 - rn=1002 mpn_div_q nn=1000/dn=501 - qn=499 mpn_sqr an=1000 ^ 2 - rn=2000 Interesting indeed. I wonder why they need different multiply sizes. It uses div_q instead of divrem, but squares used by rootrem are doubled in size... The operation are similar if compared to sqrt (the order changes), probably div_q is faster than divrem. mpn_sqrt wins only when the huge speed-up given by the specialised code for the first limb is relevant... I am not sure that div_q vs divrem matters a whole lot for this usage (i.e., 2n/n sizes). -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
t...@gmplib.org (Torbjörn Granlund) writes: I am not sure that div_q vs divrem matters a whole lot for this usage (i.e., 2n/n sizes). Apparently wrong. The first column is n, division is 2n-limb by n-limb: mpn_divrem mpn_div_qratio 50.0009 0.00111.17 70.0012 0.00120.96 100.0014 0.00191.37 150.0028 0.00260.92 220.0056 0.00510.89 330.0099 0.00900.91 490.0215 0.01550.72 730.0444 0.03060.69 1090.0882 0.06270.71 1630.1767 0.12930.73 2440.3385 0.26160.77 3660.6530 0.50400.77 5490.00012581 0.94590.75 8230.00023138 0.000181170.78 12340.00042985 0.000336420.78 18510.00075588 0.000625480.83 27760.00135386 0.001071240.79 41640.00218069 0.001802420.83 62460.00375453 0.003008710.80 93690.00598697 0.004673050.78 140530.00964423 0.007633660.79 210790.01540152 0.012907300.84 316180.02312424 0.019592180.85 474270.03907867 0.032388670.83 711400.05960059 0.049842350.84 1067100.09717958 0.083781170.86 1600650.16891133 0.136253220.81 2400970.21165200 0.175965290.83 3601450.38509720 0.308413200.80 5402170.67585650 0.567553500.84 8103251.17571375 0.966970250.82 -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
Ciao, Il Mar, 9 Giugno 2015 7:02 pm, Torbjörn Granlund ha scritto: bodr...@mail.dm.unipi.it writes: We should probably use sqr whenever possible, perhaps it is worth having a condition for this in rootrem.c, (I believe mpn_pow_1 gets it right, and uses sqr whenever possible, since I looked into that as an mpn_pow_1 correctly uses sqr, the problem within rootrem arises only when k==2. The two lines of code: wn = mpn_pow_1 (wp, sp, sn, k - 1, qp); mpn_mul (qp, wp, wn, sp, sn); degenerate with k=2. mpn_pow_1 is used to perform an MPN_COPY so that the two operands passed to mpn_mul contain the same number... Maybe an initial condition if (k == 2) return mpn_sqrtrem (), but the specialised code for a single k should not get inside the generic rootrem function, in my opinion. I wrote the hack only to understand the algorithm... I'll clean up rootrem somehow, but I'll not insert special code. tune/speed -r -p1000 -s16-1000 -f1.438 mpn_sqrt mpn_root.2 I get about 0.97 on average. This can be healed using in sqrtrem the same strategy that rootrem uses to skip the last squaring if the reminder is not required: adding a couple of zero limbs and compute a slightly larger root... And tune/speed -r -p1000 -s16-1000 -f1.438 mpn_sqrtrem mpn_rootrem.2 gives around 1.25. This is not bad, a generic code can be slightly slower than a specialised one. The last functions called by mpn_sqrtrem when computing the root of a 2000 limb operand are: mpn_divrem nn=1000/dn=500 - qn=500,rn=500 mpn_sqr an=500 ^ 2 - rn=1000 Similarly, for mpn_rootrem: mpn_div_q nn=1000/dn=501 - qn=499 mpn_sqr an=1000 ^ 2 - rn=2000 Interesting indeed. I wonder why they need different multiply sizes. sqrtrem obtain the reminder with divrem and needs only the b^2 part of (ax+b)^2 to update it. rootrem rebuild the reminder from scratch every loop... because the generic (ax+b)^k has too many components. I am not sure that div_q vs divrem matters a whole lot for this usage (i.e., 2n/n sizes). The manual says, about mpn_divrem: This function is obsolete. Please call mpn_tdiv_qr instead for best performance. With a little scratch space passed to mpn_dc_sqrtrem, it's easy to stop using divrem, but... will the performance really benefit of this? Regards, m -- http://bodrato.it/toom-cook/ ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
t...@gmplib.org (Torbjörn Granlund) writes: An alternative would be to keep the current algorithm, but use the fact that the previous divisor is a damn good staring approximation to the new one, and use Newton to maintain an inverse to it. I vaguely remember that idea. That would be a loop that maintains approximations of both x \approx sqrt(a) and v \approx 1/sqrt(a). It makes my head spin a bit to think about the precision and order of the needed updates. When we have a n bit approximation x, we need a 2n bit inverse v of x in order to compute the new x' with 2n bits of precision. For the next iteration, we then need a 4n-bit inverse v' of x'. As inputs for the v' update we have the old v (2n bits), the new square root approximation x' (2n bits), and we also have the old x (or the difference x' - x) available. As far as I see, a plain Newton iteration on won't produce an x' inverse with 4n bits of accuracy. One would need aither two iterations, or some other trick, maybe mixing in some interpolation of x' - x. I also had a quick look at the math, and I realized (some of you surely knew that already) that floor(sqrt(a)) is mostly independent of the lowest half of the bits of a. So its essentially an operation with n input bits and n output bits. Proof: sqrt(a+d) - sqrt(a) = d / (sqrt(a+d) + sqrt(a)) So if d = sqrt(a), then sqrt(a+d) - sqrt(a) 1/2. The lowest half of the bits are needed only for computing the remainder, and a final adjustment-by-one step. I think the iteration should totally ignore the low half bits in each step. Would it make sense with an mpn_sqrt_appr which takes a n-limb normalized number A as input, and computes an n-limb square root approximation of B^{n-1} A, with some well defined error bound, of at most 2? (We might also need a function for the the (n-1)-limb sqrt of B^{n-2} A). And a corresponding mpn_root_appr would compute the kth root of B^{(k-1) n - 1} A, I think. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
Ciao, Il Lun, 8 Giugno 2015 10:15 am, Torbjörn Granlund ha scritto: ni...@lysator.liu.se (Niels Möller) writes: I've had a quick look. Both mpn_dc_sqrtrem and mpn_rootrem_internal (which seem to be the work horses for larger operands) do a division in the loop. The latter is organized as a newton iteration, the former isn't (but I guess the computation is more or less equivalent to a Newton iteration). To compare them I wrote a quick and dirty specialization of the rootrem algorithm for the k==2 case (use sqr instead of mul, lshift instead of mul_1...) Current speed (on shell) $ tune/speed -p 1 -s 1-110 -f16 mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU freq 3500.08 MHz mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem 1 0.00518 0.00011 0.00516 #0.9 160.01103 0.00244 0.01117 #0.00244 256 0.09316 #0.08321 0.14106 0.08374 4096 #0.000633967 0.000659180 0.000894513 0.000654562 65536#0.026071234 0.026756061 0.033005667 0.026762939 1048576 0.732938835 0.734521311 0.962774086 #0.731488979 With the attached specialised function, rootrem.2 gains a 10% in speed and is faster than sqrt when the reminder is not needed and size grows. $ tune/speed -p 1 -s 1-110 -f16 mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU freq 3500.08 MHz mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem 1 0.00376 #0.00010 0.00383 0.00010 160.00884 #0.00241 0.00894 0.00243 256 #0.07743 0.08262 0.12138 0.08282 4096 #0.000563298 0.000657382 0.000816271 0.000650694 65536#0.023296001 0.026694774 0.030225570 0.026717529 1048576 #0.654117324 0.736489574 0.832002002 0.733319729 I added some TRACE (printf...) in the code to show the primitives with a super-linear cost used in computation. The last functions called by mpn_sqrtrem when computing the root of a 2000 limb operand are: mpn_divrem nn=500/dn=250 - qn=250,rn=250 mpn_sqr an=250 ^ 2 - rn=500 mpn_divrem nn=1000/dn=500 - qn=500,rn=500 mpn_sqr an=500 ^ 2 - rn=1000 Similarly, for mpn_rootrem: mpn_div_q nn=500/dn=251 - qn=249 mpn_sqr an=501 ^ 2 - rn=1002 mpn_div_q nn=1000/dn=501 - qn=499 mpn_sqr an=1000 ^ 2 - rn=2000 It uses div_q instead of divrem, but squares used by rootrem are doubled in size... When we do not ask for the reminder mpn_rootrem (say mpn_root) skips the last squaring: [mpn_sqr an=251 ^ 2 - rn=502] mpn_div_q nn=501/dn=251 - qn=250 mpn_sqr an=501 ^ 2 - rn=1002 mpn_div_q nn=1002/dn=501 - qn=501 The operation are similar if compared to sqrt (the order changes), probably div_q is faster than divrem. mpn_sqrt wins only when the huge speed-up given by the specialised code for the first limb is relevant... Best regards, mb -- http://bodrato.it/papers//* mpn_rootrem(rootp,remp,ap,an,nth) -- Compute the nth root of {ap,an}, and store the truncated integer part at rootp and the remainder at remp. Contributed by Paul Zimmermann (algorithm) and Paul Zimmermann and Torbjorn Granlund (implementation). Marco Bodrato specialised mpn_sqrtrem_internal for square roots. THE FUNCTIONS IN THIS FILE ARE INTERNAL, AND HAVE MUTABLE INTERFACES. IT'S ONLY SAFE TO REACH THEM THROUGH DOCUMENTED INTERFACES. IN FACT, IT'S ALMOST GUARANTEED THAT THEY'LL CHANGE OR DISAPPEAR IN A FUTURE GNU MP RELEASE. Copyright 2002, 2005, 2009-2012, 2015 Free Software Foundation, Inc. This file is part of the GNU MP Library. The GNU MP Library is free software; you can redistribute it and/or modify it under the terms of: * the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. The GNU MP Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with the GNU MP Library. If not, see https://www.gnu.org/licenses/. */ /* FIXME: This implementation is not optimal when remp == NULL, since the complexity is M(n), whereas it should be M(n/k) on average. */ #include stdio.h /* for NULL */ #include gmp.h #include gmp-impl.h #include longlong.h #define TRACE(x) do {} while (0) static mp_size_t mpn_rootrem_internal (mp_ptr, mp_ptr, mp_srcptr, mp_size_t, mp_limb_t, int); static mp_size_t mpn_sqrtrem_internal (mp_ptr, mp_ptr, mp_srcptr, mp_size_t, int); #define MPN_RSHIFT(cy,rp,up,un,cnt) \ do { \ if ((cnt) != 0) \
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
bodr...@mail.dm.unipi.it writes: The last functions called by mpn_sqrtrem when computing the root of a 2000 limb operand are: mpn_divrem nn=500/dn=250 - qn=250,rn=250 mpn_sqr an=250 ^ 2 - rn=500 mpn_divrem nn=1000/dn=500 - qn=500,rn=500 mpn_sqr an=500 ^ 2 - rn=1000 Similarly, for mpn_rootrem: mpn_div_q nn=500/dn=251 - qn=249 mpn_sqr an=501 ^ 2 - rn=1002 mpn_div_q nn=1000/dn=501 - qn=499 mpn_sqr an=1000 ^ 2 - rn=2000 It uses div_q instead of divrem, but squares used by rootrem are doubled in size... That explains why rootrem is a lot slower, right? Is that what needs explaining? When we do not ask for the reminder mpn_rootrem (say mpn_root) skips the last squaring: Which is where most of the slowdown was. probably div_q is faster than divrem. It should be. And if there's an adjustment step anyway, maybe it would be even better to use divappr? (speed seems to lack support for many of the interesting division functions). Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
Ciao, Il Lun, 8 Giugno 2015 10:15 am, Torbjörn Granlund ha scritto: I played some years ago with a loop computing A^(-0.5), which naturally becomes division-free. An alternative would be to keep the current algorithm, but use the fact that the previous divisor is a damn good staring approximation to the new one, and use Newton to maintain an inverse to it. If we use the rootrem algorithm for sqrt, we could even try to update the squared value, we have s and s^2, we update s to sc+r, and we update the squared value (sc+r)^2 shifting s^2 and adding 2src + r^2; maybe using a karatsuba-like update sequence). Regards, mb -- http://bodrato.it/papers ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
ni...@lysator.liu.se (Niels Möller) writes: t...@gmplib.org (Torbjörn Granlund) writes: Surely, we could make mpn_rootrem run faster, in particular for small arguments. But also for large arguments, 2x slowdown seems like a lot. I've had a quick look. Both mpn_dc_sqrtrem and mpn_rootrem_internal (which seem to be the work horses for larger operands) do a division in the loop. The latter is organized as a newton iteration, the former isn't (but I guess the computation is more or less equivalent to a Newton iteration). The code is a bit too complex for me to really compare them... Ahum, same sensation here... For larger operands, I imagine a rewrite to use a division-free Newton-iteration for an approximation of the inverse root would be faster. I played some years ago with a loop computing A^(-0.5), which naturally becomes division-free. An alternative would be to keep the current algorithm, but use the fact that the previous divisor is a damn good staring approximation to the new one, and use Newton to maintain an inverse to it. For mpn_rootrem, i.e., a^(1/k), I suppose we could make the work take O(M(log(a^(1/k as opposed to the current O(M(log(a))), where M(n) is the time for an n-bit multiply. We'd need to make use of mpn_pow_1_highpart. This will be a lot of work. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
t...@gmplib.org (Torbjörn Granlund) writes: Surely, we could make mpn_rootrem run faster, in particular for small arguments. But also for large arguments, 2x slowdown seems like a lot. I've had a quick look. Both mpn_dc_sqrtrem and mpn_rootrem_internal (which seem to be the work horses for larger operands) do a division in the loop. The latter is organized as a newton iteration, the former isn't (but I guess the computation is more or less equivalent to a Newton iteration). The code is a bit too complex for me to really compare them... For larger operands, I imagine a rewrite to use a division-free Newton-iteration for an approximation of the inverse root would be faster. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Anomaly in mpn_sqrtrem and mpn_rottrem
t...@gmplib.org (Torbjörn Granlund) writes: For mpn_rootrem, i.e., a^(1/k), I suppose we could make the work take O(M(log(a^(1/k as opposed to the current O(M(log(a))), where M(n) is the time for an n-bit multiply. We'd need to make use of mpn_pow_1_highpart. This will be a lot of work. Note that O(M(log(a^(1/k = O(M(log(a)/k)), i.e., that we can speed mpn_rootrem up by a factor k (at least for large k, and large a). Perhaps this does not need to be super-hard:. An outline: (1) make the computation with log2(a)/k bits + eps extra bits (an extra limb would do at least if the limb are not too small), happily truncating intermediates. (2) at the end do two final exponentiations, one truncating towards 0 and one rounding towards +oo. These, keeping just log2(a)/k + eps high bits, should be at most 1 apart sort of. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Anomaly in mpn_sqrtrem and mpn_rottrem
Paul Zimmermann has pointed out to us that mpn_rootrem is sometimes faster than mpn_sqrtrem, specifically when not requesting the remainder. When requesting the remainder we have anomalous behaviour too, but in the other direction: mpn_sqrtrem mpn_rootrem.2 mpn_rootrem.3 1 #38.84 2161.80 2459.40 2 #125.42 2419.00 2878.00 3 #352.65 2721.25 2907.75 4 #301.31 2812.50 3454.33 5 #531.16 3080.75 3445.67 7 #595.18 3408.67 3851.00 9 #954.00 4050.33 3908.33 12#836.83 4125.00 4325.33 16 #1168.00 4586.00 5189.00 22 #1710.17 5333.00 5565.00 30 #2333.40 6166.00 6711.50 42 #3556.33 7698.50 8360.50 58 #4607.33 10684.00 12585.00 81 #7675.00 13746.00 16511.00 113 #10920.00 19760.00 24908.00 158 #16435.00 27297.00 35592.00 221 #26308.00 41151.00 54227.00 309 #42636.00 65637.00 85184.00 432 #71155.00 105516.00 136874.00 604#117906.00 175414.00 220812.00 845#205120.00 388529.00 363189.00 Surely, we could make mpn_rootrem run faster, in particular for small arguments. But also for large arguments, 2x slowdown seems like a lot. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel