sqrt algorithm (was: Re: Anomaly in mpn_sqrtrem and mpn_rottrem)

2015-07-12 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

 So back to the drawing board.

I had to redo the splitting a bit in sqrt_nm1, and write some special
case code for n = 3. But now I have code that survives some testing.

I'm attaching my current algorithm description and code. The code
currently doesn't really exploit cancellation. Also, it computes
divisions like 

  floor (B^{k-1} E / 2H)

by zero-padding, and calling mpn_tdiv_qr. 

To make this fast, we need some variant of divappr_q which don't require
any of the uninteresting low limbs. Or alternatively, resurrect the
notion of fraction limbs.

Compare to bdiv_q functions, which computes N / D mod B^n, where both
inputs and the output are n limbs. A divappr_q function ought to compute
an approximation of N B^n / D, but possibly with a little mmore
flexibility in the input and output sizes.

The most flexible way is perhaps to mimic the fraction
limbs-interface. Say we have inputs N = {np, nn}, D = {dp, dn}, and a
scaling factor k, and compute an approximation of

   Q = floor (N B^k / D)

of qn = nn + dn + 1 + k limbs. Or we could tie things together a bit
more, by requiring that dn = nn = qn + 1 or so.

For the divisions in the sqrt algorithms, I'm not sure of exactly how
the sizes of the numerator E, the denominator H, and the quotient,
relate, but they ought to all be pretty close to k.

Regards,
/Niels

sqrt_nm1 (n, A):

  Input: A = {a_{n-1}, ..., a_0}, n = 2
  Output: X = {x_{n-2}, ..., x_0} \approx sqrt(B^{n-2} A)

  if (n == 2)
x_0 -- floor (sqrt (A))
return X = {x_0}

  if (n == 3) // Needs normalization
c -- count_leading_zeros (a_2)  ~1
A' -- 2^c A
x_1 = floor (sqrt ({a'_2, a'_1}))
E = A' - B x_1^2
return X = (B x_1 + floor (E / 2 x_1)) 2^{-c/2}

  k -- floor (n/2)
  H -- sqrt_nm1 (n - (k-1), {a_{n-1},...,a_{k-1}}) // n - (k-1) limbs

  if (n odd)// n = 2k+1
// We have a (k+1)-limb H, and produce k-1 low limbs.
E -- B A - H^2
X -- B^{k-1} H + floor (B^{k-1} E / 2H) // XXX

  else  // n = 2k
// We have a k-limb H, and produce k-1 low limbs
E -- A - H^2
X -- B^{k-1} H + floor (B^{k-1} E / 2H)

  return X

sqrt_n (n, A):

  Input: A = {a_{n-1}, ..., a_0}, n = 1
  Output: X = {x_{n-1}, ..., x_0} \approx sqrt(B^{n-1} A)

  if (n == 1)
x_0 -- floor (sqrt (a_0))
return X = {x_0}

  if (n == 2)
h -- floor (sqrt(A))
E -- A - h*h
X -- sqrt(B) h + floor (sqrt(B) E / 2h)

  k -- floor (n/2)
  H -- sqrt_n(k+1, {a_{n-1},...,a_{n-1-k}}) // k+1 limbs

  if (n odd)// n = 2k+1
// We have (k+1)-limb H, and produce k low limbs
E -- A - H^2
X -- B^k H + floor (B^k E / 2 H)

  else  // n = 2k
// We have (k+1)-limb H, and produce k-1 low limbs
E -- B A - H^2
X -- B^{k-1} H + floor (B^{k-1} E / 2H)

  return X

Size of remainder:

  Assume X = floor(sqrt(A)) = sqrt(A) - e, where A is n limbs. Define
  the remainder R = A - X^2.

  Then R = A - (sqrt(A) - e)^2
 = 2 e sqrt(A) - e^2
 = e (2 sqrt(A) - e)  2 sqrt(A)

  If n is even, n = 2k, then A - X^2  B^{k+1}

  If n is odd, n = 2k+1, then A - X^2  B^{k+1}

  So in all cases, R  B^{floor(n/2) + 1}, and X  B^{ceil(n/2)}, and
  with almost half a limb margin.

Correctness test:

  0  A - (X+1)^2 = A - X^2 - 2X - 1 = R - 2X - 1

  R  2X + 1

  R = 2X

Small cases. Consider sqrtrem, size n.

  n = 1: sqrt_1
  n = 2: sqrt_2
  n = 3: sqrt_n(2)
 - sqrt_2 + special update
  n = 4: sqrt_nm1(3)
 - sqrt_nm1(2) - sqrt_2
  n = 5: sqrt_n (3)
 - sqrt_n(2)
 - sqrt_2 + special update
  n = 6: sqrt_nm1 (4)
 - sqrt_nm1 (3)
 - sqrt_nm1(2) - sqrt_2
#include assert.h
#include stdio.h
#include stdlib.h

#include gmp.h

#if GMP_NUMB_BITS != 64
#error Unsupported limb size
#endif

#if !defined (__amd64__)
#error Unsupported arch
#endif

/* From longlong.h */
typedef unsigned long int UDItype;

#define add_ss(sh, sl, ah, al, bh, bl) \
  __asm__ (addq %5,%q1\n\tadcq %3,%q0	\
	   : =r (sh), =r (sl)	\
	   : 0  ((UDItype)(ah)), rme ((UDItype)(bh)),		\
	 %1 ((UDItype)(al)), rme ((UDItype)(bl)))
#define sub_ddmmss(sh, sl, ah, al, bh, bl) \
  __asm__ (subq %5,%q1\n\tsbbq %3,%q0	\
	   : =r (sh), =r (sl)	\
	   : 0 ((UDItype)(ah)), rme ((UDItype)(bh)),		\
	 1 ((UDItype)(al)), rme ((UDItype)(bl)))
#define umul_ppmm(w1, w0, u, v) \
  __asm__ (mulq %3			\
	   : =a (w0), =d (w1)	\
	   : %0 ((UDItype)(u)), rm ((UDItype)(v)))
#define udiv_qrnnd(q, r, n1, n0, dx) /* d renamed to dx avoiding =d */\
  __asm__ (divq %4		 /* stringification in KR C */	\
	   : =a (q), =d (r)		\
	   : 0 ((UDItype)(n0)), 1 ((UDItype)(n1)), rm ((UDItype)(dx)))

#define assert_no_carry(x) do {			\
mp_limb_t __assert_cy = (x);		\
assert (!__assert_cy);			\
  } while (0)

#define assert_carry(x) do {			\
mp_limb_t __assert_cy = (x);		\
assert (__assert_cy);			\
  } while (0)

static int
mpn_zero_p (const mp_limb_t *xp, mp_size_t n)
{
  mp_size_t i;
 

Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-13 Thread Torbjörn Granlund
bodr...@mail.dm.unipi.it writes:

  Ciao,
  
  Il Ven, 12 Giugno 2015 9:04 am, Torbjörn Granlund ha scritto:
   We might want to look into the plain division-free sqrt(A) = A*sqrt(1/A)
   approach before implementing a tricky division sqrt(A).
  
  We can try improving the current implementation, before implementing any
  other algorithm :-)
  
You're wise.

  I wrote a simple patch (it touches very few lines) that allows skipping
  the final squaring when mpn_sqrtrem is called with a NULL argument and the
  approximation computed so far can not change.
  It exploits the spurious half-a-limb that current code adds to the result
  when the size is odd, i.e. it changes the timings only for odd sizes.
  
Nice speedup!

I suppose we really ought to add a limb also for even sizes (at least
when computing with more than a few limbs) to allow for this type of
speedup for even operand sizes too...

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-12 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Hmm. Or maybe this is stupid. I could stop insisting on using a full
  size inverse (so that A / x or E / x can be computed as a *single*
  mulhi), and instead work with a half-size inverse, so that the quotient
  is computed in two steps. Then inverse computation using a plain
  Newton-step per iteration should work fine.
  
We indeed do that already when performing division for its own sake;
compute an inverse which is not longer than half the quotient size.

We might want to look into the plain division-free sqrt(A) = A*sqrt(1/A)
approach before implementing a tricky division sqrt(A).


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-12 Thread bodrato
Ciao,

Il Ven, 12 Giugno 2015 9:04 am, Torbjörn Granlund ha scritto:
 We might want to look into the plain division-free sqrt(A) = A*sqrt(1/A)
 approach before implementing a tricky division sqrt(A).

We can try improving the current implementation, before implementing any
other algorithm :-)

I wrote a simple patch (it touches very few lines) that allows skipping
the final squaring when mpn_sqrtrem is called with a NULL argument and the
approximation computed so far can not change.
It exploits the spurious half-a-limb that current code adds to the result
when the size is odd, i.e. it changes the timings only for odd sizes.

Before the patch:

$ tune/speed -rp1 -s21-110 -f15 mpn_sqrt mpn_root.2
mpn_rootrem.2 mpn_sqrtrem
overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU
freq 3500.08 MHz
 mpn_sqrtmpn_root.2 mpn_rootrem.2   mpn_sqrtrem
210.013801.61551.5569   #0.9996
315  #0.153741.12221.54261.1521
4725  0.000918521   #0.90001.24581.0004
70875 0.029574741   #0.96121.21921.0007
10631250.741147157   #0.97181.21470.

Sometimes mpn_root.2 is faster than the other functios in a wide range

With the patch applied:
$ tune/speed.nsqrt -rp1 -s21-110 -f15 mpn_sqrt mpn_root.2
mpn_rootrem.2 mpn_sqrtrem
overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU
freq 3500.08 MHz
 mpn_sqrtmpn_root.2 mpn_rootrem.2   mpn_sqrtrem
21   #0.013391.64961.60891.0244
315  #0.129241.33181.84321.3701
4725 #0.0007996071.03491.43121.1495
70875#0.0263216531.08041.36981.1253
1063125  #0.6563897051.09821.37361.1296

mpn_rot.2 is the faster

-- 
http://bodrato.it/

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-12 Thread bodrato
Il Sab, 13 Giugno 2015 12:27 am, bodr...@mail.dm.unipi.it ha scritto:

 I wrote a simple patch (it touches very few lines) that allows
skippingthe final
I forgot the attachment...

-- 
http://bodrato.it/

gmp.diff
Description: plain/text
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-11 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

 ni...@lysator.liu.se (Niels Möller) writes:

   As far as I see, a plain Newton iteration on won't produce an x' inverse
   with 4n bits of accuracy. One would need aither two iterations, or some
   other trick, maybe mixing in some interpolation of x' - x.
   
 I don't think one should struggle in order to get n - 2n bits in a
 Newtonian iteration.  It is enough to achieve n - 2n-k for some
 constant k.

I don't think one can do that for any small k in this case. I'll try to
describe the problem.

In the Newton iteration for the square root, which extends the precision
of x from n bits to (close to) 2n bits, I'd expect that we 2n bits of
precision for all inputs, including the inverse of the input x. Do you
agree? So as input, we need a 2n-bit approximative inverse of the n-bit
input x.

As output, we want the new 2n-bit x and a corresponding 4n-bit
approximative inverse.

A single newton iteration can produce a 4n-bit approximation of the
*n*-bit x, but then we also have new x bits from the square root
approximation to take into account. One way to get the needed inverse
approximation is to

1. Adjust the 2n-bit inverse of n-bit x to a 2n-bit inverse of the new
   2n-bit x. Could be done with a Newton step (which seems a bit
   overkill), or perhaps one can use some cheaper linear interpolation.

2. Do a Newton step to extend that inverse of the 2n-bit x to 4n
   bits of precision.

Hmm. Or maybe this is stupid. I could stop insisting on using a full
size inverse (so that A / x or E / x can be computed as a *single*
mulhi), and instead work with a half-size inverse, so that the quotient
is computed in two steps. Then inverse computation using a plain
Newton-step per iteration should work fine.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-11 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

 Before I try to understand the rest of your reasoning: What is B?  It's
 not the usual limb base, I presume, since then sqrt(B^{n-2}) =
 (B/2)^{n-2}, which I can compute completely without hard thinking
 (assuming the common case that B is a power of two)...

Sorry, there was an A missing. The function sqrt_nm1 is intended to
compute a close approximation of sqrt(B^{n-2} A), with n input limbs A
and n-1 output limbs X. (Suggestions for better names are appreciated).

And B is the usual limb base.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-11 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Here's a sketch with some more details. I've tried to work out both
  sqrt(B^{n-1} A) and sqrt(B^{n-2}). To my surprise, they seem
  independent, not mutually recursive.
  
Before I try to understand the rest of your reasoning: What is B?  It's
not the usual limb base, I presume, since then sqrt(B^{n-2}) =
(B/2)^{n-2}, which I can compute completely without hard thinking
(assuming the common case that B is a power of two)...

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-10 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  I also had a quick look at the math, and I realized (some of you surely
  knew that already) that floor(sqrt(a)) is mostly independent of the
  lowest half of the bits of a.

Some of us indeed knew that...  And this generalises to kth roots.  It
is similar to the more familiar problem of an m-limb by n-limb division;
the low n dividend limbs affect the quotient very little...

  The lowest half of the bits are needed only for computing the remainder,
  and a final adjustment-by-one step. I think the iteration should totally
  ignore the low half bits in each step.
  
  Would it make sense with an mpn_sqrt_appr which takes a n-limb
  normalized number A as input, and computes an n-limb square root
  approximation of B^{n-1} A, with some well defined error bound, of at
  most 2? (We might also need a function for the the (n-1)-limb sqrt of
  B^{n-2} A).
  
  And a corresponding mpn_root_appr would compute the kth root of
  B^{(k-1) n - 1} A, I think.

I haven't looked enough at your propposal to have an informed opinion.

I think mpn_rootrem's approach for when the remainder is not needed
might be wise, perhaps wiser.  It computes with one extra limb, then
uses that to almost always return the correct result without any
(partial) remainder computation.

To speed sqrt (non-rem) we might want to do like root.  But both
functions should probably do that only for operands over say 10 limbs.
Below that, some _appr approach will be better.

And as written before, as k grows, we would benefit enormously from a
pow_1_highpart.  The average speedup potential is O(k).

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-10 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

 I haven't looked enough at your propposal to have an informed opinion.

Here's a sketch with some more details. I've tried to work out both
sqrt(B^{n-1} A) and sqrt(B^{n-2}). To my surprise, they seem
independent, not mutually recursive.

sqrt_nm1 (n, A):

  Input: A = {a_{n-1}, ..., a_0}, n = 2
  Output: X = {x_{n-2}, ..., x_0} \approx sqrt(B^{n-2} A)

  if (n == 2)
x_0 -- floor (sqrt ({a_1, a_0}))
return X = {x_0}

  k -- floor (n/2)
  H -- sqrt_nm1 (k+1, {a_{n-1},...,a_{n-1-k}}) // k limbs

  if (n odd)// n = 2k+1
E -- B H^2 - A
X -- B^k H - floor (B^{k-1} E / 2H)

  else  // n = 2k
E -- H^2 - A
X -- B^{k-1} H - floor (B^{k-1} E / 2H)

  return X

sqrt_n (n, A):

  Input: A = {a_{n-1}, ..., a_0}, n = 1
  Output: X = {x_{n-1}, ..., x_0} \approx sqrt(B^{n-1} A)

  if (n == 1)
x_0 -- floor (sqrt (a_0))
return X = {x_0}

  k -- floor (n/2)
  H -- sqrt_n(k+1, {a_{n-1},...,a_{n-1-k}}) // k+1 limbs

  if (n odd)// n = 2k+1
E -- H^2 - A
X -- H B^k - floor (B^k E / 2 H)

  else  // n = 2k
E -- H^2 - B A
X -- B^{k-1} H - floor (B^{k-1} E / 2H)

  return X

Note that all the computations of the error/residual E is subject to a
lot of cancellation, so one can use mullo (for small sizes) or
wraparound arithmetic using mulmod_bnm1 for larger sizes.

And the divisions really ought to use divappr, rather than exact
truncating division. I'm not sure how that interface works, but one
ought be able to pass in numerators like, e.g., B^{k-1} E, without
actually padding the numerator with zeros.

I don't know how large errors one gets, from the above algorithms, but I
would expect errors to be at most 2 or 3.

For proper error analysis, one should also think about desired sign of
the error. The newton iteration wants to converge from above, and on the
other hand the inclusion of more bits of A wants X to converge from
below. So maybe one should add some fudge factor into the rounding
somewhere force monotonous convergence? Also sign and magnitude of the
divappr errors must be taken into account.

If we end up with an error E and corresponding X update of varying sign,
that's no fundamental problem, but it will make the wraparound
arithmetic a little more complex.

Finally, I'm not sure if one can expect tricks to eliminate division to
be a big win. Except for the smallest sizes, division, and in particular
divappr, should only be a small constant slower than multiplication,
right? The division free iteration for 1/sqrt(A) needs additional
multiplications, and the trick mensioned the other day to compute an
inverse incrementally, will also introduce additional multiplications.

I'm thinking that an exact sqrt/sqrtrem for even size n will call
sqrt_nm1 with n/2+1 input limbs producing an n-limb root. And for odd n,
call sqrt_n with (n+1)/2 input limbs. In either case, we'll get at least
one bit of margin, with the worst case being odd n and a[n-1] == 1.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-09 Thread Torbjörn Granlund
bodr...@mail.dm.unipi.it writes:

  To compare them I wrote a quick and dirty specialization of the rootrem
  algorithm for the k==2 case (use sqr instead of mul, lshift instead of
  mul_1...)

Cool!

We should probably use sqr whenever possible, perhaps it is worth having
a condition for this in rootrem.c, (I believe mpn_pow_1 gets it right,
and uses sqr whenever possible, since I looked into that as an
explanation.)
  
  $ tune/speed -p 1  -s 1-110 -f16 mpn_root.2 mpn_sqrt
  mpn_rootrem.2 mpn_sqrtrem
  overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU
  freq 3500.08 MHz
 mpn_root.2  mpn_sqrt mpn_rootrem.2   mpn_sqrtrem
  1 0.00518   0.00011   0.00516  #0.9
  160.01103   0.00244   0.01117  #0.00244
  256   0.09316  #0.08321   0.14106   0.08374
  4096 #0.000633967   0.000659180   0.000894513   0.000654562
  65536#0.026071234   0.026756061   0.033005667   0.026762939
  1048576   0.732938835   0.734521311   0.962774086  #0.731488979
  
  With the attached specialised function, rootrem.2 gains a 10% in speed and
  is faster than sqrt when the reminder is not needed and size grows.
  
  $ tune/speed -p 1  -s 1-110 -f16 mpn_root.2 mpn_sqrt
  mpn_rootrem.2 mpn_sqrtrem
  overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU
  freq 3500.08 MHz
 mpn_root.2  mpn_sqrt mpn_rootrem.2   mpn_sqrtrem
  1 0.00376  #0.00010   0.00383   0.00010
  160.00884  #0.00241   0.00894   0.00243
  256  #0.07743   0.08262   0.12138   0.08282
  4096 #0.000563298   0.000657382   0.000816271   0.000650694
  65536#0.023296001   0.026694774   0.030225570   0.026717529
  1048576  #0.654117324   0.736489574   0.832002002   0.733319729
  
Your timing leaps hide some of the actual timing characteristics...

Using
  tune/speed -r -p1000 -s16-1000 -f1.438  mpn_sqrt mpn_root.2
I get about 0.97 on average.

And
  tune/speed -r -p1000 -s16-1000 -f1.438  mpn_sqrtrem mpn_rootrem.2
gives around 1.25.

(In both cases without your rootrem hack.)

  I added some TRACE (printf...) in the code to show the primitives with a
  super-linear cost used in computation.
  The last functions called by mpn_sqrtrem when computing the root of a 2000
  limb operand are:
  
  mpn_divrem nn=500/dn=250 - qn=250,rn=250
  mpn_sqr an=250 ^ 2 - rn=500
  mpn_divrem nn=1000/dn=500 - qn=500,rn=500
  mpn_sqr an=500 ^ 2 - rn=1000
  
  Similarly, for mpn_rootrem:
  
  mpn_div_q nn=500/dn=251 - qn=249
  mpn_sqr an=501 ^ 2 - rn=1002
  mpn_div_q nn=1000/dn=501 - qn=499
  mpn_sqr an=1000 ^ 2 - rn=2000
  
Interesting indeed.  I wonder why they need different multiply sizes.

  It uses div_q instead of divrem, but squares used by rootrem are doubled
  in size...

  The operation are similar if compared to sqrt (the order changes),
  probably div_q is faster than divrem. mpn_sqrt wins only when the huge
  speed-up given by the specialised code for the first limb is relevant...
  
I am not sure that div_q vs divrem matters a whole lot for this usage
(i.e., 2n/n sizes).
  
-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-09 Thread Torbjörn Granlund
t...@gmplib.org (Torbjörn Granlund) writes:

  I am not sure that div_q vs divrem matters a whole lot for this usage
  (i.e., 2n/n sizes).

Apparently wrong.  The first column is n, division is 2n-limb by n-limb:

  mpn_divrem  mpn_div_qratio
 50.0009 0.00111.17
 70.0012 0.00120.96
100.0014 0.00191.37
150.0028 0.00260.92
220.0056 0.00510.89
330.0099 0.00900.91
490.0215 0.01550.72
730.0444 0.03060.69
   1090.0882 0.06270.71
   1630.1767 0.12930.73
   2440.3385 0.26160.77
   3660.6530 0.50400.77
   5490.00012581 0.94590.75
   8230.00023138 0.000181170.78
  12340.00042985 0.000336420.78
  18510.00075588 0.000625480.83
  27760.00135386 0.001071240.79
  41640.00218069 0.001802420.83
  62460.00375453 0.003008710.80
  93690.00598697 0.004673050.78
 140530.00964423 0.007633660.79
 210790.01540152 0.012907300.84
 316180.02312424 0.019592180.85
 474270.03907867 0.032388670.83
 711400.05960059 0.049842350.84
1067100.09717958 0.083781170.86
1600650.16891133 0.136253220.81
2400970.21165200 0.175965290.83
3601450.38509720 0.308413200.80
5402170.67585650 0.567553500.84
8103251.17571375 0.966970250.82

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-09 Thread bodrato
Ciao,

Il Mar, 9 Giugno 2015 7:02 pm, Torbjörn Granlund ha scritto:
 bodr...@mail.dm.unipi.it writes:

 We should probably use sqr whenever possible, perhaps it is worth having
 a condition for this in rootrem.c, (I believe mpn_pow_1 gets it right,
 and uses sqr whenever possible, since I looked into that as an

mpn_pow_1 correctly uses sqr, the problem within rootrem arises only when
k==2. The two lines of code:
  wn = mpn_pow_1 (wp, sp, sn, k - 1, qp);
  mpn_mul (qp, wp, wn, sp, sn);
degenerate with k=2. mpn_pow_1 is used to perform an MPN_COPY so that the
two operands passed to mpn_mul contain the same number...

Maybe an initial condition if (k == 2) return mpn_sqrtrem (), but the
specialised code for a single k should not get inside the generic rootrem
function, in my opinion.

I wrote the hack only to understand the algorithm... I'll clean up rootrem
somehow, but I'll not insert special code.

   tune/speed -r -p1000 -s16-1000 -f1.438  mpn_sqrt mpn_root.2
 I get about 0.97 on average.

This can be healed using in sqrtrem the same strategy that rootrem uses to
skip the last squaring if the reminder is not required: adding a couple of
zero limbs and compute a slightly larger root...

 And
   tune/speed -r -p1000 -s16-1000 -f1.438  mpn_sqrtrem
 mpn_rootrem.2
 gives around 1.25.

This is not bad, a generic code can be slightly slower than a specialised
one.

   The last functions called by mpn_sqrtrem when computing the root of a
 2000
   limb operand are:

   mpn_divrem nn=1000/dn=500 - qn=500,rn=500
   mpn_sqr an=500 ^ 2 - rn=1000

   Similarly, for mpn_rootrem:

   mpn_div_q nn=1000/dn=501 - qn=499
   mpn_sqr an=1000 ^ 2 - rn=2000

 Interesting indeed.  I wonder why they need different multiply sizes.

sqrtrem obtain the reminder with divrem and needs only the b^2 part of
(ax+b)^2 to update it.
rootrem rebuild the reminder from scratch every loop... because the
generic (ax+b)^k has too many components.

 I am not sure that div_q vs divrem matters a whole lot for this usage
 (i.e., 2n/n sizes).

The manual says, about mpn_divrem:
This function is obsolete. Please call mpn_tdiv_qr instead for best
performance.
With a little scratch space passed to mpn_dc_sqrtrem, it's easy to stop
using divrem, but... will the performance really benefit of this?

Regards,
m

-- 
http://bodrato.it/toom-cook/

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-09 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

 An alternative would be to keep the current
 algorithm, but use the fact that the previous divisor is a damn good
 staring approximation to the new one, and use Newton to maintain an
 inverse to it.

I vaguely remember that idea. That would be a loop that maintains
approximations of both x \approx sqrt(a) and v \approx 1/sqrt(a). It
makes my head spin a bit to think about the precision and order of the
needed updates.

When we have a n bit approximation x,
we need a 2n bit inverse v of x in order to compute the new x' with 2n
bits of precision. For the next iteration, we then need a 4n-bit inverse
v' of x'.

As inputs for the v' update we have the old v (2n bits), the new square
root approximation x' (2n bits), and we also have the old x (or the
difference x' - x) available.

As far as I see, a plain Newton iteration on won't produce an x' inverse
with 4n bits of accuracy. One would need aither two iterations, or some
other trick, maybe mixing in some interpolation of x' - x.

I also had a quick look at the math, and I realized (some of you surely
knew that already) that floor(sqrt(a)) is mostly independent of the
lowest half of the bits of a. So its essentially an operation with n
input bits and n output bits. Proof:


  sqrt(a+d) - sqrt(a) = d / (sqrt(a+d) + sqrt(a))

So if d = sqrt(a), then

  sqrt(a+d) - sqrt(a)  1/2.

The lowest half of the bits are needed only for computing the remainder,
and a final adjustment-by-one step. I think the iteration should totally
ignore the low half bits in each step.

Would it make sense with an mpn_sqrt_appr which takes a n-limb
normalized number A as input, and computes an n-limb square root
approximation of B^{n-1} A, with some well defined error bound, of at
most 2? (We might also need a function for the the (n-1)-limb sqrt of
B^{n-2} A).

And a corresponding mpn_root_appr would compute the kth root of
B^{(k-1) n - 1} A, I think.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-09 Thread bodrato
Ciao,

Il Lun, 8 Giugno 2015 10:15 am, Torbjörn Granlund ha scritto:
 ni...@lysator.liu.se (Niels Möller) writes:

   I've had a quick look. Both mpn_dc_sqrtrem and mpn_rootrem_internal
   (which seem to be the work horses for larger operands) do a division in
   the loop. The latter is organized as a newton iteration, the former
   isn't (but I guess the computation is more or less equivalent to a
   Newton iteration).

To compare them I wrote a quick and dirty specialization of the rootrem
algorithm for the k==2 case (use sqr instead of mul, lshift instead of
mul_1...)

Current speed (on shell)

$ tune/speed -p 1  -s 1-110 -f16 mpn_root.2 mpn_sqrt
mpn_rootrem.2 mpn_sqrtrem
overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU
freq 3500.08 MHz
   mpn_root.2  mpn_sqrt mpn_rootrem.2   mpn_sqrtrem
1 0.00518   0.00011   0.00516  #0.9
160.01103   0.00244   0.01117  #0.00244
256   0.09316  #0.08321   0.14106   0.08374
4096 #0.000633967   0.000659180   0.000894513   0.000654562
65536#0.026071234   0.026756061   0.033005667   0.026762939
1048576   0.732938835   0.734521311   0.962774086  #0.731488979

With the attached specialised function, rootrem.2 gains a 10% in speed and
is faster than sqrt when the reminder is not needed and size grows.

$ tune/speed -p 1  -s 1-110 -f16 mpn_root.2 mpn_sqrt
mpn_rootrem.2 mpn_sqrtrem
overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU
freq 3500.08 MHz
   mpn_root.2  mpn_sqrt mpn_rootrem.2   mpn_sqrtrem
1 0.00376  #0.00010   0.00383   0.00010
160.00884  #0.00241   0.00894   0.00243
256  #0.07743   0.08262   0.12138   0.08282
4096 #0.000563298   0.000657382   0.000816271   0.000650694
65536#0.023296001   0.026694774   0.030225570   0.026717529
1048576  #0.654117324   0.736489574   0.832002002   0.733319729


I added some TRACE (printf...) in the code to show the primitives with a
super-linear cost used in computation.
The last functions called by mpn_sqrtrem when computing the root of a 2000
limb operand are:

mpn_divrem nn=500/dn=250 - qn=250,rn=250
mpn_sqr an=250 ^ 2 - rn=500
mpn_divrem nn=1000/dn=500 - qn=500,rn=500
mpn_sqr an=500 ^ 2 - rn=1000

Similarly, for mpn_rootrem:

mpn_div_q nn=500/dn=251 - qn=249
mpn_sqr an=501 ^ 2 - rn=1002
mpn_div_q nn=1000/dn=501 - qn=499
mpn_sqr an=1000 ^ 2 - rn=2000

It uses div_q instead of divrem, but squares used by rootrem are doubled
in size...

When we do not ask for the reminder mpn_rootrem (say mpn_root) skips the
last squaring:

[mpn_sqr an=251 ^ 2 - rn=502]
mpn_div_q nn=501/dn=251 - qn=250
mpn_sqr an=501 ^ 2 - rn=1002
mpn_div_q nn=1002/dn=501 - qn=501

The operation are similar if compared to sqrt (the order changes),
probably div_q is faster than divrem. mpn_sqrt wins only when the huge
speed-up given by the specialised code for the first limb is relevant...


Best regards,
mb

-- 
http://bodrato.it/papers//* mpn_rootrem(rootp,remp,ap,an,nth) -- Compute the nth root of {ap,an}, and
   store the truncated integer part at rootp and the remainder at remp.

   Contributed by Paul Zimmermann (algorithm) and
   Paul Zimmermann and Torbjorn Granlund (implementation).
   Marco Bodrato specialised mpn_sqrtrem_internal for square roots.

   THE FUNCTIONS IN THIS FILE ARE INTERNAL, AND HAVE MUTABLE INTERFACES.  IT'S
   ONLY SAFE TO REACH THEM THROUGH DOCUMENTED INTERFACES.  IN FACT, IT'S ALMOST
   GUARANTEED THAT THEY'LL CHANGE OR DISAPPEAR IN A FUTURE GNU MP RELEASE.

Copyright 2002, 2005, 2009-2012, 2015 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of:

  * the GNU General Public License as published by the Free Software
Foundation; either version 3 of the License, or (at your option) any
later version.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.  */

/* FIXME:
 This implementation is not optimal when remp == NULL, since the complexity
 is M(n), whereas it should be M(n/k) on average.
*/

#include stdio.h		/* for NULL */

#include gmp.h
#include gmp-impl.h
#include longlong.h

#define TRACE(x) do {} while (0)

static mp_size_t mpn_rootrem_internal (mp_ptr, mp_ptr, mp_srcptr, mp_size_t,
   mp_limb_t, int);
static mp_size_t mpn_sqrtrem_internal (mp_ptr, mp_ptr, mp_srcptr, mp_size_t,
   int);

#define MPN_RSHIFT(cy,rp,up,un,cnt) \
  do {	\
if ((cnt) != 0)			\
 

Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-09 Thread Niels Möller
bodr...@mail.dm.unipi.it writes:

 The last functions called by mpn_sqrtrem when computing the root of a 2000
 limb operand are:

 mpn_divrem nn=500/dn=250 - qn=250,rn=250
 mpn_sqr an=250 ^ 2 - rn=500
 mpn_divrem nn=1000/dn=500 - qn=500,rn=500
 mpn_sqr an=500 ^ 2 - rn=1000

 Similarly, for mpn_rootrem:

 mpn_div_q nn=500/dn=251 - qn=249
 mpn_sqr an=501 ^ 2 - rn=1002
 mpn_div_q nn=1000/dn=501 - qn=499
 mpn_sqr an=1000 ^ 2 - rn=2000

 It uses div_q instead of divrem, but squares used by rootrem are doubled
 in size...

That explains why rootrem is a lot slower, right? Is that what needs
explaining?

 When we do not ask for the reminder mpn_rootrem (say mpn_root) skips the
 last squaring:

Which is where most of the slowdown was.

 probably div_q is faster than divrem.

It should be. And if there's an adjustment step anyway, maybe it would
be even better to use divappr? (speed seems to lack support for many of
the interesting division functions).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-09 Thread bodrato
Ciao,

Il Lun, 8 Giugno 2015 10:15 am, Torbjörn Granlund ha scritto:
 I played some years ago with a loop computing A^(-0.5), which naturally
 becomes division-free.  An alternative would be to keep the current
 algorithm, but use the fact that the previous divisor is a damn good
 staring approximation to the new one, and use Newton to maintain an
 inverse to it.

If we use the rootrem algorithm for sqrt, we could even try to update the
squared value, we have s and s^2, we update s to sc+r, and we update the
squared value (sc+r)^2 shifting s^2 and adding 2src + r^2; maybe using
a karatsuba-like update sequence).

Regards,
mb

-- 
http://bodrato.it/papers

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-08 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  t...@gmplib.org (Torbjörn Granlund) writes:
  
   Surely, we could make mpn_rootrem run faster, in particular for small
   arguments.  But also for large arguments, 2x slowdown seems like a lot.
  
  I've had a quick look. Both mpn_dc_sqrtrem and mpn_rootrem_internal
  (which seem to be the work horses for larger operands) do a division in
  the loop. The latter is organized as a newton iteration, the former
  isn't (but I guess the computation is more or less equivalent to a
  Newton iteration).
  
  The code is a bit too complex for me to really compare them...
  
Ahum, same sensation here...

  For larger operands, I imagine a rewrite to use a division-free
  Newton-iteration for an approximation of the inverse root would be
  faster.
  
I played some years ago with a loop computing A^(-0.5), which naturally
becomes division-free.  An alternative would be to keep the current
algorithm, but use the fact that the previous divisor is a damn good
staring approximation to the new one, and use Newton to maintain an
inverse to it.

For mpn_rootrem, i.e., a^(1/k), I suppose we could make the work take
O(M(log(a^(1/k as opposed to the current O(M(log(a))), where M(n) is
the time for an n-bit multiply.  We'd need to make use of
mpn_pow_1_highpart.  This will be a lot of work.



-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-08 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

 Surely, we could make mpn_rootrem run faster, in particular for small
 arguments.  But also for large arguments, 2x slowdown seems like a lot.

I've had a quick look. Both mpn_dc_sqrtrem and mpn_rootrem_internal
(which seem to be the work horses for larger operands) do a division in
the loop. The latter is organized as a newton iteration, the former
isn't (but I guess the computation is more or less equivalent to a
Newton iteration).

The code is a bit too complex for me to really compare them...

For larger operands, I imagine a rewrite to use a division-free
Newton-iteration for an approximation of the inverse root would be
faster.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Anomaly in mpn_sqrtrem and mpn_rottrem

2015-06-08 Thread Torbjörn Granlund
t...@gmplib.org (Torbjörn Granlund) writes:

  For mpn_rootrem, i.e., a^(1/k), I suppose we could make the work take
  O(M(log(a^(1/k as opposed to the current O(M(log(a))), where M(n) is
  the time for an n-bit multiply.  We'd need to make use of
  mpn_pow_1_highpart.  This will be a lot of work.

Note that O(M(log(a^(1/k = O(M(log(a)/k)), i.e., that we can speed
mpn_rootrem up by a factor k (at least for large k, and large a).

Perhaps this does not need to be super-hard:.  An outline:

(1) make the computation with log2(a)/k bits + eps extra bits (an extra
limb would do at least if the limb are not too small), happily
truncating intermediates.

(2) at the end do two final exponentiations, one truncating towards 0
and one rounding towards +oo.  These, keeping just log2(a)/k + eps high
bits, should be at most 1 apart sort of.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Anomaly in mpn_sqrtrem and mpn_rottrem

2015-05-28 Thread Torbjörn Granlund
Paul Zimmermann has pointed out to us that mpn_rootrem is sometimes
faster than mpn_sqrtrem, specifically when not requesting the remainder.

When requesting the remainder we have anomalous behaviour too, but in
the other direction:

  mpn_sqrtrem mpn_rootrem.2 mpn_rootrem.3
1  #38.84   2161.80   2459.40
2 #125.42   2419.00   2878.00
3 #352.65   2721.25   2907.75
4 #301.31   2812.50   3454.33
5 #531.16   3080.75   3445.67
7 #595.18   3408.67   3851.00
9 #954.00   4050.33   3908.33
12#836.83   4125.00   4325.33
16   #1168.00   4586.00   5189.00
22   #1710.17   5333.00   5565.00
30   #2333.40   6166.00   6711.50
42   #3556.33   7698.50   8360.50
58   #4607.33  10684.00  12585.00
81   #7675.00  13746.00  16511.00
113 #10920.00  19760.00  24908.00
158 #16435.00  27297.00  35592.00
221 #26308.00  41151.00  54227.00
309 #42636.00  65637.00  85184.00
432 #71155.00 105516.00 136874.00
604#117906.00 175414.00 220812.00
845#205120.00 388529.00 363189.00

Surely, we could make mpn_rootrem run faster, in particular for small
arguments.  But also for large arguments, 2x slowdown seems like a lot.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel