Re: GCD project status?

2019-09-27 Thread Torbjörn Granlund
We now have almost complete results for the latest hgcd2.c tweaks.

We have 5 div1 variants (as chosen by HGCD2_DIV1_METHOD).  There are
results and brief documentation here:

https://gmplib.org/devel/thres/HGCD2_DIV1_METHOD.html

About half of our test machines prefer '/', i.e., whatever word division
the compiler comes up with.  About one quarter prefers method 3, which
handles n / d < 8 with plain and branch-free code, but '/' for greater
divisors.  About one quarter prefers table-driven inverse-and-multiply,
which is in practice branch-free.

The even-numbered methods are clearly unpopular.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-24 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Hmm. Definitely worth a try. But if we need explicit loads and stores
  from the structs, we'll not save that many instructions.

x86 can handle operate-from-memory at almost the same cost as
operate-from-register.  But operate-to-memory is expensive.

Load/store architectures typically reach their peak execution throughput
only with a mixture of loads and operates.

  Each iteration needs to load all the values but store only half of
  them, so for each pair of values load + load + store, compared to mov,
  xor, and, xor, xor for conditionally swapping using a mask.

You might want to explicitly load some of the values into registers.
(And perhaps use the "restrict" keyword for the swappable pointers...but
I am afraid that we'd be stepping outside the vague semantics of
restrict.)

  > Some measurements with method 4 and 5 are now in.  Modern Intel CPUs
  > like method 5, as I had expected.

  Nice! With a few % margin over method 3.

8 configs now vouch for method 5.

And method 4 got its first "honourable mention"; beagle thinks it is 2nd
best.  :-)

I don't think method 4 will see much use unless we find a way to
radically improve the applicability of its large tables.  I made table
size 2048 the default, just for testing purposes.  The next smaller
table size is 512 bytes, which is a more reasonable size.  It gets a 87%
hit rate.  I'd say we need that to get beyond 95% for method 4 to become
viable.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-23 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

> Perhaps keeping the to-be-swapped variables in two structs, and instead
> conditionally swap pointers to the structs?

Hmm. Definitely worth a try. But if we need explicit loads and stores
from the structs, we'll not save that many instructions. Each iteration
needs to load all the values but store only half of them, so for each
pair of values load + load + store, compared to mov, xor, and, xor, xor
for conditionally swapping using a mask.

> Some measurements with method 4 and 5 are now in.  Modern Intel CPUs
> like method 5, as I had expected.

Nice! With a few % margin over method 3.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-23 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  $ ./tune/speed -c -s1 -p10 mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 
mpn_hgcd2_4 mpn_hgcd2_5 mpn_hgcd2_binary
  overhead 6.02 cycles, precision 10 units of 8.33e-10 secs, CPU freq 
1200.00 MHz
mpn_hgcd2_1   mpn_hgcd2_2   mpn_hgcd2_3   mpn_hgcd2_4 mpn_hgcd2_5 
mpn_hgcd2_binary
  1#1668.90   1863.72   1670.73   1757.54 1738.50   
2044.25

  Had a look at the disassembly for the binary algorithm. The
  double-precision loop needs, 20 instructions for just the conditional
  swap logic, 23 for the clz + shift + subtract, 8 for the shift+add
  updates of the u matrix.

Perhaps keeping the to-be-swapped variables in two structs, and instead
conditionally swap pointers to the structs?

Some measurements with method 4 and 5 are now in.  Modern Intel CPUs
like method 5, as I had expected.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-23 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> Below is one variant, that seems to pass tests. I haven't done any
> benchmarks yet.

Just gave it a try with tune/speed, together with all the latest div1
variants. This one on my laptop, and not looking that great:

$ ./tune/speed -c -s1 -p10 mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 mpn_hgcd2_4 
mpn_hgcd2_5 mpn_hgcd2_binary
overhead 6.02 cycles, precision 10 units of 8.33e-10 secs, CPU freq 1200.00 
MHz
  mpn_hgcd2_1   mpn_hgcd2_2   mpn_hgcd2_3   mpn_hgcd2_4 mpn_hgcd2_5 
mpn_hgcd2_binary
1#1668.90   1863.72   1670.73   1757.54 1738.50 
  2044.25


Had a look at the disassembly for the binary algorithm. The
double-precision loop needs, 20 instructions for just the conditional
swap logic, 23 for the clz + shift + subtract, 8 for the shift+add
updates of the u matrix.

By doing another conditional subtraction in each iteration, I get down
to 1865 cycles. See below version. In the case acnt == bcnt, the
condition for this is always false. But when acnt < bcnt, we have 

  2^{k-1} <= a < 2^k
  2^{k-2} <= (2^s b) < 2^{k-1}

so floor (a / (b 2^s) can be 1, 2, or 3. So one could consider yet
another conditional subtraction (but then q = 3 2^s is no longer just a
single shift and add).

If I also try the SIMD trick, I get down to 1790 cycles, but I'd need a
special middle iteration to make it correct. As long as both a,b >= 2^96
(more than one and a half limb), matrix coefficients fit in a half limb.
But we don't switch to single precision until a,b < 2^96. So we'll have
one or more iterations with a >= 2^96 > b, and there u01 and u11 may
exceeed half a limb. May still be worth it to add that middle loop, for
5% speedup.

Seems hard to beat method 1 on this hardware, but binary may well be a
winner on machines with slow multiply, slow division, but fast
count_leading_zeros.

Regards,
/Niels

#define SWAP_MASK(mask, x, y) do {  \
mp_limb_t swap_xor = ((x) ^ (y)) & mask;\
(x) ^= swap_xor;\
(y) ^= swap_xor;\
  } while (0);

int
mpn_hgcd2 (mp_limb_t ah, mp_limb_t al, mp_limb_t bh, mp_limb_t bl,
   struct hgcd_matrix1 *M)
{
  mp_limb_t u00, u01, u10, u11, swapped;

  if (ah < 2 || bh < 2)
return 0;

  if (ah > bh || (ah == bh && al > bl))
{
  sub_ddmmss (ah, al, ah, al, bh, bl);
  if (ah < 2)
return 0;

  u00 = u01 = u11 = 1;
  u10 = 0;
}
  else
{
  sub_ddmmss (bh, bl, bh, bl, ah, al);
  if (bh < 2)
return 0;

  u00 = u10 = u11 = 1;
  u01 = 0;
}
  
  swapped = 0;

  for (;;)
{
  mp_limb_t mask, dh, dl;
  int acnt, bcnt, shift;

  if (ah == bh)
goto done;

  mask = -(mp_limb_t) (ah < bh);
  swapped ^= mask;
  SWAP_MASK (mask, ah, bh);
  SWAP_MASK (mask, al, bl);
  SWAP_MASK (mask, u00, u01);
  SWAP_MASK (mask, u10, u11);
  
  ASSERT (ah > bh);

  if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2)))
{
  ah = (ah << (GMP_LIMB_BITS / 2) ) + (al >> (GMP_LIMB_BITS / 2));
  bh = (bh << (GMP_LIMB_BITS / 2) ) + (bl >> (GMP_LIMB_BITS / 2));

  break;
}

  count_leading_zeros (acnt, ah);
  count_leading_zeros (bcnt, bh);
  shift = bcnt - acnt;
  ASSERT (shift >= 0);
  /* Avoid subtraction underflow */
  shift -= (shift > 0);

  dh = (bh << shift) | ((bl >> 1) >> (GMP_LIMB_BITS - 1 - shift));
  dl = bl << shift;
  sub_ddmmss (ah, al, ah, al, dh, dl);

  mask = -(mp_limb_t) (ah > dh);
  shift -= mask;
  sub_ddmmss (ah, al, ah, al, mask & dh, mask & dl);

  if (ah < 2) 
{
  /* A is too small, so decrement q. */
  mp_limb_t q = ((mp_limb_t) 1 << shift) - 1;
  u01 += q * u00;
  u11 += q * u10;
  goto done;
}
  u01 += (u00 << shift);
  u11 += (u10 << shift);
}

  /* Single-precision loop */
  for (;;)
{
  mp_limb_t mask, dh;
  int acnt, bcnt, shift;
  if (ah == bh)
break;

  mask = -(mp_limb_t) (ah < bh);
  swapped ^= mask;
  SWAP_MASK (mask, ah, bh);
  SWAP_MASK (mask, u00, u01);
  SWAP_MASK (mask, u10, u11);
  
  ASSERT (ah > bh);
  
  count_leading_zeros (acnt, ah);
  count_leading_zeros (bcnt, bh);
  shift = bcnt - acnt;
  ASSERT (shift >= 0);
  /* Avoid subtraction underflow */
  shift -= (shift > 0);
  
  dh = bh << shift;
  ah -= dh;

  mask = - (mp_limb_t) (ah > dh);
  shift -= mask;
  ah -= mask & dh;

  if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2 + 1)))
{
  /* A is too small, so decrement q. */
  mp_limb_t q = ((mp_limb_t) 1 << shift) - 1;
  u01 += q * u00;
  u11 += q * u10;
  break;
}

  u01 += (u00 << shift);
  u11 += (u10 << shift);
}

 done:
  SWAP_MASK (swapped, u00, u01);
  

Re: GCD project status?

2019-09-23 Thread Torbjörn Granlund
t...@gmplib.org (Torbjörn Granlund) writes:

  I went ahead and pushed a hgcd2.c with _METHOD = 4 and _METHOD 5.  The
  former tables quotients and neeeds one multiply per div1 call and the
  latter tables inverses and needs 2 multiplies per div1 call.

I pushed improved versions.

Instead of detecting potential problematic operands ahead of time, both
methods now try its division-free algorithm, carefully avoiding
overflow.  If the remainder is not in range, they fall back to division.

  The quotient table method isn't pretty.  It can surely be improved.  It
  uses a very large table and gets poor throughput.  It also has only
  about 80% branch hit rate, which is not good enough.

Now 90%.  But the table is still the same huge size.

  The inverse table method is more mature, I believe.  It has much smaller
  tables and with the default table size (256 2-byte entries) it gets a
  branch hit rate of 97%.  Doubling the table results in 98.5% rate.

Now 99.74% for the smaller table.  The larger table is no longer useful
and is gone.  The default is now a 64-byte table with 99.5% division
avoidance.

It is easier to see that the new variants are correct; if the few
arithmetic operations do not overflow (which I claim to be true) and if
the remainder is in range [0,d-1] then the result will be correct.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-22 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  I see two ways. Either just do an inversion table, which will cost an
  additional multiplication when used. Than one could handle quotients up
  to 7 or 8 bits bits with just 256 bytes of table.

  Or shift both ni and di, and get quotient with something like

q = tabp[(ni << NBITS) + di] >> (NBITS - (dcnt - ncnt))

  Then one could handle quotients up to 5 bits with a table of 1024 bytes
  (indexing by 5 significant bits of each of n and d, excluding the most
  significant one bit).

I went ahead and pushed a hgcd2.c with _METHOD = 4 and _METHOD 5.  The
former tables quotients and neeeds one multiply per div1 call and the
latter tables inverses and needs 2 multiplies per div1 call.

The quotient table method isn't pretty.  It can surely be improved.  It
uses a very large table and gets poor throughput.  It also has only
about 80% branch hit rate, which is not good enough.  Niels suggested a
variant with better progress (I get quotient with at most NBITS/2 bits
in the fast path while Niels expects NBITS) when extracting NBITS
numerator and NBITS quotient bits.  I would be happy to see a variant
with NBITS progress.

The inverse table method is more mature, I believe.  It has much smaller
tables and with the default table size (256 2-byte entries) it gets a
branch hit rate of 97%.  Doubling the table results in 98.5% rate.

Benchmark?  The inverse table method seems to be the fastest div1 thus
far for some current Intel processors.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-22 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

> I had not realised you meant left-to-right.  I thought you talked about
> right-to-left.  Interesting!

Below is one variant, that seems to pass tests. I haven't done any
benchmarks yet. All the SWAP_MASK calls look a bit expensive, should be
4 instructions each. Not sure if there's any way to do the swap in
assembly that's much cheaper; the obvious alternative would be one mov
and two cmov.

When doing this, I realized one could do a kind of SIMD optimization for
the double-precision loop (not yet tried). In the double-precision loop,
I think the matrix elements u00, u01, u10, u11 all fit in half a limb
each. So one could let u00, u10 share one limb, say u0010, using low anf
high half, and let u01, u11 share a single limb u0111. Then

  u01 += q * u00;
  u11 += q * u10;

could be replaced with the single multiply

  u0111 += q * u0010;

And for the binary version below, we'd need only three SWAP_MASK per
iteration, in both the double-limb and the single-limb loops.

Regards,
/Niels

#define SWAP_MASK(mask, x, y) do {  \
mp_limb_t swap_xor = ((x) ^ (y)) & mask;\
(x) ^= swap_xor;\
(y) ^= swap_xor;\
  } while (0);

int
mpn_hgcd2 (mp_limb_t ah, mp_limb_t al, mp_limb_t bh, mp_limb_t bl,
   struct hgcd_matrix1 *M)
{
  mp_limb_t u00, u01, u10, u11, swapped;

  if (ah < 2 || bh < 2)
return 0;

  if (ah > bh || (ah == bh && al > bl))
{
  sub_ddmmss (ah, al, ah, al, bh, bl);
  if (ah < 2)
return 0;

  u00 = u01 = u11 = 1;
  u10 = 0;
}
  else
{
  sub_ddmmss (bh, bl, bh, bl, ah, al);
  if (bh < 2)
return 0;

  u00 = u10 = u11 = 1;
  u01 = 0;
}
  
  swapped = 0;

  for (;;)
{
  mp_limb_t mask, dh, dl;
  int acnt, bcnt, shift;

  if (ah == bh)
goto done;

  mask = -(mp_limb_t) (ah < bh);
  swapped ^= mask;
  SWAP_MASK (mask, ah, bh);
  SWAP_MASK (mask, al, bl);
  SWAP_MASK (mask, u00, u01);
  SWAP_MASK (mask, u10, u11);
  
  ASSERT_ALWAYS (ah > bh);

  if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2)))
{
  ah = (ah << (GMP_LIMB_BITS / 2) ) + (al >> (GMP_LIMB_BITS / 2));
  bh = (bh << (GMP_LIMB_BITS / 2) ) + (bl >> (GMP_LIMB_BITS / 2));

  break;
}

  count_leading_zeros (acnt, ah);
  count_leading_zeros (bcnt, bh);
  shift = bcnt - acnt;
  ASSERT_ALWAYS (shift >= 0);
  /* Avoid subtraction underflow */
  shift -= (shift > 0);

  dh = (bh << shift) | ((bl >> 1) >> (GMP_LIMB_BITS - 1 - shift));
  dl = bl << shift;
  sub_ddmmss (ah, al, ah, al, dh, dl);

  if (ah < 2) 
{
  /* A is too small, so decrement q. */
  mp_limb_t q = ((mp_limb_t) 1 << shift) - 1;
  u01 += q * u00;
  u11 += q * u10;
  goto done;
}
  u01 += (u00 << shift);
  u11 += (u10 << shift);
}

  /* Single-precision loop */
  for (;;)
{
  mp_limb_t mask;
  int acnt, bcnt, shift;
  if (ah == bh)
break;

  mask = -(mp_limb_t) (ah < bh);
  swapped ^= mask;
  SWAP_MASK (mask, ah, bh);
  SWAP_MASK (mask, u00, u01);
  SWAP_MASK (mask, u10, u11);
  
  ASSERT_ALWAYS (ah > bh);
  
  count_leading_zeros (acnt, ah);
  count_leading_zeros (bcnt, bh);
  shift = bcnt - acnt;
  ASSERT_ALWAYS (shift >= 0);
  /* Avoid subtraction underflow */
  shift -= (shift > 0);

  ah -= bh << shift;
  if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2 + 1)))
{
  /* A is too small, so decrement q. */
  mp_limb_t q = ((mp_limb_t) 1 << shift) - 1;
  u01 += q * u00;
  u11 += q * u10;
  break;
}

  u01 += (u00 << shift);
  u11 += (u10 << shift);
}

 done:
  SWAP_MASK (swapped, u00, u01);
  SWAP_MASK (swapped, u10, u11);
  ASSERT_ALWAYS (u00 * u11 - u01 * u10 == 1);
  M->u[0][0] = u00; M->u[0][1] = u01;
  M->u[1][0] = u10; M->u[1][1] = u11;

  return 1;
}


-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-22 Thread Torbjörn Granlund
Nice!

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-22 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> Hmm. So the hgcd2-x-y.c would be there als for speed, and hgcd2.c would
> have something like
>
> #if TUNE_PROGRAM_BUILD
> int mpn_hgcd2(...)
> {
>   return (*hgcd2_method_pointer)(...);
> }

See below patch. 

diff -r 3e04a9bbba13 gmp-impl.h
--- a/gmp-impl.hFri Sep 20 20:36:30 2019 +0200
+++ b/gmp-impl.hSun Sep 22 10:19:06 2019 +0200
@@ -4949,6 +4949,10 @@
 #define MATRIX22_STRASSEN_THRESHOLDmatrix22_strassen_threshold
 extern mp_size_t   matrix22_strassen_threshold;
 
+typedef int hgcd2_func_t (mp_limb_t, mp_limb_t, mp_limb_t, mp_limb_t,
+ struct hgcd_matrix1 *);
+extern hgcd2_func_t *hgcd2_func;
+
 #undef HGCD_THRESHOLD
 #define HGCD_THRESHOLD hgcd_threshold
 extern mp_size_t   hgcd_threshold;
diff -r 3e04a9bbba13 tune/Makefile.am
--- a/tune/Makefile.am  Fri Sep 20 20:36:30 2019 +0200
+++ b/tune/Makefile.am  Sun Sep 22 10:19:06 2019 +0200
@@ -96,7 +96,7 @@
 speed_ext_SOURCES = speed-ext.c
 speed_ext_LDFLAGS = $(STATIC)
 
-tuneup_SOURCES = tuneup.c
+tuneup_SOURCES = tuneup.c hgcd2.c
 nodist_tuneup_SOURCES = sqr_basecase.c fac_ui.c $(TUNE_MPN_SRCS)
 tuneup_DEPENDENCIES = $(TUNE_SQR_OBJ) libspeed.la
 tuneup_LDADD = $(tuneup_DEPENDENCIES) $(TUNE_LIBS)
diff -r 3e04a9bbba13 tune/hgcd2.c
--- /dev/null   Thu Jan 01 00:00:00 1970 +
+++ b/tune/hgcd2.c  Sun Sep 22 10:19:06 2019 +0200
@@ -0,0 +1,49 @@
+/* mpn/generic/hgcd2.c for tuning
+
+Copyright 2019 Free Software Foundation, Inc.
+
+This file is part of the GNU MP Library.
+
+The GNU MP Library is free software; you can redistribute it and/or modify
+it under the terms of either:
+
+  * the GNU Lesser General Public License as published by the Free
+Software Foundation; either version 3 of the License, or (at your
+option) any later version.
+
+or
+
+  * the GNU General Public License as published by the Free Software
+Foundation; either version 2 of the License, or (at your option) any
+later version.
+
+or both in parallel, as here.
+
+The GNU MP Library is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received copies of the GNU General Public License and the
+GNU Lesser General Public License along with the GNU MP Library.  If not,
+see https://www.gnu.org/licenses/.  */
+
+#define TUNE_PROGRAM_BUILD 1
+
+#include "gmp-impl.h"
+
+hgcd2_func_t mpn_hgcd2_default;
+
+hgcd2_func_t *hgcd2_func = _hgcd2_default;
+
+int
+mpn_hgcd2 (mp_limb_t ah, mp_limb_t al, mp_limb_t bh, mp_limb_t bl,
+  struct hgcd_matrix1 *M)
+{
+  return hgcd2_func(ah, al, bh, bl, M);
+}
+
+#undef mpn_hgcd2
+#define mpn_hgcd2 mpn_hgcd2_default
+
+#include "mpn/generic/hgcd2.c"
diff -r 3e04a9bbba13 tune/tuneup.c
--- a/tune/tuneup.c Fri Sep 20 20:36:30 2019 +0200
+++ b/tune/tuneup.c Sun Sep 22 10:19:06 2019 +0200
@@ -716,8 +716,11 @@
select the fastest. Since *_METHOD defines start numbering from
one, if functions[i] is fastest, the value of the define is i+1.
Also output a comment with speedup compared to the next fastest
-   function. The NAME argument is used only for trace output.*/
-void
+   function. The NAME argument is used only for trace output.
+
+   Returns the index of the fastest function.
+*/
+int
 one_method (int n, speed_function_t *functions,
const char *name, const char *define,
const struct param_t *param)
@@ -757,6 +760,7 @@
 t[method_runner_up] / t[method]);
 
   TMP_FREE;
+  return method;
 }
 
 
@@ -1958,15 +1962,17 @@
 tune_hgcd2 (void)
 {
   static struct param_t  param;
-  speed_function_t f[3] =
-{
- speed_mpn_hgcd2_1,
- speed_mpn_hgcd2_2,
- speed_mpn_hgcd2_3,
-};
+  hgcd2_func_t *f[3] =
+{ mpn_hgcd2_1, mpn_hgcd2_2, mpn_hgcd2_3 };
+  speed_function_t speed_f[3] =
+{ speed_mpn_hgcd2_1, speed_mpn_hgcd2_2, speed_mpn_hgcd2_3 };
+  int best;
 
   s.size = 1;
-  one_method (3, f, "mpn_hgcd2", "HGCD2_DIV1_METHOD", );
+  best = one_method (3, speed_f, "mpn_hgcd2", "HGCD2_DIV1_METHOD", );
+
+  /* Use selected function when tuning hgcd and gcd */
+  hgcd2_func = f[best];
 }
 
 void
@@ -2236,9 +2242,6 @@
 void
 tune_div_qr_1 (void)
 {
-  static struct param_t  param;
-  doublet1, t2;
-
   if (!HAVE_NATIVE_mpn_div_qr_1n_pi1)
 {
   static struct param_t  param;


-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-19 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

> With some thought, I am sure the table size could come down while the
> accuracy could improve.

I see two ways. Either just do an inversion table, which will cost an
additional multiplication when used. Than one could handle quotients up
to 7 or 8 bits bits with just 256 bytes of table.

Or shift both ni and di, and get quotient with something like

  q = tabp[(ni << NBITS) + di] >> (NBITS - (dcnt - ncnt))

Then one could handle quotients up to 5 bits with a table of 1024 bytes
(indexing by 5 significant bits of each of n and d, excluding the most
significant one bit).

>   q0 = tabp[(ni << NBITS) + di];
>   r0 = n0 - q0 * d0;
>   mask = -(mp_limb_t) (r0 >= d0);
>   q0 -= mask;
>   r0 -= d0 & mask;

You round tabulated q down, and adjust up. I guess one can save one
cycle if one instead rounds it up and adjusts down,

q0 = tabp[(ni << NBITS) + di];
t = q0 * d0;
r0 = n0 - t;
mask = -(mp_limb_t) (n0 < t);
q0 += mask;
r0 += d0 & mask;

since computation of mask and initial r0 can then run in parallel (or
the same instruction, if the compiler is clever enough to use use carry
out). Or is there a risk for overflow in q0 * d0?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-18 Thread Torbjörn Granlund

  > * Speed up div1, presumably with a table-based approach.

A quick-and-dirty but working variant below.

The table size with NBITS = 8 is 32KiB.  Far too big.  It avoids
branches in 83% of the cases, which is not that great considering the
huge cost of mispredicted branches.

With some thought, I am sure the table size could come down while the
accuracy could improve.

#define NBITS 8
static unsigned char tab[1 << (2 * NBITS - 1)];
static unsigned char *tabp = tab - (1 << (NBITS - 1) << NBITS);

mp_double_limb_t
div1 (mp_limb_t n0, mp_limb_t d0)
{
  int ncnt;
  size_t ni, di;
  mp_limb_t q0;
  mp_limb_t r0;
  mp_limb_t mask;
  mp_double_limb_t res;

  count_leading_zeros (ncnt, n0);
  ni = n0 >> (64 - NBITS - ncnt);
  di = d0 >> (64 - NBITS - ncnt);

  if (UNLIKELY (di < (1 << NBITS/2)))
{
  q0 = n0 / d0;
  r0 = n0 - q0 * d0;
}
  else
{
  q0 = tabp[(ni << NBITS) + di];
  r0 = n0 - q0 * d0;
  mask = -(mp_limb_t) (r0 >= d0);
  q0 -= mask;
  r0 -= d0 & mask;
}

  res.d1 = q0;
  res.d0 = r0;
  return res;
}

void
init_tab()
{
  size_t ni, di;

  for (ni = 1 << (NBITS - 1); ni < (1 << NBITS); ni++)
{
  for (di = 0; di < (1 << NBITS); di++)
{
  tabp[(ni << NBITS) + di] = (ni << NBITS) / ((di << NBITS) + ((1 << 
NBITS)-1));
}
}
}


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-17 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

> Ideally, one would compile hgcd2.c in all possible variants (presumable
> through hand-crafted hgcd2-1-1.c, hgcd2-2-1.c, etc., and then call the
> selected hgcd2's entry point through a function pointer for further
> measuring.

Hmm. So the hgcd2-x-y.c would be there als for speed, and hgcd2.c would
have something like

#if TUNE_PROGRAM_BUILD
int mpn_hgcd2(...)
{
  return (*hgcd2_method_pointer)(...);
}

> I'm pretty confused about the algorithms used for gcdext.  Is it at all
> helped by our hgcd2 improvements?

If you look at mpn_gcdext_lehmer_n, that's a loop around hgcd2, applying
each matrix to the numbers being reduced using
mpn_matrix22_mul1_inverse_vector, and to the cofactors using
mpn_hgcd_mul_matrix1_vector.

At the end, it calls mpn_gcdext_1. mpn_gcdext_22 is missing.

> What is suboptimal about gcdext for large operands?

Subquadratic gcd is a loop calling hgcd on the high part of the numbers
(iirc, n/3 size), then applies the matrix also to the low part, and
repeat until numbers are in lehmer range.

Current subquadratic gcdext does the same, but also builds up cofactors
along the way. So numbers getting smaller, and cofactors larger. For
one, that makes the optimal splitting for the hgcd call complicated, one
should take into account not only the size of the numbers, but also the
size of the cofactors. That's done in an adhoc way, different splitting
for the very first iteration, then a fixed size ratio. 

More fundamentally, the multiplications used to build up the cofactors
get more and more unbalanced for each iteration. That doesn't seem
healthy for a divide-and-conquer algorithm.

I suspect a better way is to do it like this: Given A, B of size n,
split off high part, say size n/2 or so. Call hgcd, so we get reduced
numbers a, b, with

 (A;B) = M (a;b)

where M is if size n/4 and a,b of size 3n/4.

Recursively compute gcdext(a,b), to get

  g = s a + t b

Then construct one or both cofactors for 

  g = S A + T B

based on s, t and the matrix M. Then we'll reduce numbers in size as we
recurse down, and build up the cofactors as we return back, without
getting increasingly unbalanced multiply instructions.

But I haven't tried it out.

>   * Try out left-to-right binary hgcd2.
>
> I had not realised you meant left-to-right.  I thought you talked about
> right-to-left.  Interesting!

There are a couple of different cases, I'm afraid. Say we have a of
a_size bits and b of b_size bits.

If a_size == b_size, set
  
  {a, b} <-- {|a-b|, min(a,b)}

Like a binary step, but we don't divide out powers of two, and we
eliminate at least one bit at the high end.

When a_size > b_size, then to avoid underflow I'm afraid we need 

  a <-- a - 2^{a_size - b_size - 1} b

with progress of 0.5 bits or so. Or we could go half way-towards
computing a quotient by using

  a <-- a - 2^{a_size - b_size} b

in case that is non-negative, which would give us better progress. And
similarly for a_size < b_size.

So a fair amount of logic that needs to be branch free. Updates of u0
and u1 are cheap, but they make any needed conditional swap more
costly.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-17 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  How much speedup have we achieved?

I don't know.  I just observed the GCD_DC_THRESHOLD to change a lot, and
that is a good sign.

  Looks like GCD_DC_THRESHOLD gets higher on many machines, but lower on a
  few? I now realize that the way tuneup currently works, tuning of HGCD_*
  and GCD_* do *not* take the tuned HGCD2_*_METHOD into account. To fix
  that, tuneup build needs mpn_hgcd2 to jump via a function pointer. Or is
  there a better way to do that?

Dunno.

Ideally, one would compile hgcd2.c in all possible variants (presumable
through hand-crafted hgcd2-1-1.c, hgcd2-2-1.c, etc., and then call the
selected hgcd2's entry point through a function pointer for further
measuring.  Using a function for div1/div2 would disturb the
measurements a lot, I think.

  > * Make gcdext great again (as great as gcd).

  That means asm gcdext_11 and gcdext_22? In the Lehmer-range, I think
  it's as good as gcd. Then the subquadratic range gcdext is suboptimal,
  but that's a different project.

Asm gcdext_11, _22, and prhaps some more should be done, I think.

I'm pretty confused about the algorithms used for gcdext.  Is it at all
helped by our hgcd2 improvements?

What is suboptimal about gcdext for large operands?

  * Try out left-to-right binary hgcd2.

I had not realised you meant left-to-right.  I thought you talked about
right-to-left.  Interesting!

  * Fix doc or implementation regarding mpn_gcd with even inputs.

OK.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GCD project status?

2019-09-17 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes:

> I feel we've achieved much of the possible speedup for gcd now.

How much speedup have we achieved?

> But what more can we do before we are completely done for now?
> Let me try to list it:
>
>
> * Add entry points for gcd_11 allowing even operand(s).
> * Add entry points for gcd_22 allowing even operand(s)?
> * Make generic/gcd_1.c call gcd_11 and get rid of */gcd_1.asm.

And review other call sites for mpn_gcd_1.

> * Consider adding gcd_33, etc, and also the basecase code Marco and
>   Torbjorn worked on.
>
> * Speed up div1, presumably with a table-based approach.
>
> * Copy the div1/div2 code to hgcd2_jacobi.c (or put it in its own file,
>   or in gmp-impl.h).
>
> * Tune HGCD_DIV2_METHOD?
>
> * Update lots of gmp-mparam.h files with HGCD_DIV{1,2}_METHOD and
>   GCD_DC_THRESHOLD (the latter has move a lot!)

Looks like GCD_DC_THRESHOLD gets higher on many machines, but lower on a
few? I now realize that the way tuneup currently works, tuning of HGCD_*
and GCD_* do *not* take the tuned HGCD2_*_METHOD into account. To fix
that, tuneup build needs mpn_hgcd2 to jump via a function pointer. Or is
there a better way to do that?

> * Make gcdext great again (as great as gcd).

That means asm gcdext_11 and gcdext_22? In the Lehmer-range, I think
it's as good as gcd. Then the subquadratic range gcdext is suboptimal,
but that's a different project.

> Did I forget something?

* Try out left-to-right binary hgcd2.

* Fix doc or implementation regarding mpn_gcd with even inputs.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel