Re: [PATCH 2 of 3] Add MIPS32R2 RDHWR-based cycle counter support

2019-09-11 Thread Torbjörn Granlund
Mobile Stream  writes:

  T>   Add MIPS32R2 RDHWR-based cycle counter support.
  T> Does that work in user mode for all *nix ports?

  don't know.

  basing on a quick grep:
  openbsd, freebsd seem to only allow ULR but never CC/CCRes;
  netbsd seems to enable everything in HWREna including CC, CCRes.

Wait! We're not MIPS system programmers around here!

Can your timing code be used under *nix?  I guess from your reply but it
cannot, generally.

  T> Does it not also work in r3, r4, r5?  And r6?

  the instruction and CP0 HWREna are defined in r3, r5, r6 too.

Then (assuming the cycle counter is usable from outside of the kernel)
I'd suggest that your suggested configuration pattern was a bit too
restrictive.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH 2 of 3] Add MIPS32R2 RDHWR-based cycle counter support

2019-09-11 Thread Mobile Stream
T>   Add MIPS32R2 RDHWR-based cycle counter support.
T> Does that work in user mode for all *nix ports?

don't know.

basing on a quick grep:
openbsd, freebsd seem to only allow ULR but never CC/CCRes;
netbsd seems to enable everything in HWREna including CC, CCRes.


T> Does it not also work in r3, r4, r5?  And r6?

the instruction and CP0 HWREna are defined in r3, r5, r6 too.

there is no r4.


T> Is there are mipsisa64 counterpart?

mips arch specs list mips32r2+ only where they usually mention mips64 too 
however i6400, p6600 (64-bit r6 cores) specs say rdhwr and the corresponding 
hwrena bits are supported.

octeon cores has this 32-counter since octeon plus at least as well as a custom 
64-bit counter.

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: [PATCH 2 of 3] Add MIPS32R2 RDHWR-based cycle counter support

2019-09-11 Thread Torbjörn Granlund
i...@mobile-stream.com writes:

  Add MIPS32R2 RDHWR-based cycle counter support.

Does that work in user mode for all *nix ports?

Does it not also work in r3, r4, r5?  And r6?

Is there are mipsisa64 counterpart?

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


[PATCH 3 of 3] Add MIPS32R1 MADDU-based *mul_1.asm functions

2019-09-11 Thread info
Add MIPS32R1 MADDU-based *mul_1.asm functions.

The code tries to keep the [accidental] property of MIPS-II counterparts:
constant-time operation on 32x16 MDUs as found on e.g. 4KEc and some low-
end MCUs. Even if that is unimportant, the performance cost is invisible.

It is faster on all tried MIPS32R1/R2/R5 CPUs (see the c/l table) and is
expected to be fast with any pipelined MDU. So-called Area-Efficient MDU
(optional on some MCUs) will run it *much* slower (~3x for addmul_1).

While functions look similar (especially mul_1 and addmul_1), they are
kept separate due to corner-case (N=1,2,3) tweaks for P5600 without any
ill effect on 4KEc or 24KEc at least.

diff -r 6ab06c72027e -r 789677d6e8b2 configure.ac
--- a/configure.ac
+++ b/configure.ac
@@ -1040,6 +1040,10 @@
   mipsisa32r2*-*-*)
 SPEED_CYCLECOUNTER_OBJ=mips32r2.lo
 cyclecounter_size=1
+path="mips32/r1 mips32"
+   ;;
+  mipsisa32*-*-*)
+path="mips32/r1 mips32"
 ;;
 esac
 ;;
diff -r 6ab06c72027e -r 789677d6e8b2 mpn/mips32/r1/addmul_1.asm
--- /dev/null
+++ b/mpn/mips32/r1/addmul_1.asm
@@ -0,0 +1,79 @@
+include(`../config.m4')
+
+C   cycles/limb
+C 4KEc  9.68
+C 24Kc  9.52
+C 24KEc 9.55
+C P5600 7.80
+C XBurst   13.55
+
+C INPUT PARAMETERS
+C rp   $a0
+C up   $a1
+C n$a2
+C vl   $a3
+
+ASM_START()
+   .setnoat
+PROLOGUE(mpn_addmul_1)
+   lw  $v1,0($a1)  C L0
+   ori $at,$zero,1
+   multu   $v1,$a3 C M0
+   lw  $t0,0($a0)  C L1, 32x16 MDU stall
+   addiu   $t1,$a2,-2
+   beq $at,$a2,1f
+maddu  $t0,$at C M0
+   mfhi$v0 C M0 carry
+   lw  $t2,4($a1)  C L1
+   beqz$t1,23f
+lw $v1,4($a0)  C L1
+   mflo$t0 C M0
+   andi$t3,$t1,1
+   sll $a2,$t1,2
+   beqz$t3,0f
+addu   $a2,$a2,$a1
+   multu   $t2,$a3 C M1
+   lw  $t2,8($a1)  C L2, 32x16 MDU stall
+   maddu   $v1,$at C M1
+   maddu   $v0,$at C M1
+   mfhi$v0 C M1 carry
+   lw  $v1,8($a0)  C L2
+   sw  $t0,0($a0)  C S0
+   beq $at,$t1,23f
+addiu  $a0,$a0,4
+   addiu   $a1,$a1,4
+   mflo$t0 C M1
+0: addiu   $a1,$a1,8
+   addiu   $a0,$a0,8
+   multu   $t2,$a3 C M1
+   lw  $t3,0($a1)  C L2, 32x16 MDU stall
+   maddu   $v1,$at C M1
+   lw  $t4,0($a0)  C L2
+   maddu   $v0,$at C M1
+   mfhi$v0 C M1 carry
+   sw  $t0,-8($a0) C S0
+   mflo$t1 C M1
+   multu   $t3,$a3 C M2
+   lw  $t2,4($a1)  C L3, 32x16 MDU stall
+   maddu   $t4,$at C M2
+   lw  $v1,4($a0)  C L3
+   maddu   $v0,$at C M2
+   mfhi$v0 C M2 carry
+   sw  $t1,-4($a0) C S1
+   bne $a1,$a2,0b
+23: mflo   $t0 C M2
+   multu   $t2,$a3 C M3
+   nop C 32x16 MDU stall
+   maddu   $v1,$at C M3
+   maddu   $v0,$at C M3
+   mflo$at C M3
+   mfhi$v0 C M3 carry
+   sw  $t0,0($a0)  C S2
+   jr  $ra
+sw $at,4($a0)  C S3
+1: mflo$at
+   mfhi$v0
+   jr  $ra
+sw $at,0($a0)
+EPILOGUE(mpn_addmul_1)
+ASM_END()
diff -r 6ab06c72027e -r 789677d6e8b2 mpn/mips32/r1/mul_1.asm
--- /dev/null
+++ b/mpn/mips32/r1/mul_1.asm
@@ -0,0 +1,69 @@
+include(`../config.m4')
+
+C   cycles/limb
+C 4KEc  7.66
+C 24Kc  7.54
+C 24KEc 7.55
+C P5600 7.04
+C XBurst   10.54
+
+C INPUT PARAMETERS
+C rp   $a0
+C up   $a1
+C n$a2
+C vl   $a3
+
+ASM_START()
+   .setnoat
+PROLOGUE(mpn_mul_1)
+   lw  $v1,0($a1)  C L0
+   ori $at,$zero,1
+   multu   $v1,$a3 C M0
+   beq $at,$a2,1f  C 32x16 MDU stall
+addiu  $t1,$a2,-2
+   mfhi$v0 C M0 carry
+   beqz$t1,23f
+lw $t2,4($a1)  C L1
+   mflo$t0 C M0
+   andi$t3,$t1,1
+   sll $a2,$t1,2
+   beqz$t3,0f
+addu   $a2,$a2,$a1
+   multu   $t2,$a3 C M1
+   lw  $t2,8($a1)  C L2, 32x16 MDU stall
+   maddu   $v0,$at C M1
+   mfhi$v0 C M1 carry
+   sw  $t0,0($a0)  C S0
+   beq $at,$t1,23f
+addiu  $a0,$a0,4
+   addiu   $a1,$a1,4
+   mflo$t0 C M1
+0: addiu   $a1,$a1,8
+   addiu   $a0,$a0,8
+   multu   $t2,$a3 C M1
+   lw  $t3,0($a1)  C L2, 32x16 MDU stall
+   maddu   $v0,$at C M1
+   mfhi$v0   

[PATCH 2 of 3] Add MIPS32R2 RDHWR-based cycle counter support

2019-09-11 Thread info
Add MIPS32R2 RDHWR-based cycle counter support.

diff -r 0ba6f9f13912 -r 6ab06c72027e configure.ac
--- a/configure.ac
+++ b/configure.ac
@@ -1035,7 +1035,11 @@
path_64="mips64/hilo mips64"
;;
esac
-
+;;
+
+  mipsisa32r2*-*-*)
+SPEED_CYCLECOUNTER_OBJ=mips32r2.lo
+cyclecounter_size=1
 ;;
 esac
 ;;
diff -r 0ba6f9f13912 -r 6ab06c72027e tune/Makefile.am
--- a/tune/Makefile.am
+++ b/tune/Makefile.am
@@ -33,7 +33,7 @@
 AM_LDFLAGS = -no-install
 
 EXTRA_DIST = alpha.asm pentium.asm sparcv9.asm hppa.asm hppa2.asm hppa2w.asm \
-  ia64.asm powerpc.asm powerpc64.asm x86_64.asm many.pl
+  ia64.asm powerpc.asm powerpc64.asm x86_64.asm mips32r2.asm many.pl
 noinst_HEADERS = speed.h
 
 # Prefer -static on the speed and tune programs, since that can avoid
diff -r 0ba6f9f13912 -r 6ab06c72027e tune/mips32r2.asm
--- /dev/null
+++ b/tune/mips32r2.asm
@@ -0,0 +1,12 @@
+include(`../config.m4')
+
+ASM_START()
+PROLOGUE(speed_cyclecounter)
+   rdhwr   $2,$2
+   slti$3,$2,0
+   sll $2,$2,1 C save multiply and assume CCRes is 2
+   sw  $3,4($4)
+   jr  $ra
+sw $2,0($4)
+EPILOGUE(speed_cyclecounter)
+ASM_END()



___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


[PATCH 1 of 3] Provide c/l results on some MIPS32 CPUs

2019-09-11 Thread info
Provide c/l results on some MIPS32 CPUs.

diff -r ed6ddbb7a15b -r 0ba6f9f13912 mpn/mips32/addmul_1.asm
--- a/mpn/mips32/addmul_1.asm
+++ b/mpn/mips32/addmul_1.asm
@@ -31,6 +31,13 @@
 
 include(`../config.m4')
 
+C   cycles/limb
+C 4KEc 16.33
+C 24Kc 18.04
+C 24KEc18.05
+C P5600 9.05
+C XBurst   17.05
+
 C INPUT PARAMETERS
 C res_ptr  $4
 C s1_ptr   $5
diff -r ed6ddbb7a15b -r 0ba6f9f13912 mpn/mips32/mul_1.asm
--- a/mpn/mips32/mul_1.asm
+++ b/mpn/mips32/mul_1.asm
@@ -31,6 +31,13 @@
 
 include(`../config.m4')
 
+C   cycles/limb
+C 4KEc 12.22
+C 24Kc 14.03
+C 24KEc14.05
+C P5600 7.05
+C XBurst   13.05
+
 C INPUT PARAMETERS
 C res_ptr  $4
 C s1_ptr   $5
diff -r ed6ddbb7a15b -r 0ba6f9f13912 mpn/mips32/submul_1.asm
--- a/mpn/mips32/submul_1.asm
+++ b/mpn/mips32/submul_1.asm
@@ -31,6 +31,13 @@
 
 include(`../config.m4')
 
+C   cycles/limb
+C 4KEc 16.33
+C 24Kc 18.04
+C 24KEc18.05
+C P5600 9.05
+C XBurst   17.05
+
 C INPUT PARAMETERS
 C res_ptr  $4
 C s1_ptr   $5



___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Suggested tune/tuneup.c patch

2019-09-11 Thread Torbjörn Granlund
  And please take a screenshot of the affected parameters before and
  after this change, as a sanity check.

I added a history preservation feature to the .../devel/thres/ pages.  At
23:59 each night, all pages are copied to .../devel/thres/-MM-DD.

(There is no index, one needs to type in the wanted date manually.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: hgcd1/2

2019-09-11 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Sounds good. Give it a few days, and delete if it still looks slow
  everywhere.

I cooked a modern alternative:

static mp_double_limb_t
div1 (mp_limb_t n0, mp_limb_t d0)
{
  mp_double_limb_t res;
  int ncnt, dcnt, cnt;
  mp_limb_t q;
  mp_limb_t mask;

  ASSERT (n0 >= d0);

  count_leading_zeros (ncnt, n0);
  count_leading_zeros (dcnt, d0);
  cnt = dcnt - ncnt;

  d0 <<= cnt;

  q = -(mp_limb_t) (n0 >= d0);
  n0 -= d0 & q;
  d0 >>= 1;
  q = -q;

  while (--cnt >= 0)
{
  mask = -(mp_limb_t) (n0 >= d0);
  n0 -= d0 & mask;
  d0 >>= 1;
  q = (q << 1) - mask;
}

  res.d0 = n0;
  res.d1 = q;
  return res;
}

  That should use the same choices for div1 and div2. Is it important to
  inline some or all of the div1/div2 code? If we move them out to
  separate .c files, it gets easier to share between hgcd2 and
  hgcd2_jacobi, and easier to experiment with assembly div1. But it gets a
  bit more complex to set things up for tuneup.

Not considering tuneup hardness, I suppose inlining makes some sense, at
least of _METHOD 1 which often ends up as a single instruction.

I've never been a big fan of inlining.  Small operands GCD in GMP is
special in many ways, as it is hundreds of times slower than
multiplication.  We really need to try hard to improve it.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Suggested tune/tuneup.c patch

2019-09-11 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Below patch adds a helper function for tuning *_METHOD values,
  evaluated at some fix size. What do you think?

It helps with some function comments, outlining what a function does,
and what its arguments mean.

Please consider adding that before committing.  And please take a
screenshot of the affected parameters before and after this change, as a
sanity check.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel