Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
I'd suggest to use the loop below for sparc64. It limits `which' to be 2^32 by creating the mask based on 32-bit comparison. It would be possible to replace subcc o1,1,o1; subc ... by addcc o1,-1,o1; addxc ... for newer chips, but I think that's no use. I sincerely apologise for the odd number

Re: Better tabselect

2013-04-12 Thread David Miller
From: Torbjorn Granlund t...@gmplib.org Date: Fri, 12 Apr 2013 10:04:35 +0200 David Miller da...@davemloft.net writes: The existing C code approaches 6 cycles/limb on T4, the best I can do without pipelining with this new approach at 4 way unrolling is ~4.5 cycles/limb: This

Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: It isn't really conditional execution on sparc, the resources and timing required for the move instruction are constant whether the condition matches or not. That's not enough. It needs to have the same data-dependency behaviour too. And it

Re: Better tabselect

2013-04-12 Thread David Miller
From: Torbjorn Granlund t...@gmplib.org Date: Fri, 12 Apr 2013 17:14:41 +0200 I'd suggest to use the loop below for sparc64. It limits `which' to be 2^32 by creating the mask based on 32-bit comparison. It would be possible to replace subcc o1,1,o1; subc ... by addcc o1,-1,o1; addxc ...

Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: I sincerely apologise for the odd number of insns in the loop. :-) Easily solved by using the pointer trick on 'tp' and making 'i' instead be 'i * stride'. That'll get us down to 16 instructions. I'll try to find time to play with this

Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: From: Torbjorn Granlund t...@gmplib.org Date: Fri, 12 Apr 2013 10:04:35 +0200 I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles per limb on T4, BTW. On US1-2 it might run at 7/4 c/l and on US3-4 it again probably

Re: Better tabselect

2013-04-11 Thread Torbjorn Granlund
I've written a few variants of tabselect using a different table traversal order. I think of this as horisontal, making the old one vertical. An arm neon variant which I think has become nice, thanks to neon's elegance. It improves the A9 performance by ~100% and the A15 performance by ~30%

Re: Better tabselect

2013-04-11 Thread David Miller
From: Torbjorn Granlund t...@gmplib.org Date: Thu, 11 Apr 2013 23:55:18 +0200 I think we need to write new tabselect also for ppc64, sparc64, and perhaps x86_32. The latter could use a variant of our x64-sse-horis-tabselect-w8.asm, at least some intel cpus. I'll take a stab at sparc64.

Re: Better tabselect

2013-04-11 Thread David Miller
From: David Miller da...@davemloft.net Date: Thu, 11 Apr 2013 19:06:17 -0400 (EDT) From: Torbjorn Granlund t...@gmplib.org Date: Thu, 11 Apr 2013 23:55:18 +0200 I think we need to write new tabselect also for ppc64, sparc64, and perhaps x86_32. The latter could use a variant of our

Better tabselect

2013-04-10 Thread Torbjorn Granlund
I think my original tabselect methid is not the best, at least not of we implement it in assembly. The current method takes one full table vector entry at a time, and need to perform two loads and one store per entry in the large table of vectors. It seems better two work in the opposite

Re: Better tabselect

2013-04-10 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: mp_limb_t rl; for (rl = 0, i = 0; i nelems; i++) rl += table[...] -(mp_limb_t) (i == k); rp[...] = rl; Reduces the number of stores from O(n^2) to O(n), and instead increases the mask creation from O(n) to O(n^2). Loads

Re: Better tabselect

2013-04-10 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: The multitude and pattern of mask computations make side channel leakage worse if the mask computation is made stupidly. I don't trust compilers here, since they might use a conditional move or other leaky method. One possible variant,