I'd suggest to use the loop below for sparc64. It limits `which' to be
2^32 by creating the mask based on 32-bit comparison. It would be
possible to replace subcc o1,1,o1; subc ... by addcc o1,-1,o1; addxc
... for newer chips, but I think that's no use.
I sincerely apologise for the odd number
From: Torbjorn Granlund t...@gmplib.org
Date: Fri, 12 Apr 2013 10:04:35 +0200
David Miller da...@davemloft.net writes:
The existing C code approaches 6 cycles/limb on T4, the best I can do
without pipelining with this new approach at 4 way unrolling is ~4.5
cycles/limb:
This
David Miller da...@davemloft.net writes:
It isn't really conditional execution on sparc, the resources and
timing required for the move instruction are constant whether the
condition matches or not.
That's not enough.
It needs to have the same data-dependency behaviour too.
And it
From: Torbjorn Granlund t...@gmplib.org
Date: Fri, 12 Apr 2013 17:14:41 +0200
I'd suggest to use the loop below for sparc64. It limits `which' to be
2^32 by creating the mask based on 32-bit comparison. It would be
possible to replace subcc o1,1,o1; subc ... by addcc o1,-1,o1; addxc
...
David Miller da...@davemloft.net writes:
I sincerely apologise for the odd number of insns in the loop. :-)
Easily solved by using the pointer trick on 'tp' and making 'i'
instead be 'i * stride'. That'll get us down to 16 instructions.
I'll try to find time to play with this
David Miller da...@davemloft.net writes:
From: Torbjorn Granlund t...@gmplib.org
Date: Fri, 12 Apr 2013 10:04:35 +0200
I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles
per limb on T4, BTW. On US1-2 it might run at 7/4 c/l and on US3-4 it
again probably
I've written a few variants of tabselect using a different table
traversal order. I think of this as horisontal, making the old one
vertical.
An arm neon variant which I think has become nice, thanks to neon's
elegance. It improves the A9 performance by ~100% and the A15
performance by ~30%
From: Torbjorn Granlund t...@gmplib.org
Date: Thu, 11 Apr 2013 23:55:18 +0200
I think we need to write new tabselect also for ppc64, sparc64, and
perhaps x86_32. The latter could use a variant of our
x64-sse-horis-tabselect-w8.asm, at least some intel cpus.
I'll take a stab at sparc64.
From: David Miller da...@davemloft.net
Date: Thu, 11 Apr 2013 19:06:17 -0400 (EDT)
From: Torbjorn Granlund t...@gmplib.org
Date: Thu, 11 Apr 2013 23:55:18 +0200
I think we need to write new tabselect also for ppc64, sparc64, and
perhaps x86_32. The latter could use a variant of our
I think my original tabselect methid is not the best, at least not of we
implement it in assembly.
The current method takes one full table vector entry at a time, and need
to perform two loads and one store per entry in the large table of
vectors.
It seems better two work in the opposite
ni...@lysator.liu.se (Niels Möller) writes:
mp_limb_t rl;
for (rl = 0, i = 0; i nelems; i++)
rl += table[...] -(mp_limb_t) (i == k);
rp[...] = rl;
Reduces the number of stores from O(n^2) to O(n), and instead increases
the mask creation from O(n) to O(n^2). Loads
Torbjorn Granlund t...@gmplib.org writes:
The multitude and pattern of mask computations make side channel leakage
worse if the mask computation is made stupidly. I don't trust compilers
here, since they might use a conditional move or other leaky method.
One possible variant,
12 matches
Mail list logo