Ciao,
Il Mer, 28 Agosto 2019 11:39 am, Torbjörn Granlund ha scritto:
> Marco Bodrato writes:
> or maybe simply a loop that knows that U will be smaller until also V
> will be one (some) limb shorter...
>
> ...and finally tail-call to the smaller fixed-size asm function?
>
> I think this
Marco Bodrato writes:
Maybe tail-call?
When up[0] == 0 but U is not zero and we can not return the
result... That's an unlikely case, but it means that one of the
operands gets much smaller than the other. Maybe a special strategy
can be used in this case? Maybe a division, maybe a
Ciao,
Il 2019-08-28 00:09 t...@gmplib.org ha scritto:
I broke out the unlikely up[0] code into a separate function, a
function
which can be common for any size. A function which we might call from
asm.
Maybe tail-call?
When up[0] == 0 but U is not zero and we can not return the result...
Marco Bodrato writes:
For a generic code with variable N, one may prefer a code that chooses
if a copy or a shorter shift is needed. But this means more code and
the shift could not be an in-lined fixed size version...
I broke out the unlikely up[0] code into a separate function, a
Il 2019-08-27 21:10 t...@gmplib.org ha scritto:
Marco Bodrato writes:
... and on some platform mpn_rshift may not support cnt==0.
That was taken care of in ny last version.
I wrote my message before, and did not realize, before sending it, that
you sent a new version :-)
I added a
Ciao,
Il 2019-08-27 16:35 t...@gmplib.org ha scritto:
I got something working. It runs quite well, and seems to beat the
Great!
static inline void
mpn_gcd_NN (mp_limb_t *rp, mp_limb_t *up, mp_limb_t *vp, size_t N)
I see that your idea is to obtain a N-loop-unrolled version...
if
Some cleanups and tweaks later. The gcd_33 based on this, compiled with
gcc 8.3, runs at 30 cycles per iteration. (Note, not cycles per bit!)
My best gcd_33 in assembly runs at 10 cycles per iteration.
The former uses memory based operands. The latter keeps everything in
registers.
If we
I got something working. It runs quite well, and seems to beat the
performance of mpn_gcd. Here is the code:
#include "gmp-impl.h"
#include "longlong.h"
#ifndef CMP_SWAP
#define CMP_SWAP(ap,bp,n) \
do {
ni...@lysator.liu.se (Niels Möller) writes:
And to make the loop work, it needs some condition to decrement N and
maintain non-zero high limbs (if both up[N-1] and vp[N-1] are zero,
comparison is no good). So that would be something like
Since N is my proposal is a constant, it is
Ciao,
Il Dom, 25 Agosto 2019 2:28 am, Torbjörn Granlund ha scritto:
> Now we have a nice set of x86_64 gcd_22. The code is not as well tuned
> as the gcd_11 code, but it runs somewhat fast.
So if I suggest to reorder some instructions in the loop, you will not
upset :-)
If we can change cmovc-s
"Marco Bodrato" writes:
> Some messages ago, Niels suggested that discarding the smallest number
> keeping the largest one in some unlikely case could be a good idea.
I think that is ok provided the numbers are close...
> I did not completely understand what he meant... But here I can see his
Ciao,
Il Lun, 26 Agosto 2019 4:36 pm, Torbjörn Granlund ha scritto:
>for (;;) {
> if (up[N-1] < vp[N-1])
> swap up,vp
> back:
>cy = mpn_sub_n (up, up, vp, N);
>if (UNLIKELY (cy)) { swap up,vp; goto back }
>if (UNLIKELY (up[0] == 0)) {
>
I wonder if not something like this,
for (;;) {
if (up[N-1] < vp[N-1])
swap up,vp
back:
cy = mpn_sub_n (up, up, vp, N);
if (UNLIKELY (cy)) { swap up,vp; goto back }
if (UNLIKELY (up[0] == 0)) {
// handle any number zeros
"Marco Bodrato" writes:
If I'm not reading this timings wrongly, this means that with the current
code (disregarding the overhead, for those 64-bits limbs)
the bits in the limb 1 require 4 cycles each;
the bits in the limb 2 require 8 cycles each;
the bits in the limb 3 require 54
Victor Shoup writes:
> Regarding the so-called doc bug, if I understand the issue correctly,
> I don’t think it’s a good idea to add more preconditions to the
> documentation. In fact, I think that would be a really bad idea.
I agree it's usually a bad idea, but may be ok under the
Ciao,
Il Dom, 25 Agosto 2019 9:34 am, Niels Möller ha scritto:
> t...@gmplib.org (Torbjörn Granlund) writes:
>> Should we have gcd_33, gcd_44, and gcd_55 also?
> I think it would be more beneficial to optimize hgcd2.
> E.g., 3-limb gcd should typically boil down to one hgcd2, a couple of
> 3:1
Regarding the so-called doc bug, if I understand the issue correctly, I don’t
think it’s a good idea to add more preconditions to the documentation. In fact,
I think that would be a really bad idea.
> On Aug 25, 2019, at 2:34 AM, niels mller
> wrote:
>
> t...@gmplib.org (Torbjörn Granlund)
t...@gmplib.org (Torbjörn Granlund) writes:
> Now what?
1. Update mpn_gcd.c and other callers of gcd_11 to call the new
functions (like in the patch I posted few days ago). Watch out for
performance regressions.
2. Consider alternative entry points. Allowing one even operand would
Now we have a nice set of x86_64 gcd_22. The code is not as well tuned
as the gcd_11 code, but it runs somewhat fast.
I haven't explored the table based variant which gives 3 bits of
progress per iteration. It might make the new code obsolete for
machines with fast multiply.
Now what? Should
ni...@lysator.liu.se (Niels Möller) writes:
The below implementation appears to pass tests, and give a modest
speedup of 0.2 cycles per input bit, or 2.5%. Benchmark, comparing C
implementations of gcd_11 and gcd_22:
Beware of "turbo" when counting cycles! (Relative measurements like
t...@gmplib.org (Torbjörn Granlund) writes:
> Why would it be faster? Because while cnd1 needs to wait for the
> previous iteration's double-limb shifting, cnd2 depends on just the
> single-limb shifting of the high U world.
The below implementation appears to pass tests, and give a modest
ni...@lysator.liu.se (Niels Möller) writes:
I take it the idea is to have an unlikely branch for this case;
attempting a branchless correction of vgtu would miss the point?
Yes.
--
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing
t...@gmplib.org (Torbjörn Granlund) writes:
> I think we might get more speed from this algorithm:
>
> while (highlimb(U) | highlimb(V)) != 0
> S = U - V
> T = V - U
> cnt = count_trailing_zeros(lowlimb(S))
> cnd1 = U < V
> cnd2 = highlimb(U) < highlimb(V)
> if (cnd2)
>
We now have the algorithm
while (highlimb(U) | highlimb(V)) != 0
S = U - V
T = V - U
cnt = count_trailing_zeros(lowlimb(S))
cnd1 = U < V
if (cnd1)
V = U
U = T
else
U = S
U >>= cnt
with some variation. The "if" statement is performed with masking
t...@gmplib.org (Torbjörn Granlund) writes:
Your gcd_22.c triggered a lingering bug in longlong.h:
ARM target which default to Thumb code cannot to the rsc instruction
(reverse subtract with carry). We will need to write separate
sub_ddmmss for Thumb.
The problem statement is
Your gcd_22.c triggered a lingering bug in longlong.h:
ARM target which default to Thumb code cannot to the rsc instruction
(reverse subtract with carry). We will need to write separate
sub_ddmmss for Thumb.
The problem statement is sub_ddmmss (vgtu, d0, 0, u0, 0, v0) which just
is there to
I've now tried the u + v variant for gcd_22. I've strived to minimize
the number of carry propagations, and I see no obvious
microptimizations. And it's slower, time increases from 7.8 cycles per
input bit to 9.8, when I measure;
Unclear if it's because the critical path gets longer, or if it's
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> $ ./tune/speed -p 10 -s 1-64 -t 3 -C mpn_gcd_11 mpn_gcd_22
> overhead 4.01 cycles, precision 10 units of 5.08e-10 secs, CPU freq
> 1966.75 MHz
> mpn_gcd_11mpn_gcd_22
> 1
ni...@lysator.liu.se (Niels Möller) writes:
$ ./tune/speed -p 10 -s 1-64 -t 3 -C mpn_gcd_11 mpn_gcd_22
overhead 4.01 cycles, precision 10 units of 5.08e-10 secs, CPU freq
1966.75 MHz
mpn_gcd_11mpn_gcd_22
1 #7.0459 11.0729
4 #2.8721
ni...@lysator.liu.se (Niels Möller) writes:
> So the gcd_22 code with all the branches needs about 11 cycles per input
> bit. gcd_11 is coreihwl/gcd_11.asm in this build.
I had to do a quick try with the masking version before leaving for work:
$ ./tune/speed -p 10 -s 1-64 -t 3 -C
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> Attached is a patch to move the current (slow) gcd_2 to a separate file,
> rename
> it mpn_gcd_22, change argument and return type, and add a basic test.
>
> Nice!
Pushed now.
> Speed support
ni...@lysator.liu.se (Niels Möller) writes:
Attached is a patch to move the current (slow) gcd_2 to a separate file,
rename
it mpn_gcd_22, change argument and return type, and add a basic test.
Nice!
Do you agree with the naming in
typedef struct
{
mp_limb_t d0, d1;
}
32 matches
Mail list logo