Re: gcd_22

2019-08-31 Thread Marco Bodrato
Ciao, Il Mer, 28 Agosto 2019 11:39 am, Torbjörn Granlund ha scritto: > Marco Bodrato writes: > or maybe simply a loop that knows that U will be smaller until also V > will be one (some) limb shorter... > > ...and finally tail-call to the smaller fixed-size asm function? > > I think this

Re: gcd_22

2019-08-28 Thread Torbjörn Granlund
Marco Bodrato writes: Maybe tail-call? When up[0] == 0 but U is not zero and we can not return the result... That's an unlikely case, but it means that one of the operands gets much smaller than the other. Maybe a special strategy can be used in this case? Maybe a division, maybe a

Re: gcd_22

2019-08-28 Thread Marco Bodrato
Ciao, Il 2019-08-28 00:09 t...@gmplib.org ha scritto: I broke out the unlikely up[0] code into a separate function, a function which can be common for any size. A function which we might call from asm. Maybe tail-call? When up[0] == 0 but U is not zero and we can not return the result...

Re: gcd_22

2019-08-27 Thread Torbjörn Granlund
Marco Bodrato writes: For a generic code with variable N, one may prefer a code that chooses if a copy or a shorter shift is needed. But this means more code and the shift could not be an in-lined fixed size version... I broke out the unlikely up[0] code into a separate function, a

Re: gcd_22

2019-08-27 Thread Marco Bodrato
Il 2019-08-27 21:10 t...@gmplib.org ha scritto: Marco Bodrato writes: ... and on some platform mpn_rshift may not support cnt==0. That was taken care of in ny last version. I wrote my message before, and did not realize, before sending it, that you sent a new version :-) I added a

Re: gcd_22

2019-08-27 Thread Marco Bodrato
Ciao, Il 2019-08-27 16:35 t...@gmplib.org ha scritto: I got something working. It runs quite well, and seems to beat the Great! static inline void mpn_gcd_NN (mp_limb_t *rp, mp_limb_t *up, mp_limb_t *vp, size_t N) I see that your idea is to obtain a N-loop-unrolled version... if

Re: gcd_22

2019-08-27 Thread Torbjörn Granlund
Some cleanups and tweaks later. The gcd_33 based on this, compiled with gcc 8.3, runs at 30 cycles per iteration. (Note, not cycles per bit!) My best gcd_33 in assembly runs at 10 cycles per iteration. The former uses memory based operands. The latter keeps everything in registers. If we

Re: gcd_22

2019-08-27 Thread Torbjörn Granlund
I got something working. It runs quite well, and seems to beat the performance of mpn_gcd. Here is the code: #include "gmp-impl.h" #include "longlong.h" #ifndef CMP_SWAP #define CMP_SWAP(ap,bp,n) \ do {

Re: gcd_22

2019-08-27 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: And to make the loop work, it needs some condition to decrement N and maintain non-zero high limbs (if both up[N-1] and vp[N-1] are zero, comparison is no good). So that would be something like Since N is my proposal is a constant, it is

Re: gcd_22

2019-08-27 Thread Marco Bodrato
Ciao, Il Dom, 25 Agosto 2019 2:28 am, Torbjörn Granlund ha scritto: > Now we have a nice set of x86_64 gcd_22. The code is not as well tuned > as the gcd_11 code, but it runs somewhat fast. So if I suggest to reorder some instructions in the loop, you will not upset :-) If we can change cmovc-s

Re: gcd_22

2019-08-26 Thread Niels Möller
"Marco Bodrato" writes: > Some messages ago, Niels suggested that discarding the smallest number > keeping the largest one in some unlikely case could be a good idea. I think that is ok provided the numbers are close... > I did not completely understand what he meant... But here I can see his

Re: gcd_22

2019-08-26 Thread Marco Bodrato
Ciao, Il Lun, 26 Agosto 2019 4:36 pm, Torbjörn Granlund ha scritto: >for (;;) { > if (up[N-1] < vp[N-1]) > swap up,vp > back: >cy = mpn_sub_n (up, up, vp, N); >if (UNLIKELY (cy)) { swap up,vp; goto back } >if (UNLIKELY (up[0] == 0)) { >

Re: gcd_22

2019-08-26 Thread Torbjörn Granlund
I wonder if not something like this, for (;;) { if (up[N-1] < vp[N-1]) swap up,vp back: cy = mpn_sub_n (up, up, vp, N); if (UNLIKELY (cy)) { swap up,vp; goto back } if (UNLIKELY (up[0] == 0)) { // handle any number zeros

Re: gcd_22

2019-08-26 Thread Torbjörn Granlund
"Marco Bodrato" writes: If I'm not reading this timings wrongly, this means that with the current code (disregarding the overhead, for those 64-bits limbs) the bits in the limb 1 require 4 cycles each; the bits in the limb 2 require 8 cycles each; the bits in the limb 3 require 54

Re: gcd_22

2019-08-25 Thread Niels Möller
Victor Shoup writes: > Regarding the so-called doc bug, if I understand the issue correctly, > I don’t think it’s a good idea to add more preconditions to the > documentation. In fact, I think that would be a really bad idea. I agree it's usually a bad idea, but may be ok under the

Re: gcd_22

2019-08-25 Thread Marco Bodrato
Ciao, Il Dom, 25 Agosto 2019 9:34 am, Niels Möller ha scritto: > t...@gmplib.org (Torbjörn Granlund) writes: >> Should we have gcd_33, gcd_44, and gcd_55 also? > I think it would be more beneficial to optimize hgcd2. > E.g., 3-limb gcd should typically boil down to one hgcd2, a couple of > 3:1

Re: gcd_22

2019-08-25 Thread Victor Shoup
Regarding the so-called doc bug, if I understand the issue correctly, I don’t think it’s a good idea to add more preconditions to the documentation. In fact, I think that would be a really bad idea. > On Aug 25, 2019, at 2:34 AM, niels mller > wrote: > > t...@gmplib.org (Torbjörn Granlund)

Re: gcd_22

2019-08-25 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Now what? 1. Update mpn_gcd.c and other callers of gcd_11 to call the new functions (like in the patch I posted few days ago). Watch out for performance regressions. 2. Consider alternative entry points. Allowing one even operand would

Re: gcd_22

2019-08-24 Thread Torbjörn Granlund
Now we have a nice set of x86_64 gcd_22. The code is not as well tuned as the gcd_11 code, but it runs somewhat fast. I haven't explored the table based variant which gives 3 bits of progress per iteration. It might make the new code obsolete for machines with fast multiply. Now what? Should

Re: gcd_22

2019-08-23 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: The below implementation appears to pass tests, and give a modest speedup of 0.2 cycles per input bit, or 2.5%. Benchmark, comparing C implementations of gcd_11 and gcd_22: Beware of "turbo" when counting cycles! (Relative measurements like

Re: gcd_22

2019-08-20 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Why would it be faster? Because while cnd1 needs to wait for the > previous iteration's double-limb shifting, cnd2 depends on just the > single-limb shifting of the high U world. The below implementation appears to pass tests, and give a modest

Re: gcd_22

2019-08-20 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I take it the idea is to have an unlikely branch for this case; attempting a branchless correction of vgtu would miss the point? Yes. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing

Re: gcd_22

2019-08-20 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > I think we might get more speed from this algorithm: > > while (highlimb(U) | highlimb(V)) != 0 > S = U - V > T = V - U > cnt = count_trailing_zeros(lowlimb(S)) > cnd1 = U < V > cnd2 = highlimb(U) < highlimb(V) > if (cnd2) >

Re: gcd_22

2019-08-19 Thread Torbjörn Granlund
We now have the algorithm while (highlimb(U) | highlimb(V)) != 0 S = U - V T = V - U cnt = count_trailing_zeros(lowlimb(S)) cnd1 = U < V if (cnd1) V = U U = T else U = S U >>= cnt with some variation. The "if" statement is performed with masking

Re: gcd_22

2019-08-19 Thread Torbjörn Granlund
t...@gmplib.org (Torbjörn Granlund) writes: Your gcd_22.c triggered a lingering bug in longlong.h: ARM target which default to Thumb code cannot to the rsc instruction (reverse subtract with carry). We will need to write separate sub_ddmmss for Thumb. The problem statement is

Re: gcd_22

2019-08-18 Thread Torbjörn Granlund
Your gcd_22.c triggered a lingering bug in longlong.h: ARM target which default to Thumb code cannot to the rsc instruction (reverse subtract with carry). We will need to write separate sub_ddmmss for Thumb. The problem statement is sub_ddmmss (vgtu, d0, 0, u0, 0, v0) which just is there to

Re: gcd_22

2019-08-17 Thread Niels Möller
I've now tried the u + v variant for gcd_22. I've strived to minimize the number of carry propagations, and I see no obvious microptimizations. And it's slower, time increases from 7.8 cycles per input bit to 9.8, when I measure; Unclear if it's because the critical path gets longer, or if it's

Re: gcd_22

2019-08-16 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > $ ./tune/speed -p 10 -s 1-64 -t 3 -C mpn_gcd_11 mpn_gcd_22 > overhead 4.01 cycles, precision 10 units of 5.08e-10 secs, CPU freq > 1966.75 MHz > mpn_gcd_11mpn_gcd_22 > 1

Re: gcd_22

2019-08-16 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: $ ./tune/speed -p 10 -s 1-64 -t 3 -C mpn_gcd_11 mpn_gcd_22 overhead 4.01 cycles, precision 10 units of 5.08e-10 secs, CPU freq 1966.75 MHz mpn_gcd_11mpn_gcd_22 1 #7.0459 11.0729 4 #2.8721

Re: gcd_22

2019-08-16 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: > So the gcd_22 code with all the branches needs about 11 cycles per input > bit. gcd_11 is coreihwl/gcd_11.asm in this build. I had to do a quick try with the masking version before leaving for work: $ ./tune/speed -p 10 -s 1-64 -t 3 -C

Re: gcd_22

2019-08-16 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > Attached is a patch to move the current (slow) gcd_2 to a separate file, > rename > it mpn_gcd_22, change argument and return type, and add a basic test. > > Nice! Pushed now. > Speed support

Re: gcd_22

2019-08-15 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Attached is a patch to move the current (slow) gcd_2 to a separate file, rename it mpn_gcd_22, change argument and return type, and add a basic test. Nice! Do you agree with the naming in typedef struct { mp_limb_t d0, d1; }