Re: gcd_22

2019-08-27 Thread Marco Bodrato
Ciao, Il Dom, 25 Agosto 2019 2:28 am, Torbjörn Granlund ha scritto: > Now we have a nice set of x86_64 gcd_22. The code is not as well tuned > as the gcd_11 code, but it runs somewhat fast. So if I suggest to reorder some instructions in the loop, you will not upset :-) If we can change cmovc-s

Re: gcd_22

2019-08-27 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: And to make the loop work, it needs some condition to decrement N and maintain non-zero high limbs (if both up[N-1] and vp[N-1] are zero, comparison is no good). So that would be something like Since N is my proposal is a constant, it is

Re: gcd_22

2019-08-27 Thread Torbjörn Granlund
Marco Bodrato writes: For a generic code with variable N, one may prefer a code that chooses if a copy or a shorter shift is needed. But this means more code and the shift could not be an in-lined fixed size version... I broke out the unlikely up[0] code into a separate function, a

Re: gcd_22

2019-08-27 Thread Torbjörn Granlund
Some cleanups and tweaks later. The gcd_33 based on this, compiled with gcc 8.3, runs at 30 cycles per iteration. (Note, not cycles per bit!) My best gcd_33 in assembly runs at 10 cycles per iteration. The former uses memory based operands. The latter keeps everything in registers. If we

Re: gcd_22

2019-08-27 Thread Marco Bodrato
Ciao, Il 2019-08-27 16:35 t...@gmplib.org ha scritto: I got something working. It runs quite well, and seems to beat the Great! static inline void mpn_gcd_NN (mp_limb_t *rp, mp_limb_t *up, mp_limb_t *vp, size_t N) I see that your idea is to obtain a N-loop-unrolled version... if

Re: gcd_22

2019-08-27 Thread Marco Bodrato
Il 2019-08-27 21:10 t...@gmplib.org ha scritto: Marco Bodrato writes: ... and on some platform mpn_rshift may not support cnt==0. That was taken care of in ny last version. I wrote my message before, and did not realize, before sending it, that you sent a new version :-) I added a

Re: gcd_22

2019-08-27 Thread Torbjörn Granlund
I got something working. It runs quite well, and seems to beat the performance of mpn_gcd. Here is the code: #include "gmp-impl.h" #include "longlong.h" #ifndef CMP_SWAP #define CMP_SWAP(ap,bp,n) \ do {