Re: Small operands gcd improvements

2019-08-14 Thread Niels Möller
Another trick we've been discussing is to use u+v rather than |u-v|, in the case that u+v is divisible by 4. I've not yet tried it out for gcd_22 (and I don't yet have any timing code), but I've tried it for gcd_11. It's slower (not that surprising), but with some potential for gcd_22. We need ma

Re: Small operands gcd improvements

2019-08-14 Thread Torbjörn Granlund
"Marco Bodrato" writes: Moved the fix outside. Yes, I feel the same. In both cases %r9 is not initialised. I didn't get much right last night. The timing results were wrong, presumably due to the r9 initialisation issue. -- Torbjörn Please encrypt, key id 0xC8601622 _

Re: Small operands gcd improvements

2019-08-14 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > I believe an amalgamation of Niels code which uses an implicit lsb and > his present _22 code would be best. The present requirement for > sub_mddmmss is a portability inconvenience, and for subtract-with-carry > challenged processors a major hassle.

Re: Small operands gcd improvements

2019-08-13 Thread Marco Bodrato
Ciao, Il Mer, 14 Agosto 2019 1:46 am, Torbjörn Granlund ha scritto: > "Marco Bodrato" writes: > > > What's the purpose of this change? > > Failing tests :-) > Let's leave the loop(s) alone, fix things outside. Moved the fix outside. > The bd4 and zen2 code is also broken, I think. Yes, I

Re: Small operands gcd improvements

2019-08-13 Thread Torbjörn Granlund
"Marco Bodrato" writes: > What's the purpose of this change? Failing tests :-) Darn, my scripts must be buggy. Let's leave the loop(s) alone, fix things outside. The bd4 and zen2 code is also broken, I think. This bug might have made my timing tests run faster than they should. :o( --

Re: Small operands gcd improvements

2019-08-13 Thread Marco Bodrato
Ciao, Il Mer, 14 Agosto 2019 1:21 am, Torbjörn Granlund ha scritto: > I saw this change go in: > > diff -r 118627eed635 -r bb86e66536d5 mpn/x86_64/coreihwl/gcd_11.asm > --- a/mpn/x86_64/coreihwl/gcd_11.asm Tue Aug 13 22:20:06 2019 +0200 > +++ b/mpn/x86_64/coreihwl/gcd_11.asm Wed Aug 14 01:06:08

Re: Small operands gcd improvements

2019-08-13 Thread Torbjörn Granlund
"Marco Bodrato" writes: I saw this change go in: diff -r 118627eed635 -r bb86e66536d5 mpn/x86_64/coreihwl/gcd_11.asm --- a/mpn/x86_64/coreihwl/gcd_11.asmTue Aug 13 22:20:06 2019 +0200 +++ b/mpn/x86_64/coreihwl/gcd_11.asmWed Aug 14 01:06:08 2019 +0200 @@ -79,10 +79,10 @@ ALIGN(1

Re: Small operands gcd improvements

2019-08-13 Thread Torbjörn Granlund
"Marco Bodrato" writes: I'm happy to see that all of them start with the sequence mov v0, %rax sub u0, v0 You widely developed my small idea: to keep the value in %rax :-D Except the zen2 code, which bounces the value between v0 and rax. Now, I'd like to see if it

Re: Small operands gcd improvements

2019-08-13 Thread Marco Bodrato
Ciao, Il Mar, 13 Agosto 2019 10:38 pm, Torbjörn Granlund ha scritto: > I pushed a few more variants of gcd_11 with nice speed improvements for > several x86_64 CPUs. I am sure much more can be done. I'm happy to see that all of them start with the sequence mov v0, %rax sub

Re: Small operands gcd improvements

2019-08-13 Thread Torbjörn Granlund
I pushed a few more variants of gcd_11 with nice speed improvements for several x86_64 CPUs. I am sure much more can be done. I have a bunch of finished gcd_22 too; these are generic without any CPU-specific tweaks. I haven't timed them, they are just tested for correctness. It might be desirab

Re: Small operands gcd improvements

2019-08-13 Thread Torbjörn Granlund
"Marco Bodrato" writes: This means we are currently work on the _1o1o variants. Yep. But the other entry points will be one or to extra insns. May I propose a small latency-micro-optimisation for two x86_64 just proposed variants? The idea is not to use the register %r10 at all, and di

Re: Small operands gcd improvements

2019-08-13 Thread Torbjörn Granlund
"Marco Bodrato" writes: May I suggest to write: u0 = (u0 >> c) | (u1 << (GMP_LIMB_BITS - c)); instead of u0 = (u0 >> c) || (u1 << (GMP_LIMB_BITS - c)); You may. And it is indeed a great improvement! The generated code for Niels' C code is awful for x86. It consists of mainly regist

Re: Small operands gcd improvements

2019-08-12 Thread Marco Bodrato
Ciao, Il Lun, 12 Agosto 2019 6:52 pm, Torbjörn Granlund ha scritto: > I fixed a variable misspelling and defined the needed macro. I also > modified the return type's declaration. The code fails in most cases, > but occasionally gets things right. May I suggest to write: u0 = (u0 >> c) | (u1 <

Re: Small operands gcd improvements

2019-08-12 Thread Marco Bodrato
Ciao, Il Sab, 10 Agosto 2019 7:01 pm, Torbjörn Granlund ha scritto: > We might provide several gcc_11 function variants to accomodate the > internal uses you bring up. > > gcd_1o1o - two odd limbs > gcd_1o1 - one odd and one odd/even limb > gcd_11 - two odd/even limbs This would be a rich set

Re: Small operands gcd improvements

2019-08-12 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: BTW, below is one (untested) way to organize gcd_22. Wants an sub_mddmmss, with output carry as a mask, analogous to the add_mss defined in mod_1_1.c. I fixed a variable misspelling and defined the needed macro. I also modified the return typ

Re: Small operands gcd improvements

2019-08-10 Thread Torbjörn Granlund
We might provide several gcc_11 function variants to accomodate the internal uses you bring up. gcd_1o1o - two odd limbs gcd_1o1 - one odd and one odd/even limb gcd_11 - two odd/even limbs They would all be implemented in the same asm file. In the absense of asm implementation, C could provid

Re: Small operands gcd improvements

2019-08-10 Thread Marco Bodrato
Ciao, Il Mar, 6 Agosto 2019 9:08 pm, Torbjörn Granlund ha scritto: > ni...@lysator.liu.se (Niels Möller) writes: > Perhaps it would be worthwhile to do a tailcall for when the leftshift > is 0? I think most gcd calls, irrespective of operand size, will not > have any factor of two, let alone a c

Re: Small operands gcd improvements

2019-08-08 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: BTW, below is one (untested) way to organize gcd_22. Wants an sub_mddmmss, with output carry as a mask, analogous to the add_mss defined in mod_1_1.c. typedef struct { mp_limb_t d[2]; } mp_double_limb_t; I believe one should use separ

Re: Small operands gcd improvements

2019-08-08 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > To not get multiple definitions in the case that there's some gcd_1.asm > with an mpn_gcd_11 entry point, but no gcd_11.asm. If we don't organize > asm code that way, the #if is useless and can be r

Re: Small operands gcd improvements

2019-08-08 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: To not get multiple definitions in the case that there's some gcd_1.asm with an mpn_gcd_11 entry point, but no gcd_11.asm. If we don't organize asm code that way, the #if is useless and can be removed. I believe configure handles that too. > I

Re: Small operands gcd improvements

2019-08-08 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Why did you add HAVE_NATIVE_ around the C mpn_gcd_11 function? > > #if !HAVE_NATIVE_mpn_gcd_11 > mp_limb_t > mpn_gcd_11 (mp_limb_t u, mp_limb_t v) To not get multiple definitions in the case that there's some gcd_1.asm with an mpn_gcd_11 entry p

Re: Small operands gcd improvements

2019-08-08 Thread Torbjörn Granlund
Why did you add HAVE_NATIVE_ around the C mpn_gcd_11 function? #if !HAVE_NATIVE_mpn_gcd_11 mp_limb_t mpn_gcd_11 (mp_limb_t u, mp_limb_t v) . . . I would assume configure does its magic for this file just as with any other file, i.e., exclude it when there is a definition earlier in th

Re: Small operands gcd improvements

2019-08-08 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I also see that there's no x86/gcd_1.asm. gcd_11.asm Description: Binary data -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/lis

Re: Small operands gcd improvements

2019-08-08 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Does any other gmp functions do that? If we 1. Do that, 2. Arrange that g0 is put in the "primary" return register 3. Ensure that the register used for returning g1 isn't clobbered by gcd_11 assembly Or arrange for it to naturally become

Re: Small operands gcd improvements

2019-08-08 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Should we brave to return struct{mp_limb_t g0; mp_limb_t g1;} instead of > having an rp parameter? Does any other gmp functions do that? If we 1. Do that, 2. Arrange that g0 is put in the "primary" return register 3. Ensure that the register used f

Re: Small operands gcd improvements

2019-08-08 Thread Torbjörn Granlund
Some random thoughts on our gcd code: I forgot to create all needed "grabber" files for gcd_11, so we got slowdown for many x86_64 CPUs. One thing which blurs measurements is that the gcd_1 files use a plethora of strategies for initial reduction given 1 x 1. Some check their relative sizes and

Re: Small operands gcd improvements

2019-08-07 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > I'm having problems with timing of the gcd_11 code. Unfortunately, the > nested macros of speed.h make things hard to read. Could yo > double-check that operands to gcd_11 are odd and full limbs? I'm fairly sure they are odd; I've ran speed on a --d

Re: Small operands gcd improvements

2019-08-07 Thread Torbjörn Granlund
I'm having problems with timing of the gcd_11 code. Unfortunately, the nested macros of speed.h make things hard to read. Could yo double-check that operands to gcd_11 are odd and full limbs? The odd thing is that gcd_1 seems to outperform gcd_11 in some 1 x 1 cases. That could happen I suppose

Re: Small operands gcd improvements

2019-08-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: > I might convert the former, but I am tempted to simply delete the k6 > gcd_1.asm. The k6 code uses a branch on the u - v sign? Might be slower than the current C code. What hardware is it used for? https://en.wikipedia.org/wiki/AMD_K6 https:

Re: Small operands gcd improvements

2019-08-07 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Now pushed gcd_11.asm for all gcd_1.asm except the one for ia64 and the > one for AMD k6. Excellent! > I might convert the former, but I am tempted to simply delete the k6 > gcd_1.asm. The k6 code uses a branch on the u - v sign? Might be slower t

Re: Small operands gcd improvements

2019-08-07 Thread Torbjörn Granlund
I have made the trivial conversion of all gcd_1.asm code but x86-32. I might check it in soon. Removing gcd_1.asm can be as a separate check in. Now pushed gcd_11.asm for all gcd_1.asm except the one for ia64 and the one for AMD k6. I might convert the former, but I am tempted to simply d

Re: Small operands gcd improvements

2019-08-07 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > You already added tune/speed code, didn't you? How is it used? Size is in bits. So, e.g., ./speed -s 1-64 mpn_gcd_1 mpn_gcd_11 where the latter is expected to be a few cycles faster. And like gcd_1, .r argument can be used to fix the bit size of

Re: Small operands gcd improvements

2019-08-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Great! What additional testing do you think we need? mpn_gcd_11 unit tests and or devel/try support. The former is useful, in particular to avoid end-user miscompiles. The latter is good for asm development. You already added tune/speed code, didn'

Re: Small operands gcd improvements

2019-08-07 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > I have made the trivial conversion of all gcd_1.asm code but x86-32. I > might check it in soon. Removing gcd_1.asm can be as a separate check > in. Great! What additional testing do you think we need? mpn_gcd_11 unit tests and or devel/try support.

Re: Small operands gcd improvements

2019-08-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Just not sure what order to do things. That patch just adds the gcd_11 entrypoint for that arch, with nothing but speed using it. (So I realize it's not as tested as I thought). At least it is fast! If it works out to replace one foo/gcd_1.asm

Re: Small operands gcd improvements

2019-08-06 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Ah, so you suggest that we keep asm gcd_1 after all. I was in the > editing process of removing them! :-) Just not sure what order to do things. That patch just adds the gcd_11 entrypoint for that arch, with nothing but speed using it. (So I realize

Re: Small operands gcd improvements

2019-08-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Right, mpn_gcd usually ends with a call to mpn_gcd_1 or gcd_2, and the latter also usually ends with a call to gcd_1. But I think it's easiest to leave that as is until we have good gcd_22. OK, but let's not forget about it! > I expect asm gcd_

Re: Small operands gcd improvements

2019-08-06 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > Maybe easier to wait until asm files are updated so that > HAVE_NATIVE_mpn_gcd_1 implies HAVE_NATIVE_mpn_gcd_11. Or was it some > particular call site you had in mind? > > I have seen calls to gcd_1

Re: Small operands gcd improvements

2019-08-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Maybe easier to wait until asm files are updated so that HAVE_NATIVE_mpn_gcd_1 implies HAVE_NATIVE_mpn_gcd_11. Or was it some particular call site you had in mind? I have seen calls to gcd_1 which could use gcd_11, presumably from mpn_gcd (or its

Re: Small operands gcd improvements

2019-08-06 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > Here's a patch to take out gcd_11 to it's own C file, and call it from > gcd_1.c. Needs documentation (if you agree it should be a public > function), and it would be nice with separate tests for th

Re: Small operands gcd improvements

2019-08-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Here's a patch to take out gcd_11 to it's own C file, and call it from gcd_1.c. Needs documentation (if you agree it should be a public function), and it would be nice with separate tests for this function. If you want, feel free to commit! But p

Re: Small operands gcd improvements

2019-08-06 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > We cannot enforce both to be zero. Just imagine that our great divisor > is greater than GMP_NUMB_MAX. :-) For that case, we'll get u1 = v1 > 0, u0 = v0, and will trigger the unlikely u0 == 0 check between subtraction and count_trailing_zeros. So pr

Re: Small operands gcd improvements

2019-08-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: But on second thought, maybe while (u1 | v1) or while (u1 || v1) is even simpler and efficient enough. Then we need not do anything special when the first operand gets small enough to fit in one limb. We cannot enforce both to be ze

Re: Small operands gcd improvements

2019-08-05 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > So I see no strong reason to do it one way or the other. Maybe start > with requiring both u and v odd, and relax requirements later in case we > find any more tangible benefit from doing that? > >

Re: Small operands gcd improvements

2019-08-05 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > > > Checking u1 = 0 and v1 = 0 separately as you suggest is a different > > thing, and it might not have zero cost in the gcd_22 loop. > > I think only the shifted number should be checked, and as the

Re: Small operands gcd improvements

2019-08-05 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: > Checking u1 = 0 and v1 = 0 separately as you suggest is a different > thing, and it might not have zero cost in the gcd_22 loop. I think only the shifted number should be checked, and as the main loop exit condition. OK, so we agree there!

Re: Small operands gcd improvements

2019-08-04 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > I hope we don't we have to do that all at once, but can the mpn_gcd_11 > entry point be optional (via HAVE_NATIVE_*)? > > Yes, otherwise it'll never happen! It's not obvious to me how to organize that. Can we do it like this? * gcd_11.c and gcd_1

Re: Small operands gcd improvements

2019-08-04 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Checking u1 = 0 and v1 = 0 separately as you suggest is a different > thing, and it might not have zero cost in the gcd_22 loop. I think only the shifted number should be checked, and as the main loop exit condition. If both u1 > 0 and v1 > 0 on entry

Re: Small operands gcd improvements

2019-07-31 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: The gcd_21 loop takes inputs (u1, u0, v), inputs odd, and all the limbs non-zero. It is a much simpler loop than gcd_22 do { {u1, u0} <-- {u1, u0} - v (always > 0) if (u0 == 0) { u0 = u1; break; } shift out traili

Re: Small operands gcd improvements

2019-07-31 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: We should run the gcd_22 main loop as long as all of u1, u0, v1, v0 are non-zero. After computing {t1, t0} <-- {v1,v0} - {u1,u0}, there are a couple of cases: My main perspective wasn't algorithmic but about code organisation. "Were to do what."

Re: Small operands gcd improvements

2019-07-31 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Perhaps gcd_22 should (conditionally) invoke (or tail call) gcd_11, > possibly it should let its caller invoke gcd_11 to finish up an > unfinished job. We should run the gcd_22 main loop as long as all of u1, u0, v1, v0 are non-zero. After computing {

Re: Small operands gcd improvements

2019-07-30 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: t...@gmplib.org (Torbjörn Granlund) writes: > The gcd_2 function of mpn/generic/gcd.c uses a loop with an randomly > taken branch, and therefore is slow. I think I at some point attempted a version with masking tricks, similar to the C gcd_1,

Re: Small operands gcd improvements

2019-07-30 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > The gcd_2 function of mpn/generic/gcd.c uses a loop with an randomly > taken branch, and therefore is slow. I think I at some point attempted a version with masking tricks, similar to the C gcd_1, but I didn't manage to get any useful speedup. > What

Small operands gcd improvements

2019-07-26 Thread Torbjörn Granlund
GMP's gcd performance for two-limb operands is quite poor. The gcd_2 function of mpn/generic/gcd.c uses a loop with an randomly taken branch, and therefore is slow. This function is also not overridable in assembly code. It drops into mpn_gcd_1 except in unlikely cases, but mpn_gcd_1 is overkill