Another trick we've been discussing is to use u+v rather than |u-v|, in
the case that u+v is divisible by 4.
I've not yet tried it out for gcd_22 (and I don't yet have any timing
code), but I've tried it for gcd_11. It's slower (not that surprising),
but with some potential for gcd_22.
We need ma
"Marco Bodrato" writes:
Moved the fix outside.
Yes, I feel the same. In both cases %r9 is not initialised.
I didn't get much right last night. The timing results were wrong,
presumably due to the r9 initialisation issue.
--
Torbjörn
Please encrypt, key id 0xC8601622
_
t...@gmplib.org (Torbjörn Granlund) writes:
> I believe an amalgamation of Niels code which uses an implicit lsb and
> his present _22 code would be best. The present requirement for
> sub_mddmmss is a portability inconvenience, and for subtract-with-carry
> challenged processors a major hassle.
Ciao,
Il Mer, 14 Agosto 2019 1:46 am, Torbjörn Granlund ha scritto:
> "Marco Bodrato" writes:
>
> > What's the purpose of this change?
>
> Failing tests :-)
> Let's leave the loop(s) alone, fix things outside.
Moved the fix outside.
> The bd4 and zen2 code is also broken, I think.
Yes, I
"Marco Bodrato" writes:
> What's the purpose of this change?
Failing tests :-)
Darn, my scripts must be buggy.
Let's leave the loop(s) alone, fix things outside. The bd4 and zen2
code is also broken, I think.
This bug might have made my timing tests run faster than they should.
:o(
--
Ciao,
Il Mer, 14 Agosto 2019 1:21 am, Torbjörn Granlund ha scritto:
> I saw this change go in:
>
> diff -r 118627eed635 -r bb86e66536d5 mpn/x86_64/coreihwl/gcd_11.asm
> --- a/mpn/x86_64/coreihwl/gcd_11.asm Tue Aug 13 22:20:06 2019 +0200
> +++ b/mpn/x86_64/coreihwl/gcd_11.asm Wed Aug 14 01:06:08
"Marco Bodrato" writes:
I saw this change go in:
diff -r 118627eed635 -r bb86e66536d5 mpn/x86_64/coreihwl/gcd_11.asm
--- a/mpn/x86_64/coreihwl/gcd_11.asmTue Aug 13 22:20:06 2019 +0200
+++ b/mpn/x86_64/coreihwl/gcd_11.asmWed Aug 14 01:06:08 2019 +0200
@@ -79,10 +79,10 @@
ALIGN(1
"Marco Bodrato" writes:
I'm happy to see that all of them start with the sequence
mov v0, %rax
sub u0, v0
You widely developed my small idea: to keep the value in %rax :-D
Except the zen2 code, which bounces the value between v0 and rax.
Now, I'd like to see if it
Ciao,
Il Mar, 13 Agosto 2019 10:38 pm, Torbjörn Granlund ha scritto:
> I pushed a few more variants of gcd_11 with nice speed improvements for
> several x86_64 CPUs. I am sure much more can be done.
I'm happy to see that all of them start with the sequence
mov v0, %rax
sub
I pushed a few more variants of gcd_11 with nice speed improvements for
several x86_64 CPUs. I am sure much more can be done.
I have a bunch of finished gcd_22 too; these are generic without any
CPU-specific tweaks. I haven't timed them, they are just tested for
correctness. It might be desirab
"Marco Bodrato" writes:
This means we are currently work on the _1o1o variants.
Yep. But the other entry points will be one or to extra insns.
May I propose a small latency-micro-optimisation for two x86_64 just
proposed variants? The idea is not to use the register %r10 at all, and
di
"Marco Bodrato" writes:
May I suggest to write:
u0 = (u0 >> c) | (u1 << (GMP_LIMB_BITS - c));
instead of
u0 = (u0 >> c) || (u1 << (GMP_LIMB_BITS - c));
You may. And it is indeed a great improvement!
The generated code for Niels' C code is awful for x86. It consists of
mainly regist
Ciao,
Il Lun, 12 Agosto 2019 6:52 pm, Torbjörn Granlund ha scritto:
> I fixed a variable misspelling and defined the needed macro. I also
> modified the return type's declaration. The code fails in most cases,
> but occasionally gets things right.
May I suggest to write:
u0 = (u0 >> c) | (u1 <
Ciao,
Il Sab, 10 Agosto 2019 7:01 pm, Torbjörn Granlund ha scritto:
> We might provide several gcc_11 function variants to accomodate the
> internal uses you bring up.
>
> gcd_1o1o - two odd limbs
> gcd_1o1 - one odd and one odd/even limb
> gcd_11 - two odd/even limbs
This would be a rich set
ni...@lysator.liu.se (Niels Möller) writes:
BTW, below is one (untested) way to organize gcd_22. Wants an sub_mddmmss,
with output carry as a mask, analogous to the add_mss defined in
mod_1_1.c.
I fixed a variable misspelling and defined the needed macro. I also
modified the return typ
We might provide several gcc_11 function variants to accomodate the
internal uses you bring up.
gcd_1o1o - two odd limbs
gcd_1o1 - one odd and one odd/even limb
gcd_11 - two odd/even limbs
They would all be implemented in the same asm file. In the absense of
asm implementation, C could provid
Ciao,
Il Mar, 6 Agosto 2019 9:08 pm, Torbjörn Granlund ha scritto:
> ni...@lysator.liu.se (Niels Möller) writes:
> Perhaps it would be worthwhile to do a tailcall for when the leftshift
> is 0? I think most gcd calls, irrespective of operand size, will not
> have any factor of two, let alone a c
ni...@lysator.liu.se (Niels Möller) writes:
BTW, below is one (untested) way to organize gcd_22. Wants an sub_mddmmss,
with output carry as a mask, analogous to the add_mss defined in
mod_1_1.c.
typedef struct {
mp_limb_t d[2];
} mp_double_limb_t;
I believe one should use separ
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> To not get multiple definitions in the case that there's some gcd_1.asm
> with an mpn_gcd_11 entry point, but no gcd_11.asm. If we don't organize
> asm code that way, the #if is useless and can be r
ni...@lysator.liu.se (Niels Möller) writes:
To not get multiple definitions in the case that there's some gcd_1.asm
with an mpn_gcd_11 entry point, but no gcd_11.asm. If we don't organize
asm code that way, the #if is useless and can be removed.
I believe configure handles that too.
> I
t...@gmplib.org (Torbjörn Granlund) writes:
> Why did you add HAVE_NATIVE_ around the C mpn_gcd_11 function?
>
> #if !HAVE_NATIVE_mpn_gcd_11
> mp_limb_t
> mpn_gcd_11 (mp_limb_t u, mp_limb_t v)
To not get multiple definitions in the case that there's some gcd_1.asm
with an mpn_gcd_11 entry p
Why did you add HAVE_NATIVE_ around the C mpn_gcd_11 function?
#if !HAVE_NATIVE_mpn_gcd_11
mp_limb_t
mpn_gcd_11 (mp_limb_t u, mp_limb_t v)
.
.
.
I would assume configure does its magic for this file just as with any
other file, i.e., exclude it when there is a definition earlier in th
ni...@lysator.liu.se (Niels Möller) writes:
I also see that there's no x86/gcd_1.asm.
gcd_11.asm
Description: Binary data
--
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/lis
ni...@lysator.liu.se (Niels Möller) writes:
Does any other gmp functions do that? If we
1. Do that,
2. Arrange that g0 is put in the "primary" return register
3. Ensure that the register used for returning g1 isn't clobbered by
gcd_11 assembly
Or arrange for it to naturally become
t...@gmplib.org (Torbjörn Granlund) writes:
> Should we brave to return struct{mp_limb_t g0; mp_limb_t g1;} instead of
> having an rp parameter?
Does any other gmp functions do that? If we
1. Do that,
2. Arrange that g0 is put in the "primary" return register
3. Ensure that the register used f
Some random thoughts on our gcd code:
I forgot to create all needed "grabber" files for gcd_11, so we got
slowdown for many x86_64 CPUs.
One thing which blurs measurements is that the gcd_1 files use a
plethora of strategies for initial reduction given 1 x 1. Some check
their relative sizes and
t...@gmplib.org (Torbjörn Granlund) writes:
> I'm having problems with timing of the gcd_11 code. Unfortunately, the
> nested macros of speed.h make things hard to read. Could yo
> double-check that operands to gcd_11 are odd and full limbs?
I'm fairly sure they are odd; I've ran speed on a --d
I'm having problems with timing of the gcd_11 code. Unfortunately, the
nested macros of speed.h make things hard to read. Could yo
double-check that operands to gcd_11 are odd and full limbs?
The odd thing is that gcd_1 seems to outperform gcd_11 in some 1 x 1
cases. That could happen I suppose
ni...@lysator.liu.se (Niels Möller) writes:
> I might convert the former, but I am tempted to simply delete the k6
> gcd_1.asm.
The k6 code uses a branch on the u - v sign? Might be slower than the
current C code. What hardware is it used for?
https://en.wikipedia.org/wiki/AMD_K6
https:
t...@gmplib.org (Torbjörn Granlund) writes:
> Now pushed gcd_11.asm for all gcd_1.asm except the one for ia64 and the
> one for AMD k6.
Excellent!
> I might convert the former, but I am tempted to simply delete the k6
> gcd_1.asm.
The k6 code uses a branch on the u - v sign? Might be slower t
I have made the trivial conversion of all gcd_1.asm code but x86-32. I
might check it in soon. Removing gcd_1.asm can be as a separate check
in.
Now pushed gcd_11.asm for all gcd_1.asm except the one for ia64 and the
one for AMD k6. I might convert the former, but I am tempted to simply
d
t...@gmplib.org (Torbjörn Granlund) writes:
> You already added tune/speed code, didn't you? How is it used?
Size is in bits. So, e.g.,
./speed -s 1-64 mpn_gcd_1 mpn_gcd_11
where the latter is expected to be a few cycles faster. And like gcd_1,
.r argument can be used to fix the bit size of
ni...@lysator.liu.se (Niels Möller) writes:
Great! What additional testing do you think we need? mpn_gcd_11 unit
tests and or devel/try support.
The former is useful, in particular to avoid end-user miscompiles.
The latter is good for asm development.
You already added tune/speed code, didn'
t...@gmplib.org (Torbjörn Granlund) writes:
> I have made the trivial conversion of all gcd_1.asm code but x86-32. I
> might check it in soon. Removing gcd_1.asm can be as a separate check
> in.
Great! What additional testing do you think we need? mpn_gcd_11 unit
tests and or devel/try support.
ni...@lysator.liu.se (Niels Möller) writes:
Just not sure what order to do things. That patch just adds the gcd_11
entrypoint for that arch, with nothing but speed using it. (So I realize
it's not as tested as I thought).
At least it is fast!
If it works out to replace one foo/gcd_1.asm
t...@gmplib.org (Torbjörn Granlund) writes:
> Ah, so you suggest that we keep asm gcd_1 after all. I was in the
> editing process of removing them! :-)
Just not sure what order to do things. That patch just adds the gcd_11
entrypoint for that arch, with nothing but speed using it. (So I realize
ni...@lysator.liu.se (Niels Möller) writes:
Right, mpn_gcd usually ends with a call to mpn_gcd_1 or gcd_2, and the
latter also usually ends with a call to gcd_1. But I think it's easiest
to leave that as is until we have good gcd_22.
OK, but let's not forget about it!
> I expect asm gcd_
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> Maybe easier to wait until asm files are updated so that
> HAVE_NATIVE_mpn_gcd_1 implies HAVE_NATIVE_mpn_gcd_11. Or was it some
> particular call site you had in mind?
>
> I have seen calls to gcd_1
ni...@lysator.liu.se (Niels Möller) writes:
Maybe easier to wait until asm files are updated so that
HAVE_NATIVE_mpn_gcd_1 implies HAVE_NATIVE_mpn_gcd_11. Or was it some
particular call site you had in mind?
I have seen calls to gcd_1 which could use gcd_11, presumably from
mpn_gcd (or its
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> Here's a patch to take out gcd_11 to it's own C file, and call it from
> gcd_1.c. Needs documentation (if you agree it should be a public
> function), and it would be nice with separate tests for th
ni...@lysator.liu.se (Niels Möller) writes:
Here's a patch to take out gcd_11 to it's own C file, and call it from
gcd_1.c. Needs documentation (if you agree it should be a public
function), and it would be nice with separate tests for this function.
If you want, feel free to commit!
But p
t...@gmplib.org (Torbjörn Granlund) writes:
> We cannot enforce both to be zero. Just imagine that our great divisor
> is greater than GMP_NUMB_MAX. :-)
For that case, we'll get u1 = v1 > 0, u0 = v0, and will trigger the
unlikely u0 == 0 check between subtraction and count_trailing_zeros. So
pr
ni...@lysator.liu.se (Niels Möller) writes:
But on second thought, maybe
while (u1 | v1)
or
while (u1 || v1)
is even simpler and efficient enough. Then we need not do anything
special when the first operand gets small enough to fit in one limb.
We cannot enforce both to be ze
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> So I see no strong reason to do it one way or the other. Maybe start
> with requiring both u and v odd, and relax requirements later in case we
> find any more tangible benefit from doing that?
>
>
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> > Checking u1 = 0 and v1 = 0 separately as you suggest is a different
> > thing, and it might not have zero cost in the gcd_22 loop.
>
> I think only the shifted number should be checked, and as the
ni...@lysator.liu.se (Niels Möller) writes:
> Checking u1 = 0 and v1 = 0 separately as you suggest is a different
> thing, and it might not have zero cost in the gcd_22 loop.
I think only the shifted number should be checked, and as the main loop
exit condition.
OK, so we agree there!
t...@gmplib.org (Torbjörn Granlund) writes:
> I hope we don't we have to do that all at once, but can the mpn_gcd_11
> entry point be optional (via HAVE_NATIVE_*)?
>
> Yes, otherwise it'll never happen!
It's not obvious to me how to organize that. Can we do it like this?
* gcd_11.c and gcd_1
t...@gmplib.org (Torbjörn Granlund) writes:
> Checking u1 = 0 and v1 = 0 separately as you suggest is a different
> thing, and it might not have zero cost in the gcd_22 loop.
I think only the shifted number should be checked, and as the main loop
exit condition. If both u1 > 0 and v1 > 0 on entry
ni...@lysator.liu.se (Niels Möller) writes:
The gcd_21 loop takes inputs (u1, u0, v), inputs odd, and all the limbs
non-zero. It is a much simpler loop than gcd_22
do
{
{u1, u0} <-- {u1, u0} - v (always > 0)
if (u0 == 0) { u0 = u1; break; }
shift out traili
ni...@lysator.liu.se (Niels Möller) writes:
We should run the gcd_22 main loop as long as all of u1, u0, v1, v0 are
non-zero. After computing {t1, t0} <-- {v1,v0} - {u1,u0}, there are a
couple of cases:
My main perspective wasn't algorithmic but about code organisation.
"Were to do what."
t...@gmplib.org (Torbjörn Granlund) writes:
> Perhaps gcd_22 should (conditionally) invoke (or tail call) gcd_11,
> possibly it should let its caller invoke gcd_11 to finish up an
> unfinished job.
We should run the gcd_22 main loop as long as all of u1, u0, v1, v0 are
non-zero. After computing {
ni...@lysator.liu.se (Niels Möller) writes:
t...@gmplib.org (Torbjörn Granlund) writes:
> The gcd_2 function of mpn/generic/gcd.c uses a loop with an randomly
> taken branch, and therefore is slow.
I think I at some point attempted a version with masking tricks, similar
to the C gcd_1,
t...@gmplib.org (Torbjörn Granlund) writes:
> The gcd_2 function of mpn/generic/gcd.c uses a loop with an randomly
> taken branch, and therefore is slow.
I think I at some point attempted a version with masking tricks, similar
to the C gcd_1, but I didn't manage to get any useful speedup.
> What
GMP's gcd performance for two-limb operands is quite poor.
The gcd_2 function of mpn/generic/gcd.c uses a loop with an randomly
taken branch, and therefore is slow. This function is also not
overridable in assembly code. It drops into mpn_gcd_1 except in
unlikely cases, but mpn_gcd_1 is overkill
54 matches
Mail list logo