We now have almost complete results for the latest hgcd2.c tweaks.
We have 5 div1 variants (as chosen by HGCD2_DIV1_METHOD). There are
results and brief documentation here:
https://gmplib.org/devel/thres/HGCD2_DIV1_METHOD.html
About half of our test machines prefer '/', i.e., whatever word
ni...@lysator.liu.se (Niels Möller) writes:
Hmm. Definitely worth a try. But if we need explicit loads and stores
from the structs, we'll not save that many instructions.
x86 can handle operate-from-memory at almost the same cost as
operate-from-register. But operate-to-memory is expensive.
t...@gmplib.org (Torbjörn Granlund) writes:
> Perhaps keeping the to-be-swapped variables in two structs, and instead
> conditionally swap pointers to the structs?
Hmm. Definitely worth a try. But if we need explicit loads and stores
from the structs, we'll not save that many instructions. Each
ni...@lysator.liu.se (Niels Möller) writes:
$ ./tune/speed -c -s1 -p10 mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3
mpn_hgcd2_4 mpn_hgcd2_5 mpn_hgcd2_binary
overhead 6.02 cycles, precision 10 units of 8.33e-10 secs, CPU freq
1200.00 MHz
mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3
ni...@lysator.liu.se (Niels Möller) writes:
> Below is one variant, that seems to pass tests. I haven't done any
> benchmarks yet.
Just gave it a try with tune/speed, together with all the latest div1
variants. This one on my laptop, and not looking that great:
$ ./tune/speed -c -s1 -p10
t...@gmplib.org (Torbjörn Granlund) writes:
I went ahead and pushed a hgcd2.c with _METHOD = 4 and _METHOD 5. The
former tables quotients and neeeds one multiply per div1 call and the
latter tables inverses and needs 2 multiplies per div1 call.
I pushed improved versions.
Instead of
ni...@lysator.liu.se (Niels Möller) writes:
I see two ways. Either just do an inversion table, which will cost an
additional multiplication when used. Than one could handle quotients up
to 7 or 8 bits bits with just 256 bytes of table.
Or shift both ni and di, and get quotient with
t...@gmplib.org (Torbjörn Granlund) writes:
> I had not realised you meant left-to-right. I thought you talked about
> right-to-left. Interesting!
Below is one variant, that seems to pass tests. I haven't done any
benchmarks yet. All the SWAP_MASK calls look a bit expensive, should be
4
Nice!
--
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel
ni...@lysator.liu.se (Niels Möller) writes:
> Hmm. So the hgcd2-x-y.c would be there als for speed, and hgcd2.c would
> have something like
>
> #if TUNE_PROGRAM_BUILD
> int mpn_hgcd2(...)
> {
> return (*hgcd2_method_pointer)(...);
> }
See below patch.
diff -r 3e04a9bbba13 gmp-impl.h
---
t...@gmplib.org (Torbjörn Granlund) writes:
> With some thought, I am sure the table size could come down while the
> accuracy could improve.
I see two ways. Either just do an inversion table, which will cost an
additional multiplication when used. Than one could handle quotients up
to 7 or 8
> * Speed up div1, presumably with a table-based approach.
A quick-and-dirty but working variant below.
The table size with NBITS = 8 is 32KiB. Far too big. It avoids
branches in 83% of the cases, which is not that great considering the
huge cost of mispredicted branches.
With some
t...@gmplib.org (Torbjörn Granlund) writes:
> Ideally, one would compile hgcd2.c in all possible variants (presumable
> through hand-crafted hgcd2-1-1.c, hgcd2-2-1.c, etc., and then call the
> selected hgcd2's entry point through a function pointer for further
> measuring.
Hmm. So the
ni...@lysator.liu.se (Niels Möller) writes:
How much speedup have we achieved?
I don't know. I just observed the GCD_DC_THRESHOLD to change a lot, and
that is a good sign.
Looks like GCD_DC_THRESHOLD gets higher on many machines, but lower on a
few? I now realize that the way tuneup
t...@gmplib.org (Torbjörn Granlund) writes:
> I feel we've achieved much of the possible speedup for gcd now.
How much speedup have we achieved?
> But what more can we do before we are completely done for now?
> Let me try to list it:
>
>
> * Add entry points for gcd_11 allowing even
15 matches
Mail list logo