"Marco Bodrato" writes:
Not bad, is it?
Not at all bad!
Must we use masks to obtain branchless code? Or should we let the compiler
choose?
My experience is that the compiler cannot be trusted in this area; it
may or may not use cmov and its reasons for doing it either way are
subtle.
Ciao
Il Gio, 5 Settembre 2019 10:33 am, Torbjörn Granlund ha scritto:
> One of my previous proposal looks like this, when put into its right form:
>
> mp_limb_t
> div1 (mp_limb_t n0, mp_limb_t d0)
> {
> if (UNLIKELY ((d0 >> 61) != 0))
> return n0 / d0;
>
> if (UNLIKELY (n0 >= (d0 << 3)))
ni...@lysator.liu.se (Niels Möller) writes:
Unit tests would be nice. I think the tests/mpz/t-gcd.c does exercises
the large quotient cases.
I cobbled together this:
test-div2.c
Description: Binary data
BTW, I wonder if it makes sense with HGCD2_DIV2_METHOD == 3 similar to
t...@gmplib.org (Torbjörn Granlund) writes:
> What are the specs for div2?
>
> Surely n1 > 0 and d1 > 0.
All variants need d1 > 0, method 2 also needs n1 > 0 (for
count_leading_zeros).
> Also N >= D?
Method 2 needs clz(n1) <= clz(d1). Besides that, I think they can handle
N < D, i.e., q == 0.
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> I've made a quick try deleting it from the single-limb loop. See patch
> below. Measurements are a bit noisy, but it looks like a slowdown when I
> time it. With hgcd2 time increasing from 1220
ni...@lysator.liu.se (Niels Möller) writes:
Ooops. Now it should be in.
What are the specs for div2?
Surely n1 > 0 and d1 > 0.
Also N >= D?
Does the new div2 always compute the "accurate" quotient,
i.e., with the remainder R < D?
I'm asking as I believe strongly in unit testing of these
ni...@lysator.liu.se (Niels Möller) writes:
I've made a quick try deleting it from the single-limb loop. See patch
below. Measurements are a bit noisy, but it looks like a slowdown when I
time it. With hgcd2 time increasing from 1220 cycles to 1290 (this time
measured on broadwell), which
ni...@lysator.liu.se (Niels Möller) writes:
> My guess is that special q=1 is often beneficial for the double-limb
> loop, and rarely beneficial in the single-limb loop. But needs some
> measurements.
I've made a quick try deleting it from the single-limb loop. See patch
below. Measurements are
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> Pushed now, with minor changes and deletion of the "current" code
> (method 2 above).
>
> It is not in the main repo yet.
Ooops. Now it should be in.
Regards,
/Niels
--
Niels Möller.
ni...@lysator.liu.se (Niels Möller) writes:
Pushed now, with minor changes and deletion of the "current" code
(method 2 above).
It is not in the main repo yet.
--
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
ni...@lysator.liu.se (Niels Möller) writes:
>> I'm appending another iteration of the patch to add div2 function based
>> on div1 on the high limbs. Selected via HGCD2_DIV2_METHOD. Benchmarks:
>>
>> HGCD2_DIV2_METHOD mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3
>> 1#1504.47
ni...@lysator.liu.se (Niels Möller) writes:
> I've written several div1 in asm (arm v5 method 2, 64-bit arm v8 method
> 2 and 3, 64-bit x86 method 2).
Nice.
x64-div_11-m2.asm
Description: Binary data
arm32-div_11-m2.asm
Description: Binary data
arm64-div_11-m2.asm
Description:
t...@gmplib.org (Torbjörn Granlund) writes:
> Not exactly. I think running div1 like hgcd2 but without any of hgcd2
> bookkeeping would make some sense. I.e., feed div1 with the input of
> Euclid's algorithm. To avoid skew from particular operands, perhaps
> table 10 uniformly distrinuted
ni...@lysator.liu.se (Niels Möller) writes:
> So one could perhaps provide a set of uniformly distribted
> random numbers for div1 and get adequate results?
Are you saying that the quotient of two independently uniformly
distributed numbers has a similar distribution as the euclidean
t...@gmplib.org (Torbjörn Granlund) writes:
> The distribution of operands are the same as for a plain, old Euclid,
> isn't it?
Should be. Or slightly different, if q = 1 is handeled specially.
> So one could perhaps provide a set of uniformly distribted
> random numbers for div1 and get
ni...@lysator.liu.se (Niels Möller) writes:
t...@gmplib.org (Torbjörn Granlund) writes:
> Looking at when method 1 or 3 is faster than 2 is more interesting.
> Method 1 and to some extent also method 3 would benefit from asm code,
How would method 1 benefit from asm code? To use
t...@gmplib.org (Torbjörn Granlund) writes:
> Looking at when method 1 or 3 is faster than 2 is more interesting.
> Method 1 and to some extent also method 3 would benefit from asm code,
How would method 1 benefit from asm code? To use instructions providing
both quotient and remainder? For
Two configs hesitantly chose HGCD2_DIV1_METHOD = 2. For k10/64, method
2 outperforms method 3 by 0.34%. For ARM Cortex-A8 method 2's advantage
is 1.94%. Wow.
Looking at when method 1 or 3 is faster than 2 is more interesting.
Method 1 and to some extent also method 3 would benefit from asm
ni...@lysator.liu.se (Niels Möller) writes:
Just pushed a rename, and a change of default to
#define HGCD2_DIV1_METHOD 3
Will you update the threshold page accordingly?
Done. There is no aliasing mechanism, so now the table is empty.
--
Torbjörn
Please encrypt, key id 0xC8601622
t...@gmplib.org (Torbjörn Granlund) writes:
> Next, I think we should go ahead with the rename HGCD2_METHOD to
> DIV11_METHOD or possibly HGCD2_DIV1_METHOD.
>
> Feel free.
Just pushed a rename, and a change of default to
#define HGCD2_DIV1_METHOD 3
Will you update the threshold page
ni...@lysator.liu.se (Niels Möller) writes:
> I went ahead and committed that version, replacing the old
> HGCD2_METHOD=2. I expect it is be the fastest method on some platform.
Will be interesting to see results on thresholds.
Nobody loves the new METHOD 2. :-(
(Not many machines
t...@gmplib.org (Torbjörn Granlund) writes:
> t...@gmplib.org (Torbjörn Granlund) writes:
>
> I cooked a modern alternative:
>
> I went ahead and committed that version, replacing the old
> HGCD2_METHOD=2. I expect it is be the fastest method on some platform.
Will be interesting to see
t...@gmplib.org (Torbjörn Granlund) writes:
I cooked a modern alternative:
I went ahead and committed that version, replacing the old
HGCD2_METHOD=2. I expect it is be the fastest method on some platform.
(We might want to arrange for longlong.h to use lzcnt instead of bsr for
modern AMD
ni...@lysator.liu.se (Niels Möller) writes:
Sounds good. Give it a few days, and delete if it still looks slow
everywhere.
I cooked a modern alternative:
static mp_double_limb_t
div1 (mp_limb_t n0, mp_limb_t d0)
{
mp_double_limb_t res;
int ncnt, dcnt, cnt;
mp_limb_t q;
mp_limb_t
t...@gmplib.org (Torbjörn Granlund) writes:
> Conclusion: The HGCD2_METHOD 2 code in its current form is not useful on
> any of our platforms. (Perhaps we should wait a few days more as not
> all hosts have reported results yet.)
Sounds good. Give it a few days, and delete if it still looks
ni...@lysator.liu.se (Niels Möller) writes:
Maybe I'm confused. And I think tradeoffs are different for the
double-limb and the single-limb loops.
Or I am confused.
I have mostly thought of div1, less of div2.
For the single-limb loop, I think special handling of q == 1 means we
avoid
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> > That should in particular be true if the q = 1 code is removed from
> > hgcd2, for machines with good multiplication throughput.
>
> Can we remove that for everything but HGCD2_METHOD == 2? It
ni...@lysator.liu.se (Niels Möller) writes:
> I believe METHOD 1 could almost always be pointless if METHOD 3's
> straight line code is made really efficient.
I'd guess so too. But benchmarks are better than faith alone ;-)
I have a lot of faith in benchmarking! :-)
> That should in
t...@gmplib.org (Torbjörn Granlund) writes:
> I believe METHOD 1 could almost always be pointless if METHOD 3's
> straight line code is made really efficient.
I'd guess so too. But benchmarks are better than faith alone ;-)
> That should in particular be true if the q = 1 code is removed from
>
> Should we rename div1 to put it into the GMP name space? How about
> mpn_div_11? (We could encode in its name that it is for use for small
> q, but I'd suggest that we don't do that.)
>
> That would allow for some assembly experiments.
If you think assembly will make a big
t...@gmplib.org (Torbjörn Granlund) writes:
> Should we rename div1 to put it into the GMP name space? How about
> mpn_div_11? (We could encode in its name that it is for use for small
> q, but I'd suggest that we don't do that.)
>
> That would allow for some assembly experiments.
If you think
ni...@lysator.liu.se (Niels Möller) writes:
Keep in mind that the optimizations to div1 so far only help the second,
cheaper, half of hgcd2. The first half, working with double precision,
still uses the div2 function, which is full of branches.
I know, but the number I mainly look at is
t...@gmplib.org (Torbjörn Granlund) writes:
> I'm a bit surprised that the METHOD 3 div1 didn't yield more speedup.
> The hgcd2 code is almost content with a 25 cycle / operation, which is
> really, really odd to me.
Keep in mind that the optimizations to div1 so far only help the second,
There was a bug in the "euclides 2-choice binary" function. Now, with
the intended algorithm, we get almost 4 bits per iteration.
Algorithm Iterations Iter/Bit Bit/Iter Max iter
plain binary :378420598 0.6971.435 58
binary+-
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> The current _METHOD 3 code starts with
>
> if (UNLIKELY ((d0 >> (GMP_LIMB_BITS - 3)) != 0)
> || UNLIKELY (n0 >= (d0 << 3)))
>
> The first condition isn't that unlikely, though, and will
ni...@lysator.liu.se (Niels Möller) writes:
I don't know much about itanic performance.
It's not bad for GMP, but its division *libcall* needs some 150 cycles.
I played with a METHOD = 3 div1 for Itanic as its predicated execution
comes in very handy. The result is that one can do one
t...@gmplib.org (Torbjörn Granlund) writes:
> We have now about half of hosts asking for HGCD_METHOD = 1 as for
> HGCD_METHOD = 3. Itanic wants HGCD_METHOD 2, and so does beagle (Arm
> a8), but the latter with a preference of just 0.05% (over method 3).
I don't know much aboutitanic
We have now about half of hosts asking for HGCD_METHOD = 1 as for
HGCD_METHOD = 3. Itanic wants HGCD_METHOD 2, and so does beagle (Arm
a8), but the latter with a preference of just 0.05% (over method 3).
_METHOD 1 benefits from the special handling of q = 1 before invoking
DIV1. _METHOD 2 & 3
On Tue, Sep 3, 2019 at 10:49 AM Torbjörn Granlund wrote:
>
> ni...@lysator.liu.se (Niels Möller) writes:
>
> And this on a laptop with an Intel U4100 (5 years old?), so I'd assume
> it doesn't have a particularly fast div instruction. Should we just
> delete div1 ? On which architectures
We have more values for HGCD2_METHOD now:
https://gmplib.org/devel/thres/HGCD2_METHOD.html
The low-end ARM cores are no longer alone in desiring HGCD2_METHOD = 2.
--
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
t...@gmplib.org (Torbjörn Granlund) writes:
> Can div2 be done sloppily, e.g., by extracting the top 64 bits of the
> dividend and the corresponding (up to) 64 bits of the divisor, and
> invoke DIV1 on these bits?
I think so. Result will be at most one off, except in the unlikely case
that q >
Can div2 be done sloppily, e.g., by extracting the top 64 bits of the
dividend and the corresponding (up to) 64 bits of the divisor, and
invoke DIV1 on these bits?
OK, one needs to return a non-negative remainder, so a little bit of
adjustment might be needed. But does the remainder need to be
ni...@lysator.liu.se (Niels Möller) writes:
For tuned *_METHOD values, it might be useful to let tuneup also add a
comment with next best method and margin, something like
#define HGCD2_METHOD 2 /* 3.2% faster than method 1 */
That'd be good!
I did one variant with table lookup and an
t...@gmplib.org (Torbjörn Granlund) writes:
> The table generator program has now run. Also pi1 wants the old code.
>
> The next question is how badly they want it. :-)
For tuned *_METHOD values, it might be useful to let tuneup also add a
comment with next best method and margin, something
One of my previous proposal looks like this, when put into its right form:
mp_limb_t
div1 (mp_limb_t n0, mp_limb_t d0)
{
if (UNLIKELY ((d0 >> 61) != 0))
return n0 / d0;
if (UNLIKELY (n0 >= (d0 << 3)))
return n0 / d0;
else {
mp_limb_t q, mask;
d0 <<= 2;
mask =
ni...@lysator.liu.se (Niels Möller) writes:
I had a quick look at the machines that completed tuning this night.
These three seem to prefer the old code (HGCD2_METHOD == 2):
armcortexa7neon-unknown-linux-gnueabihf pi2.gmplib.org-stat
armcortexa8neon-unknown-linux-gnueabihf
ni...@lysator.liu.se (Niels Möller) writes:
> Excellent. I just pushed tuning support, with HGCD2_METHOD == 1 using
> plain division for div1, and HGCD2_METHOD == 2 (default) using the old
> method with the div1 function.
I had a quick look at the machines that completed tuning this night.
These
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> Pushed now. Plan is to next add a tuned HGCD2_METHOD, initially
> controlling which div1 to use. Torbjörn, can you add that to the
> thresholds web page once it's ready?
>
> I went ahead and added
ni...@lysator.liu.se (Niels Möller) writes:
Pushed now. Plan is to next add a tuned HGCD2_METHOD, initially
controlling which div1 to use. Torbjörn, can you add that to the
thresholds web page once it's ready?
I went ahead and added it. When data comes in, the subpage will become
more
ni...@lysator.liu.se (Niels Möller) writes:
> First step is to add support to speed. Does the below look reasonable? I
> modelled it a bit on SPEED_ROUTINE_MODLIMB_INVERT, which also measures a
> fix size, but I don't quite understand all of struct speed_params.
Pushed now. Plan is to next add a
t...@gmplib.org (Torbjörn Granlund) writes:
> We might want to teach tuneup to choose between small quotient division
> primitives., like we do for Jacobi.
If we think there's place for several hgcd2 variants, that's definitely
needed.
First step is to add support to speed. Does the below look
ni...@lysator.liu.se (Niels Möller) writes:
Maybe it makes sense to check this in now, deleting the old div1 and
div2? If it turns out to be a regression (is that easy to see from
nightly tests?), we can consider putting it back under some #if
DIV1_METHOD ..., or replace with some other
t...@gmplib.org (Torbjörn Granlund) writes:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> Here's a div2 function that does that. Together with the deleted div1,
> it gives a 25% (!) speed up of gcd in the range 3-10 limbs, benchmarked
> on my broadwell machine. And should do well on all
ni...@lysator.liu.se (Niels Möller) writes:
Here's a div2 function that does that. Together with the deleted div1,
it gives a 25% (!) speed up of gcd in the range 3-10 limbs, benchmarked
on my broadwell machine. And should do well on all machines with decent
small-quotient division.
I tested newer Intel systems too (Haswell, Skylake) and they all need
around 25 cycles for a division n/d = 1.
Intel Goldmont Plus (a current low-end CPU) is better, it needs about 12
cycles. AMD CPUs from the last 10 years all perform OK.
It is funny that x86 vendors give division so little
t...@gmplib.org (Torbjörn Granlund) writes:
> So I think plain / is the way to go for certain systems!
Then we should use that for the double limb loop too! It gets a bit
tricky, since we need special handling of large quotients, but something
like this should work:
mp_limb_t q = ah / bh;
t...@gmplib.org (Torbjörn Granlund) writes:
ni...@lysator.liu.se (Niels Möller) writes:
>
In that case, not so surprising that the div1 function loses. Do other
architectures also have decent performance for small-quotient division?
>
I don't have the full picture, I'm afraid.
>
I
ni...@lysator.liu.se (Niels Möller) writes:
Is there any even easier way to find out on which machines (if any) div1
improves performance? It's still a bit awkward to add tuning of
HGCD_DIV1_METHOD and wait for the nightly builds.
https://gmplib.org/devel/testsystems.html :-)
SU4100$ ssh
t...@gmplib.org (Torbjörn Granlund) writes:
count_leading_zeros (ncnt, n0)
count_leading_zeros (dcnt, d0)
ni = ... extract high k bits from n0 ...
di = ... extract high \ell bits from d0 ...
q = qtab[ni<<\ell+di];
... adjust ...
Well, perhaps
q = qtab[ni<<\ell+di] << (ncnt - dcnt)
ni...@lysator.liu.se (Niels Möller) writes:
In that case, not so surprising that the div1 function loses. Do other
architectures also have decent performance for small-quotient division?
Do you think table lookup on high bits should beat 16 cycles? It needs
to give good enough accuracy
t...@gmplib.org (Torbjörn Granlund) writes:
> Intel ark doesn't seem to know any processor called "U4100" so I cannot
> figure out what generation it belongs to.
gmp's config.guess classifies it as core2. cat /proc/cpuinfo says
processor : 0
vendor_id : GenuineIntel
cpu family
ni...@lysator.liu.se (Niels Möller) writes:
> But... I get even better numbers if I keep the old code and just replace
> the div1 function with plain division q = a / b:
Attaching deletion patch. Tried it also on my broadwell machine.
Benchmarking is less reliable there, but seems to give about
ni...@lysator.liu.se (Niels Möller) writes:
And this on a laptop with an Intel U4100 (5 years old?), so I'd assume
it doesn't have a particularly fast div instruction. Should we just
delete div1 ? On which architectures can we expect it to be beneficial?
It should be fairly easy to find
ni...@lysator.liu.se (Niels Möller) writes:
> I think it's promising, and also a bit simpler than the current div1
> code. Some concerns:
>
> 1. I think the approximate q means that we will often reduce from a > b
>to a close to but larger than b, so we need two iterations to get to
>a <
I've been plaing a bit with table based hgcd2, but to keep things
simple, I started with hgcd1 (reducing full limbs to half limbs, with an
associated matrix of half limb elements). Complete code below, the main
loop is
for (;;)
{
mp_limb_t a_hi;
mp_limb_t b_hi;
mp_limb_t
65 matches
Mail list logo