ni...@lysator.liu.se (Niels Möller) writes:
A long time ago, we choose an interface for sbpi1_div_qr which does
*not* store the most significant limb; instead it returns it. I think it
was the intention that a new top-level mpn_div_qr should follow that
convention, and not store the top
ni...@lysator.liu.se (Niels Möller) writes:
The interesting thing is that the next higher function, mpn_div_qr_1,
should return the high quotient limb separately.
I am not sure I agree. Please explain.
You're saying that en n-limb consecutive dividend should yield an
(n-1)-limb consecutive
Torbjorn Granlund t...@gmplib.org writes:
ni...@lysator.liu.se (Niels Möller) writes:
The interesting thing is that the next higher function, mpn_div_qr_1,
should return the high quotient limb separately.
I am not sure I agree. Please explain.
A long time ago, we choose an interface
Torbjorn Granlund t...@gmplib.org writes:
Basically qp = up won't work, but qp = up + k for any positive k will?
Does the C code share that property?
I think the C code has the same problem. Hmm, and it looks like in both
the C and asm code, it's only the first iteration, before the main loop,
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
Basically qp = up won't work, but qp = up + k for any positive k will?
Does the C code share that property?
[...]
I think it would be good to fix that, since it is surely a common usage
scenario.
I
I pushed initial C versions of these functions:
mpn_div_qr_1n_pi2
mpn_div_qr_1u_pi2
I have had these for a long time, judging from the file time stamps.
These accept n-limb dividends in a single consecutive operand and
generate n-limb quotients also in a consecutive operand. I now
Torbjorn Granlund t...@gmplib.org writes:
* The code is no win for AMD k10/k8 (although close to 10 c/l might well be
possible)
I tried replacing one masking op by cmov, as you suggested. We then get
down to 11.25 c/l on K10. I put this modified version in the k10
subdirectory, since it was
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
* The code is no win for AMD k10/k8 (although close to 10 c/l might well be
possible)
I tried replacing one masking op by cmov, as you suggested. We then get
down to 11.25 c/l on K10. I put
I turned out the code was a bit slower on k8.
This patch changes that. With it applied, things takes 11 c/l on both
pipelines. This is also a 2 c/l improvement for piledriver.
I have not tested that this is correct. If you like the patch, please
consider putting the result in the k8 subdir.
Torbjorn Granlund t...@gmplib.org writes:
I turned out the code was a bit slower on k8.
This patch changes that. With it applied, things takes 11 c/l on both
pipelines. This is also a 2 c/l improvement for piledriver.
Cool.
I have not tested that this is correct. If you like the patch,
I played more with the code, now trying to break the add-adc-sbb-cmov
chain, for the benefit of most Intel processors.
But I lack unit testing code for the function, making hacking quite
cumbersome. I don't feel safe hacking *any* GMP assembly code without
tests/devel/try.c's function and access
Torbjorn Granlund t...@gmplib.org writes:
But I lack unit testing code for the function, making hacking quite
cumbersome. I don't feel safe hacking *any* GMP assembly code without
tests/devel/try.c's function and access checks.
tests/mpn/t-div.c includes tests for mpn_div_qr_1, including
ni...@lysator.liu.se (Niels Möller) writes:
ni...@lysator.liu.se (Niels Möller) writes:
But sure, support also in try.c would be good.
Added now. Please have a look if it the changes are sane. I use the
second source for the uh input, and I added a DATA_DIV_QR_1 to get it in
the
ni...@lysator.liu.se (Niels Möller) writes:
ni...@lysator.liu.se (Niels Möller) writes:
But sure, support also in try.c would be good.
Added now.
And sure enough, it detects some bugs in the new assembly code. For size
n==1, there's a missing mov. I'll add that shortly. Then there's another
ni...@lysator.liu.se (Niels Möller) writes:
And sure enough, it detects some bugs in the new assembly code. For size
n==1, there's a missing mov. I'll add that shortly. Then there's another
problem with n==2, which needs a bit more debugging.
Good. So now you have debugged the new try.c
I added data for the new code at http://gmplib.org/devel/asm.html.
There is a line for div_qr_1u_pi1 as well, since that will also be
needed. It might actually be more common that the divisor is not
normalised.
I should try to wrap up div_qr_1n_pi2 and div_qr_1u_pi2 as well, and
then add
ni...@lysator.liu.se (Niels Möller) writes:
Will try that. I think one could also try to delay the quotient store
one iteration, keeping Q1 in a register until the next iteration. Then
one gets rid of the
adc Q2,8(QP, UN, 8)
in the loop, using only a single store per
Torbjorn Granlund t...@gmplib.org writes:
On Intel chips, op-to-mem is expensive. Even op-from-memory is often
slower than load+op. (I understand the register shortage problem.)
The following (untested) variant needs one register too many.
UP, QP, UN: Load, store, loop counter.
I looked at the logic following this:
sbb U2, U2 C 7 13
You negate the U2 copy in Q2. It seems that three adc by sbb
could avoid the neg.
I might also be possible to replace the early loop and stuff by cmov.
Note that the carry flag survives dec, although that causes a
Torbjorn Granlund t...@gmplib.org writes:
I looked at the logic following this:
sbb U2, U2 C 7 13
You negate the U2 copy in Q2. It seems that three adc by sbb
could avoid the neg.
The problem is the final use, where Q2 is added, with carry, to a
different register.
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
I think x86-64, x86-32, arm32, arm64, powerpc-64, sparc-64 matter.
Unfortunately, powerpc-64 (and -32) return these types onto the stack
via an implicit pointer.
Ok, I think I'll stick to
ni...@lysator.liu.se (Niels Möller) writes:
I'm about to push the first step, with C implementations of mpn_div_qr_1
and mpn_div_qr_1n_pi1.
Done now, including some tuning code. It would be interesting to have
DIV_QR_1N_PI1_METHOD
DIV_QR_1_NORM_THRESHOLD
DIV_QR_1_UNNORM_THRESHOLD
added
Torbjorn Granlund t...@gmplib.org writes:
Which tail call?
In the normalized case, the checked in mpn_div_qr_1 does something like
*qh = ...;
...
return mpn_div_qr_1n_pi1(...);
Which is a nice tail call. With the struct-returning version one
gets instead
res.qh = ...;
...
res.r =
ni...@lysator.liu.se (Niels Möller) writes:
I'll try to get the x86_64 assembly for mpn_div_qr_1n_pi1 in soon.
Pushed first working version now, see
http://gmplib.org:8000/gmp/file/tip/mpn/x86_64/div_qr_1n_pi1.asm
On my core2 laptop:
$ ./speed -s 2-10,100,500 -C
ni...@lysator.liu.se (Niels Möller) writes:
On my core2 laptop:
$ ./speed -s 2-10,100,500 -C mpn_divrem_1.0x
mpn_div_qr_1.0x
overhead 6.13 cycles, precision 1 units of 8.33e-10 secs, CPU freq
1200.00 MHz
mpn_divrem_1.0x
ni...@lysator.liu.se (Niels Möller) writes:
ni...@lysator.liu.se (Niels Möller) writes:
(about using a small struct as return value)
If the caller is going to store the returned value directly in memory
anyway, there's little difference. And if the caller is going to operate
on
ni...@lysator.liu.se (Niels Möller) writes:
To get going, I've written C implementations of mpn_div_qr_1n_pi1 and
mpn_divf_qr1n_pi1, and made divrem_1 call them.
Below, also an mpn_div_qr_1, using these primitives (and with some
inspiration from divrem_1). For return value, I use the type
ni...@lysator.liu.se (Niels Möller) writes:
ni...@lysator.liu.se (Niels Möller) writes:
To get going, I've written C implementations of mpn_div_qr_1n_pi1 and
mpn_divf_qr1n_pi1, and made divrem_1 call them.
Below, also an mpn_div_qr_1, using these primitives (and with some
Torbjorn Granlund t...@gmplib.org writes:
I am not too enthusiastic about struct return types for critical
functions. I expect this to be returned via a stack slot everywhere o
almost everywhere.
As far as I understand, the most common ABIs for x86_64 and ARM (which
is pretty close to almost
ni...@lysator.liu.se (Niels Möller) writes:
Consider the test compilation unit
typedef struct {
unsigned long q; unsigned long r;
} qr_t;
qr_t divrem (unsigned long u, unsigned long d)
{
qr_t res;
res.q = u/d;
res.r = u - res.q*d;
return res;
}
On
Torbjörn reminded me of the work on more clever div_qr_1 we discussed a
few years ago, which seemed promising but never got completed.
At the time, I think one reason it stalled was the somewhat unwieldy
divrem_1 primitive. The plan (see https://gmplib.org/devel/) is to
replace divrem_1 by a
31 matches
Mail list logo