Re: div_qr_1 interface

2013-10-26 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: A long time ago, we choose an interface for sbpi1_div_qr which does *not* store the most significant limb; instead it returns it. I think it was the intention that a new top-level mpn_div_qr should follow that convention, and not store the top

Re: div_qr_1 interface

2013-10-25 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: The interesting thing is that the next higher function, mpn_div_qr_1, should return the high quotient limb separately. I am not sure I agree. Please explain. You're saying that en n-limb consecutive dividend should yield an (n-1)-limb consecutive

Re: div_qr_1 interface

2013-10-25 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: ni...@lysator.liu.se (Niels Möller) writes: The interesting thing is that the next higher function, mpn_div_qr_1, should return the high quotient limb separately. I am not sure I agree. Please explain. A long time ago, we choose an interface

Re: div_qr_1 interface

2013-10-24 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: Basically qp = up won't work, but qp = up + k for any positive k will? Does the C code share that property? I think the C code has the same problem. Hmm, and it looks like in both the C and asm code, it's only the first iteration, before the main loop,

Re: div_qr_1 interface

2013-10-24 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: Basically qp = up won't work, but qp = up + k for any positive k will? Does the C code share that property? [...] I think it would be good to fix that, since it is surely a common usage scenario. I

Re: div_qr_1 interface

2013-10-24 Thread Torbjorn Granlund
I pushed initial C versions of these functions: mpn_div_qr_1n_pi2 mpn_div_qr_1u_pi2 I have had these for a long time, judging from the file time stamps. These accept n-limb dividends in a single consecutive operand and generate n-limb quotients also in a consecutive operand. I now

Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: * The code is no win for AMD k10/k8 (although close to 10 c/l might well be possible) I tried replacing one masking op by cmov, as you suggested. We then get down to 11.25 c/l on K10. I put this modified version in the k10 subdirectory, since it was

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: * The code is no win for AMD k10/k8 (although close to 10 c/l might well be possible) I tried replacing one masking op by cmov, as you suggested. We then get down to 11.25 c/l on K10. I put

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
I turned out the code was a bit slower on k8. This patch changes that. With it applied, things takes 11 c/l on both pipelines. This is also a 2 c/l improvement for piledriver. I have not tested that this is correct. If you like the patch, please consider putting the result in the k8 subdir.

Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: I turned out the code was a bit slower on k8. This patch changes that. With it applied, things takes 11 c/l on both pipelines. This is also a 2 c/l improvement for piledriver. Cool. I have not tested that this is correct. If you like the patch,

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
I played more with the code, now trying to break the add-adc-sbb-cmov chain, for the benefit of most Intel processors. But I lack unit testing code for the function, making hacking quite cumbersome. I don't feel safe hacking *any* GMP assembly code without tests/devel/try.c's function and access

Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: But I lack unit testing code for the function, making hacking quite cumbersome. I don't feel safe hacking *any* GMP assembly code without tests/devel/try.c's function and access checks. tests/mpn/t-div.c includes tests for mpn_div_qr_1, including

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: But sure, support also in try.c would be good. Added now. Please have a look if it the changes are sane. I use the second source for the uh input, and I added a DATA_DIV_QR_1 to get it in the

Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: But sure, support also in try.c would be good. Added now. And sure enough, it detects some bugs in the new assembly code. For size n==1, there's a missing mov. I'll add that shortly. Then there's another

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: And sure enough, it detects some bugs in the new assembly code. For size n==1, there's a missing mov. I'll add that shortly. Then there's another problem with n==2, which needs a bit more debugging. Good. So now you have debugged the new try.c

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
I added data for the new code at http://gmplib.org/devel/asm.html. There is a line for div_qr_1u_pi1 as well, since that will also be needed. It might actually be more common that the divisor is not normalised. I should try to wrap up div_qr_1n_pi2 and div_qr_1u_pi2 as well, and then add

Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Will try that. I think one could also try to delay the quotient store one iteration, keeping Q1 in a register until the next iteration. Then one gets rid of the adc Q2,8(QP, UN, 8) in the loop, using only a single store per

Re: div_qr_1 interface

2013-10-21 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: On Intel chips, op-to-mem is expensive. Even op-from-memory is often slower than load+op. (I understand the register shortage problem.) The following (untested) variant needs one register too many. UP, QP, UN: Load, store, loop counter.

Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
I looked at the logic following this: sbb U2, U2 C 7 13 You negate the U2 copy in Q2. It seems that three adc by sbb could avoid the neg. I might also be possible to replace the early loop and stuff by cmov. Note that the carry flag survives dec, although that causes a

Re: div_qr_1 interface

2013-10-21 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: I looked at the logic following this: sbb U2, U2 C 7 13 You negate the U2 copy in Q2. It seems that three adc by sbb could avoid the neg. The problem is the final use, where Q2 is added, with carry, to a different register.

Re: div_qr_1 interface

2013-10-20 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: I think x86-64, x86-32, arm32, arm64, powerpc-64, sparc-64 matter. Unfortunately, powerpc-64 (and -32) return these types onto the stack via an implicit pointer. Ok, I think I'll stick to

Re: div_qr_1 interface

2013-10-20 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: I'm about to push the first step, with C implementations of mpn_div_qr_1 and mpn_div_qr_1n_pi1. Done now, including some tuning code. It would be interesting to have DIV_QR_1N_PI1_METHOD DIV_QR_1_NORM_THRESHOLD DIV_QR_1_UNNORM_THRESHOLD added

Re: div_qr_1 interface

2013-10-20 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: Which tail call? In the normalized case, the checked in mpn_div_qr_1 does something like *qh = ...; ... return mpn_div_qr_1n_pi1(...); Which is a nice tail call. With the struct-returning version one gets instead res.qh = ...; ... res.r =

Re: div_qr_1 interface

2013-10-20 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: I'll try to get the x86_64 assembly for mpn_div_qr_1n_pi1 in soon. Pushed first working version now, see http://gmplib.org:8000/gmp/file/tip/mpn/x86_64/div_qr_1n_pi1.asm On my core2 laptop: $ ./speed -s 2-10,100,500 -C

Re: div_qr_1 interface

2013-10-20 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: On my core2 laptop: $ ./speed -s 2-10,100,500 -C mpn_divrem_1.0x mpn_div_qr_1.0x overhead 6.13 cycles, precision 1 units of 8.33e-10 secs, CPU freq 1200.00 MHz mpn_divrem_1.0x

Re: div_qr_1 interface

2013-10-18 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: (about using a small struct as return value) If the caller is going to store the returned value directly in memory anyway, there's little difference. And if the caller is going to operate on

Re: div_qr_1 interface

2013-10-17 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: To get going, I've written C implementations of mpn_div_qr_1n_pi1 and mpn_divf_qr1n_pi1, and made divrem_1 call them. Below, also an mpn_div_qr_1, using these primitives (and with some inspiration from divrem_1). For return value, I use the type

Re: div_qr_1 interface

2013-10-17 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: To get going, I've written C implementations of mpn_div_qr_1n_pi1 and mpn_divf_qr1n_pi1, and made divrem_1 call them. Below, also an mpn_div_qr_1, using these primitives (and with some

Re: div_qr_1 interface

2013-10-17 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: I am not too enthusiastic about struct return types for critical functions. I expect this to be returned via a stack slot everywhere o almost everywhere. As far as I understand, the most common ABIs for x86_64 and ARM (which is pretty close to almost

Re: div_qr_1 interface

2013-10-17 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: Consider the test compilation unit typedef struct { unsigned long q; unsigned long r; } qr_t; qr_t divrem (unsigned long u, unsigned long d) { qr_t res; res.q = u/d; res.r = u - res.q*d; return res; } On

div_qr_1 interface

2013-10-16 Thread Niels Möller
Torbjörn reminded me of the work on more clever div_qr_1 we discussed a few years ago, which seemed promising but never got completed. At the time, I think one reason it stalled was the somewhat unwieldy divrem_1 primitive. The plan (see https://gmplib.org/devel/) is to replace divrem_1 by a