Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Will try that. I think one could also try to delay the quotient store one iteration, keeping Q1 in a register until the next iteration. Then one gets rid of the adc Q2,8(QP, UN, 8) in the loop, using only a single store per

Re: div_qr_1 interface

2013-10-21 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: On Intel chips, op-to-mem is expensive. Even op-from-memory is often slower than load+op. (I understand the register shortage problem.) The following (untested) variant needs one register too many. UP, QP, UN: Load, store, loop counter.

Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
I looked at the logic following this: sbb U2, U2 C 7 13 You negate the U2 copy in Q2. It seems that three adc by sbb could avoid the neg. I might also be possible to replace the early loop and stuff by cmov. Note that the carry flag survives dec, although that causes a

Re: div_qr_1 interface

2013-10-21 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: I looked at the logic following this: sbb U2, U2 C 7 13 You negate the U2 copy in Q2. It seems that three adc by sbb could avoid the neg. The problem is the final use, where Q2 is added, with carry, to a different register.