t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes:
> we might also try doing addmul_2 using toom32, which > would save 1/3 of the mul instructions. Toom32 is nice because we can > use the four easiest evaluation points: 0, infinity, and +/-1. > > Give it a shot! A concern with any toom stuff is side-channel leakage. > We currently assume _basecase is non-leaky. So any low-level trickery > needs to be applied above the _basecase cutoff points (unless we split > out _basecase_sec functions). If we do this with toom32, we'd need a separate basecase_sec. I think there's little chance to gain any speedup if we enforce side-channel silence. Looks better for toom2, there I think it costs only a single instruction in the innerloop to handle all signs in a side-channel silent fashion. If we don't trust cmov to be side-channel silent, we'd also need to do conditional negation using a mask and xor + add instead of cmov, unclear what that means for performance. > Or addmul_3 using toom32, which has the additional advantage that more > of the evaluation work is loop-invariant, and we could also jump to > separate innerloops depending on the carry bits from evaluation. I've looked a bit more at this, writing some pseudo C code for the simplest case where both evaluated values v0 + v1 + v2 and v0 - v1 + v2 are even, non-negative, and fit in a single limb after division by two. One iteration of addmul_3 would then need roughly 5 alu instructions for evaluation, 4 multiplies, 8 instructions to combine products into odd and even terms (u0 v0 + u1 v1 + u0 v2) and (u0 v1 + u1 v0 + u1 v2), and another 14 for the largish final additions, including the three carry words from the previous iteration. If we divide by 6 (since this corresponds to the work of 6 iterations of addmul_1), that's 2/3 multiply instructions and 4.5 alu instructions. A sketch for addmul_2 using toom2 (and evaluating at the -1 point) gives 3 instructions for evaluation, 3 multiplies, 13 instructions for interpolation and addition. Divide by 4, to get 3/4 multiply and 4 alu instructions to do the same amount of work as a single addmul_1 iteration. To compare to x86_64/aorsmul_1.asm, which uses one multiply and 3 adc instructions per iteration, and at best runs at 2.5 cycles per limb, and x86_64/coreibwl/addmul_1.asm, which uses an iteration consisting of mulx, adox, adcx, and runs at about 1.66 cycles per limb. > Slow hardware multiplication helps. :-) At the list https://gmplib.org/devel/asm.html, we have the "Intel noc" (what's that?) with current addmul_1 running at 15 cycles per limb, and Intel Atom where it runs at 19 cycles / limb. Writing up an addmul_2 for intel atom using toom2 is probably the most promising step in this direction. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel