addmul_k with toom (was: Re: GMP's x86-32 performance)

2017-06-17 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > ni...@lysator.liu.se (Niels Möller) writes: > we might also try doing addmul_2 using toom32, which > would save 1/3 of the mul instructions. Toom32 is nice because we can > use the four easiest evaluation points: 0, infinity, and +/-1. > >

Re: GMP's x86-32 performance

2017-06-17 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: > Our latest batch of x86-32 code dates from 2011 (for the original Intel > atom) but we have not done anything for high-end AMD and Intel CPUs > (e.g., AMD k10, bulldozer, piledriver, steamroller, excavator, zen, or > Intel penryn, nehalem,

Re: GMP's x86-32 performance

2017-06-17 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > Our latest batch of x86-32 code dates from 2011 (for the original Intel > atom) but we have not done anything for high-end AMD and Intel CPUs > (e.g., AMD k10, bulldozer, piledriver, steamroller, excavator, zen, or > Intel penryn, nehalem,

GMP's x86-32 performance

2017-06-17 Thread Torbjörn Granlund
The new measurement reporting pages have highlighted many improvement opportunities, and as you might have seen I've lately fixed a handful of the _basecase functions for x86-64. An aspect not directly covered by the new measurement reporting is that the 32-bit and 64-bit performance-per-limb is