On Sun, Jul 31, 2016 at 11:11:25PM +1000, Bruce Evans wrote: > Misalignment of this loop made it almost twice as slow on old Turion2 with > slow DDR2 memory. It made no difference on Haswell. I added an extra > movnti, but that makes little or no differences. 2 more movnti's wouldn't > fit in a 16-byte cache line so are slower unless even more care is taken > with alignment (or with less care, 4 with misalignment are not less than > twice as slow as 1 with alignment). > > I thought that alignment and unrolling didn't matter here, because movnti > has to wait for memory and almost any loop runs fast enough to keep up. > The timing on my old system is something like: CPUs at 2 GHz; main memory > at 4 GB/sec; movnti is only 4 bytes wide on i386 (so this problem > only affects i386, at least with slow memory). So sustaining 4 GB/sec > requires 1 G movnti's/sec, so the loop needs to run at 2 cycles/iteration > to keep up. But when it is misaligned, it runs at 3-4 cycles/iteration. > Alignment makes it take about 2, and the extra movnti is for safety and > to work with faster memory. > > On Haswell with CPUs at 4 GHz, 2 cycles/iteration gives 8 GB/sec on > i386 and 16 GB/sec on amd64 with wider movnti. IIRC, 16 GB/sec is about > the main memory speed so nothing better is possible but just 1 extra > movnti gives more with faster memory. This is just worse than bzero()
What about modern system with 120 GB/sec main memory speed? _______________________________________________ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"