On Tue, Jun 01, 2010 at 04:03:19PM +0300, Antti Kantee wrote: > While reading the uvm page allocator code, I noticed it tries to allocate > from percpu storage before falling back to global storage. However, even > if allocation from local storage was possible, a global stats counter is > incremented (e.g. "uvmexp.cpuhit++"). In my measurements I've observed > this type of "cheap" statcounting has a huge impact on percpu algorithms, > as you still need to load&store a globally contended memory address. > Furthermore, uvmexp cache lines are probably more contended than the page > queue, so theoretically you get less than half of the possible benefit. > > I don't expect anyone to remember what the benchmark used to justify > the original percpu commit was, but if someone is going to work on it > further, I'm curious as to how much gain the percpu allocator produced > and how much more it would squeeze out if the global counter was left out. > > The above example of course applies more generally. When you're going > all out with the bag of tricks, "i++" can be very expensive ...
I ran into the same issue with timecounters and there it was a huge overhead - disabling the counter showed great performance improvements where the timecounter hardware was largely parallel and light on memory access (CPU timestamp counter). With the UVM allocator its less of an overhead since the allocator is shielded by uvm_fpageqlock and often uvm_pageqlock (both globals), and the data structures are not intentionally organised/optimized with MP cache behaviour in mind. This is an area that definitely needs improvement, and there's good potential there. Consider NUMA, or as a middle ground multiple sets of pagedaemon/allocator state that correspond to cache units (cores, chips, whatever) or execution units (threads, cores, whatever). The per-CPU bit worked out to be a win on the build.sh benchmark. I implemented it to try and avoid cache writebacks to main memory for short lived processes, due to activity within anonymous and COWed pages. It could be a win for some other mad reason but I assume that the witnessed speed-up is for the reason outlined.
