Le 2016-10-11 14:42, Maxime Villard a écrit :
Userland is pageable, so when mmap is called with one page, the kernel does not yet make the page officially available to the CPU. Rather, it waits for the page to fault, and at fault time it will make it valid for real. It means the kernel code path from the interrupt to the moment when the page is entered needs to be

All this to say that in pmap_enter_ma on x86, an optimization is possible. In this function, new_pve and new_sparepve are always allocated, but not always needed. The reason it is done this way is because preemption is disabled in the critical part, so obviously the allocation needs to be performed earlier.

new_pve and new_sparepve are to be used in pmap_enter_pv. After adding atomic
counters in here, a './build.sh tools' gives these numbers:

        PVE: used=36441394 unused=58955001
        SPAREPVE: used=1647254 unused=93749141

It means that 38088648 allocations were needed and performed, and 152704142 performed but not used. In short, only 19% of the allocated buffers were needed. Verily, the real number may be even smaller than that, since I didn't take into account the fact that there may be no p->v tracking at all (in which case both
buffers would be unused as well).

I have a patch which introduces two inlined functions that can tell earlier whether these buffers are needed. One problem with this patch is that it makes the code harder to understand, even though I tried to explain clearly what we are doing. Another problem is that when both buffers are needed, my patch
introduces a little overhead (the cost of a few branches).

I don't know if we care enough about things like that, if someone here has
particular comments feel free.

[1] https://nxr.netbsd.org/xref/src/sys/arch/x86/x86/pmap.c#4061
[2] http://m00nbsd.net/garbage/pmap/enter.diff

I would benchmark both (with and without the "overhead" introduced); a while back when implementing PAE I did not expect the paddr_t promotion from 32 to 64 bits to have that much of an impact on pmap performance, but the first attempt induced more that 5% overhead on a "cold" ./build.sh run.

Granted, you are not dealing with the same situation here but pool caches make the allocation used/unused dance almost free (except for the slow path). When objets are in the pool cache but not yet obtained through the getter, they are still allocated but basically not used. It would be interesting to see if the hit/miss ratio is affected for the "pvpl" pool with your optimization.

Jean-Yves Migeon

Reply via email to