On x86 we use a set of four tmp VAs in pmap if no direct map is implemented. When the kernel needs to read/write a physical page without caring about its associated VA, it enters one of these VAs into the PA in question and modifies it this way.
These tmp VAs are allocated at boot time in a global array, and are associated with their PTEs. Each cpu then accesses this array with its cpu_number. There are two major issues with this array. First, we don't know the number of CPUs available yet, so we are forced to allocate maxcpus entries; on i386, the array takes 2MB of memory, while it is unlikely that the machine has 32 CPUs. Second, there is an issue with the alignment of the PTEs: each CPU has 8 entries in this array, so that each set of PTEs is 64-byte aligned (8 x 8); but the base of the array is not 64-byte aligned, so nothing guarantees we end up with the expected separation. The reason we care about the alignment is because these PTEs are used quite often in pmap routines, and if they are not 64-byte aligned a CPU could thrash the cache line of another CPU - which in short reduces performance (false sharing). I have this patch [1], inspired a little from sparc. We now switch to per-CPU VAs, that are embedded into cpu_info. At boot time, we manually allocate the VAs in pmap_bootstrap for cpu0; later, for the secondary CPUs, we call pmap_vpage_cpu_init in cpu_attach. At run time, the CPUs take their VAs in curcpu with preemption disabled. This patch fixes the two aforementioned issues: the VAs are allocated proportionally to the number of CPUs attached so we don't uselessly lose memory, and there is no false sharing since the cpu_info structures are already cache-line-aligned (and so is their content). I just wanted to know if someone had suggestions or whatever before I commit it. Maxime [1] http://m00nbsd.net/garbage/vpage/vpage.diff
