John Levon wrote:
> On Mon, Apr 21, 2008 at 05:52:02AM -0700, Jürgen Keil wrote:
>
> > Joe Bonasera's blog might contain an explanation what is
> > happening, in the section "Spurious Page Faults":
> >
> > http://blogs.sun.com/JoeBonasera/entry/i_ve_got_spur_ious
>
> This doesn't affect us any more. This type of writable page table was
> removed, since it provided no performance benefit.
Ok...
Looking at the dtrace output for the pv 32bit copy-on-write
test program, I see that x86pte_inval() does an INVLPG
through the hypervisor (MMUEXT_INVLPG_LOCAL)
when it removes a page mapping. Like this (this removes
the read-only cow stack page):
1 -> x86pte_inval
1 | x86pte_inval:entry entry 47, expect 1479b025
1 -> x86pte_access_pagetable
1 -> x86pte_mapin
1 -> pa_to_ma
1 -> pfn_to_mfn
1 <- pfn_to_mfn returns 12473
1 <- pa_to_ma returns 12473000
1 -> xen_map
1 -> HYPERVISOR_update_va_mapping
1 | HYPERVISOR_update_va_mapping:entry va cda02000,
new_pte 8000000012473001, flags 2
1 <- HYPERVISOR_update_va_mapping returns 0
1 <- xen_map returns 0
1 <- x86pte_mapin returns cda02238
1 <- x86pte_access_pagetable returns cda02238
1 -> get_pte64
1 <- get_pte64 returns 1479b025
1 -> htable_e2va
1 <- htable_e2va returns 8047000
1 -> hat_tlb_inval
1 -> xen_flush_va
1 -> HYPERVISOR_mmuext_op
1 | HYPERVISOR_mmuext_op:entry req[0/1]: cmd 7, addr
8047000
1 <- HYPERVISOR_mmuext_op returns 0
1 <- xen_flush_va returns 0
1 <- hat_tlb_inval returns 1
1 -> x86pte_release_pagetable
1 -> x86pte_mapout
1 -> HYPERVISOR_update_va_mapping
1 | HYPERVISOR_update_va_mapping:entry va cda02000,
new_pte 0, flags 2
1 <- HYPERVISOR_update_va_mapping returns 0
1 <- x86pte_mapout returns cf9df800
1 <- x86pte_release_pagetable returns cf9df800
1 <- x86pte_inval returns 1479b025
Code in uts/i86pc/vm/htable.c function x86pte_inval() is this
2222 /*
2223 * Note that the loop is needed to handle changes due to h/w
updating
2224 * of PT_MOD/PT_REF.
2225 */
2226 do {
2227 oldpte = GET_PTE(ptep);
2228 if (expect != 0 && (oldpte & PT_PADDR) != (expect &
PT_PADDR))
2229 goto done;
2230 XPV_ALLOW_PAGETABLE_UPDATES();
2231 found = CAS_PTE(ptep, oldpte, 0);
2232 XPV_DISALLOW_PAGETABLE_UPDATES();
2233 } while (found != oldpte);
2234 if (oldpte & (PT_REF | PT_MOD))
2235 hat_tlb_inval(ht->ht_hat, htable_e2va(ht, entry));
The invalidated PTE was accessed (return value from get_pte64 had the
0x20 bit set), so line 2235 hat_tlb_inval() is called which invalidates the
TLB for that stack page.
Ok so far.
Why doesn't x86pte_set() use INVLPG when it installs a
new PTE entry? The dtrace for my fork test case contains
this (this one installes the writable page after we got the COW
fault):
1 -> x86pte_set
1 | x86pte_set:entry entry 47, new bc39a007
1 -> htable_e2va
1 <- htable_e2va returns 8047000
1 -> x86pte_access_pagetable
1 -> x86pte_mapin
1 -> pa_to_ma
1 -> pfn_to_mfn
1 <- pfn_to_mfn returns 12473
1 <- pa_to_ma returns 12473000
1 -> xen_map
1 -> HYPERVISOR_update_va_mapping
1 | HYPERVISOR_update_va_mapping:entry va cda02000,
new_pte 8000000012473001, flags 2
1 <- HYPERVISOR_update_va_mapping returns 0
1 <- xen_map returns 0
1 <- x86pte_mapin returns cda02238
1 <- x86pte_access_pagetable returns cda02238
1 -> get_pte64
1 <- get_pte64 returns 0
1 -> x86pte_release_pagetable
1 -> x86pte_mapout
1 -> HYPERVISOR_update_va_mapping
1 | HYPERVISOR_update_va_mapping:entry va cda02000,
new_pte 0, flags 2
1 <- HYPERVISOR_update_va_mapping returns 0
1 <- x86pte_mapout returns cf9df800
1 <- x86pte_release_pagetable returns cf9df800
1 <- x86pte_set returns 0
The hypervisor is told up invalidate the page that contains the
PTE (via HYPERVISOR_update_va_mapping, va cda02000 flags 2),
but the CPU / MMU isn't told that the mapping for the virtual stack address
8047000 has changed. Isn't it possible that the CPU / MMU / TLB has
cached the information "virtual stack address 8047000 is not valid address",
after the call to x86pte_inval() ?
htable.c x86pte_set() does a TLB flush when the old PTE
referred to a referenced page, but it doesn't update the TLB when
an empty PTE was replaced with a new translation:
2090 /*
2091 * Do a TLB demap if needed, ie. the old pte was valid.
2092 *
2093 * Note that a stale TLB writeback to the PTE here either can't
happen
2094 * or doesn't matter. The PFN can only change for
NOSYNC|NOCONSIST
2095 * mappings, but they were created with REF and MOD already
set, so
2096 * no stale writeback will happen.
2097 *
2098 * Segmap is the only place where remaps happen on the same pfn
and for
2099 * that we want to preserve the stale REF/MOD bits.
2100 */
2101 if (old & PT_REF)
2102 hat_tlb_inval(hat, addr);
Btw. I've been experimenting with this change to x86pte_set()
(lines 2103 ... 2111 added):
2090 /*
2091 * Do a TLB demap if needed, ie. the old pte was valid.
2092 *
2093 * Note that a stale TLB writeback to the PTE here either can't
happen
2094 * or doesn't matter. The PFN can only change for
NOSYNC|NOCONSIST
2095 * mappings, but they were created with REF and MOD already
set, so
2096 * no stale writeback will happen.
2097 *
2098 * Segmap is the only place where remaps happen on the same pfn
and for
2099 * that we want to preserve the stale REF/MOD bits.
2100 */
2101 if (old & PT_REF)
2102 hat_tlb_inval(hat, addr);
2103 #if defined(__i386) && defined(__xpv)
2104 /* jk: ugly hack / experiment with PV spurious page faults */
2105 else if (old == 0 && addr < 0x8048000 && xpv_page_fault_hack) {
2106 if (xpv_page_fault_hack == 1)
2107 xen_flush_tlb();
2108 else
2109 xen_flush_va((caddr_t)addr);
2110 }
2111 #endif
With xpv_page_fault_hack := 0 I get the original code.
With xpv_page_fault_hack := 2 I try to do an INVALPG on the
new installed translation. But that hasn't fixed the issue...
But with xpv_page_fault_hack := 1 the entire TLB gets flushed
when installing new stack pages, and now:
1. the libMicro-0.4.0 fork_100 test runs ~ 30x faster in a 32-bit PV domU !!
800 seconds -> 28 seconds
2 ./boot/solaris/bin/create_ramdisk runs ~ 4x faster in a 32-bit PV domU !
2 minutes -> 36 seconds
So it seems that there is an issue with the TLB in 32-bit xVM PV doms...
This message posted from opensolaris.org
_______________________________________________
xen-discuss mailing list
[email protected]