Mark wrote:
> Joe's response..
> > So it seems that there is an issue with the TLB in 32-bit xVM PV doms...
>
> The bug is probably in TLB flushing management in the Xen code itself.
> I know they've said in the past that they do all kinds of very crafty
> optimizations to avoid unnecessary invalidates in the hypervisor.
> I suspect they've got a bug.
The bug is in the Solaris 32-bit PAE xVM xm kernel code. It
doesn't do a full tlb flush when one of the four PDPTR entries
changes; instead of a full tlb flush Solaris tries to use INVLPG,
but Intel has documented that this doesn't work...
Intel has published an application note about the TLBs and their
invalidation:
http://www.intel.com/products/processor/manuals/
http://www.intel.com/design/processor/applnots/317080.pdf
And in that application note, the following is documented in section 8.1:
---------------------------------------------------
The processor does not maintain a PDP cache as described in Section 4.
The processor always caches information from the four page-directory-pointer-
table entries. These entries are not cached at the time of address translation.
Instead, they are always cached as part of the execution of the following
instructions:
o A MOV to CR3 that occurs with IA32_EFER.LMA = 0 and CR4.PAE = 1.
o A MOV to CR4 that results in CR4.PAE = 1, that occurs with IA32_EFER.LMA = 0
and CR0.PG = 1, and that modifies at least one of CR4.PAE, CR4.PGE,
and CR4.PSE.
o A MOV to CR0 that modifies CR0.PG and that occurs with IA32_EFER.LMA = 0 and
CR4.PAE = 1.
These instructions fault if they would load a PDPTR that sets any of the bits
that must be 0 (see above). These cached entries are not modified by any other
operations.14 In particular, executions of INVLPG do not affect these cached
entries.
---------------------------------------------------
Solaris implements this:
1043 static void
1044 unlink_ptp(htable_t *higher, htable_t *old, uintptr_t vaddr)
1045 {
...
1067 /*
1068 * When a top level VLP page table entry changes, we must issue
1069 * a reload of cr3 on all processors.
1070 *
1071 * If we don't need do do that, then we still have to INVLPG
against
1072 * an address covered by the inner page table, as the latest
processors
1073 * have TLB-like caches for non-leaf page table entries.
1074 */
1075 if (!(hat->hat_flags & HAT_FREEING)) {
1076 hat_tlb_inval(hat, (higher->ht_flags & HTABLE_VLP) ?
1077 DEMAP_ALL_ADDR : old->ht_vaddr);
1078 }
and
1087 static void
1088 link_ptp(htable_t *higher, htable_t *new, uintptr_t vaddr)
1089 {
...
1104 /*
1105 * When any top level VLP page table entry changes, we must
issue
1106 * a reload of cr3 on all processors using it.
1107 * We also need to do this for the kernel hat on PAE 32 bit
kernel.
1108 */
1109 if (
1110 #ifdef __i386
1111 (higher->ht_hat == kas.a_hat && higher->ht_level ==
VLP_LEVEL) ||
1112 #endif
1113 (higher->ht_flags & HTABLE_VLP))
1114 hat_tlb_inval(higher->ht_hat, DEMAP_ALL_ADDR);
That is, when we remove PDPTR entries in unlink_ptp there is a
hat_tlb_inval(DEMAP_ALL_ADDR) (flush TLB by CR3 reload), but only when
"higher->ht_flags & HTABLE_VLP".
Under 32-bit PV xVM, the HTABLE_VLP flag isn't set, so the code does an
hat_tlb_inval(old->ht_vaddr) which results in an INVLPG. According to
section 8.1 in Intel's document, the INVLPG does *not* affect the
cached PDPTR entries
When PDPTR entries are added in link_ptp, there is a
hat_tlb_inval(DEMAP_ALL_ADDR). Again, this is only done when the
HTABLE_VLP flag is set; it isn't under 32-bit PV xVM. Or when
(higher->ht_hat == kas.a_hat && higher->ht_level == VLP_LEVEL); but
this is a non-kernel hat mapping, so the hat_tlb_inval() is skipped.
When I change the kernel so something like this, it seems to work
without getting lots of spurious page faults:
diff --git a/usr/src/uts/i86pc/vm/htable.c b/usr/src/uts/i86pc/vm/htable.c
--- a/usr/src/uts/i86pc/vm/htable.c
+++ b/usr/src/uts/i86pc/vm/htable.c
@@ -1073,8 +1079,11 @@ unlink_ptp(htable_t *higher, htable_t *o
* have TLB-like caches for non-leaf page table entries.
*/
if (!(hat->hat_flags & HAT_FREEING)) {
- hat_tlb_inval(hat, (higher->ht_flags & HTABLE_VLP) ?
- DEMAP_ALL_ADDR : old->ht_vaddr);
+ hat_tlb_inval(hat, (higher->ht_flags & HTABLE_VLP)
+#ifdef __i386
+ || (higher->ht_level == VLP_LEVEL)
+#endif
+ ? DEMAP_ALL_ADDR : old->ht_vaddr);
}
HTABLE_DEC(higher->ht_valid_cnt);
@@ -1108,7 +1117,7 @@ link_ptp(htable_t *higher, htable_t *new
*/
if (
#ifdef __i386
- (higher->ht_hat == kas.a_hat && higher->ht_level == VLP_LEVEL) ||
+ (higher->ht_level == VLP_LEVEL) ||
#endif
(higher->ht_flags & HTABLE_VLP))
hat_tlb_inval(higher->ht_hat, DEMAP_ALL_ADDR);
On metal this isn't a problem because the HTABLE_VLP flag is set on
the L2 htable, so the tlb gets flushed by CR3 reloads.
This message posted from opensolaris.org
_______________________________________________
xen-discuss mailing list
[email protected]