Mark wrote:
> Joe's response..

> > So it seems that there is an issue with the TLB in  32-bit xVM PV doms...
>
> The bug is probably in TLB flushing management in the Xen code itself.
> I know they've said in the past that they do all kinds of very crafty
> optimizations to avoid unnecessary invalidates in the hypervisor.
> I suspect they've got a bug.

The bug is in the Solaris 32-bit PAE xVM xm kernel code.  It
doesn't do a full tlb flush when one of the four PDPTR entries
changes; instead of a full tlb flush Solaris tries to use INVLPG,
but Intel has documented that this doesn't work...

Intel has published an application note about the TLBs and their
invalidation:

    http://www.intel.com/products/processor/manuals/
    http://www.intel.com/design/processor/applnots/317080.pdf

And in that application note, the following is documented in section 8.1:
---------------------------------------------------
The processor does not maintain a PDP cache as described in Section 4.
The processor always caches information from the four page-directory-pointer-
table entries. These entries are not cached at the time of address translation.
Instead, they are always cached as part of the execution of the following
instructions:
o A MOV to CR3 that occurs with IA32_EFER.LMA = 0 and CR4.PAE = 1.
o A MOV to CR4 that results in CR4.PAE = 1, that occurs with IA32_EFER.LMA = 0
  and CR0.PG = 1, and that modifies at least one of CR4.PAE, CR4.PGE,
  and CR4.PSE.
o A MOV to CR0 that modifies CR0.PG and that occurs with IA32_EFER.LMA = 0 and
  CR4.PAE = 1.

These instructions fault if they would load a PDPTR that sets any of the bits
that must be 0 (see above). These cached entries are not modified by any other
operations.14 In particular, executions of INVLPG do not affect these cached
entries.
---------------------------------------------------


Solaris implements this:

   1043 static void
   1044 unlink_ptp(htable_t *higher, htable_t *old, uintptr_t vaddr)
   1045 {
...
   1067         /*
   1068          * When a top level VLP page table entry changes, we must issue
   1069          * a reload of cr3 on all processors.
   1070          *
   1071          * If we don't need do do that, then we still have to INVLPG 
against
   1072          * an address covered by the inner page table, as the latest 
processors
   1073          * have TLB-like caches for non-leaf page table entries.
   1074          */
   1075         if (!(hat->hat_flags & HAT_FREEING)) {
   1076                 hat_tlb_inval(hat, (higher->ht_flags & HTABLE_VLP) ?
   1077                     DEMAP_ALL_ADDR : old->ht_vaddr);
   1078         }


and

   1087 static void
   1088 link_ptp(htable_t *higher, htable_t *new, uintptr_t vaddr)
   1089 {
...
   1104         /*
   1105          * When any top level VLP page table entry changes, we must 
issue
   1106          * a reload of cr3 on all processors using it.
   1107          * We also need to do this for the kernel hat on PAE 32 bit 
kernel.
   1108          */
   1109         if (
   1110 #ifdef __i386
   1111             (higher->ht_hat == kas.a_hat && higher->ht_level == 
VLP_LEVEL) ||
   1112 #endif
   1113             (higher->ht_flags & HTABLE_VLP))
   1114                 hat_tlb_inval(higher->ht_hat, DEMAP_ALL_ADDR);

That is, when we remove PDPTR entries in unlink_ptp there is a
hat_tlb_inval(DEMAP_ALL_ADDR) (flush TLB by CR3 reload), but only when
"higher->ht_flags & HTABLE_VLP".
Under 32-bit PV xVM, the HTABLE_VLP flag isn't set, so the code does an
hat_tlb_inval(old->ht_vaddr) which results in an INVLPG.  According to
section 8.1 in Intel's document, the INVLPG does *not* affect the
cached PDPTR entries


When PDPTR entries are added in link_ptp, there is a
hat_tlb_inval(DEMAP_ALL_ADDR).  Again, this is only done when the
HTABLE_VLP flag is set; it isn't under 32-bit PV xVM. Or when
(higher->ht_hat == kas.a_hat && higher->ht_level == VLP_LEVEL); but
this is a non-kernel hat mapping, so the hat_tlb_inval() is skipped.


When I change the kernel so something like this, it seems to work
without getting lots of spurious page faults:

diff --git a/usr/src/uts/i86pc/vm/htable.c b/usr/src/uts/i86pc/vm/htable.c
--- a/usr/src/uts/i86pc/vm/htable.c
+++ b/usr/src/uts/i86pc/vm/htable.c
@@ -1073,8 +1079,11 @@ unlink_ptp(htable_t *higher, htable_t *o
         * have TLB-like caches for non-leaf page table entries.
         */
        if (!(hat->hat_flags & HAT_FREEING)) {
-               hat_tlb_inval(hat, (higher->ht_flags & HTABLE_VLP) ?
-                   DEMAP_ALL_ADDR : old->ht_vaddr);
+               hat_tlb_inval(hat, (higher->ht_flags & HTABLE_VLP)
+#ifdef __i386
+                   || (higher->ht_level == VLP_LEVEL)
+#endif
+                   ? DEMAP_ALL_ADDR : old->ht_vaddr);
        }

        HTABLE_DEC(higher->ht_valid_cnt);
@@ -1108,7 +1117,7 @@ link_ptp(htable_t *higher, htable_t *new
         */
        if (
 #ifdef __i386
-           (higher->ht_hat == kas.a_hat && higher->ht_level == VLP_LEVEL) ||
+           (higher->ht_level == VLP_LEVEL) ||
 #endif
            (higher->ht_flags & HTABLE_VLP))
                hat_tlb_inval(higher->ht_hat, DEMAP_ALL_ADDR);



On metal this isn't a problem because the HTABLE_VLP flag is set on
the L2 htable, so the tlb gets flushed by CR3 reloads.
 
 
This message posted from opensolaris.org
_______________________________________________
xen-discuss mailing list
[email protected]

Reply via email to