Re: [patch 09/10] KVM: MMU: out of sync shadow core v2

2008-09-23 Thread Avi Kivity

Marcelo Tosatti wrote:
  
  
I don't understand how the variables sp, child, and parent interact. You  
either need recursion or an explicit stack?



It restarts at parent level whenever finishing any children:

+   if (i == PT64_ENT_PER_PAGE) {
+   sp-unsync_children = 0;
+   sp = parent;
+   }

No efficiency.

  


Oh okay.  'parent' is never assigned to.  Lack of concentration.


Yes. The next element for_each_entry_safe saved could have been zapped.

  
  

Ouch. Ouch.

I hate doing this. Can see no alternative though.



Me neither.

  


Well.  But I don't see kvm_mmu_zap_page()'s return value used anywhere.

Actually, I think I see an alternative:  set the invalid flag on these 
pages and queue them in a list, like we do for roots in use.  Flush the 
list on some cleanup path.



Windows 2008 64-bit has all sorts of sharing a pagetable at multiple
levels too.

  
  

We still want to allow oos for the two quadrants of a nonpae shadow page.



Sure, can be an optimization step later?
  


I'd like to reexamine this from another angle: what if we allow oos of 
any level?


This will simplify the can_unsync path (always true) and remove a 
special case.  The cost is implementing invlpg and resync for non-leaf 
pages (invlpg has to resync the pte for every level).  Are there other 
problems with this?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/10] KVM: MMU: out of sync shadow core v2

2008-09-23 Thread Marcelo Tosatti
On Tue, Sep 23, 2008 at 01:46:23PM +0300, Avi Kivity wrote:
 Marcelo Tosatti wrote:
 
 I don't understand how the variables sp, child, and parent interact. 
 You  either need recursion or an explicit stack?
 

 It restarts at parent level whenever finishing any children:

 +   if (i == PT64_ENT_PER_PAGE) {
 +   sp-unsync_children = 0;
 +   sp = parent;
 +   }

 No efficiency.

   

 Oh okay.  'parent' is never assigned to.  Lack of concentration.

 Yes. The next element for_each_entry_safe saved could have been zapped.

 
 Ouch. Ouch.

 I hate doing this. Can see no alternative though.
 

 Me neither.

   

 Well.  But I don't see kvm_mmu_zap_page()'s return value used anywhere.

It is. List walk becomes unsafe otherwise.

 Actually, I think I see an alternative:  set the invalid flag on these  
 pages and queue them in a list, like we do for roots in use.  Flush the  
 list on some cleanup path.

Yes, it is an alternative. But then you would have to test for the
invalid flag on all those paths that currently test for kvm_mmu_zap_page
return value. I'm not sure if thats any better?

 Windows 2008 64-bit has all sorts of sharing a pagetable at multiple
 levels too.

 
 We still want to allow oos for the two quadrants of a nonpae shadow page.
 

 Sure, can be an optimization step later?
   

 I'd like to reexamine this from another angle: what if we allow oos of  
 any level?

 This will simplify the can_unsync path (always true) 

The can_unsync flag is there to avoid the resync path
(mmu_unsync_walk-kvm_sync_page) from unsyncing pages of the root being
synced. Say, if at every resync you end up unsyncing two pages (unlikely
but possible).

However, we can probably get rid of it the bitmap walk (which won't
restart the walk from the beginning).

 and remove a special case. The cost is implementing invlpg and resync
 for non-leaf pages (invlpg has to resync the pte for every level). Are
 there other problems with this?

There is no gfn cache for non-leaf pages, so you either need to
introduce it or go for gfn_to_page_atomic-like functionality
(expensive).

I was hoping to look into non-leaf unsync to be another for later
optimization step, if found to be worthwhile.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/10] KVM: MMU: out of sync shadow core v2

2008-09-22 Thread Marcelo Tosatti
On Mon, Sep 22, 2008 at 11:41:14PM +0300, Avi Kivity wrote:
 Marcelo Tosatti wrote:
 +  while (parent-unsync_children) {
 +  for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
 +  u64 ent = sp-spt[i];
 +
 +  if (is_shadow_present_pte(ent)) {
 +  struct kvm_mmu_page *child;
 +  child = page_header(ent  PT64_BASE_ADDR_MASK);
   
 What does this do?
 

 Walks all children of given page with no efficiency. Its replaced later
 by the bitmap version.

   

 I don't understand how the variables sp, child, and parent interact. You  
 either need recursion or an explicit stack?

It restarts at parent level whenever finishing any children:

+   if (i == PT64_ENT_PER_PAGE) {
+   sp-unsync_children = 0;
+   sp = parent;
+   }

No efficiency.

 Yes. The next element for_each_entry_safe saved could have been zapped.

   

 Ouch. Ouch.

 I hate doing this. Can see no alternative though.

Me neither.

 +  /* don't unsync if pagetable is shadowed with multiple roles */
 +  hlist_for_each_entry_safe(s, node, n, bucket, hash_link) {
 +  if (s-gfn != sp-gfn || s-role.metaphysical)
 +  continue;
 +  if (s-role.word != sp-role.word)
 +  return 1;
 +  }
 
 This will happen for nonpae paging.  But why not allow it?  Zap all   
 unsynced pages on mode switch.

 Oh, if a page is both a page directory and page table, yes.  

 Yes. 

   
 So to allow nonpae oos, check the level instead.
 

 Windows 2008 64-bit has all sorts of sharing a pagetable at multiple
 levels too.

   

 We still want to allow oos for the two quadrants of a nonpae shadow page.

Sure, can be an optimization step later?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/10] KVM: MMU: out of sync shadow core v2

2008-09-22 Thread Marcelo Tosatti
On Mon, Sep 22, 2008 at 06:55:03PM -0300, Marcelo Tosatti wrote:
 On Mon, Sep 22, 2008 at 11:41:14PM +0300, Avi Kivity wrote:
  Marcelo Tosatti wrote:
  +while (parent-unsync_children) {
  +for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
  +u64 ent = sp-spt[i];
  +
  +if (is_shadow_present_pte(ent)) {
  +struct kvm_mmu_page *child;
  +child = page_header(ent  
  PT64_BASE_ADDR_MASK);

  What does this do?
  
 
  Walks all children of given page with no efficiency. Its replaced later
  by the bitmap version.
 

 
  I don't understand how the variables sp, child, and parent interact. You  
  either need recursion or an explicit stack?
 
 It restarts at parent level whenever finishing any children:
 
 +   if (i == PT64_ENT_PER_PAGE) {
 +   sp-unsync_children = 0;
 +   sp = parent;
 +   }
 
 No efficiency.

Do you prefer a recursive version for this one too? 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/10] KVM: MMU: out of sync shadow core v2

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 06:22:52PM -0700, Avi Kivity wrote:
 Instead of private, have an object contain both callback and private  
 data, and use container_of().  Reduces the chance of type errors.

OK.

 +while (parent-unsync_children) {
 +for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
 +u64 ent = sp-spt[i];
 +
 +if (is_shadow_present_pte(ent)) {
 +struct kvm_mmu_page *child;
 +child = page_header(ent  PT64_BASE_ADDR_MASK);

 What does this do?

Walks all children of given page with no efficiency. Its replaced later
by the bitmap version.

 +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 +{
 +if (sp-role.glevels != vcpu-arch.mmu.root_level) {
 +kvm_mmu_zap_page(vcpu-kvm, sp);
 +return 1;
 +}
   

 Suppose we switch to real mode, touch a pte, switch back.  Is this handled?

The shadow page will go unsync on pte touch and resynced as soon as its
visible (after return to paging).

Or, while still in real mode, it might be zapped by
kvm_mmu_get_page-kvm_sync_page.

Am I missing something?

 @@ -991,8 +1066,18 @@ static struct kvm_mmu_page *kvm_mmu_get_
   gfn, role.word);
  index = kvm_page_table_hashfn(gfn);
  bucket = vcpu-kvm-arch.mmu_page_hash[index];
 -hlist_for_each_entry(sp, node, bucket, hash_link)
 -if (sp-gfn == gfn  sp-role.word == role.word) {
 +hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
 +if (sp-gfn == gfn) {
 +if (sp-unsync)
 +if (kvm_sync_page(vcpu, sp))
 +continue;
 +
 +if (sp-role.word != role.word)
 +continue;
 +
 +if (sp-unsync_children)
 +vcpu-arch.mmu.need_root_sync = 1;
   

 mmu_reload() maybe?

Hum, will think about it.

  static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 -return 0;
 +return ret;
  }
   

 Why does the caller care if zap also zapped some other random pages?  To  
 restart walking the list?

Yes. The next element for_each_entry_safe saved could have been zapped.

 +/* don't unsync if pagetable is shadowed with multiple roles */
 +hlist_for_each_entry_safe(s, node, n, bucket, hash_link) {
 +if (s-gfn != sp-gfn || s-role.metaphysical)
 +continue;
 +if (s-role.word != sp-role.word)
 +return 1;
 +}
   

 This will happen for nonpae paging.  But why not allow it?  Zap all  
 unsynced pages on mode switch.

 Oh, if a page is both a page directory and page table, yes.  

Yes. 

 So to allow nonpae oos, check the level instead.

Windows 2008 64-bit has all sorts of sharing a pagetable at multiple
levels too.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/10] KVM: MMU: out of sync shadow core v2

2008-09-19 Thread Avi Kivity

Marcelo Tosatti wrote:

 static struct kmem_cache *rmap_desc_cache;
@@ -942,6 +943,39 @@ static void nonpaging_invlpg(struct kvm_
 {
 }
 
+static int mmu_unsync_walk(struct kvm_mmu_page *parent, mmu_unsync_fn fn,

+  void *priv)
  


Instead of private, have an object contain both callback and private 
data, and use container_of().  Reduces the chance of type errors.



+{
+   int i, ret;
+   struct kvm_mmu_page *sp = parent;
+
+   while (parent-unsync_children) {
+   for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
+   u64 ent = sp-spt[i];
+
+   if (is_shadow_present_pte(ent)) {
+   struct kvm_mmu_page *child;
+   child = page_header(ent  PT64_BASE_ADDR_MASK);
+
+   if (child-unsync_children) {
+   sp = child;
+   break;
+   }
+   if (child-unsync) {
+   ret = fn(child, priv);
+   if (ret)
+   return ret;
+   }
+   }
+   }
+   if (i == PT64_ENT_PER_PAGE) {
+   sp-unsync_children = 0;
+   sp = parent;
+   }
+   }
+   return 0;
+}
  


What does this do?


+static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+   if (sp-role.glevels != vcpu-arch.mmu.root_level) {
+   kvm_mmu_zap_page(vcpu-kvm, sp);
+   return 1;
+   }
  


Suppose we switch to real mode, touch a pte, switch back.  Is this handled?


@@ -991,8 +1066,18 @@ static struct kvm_mmu_page *kvm_mmu_get_
 gfn, role.word);
index = kvm_page_table_hashfn(gfn);
bucket = vcpu-kvm-arch.mmu_page_hash[index];
-   hlist_for_each_entry(sp, node, bucket, hash_link)
-   if (sp-gfn == gfn  sp-role.word == role.word) {
+   hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
+   if (sp-gfn == gfn) {
+   if (sp-unsync)
+   if (kvm_sync_page(vcpu, sp))
+   continue;
+
+   if (sp-role.word != role.word)
+   continue;
+
+   if (sp-unsync_children)
+   vcpu-arch.mmu.need_root_sync = 1;
  


mmu_reload() maybe?


 static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
+   int ret;
++kvm-stat.mmu_shadow_zapped;
+   ret = mmu_zap_unsync_children(kvm, sp);
kvm_mmu_page_unlink_children(kvm, sp);
kvm_mmu_unlink_parents(kvm, sp);
kvm_flush_remote_tlbs(kvm);
if (!sp-role.invalid  !sp-role.metaphysical)
unaccount_shadowed(kvm, sp-gfn);
+   if (sp-unsync)
+   kvm_unlink_unsync_page(kvm, sp);
if (!sp-root_count) {
hlist_del(sp-hash_link);
kvm_mmu_free_page(kvm, sp);
@@ -1129,7 +1245,7 @@ static int kvm_mmu_zap_page(struct kvm *
kvm_reload_remote_mmus(kvm);
}
kvm_mmu_reset_last_pte_updated(kvm);
-   return 0;
+   return ret;
 }
  


Why does the caller care if zap also zapped some other random pages?  To 
restart walking the list?


 
+

+static int kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+   unsigned index;
+   struct hlist_head *bucket;
+   struct kvm_mmu_page *s;
+   struct hlist_node *node, *n;
+
+   index = kvm_page_table_hashfn(sp-gfn);
+   bucket = vcpu-kvm-arch.mmu_page_hash[index];
+   /* don't unsync if pagetable is shadowed with multiple roles */
+   hlist_for_each_entry_safe(s, node, n, bucket, hash_link) {
+   if (s-gfn != sp-gfn || s-role.metaphysical)
+   continue;
+   if (s-role.word != sp-role.word)
+   return 1;
+   }
  


This will happen for nonpae paging.  But why not allow it?  Zap all 
unsynced pages on mode switch.


Oh, if a page is both a page directory and page table, yes.  So to allow 
nonpae oos, check the level instead.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 09/10] KVM: MMU: out of sync shadow core v2

2008-09-18 Thread Marcelo Tosatti
Allow guest pagetables to go out of sync.

Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED]

Index: kvm/arch/x86/kvm/mmu.c
===
--- kvm.orig/arch/x86/kvm/mmu.c
+++ kvm/arch/x86/kvm/mmu.c
@@ -148,6 +148,7 @@ struct kvm_shadow_walk {
 };
 
 typedef int (*mmu_parent_walk_fn) (struct kvm_vcpu *vcpu, struct kvm_mmu_page 
*sp);
+typedef int (*mmu_unsync_fn) (struct kvm_mmu_page *sp, void *priv);
 
 static struct kmem_cache *pte_chain_cache;
 static struct kmem_cache *rmap_desc_cache;
@@ -942,6 +943,39 @@ static void nonpaging_invlpg(struct kvm_
 {
 }
 
+static int mmu_unsync_walk(struct kvm_mmu_page *parent, mmu_unsync_fn fn,
+  void *priv)
+{
+   int i, ret;
+   struct kvm_mmu_page *sp = parent;
+
+   while (parent-unsync_children) {
+   for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
+   u64 ent = sp-spt[i];
+
+   if (is_shadow_present_pte(ent)) {
+   struct kvm_mmu_page *child;
+   child = page_header(ent  PT64_BASE_ADDR_MASK);
+
+   if (child-unsync_children) {
+   sp = child;
+   break;
+   }
+   if (child-unsync) {
+   ret = fn(child, priv);
+   if (ret)
+   return ret;
+   }
+   }
+   }
+   if (i == PT64_ENT_PER_PAGE) {
+   sp-unsync_children = 0;
+   sp = parent;
+   }
+   }
+   return 0;
+}
+
 static struct kvm_mmu_page *kvm_mmu_lookup_page(struct kvm *kvm, gfn_t gfn)
 {
unsigned index;
@@ -962,6 +996,47 @@ static struct kvm_mmu_page *kvm_mmu_look
return NULL;
 }
 
+static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+   WARN_ON(!sp-unsync);
+   sp-unsync = 0;
+   --kvm-stat.mmu_unsync;
+}
+
+static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+
+static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+   if (sp-role.glevels != vcpu-arch.mmu.root_level) {
+   kvm_mmu_zap_page(vcpu-kvm, sp);
+   return 1;
+   }
+
+   rmap_write_protect(vcpu-kvm, sp-gfn);
+   if (vcpu-arch.mmu.sync_page(vcpu, sp)) {
+   kvm_mmu_zap_page(vcpu-kvm, sp);
+   return 1;
+   }
+
+   kvm_mmu_flush_tlb(vcpu);
+   kvm_unlink_unsync_page(vcpu-kvm, sp);
+   return 0;
+}
+
+static int mmu_sync_fn(struct kvm_mmu_page *sp, void *priv)
+{
+   struct kvm_vcpu *vcpu = priv;
+
+   kvm_sync_page(vcpu, sp);
+   return (need_resched() || spin_needbreak(vcpu-kvm-mmu_lock));
+}
+
+static void mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+   while (mmu_unsync_walk(sp, mmu_sync_fn, vcpu))
+   cond_resched_lock(vcpu-kvm-mmu_lock);
+}
+
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 gfn_t gfn,
 gva_t gaddr,
@@ -975,7 +1050,7 @@ static struct kvm_mmu_page *kvm_mmu_get_
unsigned quadrant;
struct hlist_head *bucket;
struct kvm_mmu_page *sp;
-   struct hlist_node *node;
+   struct hlist_node *node, *tmp;
 
role.word = 0;
role.glevels = vcpu-arch.mmu.root_level;
@@ -991,8 +1066,18 @@ static struct kvm_mmu_page *kvm_mmu_get_
 gfn, role.word);
index = kvm_page_table_hashfn(gfn);
bucket = vcpu-kvm-arch.mmu_page_hash[index];
-   hlist_for_each_entry(sp, node, bucket, hash_link)
-   if (sp-gfn == gfn  sp-role.word == role.word) {
+   hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
+   if (sp-gfn == gfn) {
+   if (sp-unsync)
+   if (kvm_sync_page(vcpu, sp))
+   continue;
+
+   if (sp-role.word != role.word)
+   continue;
+
+   if (sp-unsync_children)
+   vcpu-arch.mmu.need_root_sync = 1;
+
mmu_page_add_parent_pte(vcpu, sp, parent_pte);
pgprintk(%s: found\n, __func__);
return sp;
@@ -1112,14 +1197,45 @@ static void kvm_mmu_unlink_parents(struc
}
 }
 
+struct mmu_zap_walk {
+   struct kvm *kvm;
+   int zapped;
+};
+
+static int mmu_zap_fn(struct kvm_mmu_page *sp, void *private)
+{
+   struct mmu_zap_walk *zap_walk = private;
+
+   kvm_mmu_zap_page(zap_walk-kvm, sp);
+   zap_walk-zapped = 1;
+   return 0;
+}
+
+static int