Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-18 Thread Xiao Guangrong
On 04/18/2013 09:29 PM, Marcelo Tosatti wrote:
> On Thu, Apr 18, 2013 at 10:03:06AM -0300, Marcelo Tosatti wrote:
>> On Thu, Apr 18, 2013 at 12:00:16PM +0800, Xiao Guangrong wrote:

 What is the justification for this? 
>>>
>>> We want the rmap of being deleted memslot is removed-only that is
>>> needed for unmapping rmap out of mmu-lock.
>>>
>>> ==
>>> 1) do not corrupt the rmap
>>> 2) keep pte-list-descs available
>>> 3) keep shadow page available
>>>
>>> Resolve 1):
>>> we make the invalid rmap be remove-only that means we only delete and
>>> clear spte from the rmap, no new sptes can be added to it.
>>> This is reasonable since kvm can not do address translation on invalid rmap
>>> (gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
>>> not be reused (they belong to invalid shadow page).
>>> ==
>>>
>>> clear_flush_young / test_young / change_pte of mmu-notify can rewrite
>>> rmap with the present-spte (P bit is set), we should umap rmap in
>>> these handlers.
>>>

> +
> + /*
> +  * To ensure that all vcpus and mmu-notify are not clearing
> +  * spte and rmap entry.
> +  */
> + synchronize_srcu_expedited(>srcu);
> +}
> +
>  #ifdef MMU_DEBUG
>  static int is_empty_shadow_page(u64 *spt)
>  {
> @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 
> *spte)
>   __clear_sp_write_flooding_count(sp);
>  }
>  
> +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>gfn_t gfn,
>gva_t gaddr,
> @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> kvm_vcpu *vcpu,
>   role.quadrant = quadrant;
>   }
>   for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> + if (!is_valid_sp(vcpu->kvm, sp))
> + continue;
> +
>   if (!need_sync && sp->unsync)
>   need_sync = true;
>  
> @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> kvm_vcpu *vcpu,
>  
>   account_shadowed(vcpu->kvm, gfn);
>   }
> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
>   init_shadow_page_table(sp);
>   trace_kvm_mmu_get_page(sp, true);
>   return sp;
> @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm 
> *kvm, struct kvm_mmu_page *sp,
>   ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>   kvm_mmu_page_unlink_children(kvm, sp);

 The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
 accessed (as they have been freed already).

 I suppose the is_valid_sp() conditional below should be moved earlier,
 before kvm_mmu_unlink_parents or any other rmap access.

 This is fine: the !is_valid_sp() shadow pages are only reachable
 by SLAB and the hypervisor itself.
>>>
>>> Unfortunately we can not do this. :(
>>>
>>> The sptes in shadow pape can linked to many slots, if the spte is linked
>>> to the rmap of being deleted memslot, it is ok, otherwise, the rmap of
>>> still used memslot is miss updated.
>>>
>>> For example, slot 0 is being deleted, sp->spte[0] is linked on slot[0].rmap,
>>> sp->spte[1] is linked on slot[1].rmap. If we do not access rmap of this 
>>> 'sp',
>>> the already-freed spte[1] is still linked on slot[1].rmap.
>>>
>>> We can let kvm update the rmap for sp->spte[1] and do not unlink 
>>> sp->spte[0].
>>> This is also not allowed since mmu-notify can access the invalid rmap before
>>> the memslot is destroyed, then mmu-notify will get already-freed spte on
>>> the rmap or page Access/Dirty is miss tracked (if let mmu-notify do not 
>>> access
>>> the invalid rmap).
>>
>> Why not release all rmaps?
>>
>> Subject: [PATCH v2 3/7] KVM: x86: introduce kvm_clear_all_gfn_page_info
>>
>> This function is used to reset the rmaps and page info of all guest page
>> which will be used in later patch
>>
>> Signed-off-by: Xiao Guangrong 
> 
> Which you have later in patchset.

The patch you mentioned is old (v2), now it only resets lpage-info excluding
rmap:

==
[PATCH v3 11/15] KVM: MMU: introduce kvm_clear_all_lpage_info

This function is used to reset the large page info of all guest page
which will be used in later patch

Signed-off-by: Xiao Guangrong 
==

We can not release all rmaps. If we do this, ->invalidate_page and
->invalidate_range_start can not find any spte using the host page,
that means, Accessed/Dirty for host page is missing tracked.
(missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)

Furthermore, when we drop a invalid-gen spte, we will call
kvm_set_pfn_dirty/kvm_set_pfn_dirty for a already-freed host page since
mmu-notify can not find the spte 

Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-18 Thread Marcelo Tosatti
On Thu, Apr 18, 2013 at 12:00:16PM +0800, Xiao Guangrong wrote:
> > 
> > What is the justification for this? 
> 
> We want the rmap of being deleted memslot is removed-only that is
> needed for unmapping rmap out of mmu-lock.
> 
> ==
> 1) do not corrupt the rmap
> 2) keep pte-list-descs available
> 3) keep shadow page available
> 
> Resolve 1):
> we make the invalid rmap be remove-only that means we only delete and
> clear spte from the rmap, no new sptes can be added to it.
> This is reasonable since kvm can not do address translation on invalid rmap
> (gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
> not be reused (they belong to invalid shadow page).
> ==
> 
> clear_flush_young / test_young / change_pte of mmu-notify can rewrite
> rmap with the present-spte (P bit is set), we should umap rmap in
> these handlers.
> 
> > 
> >> +
> >> +  /*
> >> +   * To ensure that all vcpus and mmu-notify are not clearing
> >> +   * spte and rmap entry.
> >> +   */
> >> +  synchronize_srcu_expedited(>srcu);
> >> +}
> >> +
> >>  #ifdef MMU_DEBUG
> >>  static int is_empty_shadow_page(u64 *spt)
> >>  {
> >> @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >>__clear_sp_write_flooding_count(sp);
> >>  }
> >>  
> >> +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >> +{
> >> +  return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
> >> +}
> >> +
> >>  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >> gfn_t gfn,
> >> gva_t gaddr,
> >> @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> >> kvm_vcpu *vcpu,
> >>role.quadrant = quadrant;
> >>}
> >>for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> >> +  if (!is_valid_sp(vcpu->kvm, sp))
> >> +  continue;
> >> +
> >>if (!need_sync && sp->unsync)
> >>need_sync = true;
> >>  
> >> @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> >> kvm_vcpu *vcpu,
> >>  
> >>account_shadowed(vcpu->kvm, gfn);
> >>}
> >> +  sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> >>init_shadow_page_table(sp);
> >>trace_kvm_mmu_get_page(sp, true);
> >>return sp;
> >> @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm 
> >> *kvm, struct kvm_mmu_page *sp,
> >>ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> >>kvm_mmu_page_unlink_children(kvm, sp);
> > 
> > The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
> > accessed (as they have been freed already).
> > 
> > I suppose the is_valid_sp() conditional below should be moved earlier,
> > before kvm_mmu_unlink_parents or any other rmap access.
> > 
> > This is fine: the !is_valid_sp() shadow pages are only reachable
> > by SLAB and the hypervisor itself.
> 
> Unfortunately we can not do this. :(
> 
> The sptes in shadow pape can linked to many slots, if the spte is linked
> to the rmap of being deleted memslot, it is ok, otherwise, the rmap of
> still used memslot is miss updated.
> 
> For example, slot 0 is being deleted, sp->spte[0] is linked on slot[0].rmap,
> sp->spte[1] is linked on slot[1].rmap. If we do not access rmap of this 'sp',
> the already-freed spte[1] is still linked on slot[1].rmap.
> 
> We can let kvm update the rmap for sp->spte[1] and do not unlink sp->spte[0].
> This is also not allowed since mmu-notify can access the invalid rmap before
> the memslot is destroyed, then mmu-notify will get already-freed spte on
> the rmap or page Access/Dirty is miss tracked (if let mmu-notify do not access
> the invalid rmap).

Why not release all rmaps?

Subject: [PATCH v2 3/7] KVM: x86: introduce kvm_clear_all_gfn_page_info

This function is used to reset the rmaps and page info of all guest page
which will be used in later patch

Signed-off-by: Xiao Guangrong 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-18 Thread Marcelo Tosatti
On Thu, Apr 18, 2013 at 10:03:06AM -0300, Marcelo Tosatti wrote:
> On Thu, Apr 18, 2013 at 12:00:16PM +0800, Xiao Guangrong wrote:
> > > 
> > > What is the justification for this? 
> > 
> > We want the rmap of being deleted memslot is removed-only that is
> > needed for unmapping rmap out of mmu-lock.
> > 
> > ==
> > 1) do not corrupt the rmap
> > 2) keep pte-list-descs available
> > 3) keep shadow page available
> > 
> > Resolve 1):
> > we make the invalid rmap be remove-only that means we only delete and
> > clear spte from the rmap, no new sptes can be added to it.
> > This is reasonable since kvm can not do address translation on invalid rmap
> > (gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
> > not be reused (they belong to invalid shadow page).
> > ==
> > 
> > clear_flush_young / test_young / change_pte of mmu-notify can rewrite
> > rmap with the present-spte (P bit is set), we should umap rmap in
> > these handlers.
> > 
> > > 
> > >> +
> > >> +/*
> > >> + * To ensure that all vcpus and mmu-notify are not clearing
> > >> + * spte and rmap entry.
> > >> + */
> > >> +synchronize_srcu_expedited(>srcu);
> > >> +}
> > >> +
> > >>  #ifdef MMU_DEBUG
> > >>  static int is_empty_shadow_page(u64 *spt)
> > >>  {
> > >> @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 
> > >> *spte)
> > >>  __clear_sp_write_flooding_count(sp);
> > >>  }
> > >>  
> > >> +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > >> +{
> > >> +return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
> > >> +}
> > >> +
> > >>  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >>   gfn_t gfn,
> > >>   gva_t gaddr,
> > >> @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page 
> > >> *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >>  role.quadrant = quadrant;
> > >>  }
> > >>  for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> > >> +if (!is_valid_sp(vcpu->kvm, sp))
> > >> +continue;
> > >> +
> > >>  if (!need_sync && sp->unsync)
> > >>  need_sync = true;
> > >>  
> > >> @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page 
> > >> *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >>  
> > >>  account_shadowed(vcpu->kvm, gfn);
> > >>  }
> > >> +sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > >>  init_shadow_page_table(sp);
> > >>  trace_kvm_mmu_get_page(sp, true);
> > >>  return sp;
> > >> @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm 
> > >> *kvm, struct kvm_mmu_page *sp,
> > >>  ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> > >>  kvm_mmu_page_unlink_children(kvm, sp);
> > > 
> > > The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
> > > accessed (as they have been freed already).
> > > 
> > > I suppose the is_valid_sp() conditional below should be moved earlier,
> > > before kvm_mmu_unlink_parents or any other rmap access.
> > > 
> > > This is fine: the !is_valid_sp() shadow pages are only reachable
> > > by SLAB and the hypervisor itself.
> > 
> > Unfortunately we can not do this. :(
> > 
> > The sptes in shadow pape can linked to many slots, if the spte is linked
> > to the rmap of being deleted memslot, it is ok, otherwise, the rmap of
> > still used memslot is miss updated.
> > 
> > For example, slot 0 is being deleted, sp->spte[0] is linked on slot[0].rmap,
> > sp->spte[1] is linked on slot[1].rmap. If we do not access rmap of this 
> > 'sp',
> > the already-freed spte[1] is still linked on slot[1].rmap.
> > 
> > We can let kvm update the rmap for sp->spte[1] and do not unlink 
> > sp->spte[0].
> > This is also not allowed since mmu-notify can access the invalid rmap before
> > the memslot is destroyed, then mmu-notify will get already-freed spte on
> > the rmap or page Access/Dirty is miss tracked (if let mmu-notify do not 
> > access
> > the invalid rmap).
> 
> Why not release all rmaps?
> 
> Subject: [PATCH v2 3/7] KVM: x86: introduce kvm_clear_all_gfn_page_info
> 
> This function is used to reset the rmaps and page info of all guest page
> which will be used in later patch
> 
> Signed-off-by: Xiao Guangrong 

Which you have later in patchset.

So, what is the justification for the zap root + generation number increase 
to work on a per memslot basis, given that

/*
 * If memory slot is created, or moved, we need to clear all
 * mmio sptes.
 */
if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
kvm_mmu_zap_mmio_sptes(kvm);
kvm_reload_remote_mmus(kvm);
}

Is going to be dealt with generation number on mmio spte idea?

Note at the moment all shadows pages are zapped on deletion / move,
and 

Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-18 Thread Marcelo Tosatti
On Thu, Apr 18, 2013 at 10:03:06AM -0300, Marcelo Tosatti wrote:
 On Thu, Apr 18, 2013 at 12:00:16PM +0800, Xiao Guangrong wrote:
   
   What is the justification for this? 
  
  We want the rmap of being deleted memslot is removed-only that is
  needed for unmapping rmap out of mmu-lock.
  
  ==
  1) do not corrupt the rmap
  2) keep pte-list-descs available
  3) keep shadow page available
  
  Resolve 1):
  we make the invalid rmap be remove-only that means we only delete and
  clear spte from the rmap, no new sptes can be added to it.
  This is reasonable since kvm can not do address translation on invalid rmap
  (gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
  not be reused (they belong to invalid shadow page).
  ==
  
  clear_flush_young / test_young / change_pte of mmu-notify can rewrite
  rmap with the present-spte (P bit is set), we should umap rmap in
  these handlers.
  
   
   +
   +/*
   + * To ensure that all vcpus and mmu-notify are not clearing
   + * spte and rmap entry.
   + */
   +synchronize_srcu_expedited(kvm-srcu);
   +}
   +
#ifdef MMU_DEBUG
static int is_empty_shadow_page(u64 *spt)
{
   @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 
   *spte)
__clear_sp_write_flooding_count(sp);
}

   +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
   +{
   +return likely(sp-mmu_valid_gen == kvm-arch.mmu_valid_gen);
   +}
   +
static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 gfn_t gfn,
 gva_t gaddr,
   @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page 
   *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
role.quadrant = quadrant;
}
for_each_gfn_sp(vcpu-kvm, sp, gfn) {
   +if (!is_valid_sp(vcpu-kvm, sp))
   +continue;
   +
if (!need_sync  sp-unsync)
need_sync = true;

   @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page 
   *kvm_mmu_get_page(struct kvm_vcpu *vcpu,

account_shadowed(vcpu-kvm, gfn);
}
   +sp-mmu_valid_gen = vcpu-kvm-arch.mmu_valid_gen;
init_shadow_page_table(sp);
trace_kvm_mmu_get_page(sp, true);
return sp;
   @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm 
   *kvm, struct kvm_mmu_page *sp,
ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
kvm_mmu_page_unlink_children(kvm, sp);
   
   The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
   accessed (as they have been freed already).
   
   I suppose the is_valid_sp() conditional below should be moved earlier,
   before kvm_mmu_unlink_parents or any other rmap access.
   
   This is fine: the !is_valid_sp() shadow pages are only reachable
   by SLAB and the hypervisor itself.
  
  Unfortunately we can not do this. :(
  
  The sptes in shadow pape can linked to many slots, if the spte is linked
  to the rmap of being deleted memslot, it is ok, otherwise, the rmap of
  still used memslot is miss updated.
  
  For example, slot 0 is being deleted, sp-spte[0] is linked on slot[0].rmap,
  sp-spte[1] is linked on slot[1].rmap. If we do not access rmap of this 
  'sp',
  the already-freed spte[1] is still linked on slot[1].rmap.
  
  We can let kvm update the rmap for sp-spte[1] and do not unlink 
  sp-spte[0].
  This is also not allowed since mmu-notify can access the invalid rmap before
  the memslot is destroyed, then mmu-notify will get already-freed spte on
  the rmap or page Access/Dirty is miss tracked (if let mmu-notify do not 
  access
  the invalid rmap).
 
 Why not release all rmaps?
 
 Subject: [PATCH v2 3/7] KVM: x86: introduce kvm_clear_all_gfn_page_info
 
 This function is used to reset the rmaps and page info of all guest page
 which will be used in later patch
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

Which you have later in patchset.

So, what is the justification for the zap root + generation number increase 
to work on a per memslot basis, given that

/*
 * If memory slot is created, or moved, we need to clear all
 * mmio sptes.
 */
if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
kvm_mmu_zap_mmio_sptes(kvm);
kvm_reload_remote_mmus(kvm);
}

Is going to be dealt with generation number on mmio spte idea?

Note at the moment all shadows pages are zapped on deletion / move,
and there is no performance complaint for those cases.

In fact, for what case is generation number on mmio spte optimizes for?
The cases are where slots are deleted/moved/created on a live guest
are:

- Legacy VGA mode operation where VGA slots are created/deleted. Zapping
all shadow not a performance 

Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-18 Thread Marcelo Tosatti
On Thu, Apr 18, 2013 at 12:00:16PM +0800, Xiao Guangrong wrote:
  
  What is the justification for this? 
 
 We want the rmap of being deleted memslot is removed-only that is
 needed for unmapping rmap out of mmu-lock.
 
 ==
 1) do not corrupt the rmap
 2) keep pte-list-descs available
 3) keep shadow page available
 
 Resolve 1):
 we make the invalid rmap be remove-only that means we only delete and
 clear spte from the rmap, no new sptes can be added to it.
 This is reasonable since kvm can not do address translation on invalid rmap
 (gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
 not be reused (they belong to invalid shadow page).
 ==
 
 clear_flush_young / test_young / change_pte of mmu-notify can rewrite
 rmap with the present-spte (P bit is set), we should umap rmap in
 these handlers.
 
  
  +
  +  /*
  +   * To ensure that all vcpus and mmu-notify are not clearing
  +   * spte and rmap entry.
  +   */
  +  synchronize_srcu_expedited(kvm-srcu);
  +}
  +
   #ifdef MMU_DEBUG
   static int is_empty_shadow_page(u64 *spt)
   {
  @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
 __clear_sp_write_flooding_count(sp);
   }
   
  +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
  +{
  +  return likely(sp-mmu_valid_gen == kvm-arch.mmu_valid_gen);
  +}
  +
   static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
  gfn_t gfn,
  gva_t gaddr,
  @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
  kvm_vcpu *vcpu,
 role.quadrant = quadrant;
 }
 for_each_gfn_sp(vcpu-kvm, sp, gfn) {
  +  if (!is_valid_sp(vcpu-kvm, sp))
  +  continue;
  +
 if (!need_sync  sp-unsync)
 need_sync = true;
   
  @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
  kvm_vcpu *vcpu,
   
 account_shadowed(vcpu-kvm, gfn);
 }
  +  sp-mmu_valid_gen = vcpu-kvm-arch.mmu_valid_gen;
 init_shadow_page_table(sp);
 trace_kvm_mmu_get_page(sp, true);
 return sp;
  @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm 
  *kvm, struct kvm_mmu_page *sp,
 ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
 kvm_mmu_page_unlink_children(kvm, sp);
  
  The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
  accessed (as they have been freed already).
  
  I suppose the is_valid_sp() conditional below should be moved earlier,
  before kvm_mmu_unlink_parents or any other rmap access.
  
  This is fine: the !is_valid_sp() shadow pages are only reachable
  by SLAB and the hypervisor itself.
 
 Unfortunately we can not do this. :(
 
 The sptes in shadow pape can linked to many slots, if the spte is linked
 to the rmap of being deleted memslot, it is ok, otherwise, the rmap of
 still used memslot is miss updated.
 
 For example, slot 0 is being deleted, sp-spte[0] is linked on slot[0].rmap,
 sp-spte[1] is linked on slot[1].rmap. If we do not access rmap of this 'sp',
 the already-freed spte[1] is still linked on slot[1].rmap.
 
 We can let kvm update the rmap for sp-spte[1] and do not unlink sp-spte[0].
 This is also not allowed since mmu-notify can access the invalid rmap before
 the memslot is destroyed, then mmu-notify will get already-freed spte on
 the rmap or page Access/Dirty is miss tracked (if let mmu-notify do not access
 the invalid rmap).

Why not release all rmaps?

Subject: [PATCH v2 3/7] KVM: x86: introduce kvm_clear_all_gfn_page_info

This function is used to reset the rmaps and page info of all guest page
which will be used in later patch

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-18 Thread Xiao Guangrong
On 04/18/2013 09:29 PM, Marcelo Tosatti wrote:
 On Thu, Apr 18, 2013 at 10:03:06AM -0300, Marcelo Tosatti wrote:
 On Thu, Apr 18, 2013 at 12:00:16PM +0800, Xiao Guangrong wrote:

 What is the justification for this? 

 We want the rmap of being deleted memslot is removed-only that is
 needed for unmapping rmap out of mmu-lock.

 ==
 1) do not corrupt the rmap
 2) keep pte-list-descs available
 3) keep shadow page available

 Resolve 1):
 we make the invalid rmap be remove-only that means we only delete and
 clear spte from the rmap, no new sptes can be added to it.
 This is reasonable since kvm can not do address translation on invalid rmap
 (gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
 not be reused (they belong to invalid shadow page).
 ==

 clear_flush_young / test_young / change_pte of mmu-notify can rewrite
 rmap with the present-spte (P bit is set), we should umap rmap in
 these handlers.


 +
 + /*
 +  * To ensure that all vcpus and mmu-notify are not clearing
 +  * spte and rmap entry.
 +  */
 + synchronize_srcu_expedited(kvm-srcu);
 +}
 +
  #ifdef MMU_DEBUG
  static int is_empty_shadow_page(u64 *spt)
  {
 @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 
 *spte)
   __clear_sp_write_flooding_count(sp);
  }
  
 +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 +{
 + return likely(sp-mmu_valid_gen == kvm-arch.mmu_valid_gen);
 +}
 +
  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
gfn_t gfn,
gva_t gaddr,
 @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
   role.quadrant = quadrant;
   }
   for_each_gfn_sp(vcpu-kvm, sp, gfn) {
 + if (!is_valid_sp(vcpu-kvm, sp))
 + continue;
 +
   if (!need_sync  sp-unsync)
   need_sync = true;
  
 @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
  
   account_shadowed(vcpu-kvm, gfn);
   }
 + sp-mmu_valid_gen = vcpu-kvm-arch.mmu_valid_gen;
   init_shadow_page_table(sp);
   trace_kvm_mmu_get_page(sp, true);
   return sp;
 @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm 
 *kvm, struct kvm_mmu_page *sp,
   ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
   kvm_mmu_page_unlink_children(kvm, sp);

 The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
 accessed (as they have been freed already).

 I suppose the is_valid_sp() conditional below should be moved earlier,
 before kvm_mmu_unlink_parents or any other rmap access.

 This is fine: the !is_valid_sp() shadow pages are only reachable
 by SLAB and the hypervisor itself.

 Unfortunately we can not do this. :(

 The sptes in shadow pape can linked to many slots, if the spte is linked
 to the rmap of being deleted memslot, it is ok, otherwise, the rmap of
 still used memslot is miss updated.

 For example, slot 0 is being deleted, sp-spte[0] is linked on slot[0].rmap,
 sp-spte[1] is linked on slot[1].rmap. If we do not access rmap of this 
 'sp',
 the already-freed spte[1] is still linked on slot[1].rmap.

 We can let kvm update the rmap for sp-spte[1] and do not unlink 
 sp-spte[0].
 This is also not allowed since mmu-notify can access the invalid rmap before
 the memslot is destroyed, then mmu-notify will get already-freed spte on
 the rmap or page Access/Dirty is miss tracked (if let mmu-notify do not 
 access
 the invalid rmap).

 Why not release all rmaps?

 Subject: [PATCH v2 3/7] KVM: x86: introduce kvm_clear_all_gfn_page_info

 This function is used to reset the rmaps and page info of all guest page
 which will be used in later patch

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 
 Which you have later in patchset.

The patch you mentioned is old (v2), now it only resets lpage-info excluding
rmap:

==
[PATCH v3 11/15] KVM: MMU: introduce kvm_clear_all_lpage_info

This function is used to reset the large page info of all guest page
which will be used in later patch

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
==

We can not release all rmaps. If we do this, -invalidate_page and
-invalidate_range_start can not find any spte using the host page,
that means, Accessed/Dirty for host page is missing tracked.
(missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)

Furthermore, when we drop a invalid-gen spte, we will call
kvm_set_pfn_dirty/kvm_set_pfn_dirty for a already-freed host page since
mmu-notify can not find the spte by rmap.
(we can skip drop-spte for the invalid-gen sp, but A/D for host page
can be missed)

That is why i introduced unmap_invalid_rmap out of mmu-lock.

 
 So, what is the justification for the zap root + generation number increase 
 to work on a per memslot basis, given that
 
 /*
  * If memory slot is created, or moved, we need to clear all
  * mmio 

Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-17 Thread Xiao Guangrong
On 04/18/2013 08:05 AM, Marcelo Tosatti wrote:
> On Tue, Apr 16, 2013 at 02:32:50PM +0800, Xiao Guangrong wrote:
>> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
>> walk and zap all shadow pages one by one, also it need to zap all guest
>> page's rmap and all shadow page's parent spte list. Particularly, things
>> become worse if guest uses more memory or vcpus. It is not good for
>> scalability.
>>
>> In this patch, we introduce a faster way to invalid all shadow pages.
>> KVM maintains a global mmu invalid generation-number which is stored in
>> kvm->arch.mmu_valid_gen and every shadow page stores the current global
>> generation-number into sp->mmu_valid_gen when it is created.
>>
>> When KVM need zap all shadow pages sptes, it just simply increase the
>> global generation-number then reload root shadow pages on all vcpus.
>> Vcpu will create a new shadow page table according to current kvm's
>> generation-number. It ensures the old pages are not used any more.
>>
>> The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
>> are keeped in mmu-cache until page allocator reclaims page.
>>
>> If the invalidation is due to memslot changed, its rmap amd lpage-info
>> will be freed soon, in order to avoiding use invalid memory, we unmap
>> all sptes on its rmap and always reset the large-info all memslots so
>> that rmap and lpage info can be safely freed.
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/include/asm/kvm_host.h |2 +
>>  arch/x86/kvm/mmu.c  |   85 
>> +-
>>  arch/x86/kvm/mmu.h  |4 ++
>>  arch/x86/kvm/x86.c  |6 +++
>>  4 files changed, 94 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h 
>> b/arch/x86/include/asm/kvm_host.h
>> index 1ad9a34..6f8ee18 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -223,6 +223,7 @@ struct kvm_mmu_page {
>>  int root_count;  /* Currently serving as active root */
>>  unsigned int unsync_children;
>>  unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
>> +unsigned long mmu_valid_gen;
>>  DECLARE_BITMAP(unsync_child_bitmap, 512);
>>  
>>  #ifdef CONFIG_X86_32
>> @@ -531,6 +532,7 @@ struct kvm_arch {
>>  unsigned int n_requested_mmu_pages;
>>  unsigned int n_max_mmu_pages;
>>  unsigned int indirect_shadow_pages;
>> +unsigned long mmu_valid_gen;
>>  struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>>  /*
>>   * Hash table of struct kvm_mmu_page.
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 9ac584f..12129b7 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1732,6 +1732,11 @@ static struct rmap_operations invalid_rmap_ops = {
>>  .rmap_unmap = kvm_unmap_invalid_rmapp
>>  };
>>  
>> +static void init_invalid_memslot_rmap_ops(struct kvm_memory_slot *slot)
>> +{
>> +slot->arch.ops = _rmap_ops;
>> +}
>> +
>>  typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
>>  static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
>>   handle_rmap_fun fun, void *data)
>> @@ -1812,6 +1817,65 @@ void free_meslot_rmap_desc_nolock(struct 
>> kvm_memory_slot *slot)
>>  walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
>>  }
>>  
>> +/*
>> + * Fast invalid all shadow pages belong to @slot.
>> + *
>> + * @slot != NULL means the invalidation is caused the memslot specified
>> + * by @slot is being deleted, in this case, we should ensure that rmap
>> + * and lpage-info of the @slot can not be used after calling the function.
>> + * Specially, if @slot is INVALID_ALL_SLOTS, all slots will be deleted
>> + * soon, it always happens when kvm is being destroyed.
>> + *
>> + * @slot == NULL means the invalidation due to other reasons, we need
>> + * not care rmap and lpage-info since they are still valid after calling
>> + * the function.
>> + */
>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>> +   struct kvm_memory_slot *slot)
>> +{
>> +struct kvm_memory_slot *each_slot;
>> +
>> +spin_lock(>mmu_lock);
>> +kvm->arch.mmu_valid_gen++;
>> +
>> +if (slot == INVALID_ALL_SLOTS)
>> +kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
>> +init_invalid_memslot_rmap_ops(each_slot);
>> +else if (slot)
>> +init_invalid_memslot_rmap_ops(slot);
>> +
>> +/*
>> + * All shadow paes are invalid, reset the large page info,
>> + * then we can safely desotry the memslot, it is also good
>> + * for large page used.
>> + */
>> +kvm_clear_all_lpage_info(kvm);
>> +
>> +/*
>> + * Notify all vcpus to reload its shadow page table
>> + * and flush TLB. Then all vcpus will switch to new
>> + * shadow page table with the new mmu_valid_gen.
>> + *
>> + * Note: we should do this under 

Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-17 Thread Marcelo Tosatti
On Tue, Apr 16, 2013 at 02:32:50PM +0800, Xiao Guangrong wrote:
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability.
> 
> In this patch, we introduce a faster way to invalid all shadow pages.
> KVM maintains a global mmu invalid generation-number which is stored in
> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> generation-number into sp->mmu_valid_gen when it is created.
> 
> When KVM need zap all shadow pages sptes, it just simply increase the
> global generation-number then reload root shadow pages on all vcpus.
> Vcpu will create a new shadow page table according to current kvm's
> generation-number. It ensures the old pages are not used any more.
> 
> The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> are keeped in mmu-cache until page allocator reclaims page.
> 
> If the invalidation is due to memslot changed, its rmap amd lpage-info
> will be freed soon, in order to avoiding use invalid memory, we unmap
> all sptes on its rmap and always reset the large-info all memslots so
> that rmap and lpage info can be safely freed.
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/include/asm/kvm_host.h |2 +
>  arch/x86/kvm/mmu.c  |   85 +-
>  arch/x86/kvm/mmu.h  |4 ++
>  arch/x86/kvm/x86.c  |6 +++
>  4 files changed, 94 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1ad9a34..6f8ee18 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -223,6 +223,7 @@ struct kvm_mmu_page {
>   int root_count;  /* Currently serving as active root */
>   unsigned int unsync_children;
>   unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
> + unsigned long mmu_valid_gen;
>   DECLARE_BITMAP(unsync_child_bitmap, 512);
>  
>  #ifdef CONFIG_X86_32
> @@ -531,6 +532,7 @@ struct kvm_arch {
>   unsigned int n_requested_mmu_pages;
>   unsigned int n_max_mmu_pages;
>   unsigned int indirect_shadow_pages;
> + unsigned long mmu_valid_gen;
>   struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>   /*
>* Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9ac584f..12129b7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1732,6 +1732,11 @@ static struct rmap_operations invalid_rmap_ops = {
>   .rmap_unmap = kvm_unmap_invalid_rmapp
>  };
>  
> +static void init_invalid_memslot_rmap_ops(struct kvm_memory_slot *slot)
> +{
> + slot->arch.ops = _rmap_ops;
> +}
> +
>  typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
>  static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
>handle_rmap_fun fun, void *data)
> @@ -1812,6 +1817,65 @@ void free_meslot_rmap_desc_nolock(struct 
> kvm_memory_slot *slot)
>   walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
>  }
>  
> +/*
> + * Fast invalid all shadow pages belong to @slot.
> + *
> + * @slot != NULL means the invalidation is caused the memslot specified
> + * by @slot is being deleted, in this case, we should ensure that rmap
> + * and lpage-info of the @slot can not be used after calling the function.
> + * Specially, if @slot is INVALID_ALL_SLOTS, all slots will be deleted
> + * soon, it always happens when kvm is being destroyed.
> + *
> + * @slot == NULL means the invalidation due to other reasons, we need
> + * not care rmap and lpage-info since they are still valid after calling
> + * the function.
> + */
> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> +struct kvm_memory_slot *slot)
> +{
> + struct kvm_memory_slot *each_slot;
> +
> + spin_lock(>mmu_lock);
> + kvm->arch.mmu_valid_gen++;
> +
> + if (slot == INVALID_ALL_SLOTS)
> + kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
> + init_invalid_memslot_rmap_ops(each_slot);
> + else if (slot)
> + init_invalid_memslot_rmap_ops(slot);
> +
> + /*
> +  * All shadow paes are invalid, reset the large page info,
> +  * then we can safely desotry the memslot, it is also good
> +  * for large page used.
> +  */
> + kvm_clear_all_lpage_info(kvm);
> +
> + /*
> +  * Notify all vcpus to reload its shadow page table
> +  * and flush TLB. Then all vcpus will switch to new
> +  * shadow page table with the new mmu_valid_gen.
> +  *
> +  * Note: we should do this under the protection of
> +  * mmu-lock, otherwise, vcpu would purge shadow page
> +  * but miss tlb flush.
> +  */
> + 

Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-17 Thread Marcelo Tosatti
On Tue, Apr 16, 2013 at 02:32:50PM +0800, Xiao Guangrong wrote:
 The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
 walk and zap all shadow pages one by one, also it need to zap all guest
 page's rmap and all shadow page's parent spte list. Particularly, things
 become worse if guest uses more memory or vcpus. It is not good for
 scalability.
 
 In this patch, we introduce a faster way to invalid all shadow pages.
 KVM maintains a global mmu invalid generation-number which is stored in
 kvm-arch.mmu_valid_gen and every shadow page stores the current global
 generation-number into sp-mmu_valid_gen when it is created.
 
 When KVM need zap all shadow pages sptes, it just simply increase the
 global generation-number then reload root shadow pages on all vcpus.
 Vcpu will create a new shadow page table according to current kvm's
 generation-number. It ensures the old pages are not used any more.
 
 The invalid-gen pages (sp-mmu_valid_gen != kvm-arch.mmu_valid_gen)
 are keeped in mmu-cache until page allocator reclaims page.
 
 If the invalidation is due to memslot changed, its rmap amd lpage-info
 will be freed soon, in order to avoiding use invalid memory, we unmap
 all sptes on its rmap and always reset the large-info all memslots so
 that rmap and lpage info can be safely freed.
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/include/asm/kvm_host.h |2 +
  arch/x86/kvm/mmu.c  |   85 +-
  arch/x86/kvm/mmu.h  |4 ++
  arch/x86/kvm/x86.c  |6 +++
  4 files changed, 94 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 1ad9a34..6f8ee18 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -223,6 +223,7 @@ struct kvm_mmu_page {
   int root_count;  /* Currently serving as active root */
   unsigned int unsync_children;
   unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
 + unsigned long mmu_valid_gen;
   DECLARE_BITMAP(unsync_child_bitmap, 512);
  
  #ifdef CONFIG_X86_32
 @@ -531,6 +532,7 @@ struct kvm_arch {
   unsigned int n_requested_mmu_pages;
   unsigned int n_max_mmu_pages;
   unsigned int indirect_shadow_pages;
 + unsigned long mmu_valid_gen;
   struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
   /*
* Hash table of struct kvm_mmu_page.
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 9ac584f..12129b7 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1732,6 +1732,11 @@ static struct rmap_operations invalid_rmap_ops = {
   .rmap_unmap = kvm_unmap_invalid_rmapp
  };
  
 +static void init_invalid_memslot_rmap_ops(struct kvm_memory_slot *slot)
 +{
 + slot-arch.ops = invalid_rmap_ops;
 +}
 +
  typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
  static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
handle_rmap_fun fun, void *data)
 @@ -1812,6 +1817,65 @@ void free_meslot_rmap_desc_nolock(struct 
 kvm_memory_slot *slot)
   walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
  }
  
 +/*
 + * Fast invalid all shadow pages belong to @slot.
 + *
 + * @slot != NULL means the invalidation is caused the memslot specified
 + * by @slot is being deleted, in this case, we should ensure that rmap
 + * and lpage-info of the @slot can not be used after calling the function.
 + * Specially, if @slot is INVALID_ALL_SLOTS, all slots will be deleted
 + * soon, it always happens when kvm is being destroyed.
 + *
 + * @slot == NULL means the invalidation due to other reasons, we need
 + * not care rmap and lpage-info since they are still valid after calling
 + * the function.
 + */
 +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
 +struct kvm_memory_slot *slot)
 +{
 + struct kvm_memory_slot *each_slot;
 +
 + spin_lock(kvm-mmu_lock);
 + kvm-arch.mmu_valid_gen++;
 +
 + if (slot == INVALID_ALL_SLOTS)
 + kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
 + init_invalid_memslot_rmap_ops(each_slot);
 + else if (slot)
 + init_invalid_memslot_rmap_ops(slot);
 +
 + /*
 +  * All shadow paes are invalid, reset the large page info,
 +  * then we can safely desotry the memslot, it is also good
 +  * for large page used.
 +  */
 + kvm_clear_all_lpage_info(kvm);
 +
 + /*
 +  * Notify all vcpus to reload its shadow page table
 +  * and flush TLB. Then all vcpus will switch to new
 +  * shadow page table with the new mmu_valid_gen.
 +  *
 +  * Note: we should do this under the protection of
 +  * mmu-lock, otherwise, vcpu would purge shadow page
 +  * but miss tlb flush.
 +  */
 + kvm_reload_remote_mmus(kvm);
 + spin_unlock(kvm-mmu_lock);
 +
 + if 

Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-17 Thread Xiao Guangrong
On 04/18/2013 08:05 AM, Marcelo Tosatti wrote:
 On Tue, Apr 16, 2013 at 02:32:50PM +0800, Xiao Guangrong wrote:
 The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
 walk and zap all shadow pages one by one, also it need to zap all guest
 page's rmap and all shadow page's parent spte list. Particularly, things
 become worse if guest uses more memory or vcpus. It is not good for
 scalability.

 In this patch, we introduce a faster way to invalid all shadow pages.
 KVM maintains a global mmu invalid generation-number which is stored in
 kvm-arch.mmu_valid_gen and every shadow page stores the current global
 generation-number into sp-mmu_valid_gen when it is created.

 When KVM need zap all shadow pages sptes, it just simply increase the
 global generation-number then reload root shadow pages on all vcpus.
 Vcpu will create a new shadow page table according to current kvm's
 generation-number. It ensures the old pages are not used any more.

 The invalid-gen pages (sp-mmu_valid_gen != kvm-arch.mmu_valid_gen)
 are keeped in mmu-cache until page allocator reclaims page.

 If the invalidation is due to memslot changed, its rmap amd lpage-info
 will be freed soon, in order to avoiding use invalid memory, we unmap
 all sptes on its rmap and always reset the large-info all memslots so
 that rmap and lpage info can be safely freed.

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/include/asm/kvm_host.h |2 +
  arch/x86/kvm/mmu.c  |   85 
 +-
  arch/x86/kvm/mmu.h  |4 ++
  arch/x86/kvm/x86.c  |6 +++
  4 files changed, 94 insertions(+), 3 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h 
 b/arch/x86/include/asm/kvm_host.h
 index 1ad9a34..6f8ee18 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -223,6 +223,7 @@ struct kvm_mmu_page {
  int root_count;  /* Currently serving as active root */
  unsigned int unsync_children;
  unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
 +unsigned long mmu_valid_gen;
  DECLARE_BITMAP(unsync_child_bitmap, 512);
  
  #ifdef CONFIG_X86_32
 @@ -531,6 +532,7 @@ struct kvm_arch {
  unsigned int n_requested_mmu_pages;
  unsigned int n_max_mmu_pages;
  unsigned int indirect_shadow_pages;
 +unsigned long mmu_valid_gen;
  struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
  /*
   * Hash table of struct kvm_mmu_page.
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 9ac584f..12129b7 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1732,6 +1732,11 @@ static struct rmap_operations invalid_rmap_ops = {
  .rmap_unmap = kvm_unmap_invalid_rmapp
  };
  
 +static void init_invalid_memslot_rmap_ops(struct kvm_memory_slot *slot)
 +{
 +slot-arch.ops = invalid_rmap_ops;
 +}
 +
  typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
  static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
   handle_rmap_fun fun, void *data)
 @@ -1812,6 +1817,65 @@ void free_meslot_rmap_desc_nolock(struct 
 kvm_memory_slot *slot)
  walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
  }
  
 +/*
 + * Fast invalid all shadow pages belong to @slot.
 + *
 + * @slot != NULL means the invalidation is caused the memslot specified
 + * by @slot is being deleted, in this case, we should ensure that rmap
 + * and lpage-info of the @slot can not be used after calling the function.
 + * Specially, if @slot is INVALID_ALL_SLOTS, all slots will be deleted
 + * soon, it always happens when kvm is being destroyed.
 + *
 + * @slot == NULL means the invalidation due to other reasons, we need
 + * not care rmap and lpage-info since they are still valid after calling
 + * the function.
 + */
 +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
 +   struct kvm_memory_slot *slot)
 +{
 +struct kvm_memory_slot *each_slot;
 +
 +spin_lock(kvm-mmu_lock);
 +kvm-arch.mmu_valid_gen++;
 +
 +if (slot == INVALID_ALL_SLOTS)
 +kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
 +init_invalid_memslot_rmap_ops(each_slot);
 +else if (slot)
 +init_invalid_memslot_rmap_ops(slot);
 +
 +/*
 + * All shadow paes are invalid, reset the large page info,
 + * then we can safely desotry the memslot, it is also good
 + * for large page used.
 + */
 +kvm_clear_all_lpage_info(kvm);
 +
 +/*
 + * Notify all vcpus to reload its shadow page table
 + * and flush TLB. Then all vcpus will switch to new
 + * shadow page table with the new mmu_valid_gen.
 + *
 + * Note: we should do this under the protection of
 + * mmu-lock, otherwise, vcpu would purge shadow page
 + * but miss tlb flush.
 + */
 +kvm_reload_remote_mmus(kvm);
 +spin_unlock(kvm-mmu_lock);
 +
 +if 

[PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-16 Thread Xiao Guangrong
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

In this patch, we introduce a faster way to invalid all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are keeped in mmu-cache until page allocator reclaims page.

If the invalidation is due to memslot changed, its rmap amd lpage-info
will be freed soon, in order to avoiding use invalid memory, we unmap
all sptes on its rmap and always reset the large-info all memslots so
that rmap and lpage info can be safely freed.

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/kvm/mmu.c  |   85 +-
 arch/x86/kvm/mmu.h  |4 ++
 arch/x86/kvm/x86.c  |6 +++
 4 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1ad9a34..6f8ee18 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -223,6 +223,7 @@ struct kvm_mmu_page {
int root_count;  /* Currently serving as active root */
unsigned int unsync_children;
unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
+   unsigned long mmu_valid_gen;
DECLARE_BITMAP(unsync_child_bitmap, 512);
 
 #ifdef CONFIG_X86_32
@@ -531,6 +532,7 @@ struct kvm_arch {
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+   unsigned long mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9ac584f..12129b7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1732,6 +1732,11 @@ static struct rmap_operations invalid_rmap_ops = {
.rmap_unmap = kvm_unmap_invalid_rmapp
 };
 
+static void init_invalid_memslot_rmap_ops(struct kvm_memory_slot *slot)
+{
+   slot->arch.ops = _rmap_ops;
+}
+
 typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
 static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
 handle_rmap_fun fun, void *data)
@@ -1812,6 +1817,65 @@ void free_meslot_rmap_desc_nolock(struct kvm_memory_slot 
*slot)
walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
 }
 
+/*
+ * Fast invalid all shadow pages belong to @slot.
+ *
+ * @slot != NULL means the invalidation is caused the memslot specified
+ * by @slot is being deleted, in this case, we should ensure that rmap
+ * and lpage-info of the @slot can not be used after calling the function.
+ * Specially, if @slot is INVALID_ALL_SLOTS, all slots will be deleted
+ * soon, it always happens when kvm is being destroyed.
+ *
+ * @slot == NULL means the invalidation due to other reasons, we need
+ * not care rmap and lpage-info since they are still valid after calling
+ * the function.
+ */
+void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
+  struct kvm_memory_slot *slot)
+{
+   struct kvm_memory_slot *each_slot;
+
+   spin_lock(>mmu_lock);
+   kvm->arch.mmu_valid_gen++;
+
+   if (slot == INVALID_ALL_SLOTS)
+   kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
+   init_invalid_memslot_rmap_ops(each_slot);
+   else if (slot)
+   init_invalid_memslot_rmap_ops(slot);
+
+   /*
+* All shadow paes are invalid, reset the large page info,
+* then we can safely desotry the memslot, it is also good
+* for large page used.
+*/
+   kvm_clear_all_lpage_info(kvm);
+
+   /*
+* Notify all vcpus to reload its shadow page table
+* and flush TLB. Then all vcpus will switch to new
+* shadow page table with the new mmu_valid_gen.
+*
+* Note: we should do this under the protection of
+* mmu-lock, otherwise, vcpu would purge shadow page
+* but miss tlb flush.
+*/
+   kvm_reload_remote_mmus(kvm);
+   spin_unlock(>mmu_lock);
+
+   if (slot == INVALID_ALL_SLOTS)
+   kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
+   

[PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-16 Thread Xiao Guangrong
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

In this patch, we introduce a faster way to invalid all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm-arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp-mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

The invalid-gen pages (sp-mmu_valid_gen != kvm-arch.mmu_valid_gen)
are keeped in mmu-cache until page allocator reclaims page.

If the invalidation is due to memslot changed, its rmap amd lpage-info
will be freed soon, in order to avoiding use invalid memory, we unmap
all sptes on its rmap and always reset the large-info all memslots so
that rmap and lpage info can be safely freed.

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/kvm/mmu.c  |   85 +-
 arch/x86/kvm/mmu.h  |4 ++
 arch/x86/kvm/x86.c  |6 +++
 4 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1ad9a34..6f8ee18 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -223,6 +223,7 @@ struct kvm_mmu_page {
int root_count;  /* Currently serving as active root */
unsigned int unsync_children;
unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
+   unsigned long mmu_valid_gen;
DECLARE_BITMAP(unsync_child_bitmap, 512);
 
 #ifdef CONFIG_X86_32
@@ -531,6 +532,7 @@ struct kvm_arch {
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+   unsigned long mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9ac584f..12129b7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1732,6 +1732,11 @@ static struct rmap_operations invalid_rmap_ops = {
.rmap_unmap = kvm_unmap_invalid_rmapp
 };
 
+static void init_invalid_memslot_rmap_ops(struct kvm_memory_slot *slot)
+{
+   slot-arch.ops = invalid_rmap_ops;
+}
+
 typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
 static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
 handle_rmap_fun fun, void *data)
@@ -1812,6 +1817,65 @@ void free_meslot_rmap_desc_nolock(struct kvm_memory_slot 
*slot)
walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
 }
 
+/*
+ * Fast invalid all shadow pages belong to @slot.
+ *
+ * @slot != NULL means the invalidation is caused the memslot specified
+ * by @slot is being deleted, in this case, we should ensure that rmap
+ * and lpage-info of the @slot can not be used after calling the function.
+ * Specially, if @slot is INVALID_ALL_SLOTS, all slots will be deleted
+ * soon, it always happens when kvm is being destroyed.
+ *
+ * @slot == NULL means the invalidation due to other reasons, we need
+ * not care rmap and lpage-info since they are still valid after calling
+ * the function.
+ */
+void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
+  struct kvm_memory_slot *slot)
+{
+   struct kvm_memory_slot *each_slot;
+
+   spin_lock(kvm-mmu_lock);
+   kvm-arch.mmu_valid_gen++;
+
+   if (slot == INVALID_ALL_SLOTS)
+   kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
+   init_invalid_memslot_rmap_ops(each_slot);
+   else if (slot)
+   init_invalid_memslot_rmap_ops(slot);
+
+   /*
+* All shadow paes are invalid, reset the large page info,
+* then we can safely desotry the memslot, it is also good
+* for large page used.
+*/
+   kvm_clear_all_lpage_info(kvm);
+
+   /*
+* Notify all vcpus to reload its shadow page table
+* and flush TLB. Then all vcpus will switch to new
+* shadow page table with the new mmu_valid_gen.
+*
+* Note: we should do this under the protection of
+* mmu-lock, otherwise, vcpu would purge shadow page
+* but miss tlb flush.
+*/
+   kvm_reload_remote_mmus(kvm);
+   spin_unlock(kvm-mmu_lock);
+
+   if (slot == INVALID_ALL_SLOTS)
+   kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
+