Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)

2009-04-20 Thread Andrea Arcangeli
On Sun, Apr 19, 2009 at 02:54:28PM -0300, Marcelo Tosatti wrote:
 I'm fine with your kvm_flush_local_tlb. Just one minor nit:
 
 +   /* get new asid before returning to guest mode */
 +   if (!test_bit(KVM_REQ_TLB_FLUSH, vcpu-requests))
 +   set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests);
 
 Whats the test_bit for?

To avoid a write in case it was already set... but thinking twice I
guess the probability that it's already set is near zero, so I'll
remove it and I'll just do set_bit.

 It was nice to hide explicit knowledge about
 vcpu-kvm-remote_tlbs_dirty behind the interface instead of exposing
 it.

Hmm ok, if you prefer it I'll add it back. I guess ..._tlb_dirty_cond
is better name so it's clear it's not just checking the cond but the
dirty flag too.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)

2009-04-19 Thread Marcelo Tosatti
On Sat, Apr 18, 2009 at 05:34:27PM +0200, Andrea Arcangeli wrote:
 On Sun, Apr 12, 2009 at 07:31:58PM -0300, Marcelo Tosatti wrote:
  mmu_sync_children needs to find any _reachable_ sptes that are unsync,
  read the guest pagetable, and sync the sptes. Before it can inspect the
  guest pagetable, it needs to write protect it, with rmap_write_protect.
 
 So far so good.
 
  In theory it wouldnt be necesarry to call
  kvm_flush_remote_tlbs_cond(protected) here, but only
  kvm_flush_remote_tlbs(), since the kvm-remote_tlbs_dirty information
  is not pertinent to mmu_sync_children.
 
 Hmm I'm not sure I fully understand how it is not pertinent.
 
 When we have a cr3 write in a remote vcpu, mmu_sync_children runs and
 it resyncs all sptes reachaeble for that remote vcpu context. But to
 resync the sptes, it also has to get rid of any old writable tlb entry
 for its remote vcpu where cr3 write is running. Checking only sptes to
 find writable ones, updating the sptes that are mapped by the writable
 sptes, and marking them wrprotected, isn't enough, as old spte
 contents may still be cached in the tlb if remote_tlbs_dirty is true!

 Think if the invlpg'd spte that got nuked by rmap_remove in the invlpg
 handler running in the current vcpu has a writable tlb entry active in
 the vcpu that later does cr3 overwrite. mmu_sync_children running in
 the remote vcpu will find no writable spte in the rmap chain
 representing that spte (because that spte that is still cached in the
 remote tlb, has already been zapped by the current vcpu) but it is
 still cached and writable in the remote vcpu TLB cache, when cr3
 overwrite runs.

Right, so you have to cope with the fact that invlpg can skip a TLB
flush. OK.

  But this is done here to clear remote_tlbs_dirty (after a
  kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an
  optimization.
 
 The whole point of remote_tlbs_dirty is to defer any smp tlb flush at
 the least possible time, so how can it be an optimization to run a
 kvm_flush_remote_tlbs earlier than necessary? The only way this can be
 an optimization, is to never run kvm_flush_remote_tlbs unless
 absolutely necessary, and to leave remote_tlbs_dirty true instead of
 calling kvm_flush_remote_tlbs. Calling kvm_flush_remote_tlbs_cond
 instead of kvm_flush_remote_tlbs cannot ever be an optimization.

Right.

 @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, 
gva_t gva)
   rmap_remove(vcpu-kvm, sptep);
   if (is_large_pte(*sptep))
   --vcpu-kvm-stat.lpages;
-  need_flush = 1;
+  vcpu-kvm-remote_tlbs_dirty = true;
   }
   set_shadow_pte(sptep, 
shadow_trap_nonpresent_pte);
   break;
@@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, 
gva_t 
gva)
   break;
   }
 - if (need_flush)
-  kvm_flush_remote_tlbs(vcpu-kvm);
   spin_unlock(vcpu-kvm-mmu_lock);
   
   AFIK to be compliant with lowlevel archs (without ASN it doesn't
   matter I think as vmx always flush on exit), we have to flush the
   local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests). I
   don't see why it's missing. Or am I wrong?
  
  Caller does it:
  
  void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
  {
  vcpu-arch.mmu.invlpg(vcpu, gva);
  kvm_mmu_flush_tlb(vcpu);
  ++vcpu-stat.invlpg;
  }
 
 Ah great! Avi also mentioned it I recall but I didn't figure out it
 was after FNAME(invlpg) returns. But isn't always more efficient to
 set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests) instead like I'm doing?

Sure that works too.

 See my version of kvm_flush_local_tlb, that's a bit different from
 kvm_mmu_flush_tlb and I'm made the old no-mmu notifier kernel safe too
 which I think is worth it. If you like to keep my version of
 kvm_flush_local_tlb, then I've simply to remove the kvm_mmu_flush_tlb
 from kvm_mmu_invlpg and move it inside the arch.mmu.invlpg handler so
 each shadow implementation does it its way. Comments?

I'm fine with your kvm_flush_local_tlb. Just one minor nit:

+   /* get new asid before returning to guest mode */
+   if (!test_bit(KVM_REQ_TLB_FLUSH, vcpu-requests))
+   set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests);

Whats the test_bit for?

 If you disagree, and you want to run kvm_mmu_flush_tlb in
 kvm_mmu_invlpg instead of set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests),
 then I can expand the kvm_flush_local_tlb with the smp_wmb() in the
 FNAME(invlpg) code. Like:
 
 if (need_flush) {
smp_wmb();
remote_tlbs_dirty = true;
 }
 spin_unlock(mmu_lock);
 
 Then the local tlb flush will run when we return from FNAME(invlpg)
 and remote_tlbs_dirty is set _after_ 

Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)

2009-04-18 Thread Andrea Arcangeli
On Sun, Apr 12, 2009 at 07:31:58PM -0300, Marcelo Tosatti wrote:
 mmu_sync_children needs to find any _reachable_ sptes that are unsync,
 read the guest pagetable, and sync the sptes. Before it can inspect the
 guest pagetable, it needs to write protect it, with rmap_write_protect.

So far so good.

 In theory it wouldnt be necesarry to call
 kvm_flush_remote_tlbs_cond(protected) here, but only
 kvm_flush_remote_tlbs(), since the kvm-remote_tlbs_dirty information
 is not pertinent to mmu_sync_children.

Hmm I'm not sure I fully understand how it is not pertinent.

When we have a cr3 write in a remote vcpu, mmu_sync_children runs and
it resyncs all sptes reachaeble for that remote vcpu context. But to
resync the sptes, it also has to get rid of any old writable tlb entry
for its remote vcpu where cr3 write is running. Checking only sptes to
find writable ones, updating the sptes that are mapped by the writable
sptes, and marking them wrprotected, isn't enough, as old spte
contents may still be cached in the tlb if remote_tlbs_dirty is true!

Think if the invlpg'd spte that got nuked by rmap_remove in the invlpg
handler running in the current vcpu has a writable tlb entry active in
the vcpu that later does cr3 overwrite. mmu_sync_children running in
the remote vcpu will find no writable spte in the rmap chain
representing that spte (because that spte that is still cached in the
remote tlb, has already been zapped by the current vcpu) but it is
still cached and writable in the remote vcpu TLB cache, when cr3
overwrite runs.

 But this is done here to clear remote_tlbs_dirty (after a
 kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an
 optimization.

The whole point of remote_tlbs_dirty is to defer any smp tlb flush at
the least possible time, so how can it be an optimization to run a
kvm_flush_remote_tlbs earlier than necessary? The only way this can be
an optimization, is to never run kvm_flush_remote_tlbs unless
absolutely necessary, and to leave remote_tlbs_dirty true instead of
calling kvm_flush_remote_tlbs. Calling kvm_flush_remote_tlbs_cond
instead of kvm_flush_remote_tlbs cannot ever be an optimization.

@@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, 
   gva_t gva)
rmap_remove(vcpu-kvm, sptep);
if (is_large_pte(*sptep))
--vcpu-kvm-stat.lpages;
   -need_flush = 1;
   +vcpu-kvm-remote_tlbs_dirty = true;
}
set_shadow_pte(sptep, 
   shadow_trap_nonpresent_pte);
break;
   @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, 
   gva_t 
   gva)
break;
}
-   if (need_flush)
   -kvm_flush_remote_tlbs(vcpu-kvm);
spin_unlock(vcpu-kvm-mmu_lock);
  
  AFIK to be compliant with lowlevel archs (without ASN it doesn't
  matter I think as vmx always flush on exit), we have to flush the
  local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests). I
  don't see why it's missing. Or am I wrong?
 
 Caller does it:
 
 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
 {
 vcpu-arch.mmu.invlpg(vcpu, gva);
 kvm_mmu_flush_tlb(vcpu);
 ++vcpu-stat.invlpg;
 }

Ah great! Avi also mentioned it I recall but I didn't figure out it
was after FNAME(invlpg) returns. But isn't always more efficient to
set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests) instead like I'm doing?

See my version of kvm_flush_local_tlb, that's a bit different from
kvm_mmu_flush_tlb and I'm made the old no-mmu notifier kernel safe too
which I think is worth it. If you like to keep my version of
kvm_flush_local_tlb, then I've simply to remove the kvm_mmu_flush_tlb
from kvm_mmu_invlpg and move it inside the arch.mmu.invlpg handler so
each shadow implementation does it its way. Comments?

If you disagree, and you want to run kvm_mmu_flush_tlb in
kvm_mmu_invlpg instead of set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests),
then I can expand the kvm_flush_local_tlb with the smp_wmb() in the
FNAME(invlpg) code. Like:

  if (need_flush) {
 smp_wmb();
 remote_tlbs_dirty = true;
  }
  spin_unlock(mmu_lock);

Then the local tlb flush will run when we return from FNAME(invlpg)
and remote_tlbs_dirty is set _after_ set_shadow_pte and _before_
releasing mmu_lock, making it still safe against
mmu_notifier_invalidate_page/range.

 What about protecting remote_tlbs_dirty with mmu_lock? Only caller of
 kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which
 is not performance sensitive anyway.

I thought it too I've to say. Tried a bit too, then I figured out why
Avi wanted to do out of order. The reason is that we want to allow
kvm_flush_remote_tlbs to run outside the mmu_lock too. So this

Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)

2009-04-12 Thread Marcelo Tosatti
Hi Andrea,

On Sat, Apr 11, 2009 at 06:48:54PM +0200, Andrea Arcangeli wrote:
 On Sun, Mar 29, 2009 at 01:36:01PM +0300, Avi Kivity wrote:
  Marcelo, Andrea?
 
 Had to read the code a bit more to understand the reason of the
 unsync_mmu flush in cr3 overwrite.
 
  Avi Kivity wrote:
  diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
  index 2a36f7f..f0ea56c 100644
  --- a/arch/x86/kvm/mmu.c
  +++ b/arch/x86/kvm/mmu.c
  @@ -1184,8 +1184,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
 for_each_sp(pages, sp, parents, i)
 protected |= rmap_write_protect(vcpu-kvm, sp-gfn);
   - if (protected)
  -  kvm_flush_remote_tlbs(vcpu-kvm);
  +  kvm_flush_remote_tlbs_cond(vcpu-kvm, protected);
 for_each_sp(pages, sp, parents, i) {
 kvm_sync_page(vcpu, sp);
 
 Ok so because we didn't flush the tlb on the other vcpus when invlpg
 run, if cr3 overwrite needs to re-sync sptes wrprotecting them, we've
 to flush the tlb in all vcpus to be sure the possibly writable tlb
 entry reflecting the old writable spte instantiated before invlpg run,
 is removed from the physical cpus. We wouldn't find it in for_each_sp
 because it was rmap_removed, but we'll find something in
 mmu_unsync_walk (right? we definitely have to find something in
 mmu_unsync_walk for this to work, the parent sp have to leave
 child-unsync set even after rmap_remove run in invlpg without
 flushing the other vcpus tlbs).

mmu_sync_children needs to find any _reachable_ sptes that are unsync,
read the guest pagetable, and sync the sptes. Before it can inspect the
guest pagetable, it needs to write protect it, with rmap_write_protect.

In theory it wouldnt be necesarry to call
kvm_flush_remote_tlbs_cond(protected) here, but only
kvm_flush_remote_tlbs(), since the kvm-remote_tlbs_dirty information
is not pertinent to mmu_sync_children.

But this is done here to clear remote_tlbs_dirty (after a
kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an
optimization.

   @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, 
  gva_t gva)
 rmap_remove(vcpu-kvm, sptep);
 if (is_large_pte(*sptep))
 --vcpu-kvm-stat.lpages;
  -  need_flush = 1;
  +  vcpu-kvm-remote_tlbs_dirty = true;
 }
 set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
 break;
  @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
  gva)
 break;
 }
   - if (need_flush)
  -  kvm_flush_remote_tlbs(vcpu-kvm);
 spin_unlock(vcpu-kvm-mmu_lock);
 
 AFIK to be compliant with lowlevel archs (without ASN it doesn't
 matter I think as vmx always flush on exit), we have to flush the
 local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests). I
 don't see why it's missing. Or am I wrong?

Caller does it:

void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
{
vcpu-arch.mmu.invlpg(vcpu, gva);
kvm_mmu_flush_tlb(vcpu);
++vcpu-stat.invlpg;
}

  diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
  index 68b217e..12afa50 100644
  --- a/virt/kvm/kvm_main.c
  +++ b/virt/kvm/kvm_main.c
  @@ -758,10 +758,18 @@ static bool make_all_cpus_request(struct kvm *kvm, 
  unsigned int req)
void kvm_flush_remote_tlbs(struct kvm *kvm)
   {
  +  kvm-remote_tlbs_dirty = false;
  +  smp_wmb();
 
 Still no lock prefix to the asm insn and here it runs outside the
 mmu_lock, but ok, I tend to agree smp_wmb should be enough to be sure
 the write is fully finished by the time smb_wmb returns. There's
 another problem though.
 
 CPU0  CPU1
 ---   -
 remote_tlbs_dirty = false
   remote_tlbs_dirty = true
 smp_tlb_flush
   set_shadow_pte(sptep, 
 shadow_trap_nonpresent_pte);
 
 
 The flush for the sptep will be lost.

What about protecting remote_tlbs_dirty with mmu_lock? Only caller of
kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which
is not performance sensitive anyway.

  @@ -907,8 +913,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
  mmu_notifier *mn,
 young = kvm_age_hva(kvm, address);
 spin_unlock(kvm-mmu_lock);
   - if (young)
  -  kvm_flush_remote_tlbs(kvm);
  +  kvm_flush_remote_tlbs_cond(kvm, young);
 return young;
   }
 
 No need to flush for clear_flush_young method, pages can't be freed
 there.
 
 I mangled over the patch a bit, plus fixed the above smp race, let me
 know what you think.
 
 The the best workload to exercise this is running a VM with lots of
 VCPUs and 8G of ram with a 32bit guest kernel and then just malloc and
 touch a byte for each 4096 page allocated by malloc. That will run a
 flood of invlpg. Then push the system to swap. while :; 

[PATCH] KVM: Defer remote tlb flushes on invlpg (v4)

2009-04-11 Thread Andrea Arcangeli
On Sun, Mar 29, 2009 at 01:36:01PM +0300, Avi Kivity wrote:
 Marcelo, Andrea?

Had to read the code a bit more to understand the reason of the
unsync_mmu flush in cr3 overwrite.

 Avi Kivity wrote:
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 2a36f7f..f0ea56c 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1184,8 +1184,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
  for_each_sp(pages, sp, parents, i)
  protected |= rmap_write_protect(vcpu-kvm, sp-gfn);
  -   if (protected)
 -kvm_flush_remote_tlbs(vcpu-kvm);
 +kvm_flush_remote_tlbs_cond(vcpu-kvm, protected);
  for_each_sp(pages, sp, parents, i) {
  kvm_sync_page(vcpu, sp);

Ok so because we didn't flush the tlb on the other vcpus when invlpg
run, if cr3 overwrite needs to re-sync sptes wrprotecting them, we've
to flush the tlb in all vcpus to be sure the possibly writable tlb
entry reflecting the old writable spte instantiated before invlpg run,
is removed from the physical cpus. We wouldn't find it in for_each_sp
because it was rmap_removed, but we'll find something in
mmu_unsync_walk (right? we definitely have to find something in
mmu_unsync_walk for this to work, the parent sp have to leave
child-unsync set even after rmap_remove run in invlpg without
flushing the other vcpus tlbs).

  @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, 
 gva_t gva)
  rmap_remove(vcpu-kvm, sptep);
  if (is_large_pte(*sptep))
  --vcpu-kvm-stat.lpages;
 -need_flush = 1;
 +vcpu-kvm-remote_tlbs_dirty = true;
  }
  set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
  break;
 @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
 gva)
  break;
  }
  -   if (need_flush)
 -kvm_flush_remote_tlbs(vcpu-kvm);
  spin_unlock(vcpu-kvm-mmu_lock);

AFIK to be compliant with lowlevel archs (without ASN it doesn't
matter I think as vmx always flush on exit), we have to flush the
local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests). I
don't see why it's missing. Or am I wrong?

 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 68b217e..12afa50 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -758,10 +758,18 @@ static bool make_all_cpus_request(struct kvm *kvm, 
 unsigned int req)
   void kvm_flush_remote_tlbs(struct kvm *kvm)
  {
 +kvm-remote_tlbs_dirty = false;
 +smp_wmb();

Still no lock prefix to the asm insn and here it runs outside the
mmu_lock, but ok, I tend to agree smp_wmb should be enough to be sure
the write is fully finished by the time smb_wmb returns. There's
another problem though.

CPU0CPU1
--- -
remote_tlbs_dirty = false
remote_tlbs_dirty = true
smp_tlb_flush
set_shadow_pte(sptep, 
shadow_trap_nonpresent_pte);


The flush for the sptep will be lost.

 @@ -907,8 +913,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
 mmu_notifier *mn,
  young = kvm_age_hva(kvm, address);
  spin_unlock(kvm-mmu_lock);
  -   if (young)
 -kvm_flush_remote_tlbs(kvm);
 +kvm_flush_remote_tlbs_cond(kvm, young);
  return young;
  }

No need to flush for clear_flush_young method, pages can't be freed
there.

I mangled over the patch a bit, plus fixed the above smp race, let me
know what you think.

The the best workload to exercise this is running a VM with lots of
VCPUs and 8G of ram with a 32bit guest kernel and then just malloc and
touch a byte for each 4096 page allocated by malloc. That will run a
flood of invlpg. Then push the system to swap. while :; do cp /dev/hda
/dev/null; done, also works without O_DIRECT so the host cache make it
fast at the second run (not so much faster with host swapping though).

I only tested it so far with 12 VM on swap with 64bit kernels with
heavy I/O so it's not good test as I doubt any invlpg runs, not even
munmap(addr, 4k) uses invlpg.

Signed-off-by: Andrea Arcangeli aarca...@redhat.com
---

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d5bdf3a..900bc31 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1185,7 +1185,11 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
for_each_sp(pages, sp, parents, i)
protected |= rmap_write_protect(vcpu-kvm, sp-gfn);
 
-   if (protected)
+   /*
+* Avoid leaving stale tlb entries in this vcpu representing
+* sptes rmap_removed by invlpg without vcpu smp tlb flush.
+*/
+   if (protected || vcpu-kvm-remote_tlbs_dirty)