Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)
On Sun, Apr 19, 2009 at 02:54:28PM -0300, Marcelo Tosatti wrote: I'm fine with your kvm_flush_local_tlb. Just one minor nit: + /* get new asid before returning to guest mode */ + if (!test_bit(KVM_REQ_TLB_FLUSH, vcpu-requests)) + set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests); Whats the test_bit for? To avoid a write in case it was already set... but thinking twice I guess the probability that it's already set is near zero, so I'll remove it and I'll just do set_bit. It was nice to hide explicit knowledge about vcpu-kvm-remote_tlbs_dirty behind the interface instead of exposing it. Hmm ok, if you prefer it I'll add it back. I guess ..._tlb_dirty_cond is better name so it's clear it's not just checking the cond but the dirty flag too. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)
On Sat, Apr 18, 2009 at 05:34:27PM +0200, Andrea Arcangeli wrote: On Sun, Apr 12, 2009 at 07:31:58PM -0300, Marcelo Tosatti wrote: mmu_sync_children needs to find any _reachable_ sptes that are unsync, read the guest pagetable, and sync the sptes. Before it can inspect the guest pagetable, it needs to write protect it, with rmap_write_protect. So far so good. In theory it wouldnt be necesarry to call kvm_flush_remote_tlbs_cond(protected) here, but only kvm_flush_remote_tlbs(), since the kvm-remote_tlbs_dirty information is not pertinent to mmu_sync_children. Hmm I'm not sure I fully understand how it is not pertinent. When we have a cr3 write in a remote vcpu, mmu_sync_children runs and it resyncs all sptes reachaeble for that remote vcpu context. But to resync the sptes, it also has to get rid of any old writable tlb entry for its remote vcpu where cr3 write is running. Checking only sptes to find writable ones, updating the sptes that are mapped by the writable sptes, and marking them wrprotected, isn't enough, as old spte contents may still be cached in the tlb if remote_tlbs_dirty is true! Think if the invlpg'd spte that got nuked by rmap_remove in the invlpg handler running in the current vcpu has a writable tlb entry active in the vcpu that later does cr3 overwrite. mmu_sync_children running in the remote vcpu will find no writable spte in the rmap chain representing that spte (because that spte that is still cached in the remote tlb, has already been zapped by the current vcpu) but it is still cached and writable in the remote vcpu TLB cache, when cr3 overwrite runs. Right, so you have to cope with the fact that invlpg can skip a TLB flush. OK. But this is done here to clear remote_tlbs_dirty (after a kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an optimization. The whole point of remote_tlbs_dirty is to defer any smp tlb flush at the least possible time, so how can it be an optimization to run a kvm_flush_remote_tlbs earlier than necessary? The only way this can be an optimization, is to never run kvm_flush_remote_tlbs unless absolutely necessary, and to leave remote_tlbs_dirty true instead of calling kvm_flush_remote_tlbs. Calling kvm_flush_remote_tlbs_cond instead of kvm_flush_remote_tlbs cannot ever be an optimization. Right. @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) rmap_remove(vcpu-kvm, sptep); if (is_large_pte(*sptep)) --vcpu-kvm-stat.lpages; - need_flush = 1; + vcpu-kvm-remote_tlbs_dirty = true; } set_shadow_pte(sptep, shadow_trap_nonpresent_pte); break; @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) break; } - if (need_flush) - kvm_flush_remote_tlbs(vcpu-kvm); spin_unlock(vcpu-kvm-mmu_lock); AFIK to be compliant with lowlevel archs (without ASN it doesn't matter I think as vmx always flush on exit), we have to flush the local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests). I don't see why it's missing. Or am I wrong? Caller does it: void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva) { vcpu-arch.mmu.invlpg(vcpu, gva); kvm_mmu_flush_tlb(vcpu); ++vcpu-stat.invlpg; } Ah great! Avi also mentioned it I recall but I didn't figure out it was after FNAME(invlpg) returns. But isn't always more efficient to set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests) instead like I'm doing? Sure that works too. See my version of kvm_flush_local_tlb, that's a bit different from kvm_mmu_flush_tlb and I'm made the old no-mmu notifier kernel safe too which I think is worth it. If you like to keep my version of kvm_flush_local_tlb, then I've simply to remove the kvm_mmu_flush_tlb from kvm_mmu_invlpg and move it inside the arch.mmu.invlpg handler so each shadow implementation does it its way. Comments? I'm fine with your kvm_flush_local_tlb. Just one minor nit: + /* get new asid before returning to guest mode */ + if (!test_bit(KVM_REQ_TLB_FLUSH, vcpu-requests)) + set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests); Whats the test_bit for? If you disagree, and you want to run kvm_mmu_flush_tlb in kvm_mmu_invlpg instead of set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests), then I can expand the kvm_flush_local_tlb with the smp_wmb() in the FNAME(invlpg) code. Like: if (need_flush) { smp_wmb(); remote_tlbs_dirty = true; } spin_unlock(mmu_lock); Then the local tlb flush will run when we return from FNAME(invlpg) and remote_tlbs_dirty is set _after_
Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)
On Sun, Apr 12, 2009 at 07:31:58PM -0300, Marcelo Tosatti wrote: mmu_sync_children needs to find any _reachable_ sptes that are unsync, read the guest pagetable, and sync the sptes. Before it can inspect the guest pagetable, it needs to write protect it, with rmap_write_protect. So far so good. In theory it wouldnt be necesarry to call kvm_flush_remote_tlbs_cond(protected) here, but only kvm_flush_remote_tlbs(), since the kvm-remote_tlbs_dirty information is not pertinent to mmu_sync_children. Hmm I'm not sure I fully understand how it is not pertinent. When we have a cr3 write in a remote vcpu, mmu_sync_children runs and it resyncs all sptes reachaeble for that remote vcpu context. But to resync the sptes, it also has to get rid of any old writable tlb entry for its remote vcpu where cr3 write is running. Checking only sptes to find writable ones, updating the sptes that are mapped by the writable sptes, and marking them wrprotected, isn't enough, as old spte contents may still be cached in the tlb if remote_tlbs_dirty is true! Think if the invlpg'd spte that got nuked by rmap_remove in the invlpg handler running in the current vcpu has a writable tlb entry active in the vcpu that later does cr3 overwrite. mmu_sync_children running in the remote vcpu will find no writable spte in the rmap chain representing that spte (because that spte that is still cached in the remote tlb, has already been zapped by the current vcpu) but it is still cached and writable in the remote vcpu TLB cache, when cr3 overwrite runs. But this is done here to clear remote_tlbs_dirty (after a kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an optimization. The whole point of remote_tlbs_dirty is to defer any smp tlb flush at the least possible time, so how can it be an optimization to run a kvm_flush_remote_tlbs earlier than necessary? The only way this can be an optimization, is to never run kvm_flush_remote_tlbs unless absolutely necessary, and to leave remote_tlbs_dirty true instead of calling kvm_flush_remote_tlbs. Calling kvm_flush_remote_tlbs_cond instead of kvm_flush_remote_tlbs cannot ever be an optimization. @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) rmap_remove(vcpu-kvm, sptep); if (is_large_pte(*sptep)) --vcpu-kvm-stat.lpages; -need_flush = 1; +vcpu-kvm-remote_tlbs_dirty = true; } set_shadow_pte(sptep, shadow_trap_nonpresent_pte); break; @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) break; } - if (need_flush) -kvm_flush_remote_tlbs(vcpu-kvm); spin_unlock(vcpu-kvm-mmu_lock); AFIK to be compliant with lowlevel archs (without ASN it doesn't matter I think as vmx always flush on exit), we have to flush the local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests). I don't see why it's missing. Or am I wrong? Caller does it: void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva) { vcpu-arch.mmu.invlpg(vcpu, gva); kvm_mmu_flush_tlb(vcpu); ++vcpu-stat.invlpg; } Ah great! Avi also mentioned it I recall but I didn't figure out it was after FNAME(invlpg) returns. But isn't always more efficient to set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests) instead like I'm doing? See my version of kvm_flush_local_tlb, that's a bit different from kvm_mmu_flush_tlb and I'm made the old no-mmu notifier kernel safe too which I think is worth it. If you like to keep my version of kvm_flush_local_tlb, then I've simply to remove the kvm_mmu_flush_tlb from kvm_mmu_invlpg and move it inside the arch.mmu.invlpg handler so each shadow implementation does it its way. Comments? If you disagree, and you want to run kvm_mmu_flush_tlb in kvm_mmu_invlpg instead of set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests), then I can expand the kvm_flush_local_tlb with the smp_wmb() in the FNAME(invlpg) code. Like: if (need_flush) { smp_wmb(); remote_tlbs_dirty = true; } spin_unlock(mmu_lock); Then the local tlb flush will run when we return from FNAME(invlpg) and remote_tlbs_dirty is set _after_ set_shadow_pte and _before_ releasing mmu_lock, making it still safe against mmu_notifier_invalidate_page/range. What about protecting remote_tlbs_dirty with mmu_lock? Only caller of kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which is not performance sensitive anyway. I thought it too I've to say. Tried a bit too, then I figured out why Avi wanted to do out of order. The reason is that we want to allow kvm_flush_remote_tlbs to run outside the mmu_lock too. So this
Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)
Hi Andrea, On Sat, Apr 11, 2009 at 06:48:54PM +0200, Andrea Arcangeli wrote: On Sun, Mar 29, 2009 at 01:36:01PM +0300, Avi Kivity wrote: Marcelo, Andrea? Had to read the code a bit more to understand the reason of the unsync_mmu flush in cr3 overwrite. Avi Kivity wrote: diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2a36f7f..f0ea56c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1184,8 +1184,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, for_each_sp(pages, sp, parents, i) protected |= rmap_write_protect(vcpu-kvm, sp-gfn); - if (protected) - kvm_flush_remote_tlbs(vcpu-kvm); + kvm_flush_remote_tlbs_cond(vcpu-kvm, protected); for_each_sp(pages, sp, parents, i) { kvm_sync_page(vcpu, sp); Ok so because we didn't flush the tlb on the other vcpus when invlpg run, if cr3 overwrite needs to re-sync sptes wrprotecting them, we've to flush the tlb in all vcpus to be sure the possibly writable tlb entry reflecting the old writable spte instantiated before invlpg run, is removed from the physical cpus. We wouldn't find it in for_each_sp because it was rmap_removed, but we'll find something in mmu_unsync_walk (right? we definitely have to find something in mmu_unsync_walk for this to work, the parent sp have to leave child-unsync set even after rmap_remove run in invlpg without flushing the other vcpus tlbs). mmu_sync_children needs to find any _reachable_ sptes that are unsync, read the guest pagetable, and sync the sptes. Before it can inspect the guest pagetable, it needs to write protect it, with rmap_write_protect. In theory it wouldnt be necesarry to call kvm_flush_remote_tlbs_cond(protected) here, but only kvm_flush_remote_tlbs(), since the kvm-remote_tlbs_dirty information is not pertinent to mmu_sync_children. But this is done here to clear remote_tlbs_dirty (after a kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an optimization. @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) rmap_remove(vcpu-kvm, sptep); if (is_large_pte(*sptep)) --vcpu-kvm-stat.lpages; - need_flush = 1; + vcpu-kvm-remote_tlbs_dirty = true; } set_shadow_pte(sptep, shadow_trap_nonpresent_pte); break; @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) break; } - if (need_flush) - kvm_flush_remote_tlbs(vcpu-kvm); spin_unlock(vcpu-kvm-mmu_lock); AFIK to be compliant with lowlevel archs (without ASN it doesn't matter I think as vmx always flush on exit), we have to flush the local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests). I don't see why it's missing. Or am I wrong? Caller does it: void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva) { vcpu-arch.mmu.invlpg(vcpu, gva); kvm_mmu_flush_tlb(vcpu); ++vcpu-stat.invlpg; } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 68b217e..12afa50 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -758,10 +758,18 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req) void kvm_flush_remote_tlbs(struct kvm *kvm) { + kvm-remote_tlbs_dirty = false; + smp_wmb(); Still no lock prefix to the asm insn and here it runs outside the mmu_lock, but ok, I tend to agree smp_wmb should be enough to be sure the write is fully finished by the time smb_wmb returns. There's another problem though. CPU0 CPU1 --- - remote_tlbs_dirty = false remote_tlbs_dirty = true smp_tlb_flush set_shadow_pte(sptep, shadow_trap_nonpresent_pte); The flush for the sptep will be lost. What about protecting remote_tlbs_dirty with mmu_lock? Only caller of kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which is not performance sensitive anyway. @@ -907,8 +913,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, young = kvm_age_hva(kvm, address); spin_unlock(kvm-mmu_lock); - if (young) - kvm_flush_remote_tlbs(kvm); + kvm_flush_remote_tlbs_cond(kvm, young); return young; } No need to flush for clear_flush_young method, pages can't be freed there. I mangled over the patch a bit, plus fixed the above smp race, let me know what you think. The the best workload to exercise this is running a VM with lots of VCPUs and 8G of ram with a 32bit guest kernel and then just malloc and touch a byte for each 4096 page allocated by malloc. That will run a flood of invlpg. Then push the system to swap. while :;
[PATCH] KVM: Defer remote tlb flushes on invlpg (v4)
On Sun, Mar 29, 2009 at 01:36:01PM +0300, Avi Kivity wrote: Marcelo, Andrea? Had to read the code a bit more to understand the reason of the unsync_mmu flush in cr3 overwrite. Avi Kivity wrote: diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2a36f7f..f0ea56c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1184,8 +1184,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, for_each_sp(pages, sp, parents, i) protected |= rmap_write_protect(vcpu-kvm, sp-gfn); - if (protected) -kvm_flush_remote_tlbs(vcpu-kvm); +kvm_flush_remote_tlbs_cond(vcpu-kvm, protected); for_each_sp(pages, sp, parents, i) { kvm_sync_page(vcpu, sp); Ok so because we didn't flush the tlb on the other vcpus when invlpg run, if cr3 overwrite needs to re-sync sptes wrprotecting them, we've to flush the tlb in all vcpus to be sure the possibly writable tlb entry reflecting the old writable spte instantiated before invlpg run, is removed from the physical cpus. We wouldn't find it in for_each_sp because it was rmap_removed, but we'll find something in mmu_unsync_walk (right? we definitely have to find something in mmu_unsync_walk for this to work, the parent sp have to leave child-unsync set even after rmap_remove run in invlpg without flushing the other vcpus tlbs). @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) rmap_remove(vcpu-kvm, sptep); if (is_large_pte(*sptep)) --vcpu-kvm-stat.lpages; -need_flush = 1; +vcpu-kvm-remote_tlbs_dirty = true; } set_shadow_pte(sptep, shadow_trap_nonpresent_pte); break; @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) break; } - if (need_flush) -kvm_flush_remote_tlbs(vcpu-kvm); spin_unlock(vcpu-kvm-mmu_lock); AFIK to be compliant with lowlevel archs (without ASN it doesn't matter I think as vmx always flush on exit), we have to flush the local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests). I don't see why it's missing. Or am I wrong? diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 68b217e..12afa50 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -758,10 +758,18 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req) void kvm_flush_remote_tlbs(struct kvm *kvm) { +kvm-remote_tlbs_dirty = false; +smp_wmb(); Still no lock prefix to the asm insn and here it runs outside the mmu_lock, but ok, I tend to agree smp_wmb should be enough to be sure the write is fully finished by the time smb_wmb returns. There's another problem though. CPU0CPU1 --- - remote_tlbs_dirty = false remote_tlbs_dirty = true smp_tlb_flush set_shadow_pte(sptep, shadow_trap_nonpresent_pte); The flush for the sptep will be lost. @@ -907,8 +913,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, young = kvm_age_hva(kvm, address); spin_unlock(kvm-mmu_lock); - if (young) -kvm_flush_remote_tlbs(kvm); +kvm_flush_remote_tlbs_cond(kvm, young); return young; } No need to flush for clear_flush_young method, pages can't be freed there. I mangled over the patch a bit, plus fixed the above smp race, let me know what you think. The the best workload to exercise this is running a VM with lots of VCPUs and 8G of ram with a 32bit guest kernel and then just malloc and touch a byte for each 4096 page allocated by malloc. That will run a flood of invlpg. Then push the system to swap. while :; do cp /dev/hda /dev/null; done, also works without O_DIRECT so the host cache make it fast at the second run (not so much faster with host swapping though). I only tested it so far with 12 VM on swap with 64bit kernels with heavy I/O so it's not good test as I doubt any invlpg runs, not even munmap(addr, 4k) uses invlpg. Signed-off-by: Andrea Arcangeli aarca...@redhat.com --- diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index d5bdf3a..900bc31 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1185,7 +1185,11 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, for_each_sp(pages, sp, parents, i) protected |= rmap_write_protect(vcpu-kvm, sp-gfn); - if (protected) + /* +* Avoid leaving stale tlb entries in this vcpu representing +* sptes rmap_removed by invlpg without vcpu smp tlb flush. +*/ + if (protected || vcpu-kvm-remote_tlbs_dirty)