Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 6/22/2011 3:21 AM, Chris Wright wrote: * Nai Xia (nai@gmail.com) wrote: Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update() and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings significant performance gain in volatile pages scanning in KSM. Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is enabled to indicate that the dirty bits of underlying sptes are not updated by hardware. Did you test with each of EPT, NPT and shadow? Signed-off-by: Nai Xianai@gmail.com Acked-by: Izik Eidusizik.ei...@ravellosystems.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c | 36 + arch/x86/kvm/mmu.h |3 +- arch/x86/kvm/vmx.c |1 + include/linux/kvm_host.h|2 +- include/linux/mmu_notifier.h| 48 +++ mm/mmu_notifier.c | 33 ++ virt/kvm/kvm_main.c | 27 ++ 8 files changed, 149 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d2ac8e2..f0d7aa0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -848,6 +848,7 @@ extern bool kvm_rebooting; int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); int kvm_age_hva(struct kvm *kvm, unsigned long hva); int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva); void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); int cpuid_maxphyaddr(struct kvm_vcpu *vcpu); int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index aee3862..a5a0c51 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -979,6 +979,37 @@ out: return young; } +/* + * Caller is supposed to SetPageDirty(), it's not done inside this. + */ +static +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp, + unsigned long data) +{ + u64 *spte; + int dirty = 0; + + if (!shadow_dirty_mask) { + WARN(1, KVM: do NOT try to test dirty bit in EPT\n); + goto out; + } This should never fire with the dirty_update() notifier test, right? And that means that this whole optimization is for the shadow mmu case, arguably the legacy case. Hi Chris, AMD npt does track the dirty bit in the nested page tables, so the shadow_dirty_mask should not be 0 in that case... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Wednesday 22 June 2011 14:15:51 Izik Eidus wrote: On 6/22/2011 3:21 AM, Chris Wright wrote: * Nai Xia (nai@gmail.com) wrote: Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update() and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings significant performance gain in volatile pages scanning in KSM. Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is enabled to indicate that the dirty bits of underlying sptes are not updated by hardware. Did you test with each of EPT, NPT and shadow? Signed-off-by: Nai Xianai@gmail.com Acked-by: Izik Eidusizik.ei...@ravellosystems.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c | 36 + arch/x86/kvm/mmu.h |3 +- arch/x86/kvm/vmx.c |1 + include/linux/kvm_host.h|2 +- include/linux/mmu_notifier.h| 48 +++ mm/mmu_notifier.c | 33 ++ virt/kvm/kvm_main.c | 27 ++ 8 files changed, 149 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d2ac8e2..f0d7aa0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -848,6 +848,7 @@ extern bool kvm_rebooting; int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); int kvm_age_hva(struct kvm *kvm, unsigned long hva); int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva); void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); int cpuid_maxphyaddr(struct kvm_vcpu *vcpu); int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index aee3862..a5a0c51 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -979,6 +979,37 @@ out: return young; } +/* + * Caller is supposed to SetPageDirty(), it's not done inside this. + */ +static +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp, + unsigned long data) +{ + u64 *spte; + int dirty = 0; + + if (!shadow_dirty_mask) { + WARN(1, KVM: do NOT try to test dirty bit in EPT\n); + goto out; + } This should never fire with the dirty_update() notifier test, right? And that means that this whole optimization is for the shadow mmu case, arguably the legacy case. Hi Chris, AMD npt does track the dirty bit in the nested page tables, so the shadow_dirty_mask should not be 0 in that case... Hi Izik, I think he meant that if the caller is doing right (!shadow_dirty_mask), the kvm_test_and_clear_dirty_rmapp() will never be called at all. So this test inside kvm_test_and_clear_dirty_rmapp() is useless...as I said I added this test in any case of this interface abused by others, just like a softer BUG_ON() --- dirty bit is not that critical to bump into BUG(). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: restricting users to only power control of VMs
On 06/22/2011 12:45 AM, Iordan Iordanov wrote: It's a job for the management layer; I think it should be easy to script libvirt to do this. I read the documentation of libvirt, and out of the box, I don't see how this can be configured. So, I understand your reply as meaning that we need to write a program that uses the libvirt API to control this? Yes. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time sync in KVM Guest
On 06/22/2011 06:17 AM, Chaitra Gorantla wrote: Hi all, We are working on Fedora 15 Host. And the KVM is used to create Fedora 14 guest. The clock-source details are as below. Our Host supports constant_tsc. on HOST OS cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc on GUEST OS cat /sys/devices/system/clocksource/clocksource0/current_clocksource kvm-clock We are not having Wall-clock time Sync in Both Host and guest. The Time in Host OS lags behind guest by atleast 2 seconds. However, this time difference remains constant. That's good Does this mean that the Host and Guest clocks are synchronized ? By using kvm-clock as clock source on the guest, the time sync should happen itself ? Please correct me if I am wrong... How exactly we can find the time difference between Host and guest? Usually Linux OS reads the rtc cmos to get the time when it boots. Both the guest and the host do that using hwclock. From there own the time is tracked by the clock source as delta. It is recommended that both the guest and the host will run ntpd to keep their time updated. That's the only thing that you're recommended to do. Thanks, dor Thanks in advance. The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail __ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH x86 kvm] Fix register corruption in pvclock_scale_delta
On Sun, 2011-06-19 at 15:44 +0300, Avi Kivity wrote: On 06/16/2011 06:50 AM, Zachary Amsden wrote: The 128-bit multiply in pvclock.h was missing an output constraint for EDX which caused a register corruption to appear. Was there any particular symptom associated with corruption at that particular point or just general badness? Thanks to Ulrich for diagnosing the EDX corruption and Avi for providing this fix (which now I am sending back to you Avi..) Applied and queued, thanks. Seems like a stable candidate also? Ian. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH x86 kvm] Fix register corruption in pvclock_scale_delta
On 06/22/2011 12:35 PM, Ian Campbell wrote: On Sun, 2011-06-19 at 15:44 +0300, Avi Kivity wrote: On 06/16/2011 06:50 AM, Zachary Amsden wrote: The 128-bit multiply in pvclock.h was missing an output constraint for EDX which caused a register corruption to appear. Was there any particular symptom associated with corruption at that particular point or just general badness? Thanks to Ulrich for diagnosing the EDX corruption and Avi for providing this fix (which now I am sending back to you Avi..) Applied and queued, thanks. Seems like a stable candidate also? Yes. I see it's merged; Marcelo, can you backport it? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH x86 kvm] Fix register corruption in pvclock_scale_delta
On 06/22/2011 12:35 PM, Ian Campbell wrote: On Sun, 2011-06-19 at 15:44 +0300, Avi Kivity wrote: On 06/16/2011 06:50 AM, Zachary Amsden wrote: The 128-bit multiply in pvclock.h was missing an output constraint for EDX which caused a register corruption to appear. Was there any particular symptom associated with corruption at that particular point or just general badness? Time went backwards and forwards as vcpus were migrated across cpus. The problem was in the host, not the guest, so Xen should not be affected. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH x86 kvm] Fix register corruption in pvclock_scale_delta
On Wed, 2011-06-22 at 10:40 +0100, Avi Kivity wrote: On 06/22/2011 12:35 PM, Ian Campbell wrote: On Sun, 2011-06-19 at 15:44 +0300, Avi Kivity wrote: On 06/16/2011 06:50 AM, Zachary Amsden wrote: The 128-bit multiply in pvclock.h was missing an output constraint for EDX which caused a register corruption to appear. Was there any particular symptom associated with corruption at that particular point or just general badness? Time went backwards and forwards as vcpus were migrated across cpus. Oops! The problem was in the host, not the guest, so Xen should not be affected. Good to know, thanks! Ian. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 06/21/2011 04:32 PM, Nai Xia wrote: Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update() and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings significant performance gain in volatile pages scanning in KSM. Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is enabled to indicate that the dirty bits of underlying sptes are not updated by hardware. Can you quantify the performance gains? +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp, + unsigned long data) +{ + u64 *spte; + int dirty = 0; + + if (!shadow_dirty_mask) { + WARN(1, KVM: do NOT try to test dirty bit in EPT\n); + goto out; + } + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + int _dirty; + u64 _spte = *spte; + BUG_ON(!(_spte PT_PRESENT_MASK)); + _dirty = _spte PT_DIRTY_MASK; + if (_dirty) { + dirty = 1; + clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte); + } Racy. Also, needs a tlb flush eventually. + spte = rmap_next(kvm, rmapp, spte); + } +out: + return dirty; +} + #define RMAP_RECYCLE_THRESHOLD 1000 struct mmu_notifier_ops { + int (*dirty_update)(struct mmu_notifier *mn, +struct mm_struct *mm); + I prefer to have test_and_clear_dirty() always return 1 in this case (if the spte is writeable), and drop this callback. +int __mmu_notifier_dirty_update(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int dirty_update = 0; + + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n,mm-mmu_notifier_mm-list, hlist) { + if (mn-ops-dirty_update) + dirty_update |= mn-ops-dirty_update(mn, mm); + } + rcu_read_unlock(); + Should it not be = instead? + return dirty_update; +} + /* * This function can't run concurrently against mmu_notifier_register * because mm-mm_users 0 during mmu_notifier_register and exit_mmap -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 6/22/2011 1:43 PM, Avi Kivity wrote: On 06/21/2011 04:32 PM, Nai Xia wrote: Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update() and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings significant performance gain in volatile pages scanning in KSM. Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is enabled to indicate that the dirty bits of underlying sptes are not updated by hardware. Can you quantify the performance gains? +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp, + unsigned long data) +{ +u64 *spte; +int dirty = 0; + +if (!shadow_dirty_mask) { +WARN(1, KVM: do NOT try to test dirty bit in EPT\n); +goto out; +} + +spte = rmap_next(kvm, rmapp, NULL); +while (spte) { +int _dirty; +u64 _spte = *spte; +BUG_ON(!(_spte PT_PRESENT_MASK)); +_dirty = _spte PT_DIRTY_MASK; +if (_dirty) { +dirty = 1; +clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte); +} Racy. Also, needs a tlb flush eventually. Hi, one of the issues is that the whole point of this patch is not do tlb flush eventually, But I see your point, because other users will not expect such behavior, so maybe there is need into a parameter flush_tlb=?, or add another mmu notifier call? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 06/22/2011 02:05 PM, Izik Eidus wrote: +spte = rmap_next(kvm, rmapp, NULL); +while (spte) { +int _dirty; +u64 _spte = *spte; +BUG_ON(!(_spte PT_PRESENT_MASK)); +_dirty = _spte PT_DIRTY_MASK; +if (_dirty) { +dirty = 1; +clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte); +} Racy. Also, needs a tlb flush eventually. + Hi, one of the issues is that the whole point of this patch is not do tlb flush eventually, But I see your point, because other users will not expect such behavior, so maybe there is need into a parameter flush_tlb=?, or add another mmu notifier call? If you don't flush the tlb, a subsequent write will not see that spte.d is clear and the write will happen. So you'll see the page as clean even though it's dirty. That's not acceptable. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 6/22/2011 2:10 PM, Avi Kivity wrote: On 06/22/2011 02:05 PM, Izik Eidus wrote: +spte = rmap_next(kvm, rmapp, NULL); +while (spte) { +int _dirty; +u64 _spte = *spte; +BUG_ON(!(_spte PT_PRESENT_MASK)); +_dirty = _spte PT_DIRTY_MASK; +if (_dirty) { +dirty = 1; +clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte); +} Racy. Also, needs a tlb flush eventually. + Hi, one of the issues is that the whole point of this patch is not do tlb flush eventually, But I see your point, because other users will not expect such behavior, so maybe there is need into a parameter flush_tlb=?, or add another mmu notifier call? If you don't flush the tlb, a subsequent write will not see that spte.d is clear and the write will happen. So you'll see the page as clean even though it's dirty. That's not acceptable. Yes, but this is exactly what we want from this use case: Right now ksm calculate the page hash to see if it was changed, the idea behind this patch is to use the dirty bit instead, however the guest might not really like the fact that we will flush its tlb over and over again, specially in periodically scan like ksm does. So what we say here is: it is better to have little junk in the unstable tree that get flushed eventualy anyway, instead of make the guest slower this race is something that does not reflect accurate of ksm anyway due to the full memcmp that we will eventualy perform... Ofcurse we trust that in most cases, beacuse it take ksm to get into a random virtual address in real systems few minutes, there will be already tlb flush performed. What you think about having 2 calls: one that does the expected behivor and does flush the tlb, and one that clearly say it doesnt flush the tlb and expline its use case for ksm? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 06/22/2011 02:19 PM, Izik Eidus wrote: On 6/22/2011 2:10 PM, Avi Kivity wrote: On 06/22/2011 02:05 PM, Izik Eidus wrote: +spte = rmap_next(kvm, rmapp, NULL); +while (spte) { +int _dirty; +u64 _spte = *spte; +BUG_ON(!(_spte PT_PRESENT_MASK)); +_dirty = _spte PT_DIRTY_MASK; +if (_dirty) { +dirty = 1; +clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte); +} Racy. Also, needs a tlb flush eventually. + Hi, one of the issues is that the whole point of this patch is not do tlb flush eventually, But I see your point, because other users will not expect such behavior, so maybe there is need into a parameter flush_tlb=?, or add another mmu notifier call? If you don't flush the tlb, a subsequent write will not see that spte.d is clear and the write will happen. So you'll see the page as clean even though it's dirty. That's not acceptable. Yes, but this is exactly what we want from this use case: Right now ksm calculate the page hash to see if it was changed, the idea behind this patch is to use the dirty bit instead, however the guest might not really like the fact that we will flush its tlb over and over again, specially in periodically scan like ksm does. I see. So what we say here is: it is better to have little junk in the unstable tree that get flushed eventualy anyway, instead of make the guest slower this race is something that does not reflect accurate of ksm anyway due to the full memcmp that we will eventualy perform... Ofcurse we trust that in most cases, beacuse it take ksm to get into a random virtual address in real systems few minutes, there will be already tlb flush performed. What you think about having 2 calls: one that does the expected behivor and does flush the tlb, and one that clearly say it doesnt flush the tlb and expline its use case for ksm? Yes. And if the unstable/fast callback is not provided, have the common code fall back to the stable/slow callback instead. Or have a parameter that allows inaccurate results to be returned more quickly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 06/22/2011 02:24 PM, Avi Kivity wrote: On 06/22/2011 02:19 PM, Izik Eidus wrote: On 6/22/2011 2:10 PM, Avi Kivity wrote: On 06/22/2011 02:05 PM, Izik Eidus wrote: +spte = rmap_next(kvm, rmapp, NULL); +while (spte) { +int _dirty; +u64 _spte = *spte; +BUG_ON(!(_spte PT_PRESENT_MASK)); +_dirty = _spte PT_DIRTY_MASK; +if (_dirty) { +dirty = 1; +clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte); +} Racy. Also, needs a tlb flush eventually. + Hi, one of the issues is that the whole point of this patch is not do tlb flush eventually, But I see your point, because other users will not expect such behavior, so maybe there is need into a parameter flush_tlb=?, or add another mmu notifier call? If you don't flush the tlb, a subsequent write will not see that spte.d is clear and the write will happen. So you'll see the page as clean even though it's dirty. That's not acceptable. Yes, but this is exactly what we want from this use case: Right now ksm calculate the page hash to see if it was changed, the idea behind this patch is to use the dirty bit instead, however the guest might not really like the fact that we will flush its tlb over and over again, specially in periodically scan like ksm does. I see. Actually, this is dangerous. If we use the dirty bit for other things, we will get data corruption. For example we might want to map clean host pages as writeable-clean in the spte on a read fault so that we don't get a page fault when they get eventually written. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 06/22/2011 02:28 PM, Avi Kivity wrote: Actually, this is dangerous. If we use the dirty bit for other things, we will get data corruption. For example we might want to map clean host pages as writeable-clean in the spte on a read fault so that we don't get a page fault when they get eventually written. Another example - we can use the dirty bit for dirty page loggging. So I think we can get away with a conditional tlb flush - only flush if the page was dirty. That should be rare after the first pass, at least with small pages. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Wednesday 22 June 2011 19:28:08 Avi Kivity wrote: On 06/22/2011 02:24 PM, Avi Kivity wrote: On 06/22/2011 02:19 PM, Izik Eidus wrote: On 6/22/2011 2:10 PM, Avi Kivity wrote: On 06/22/2011 02:05 PM, Izik Eidus wrote: +spte = rmap_next(kvm, rmapp, NULL); +while (spte) { +int _dirty; +u64 _spte = *spte; +BUG_ON(!(_spte PT_PRESENT_MASK)); +_dirty = _spte PT_DIRTY_MASK; +if (_dirty) { +dirty = 1; +clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte); +} Racy. Also, needs a tlb flush eventually. + Hi, one of the issues is that the whole point of this patch is not do tlb flush eventually, But I see your point, because other users will not expect such behavior, so maybe there is need into a parameter flush_tlb=?, or add another mmu notifier call? If you don't flush the tlb, a subsequent write will not see that spte.d is clear and the write will happen. So you'll see the page as clean even though it's dirty. That's not acceptable. Yes, but this is exactly what we want from this use case: Right now ksm calculate the page hash to see if it was changed, the idea behind this patch is to use the dirty bit instead, however the guest might not really like the fact that we will flush its tlb over and over again, specially in periodically scan like ksm does. I see. Actually, this is dangerous. If we use the dirty bit for other things, we will get data corruption. Yeah,yeah, I actually clarified in a reply letter to Chris about his similar concern that we are currently the _only_ user. :) We can add the flushing when someone else should rely on this bit. For example we might want to map clean host pages as writeable-clean in the spte on a read fault so that we don't get a page fault when they get eventually written. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 6/22/2011 2:33 PM, Nai Xia wrote: On Wednesday 22 June 2011 19:28:08 Avi Kivity wrote: On 06/22/2011 02:24 PM, Avi Kivity wrote: On 06/22/2011 02:19 PM, Izik Eidus wrote: On 6/22/2011 2:10 PM, Avi Kivity wrote: On 06/22/2011 02:05 PM, Izik Eidus wrote: +spte = rmap_next(kvm, rmapp, NULL); +while (spte) { +int _dirty; +u64 _spte = *spte; +BUG_ON(!(_spte PT_PRESENT_MASK)); +_dirty = _spte PT_DIRTY_MASK; +if (_dirty) { +dirty = 1; +clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte); +} Racy. Also, needs a tlb flush eventually. + Hi, one of the issues is that the whole point of this patch is not do tlb flush eventually, But I see your point, because other users will not expect such behavior, so maybe there is need into a parameter flush_tlb=?, or add another mmu notifier call? If you don't flush the tlb, a subsequent write will not see that spte.d is clear and the write will happen. So you'll see the page as clean even though it's dirty. That's not acceptable. Yes, but this is exactly what we want from this use case: Right now ksm calculate the page hash to see if it was changed, the idea behind this patch is to use the dirty bit instead, however the guest might not really like the fact that we will flush its tlb over and over again, specially in periodically scan like ksm does. I see. Actually, this is dangerous. If we use the dirty bit for other things, we will get data corruption. Yeah,yeah, I actually clarified in a reply letter to Chris about his similar concern that we are currently the _only_ user. :) We can add the flushing when someone else should rely on this bit. I suggest to add the flushing when someone else will use it as well Btw I don`t think this whole optimization is worth for kvm guests in case that tlb flush must be perform, in machine with alot of cpus, it much better ksm will burn one cpu usage, instead of slowering all the others... So while this patch will really make ksm look faster, the whole system will be slower... So in case you don`t want to add the flushing when someone else will rely on it, it will be better to use the dirty tick just for userspace applications and not for kvm guests.. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
collect some information when qemu-kvm exit
I find qemu-kvm only output a little information when abnormally exit. For example, if qemu-kvm exit by segmentation fault, there are no information in /var/log/libvirt/qemu/xx.log. so i want to solve this by collect some information when qemu-kvm exit. my idea is register some signal handler, and print some debug information in the signal handler function. and use atexit register a function for process termination. the main thread already used SIGBUS,SIGUSR2,SIGALRM,SIGIO. and vcpu thread used SIGBUS,SIGIPI. the signal that should register for vcpu thread. SIGHUP, SIGINT, SIGQUIT, SIGILL, SIGTRAP, SIGABRT, SIGFPE, SIGUSR1, SIGPIPE, SIGSEGV, SIGUSR2, SIGTERM, SIGSTKFLT, SIGTSTP, SIGXCPU, SIGXFSZ, SIGVTALRM, SIGIO, SIGSYS, the signal that should register for main thread. SIGHUP, SIGINT, SIGQUIT, SIGILL, SIGTRAP, SIGABRT, SIGFPE, SIGUSR1, SIGPIPE, SIGSEGV, SIGTERM, SIGSTKFLT, SIGTSTP, SIGXCPU, SIGXFSZ, SIGVTALRM, SIGSYS, the simple example for this function: --- ../../BUILD/qemu-kvm-0.12.5/vl.c 2011-05-25 04:08:00.0 -0400 +++ vl.c 2011-06-22 06:57:51.0 -0400 +static void sig_handler(int n) +{ + int j, nptrs; + #define SIZE 100 + void *buffer[100]; + char **strings; + + nptrs = backtrace(buffer, SIZE); + printf(backtrace() returned %d addresses\n, nptrs); + + /* The call backtrace_symbols_fd(buffer, nptrs, STDOUT_FILENO) + would produce similar output to the following: */ + + strings = backtrace_symbols(buffer, nptrs); + if (strings == NULL) { + perror(backtrace_symbols); + exit(EXIT_FAILURE); + } + + for (j = 0; j nptrs; j++) + printf(%s\n, strings[j]); + + free(strings); + + abort(); +} + +void bye(void) { + printf(That was all, folks\n); + int j, nptrs; + #define SIZE 100 + void *buffer[100]; + char **strings; + + nptrs = backtrace(buffer, SIZE); + printf(backtrace() returned %d addresses\n, nptrs); + + /* The call backtrace_symbols_fd(buffer, nptrs, STDOUT_FILENO) + would produce similar output to the following: */ + + strings = backtrace_symbols(buffer, nptrs); + if (strings == NULL) { + perror(backtrace_symbols); + exit(EXIT_FAILURE); + } + + for (j = 0; j nptrs; j++) + printf(%s\n, strings[j]); + + free(strings); + +} + + int main(int argc, char **argv, char **envp) { const char *gdbstub_dev = NULL; @@ -4954,6 +5010,17 @@ init_clocks(); + + signal(SIGSEGV, sig_handler); + + i = atexit(bye); + if (i != 0) { + fprintf(stderr, cannot set exit function\n); + return EXIT_FAILURE; + } + + + --- ../../BUILD/qemu-kvm-0.12.5/qemu-kvm.c 2011-05-25 03:31:25.0 -0400 +++ qemu-kvm.c 2011-06-22 07:04:05.0 -0400 @@ -1883,6 +1883,10 @@ + static void *ap_main_loop(void *_env) { CPUState *env = _env; @@ -1908,10 +1927,10 @@ struct ioperm_data *data = NULL; #endif current_env = env; env-thread_id = kvm_get_thread_id(); sigfillset(signals); + sigdelset(signals,SIGSEGV); sigprocmask(SIG_BLOCK, signals, NULL); kvm_create_vcpu(env, env-cpu_index); the log in /var/log/libvirt/qemu/xx.log. LC_ALL=C PATH=/bin:/sbin:/usr/bin:/usr/sbin HOME=/ QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -S -M pc-0.12 -enable-kvm -m 9767 -smp 16,sockets=16,cores=1,threads=1 -name sles11-4 -uuid 52841129-6b46-fef5-6a62-11c422597206 -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/sles11-4.monitor,server,nowait -mon chardev=monitor,mode=readline -no-reboot -boot dc -drive file=/mnt/disk2/c00104598/sles11sp1.raw,if=none,id=drive-virtio-disk0,boot=on -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/opt/c00104598/iso/SLES-11-SP1.iso,if=none,media=cdrom,id=drive-ide0-1-0 -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -device virtio-net-pci,vlan=0,id=net0,mac=52:54:00:24:a5:1d,bus=pci.0,addr=0x5 -net tap,fd=15,vlan=0,name=hostnet0 -usb -vnc *:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 19:22:40.436: debug : qemuSecurityDACSetProcessLabel:411 : Dropping privileges of VM to 0:0 backtrace() returned 16 addresses /usr/bin/qemu-kvm(bye+0x2b) [0x43346e] /lib64/libc.so.6(+0x355e5) [0x7f45c50605e5] /lib64/libc.so.6(+0x35635) [0x7f45c5060635] /usr/bin/qemu-kvm() [0x4489e2] /usr/bin/qemu-kvm(virtio_queue_notify+0x89) [0x5bb841] /usr/bin/qemu-kvm() [0x449bb3] /usr/bin/qemu-kvm() [0x44a077] /usr/bin/qemu-kvm() [0x4f2cb4] /usr/bin/qemu-kvm(cpu_outw+0x20) [0x4f305d] /usr/bin/qemu-kvm() [0x4513f8] /usr/bin/qemu-kvm(kvm_run+0x2e0) [0x4539bb] /usr/bin/qemu-kvm(kvm_cpu_exec+0x15)
Re: [Qemu-devel] [PATCH 00/15] [PULL] qemu-kvm.git uq/master queue
On 06/21/2011 12:07 PM, Marcelo Tosatti wrote: The following changes since commit eb47d7c5d96060040931c42773ee07e61e547af9: hw/9118.c: Implement active-low interrupt support (2011-06-15 13:23:37 +0200) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master Pulled. Thanks. Regards, Anthony Liguori Andre Przywara (1): KVM: Fix XSAVE feature bit enumeration Jan Kiszka (13): kvm: x86: Save/restore FPU OP, IP and DP Add kernel header update script Import kernel headers Switch build system to accompanied kernel headers kvm: Drop CONFIG_KVM_PARA kvm: ppc: Drop CONFIG_KVM_PPC_PVR kvm: Drop useless zero-initializations kvm: Drop KVM_CAP build dependencies kvm: x86: Drop KVM_CAP build dependencies kvm: ppc: Drop KVM_CAP build dependencies kvm: Clean up stubs kvm: x86: Pass KVMState to kvm_arch_get_supported_cpuid Remove unneeded kvm.h from cpu-exec.c Marcelo Tosatti (1): kvm: fix FPU state subsection Makefile.target |4 +- configure| 149 +-- cpu-exec.c |1 - hw/kvmclock.c|9 - kvm-all.c| 13 - kvm-stub.c | 18 +- kvm.h|2 +- linux-headers/COPYING| 356 +++ linux-headers/README |2 + linux-headers/asm-powerpc/kvm.h | 275 linux-headers/asm-powerpc/kvm_para.h | 53 +++ linux-headers/asm-s390/kvm.h | 44 ++ linux-headers/asm-s390/kvm_para.h| 17 + linux-headers/asm-x86/hyperv.h | 193 linux-headers/asm-x86/kvm.h | 324 ++ linux-headers/asm-x86/kvm_para.h | 79 linux-headers/linux/kvm.h| 804 ++ linux-headers/linux/kvm_para.h | 28 ++ linux-headers/linux/vhost.h | 130 ++ linux-headers/linux/virtio_config.h | 54 +++ linux-headers/linux/virtio_ring.h| 163 +++ scripts/update-linux-headers.sh | 55 +++ target-i386/cpu.h|4 + target-i386/cpuid.c | 20 +- target-i386/kvm.c| 148 ++- target-i386/machine.c| 23 + target-ppc/kvm.c | 23 - target-s390x/cpu.h | 10 - target-s390x/op_helper.c |1 + 29 files changed, 2675 insertions(+), 327 deletions(-) create mode 100644 linux-headers/COPYING create mode 100644 linux-headers/README create mode 100644 linux-headers/asm-powerpc/kvm.h create mode 100644 linux-headers/asm-powerpc/kvm_para.h create mode 100644 linux-headers/asm-s390/kvm.h create mode 100644 linux-headers/asm-s390/kvm_para.h create mode 100644 linux-headers/asm-x86/hyperv.h create mode 100644 linux-headers/asm-x86/kvm.h create mode 100644 linux-headers/asm-x86/kvm_para.h create mode 100644 linux-headers/linux/kvm.h create mode 100644 linux-headers/linux/kvm_para.h create mode 100644 linux-headers/linux/vhost.h create mode 100644 linux-headers/linux/virtio_config.h create mode 100644 linux-headers/linux/virtio_ring.h create mode 100755 scripts/update-linux-headers.sh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 01/22] KVM: MMU: fix walking shadow page table
Properly check the last mapping, and do not walk to the next level if last spte is met Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9c629b5..f474e93 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1517,10 +1517,6 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator) if (iterator-level PT_PAGE_TABLE_LEVEL) return false; - if (iterator-level == PT_PAGE_TABLE_LEVEL) - if (is_large_pte(*iterator-sptep)) - return false; - iterator-index = SHADOW_PT_INDEX(iterator-addr, iterator-level); iterator-sptep = ((u64 *)__va(iterator-shadow_addr)) + iterator-index; return true; @@ -1528,6 +1524,11 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator) static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) { + if (is_last_spte(*iterator-sptep, iterator-level)) { + iterator-level = 0; + return; + } + iterator-shadow_addr = *iterator-sptep PT64_BASE_ADDR_MASK; --iterator-level; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 02/22] KVM: MMU: do not update slot bitmap if spte is nonpresent
Set slot bitmap only if the spte is present Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 15 +++ 1 files changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f474e93..8316c2d 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -743,9 +743,6 @@ static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn) struct kvm_mmu_page *sp; unsigned long *rmapp; - if (!is_rmap_spte(*spte)) - return 0; - sp = page_header(__pa(spte)); kvm_mmu_page_set_gfn(sp, spte - sp-spt, gfn); rmapp = gfn_to_rmap(vcpu-kvm, gfn, sp-role.level); @@ -2087,11 +2084,13 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, if (!was_rmapped is_large_pte(*sptep)) ++vcpu-kvm-stat.lpages; - page_header_update_slot(vcpu-kvm, sptep, gfn); - if (!was_rmapped) { - rmap_count = rmap_add(vcpu, sptep, gfn); - if (rmap_count RMAP_RECYCLE_THRESHOLD) - rmap_recycle(vcpu, sptep, gfn); + if (is_shadow_present_pte(*sptep)) { + page_header_update_slot(vcpu-kvm, sptep, gfn); + if (!was_rmapped) { + rmap_count = rmap_add(vcpu, sptep, gfn); + if (rmap_count RMAP_RECYCLE_THRESHOLD) + rmap_recycle(vcpu, sptep, gfn); + } } kvm_release_pfn_clean(pfn); if (speculative) { -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary
If the range spans a boundary, the mmio access can be broke, fix it as write emulation. And we already get the guest physical address, so use it to read guest data directly to avoid walking guest page table again Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 41 - 1 files changed, 32 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0b803f0..eb27be4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3944,14 +3944,13 @@ out: } EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system); -static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, - unsigned long addr, - void *val, - unsigned int bytes, - struct x86_exception *exception) +static int emulator_read_emulated_onepage(unsigned long addr, + void *val, + unsigned int bytes, + struct x86_exception *exception, + struct kvm_vcpu *vcpu) { - struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); - gpa_t gpa; + gpa_t gpa; int handled; if (vcpu-mmio_read_completed) { @@ -3971,8 +3970,7 @@ static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, if ((gpa PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) goto mmio; - if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception) - == X86EMUL_CONTINUE) + if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes)) return X86EMUL_CONTINUE; mmio: @@ -4001,6 +3999,31 @@ mmio: return X86EMUL_IO_NEEDED; } +static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, + unsigned long addr, + void *val, + unsigned int bytes, + struct x86_exception *exception) +{ + struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); + + /* Crossing a page boundary? */ + if (((addr + bytes - 1) ^ addr) PAGE_MASK) { + int rc, now; + + now = -addr ~PAGE_MASK; + rc = emulator_read_emulated_onepage(addr, val, now, exception, + vcpu); + if (rc != X86EMUL_CONTINUE) + return rc; + addr += now; + val += now; + bytes -= now; + } + return emulator_read_emulated_onepage(addr, val, bytes, exception, + vcpu); +} + int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, const void *val, int bytes) { -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
Introduce vcpu_gva_to_gpa to translate the gva to gpa, we can use it to cleanup the code between read emulation and write emulation Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 38 +- 1 files changed, 29 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index eb27be4..c29ef96 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3944,6 +3944,27 @@ out: } EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system); +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, + gpa_t *gpa, struct x86_exception *exception, + bool write) +{ + u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + + if (write) + access |= PFERR_WRITE_MASK; + + *gpa = vcpu-arch.walk_mmu-gva_to_gpa(vcpu, gva, access, exception); + + if (*gpa == UNMAPPED_GVA) + return -1; + + /* For APIC access vmexit */ + if ((*gpa PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) + return 1; + + return 0; +} + static int emulator_read_emulated_onepage(unsigned long addr, void *val, unsigned int bytes, @@ -3951,7 +3972,7 @@ static int emulator_read_emulated_onepage(unsigned long addr, struct kvm_vcpu *vcpu) { gpa_t gpa; - int handled; + int handled, ret; if (vcpu-mmio_read_completed) { memcpy(val, vcpu-mmio_data, bytes); @@ -3961,13 +3982,12 @@ static int emulator_read_emulated_onepage(unsigned long addr, return X86EMUL_CONTINUE; } - gpa = kvm_mmu_gva_to_gpa_read(vcpu, addr, exception); + ret = vcpu_gva_to_gpa(vcpu, addr, gpa, exception, false); - if (gpa == UNMAPPED_GVA) + if (ret 0) return X86EMUL_PROPAGATE_FAULT; - /* For APIC access vmexit */ - if ((gpa PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) + if (ret) goto mmio; if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes)) @@ -4043,15 +4063,15 @@ static int emulator_write_emulated_onepage(unsigned long addr, struct kvm_vcpu *vcpu) { gpa_t gpa; - int handled; + int handled, ret; - gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); + ret = vcpu_gva_to_gpa(vcpu, addr, gpa, exception, true); - if (gpa == UNMAPPED_GVA) + if (ret 0) return X86EMUL_PROPAGATE_FAULT; /* For APIC access vmexit */ - if ((gpa PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) + if (ret) goto mmio; if (emulator_write_phys(vcpu, gpa, val, bytes)) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation
The operations of read emulation and write emulation are very similar, so we can abstract the operation of them, in larter patch, it is used to cleanup the same code Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 72 1 files changed, 72 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c29ef96..887714f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4056,6 +4056,78 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, return 1; } +struct read_write_emulator_ops { + int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val, + int bytes); + int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes); + int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa, + int bytes, void *val); + int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes); + bool write; +}; + +static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes) +{ + if (vcpu-mmio_read_completed) { + memcpy(val, vcpu-mmio_data, bytes); + trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes, + vcpu-mmio_phys_addr, *(u64 *)val); + vcpu-mmio_read_completed = 0; + return 1; + } + + return 0; +} + +static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes) +{ + return !kvm_read_guest(vcpu-kvm, gpa, val, bytes); +} + +static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, +void *val, int bytes) +{ + return emulator_write_phys(vcpu, gpa, val, bytes); +} + +static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val) +{ + trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); + return vcpu_mmio_write(vcpu, gpa, bytes, val); +} + +static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes) +{ + trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, bytes, gpa, 0); + return X86EMUL_IO_NEEDED; +} + +static int write_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes) +{ + memcpy(vcpu-mmio_data, val, bytes); + memcpy(vcpu-run-mmio.data, vcpu-mmio_data, 8); + return X86EMUL_CONTINUE; +} + +static struct read_write_emulator_ops read_emultor = { + .read_write_prepare = read_prepare, + .read_write_emulate = read_emulate, + .read_write_mmio = vcpu_mmio_read, + .read_write_exit_mmio = read_exit_mmio, +}; + +static struct read_write_emulator_ops write_emultor = { + .read_write_emulate = write_emulate, + .read_write_mmio = write_mmio, + .read_write_exit_mmio = write_exit_mmio, + .write = true, +}; + static int emulator_write_emulated_onepage(unsigned long addr, const void *val, unsigned int bytes, -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 06/22] KVM: x86: cleanup the code of read/write emulation
Using the read/write operation to remove the same code Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 149 --- 1 files changed, 47 insertions(+), 102 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 887714f..baa5a11 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3965,85 +3965,6 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, return 0; } -static int emulator_read_emulated_onepage(unsigned long addr, - void *val, - unsigned int bytes, - struct x86_exception *exception, - struct kvm_vcpu *vcpu) -{ - gpa_t gpa; - int handled, ret; - - if (vcpu-mmio_read_completed) { - memcpy(val, vcpu-mmio_data, bytes); - trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes, - vcpu-mmio_phys_addr, *(u64 *)val); - vcpu-mmio_read_completed = 0; - return X86EMUL_CONTINUE; - } - - ret = vcpu_gva_to_gpa(vcpu, addr, gpa, exception, false); - - if (ret 0) - return X86EMUL_PROPAGATE_FAULT; - - if (ret) - goto mmio; - - if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes)) - return X86EMUL_CONTINUE; - -mmio: - /* -* Is this MMIO handled locally? -*/ - handled = vcpu_mmio_read(vcpu, gpa, bytes, val); - - if (handled == bytes) - return X86EMUL_CONTINUE; - - gpa += handled; - bytes -= handled; - val += handled; - - trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, bytes, gpa, 0); - - vcpu-mmio_needed = 1; - vcpu-run-exit_reason = KVM_EXIT_MMIO; - vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; - vcpu-mmio_size = bytes; - vcpu-run-mmio.len = min(vcpu-mmio_size, 8); - vcpu-run-mmio.is_write = vcpu-mmio_is_write = 0; - vcpu-mmio_index = 0; - - return X86EMUL_IO_NEEDED; -} - -static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, - unsigned long addr, - void *val, - unsigned int bytes, - struct x86_exception *exception) -{ - struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); - - /* Crossing a page boundary? */ - if (((addr + bytes - 1) ^ addr) PAGE_MASK) { - int rc, now; - - now = -addr ~PAGE_MASK; - rc = emulator_read_emulated_onepage(addr, val, now, exception, - vcpu); - if (rc != X86EMUL_CONTINUE) - return rc; - addr += now; - val += now; - bytes -= now; - } - return emulator_read_emulated_onepage(addr, val, bytes, exception, - vcpu); -} - int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, const void *val, int bytes) { @@ -4128,16 +4049,22 @@ static struct read_write_emulator_ops write_emultor = { .write = true, }; -static int emulator_write_emulated_onepage(unsigned long addr, - const void *val, - unsigned int bytes, - struct x86_exception *exception, - struct kvm_vcpu *vcpu) +static int emulator_read_write_onepage(unsigned long addr, void *val, + unsigned int bytes, + struct x86_exception *exception, + struct kvm_vcpu *vcpu, + struct read_write_emulator_ops *ops) + { - gpa_t gpa; + gpa_t gpa; int handled, ret; + bool write = ops-write; + + if (ops-read_write_prepare + ops-read_write_prepare(vcpu, val, bytes)) + return X86EMUL_CONTINUE; - ret = vcpu_gva_to_gpa(vcpu, addr, gpa, exception, true); + ret = vcpu_gva_to_gpa(vcpu, addr, gpa, exception, write); if (ret 0) return X86EMUL_PROPAGATE_FAULT; @@ -4146,15 +4073,14 @@ static int emulator_write_emulated_onepage(unsigned long addr, if (ret) goto mmio; - if (emulator_write_phys(vcpu, gpa, val, bytes)) + if (ops-read_write_emulate(vcpu, gpa, val, bytes)) return X86EMUL_CONTINUE; mmio: - trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); /* * Is this MMIO handled locally? */ - handled = vcpu_mmio_write(vcpu, gpa, bytes, val); + handled =
[PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path
If the page fault is caused by mmio, we can cache the mmio info, later, we do not need to walk guest page table and quickly know it is a mmio fault while we emulate the mmio instruction Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |5 + arch/x86/kvm/mmu.c | 21 +++-- arch/x86/kvm/mmu.h | 23 +++ arch/x86/kvm/paging_tmpl.h | 21 ++--- arch/x86/kvm/x86.c | 11 +++ arch/x86/kvm/x86.h | 36 6 files changed, 96 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index da6bbee..7b0834a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -415,6 +415,11 @@ struct kvm_vcpu_arch { u64 mcg_ctl; u64 *mce_banks; + /* Cache MMIO info */ + u64 mmio_gva; + unsigned access; + gfn_t mmio_gfn; + /* used for guest single stepping over the given code position */ unsigned long singlestep_rip; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 8316c2d..7f53210 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -217,11 +217,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, } EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes); -static bool is_write_protection(struct kvm_vcpu *vcpu) -{ - return kvm_read_cr0_bits(vcpu, X86_CR0_WP); -} - static int is_cpuid_PSE36(void) { return 1; @@ -243,11 +238,6 @@ static int is_large_pte(u64 pte) return pte PT_PAGE_SIZE_MASK; } -static int is_writable_pte(unsigned long pte) -{ - return pte PT_WRITABLE_MASK; -} - static int is_dirty_gpte(unsigned long pte) { return pte PT_DIRTY_MASK; @@ -2247,15 +2237,17 @@ static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct * send_sig_info(SIGBUS, info, tsk); } -static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn) +static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gva_t gva, + unsigned access, gfn_t gfn, pfn_t pfn) { kvm_release_pfn_clean(pfn); if (is_hwpoison_pfn(pfn)) { - kvm_send_hwpoison_signal(gfn_to_hva(kvm, gfn), current); + kvm_send_hwpoison_signal(gfn_to_hva(vcpu-kvm, gfn), current); return 0; } else if (is_fault_pfn(pfn)) return -EFAULT; + vcpu_cache_mmio_info(vcpu, gva, gfn, access); return 1; } @@ -2337,7 +2329,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn, /* mmio */ if (is_error_pfn(pfn)) - return kvm_handle_bad_page(vcpu-kvm, gfn, pfn); + return kvm_handle_bad_page(vcpu, v, ACC_ALL, gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) @@ -2564,6 +2556,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu) if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) return; + vcpu_clear_mmio_info(vcpu, ~0ul); trace_kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC); if (vcpu-arch.mmu.root_level == PT64_ROOT_LEVEL) { hpa_t root = vcpu-arch.mmu.root_hpa; @@ -2710,7 +2703,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, /* mmio */ if (is_error_pfn(pfn)) - return kvm_handle_bad_page(vcpu-kvm, gfn, pfn); + return kvm_handle_bad_page(vcpu, 0, 0, gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 7086ca8..05310b1 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -76,4 +76,27 @@ static inline int is_present_gpte(unsigned long pte) return pte PT_PRESENT_MASK; } +static inline int is_writable_pte(unsigned long pte) +{ + return pte PT_WRITABLE_MASK; +} + +static inline bool is_write_protection(struct kvm_vcpu *vcpu) +{ + return kvm_read_cr0_bits(vcpu, X86_CR0_WP); +} + +static inline bool check_write_user_access(struct kvm_vcpu *vcpu, + bool write_fault, bool user_fault, + unsigned long pte) +{ + if (unlikely(write_fault !is_writable_pte(pte) + (user_fault || is_write_protection(vcpu + return false; + + if (unlikely(user_fault !(pte PT_USER_MASK))) + return false; + + return true; +} #endif diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 1caeb4d..13978dc 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -201,11 +201,8 @@ walk: break; } - if (unlikely(write_fault
[PATCH v2 08/22] KVM: MMU: optimize to handle dirty bit
If dirty bit is not set, we can make the pte access read-only to avoid handing dirty bit everywhere Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 13 +-- arch/x86/kvm/paging_tmpl.h | 46 ++- 2 files changed, 25 insertions(+), 34 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 7f53210..9e35577 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1923,7 +1923,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access, int user_fault, - int write_fault, int dirty, int level, + int write_fault, int level, gfn_t gfn, pfn_t pfn, bool speculative, bool can_unsync, bool host_writable) { @@ -1938,8 +1938,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte = PT_PRESENT_MASK; if (!speculative) spte |= shadow_accessed_mask; - if (!dirty) - pte_access = ~ACC_WRITE_MASK; + if (pte_access ACC_EXEC_MASK) spte |= shadow_x_mask; else @@ -2023,7 +2022,7 @@ done: static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pt_access, unsigned pte_access, -int user_fault, int write_fault, int dirty, +int user_fault, int write_fault, int *ptwrite, int level, gfn_t gfn, pfn_t pfn, bool speculative, bool host_writable) @@ -2059,7 +2058,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, } if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault, - dirty, level, gfn, pfn, speculative, true, + level, gfn, pfn, speculative, true, host_writable)) { if (write_fault) *ptwrite = 1; @@ -2129,7 +2128,7 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu, for (i = 0; i ret; i++, gfn++, start++) mmu_set_spte(vcpu, start, ACC_ALL, -access, 0, 0, 1, NULL, +access, 0, 0, NULL, sp-role.level, gfn, page_to_pfn(pages[i]), true, true); @@ -2193,7 +2192,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, unsigned pte_access = ACC_ALL; mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access, -0, write, 1, pt_write, +0, write, pt_write, level, gfn, pfn, prefault, map_writable); direct_pte_prefetch(vcpu, iterator.sptep); ++vcpu-stat.pf_fixed; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 13978dc..e06c9e4 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -101,11 +101,15 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, return (ret != orig_pte); } -static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte) +static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte, + bool last) { unsigned access; access = (gpte (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK; + if (last !is_dirty_gpte(gpte)) + access = ~ACC_WRITE_MASK; + #if PTTYPE == 64 if (vcpu-arch.mmu.nx) access = ~(gpte PT64_NX_SHIFT); @@ -227,8 +231,6 @@ walk: pte |= PT_ACCESSED_MASK; } - pte_access = pt_access FNAME(gpte_access)(vcpu, pte); - walker-ptes[walker-level - 1] = pte; if ((walker-level == PT_PAGE_TABLE_LEVEL) || @@ -269,7 +271,7 @@ walk: break; } - pt_access = pte_access; + pt_access = FNAME(gpte_access)(vcpu, pte, false); --walker-level; } @@ -293,6 +295,7 @@ walk: walker-ptes[walker-level - 1] = pte; } + pte_access = pt_access FNAME(gpte_access)(vcpu, pte, true); walker-pt_access = pt_access; walker-pte_access = pte_access; pgprintk(%s: pte %llx pte_access %x pt_access %x\n, @@ -373,7 +376,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, return; pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte); - pte_access = sp-role.access FNAME(gpte_access)(vcpu, gpte); + pte_access = sp-role.access FNAME(gpte_access)(vcpu, gpte, true); pfn =
[PATCH v2 09/22] KVM: MMU: cleanup for FNAME(fetch)
gw-pte_access is the final access permission, since it is unified with gw-pt_access when we walked guest page table: FNAME(walk_addr_generic): pte_access = pt_access FNAME(gpte_access)(vcpu, pte, true); Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/paging_tmpl.h |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index e06c9e4..e9ee473 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -485,7 +485,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, if (!is_present_gpte(gw-ptes[gw-level - 1])) return NULL; - direct_access = gw-pt_access gw-pte_access; + direct_access = gw-pte_access; top_level = vcpu-arch.mmu.root_level; if (top_level == PT32E_ROOT_LEVEL) @@ -543,7 +543,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, link_shadow_page(it.sptep, sp); } - mmu_set_spte(vcpu, it.sptep, access, gw-pte_access access, + mmu_set_spte(vcpu, it.sptep, access, gw-pte_access, user_fault, write_fault, ptwrite, it.level, gw-gfn, pfn, prefault, map_writable); FNAME(pte_prefetch)(vcpu, gw, it.sptep); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 10/22] KVM: MMU: rename 'pt_write' to 'emulate'
If 'pt_write' is true, we need to emulate the fault. And in later patch, we need to emulate the fault even though it is not a pt_write event, so rename it to better fit the meaning Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 10 +- arch/x86/kvm/paging_tmpl.h | 16 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9e35577..580ecf0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2023,7 +2023,7 @@ done: static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pt_access, unsigned pte_access, int user_fault, int write_fault, -int *ptwrite, int level, gfn_t gfn, +int *emulate, int level, gfn_t gfn, pfn_t pfn, bool speculative, bool host_writable) { @@ -2061,7 +2061,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, level, gfn, pfn, speculative, true, host_writable)) { if (write_fault) - *ptwrite = 1; + *emulate = 1; kvm_mmu_flush_tlb(vcpu); } @@ -2184,7 +2184,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, { struct kvm_shadow_walk_iterator iterator; struct kvm_mmu_page *sp; - int pt_write = 0; + int emulate = 0; gfn_t pseudo_gfn; for_each_shadow_entry(vcpu, (u64)gfn PAGE_SHIFT, iterator) { @@ -2192,7 +2192,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, unsigned pte_access = ACC_ALL; mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access, -0, write, pt_write, +0, write, emulate, level, gfn, pfn, prefault, map_writable); direct_pte_prefetch(vcpu, iterator.sptep); ++vcpu-stat.pf_fixed; @@ -2220,7 +2220,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, | shadow_accessed_mask); } } - return pt_write; + return emulate; } static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index e9ee473..1e534e0 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -473,7 +473,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, struct guest_walker *gw, int user_fault, int write_fault, int hlevel, -int *ptwrite, pfn_t pfn, bool map_writable, +int *emulate, pfn_t pfn, bool map_writable, bool prefault) { unsigned access = gw-pt_access; @@ -544,7 +544,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, } mmu_set_spte(vcpu, it.sptep, access, gw-pte_access, -user_fault, write_fault, ptwrite, it.level, +user_fault, write_fault, emulate, it.level, gw-gfn, pfn, prefault, map_writable); FNAME(pte_prefetch)(vcpu, gw, it.sptep); @@ -578,7 +578,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, int user_fault = error_code PFERR_USER_MASK; struct guest_walker walker; u64 *sptep; - int write_pt = 0; + int emulate = 0; int r; pfn_t pfn; int level = PT_PAGE_TABLE_LEVEL; @@ -639,19 +639,19 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, if (!force_pt_level) transparent_hugepage_adjust(vcpu, walker.gfn, pfn, level); sptep = FNAME(fetch)(vcpu, addr, walker, user_fault, write_fault, -level, write_pt, pfn, map_writable, prefault); +level, emulate, pfn, map_writable, prefault); (void)sptep; - pgprintk(%s: shadow pte %p %llx ptwrite %d\n, __func__, -sptep, *sptep, write_pt); + pgprintk(%s: shadow pte %p %llx emulate %d\n, __func__, +sptep, *sptep, emulate); - if (!write_pt) + if (!emulate) vcpu-arch.last_pt_write_count = 0; /* reset fork detector */ ++vcpu-stat.pf_fixed; trace_kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT); spin_unlock(vcpu-kvm-mmu_lock); - return write_pt; + return emulate; out_unlock: spin_unlock(vcpu-kvm-mmu_lock); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body
[PATCH v2 11/22] KVM: MMU: count used shadow pages on prepareing path
Move counting used shadow pages from commiting path to preparing path to reduce tlb flush on some paths Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 10 +- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 580ecf0..8233aba 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1039,7 +1039,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr) percpu_counter_add(kvm_total_used_mmu_pages, nr); } -static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp) +static void kvm_mmu_free_page(struct kvm_mmu_page *sp) { ASSERT(is_empty_shadow_page(sp-spt)); hlist_del(sp-hash_link); @@ -1048,7 +1048,6 @@ static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp) if (!sp-role.direct) free_page((unsigned long)sp-gfns); kmem_cache_free(mmu_page_header_cache, sp); - kvm_mod_used_mmu_pages(kvm, -1); } static unsigned kvm_page_table_hashfn(gfn_t gfn) @@ -1655,6 +1654,7 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, /* Count self */ ret++; list_move(sp-link, invalid_list); + kvm_mod_used_mmu_pages(kvm, -1); } else { list_move(sp-link, kvm-arch.active_mmu_pages); kvm_reload_remote_mmus(kvm); @@ -1678,7 +1678,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, do { sp = list_first_entry(invalid_list, struct kvm_mmu_page, link); WARN_ON(!sp-role.invalid || sp-root_count); - kvm_mmu_free_page(kvm, sp); + kvm_mmu_free_page(sp); } while (!list_empty(invalid_list)); } @@ -1704,8 +1704,8 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int goal_nr_mmu_pages) page = container_of(kvm-arch.active_mmu_pages.prev, struct kvm_mmu_page, link); kvm_mmu_prepare_zap_page(kvm, page, invalid_list); - kvm_mmu_commit_zap_page(kvm, invalid_list); } + kvm_mmu_commit_zap_page(kvm, invalid_list); goal_nr_mmu_pages = kvm-arch.n_used_mmu_pages; } @@ -3312,9 +3312,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu) sp = container_of(vcpu-kvm-arch.active_mmu_pages.prev, struct kvm_mmu_page, link); kvm_mmu_prepare_zap_page(vcpu-kvm, sp, invalid_list); - kvm_mmu_commit_zap_page(vcpu-kvm, invalid_list); ++vcpu-kvm-stat.mmu_recycled; } + kvm_mmu_commit_zap_page(vcpu-kvm, invalid_list); } int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code, -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 13/22] KVM: MMU: remove bypass_guest_pf
The idea is from Avi: | Maybe it's time to kill off bypass_guest_pf=1. It's not as effective as | it used to be, since unsync pages always use shadow_trap_nonpresent_pte, | and since we convert between the two nonpresent_ptes during sync and unsync. Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- Documentation/kernel-parameters.txt |4 -- arch/x86/include/asm/kvm_host.h |3 - arch/x86/kvm/mmu.c | 83 ++- arch/x86/kvm/mmu_audit.c| 12 - arch/x86/kvm/paging_tmpl.h | 51 +++-- arch/x86/kvm/vmx.c | 11 + arch/x86/kvm/x86.c |1 - 7 files changed, 33 insertions(+), 132 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index fd248a31..a06e4f1 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1159,10 +1159,6 @@ bytes respectively. Such letter suffixes can also be entirely omitted. for all guests. Default is 1 (enabled) if in 64bit or 32bit-PAE mode - kvm-intel.bypass_guest_pf= - [KVM,Intel] Disables bypassing of guest page faults - on Intel chips. Default is 1 (enabled) - kvm-intel.ept= [KVM,Intel] Disable extended page tables (virtualized MMU) support on capable Intel chips. Default is 1 (enabled) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7b0834a..42e577d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -266,8 +266,6 @@ struct kvm_mmu { gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access, struct x86_exception *exception); gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access); - void (*prefetch_page)(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *page); int (*sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp); void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva); @@ -638,7 +636,6 @@ void kvm_mmu_module_exit(void); void kvm_mmu_destroy(struct kvm_vcpu *vcpu); int kvm_mmu_create(struct kvm_vcpu *vcpu); int kvm_mmu_setup(struct kvm_vcpu *vcpu); -void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, u64 dirty_mask, u64 nx_mask, u64 x_mask); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c6fabf0..2c7b7a0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -186,8 +186,6 @@ static struct kmem_cache *pte_list_desc_cache; static struct kmem_cache *mmu_page_header_cache; static struct percpu_counter kvm_total_used_mmu_pages; -static u64 __read_mostly shadow_trap_nonpresent_pte; -static u64 __read_mostly shadow_notrap_nonpresent_pte; static u64 __read_mostly shadow_nx_mask; static u64 __read_mostly shadow_x_mask;/* mutual exclusive with nx_mask */ static u64 __read_mostly shadow_user_mask; @@ -199,13 +197,6 @@ static inline u64 rsvd_bits(int s, int e) return ((1ULL (e - s + 1)) - 1) s; } -void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte) -{ - shadow_trap_nonpresent_pte = trap_pte; - shadow_notrap_nonpresent_pte = notrap_pte; -} -EXPORT_SYMBOL_GPL(kvm_mmu_set_nonpresent_ptes); - void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, u64 dirty_mask, u64 nx_mask, u64 x_mask) { @@ -229,8 +220,7 @@ static int is_nx(struct kvm_vcpu *vcpu) static int is_shadow_present_pte(u64 pte) { - return pte != shadow_trap_nonpresent_pte -pte != shadow_notrap_nonpresent_pte; + return pte != 0ull; } static int is_large_pte(u64 pte) @@ -777,9 +767,9 @@ static int set_spte_track_bits(u64 *sptep, u64 new_spte) return 1; } -static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte) +static void drop_spte(struct kvm *kvm, u64 *sptep) { - if (set_spte_track_bits(sptep, new_spte)) + if (set_spte_track_bits(sptep, 0ull)) rmap_remove(kvm, sptep); } @@ -814,8 +804,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn) BUG_ON((*spte (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)); pgprintk(rmap_write_protect(large): spte %p %llx %lld\n, spte, *spte, gfn); if (is_writable_pte(*spte)) { - drop_spte(kvm, spte, - shadow_trap_nonpresent_pte); + drop_spte(kvm, spte); --kvm-stat.lpages; spte = NULL; write_protected = 1; @@ -836,7 +825,7 @@ static
[PATCH v2 14/22] KVM: MMU: filter out the mmio pfn from the fault pfn
If the page fault is caused by mmio, the gfn can not be found in memslots, and 'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify the mmio page fault. And, to clarify the meaning of mmio pfn, we return fault page instead of bad page when the gfn is not allowd to prefetch Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |4 ++-- include/linux/kvm_host.h |5 + virt/kvm/kvm_main.c | 16 ++-- 3 files changed, 21 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2c7b7a0..4a51042 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2085,8 +2085,8 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, no_dirty_log); if (!slot) { - get_page(bad_page); - return page_to_pfn(bad_page); + get_page(fault_page); + return page_to_pfn(fault_page); } hva = gfn_to_hva_memslot(slot, gfn); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 31ebb59..3e548e8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -326,12 +326,17 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm) static inline int is_error_hpa(hpa_t hpa) { return hpa HPA_MSB; } extern struct page *bad_page; +extern struct page *fault_page; + extern pfn_t bad_pfn; +extern pfn_t fault_pfn; int is_error_page(struct page *page); int is_error_pfn(pfn_t pfn); int is_hwpoison_pfn(pfn_t pfn); int is_fault_pfn(pfn_t pfn); +int is_mmio_pfn(pfn_t pfn); +int is_invalid_pfn(pfn_t pfn); int kvm_is_error_hva(unsigned long addr); int kvm_set_memory_region(struct kvm *kvm, struct kvm_userspace_memory_region *mem, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 11d2783..c7d41d4 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -101,8 +101,8 @@ static bool largepages_enabled = true; static struct page *hwpoison_page; static pfn_t hwpoison_pfn; -static struct page *fault_page; -static pfn_t fault_pfn; +struct page *fault_page; +pfn_t fault_pfn; inline int kvm_is_mmio_pfn(pfn_t pfn) { @@ -931,6 +931,18 @@ int is_fault_pfn(pfn_t pfn) } EXPORT_SYMBOL_GPL(is_fault_pfn); +int is_mmio_pfn(pfn_t pfn) +{ + return pfn == bad_pfn; +} +EXPORT_SYMBOL_GPL(is_mmio_pfn); + +int is_invalid_pfn(pfn_t pfn) +{ + return pfn == hwpoison_pfn || pfn == fault_pfn; +} +EXPORT_SYMBOL_GPL(is_invalid_pfn); + static inline unsigned long bad_hva(void) { return PAGE_OFFSET; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 15/22] KVM: MMU: abstract some functions to handle fault pfn
Introduce handle_abnormal_pfn to handle fault pfn on page fault path, introduce mmu_invalid_pfn to handle fault pfn on prefetch path It is the preparing work for mmio page fault support Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 47 --- arch/x86/kvm/paging_tmpl.h | 12 +- 2 files changed, 41 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 4a51042..4fe1cb3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2221,18 +2221,15 @@ static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct * send_sig_info(SIGBUS, info, tsk); } -static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gva_t gva, - unsigned access, gfn_t gfn, pfn_t pfn) +static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn) { kvm_release_pfn_clean(pfn); if (is_hwpoison_pfn(pfn)) { kvm_send_hwpoison_signal(gfn_to_hva(vcpu-kvm, gfn), current); return 0; - } else if (is_fault_pfn(pfn)) - return -EFAULT; + } - vcpu_cache_mmio_info(vcpu, gva, gfn, access); - return 1; + return -EFAULT; } static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, @@ -2277,6 +2274,33 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, } } +static bool mmu_invalid_pfn(pfn_t pfn) +{ + return unlikely(is_invalid_pfn(pfn) || is_mmio_pfn(pfn)); +} + +static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, + pfn_t pfn, unsigned access, int *ret_val) +{ + bool ret = true; + + /* The pfn is invalid, report the error! */ + if (unlikely(is_invalid_pfn(pfn))) { + *ret_val = kvm_handle_bad_page(vcpu, gfn, pfn); + goto exit; + } + + if (unlikely(is_mmio_pfn(pfn))) { + vcpu_cache_mmio_info(vcpu, gva, gfn, access); + *ret_val = 1; + goto exit; + } + + ret = false; +exit: + return ret; +} + static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, gva_t gva, pfn_t *pfn, bool write, bool *writable); @@ -2311,9 +2335,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn, if (try_async_pf(vcpu, prefault, gfn, v, pfn, write, map_writable)) return 0; - /* mmio */ - if (is_error_pfn(pfn)) - return kvm_handle_bad_page(vcpu, v, ACC_ALL, gfn, pfn); + if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, r)) + return r; spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) @@ -2685,9 +2708,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, if (try_async_pf(vcpu, prefault, gfn, gpa, pfn, write, map_writable)) return 0; - /* mmio */ - if (is_error_pfn(pfn)) - return kvm_handle_bad_page(vcpu, 0, 0, gfn, pfn); + if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, r)) + return r; + spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 58a29d0..870be69 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -373,7 +373,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte); pte_access = sp-role.access FNAME(gpte_access)(vcpu, gpte, true); pfn = gfn_to_pfn_atomic(vcpu-kvm, gpte_to_gfn(gpte)); - if (is_error_pfn(pfn)) { + if (mmu_invalid_pfn(pfn)) { kvm_release_pfn_clean(pfn); return; } @@ -451,7 +451,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, gfn = gpte_to_gfn(gpte); pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn, pte_access ACC_WRITE_MASK); - if (is_error_pfn(pfn)) { + if (mmu_invalid_pfn(pfn)) { kvm_release_pfn_clean(pfn); break; } @@ -621,10 +621,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, map_writable)) return 0; - /* mmio */ - if (is_error_pfn(pfn)) - return kvm_handle_bad_page(vcpu, mmu_is_nested(vcpu) ? 0 : - addr, walker.pte_access, walker.gfn, pfn); + if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr, + walker.gfn, pfn, walker.pte_access, r)) + return r; +
[PATCH v2 16/22] KVM: MMU: introduce the rules to modify shadow page table
Introduce some interfaces to modify spte as linux kernel does: - mmu_spte_clear_track_bits, it set the spte from present to nonpresent, and track the stat bits(accessed/dirty) of spte - mmu_spte_clear_no_track, the same as mmu_spte_clear_track_bits except tracking the stat bits - mmu_spte_set, set spte from nonpresent to present - mmu_spte_update, only update the stat bits Now, it does not allowed to set spte from present to present, later, we can drop the atomicly opration for X86_32 host, and it is the preparing work to get spte on X86_32 host out of the mmu lock Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 103 +++- 1 files changed, 69 insertions(+), 34 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 4fe1cb3..5ceb64a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -299,12 +299,30 @@ static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask) return (old_spte bit_mask) !(new_spte bit_mask); } -static void update_spte(u64 *sptep, u64 new_spte) +/* Rules for using mmu_spte_set: + * Set the sptep from nonpresent to present. + * Note: the sptep being assigned *must* be either not present + * or in a state where the hardware will not attempt to update + * the spte. + */ +static void mmu_spte_set(u64 *sptep, u64 new_spte) +{ + WARN_ON(is_shadow_present_pte(*sptep)); + __set_spte(sptep, new_spte); +} + +/* Rules for using mmu_spte_update: + * Update the state bits, it means the mapped pfn is not changged. + */ +static void mmu_spte_update(u64 *sptep, u64 new_spte) { u64 mask, old_spte = *sptep; WARN_ON(!is_rmap_spte(new_spte)); + if (!is_shadow_present_pte(old_spte)) + return mmu_spte_set(sptep, new_spte); + new_spte |= old_spte shadow_dirty_mask; mask = shadow_accessed_mask; @@ -325,6 +343,42 @@ static void update_spte(u64 *sptep, u64 new_spte) kvm_set_pfn_dirty(spte_to_pfn(old_spte)); } +/* + * Rules for using mmu_spte_clear_track_bits: + * It sets the sptep from present to nonpresent, and track the + * state bits, it is used to clear the last level sptep. + */ +static int mmu_spte_clear_track_bits(u64 *sptep) +{ + pfn_t pfn; + u64 old_spte = *sptep; + + if (!spte_has_volatile_bits(old_spte)) + __set_spte(sptep, 0ull); + else + old_spte = __xchg_spte(sptep, 0ull); + + if (!is_rmap_spte(old_spte)) + return 0; + + pfn = spte_to_pfn(old_spte); + if (!shadow_accessed_mask || old_spte shadow_accessed_mask) + kvm_set_pfn_accessed(pfn); + if (!shadow_dirty_mask || (old_spte shadow_dirty_mask)) + kvm_set_pfn_dirty(pfn); + return 1; +} + +/* + * Rules for using mmu_spte_clear_no_track: + * Directly clear spte without caring the state bits of sptep, + * it is used to set the upper level spte. + */ +static void mmu_spte_clear_no_track(u64 *sptep) +{ + __set_spte(sptep, 0ull); +} + static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, struct kmem_cache *base_cache, int min) { @@ -746,30 +800,9 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) pte_list_remove(spte, rmapp); } -static int set_spte_track_bits(u64 *sptep, u64 new_spte) -{ - pfn_t pfn; - u64 old_spte = *sptep; - - if (!spte_has_volatile_bits(old_spte)) - __set_spte(sptep, new_spte); - else - old_spte = __xchg_spte(sptep, new_spte); - - if (!is_rmap_spte(old_spte)) - return 0; - - pfn = spte_to_pfn(old_spte); - if (!shadow_accessed_mask || old_spte shadow_accessed_mask) - kvm_set_pfn_accessed(pfn); - if (!shadow_dirty_mask || (old_spte shadow_dirty_mask)) - kvm_set_pfn_dirty(pfn); - return 1; -} - static void drop_spte(struct kvm *kvm, u64 *sptep) { - if (set_spte_track_bits(sptep, 0ull)) + if (mmu_spte_clear_track_bits(sptep)) rmap_remove(kvm, sptep); } @@ -787,7 +820,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn) BUG_ON(!(*spte PT_PRESENT_MASK)); rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte); if (is_writable_pte(*spte)) { - update_spte(spte, *spte ~PT_WRITABLE_MASK); + mmu_spte_update(spte, *spte ~PT_WRITABLE_MASK); write_protected = 1; } spte = rmap_next(kvm, rmapp, spte); @@ -856,7 +889,8 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, new_spte = ~PT_WRITABLE_MASK; new_spte = ~SPTE_HOST_WRITEABLE; new_spte = ~shadow_accessed_mask; - set_spte_track_bits(spte,
[PATCH v2 17/22] KVM: MMU: clean up spte updating and clearing
Clean up the code between mmu_spte_clear_* and mmu_spte_update Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 75 +++- 1 files changed, 39 insertions(+), 36 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5ceb64a..339e1a3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -279,24 +279,51 @@ static u64 __xchg_spte(u64 *sptep, u64 new_spte) #endif } -static bool spte_has_volatile_bits(u64 spte) +static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask) +{ + return (old_spte bit_mask) !(new_spte bit_mask); +} + +static bool spte_has_volatile_bits(u64 spte, u64 keep_bits) { + bool access_nonvolatile = false, dirty_nonvolatile = false; + if (!shadow_accessed_mask) return false; - if (!is_shadow_present_pte(spte)) - return false; + if ((spte | keep_bits) shadow_accessed_mask) + access_nonvolatile = true; + + if (!is_writable_pte(spte) || ((spte | keep_bits) shadow_dirty_mask)) + dirty_nonvolatile = true; - if ((spte shadow_accessed_mask) - (!is_writable_pte(spte) || (spte shadow_dirty_mask))) + if (access_nonvolatile dirty_nonvolatile) return false; return true; } -static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask) +static void track_spte_bits(u64 old_spte, u64 keep_bits, bool always_track) { - return (old_spte bit_mask) !(new_spte bit_mask); + if (always_track || + (spte_is_bit_cleared(old_spte, keep_bits, shadow_accessed_mask))) + kvm_set_pfn_accessed(spte_to_pfn(old_spte)); + + if (always_track || + (spte_is_bit_cleared(old_spte, keep_bits, shadow_dirty_mask))) + kvm_set_pfn_dirty(spte_to_pfn(old_spte)); +} + +static u64 spte_get_and_update_clear(u64 *sptep, u64 new_spte) +{ + u64 old_spte = *sptep; + + if (!spte_has_volatile_bits(old_spte, new_spte)) + __set_spte(sptep, new_spte); + else + old_spte = __xchg_spte(sptep, new_spte); + + return old_spte; } /* Rules for using mmu_spte_set: @@ -316,7 +343,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte) */ static void mmu_spte_update(u64 *sptep, u64 new_spte) { - u64 mask, old_spte = *sptep; + u64 old_spte = *sptep; WARN_ON(!is_rmap_spte(new_spte)); @@ -324,23 +351,8 @@ static void mmu_spte_update(u64 *sptep, u64 new_spte) return mmu_spte_set(sptep, new_spte); new_spte |= old_spte shadow_dirty_mask; - - mask = shadow_accessed_mask; - if (is_writable_pte(old_spte)) - mask |= shadow_dirty_mask; - - if (!spte_has_volatile_bits(old_spte) || (new_spte mask) == mask) - __set_spte(sptep, new_spte); - else - old_spte = __xchg_spte(sptep, new_spte); - - if (!shadow_accessed_mask) - return; - - if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask)) - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); - if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask)) - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); + old_spte = spte_get_and_update_clear(sptep, new_spte); + track_spte_bits(old_spte, new_spte, false); } /* @@ -350,22 +362,13 @@ static void mmu_spte_update(u64 *sptep, u64 new_spte) */ static int mmu_spte_clear_track_bits(u64 *sptep) { - pfn_t pfn; u64 old_spte = *sptep; - if (!spte_has_volatile_bits(old_spte)) - __set_spte(sptep, 0ull); - else - old_spte = __xchg_spte(sptep, 0ull); - if (!is_rmap_spte(old_spte)) return 0; - pfn = spte_to_pfn(old_spte); - if (!shadow_accessed_mask || old_spte shadow_accessed_mask) - kvm_set_pfn_accessed(pfn); - if (!shadow_dirty_mask || (old_spte shadow_dirty_mask)) - kvm_set_pfn_dirty(pfn); + old_spte = spte_get_and_update_clear(sptep, 0ull); + track_spte_bits(old_spte, 0ull, !shadow_accessed_mask); return 1; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 18/22] KVM: MMU: do not need atomicly to set/clear spte
Now, the spte is just from nonprsent to present or present to nonprsent, so we can use some trick to set/clear spte non-atomicly as linux kernel does Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 82 +++ 1 files changed, 69 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 339e1a3..0ba94e9 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -259,26 +259,82 @@ static gfn_t pse36_gfn_delta(u32 gpte) return (gpte PT32_DIR_PSE36_MASK) shift; } +#ifdef CONFIG_X86_64 static void __set_spte(u64 *sptep, u64 spte) { - set_64bit(sptep, spte); + *sptep = spte; } -static u64 __xchg_spte(u64 *sptep, u64 new_spte) +static void __update_clear_spte_fast(u64 *sptep, u64 spte) { -#ifdef CONFIG_X86_64 - return xchg(sptep, new_spte); + *sptep = spte; +} + +static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) +{ + return xchg(sptep, spte); +} #else - u64 old_spte; +union split_spte { + struct { + u32 spte_low; + u32 spte_high; + }; + u64 spte; +}; - do { - old_spte = *sptep; - } while (cmpxchg64(sptep, old_spte, new_spte) != old_spte); +static void __set_spte(u64 *sptep, u64 spte) +{ + union split_spte *ssptep, sspte; - return old_spte; -#endif + ssptep = (union split_spte *)sptep; + sspte = (union split_spte)spte; + + ssptep-spte_high = sspte.spte_high; + + /* +* If we map the spte from nonpresent to present, We should store +* the high bits firstly, then set present bit, so cpu can not +* fetch this spte while we are setting the spte. +*/ + smp_wmb(); + + ssptep-spte_low = sspte.spte_low; } +static void __update_clear_spte_fast(u64 *sptep, u64 spte) +{ + union split_spte *ssptep, sspte; + + ssptep = (union split_spte *)sptep; + sspte = (union split_spte)spte; + + ssptep-spte_low = sspte.spte_low; + + /* +* If we map the spte from present to nonpresent, we should clear +* present bit firstly to avoid vcpu fetch the old high bits. +*/ + smp_wmb(); + + ssptep-spte_high = sspte.spte_high; +} + +static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) +{ + union split_spte *ssptep, sspte, orig; + + ssptep = (union split_spte *)sptep; + sspte = (union split_spte)spte; + + /* xchg acts as a barrier before the setting of the high bits */ + orig.spte_low = xchg(ssptep-spte_low, sspte.spte_low); + orig.spte_high = ssptep-spte_high = sspte.spte_high; + + return orig.spte; +} +#endif + static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask) { return (old_spte bit_mask) !(new_spte bit_mask); @@ -319,9 +375,9 @@ static u64 spte_get_and_update_clear(u64 *sptep, u64 new_spte) u64 old_spte = *sptep; if (!spte_has_volatile_bits(old_spte, new_spte)) - __set_spte(sptep, new_spte); + __update_clear_spte_fast(sptep, new_spte); else - old_spte = __xchg_spte(sptep, new_spte); + old_spte = __update_clear_spte_slow(sptep, new_spte); return old_spte; } @@ -379,7 +435,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep) */ static void mmu_spte_clear_no_track(u64 *sptep) { - __set_spte(sptep, 0ull); + __update_clear_spte_fast(sptep, 0ull); } static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
Use rcu to protect shadow pages table to be freed, so we can safely walk it, it should run fastly and is needed by mmio page fault Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |8 +++ arch/x86/kvm/mmu.c | 132 --- 2 files changed, 132 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 42e577d..87a868e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -233,6 +233,12 @@ struct kvm_mmu_page { unsigned int unsync_children; unsigned long parent_ptes; /* Reverse mapping for parent_pte */ DECLARE_BITMAP(unsync_child_bitmap, 512); + +#ifdef CONFIG_X86_32 + int clear_spte_count; +#endif + + struct rcu_head rcu; }; struct kvm_pv_mmu_op_buffer { @@ -477,6 +483,8 @@ struct kvm_arch { u64 hv_guest_os_id; u64 hv_hypercall; + atomic_t reader_counter; + #ifdef CONFIG_KVM_MMU_AUDIT int audit_point; #endif diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0ba94e9..dad7ad9 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -182,6 +182,12 @@ struct kvm_shadow_walk_iterator { shadow_walk_okay((_walker)); \ shadow_walk_next((_walker))) +#define for_each_shadow_entry_lockless(_vcpu, _addr, _walker, spte)\ + for (shadow_walk_init((_walker), _vcpu, _addr);\ +shadow_walk_okay((_walker)) \ + ({ spte = mmu_spte_get_lockless(_walker.sptep); 1; }); \ +__shadow_walk_next((_walker), spte)) + static struct kmem_cache *pte_list_desc_cache; static struct kmem_cache *mmu_page_header_cache; static struct percpu_counter kvm_total_used_mmu_pages; @@ -274,6 +280,11 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) { return xchg(sptep, spte); } + +static u64 __get_spte_lockless(u64 *sptep) +{ + return ACCESS_ONCE(*sptep); +} #else union split_spte { struct { @@ -283,6 +294,18 @@ union split_spte { u64 spte; }; +static void count_spte_clear(u64 *sptep, u64 spte) +{ + struct kvm_mmu_page *sp = page_header(__pa(sptep)); + + if (is_shadow_present_pte(spte)) + return; + + /* Ensure the spte is completely set before we increase the count */ + smp_wmb(); + sp-clear_spte_count++; +} + static void __set_spte(u64 *sptep, u64 spte) { union split_spte *ssptep, sspte; @@ -318,6 +341,7 @@ static void __update_clear_spte_fast(u64 *sptep, u64 spte) smp_wmb(); ssptep-spte_high = sspte.spte_high; + count_spte_clear(sptep, spte); } static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) @@ -330,9 +354,40 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) /* xchg acts as a barrier before the setting of the high bits */ orig.spte_low = xchg(ssptep-spte_low, sspte.spte_low); orig.spte_high = ssptep-spte_high = sspte.spte_high; + count_spte_clear(sptep, spte); return orig.spte; } + +/* + * The idea using the light way get the spte on x86_32 guest is from + * gup_get_pte(arch/x86/mm/gup.c). + * The difference is we can not catch the spte tlb flush if we leave + * guest mode, so we emulate it by increase clear_spte_count when spte + * is cleared. + */ +static u64 __get_spte_lockless(u64 *sptep) +{ + struct kvm_mmu_page *sp = page_header(__pa(sptep)); + union split_spte spte, *orig = (union split_spte *)sptep; + int count; + +retry: + count = sp-clear_spte_count; + smp_rmb(); + + spte.spte_low = orig-spte_low; + smp_rmb(); + + spte.spte_high = orig-spte_high; + smp_rmb(); + + if (unlikely(spte.spte_low != orig-spte_low || + count != sp-clear_spte_count)) + goto retry; + + return spte.spte; +} #endif static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask) @@ -438,6 +493,28 @@ static void mmu_spte_clear_no_track(u64 *sptep) __update_clear_spte_fast(sptep, 0ull); } +static u64 mmu_spte_get_lockless(u64 *sptep) +{ + return __get_spte_lockless(sptep); +} + +static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu) +{ + rcu_read_lock(); + atomic_inc(vcpu-kvm-arch.reader_counter); + + /* Increase the counter before walking shadow page table */ + smp_mb__after_atomic_inc(); +} + +static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) +{ + /* Decrease the counter after walking shadow page table finished */ + smp_mb__before_atomic_dec(); + atomic_dec(vcpu-kvm-arch.reader_counter); + rcu_read_unlock(); +} + static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, struct kmem_cache *base_cache, int min) {
[PATCH v2 20/22] KVM: MMU: reorganize struct kvm_shadow_walk_iterator
Reorganize it for good using the cache Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index dad7ad9..1319050 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -172,8 +172,8 @@ struct pte_list_desc { struct kvm_shadow_walk_iterator { u64 addr; hpa_t shadow_addr; - int level; u64 *sptep; + int level; unsigned index; }; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 21/22] KVM: MMU: mmio page fault support
The idea is from Avi: | We could cache the result of a miss in an spte by using a reserved bit, and | checking the page fault error code (or seeing if we get an ept violation or | ept misconfiguration), so if we get repeated mmio on a page, we don't need to | search the slot list/tree. | (https://lkml.org/lkml/2011/2/22/221) When the page fault is caused by mmio, we cache the info in the shadow page table, and also set the reserved bits in the shadow page table, so if the mmio is caused again, we can quickly identify it and emulate it directly Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it can be reduced by this feature, and also avoid walking guest page table for soft mmu. Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 156 ++-- arch/x86/kvm/mmu.h |1 + arch/x86/kvm/paging_tmpl.h | 18 -- arch/x86/kvm/vmx.c | 10 +++- 4 files changed, 173 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 1319050..e69a47a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask;/* mutual exclusive with nx_mask */ static u64 __read_mostly shadow_user_mask; static u64 __read_mostly shadow_accessed_mask; static u64 __read_mostly shadow_dirty_mask; +static u64 __read_mostly shadow_mmio_mask = (0xffull 49 | 1ULL); + +static void mmu_spte_set(u64 *sptep, u64 spte); + +static void mark_mmio_spte(u64 *sptep, u64 gfn, unsigned access) +{ + access = ACC_WRITE_MASK | ACC_USER_MASK; + + mmu_spte_set(sptep, shadow_mmio_mask | access | gfn PAGE_SHIFT); +} + +static bool is_mmio_spte(u64 spte) +{ + return (spte shadow_mmio_mask) == shadow_mmio_mask; +} + +static gfn_t get_mmio_spte_gfn(u64 spte) +{ + return (spte ~shadow_mmio_mask) PAGE_SHIFT; +} + +static unsigned get_mmio_spte_access(u64 spte) +{ + return (spte ~shadow_mmio_mask) ~PAGE_MASK; +} + +static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access) +{ + if (unlikely(is_mmio_pfn(pfn))) { + mark_mmio_spte(sptep, gfn, access); + return true; + } + + return false; +} static inline u64 rsvd_bits(int s, int e) { @@ -226,7 +261,7 @@ static int is_nx(struct kvm_vcpu *vcpu) static int is_shadow_present_pte(u64 pte) { - return pte != 0ull; + return pte != 0ull !is_mmio_spte(pte); } static int is_large_pte(u64 pte) @@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte, entry = *sptep; int ret = 0; + if (set_mmio_spte(sptep, gfn, pfn, pte_access)) + return 0; + /* * We don't set the accessed bit, since we sometimes want to see * whether the guest actually used the pte (in order to detect @@ -2258,6 +2296,9 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, kvm_mmu_flush_tlb(vcpu); } + if (unlikely(is_mmio_spte(*sptep) emulate)) + *emulate = 1; + pgprintk(%s: setting spte %llx\n, __func__, *sptep); pgprintk(instantiating %s PTE (%s) at %llx (%llx) addr %p\n, is_large_pte(*sptep)? 2MB : 4kB, @@ -2484,7 +2525,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, static bool mmu_invalid_pfn(pfn_t pfn) { - return unlikely(is_invalid_pfn(pfn) || is_mmio_pfn(pfn)); + return unlikely(is_invalid_pfn(pfn)); } static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, @@ -2498,11 +2539,8 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, goto exit; } - if (unlikely(is_mmio_pfn(pfn))) { + if (unlikely(is_mmio_pfn(pfn))) vcpu_cache_mmio_info(vcpu, gva, gfn, access); - *ret_val = 1; - goto exit; - } ret = false; exit: @@ -2816,6 +2854,86 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr, return vcpu-arch.nested_mmu.translate_gpa(vcpu, vaddr, access); } +static bool quickly_check_mmio_pf(struct kvm_vcpu *vcpu, u64 addr, bool direct) +{ + if (direct) + return vcpu_match_mmio_gpa(vcpu, addr); + + return vcpu_match_mmio_gva(vcpu, addr); +} + +static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr) +{ + struct kvm_shadow_walk_iterator iterator; + u64 spte, ret = 0ull; + + walk_shadow_page_lockless_begin(vcpu); + for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) { + if (iterator.level == 1) { + ret = spte; + break; + } + + if (!is_shadow_present_pte(spte)) + break; + } + walk_shadow_page_lockless_end(vcpu); + + return ret; +}
[PATCH v2 22/22] KVM: MMU: trace mmio page fault
Add tracepoints to trace mmio page fault Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |5 arch/x86/kvm/mmutrace.h | 48 +++ arch/x86/kvm/trace.h| 23 ++ arch/x86/kvm/x86.c |5 +++- 4 files changed, 80 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e69a47a..5a4e36c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -205,6 +205,7 @@ static void mark_mmio_spte(u64 *sptep, u64 gfn, unsigned access) { access = ACC_WRITE_MASK | ACC_USER_MASK; + trace_mark_mmio_spte(sptep, gfn, access); mmu_spte_set(sptep, shadow_mmio_mask | access | gfn PAGE_SHIFT); } @@ -1913,6 +1914,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, kvm_mmu_isolate_pages(invalid_list); sp = list_first_entry(invalid_list, struct kvm_mmu_page, link); list_del_init(invalid_list); + + trace_kvm_mmu_delay_free_pages(sp); call_rcu(sp-rcu, free_pages_rcu); return; } @@ -2902,6 +2905,8 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct) if (direct) addr = 0; + + trace_handle_mmio_page_fault(addr, gfn, access); vcpu_cache_mmio_info(vcpu, addr, gfn, access); return 1; } diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h index b60b4fd..eed67f3 100644 --- a/arch/x86/kvm/mmutrace.h +++ b/arch/x86/kvm/mmutrace.h @@ -196,6 +196,54 @@ DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_prepare_zap_page, TP_ARGS(sp) ); +DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_delay_free_pages, + TP_PROTO(struct kvm_mmu_page *sp), + + TP_ARGS(sp) +); + +TRACE_EVENT( + mark_mmio_spte, + TP_PROTO(u64 *sptep, gfn_t gfn, unsigned access), + TP_ARGS(sptep, gfn, access), + + TP_STRUCT__entry( + __field(void *, sptep) + __field(gfn_t, gfn) + __field(unsigned, access) + ), + + TP_fast_assign( + __entry-sptep = sptep; + __entry-gfn = gfn; + __entry-access = access; + ), + + TP_printk(sptep:%p gfn %llx access %x, __entry-sptep, __entry-gfn, + __entry-access) +); + +TRACE_EVENT( + handle_mmio_page_fault, + TP_PROTO(u64 addr, gfn_t gfn, unsigned access), + TP_ARGS(addr, gfn, access), + + TP_STRUCT__entry( + __field(u64, addr) + __field(gfn_t, gfn) + __field(unsigned, access) + ), + + TP_fast_assign( + __entry-addr = addr; + __entry-gfn = gfn; + __entry-access = access; + ), + + TP_printk(addr:%llx gfn %llx access %x, __entry-addr, __entry-gfn, + __entry-access) +); + TRACE_EVENT( kvm_mmu_audit, TP_PROTO(struct kvm_vcpu *vcpu, int audit_point), diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 624f8cb..3ff898c 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -698,6 +698,29 @@ TRACE_EVENT(kvm_emulate_insn, #define trace_kvm_emulate_insn_start(vcpu) trace_kvm_emulate_insn(vcpu, 0) #define trace_kvm_emulate_insn_failed(vcpu) trace_kvm_emulate_insn(vcpu, 1) +TRACE_EVENT( + vcpu_match_mmio, + TP_PROTO(gva_t gva, gpa_t gpa, bool write, bool gpa_match), + TP_ARGS(gva, gpa, write, gpa_match), + + TP_STRUCT__entry( + __field(gva_t, gva) + __field(gpa_t, gpa) + __field(bool, write) + __field(bool, gpa_match) + ), + + TP_fast_assign( + __entry-gva = gva; + __entry-gpa = gpa; + __entry-write = write; + __entry-gpa_match = gpa_match + ), + + TP_printk(gva %#lx gpa %#llx %s %s, __entry-gva, __entry-gpa, + __entry-write ? Write : Read, + __entry-gpa_match ? GPA : GVA) +); #endif /* _TRACE_KVM_H */ #undef TRACE_INCLUDE_PATH diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 78d5519..24a0f2c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3955,6 +3955,7 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, vcpu-arch.access)) { *gpa = vcpu-arch.mmio_gfn PAGE_SHIFT | (gva (PAGE_SIZE - 1)); + trace_vcpu_match_mmio(gva, *gpa, write, false); return 1; } @@ -3970,8 +3971,10 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, if ((*gpa PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) return 1; - if (vcpu_match_mmio_gpa(vcpu, *gpa)) + if (vcpu_match_mmio_gpa(vcpu, *gpa)) { +
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index d48ec60..b407a69 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4674,6 +4674,7 @@ static int __init vmx_init(void) kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull, VMX_EPT_EXECUTABLE_MASK); kvm_enable_tdp(); + kvm_dirty_update = 0; } else kvm_disable_tdp(); Why not return !shadow_dirty_mask instead of adding a new var? struct mmu_notifier_ops { + int (*dirty_update)(struct mmu_notifier *mn, + struct mm_struct *mm); + Needs some docu. I think dirty_update isn't self explanatory name. I think has_test_and_clear_dirty would be better. If we don't flush the smp tlb don't we risk that we'll insert pages in the unstable tree that are volatile just because the dirty bit didn't get set again on the spte? The first patch I guess it's a sign of hugetlbfs going a little over the edge in trying to mix with the core VM... Passing that parameter need_pte_unmap all over the place not so nice, maybe it'd be possible to fix within hugetlbfs to use a different method to walk the hugetlb vmas. I'd prefer that if possible. Thanks, Andrea -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
If we don't flush the smp tlb don't we risk that we'll insert pages in the unstable tree that are volatile just because the dirty bit didn't get set again on the spte? Yes, this is the trade off we take, the unstable tree will be flushed anyway - so this is nothing that won`t be recovered very soon after it happen... and most of the chances the tlb will be flushed before ksm get there anyway (specially for heavily modified page, that we don`t want in the unstable tree) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 06/22/2011 07:19 AM, Izik Eidus wrote: So what we say here is: it is better to have little junk in the unstable tree that get flushed eventualy anyway, instead of make the guest slower this race is something that does not reflect accurate of ksm anyway due to the full memcmp that we will eventualy perform... With 2MB pages, I am not convinced they will get flushed eventually, because there is a good chance at least one of the 4kB pages inside a 2MB page is in active use at all times. I worry that the proposed changes may end up effectively preventing KSM from scanning inside 2MB pages, when even one 4kB page inside is in active use. This could mean increased swapping on systems that run low on memory, which can be a much larger performance penalty than ksmd CPU use. We need to scan inside 2MB pages when memory runs low, regardless of the accessed or dirty bits. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
* Izik Eidus (izik.ei...@ravellosystems.com) wrote: On 6/22/2011 3:21 AM, Chris Wright wrote: * Nai Xia (nai@gmail.com) wrote: + if (!shadow_dirty_mask) { + WARN(1, KVM: do NOT try to test dirty bit in EPT\n); + goto out; + } This should never fire with the dirty_update() notifier test, right? And that means that this whole optimization is for the shadow mmu case, arguably the legacy case. Hi Chris, AMD npt does track the dirty bit in the nested page tables, so the shadow_dirty_mask should not be 0 in that case... Yeah, momentary lapse... ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] KVM: MMU: Clean up the error handling of walk_addr_generic()
On Mon, Jun 20, 2011 at 11:29:47PM +0900, Takuya Yoshikawa wrote: From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Avoid two step jump to the error handling part. This eliminates the use of the variables present and rsvd_fault. We also use the const type qualifier to show that write/user/fetch_fault do not change in the function. Both of these were suggested by Ingo Molnar. Cc: Ingo Molnar mi...@elte.hu Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- v2-v3: only changelog update arch/x86/kvm/paging_tmpl.h | 64 +++ 1 files changed, 28 insertions(+), 36 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 1caeb4d..137aa45 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -125,18 +125,17 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, gfn_t table_gfn; unsigned index, pt_access, uninitialized_var(pte_access); gpa_t pte_gpa; - bool eperm, present, rsvd_fault; - int offset, write_fault, user_fault, fetch_fault; - - write_fault = access PFERR_WRITE_MASK; - user_fault = access PFERR_USER_MASK; - fetch_fault = access PFERR_FETCH_MASK; + bool eperm; + int offset; + const int write_fault = access PFERR_WRITE_MASK; + const int user_fault = access PFERR_USER_MASK; + const int fetch_fault = access PFERR_FETCH_MASK; + u16 errcode = 0; trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault, fetch_fault); walk: - present = true; - eperm = rsvd_fault = false; + eperm = false; walker-level = mmu-root_level; pte = mmu-get_cr3(vcpu); @@ -145,7 +144,7 @@ walk: pte = kvm_pdptr_read_mmu(vcpu, mmu, (addr 30) 3); trace_kvm_mmu_paging_element(pte, walker-level); if (!is_present_gpte(pte)) { - present = false; + errcode |= PFERR_PRESENT_MASK; goto error; } --walker-level; @@ -171,34 +170,34 @@ walk: real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn), PFERR_USER_MASK|PFERR_WRITE_MASK); if (unlikely(real_gfn == UNMAPPED_GVA)) { - present = false; - break; + errcode |= PFERR_PRESENT_MASK; + goto error; } real_gfn = gpa_to_gfn(real_gfn); host_addr = gfn_to_hva(vcpu-kvm, real_gfn); if (unlikely(kvm_is_error_hva(host_addr))) { - present = false; - break; + errcode |= PFERR_PRESENT_MASK; + goto error; } ptep_user = (pt_element_t __user *)((void *)host_addr + offset); if (unlikely(__copy_from_user(pte, ptep_user, sizeof(pte { - present = false; - break; + errcode |= PFERR_PRESENT_MASK; + goto error; } trace_kvm_mmu_paging_element(pte, walker-level); if (unlikely(!is_present_gpte(pte))) { - present = false; - break; + errcode |= PFERR_PRESENT_MASK; + goto error; } Assignment of PFERR_PRESENT_MASK is inverted. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Wed, Jun 22, 2011 at 11:39:40AM -0400, Rik van Riel wrote: On 06/22/2011 07:19 AM, Izik Eidus wrote: So what we say here is: it is better to have little junk in the unstable tree that get flushed eventualy anyway, instead of make the guest slower this race is something that does not reflect accurate of ksm anyway due to the full memcmp that we will eventualy perform... With 2MB pages, I am not convinced they will get flushed eventually, because there is a good chance at least one of the 4kB pages inside a 2MB page is in active use at all times. I worry that the proposed changes may end up effectively preventing KSM from scanning inside 2MB pages, when even one 4kB page inside is in active use. This could mean increased swapping on systems that run low on memory, which can be a much larger performance penalty than ksmd CPU use. We need to scan inside 2MB pages when memory runs low, regardless of the accessed or dirty bits. I guess we could fallback to the cksum when a THP is encountered (repeating the test_and_clear_dirty also wouldn't give the expected result if it's repeated on the same hugepmd for the next 4k virtual address candidate for unstable tree insertion, so it'd need special handling during the virtual walk anyway). So it's getting a little hairy, skip on THP, skip on EPT, then I wonder what is the common case that would be left using it... Or we could evaluate with statistic how many less pages are inserted into the unstable tree using the 2m dirty bit but clearly it'd be less reliable, the algorithm really is meant to track the volatility of what is later merged, not of a bigger chunk with unrelated data in it. On a side note, khugepaged should also be changed to preserve the dirty bit if at least one dirty bit of the ptes is dirty (currently the hugepmd is always created dirty, it can never happen for an hugepmd to be clean today so it wasn't preserved in khugepaged so far). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] KVM: MMU: Clean up the error handling of walk_addr_generic()
On 06/22/2011 07:46 PM, Marcelo Tosatti wrote: if (unlikely(!is_present_gpte(pte))) { - present = false; - break; + errcode |= PFERR_PRESENT_MASK; + goto error; } Assignment of PFERR_PRESENT_MASK is inverted. Note: kvm-unit-tests.git/x86/access.flat would have caught this. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/22] KVM: MMU: fix walking shadow page table
On Wed, Jun 22, 2011 at 10:28:04PM +0800, Xiao Guangrong wrote: Properly check the last mapping, and do not walk to the next level if last spte is met Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9c629b5..f474e93 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1517,10 +1517,6 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator) if (iterator-level PT_PAGE_TABLE_LEVEL) return false; - if (iterator-level == PT_PAGE_TABLE_LEVEL) Change to = PT_PAGE_TABLE_LEVEL, checks should be in shadow_walk_okay. - if (is_large_pte(*iterator-sptep)) - return false; - iterator-index = SHADOW_PT_INDEX(iterator-addr, iterator-level); iterator-sptep = ((u64 *)__va(iterator-shadow_addr)) + iterator-index; return true; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 03/12] Switch build system to accompanied kernel headers
Am 08.06.2011 16:10, schrieb Jan Kiszka: This helps reducing our build-time checks for feature support in the available Linux kernel headers. And it helps users that do not have sufficiently recent headers installed on their build machine. Consequently, the patch removes and build-time checks for kvm and vhost in configure, the --kerneldir switch, and KVM_CFLAGS. Kernel headers are supposed to be provided by QEMU only. s390 needs some extra love as it carries redefinitions from kernel headers. CC: Alexander Graf ag...@suse.de Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Makefile.target | 4 +- configure | 151 ++ target-s390x/cpu.h | 10 --- target-s390x/op_helper.c | 1 + 4 files changed, 21 insertions(+), 145 deletions(-) diff --git a/Makefile.target b/Makefile.target index 5c22df8..be9c0e8 100644 --- a/Makefile.target +++ b/Makefile.target @@ -14,7 +14,7 @@ endif TARGET_PATH=$(SRC_PATH)/target-$(TARGET_BASE_ARCH) $(call set-vpath, $(SRC_PATH):$(TARGET_PATH):$(SRC_PATH)/hw) -QEMU_CFLAGS+= -I.. -I$(TARGET_PATH) -DNEED_CPU_H +QEMU_CFLAGS+= -I.. -I../linux-headers -I$(TARGET_PATH) -DNEED_CPU_H include $(SRC_PATH)/Makefile.objs @@ -37,8 +37,6 @@ ifndef CONFIG_HAIKU LIBS+=-lm endif -kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: QEMU_CFLAGS+=$(KVM_CFLAGS) - config-target.h: config-target.h-timestamp config-target.h-timestamp: config-target.mak diff --git a/configure b/configure index d38b952..0e1dc46 100755 --- a/configure +++ b/configure @@ -113,8 +113,7 @@ curl= curses= docs= fdt= -kvm= -kvm_para= +kvm=yes nptl= Are you planning to add kvm support for all platforms which don't support it today? If not, kvm=yes should be restricted to platforms with kvm support. Otherwise, QEMU builds will fail very early: ERROR: Host kernel lacks signalfd() support, but KVM depends on it when the IO thread is disabled. Of course, users of those non-kvm platforms can set --disable-kvm, but I don't think that is the correct solution. Even with kvm disabled, builds still fail for non-kvm systems: In file included from /qemu/hw/kvmclock.c:21: /qemu/linux-headers/linux/kvm_para.h:26:26: warning: asm/kvm_para.h: No such file or directory Cheers, Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/12] Switch build system to accompanied kernel headers
On 2011-06-22 22:51, Stefan Weil wrote: Am 08.06.2011 16:10, schrieb Jan Kiszka: This helps reducing our build-time checks for feature support in the available Linux kernel headers. And it helps users that do not have sufficiently recent headers installed on their build machine. Consequently, the patch removes and build-time checks for kvm and vhost in configure, the --kerneldir switch, and KVM_CFLAGS. Kernel headers are supposed to be provided by QEMU only. s390 needs some extra love as it carries redefinitions from kernel headers. CC: Alexander Graf ag...@suse.de Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Makefile.target | 4 +- configure | 151 ++ target-s390x/cpu.h | 10 --- target-s390x/op_helper.c | 1 + 4 files changed, 21 insertions(+), 145 deletions(-) diff --git a/Makefile.target b/Makefile.target index 5c22df8..be9c0e8 100644 --- a/Makefile.target +++ b/Makefile.target @@ -14,7 +14,7 @@ endif TARGET_PATH=$(SRC_PATH)/target-$(TARGET_BASE_ARCH) $(call set-vpath, $(SRC_PATH):$(TARGET_PATH):$(SRC_PATH)/hw) -QEMU_CFLAGS+= -I.. -I$(TARGET_PATH) -DNEED_CPU_H +QEMU_CFLAGS+= -I.. -I../linux-headers -I$(TARGET_PATH) -DNEED_CPU_H include $(SRC_PATH)/Makefile.objs @@ -37,8 +37,6 @@ ifndef CONFIG_HAIKU LIBS+=-lm endif -kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: QEMU_CFLAGS+=$(KVM_CFLAGS) - config-target.h: config-target.h-timestamp config-target.h-timestamp: config-target.mak diff --git a/configure b/configure index d38b952..0e1dc46 100755 --- a/configure +++ b/configure @@ -113,8 +113,7 @@ curl= curses= docs= fdt= -kvm= -kvm_para= +kvm=yes nptl= Are you planning to add kvm support for all platforms which don't support it today? That would mean replacing all their kernels with Linux. Will take a bit longer. I simply overshot with my cleanups: diff --git a/configure b/configure index 3286e33..e6847c4 100755 --- a/configure +++ b/configure @@ -113,7 +113,7 @@ curl= curses= docs= fdt= -kvm=yes +kvm= nptl= sdl= vnc=yes @@ -129,7 +129,7 @@ xen= xen_ctrl_version= linux_aio= attr= -vhost_net=yes +vhost_net= xfs= gprof=no @@ -457,6 +457,8 @@ Haiku) linux=yes linux_user=yes usb=linux + kvm=yes + vhost_net=yes if [ $cpu = i386 -o $cpu = x86_64 ] ; then audio_possible_drivers=$audio_possible_drivers fmod fi Will post this as a patch tomorrow. If not, kvm=yes should be restricted to platforms with kvm support. Otherwise, QEMU builds will fail very early: ERROR: Host kernel lacks signalfd() support, but KVM depends on it when the IO thread is disabled. Of course, users of those non-kvm platforms can set --disable-kvm, but I don't think that is the correct solution. Even with kvm disabled, builds still fail for non-kvm systems: In file included from /qemu/hw/kvmclock.c:21: /qemu/linux-headers/linux/kvm_para.h:26:26: warning: asm/kvm_para.h: No such file or directory That indicates symlink emulation under Windows does not support directories. Can you confirm this (check what builddir/linux-headers/asm became)? Then we would have to link all files in the arch header dir individually. Jan signature.asc Description: OpenPGP digital signature
[PATCH] KVM test: get_started.py: Remove confusing question
After receiving some feedback on get_started.py, resolved to remove a question regarding NFS shares. Since the script prints the directories clearly, if one wants to setup NFS or symlinks, he/she will do it. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/get_started.py |7 --- 1 files changed, 0 insertions(+), 7 deletions(-) diff --git a/client/tests/kvm/get_started.py b/client/tests/kvm/get_started.py index c986f5e..5f76817 100755 --- a/client/tests/kvm/get_started.py +++ b/client/tests/kvm/get_started.py @@ -80,13 +80,6 @@ if __name__ == __main__: else: logging.debug(Dir %s exists, not creating % sub_dir_path) -answer = utils.ask(Do you want to setup NFS mounts for some of those - dirs?) -if answer == 'y': -logging.info(Exiting the script so you can setup the NFS mounts. - When you are done, re-run this script.) -sys.exit(0) - logging.info() logging.info(2 - Creating config files from samples (copy the default config samples to actual config files)) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
Xiao, On Wed, Jun 22, 2011 at 10:36:16PM +0800, Xiao Guangrong wrote: The idea is from Avi: | We could cache the result of a miss in an spte by using a reserved bit, and | checking the page fault error code (or seeing if we get an ept violation or | ept misconfiguration), so if we get repeated mmio on a page, we don't need to | search the slot list/tree. | (https://lkml.org/lkml/2011/2/22/221) When the page fault is caused by mmio, we cache the info in the shadow page table, and also set the reserved bits in the shadow page table, so if the mmio is caused again, we can quickly identify it and emulate it directly Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it can be reduced by this feature, and also avoid walking guest page table for soft mmu. Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 156 ++-- arch/x86/kvm/mmu.h |1 + arch/x86/kvm/paging_tmpl.h | 18 -- arch/x86/kvm/vmx.c | 10 +++- 4 files changed, 173 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 1319050..e69a47a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */ static u64 __read_mostly shadow_user_mask; static u64 __read_mostly shadow_accessed_mask; static u64 __read_mostly shadow_dirty_mask; +static u64 __read_mostly shadow_mmio_mask = (0xffull 49 | 1ULL); + +static void mmu_spte_set(u64 *sptep, u64 spte); + +static void mark_mmio_spte(u64 *sptep, u64 gfn, unsigned access) +{ + access = ACC_WRITE_MASK | ACC_USER_MASK; + + mmu_spte_set(sptep, shadow_mmio_mask | access | gfn PAGE_SHIFT); +} + +static bool is_mmio_spte(u64 spte) +{ + return (spte shadow_mmio_mask) == shadow_mmio_mask; +} + +static gfn_t get_mmio_spte_gfn(u64 spte) +{ + return (spte ~shadow_mmio_mask) PAGE_SHIFT; +} + +static unsigned get_mmio_spte_access(u64 spte) +{ + return (spte ~shadow_mmio_mask) ~PAGE_MASK; +} + +static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access) +{ + if (unlikely(is_mmio_pfn(pfn))) { + mark_mmio_spte(sptep, gfn, access); + return true; + } + + return false; +} static inline u64 rsvd_bits(int s, int e) { @@ -226,7 +261,7 @@ static int is_nx(struct kvm_vcpu *vcpu) static int is_shadow_present_pte(u64 pte) { - return pte != 0ull; + return pte != 0ull !is_mmio_spte(pte); } static int is_large_pte(u64 pte) @@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte, entry = *sptep; int ret = 0; + if (set_mmio_spte(sptep, gfn, pfn, pte_access)) + return 0; + Should zap all mmio sptes when creating new memory slots (current qemu never exhibits that pattern, but its not an unsupported scenario). Easier to zap all mmu in that case. For shadow you need to remove mmio sptes on invlpg and all mmio sptes under CR3 (mmu_sync_roots), other than clearing the gva/gpa variables. Otherwise you can move the mmio info from an mmio spte back to mmio_gva/mmio_gfn after a TLB flush, without rereading the guest pagetable. +{ + struct kvm_shadow_walk_iterator iterator; + u64 spte, ret = 0ull; + + walk_shadow_page_lockless_begin(vcpu); + for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) { + if (iterator.level == 1) { + ret = spte; + break; + } Have you verified it is safe to walk sptes with parallel modifications to them? Other than shadow page deletion which is in theory covered by RCU. It would be good to have all cases written down. + + if (!is_shadow_present_pte(spte)) + break; + } + walk_shadow_page_lockless_end(vcpu); + + return ret; +} + +/* + * If it is a real mmio page fault, return 1 and emulat the instruction + * directly, return 0 if it needs page fault path to fix it, -1 is + * returned if bug is detected. + */ +int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct) +{ + u64 spte; + + if (quickly_check_mmio_pf(vcpu, addr, direct)) + return 1; + + spte = walk_shadow_page_get_mmio_spte(vcpu, addr); + + if (is_mmio_spte(spte)) { + gfn_t gfn = get_mmio_spte_gfn(spte); + unsigned access = get_mmio_spte_access(spte); + + if (direct) + addr = 0; + vcpu_cache_mmio_info(vcpu, addr, gfn, access); + return 1; + } + + /* + * It's ok if the gva is remapped by other cpus on shadow guest, + * it's a BUG if the gfn is not a mmio page. + */ If the gva is remapped there should be no mmio page
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Wed, Jun 22, 2011 at 11:39 PM, Rik van Riel r...@redhat.com wrote: On 06/22/2011 07:19 AM, Izik Eidus wrote: So what we say here is: it is better to have little junk in the unstable tree that get flushed eventualy anyway, instead of make the guest slower this race is something that does not reflect accurate of ksm anyway due to the full memcmp that we will eventualy perform... With 2MB pages, I am not convinced they will get flushed eventually, because there is a good chance at least one of the 4kB pages inside a 2MB page is in active use at all times. I worry that the proposed changes may end up effectively preventing KSM from scanning inside 2MB pages, when even one 4kB page inside is in active use. This could mean increased swapping on systems that run low on memory, which can be a much larger performance penalty than ksmd CPU use. We need to scan inside 2MB pages when memory runs low, regardless of the accessed or dirty bits. I agree on this point. Dirty bit , young bit, is by no means accurate. Even on 4kB pages, there is always a chance that the pte are dirty but the contents are actually the same. Yeah, the whole optimization contains trade-offs and trades-offs always have the possibilities to annoy someone. Just like page-bit-relying LRU approximations none of them is perfect too. But I think it can benefit some people. So maybe we could just provide a generic balanced solution but provide fine tuning interfaces to make sure tha when it really gets in the way of someone, he has a way to walk around. Do you agree on my argument? :-) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Wed, Jun 22, 2011 at 11:03 PM, Andrea Arcangeli aarca...@redhat.com wrote: On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index d48ec60..b407a69 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4674,6 +4674,7 @@ static int __init vmx_init(void) kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull, VMX_EPT_EXECUTABLE_MASK); kvm_enable_tdp(); + kvm_dirty_update = 0; } else kvm_disable_tdp(); Why not return !shadow_dirty_mask instead of adding a new var? struct mmu_notifier_ops { + int (*dirty_update)(struct mmu_notifier *mn, + struct mm_struct *mm); + Needs some docu. I think dirty_update isn't self explanatory name. I think has_test_and_clear_dirty would be better. If we don't flush the smp tlb don't we risk that we'll insert pages in the unstable tree that are volatile just because the dirty bit didn't get set again on the spte? The first patch I guess it's a sign of hugetlbfs going a little over the edge in trying to mix with the core VM... Passing that parameter need_pte_unmap all over the place not so nice, maybe it'd be possible to fix within hugetlbfs to use a different method to walk the hugetlb vmas. I'd prefer that if possible. OK, I'll have a try over other workarounds. I am not feeling good about need_pte_unmap myself. :-) Thanks for viewing! -Nai Thanks, Andrea -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 07:13:54AM +0800, Nai Xia wrote: I agree on this point. Dirty bit , young bit, is by no means accurate. Even on 4kB pages, there is always a chance that the pte are dirty but the contents are actually the same. Yeah, the whole optimization contains trade-offs and Just a side note: the fact the dirty bit would be set even when the data is the same is actually a pros, not a cons. If the content is the same but the page was written to, it'd trigger a copy on write short after merging the page rendering the whole exercise wasteful. The cksum plays a double role, it both stabilizes the unstable tree, so there's less chance of bad lookups, but it also avoids us to merge stuff that is written to frequently triggering copy on writes, and the dirty bit would also catch overwrites with the same data, something the cksum can't do. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 06/22/2011 07:13 PM, Nai Xia wrote: On Wed, Jun 22, 2011 at 11:39 PM, Rik van Rielr...@redhat.com wrote: On 06/22/2011 07:19 AM, Izik Eidus wrote: So what we say here is: it is better to have little junk in the unstable tree that get flushed eventualy anyway, instead of make the guest slower this race is something that does not reflect accurate of ksm anyway due to the full memcmp that we will eventualy perform... With 2MB pages, I am not convinced they will get flushed eventually, because there is a good chance at least one of the 4kB pages inside a 2MB page is in active use at all times. I worry that the proposed changes may end up effectively preventing KSM from scanning inside 2MB pages, when even one 4kB page inside is in active use. This could mean increased swapping on systems that run low on memory, which can be a much larger performance penalty than ksmd CPU use. We need to scan inside 2MB pages when memory runs low, regardless of the accessed or dirty bits. I agree on this point. Dirty bit , young bit, is by no means accurate. Even on 4kB pages, there is always a chance that the pte are dirty but the contents are actually the same. Yeah, the whole optimization contains trade-offs and trades-offs always have the possibilities to annoy someone. Just like page-bit-relying LRU approximations none of them is perfect too. But I think it can benefit some people. So maybe we could just provide a generic balanced solution but provide fine tuning interfaces to make sure tha when it really gets in the way of someone, he has a way to walk around. Do you agree on my argument? :-) That's not an argument. That is a if I wave my hands vigorously enough, maybe people will let my patch in without thinking about what I wrote style argument. I believe your optimization makes sense for 4kB pages, but is going to be counter-productive for 2MB pages. Your approach of make ksmd skip over more pages, so it uses less CPU is likely to reduce the effectiveness of ksm by not sharing some pages. For 4kB pages that is fine, because you'll get around to them eventually. However, the internal use of a 2MB page is likely to be quite different. Chances are most 2MB pages will have actively used, barely used and free pages inside. You absolutely want ksm to get at the barely used and free sub-pages. Having just one actively used 4kB sub-page prevent ksm from merging any of the other 511 sub-pages is a problem. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 12:55 AM, Andrea Arcangeli aarca...@redhat.com wrote: On Wed, Jun 22, 2011 at 11:39:40AM -0400, Rik van Riel wrote: On 06/22/2011 07:19 AM, Izik Eidus wrote: So what we say here is: it is better to have little junk in the unstable tree that get flushed eventualy anyway, instead of make the guest slower this race is something that does not reflect accurate of ksm anyway due to the full memcmp that we will eventualy perform... With 2MB pages, I am not convinced they will get flushed eventually, because there is a good chance at least one of the 4kB pages inside a 2MB page is in active use at all times. I worry that the proposed changes may end up effectively preventing KSM from scanning inside 2MB pages, when even one 4kB page inside is in active use. This could mean increased swapping on systems that run low on memory, which can be a much larger performance penalty than ksmd CPU use. We need to scan inside 2MB pages when memory runs low, regardless of the accessed or dirty bits. I guess we could fallback to the cksum when a THP is encountered (repeating the test_and_clear_dirty also wouldn't give the expected result if it's repeated on the same hugepmd for the next 4k virtual address candidate for unstable tree insertion, so it'd need special handling during the virtual walk anyway). So it's getting a little hairy, skip on THP, skip on EPT, then I wonder what is the common case that would be left using it... Or we could evaluate with statistic how many less pages are inserted into the unstable tree using the 2m dirty bit but clearly it'd be less reliable, the algorithm really is meant to track the volatility of what is later merged, not of a bigger chunk with unrelated data in it. On 2MB pages, I'd like to remind you and Rik that ksmd currently splits huge pages before their sub pages gets really merged to stable tree. So when there are many 2MB pages each having a 4kB subpage changed for all time, this is already a concern for ksmd to judge if it's worthwhile to split 2MB page and get its sub-pages merged. I think the policy for ksmd in a system should be If you cannot do sth good, at least do nothing evil. So I really don't think we can satisfy _all_ people. Get a general method and give users one or two knobs to tune it when they are the corner cases. How do you think of my proposal ? On a side note, khugepaged should also be changed to preserve the dirty bit if at least one dirty bit of the ptes is dirty (currently the hugepmd is always created dirty, it can never happen for an hugepmd to be clean today so it wasn't preserved in khugepaged so far). Thanks for the point that out. This is what I have overlooked! thanks, Nai -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Wed, Jun 22, 2011 at 11:03 PM, Andrea Arcangeli aarca...@redhat.com wrote: On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index d48ec60..b407a69 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4674,6 +4674,7 @@ static int __init vmx_init(void) kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull, VMX_EPT_EXECUTABLE_MASK); kvm_enable_tdp(); + kvm_dirty_update = 0; } else kvm_disable_tdp(); Why not return !shadow_dirty_mask instead of adding a new var? struct mmu_notifier_ops { + int (*dirty_update)(struct mmu_notifier *mn, + struct mm_struct *mm); + Needs some docu. OK. I'll add it. I think dirty_update isn't self explanatory name. I think has_test_and_clear_dirty would be better. Agreed. So it will be the name in the next version. :) Thanks, Nai If we don't flush the smp tlb don't we risk that we'll insert pages in the unstable tree that are volatile just because the dirty bit didn't get set again on the spte? The first patch I guess it's a sign of hugetlbfs going a little over the edge in trying to mix with the core VM... Passing that parameter need_pte_unmap all over the place not so nice, maybe it'd be possible to fix within hugetlbfs to use a different method to walk the hugetlb vmas. I'd prefer that if possible. Thanks, Andrea -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 07:19:06AM +0800, Nai Xia wrote: OK, I'll have a try over other workarounds. I am not feeling good about need_pte_unmap myself. :-) The usual way is to check VM_HUGETLB in the caller and to call another function that doesn't kmap. Casting pmd_t to pte_t isn't really nice (but hey we're also doing that exceptionally in smaps_pte_range for THP, but it safe there because we're casting the value of the pmd, not the pointer to the pmd, so the kmap is done by the pte version of the caller and not done by the pmd version of the caller). Is it done for migrate? Surely it's not for swapout ;). Thanks for viewing! You're welcome! JFYI I'll be offline on vacation for a week, starting tomorrow, so if I don't answer in the next few days that's the reason but I'll follow the progress in a week. Thanks! Andrea -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote: On 2MB pages, I'd like to remind you and Rik that ksmd currently splits huge pages before their sub pages gets really merged to stable tree. So when there are many 2MB pages each having a 4kB subpage changed for all time, this is already a concern for ksmd to judge if it's worthwhile to split 2MB page and get its sub-pages merged. Hmm not sure to follow. KSM memory density with THP on and off should be identical. The cksum is computed on subpages so the fact the 4k subpage is actually mapped by a hugepmd is invisible to KSM up to the point we get a unstable_tree_search_insert/stable_tree_search lookup succeeding. I think the policy for ksmd in a system should be If you cannot do sth good, at least do nothing evil. So I really don't think we can satisfy _all_ people. Get a general method and give users one or two knobs to tune it when they are the corner cases. How do you think of my proposal ? I'm neutral, but if we get two methods for deciding the unstable tree candidates, the default probably should prioritize on maximum merging even if it takes more CPU (if one cares about performance of the core dedicated to ksmd, KSM is likely going to be off or scanning at low rate in the first place). On a side note, khugepaged should also be changed to preserve the dirty bit if at least one dirty bit of the ptes is dirty (currently the hugepmd is always created dirty, it can never happen for an hugepmd to be clean today so it wasn't preserved in khugepaged so far). Thanks for the point that out. This is what I have overlooked! No prob. And its default scan rate is very slow compared to ksmd so it was unlikely to generate too many false positive dirty bits even if you were splitting hugepages through swap. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On 06/22/2011 07:37 PM, Nai Xia wrote: On 2MB pages, I'd like to remind you and Rik that ksmd currently splits huge pages before their sub pages gets really merged to stable tree. Your proposal appears to add a condition that causes ksmd to skip doing that, which can cause the system to start using swap instead of sharing memory. So when there are many 2MB pages each having a 4kB subpage changed for all time, this is already a concern for ksmd to judge if it's worthwhile to split 2MB page and get its sub-pages merged. I think the policy for ksmd in a system should be If you cannot do sth good, at least do nothing evil. So I really don't think we can satisfy _all_ people. Get a general method and give users one or two knobs to tune it when they are the corner cases. How do you think of my proposal ? I think your proposal makes sense for 4kB pages, but the ksmd policy for 2MB pages probably needs to be much more aggressive. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 7:44 AM, Andrea Arcangeli aarca...@redhat.com wrote: On Thu, Jun 23, 2011 at 07:19:06AM +0800, Nai Xia wrote: OK, I'll have a try over other workarounds. I am not feeling good about need_pte_unmap myself. :-) The usual way is to check VM_HUGETLB in the caller and to call another function that doesn't kmap. Casting pmd_t to pte_t isn't really nice (but hey we're also doing that exceptionally in smaps_pte_range for THP, but it safe there because we're casting the value of the pmd, not the pointer to the pmd, so the kmap is done by the pte version of the caller and not done by the pmd version of the caller). Is it done for migrate? Surely it's not for swapout ;). Thanks for the hint. :-) You know, another thing I am worried about is that I think I did make page_check_address() return a pmd version for skipping the tail subpages ... I did detecte a schedule in atomic if I kunmap() the returned value. :-( Thanks for viewing! You're welcome! JFYI I'll be offline on vacation for a week, starting tomorrow, so if I don't answer in the next few days that's the reason but I'll follow the progress in a week. Have a nice vacation man! Enjoy the sunlight, we all have enough of code in rooms. ;-) Thanks, Nai Thanks! Andrea -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli aarca...@redhat.com wrote: On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote: On 2MB pages, I'd like to remind you and Rik that ksmd currently splits huge pages before their sub pages gets really merged to stable tree. So when there are many 2MB pages each having a 4kB subpage changed for all time, this is already a concern for ksmd to judge if it's worthwhile to split 2MB page and get its sub-pages merged. Hmm not sure to follow. KSM memory density with THP on and off should be identical. The cksum is computed on subpages so the fact the 4k subpage is actually mapped by a hugepmd is invisible to KSM up to the point we get a unstable_tree_search_insert/stable_tree_search lookup succeeding. I agree on your points. But, I mean splitting the huge page into normal pages when some subpages need to be merged may increase the TLB lookside timing of CPU and _might_ hurt the workload ksmd is scanning. If only a small portion of false negative 2MB pages are really get merged eventually, maybe it's not worthwhile, right? But, well, just like Rik said below, yes, ksmd should be more aggressive to avoid much more time consuming cost for swapping. I think the policy for ksmd in a system should be If you cannot do sth good, at least do nothing evil. So I really don't think we can satisfy _all_ people. Get a general method and give users one or two knobs to tune it when they are the corner cases. How do you think of my proposal ? I'm neutral, but if we get two methods for deciding the unstable tree candidates, the default probably should prioritize on maximum merging even if it takes more CPU (if one cares about performance of the core dedicated to ksmd, KSM is likely going to be off or scanning at low rate in the first place). I agree with you here. thanks, Nai On a side note, khugepaged should also be changed to preserve the dirty bit if at least one dirty bit of the ptes is dirty (currently the hugepmd is always created dirty, it can never happen for an hugepmd to be clean today so it wasn't preserved in khugepaged so far). Thanks for the point that out. This is what I have overlooked! No prob. And its default scan rate is very slow compared to ksmd so it was unlikely to generate too many false positive dirty bits even if you were splitting hugepages through swap. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 8:00 AM, Rik van Riel r...@redhat.com wrote: On 06/22/2011 07:37 PM, Nai Xia wrote: On 2MB pages, I'd like to remind you and Rik that ksmd currently splits huge pages before their sub pages gets really merged to stable tree. Your proposal appears to add a condition that causes ksmd to skip doing that, which can cause the system to start using swap instead of sharing memory. Hmm, yes, no swapping. So we should make the checksum default for huge pages, right? So when there are many 2MB pages each having a 4kB subpage changed for all time, this is already a concern for ksmd to judge if it's worthwhile to split 2MB page and get its sub-pages merged. I think the policy for ksmd in a system should be If you cannot do sth good, at least do nothing evil. So I really don't think we can satisfy _all_ people. Get a general method and give users one or two knobs to tune it when they are the corner cases. How do you think of my proposal ? I think your proposal makes sense for 4kB pages, but the ksmd policy for 2MB pages probably needs to be much more aggressive. I now agree with you on the whole point. Let's fall back to checksum Thanks for make my mind clear! :) And shall we provide a interface to users if he _really_ what to judge the dirty bit from the pmd level? I think we should agree to one point before I misunderstand you and spam you with my next submission :P And thanks for your time viewing! -Nai -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 08:31:56AM +0800, Nai Xia wrote: On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli aarca...@redhat.com wrote: On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote: On 2MB pages, I'd like to remind you and Rik that ksmd currently splits huge pages before their sub pages gets really merged to stable tree. So when there are many 2MB pages each having a 4kB subpage changed for all time, this is already a concern for ksmd to judge if it's worthwhile to split 2MB page and get its sub-pages merged. Hmm not sure to follow. KSM memory density with THP on and off should be identical. The cksum is computed on subpages so the fact the 4k subpage is actually mapped by a hugepmd is invisible to KSM up to the point we get a unstable_tree_search_insert/stable_tree_search lookup succeeding. I agree on your points. But, I mean splitting the huge page into normal pages when some subpages need to be merged may increase the TLB lookside timing of CPU and _might_ hurt the workload ksmd is scanning. If only a small portion of false negative 2MB pages are really get merged eventually, maybe it's not worthwhile, right? Yes, there's not threshold to say only split if we could merge more than N subpages, 1 subpage match in two different hugepages is enough to split both and save just 4k but then memory accesses will be slower for both 2m ranges that have been splitted. But the point is that it won't be slower than if THP was off in the first place. So in the end all we gain is 4k saved but we still run faster than THP off, in the other hugepages that haven't been splitted yet. But, well, just like Rik said below, yes, ksmd should be more aggressive to avoid much more time consuming cost for swapping. Correct the above logic also follows the idea to always maximize memory merging in KSM, which is why we've no threshold to wait N subpages to be mergeable before we split the hugepage. I'm unsure if admins in real life would then start to use those thresholds even if we'd implement them. The current way of enabling KSM a per-VM (not per-host) basis is pretty simple: the performance critical VM has KSM off, non-performance critical VM has KSM on and it prioritizes on memory merging. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 7:28 AM, Rik van Riel r...@redhat.com wrote: On 06/22/2011 07:13 PM, Nai Xia wrote: On Wed, Jun 22, 2011 at 11:39 PM, Rik van Rielr...@redhat.com wrote: On 06/22/2011 07:19 AM, Izik Eidus wrote: So what we say here is: it is better to have little junk in the unstable tree that get flushed eventualy anyway, instead of make the guest slower this race is something that does not reflect accurate of ksm anyway due to the full memcmp that we will eventualy perform... With 2MB pages, I am not convinced they will get flushed eventually, because there is a good chance at least one of the 4kB pages inside a 2MB page is in active use at all times. I worry that the proposed changes may end up effectively preventing KSM from scanning inside 2MB pages, when even one 4kB page inside is in active use. This could mean increased swapping on systems that run low on memory, which can be a much larger performance penalty than ksmd CPU use. We need to scan inside 2MB pages when memory runs low, regardless of the accessed or dirty bits. I agree on this point. Dirty bit , young bit, is by no means accurate. Even on 4kB pages, there is always a chance that the pte are dirty but the contents are actually the same. Yeah, the whole optimization contains trade-offs and trades-offs always have the possibilities to annoy someone. Just like page-bit-relying LRU approximations none of them is perfect too. But I think it can benefit some people. So maybe we could just provide a generic balanced solution but provide fine tuning interfaces to make sure tha when it really gets in the way of someone, he has a way to walk around. Do you agree on my argument? :-) That's not an argument. That is a if I wave my hands vigorously enough, maybe people will let my patch in without thinking about what I wrote style argument. Oh, NO, this is not what I meant. Really sorry if I made myself look so evil... I actually mean: Skip or not, we agree on a point that will not harm most people, and provide another interface to let someon who _really_ want to take another way. I am by no means pushing the idea of skipping huge pages. I am just not sure about it and want to get a precise idea from you. And now I get it. I believe your optimization makes sense for 4kB pages, but is going to be counter-productive for 2MB pages. Your approach of make ksmd skip over more pages, so it uses less CPU is likely to reduce the effectiveness of ksm by not sharing some pages. For 4kB pages that is fine, because you'll get around to them eventually. However, the internal use of a 2MB page is likely to be quite different. Chances are most 2MB pages will have actively used, barely used and free pages inside. You absolutely want ksm to get at the barely used and free sub-pages. Having just one actively used 4kB sub-page prevent ksm from merging any of the other 511 sub-pages is a problem. No, no, I was just not sure about it. I meant we cannot satisfy all people but I was not sure which one is good for most of them. Sorry, again, if I didn't make it clear. Nai -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 7:25 AM, Andrea Arcangeli aarca...@redhat.com wrote: On Thu, Jun 23, 2011 at 07:13:54AM +0800, Nai Xia wrote: I agree on this point. Dirty bit , young bit, is by no means accurate. Even on 4kB pages, there is always a chance that the pte are dirty but the contents are actually the same. Yeah, the whole optimization contains trade-offs and Just a side note: the fact the dirty bit would be set even when the data is the same is actually a pros, not a cons. If the content is the same but the page was written to, it'd trigger a copy on write short after merging the page rendering the whole exercise wasteful. The cksum plays a double role, it both stabilizes the unstable tree, so there's less chance of bad lookups, but it also avoids us to merge stuff that is written to frequently triggering copy on writes, and the dirty bit would also catch overwrites with the same data, something the cksum can't do. Good point. I actually have myself another version of ksm(off topic, but if you want to take a glance: http://code.google.com/p/uksm/ :-) ) that did do statistics of the ratio of the pages in a VMA that really got COWed. due to KSM merging on each scan round basis. It's complicated to deduce a precise information only from the dirty and cksum. Thanks, Nai -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
On Thu, Jun 23, 2011 at 8:44 AM, Andrea Arcangeli aarca...@redhat.com wrote: On Thu, Jun 23, 2011 at 08:31:56AM +0800, Nai Xia wrote: On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli aarca...@redhat.com wrote: On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote: On 2MB pages, I'd like to remind you and Rik that ksmd currently splits huge pages before their sub pages gets really merged to stable tree. So when there are many 2MB pages each having a 4kB subpage changed for all time, this is already a concern for ksmd to judge if it's worthwhile to split 2MB page and get its sub-pages merged. Hmm not sure to follow. KSM memory density with THP on and off should be identical. The cksum is computed on subpages so the fact the 4k subpage is actually mapped by a hugepmd is invisible to KSM up to the point we get a unstable_tree_search_insert/stable_tree_search lookup succeeding. I agree on your points. But, I mean splitting the huge page into normal pages when some subpages need to be merged may increase the TLB lookside timing of CPU and _might_ hurt the workload ksmd is scanning. If only a small portion of false negative 2MB pages are really get merged eventually, maybe it's not worthwhile, right? Yes, there's not threshold to say only split if we could merge more than N subpages, 1 subpage match in two different hugepages is enough to split both and save just 4k but then memory accesses will be slower for both 2m ranges that have been splitted. But the point is that it won't be slower than if THP was off in the first place. So in the end all we gain is 4k saved but we still run faster than THP off, in the other hugepages that haven't been splitted yet. Yes, so ksmd is still doing good compared to THP off. Thanks for making my mind clearer :) But, well, just like Rik said below, yes, ksmd should be more aggressive to avoid much more time consuming cost for swapping. Correct the above logic also follows the idea to always maximize memory merging in KSM, which is why we've no threshold to wait N subpages to be mergeable before we split the hugepage. I'm unsure if admins in real life would then start to use those thresholds even if we'd implement them. The current way of enabling KSM a per-VM (not per-host) basis is pretty simple: the performance critical VM has KSM off, non-performance critical VM has KSM on and it prioritizes on memory merging. Hmm, yes, you are right. Thanks, Nai -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/22] KVM: MMU: fix walking shadow page table
On 06/23/2011 01:13 AM, Marcelo Tosatti wrote: On Wed, Jun 22, 2011 at 10:28:04PM +0800, Xiao Guangrong wrote: Properly check the last mapping, and do not walk to the next level if last spte is met Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9c629b5..f474e93 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1517,10 +1517,6 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator) if (iterator-level PT_PAGE_TABLE_LEVEL) return false; -if (iterator-level == PT_PAGE_TABLE_LEVEL) Change to = PT_PAGE_TABLE_LEVEL, checks should be in shadow_walk_okay. OK, will fix, thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
Marcelo, Thanks for your review! On 06/23/2011 05:59 AM, Marcelo Tosatti wrote: static int is_large_pte(u64 pte) @@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte, entry = *sptep; int ret = 0; +if (set_mmio_spte(sptep, gfn, pfn, pte_access)) +return 0; + Should zap all mmio sptes when creating new memory slots (current qemu never exhibits that pattern, but its not an unsupported scenario). Easier to zap all mmu in that case. OK, will do it in the next version. For shadow you need to remove mmio sptes on invlpg and all mmio sptes under CR3 (mmu_sync_roots), other than clearing the gva/gpa variables. Oh, kvm_mmu_pte_write and invlpg do not zap mmio spte, i will fix these, thanks for your point it out. For mmu_sync_roots, we properly do it in FNAME(sync_page) in this patch: static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access, int *nr_present) { if (unlikely(is_mmio_spte(*sptep))) { if (gfn != get_mmio_spte_gfn(*sptep)) { mmu_spte_clear_no_track(sptep); return true; } (*nr_present)++; mark_mmio_spte(sptep, gfn, access); return true; } return false; } Otherwise you can move the mmio info from an mmio spte back to mmio_gva/mmio_gfn after a TLB flush, without rereading the guest pagetable. We do not read the guest page table when mmio page fault occurred, we just do it as you say: - Firstly, walk the shadow page table to get the mmio spte - Then, cache the mmio spte info to mmio_gva/mmio_gfn I missed your meaning? +{ +struct kvm_shadow_walk_iterator iterator; +u64 spte, ret = 0ull; + +walk_shadow_page_lockless_begin(vcpu); +for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) { +if (iterator.level == 1) { +ret = spte; +break; +} Have you verified it is safe to walk sptes with parallel modifications to them? Other than shadow page deletion which is in theory covered by RCU. It would be good to have all cases written down. Um, i think it is safe, when spte is being write, we can get the old spte or the new spte, both cases are ok for us: it is just like the page structure cache on the real machine, the cpu can fetch the old spte even if the page structure is changed by other cpus. + +if (!is_shadow_present_pte(spte)) +break; +} +walk_shadow_page_lockless_end(vcpu); + +return ret; +} + +/* + * If it is a real mmio page fault, return 1 and emulat the instruction + * directly, return 0 if it needs page fault path to fix it, -1 is + * returned if bug is detected. + */ +int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct) +{ +u64 spte; + +if (quickly_check_mmio_pf(vcpu, addr, direct)) +return 1; + +spte = walk_shadow_page_get_mmio_spte(vcpu, addr); + +if (is_mmio_spte(spte)) { +gfn_t gfn = get_mmio_spte_gfn(spte); +unsigned access = get_mmio_spte_access(spte); + +if (direct) +addr = 0; +vcpu_cache_mmio_info(vcpu, addr, gfn, access); +return 1; +} + +/* + * It's ok if the gva is remapped by other cpus on shadow guest, + * it's a BUG if the gfn is not a mmio page. + */ If the gva is remapped there should be no mmio page fault in the first place. For example, as follow case: VCPU 0 VCPU 1 gva is mapped to a mmio region access gva remap gva to other page emulate mmio access by kvm (*the point we are discussing*) flush all tlb In theory, it can happen. +if (direct is_shadow_present_pte(spte)) +return -1; An spte does not have to contain the present bit to generate a valid EPT misconfiguration (and an spte dump is still required in that case). Use !is_mmio_spte() instead. We can not use !is_mmio_spte() here, since the shadow page can be zapped anytime, for example: sp.spt[i] = mmio-spte VCPU 0VCPU 1 Access sp.spte[i], ept misconfig is occurred delete sp (if the number of shadow page is out of the limit or page shrink is required, and other events...) Walk shadow page out of the lock and get the non-present spte (*the point we are discussing*) So, the bug we can detect is: it is the mmio access but the spte is point to the normal page. + +/* + * If the page table is zapped by other cpus, let the page + * fault path
Re: [PATCH 03/12] Switch build system to accompanied kernel headers
Am 22.06.2011 23:37, schrieb Jan Kiszka: On 2011-06-22 22:51, Stefan Weil wrote: If not, kvm=yes should be restricted to platforms with kvm support. Otherwise, QEMU builds will fail very early: ERROR: Host kernel lacks signalfd() support, but KVM depends on it when the IO thread is disabled. Of course, users of those non-kvm platforms can set --disable-kvm, but I don't think that is the correct solution. Even with kvm disabled, builds still fail for non-kvm systems: In file included from /qemu/hw/kvmclock.c:21: /qemu/linux-headers/linux/kvm_para.h:26:26: warning: asm/kvm_para.h: No such file or directory That indicates symlink emulation under Windows does not support directories. Can you confirm this (check what builddir/linux-headers/asm became)? Then we would have to link all files in the arch header dir individually. Jan Even when cross compiling for w32 (on a linux host), kvmclock.c does not compile: $ LANG=C make CFLAGS=-g CCi386-softmmu/kvmclock.o In file included from /home/stefan/src/qemu/savannah/qemu/hw/kvmclock.c:20: /home/stefan/src/qemu/savannah/qemu/linux-headers/linux/kvm.h:10:25: warning: linux/types.h: No such file or directory /home/stefan/src/qemu/savannah/qemu/linux-headers/linux/kvm.h:12:25: warning: linux/ioctl.h: No such file or directory In file included from /home/stefan/src/qemu/savannah/qemu/linux-headers/linux/kvm.h:13, from /home/stefan/src/qemu/savannah/qemu/hw/kvmclock.c:20: ../linux-headers/asm/kvm.h:32: error: expected specifier-qualifier-list before '__u32' ../linux-headers/asm/kvm.h:41: error: expected specifier-qualifier-list before '__u8' ../linux-headers/asm/kvm.h:61: error: expected specifier-qualifier-list before '__u64' Is kvmclock.c really needed for non-kvm platforms? Or does it simply need a obj-i386-$(CONFIG_KVM) in Makefile.target? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio_net sometimes didn't work
I find this problem have already solved by this patch:) Subject: [PATCH] [virtio] Replace virtio-net with native gPXE driver http://git.etherboot.org/gpxe.git/commitdiff/180dcf4363bf1d8889a2398a4944e94ca150b311 2011/2/16 Michael S. Tsirkin m...@redhat.com: On Wed, Feb 16, 2011 at 04:00:15PM +0800, lidong chen wrote: how to info pci in qemu? just type it at the monitor prompt. the usb controller also used irq 11, i think the interrupt maybe cause by this. i will modify qemu, ignore -usb,and try again. i add some debug code, printk the ioaddr of vdev. static int i; /* reading the ISR has the effect of also clearing it so it's very * important to save off the value. */ isr = ioread8(vp_dev-ioaddr + VIRTIO_PCI_ISR); /* It's definitely not us if the ISR was not high */ if (!isr) { i++; if( i==1 ) { printk(KERN_EMERG 2); printk(KERN_EMERG ioaddr %p, vp_dev-ioaddr ); i=0; } return IRQ_NONE; } OK, so we seem to get it right. Try to stick the printout in qemu case VIRTIO_PCI_ISR: /* reading from the ISR also clears it. */ ret = vdev-isr; vdev-isr = 0; qemu_set_irq(proxy-pci_dev.irq[0], 0); see whether it's the baloon device or the virtio net device that gets to handle the reads. ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11 ACPI: PCI Interrupt :00:03.0[A] - Link [LNKC] - GSI 11 (level, high) - IRQ 11 20ioaddr 0001c040020ioaddr 0001c040020ioaddr 0001c040020ioaddr 0001c040020ioaddr 0001c040020ioaddr 0001c040020ioaddr 0001c040020ioaddr 0001c040020ioaddr 0001c040020ioaddr 0001c0403irq 11: nobody cared (try booting with the irqpoll option) [c01457b0] __report_bad_irq+0x2b/0x69 [c0145979] note_interrupt+0x18b/0x1b2 [c01452a9] handle_IRQ_event+0x26/0x51 [c014537f] __do_IRQ+0xab/0xdc [c0106445] do_IRQ+0x46/0x53 [c0104e8a] common_interrupt+0x1a/0x20 [c01276f2] __do_softirq+0x4f/0xc2 [c0127793] do_softirq+0x2e/0x32 [c0104f3c] apic_timer_interrupt+0x1c/0x30 [c02ae9fe] _spin_unlock_irqrestore+0x6/0x7 [c01455dc] setup_irq+0xab/0x108 [f8a9a09b] vp_interrupt+0x0/0x114 [virtio_pci] [c01456ad] request_irq+0x74/0x90 [f8a9a2fb] virtio_pci_probe+0x14c/0x1c2 [virtio_pci] [c01d4096] pci_device_probe+0x36/0x57 [c022f935] driver_probe_device+0x42/0x8b [c022fa23] __driver_attach+0x4a/0x71 [c022f9d9] __driver_attach+0x0/0x71 [c022f45a] bus_for_each_dev+0x39/0x5b [c022f89f] driver_attach+0x11/0x13 [c022f9d9] __driver_attach+0x0/0x71 [c022f17d] bus_add_driver+0x64/0xfd [c01d41f9] __pci_register_driver+0x6c/0x8e [f8830020] virtio_pci_init+0x20/0x34 [virtio_pci] [c013b216] sys_init_module+0x1749/0x18c2 [c014ab9b] generic_file_read+0x9a/0xaf [c0167011] vfs_read+0xa8/0x150 [c0103dcb] sysenter_past_esp+0x54/0x79 handlers: [f8a9a09b] (vp_interrupt+0x0/0x114 [virtio_pci]) Disabling IRQ #11 the output of lspci -vv 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) Subsystem: Unknown device 1af4:1100 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] Subsystem: Unknown device 1af4:1100 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 0 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] (prog-if 80 [Master]) Subsystem: Unknown device 1af4:1100 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 0 Region 4: I/O ports at c000 [size=16] 00:01.2 USB Controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) (prog-if 00 [UHCI]) Subsystem: Unknown device 1af4:1100 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Latency: 0 Interrupt: pin D routed to IRQ 11 Region 4: I/O ports at c020 [size=32] 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) Subsystem: Unknown device 1af4:1100 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Interrupt: pin A routed to IRQ 9