from:"David Matlack"

[RFC][PATCH] kvm: x86: fix stale mmio cache bug

2014-08-01 Thread David Matlack

The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.

(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.

This patch fixes the issue by clearing the mmio cache in the
KVM_MR_CREATE code path.
 - introduce KVM_REQ_CLEAR_MMIO_CACHE for clearing all vcpu mmio
   caches.
 - extend vcpu_clear_mmio_info to clear mmio_gfn in addition to
   mmio_gva, since both can be used to fast path mmio faults.
 - issue KVM_REQ_CLEAR_MMIO_CACHE during memslot creation to flush
   the mmio cache.
 - in mmu_sync_roots, unconditionally clear the mmio cache since
   even direct_map (e.g. tdp) hosts use it.

Signed-off-by: David Matlack dmatl...@google.com
---
 arch/x86/kvm/mmu.c   |  3 ++-
 arch/x86/kvm/x86.c   |  5 +
 arch/x86/kvm/x86.h   |  8 +---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 10 +-
 5 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9314678..8d50b84 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3157,13 +3157,14 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
int i;
struct kvm_mmu_page *sp;
 
+   vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
+
if (vcpu-arch.mmu.direct_map)
return;
 
if (!VALID_PAGE(vcpu-arch.mmu.root_hpa))
return;
 
-   vcpu_clear_mmio_info(vcpu, ~0ul);
kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
if (vcpu-arch.mmu.root_level == PT64_ROOT_LEVEL) {
hpa_t root = vcpu-arch.mmu.root_hpa;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ef432f8..05b5629 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6001,6 +6001,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_deliver_pmi(vcpu);
if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))
vcpu_scan_ioapic(vcpu);
+
+   if (kvm_check_request(KVM_REQ_CLEAR_MMIO_CACHE, vcpu))
+   vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
}
 
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
@@ -7281,6 +7284,8 @@ void kvm_arch_memslots_updated(struct kvm *kvm)
 * mmio generation may have reached its maximum value.
 */
kvm_mmu_invalidate_mmio_sptes(kvm);
+
+   kvm_make_all_vcpus_request(kvm, KVM_REQ_CLEAR_MMIO_CACHE);
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 8c97bac..41ef197 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -81,15 +81,17 @@ static inline void vcpu_cache_mmio_info(struct kvm_vcpu 
*vcpu,
 }
 
 /*
- * Clear the mmio cache info for the given gva,
- * specially, if gva is ~0ul, we clear all mmio cache info.
+ * Clear the mmio cache info for the given gva. If gva is MMIO_GVA_ANY,
+ * unconditionally clear the mmio cache.
  */
+#define MMIO_GVA_ANY (~0ul)
 static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
 {
-   if (gva != (~0ul)  vcpu-arch.mmio_gva != (gva  PAGE_MASK))
+   if (gva != MMIO_GVA_ANY  vcpu-arch.mmio_gva != (gva  PAGE_MASK))
return;
 
vcpu-arch.mmio_gva = 0;
+   vcpu-arch.mmio_gfn = 0;
 }
 
 static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long 
gva)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ec4e3bd..e4edaff 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -136,6 +136,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_GLOBAL_CLOCK_UPDATE 22
 #define KVM_REQ_ENABLE_IBS23
 #define KVM_REQ_DISABLE_IBS   24
+#define KVM_REQ_CLEAR_MMIO_CACHE  25
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID   1
@@ -591,6 +592,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
 void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
 void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
 
+bool kvm_make_all_vcpus_request(struct kvm *kvm, unsigned int req);
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_reload_remote_mmus(struct kvm *kvm);
 void kvm_make_mclock_inprogress_request(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4b6c01b..d09527a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -152,7

Re: [RFC][PATCH] kvm: x86: fix stale mmio cache bug

2014-08-04 Thread David Matlack

On Mon, Aug 4, 2014 at 5:44 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 02/08/2014 06:15, Xiao Guangrong ha scritto:
 I prefer to also caching the spte’s generation number, then check the number
 in quickly_check_mmio_pf().

 I agree, thanks Xiao for the review and David for the report!

I like this approach as well. I'll send out a v2.

Thanks for the reviews!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] kvm: x86: fix stale mmio cache bug

2014-08-04 Thread David Matlack

The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.

(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.

This patch fixes the issue by doing the following:
  - Tag the mmio cache with the memslot generation and use it to
validate mmio cache lookups.
  - Extend vcpu_clear_mmio_info to clear mmio_gfn in addition to
mmio_gva, since both can be used to fast path mmio faults.
  - In mmu_sync_roots, unconditionally clear the mmio cache since
even direct_map (e.g. tdp) hosts use it.

Signed-off-by: David Matlack dmatl...@google.com
---
Changes in v2:
  - Use memslot generation to invalidate the mmio cache rather than
actively invalidating the cache.
  - Update patch description with new cache invalidation technique.
  - Pull mmio cache/clear code up out of x86.h and mmu.c and into
mmu.h.

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.c  | 16 ++
 arch/x86/kvm/mmu.h  | 70 +
 arch/x86/kvm/x86.h  | 36 -
 4 files changed, 73 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 49205d0..f518d14 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -479,6 +479,7 @@ struct kvm_vcpu_arch {
u64 mmio_gva;
unsigned access;
gfn_t mmio_gfn;
+   unsigned int mmio_gen;
 
struct kvm_pmu pmu;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9314678..43f1c18 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -206,11 +206,8 @@ EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 #define MMIO_SPTE_GEN_LOW_SHIFT3
 #define MMIO_SPTE_GEN_HIGH_SHIFT   52
 
-#define MMIO_GEN_SHIFT 19
 #define MMIO_GEN_LOW_SHIFT 9
 #define MMIO_GEN_LOW_MASK  ((1  MMIO_GEN_LOW_SHIFT) - 1)
-#define MMIO_GEN_MASK  ((1  MMIO_GEN_SHIFT) - 1)
-#define MMIO_MAX_GEN   ((1  MMIO_GEN_SHIFT) - 1)
 
 static u64 generation_mmio_spte_mask(unsigned int gen)
 {
@@ -234,16 +231,6 @@ static unsigned int get_mmio_spte_generation(u64 spte)
return gen;
 }
 
-static unsigned int kvm_current_mmio_generation(struct kvm *kvm)
-{
-   /*
-* Init kvm generation close to MMIO_MAX_GEN to easily test the
-* code of handling generation number wrap-around.
-*/
-   return (kvm_memslots(kvm)-generation +
- MMIO_MAX_GEN - 150)  MMIO_GEN_MASK;
-}
-
 static void mark_mmio_spte(struct kvm *kvm, u64 *sptep, u64 gfn,
   unsigned access)
 {
@@ -3157,13 +3144,14 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
int i;
struct kvm_mmu_page *sp;
 
+   vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
+
if (vcpu-arch.mmu.direct_map)
return;
 
if (!VALID_PAGE(vcpu-arch.mmu.root_hpa))
return;
 
-   vcpu_clear_mmio_info(vcpu, ~0ul);
kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
if (vcpu-arch.mmu.root_level == PT64_ROOT_LEVEL) {
hpa_t root = vcpu-arch.mmu.root_hpa;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b982112..058651a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -82,6 +82,76 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct 
kvm_mmu *context,
 void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
bool ept);
 
+#define MMIO_GEN_SHIFT 19
+#define MMIO_GEN_MASK  ((1  MMIO_GEN_SHIFT) - 1)
+#define MMIO_MAX_GEN   ((1  MMIO_GEN_SHIFT) - 1)
+static inline unsigned int kvm_current_mmio_generation(struct kvm *kvm)
+{
+   /*
+* Init kvm generation close to MMIO_MAX_GEN to easily test the
+* code of handling generation number wrap-around.
+*/
+   return (kvm_memslots(kvm)-generation + MMIO_MAX_GEN - 150) 
+  MMIO_GEN_MASK;
+}
+
+static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,
+   gva_t gva, gfn_t gfn, unsigned access)
+{
+   vcpu-arch.mmio_gen = kvm_current_mmio_generation(vcpu-kvm);
+
+   /*
+* Ensure that the mmio_gen is set before the rest of the cache entry

Re: [PATCH v2] kvm: x86: fix stale mmio cache bug

2014-08-05 Thread David Matlack

On Mon, Aug 4, 2014 at 5:31 PM, Wanpeng Li wanpeng...@linux.intel.com wrote:
 Hi David,
 On Mon, Aug 04, 2014 at 02:10:20PM -0700, David Matlack wrote:
The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.


 One question:

 Who trigger host userspace creates a mmio memslot? It will be created
 just after first mmio #PF?

Devices such as vga can be in modes where their memory behaves
like ram and using a memslot to back the memory makes sense. In
other modes, reading and writing to vga memory has side-effects
and so mmio makes sense (delete memslot). Switching between these
modes is a guest initiated event.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] kvm: x86: fix stale mmio cache bug

2014-08-05 Thread David Matlack

On Mon, Aug 4, 2014 at 8:36 PM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 On 08/05/2014 05:10 AM, David Matlack wrote:

 This patch fixes the issue by doing the following:
   - Tag the mmio cache with the memslot generation and use it to
 validate mmio cache lookups.
   - Extend vcpu_clear_mmio_info to clear mmio_gfn in addition to
 mmio_gva, since both can be used to fast path mmio faults.
   - In mmu_sync_roots, unconditionally clear the mmio cache since
 even direct_map (e.g. tdp) hosts use it.

 It's not needed.

 direct map only uses gpa (and never cache gva) and
 vcpu_clear_mmio_info only clears gva.

Ok thanks for the clarification.

 +static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,
 + gva_t gva, gfn_t gfn, unsigned access)
 +{
 + vcpu-arch.mmio_gen = kvm_current_mmio_generation(vcpu-kvm);
 +
 + /*
 +  * Ensure that the mmio_gen is set before the rest of the cache entry.
 +  * Otherwise we might see a new generation number attached to an old
 +  * cache entry if creating/deleting a memslot races with mmio caching.
 +  * The inverse case is possible (old generation number with new cache
 +  * info), but that is safe. The next access will just miss the cache
 +  * when it should have hit.
 +  */
 + smp_wmb();

 The memory barrier can't help us, consider this scenario:

 CPU 0  CPU 1
 page-fault
 see gpa is not mapped in memslot

   create new memslot containing gpa from Qemu
   update the slots's generation number
 cache mmio info

 !!! later when vcpu accesses gpa again
 it will cause mmio-exit.

Ah! Thanks for catching my mistake.

 The easy way to fix this is that we update slots's generation-number
 after synchronize_srcu_expedited when memslot is being updated that
 ensures other sides can see the new generation-number only after
 finishing update.

It would be possible for a vcpu to see an inconsistent kvm_memslots struct
(new set of slots with an old generation number). Is that not an issue?

We could just use the generation number that is stored in the
spte. The only downside (that I can see) is that handle_abnormal_pfn()
calls vcpu_cache_mmio_info() but does not have access to the spte.
So presumably we'd have to do a page table walk.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] kvm: x86: fix stale mmio cache bug

2014-08-06 Thread David Matlack

On Tue, Aug 5, 2014 at 8:26 PM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 On 08/06/2014 06:39 AM, David Matlack wrote:
 On Mon, Aug 4, 2014 at 8:36 PM, Xiao Guangrong
 xiaoguangr...@linux.vnet.ibm.com wrote:
 The memory barrier can't help us, consider this scenario:

 CPU 0  CPU 1
 page-fault
 see gpa is not mapped in memslot

   create new memslot containing gpa from Qemu
   update the slots's generation number
 cache mmio info

 !!! later when vcpu accesses gpa again
 it will cause mmio-exit.

 Ah! Thanks for catching my mistake.

 The easy way to fix this is that we update slots's generation-number
 after synchronize_srcu_expedited when memslot is being updated that
 ensures other sides can see the new generation-number only after
 finishing update.

 It would be possible for a vcpu to see an inconsistent kvm_memslots struct
 (new set of slots with an old generation number). Is that not an issue?

 In this case, checking generation-number will fail, we will goto the slow path
 to handle mmio access - that's very rare, so i think it's ok.


 We could just use the generation number that is stored in the
 spte. The only downside (that I can see) is that handle_abnormal_pfn()
 calls vcpu_cache_mmio_info() but does not have access to the spte.
 So presumably we'd have to do a page table walk.

 The issue is not only in vcpu_cache_mmio_info but also in
 mark_mmio_spte() where we may cache new generation-number and old memslots
 info.

This is now a separate bug from the one I originally reported :). For now, I
will resend a 3rd version of my patch with the fixes you suggested (the memory
barrier is useless and no need to change to mmu_sync_roots).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3] kvm: x86: fix stale mmio cache bug

2014-08-07 Thread David Matlack

The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.

(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.

This patch fixes the issue by using the memslot generation number
to validate the mmio cache.

Signed-off-by: David Matlack dmatl...@google.com
---
The patch diff is rather large because I had to pull some code out
of x86.h and mmu.c and into mmu.h. The main change is adding the
memslot generation in vcpu_cach_mmio_info() and then validating
that slot in vcpu_match_mmio_*().

Changes in v3:
  - remove memory barrier in vcpu_cache_mmio_info()
  - don't unconditionally clear mmio cache in mmu_synch_roots

Changes in v2:
  - Use memslot generation to invalidate the mmio cache rather than
actively invalidating the cache.
  - Update patch description with new cache invalidation technique.
  - Pull mmio cache/clear code up out of x86.h and mmu.c and into
mmu.h.

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.c  | 15 +--
 arch/x86/kvm/mmu.h  | 58 +
 arch/x86/kvm/x86.h  | 36 -
 4 files changed, 60 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 49205d0..f518d14 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -479,6 +479,7 @@ struct kvm_vcpu_arch {
u64 mmio_gva;
unsigned access;
gfn_t mmio_gfn;
+   unsigned int mmio_gen;
 
struct kvm_pmu pmu;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9314678..dd5a30d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -206,11 +206,8 @@ EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 #define MMIO_SPTE_GEN_LOW_SHIFT3
 #define MMIO_SPTE_GEN_HIGH_SHIFT   52
 
-#define MMIO_GEN_SHIFT 19
 #define MMIO_GEN_LOW_SHIFT 9
 #define MMIO_GEN_LOW_MASK  ((1  MMIO_GEN_LOW_SHIFT) - 1)
-#define MMIO_GEN_MASK  ((1  MMIO_GEN_SHIFT) - 1)
-#define MMIO_MAX_GEN   ((1  MMIO_GEN_SHIFT) - 1)
 
 static u64 generation_mmio_spte_mask(unsigned int gen)
 {
@@ -234,16 +231,6 @@ static unsigned int get_mmio_spte_generation(u64 spte)
return gen;
 }
 
-static unsigned int kvm_current_mmio_generation(struct kvm *kvm)
-{
-   /*
-* Init kvm generation close to MMIO_MAX_GEN to easily test the
-* code of handling generation number wrap-around.
-*/
-   return (kvm_memslots(kvm)-generation +
- MMIO_MAX_GEN - 150)  MMIO_GEN_MASK;
-}
-
 static void mark_mmio_spte(struct kvm *kvm, u64 *sptep, u64 gfn,
   unsigned access)
 {
@@ -3163,7 +3150,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
if (!VALID_PAGE(vcpu-arch.mmu.root_hpa))
return;
 
-   vcpu_clear_mmio_info(vcpu, ~0ul);
+   vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
if (vcpu-arch.mmu.root_level == PT64_ROOT_LEVEL) {
hpa_t root = vcpu-arch.mmu.root_hpa;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b982112..a98a060 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -82,6 +82,64 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct 
kvm_mmu *context,
 void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
bool ept);
 
+#define MMIO_GEN_SHIFT 19
+#define MMIO_GEN_MASK  ((1  MMIO_GEN_SHIFT) - 1)
+#define MMIO_MAX_GEN   ((1  MMIO_GEN_SHIFT) - 1)
+static inline unsigned int kvm_current_mmio_generation(struct kvm *kvm)
+{
+   /*
+* Init kvm generation close to MMIO_MAX_GEN to easily test the
+* code of handling generation number wrap-around.
+*/
+   return (kvm_memslots(kvm)-generation +
+   MMIO_MAX_GEN - 150)  MMIO_GEN_MASK;
+}
+
+static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,
+   gva_t gva, gfn_t gfn, unsigned access)
+{
+   vcpu-arch.mmio_gen = kvm_current_mmio_generation(vcpu-kvm);
+   vcpu-arch.mmio_gva = gva  PAGE_MASK;
+   vcpu-arch.access = access;
+   vcpu-arch.mmio_gfn = gfn;
+}
+
+/*
+ * Clear the mmio cache

Re: [PATCH v3] kvm: x86: fix stale mmio cache bug

2014-08-07 Thread David Matlack

On Thu, Aug 7, 2014 at 6:36 PM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 On 08/08/2014 02:32 AM, David Matlack wrote:
 The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
 up to userspace:

 (1) Guest accesses gpa X without a memory slot. The gfn is cached in
 struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
 the SPTE write-execute-noread so that future accesses cause
 EPT_MISCONFIGs.

 (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
 covering the page just accessed.

 (3) Guest attempts to read or write to gpa X again. On Intel, this
 generates an EPT_MISCONFIG. The memory slot generation number that
 was incremented in (2) would normally take care of this but we fast
 path mmio faults through quickly_check_mmio_pf(), which only checks
 the per-vcpu mmio cache. Since we hit the cache, KVM passes a
 KVM_EXIT_MMIO up to userspace.

 This patch fixes the issue by using the memslot generation number
 to validate the mmio cache.

 Signed-off-by: David Matlack dmatl...@google.com
 ---
 The patch diff is rather large because I had to pull some code out
 of x86.h and mmu.c and into mmu.h. The main change is adding the
 memslot generation in vcpu_cach_mmio_info() and then validating
 that slot in vcpu_match_mmio_*().

 Why not just move vcpu_cach_mmio_info() into mmu.c where is
 the only place vcpu_cach_mmio_info is called. :)

If only it was so simple :). vcpu_match_mmio_*() now needs
kvm_current_mmio_generation() which *was* in mmu.c.
vcpu_match_mmio_*() is called from x86.c and mmu.c so all that
has to go in a header. I decided to keep vcpu_cache_mmio_info()
with the  rest for organization's sake.

 BTW, i will post a patch to fix the generation-number issue
 soon.



Great! Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-12 Thread David Matlack

On Mon, Aug 11, 2014 at 10:02 PM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 @@ -722,9 +719,10 @@ static struct kvm_memslots *install_new_memslots(struct 
 kvm *kvm,
  {
 struct kvm_memslots *old_memslots = kvm-memslots;


I think you want

  slots-generation = old_memslots-generation;

here.

On the KVM_MR_DELETE path, install_new_memslots is called twice so this
patch introduces a short window of time where the generation number
actually decreases.

 -   update_memslots(slots, new, kvm-memslots-generation);
 +   update_memslots(slots, new);
 rcu_assign_pointer(kvm-memslots, slots);
 synchronize_srcu_expedited(kvm-srcu);
 +   slots-generation++;

 kvm_arch_memslots_updated(kvm);

 --
 1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] kvm: x86: fix stale mmio cache bug

2014-08-14 Thread David Matlack

On Thu, Aug 14, 2014 at 12:01 AM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 From: David Matlack dmatl...@google.com

 The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
 up to userspace:

 (1) Guest accesses gpa X without a memory slot. The gfn is cached in
 struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
 the SPTE write-execute-noread so that future accesses cause
 EPT_MISCONFIGs.

 (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
 covering the page just accessed.

 (3) Guest attempts to read or write to gpa X again. On Intel, this
 generates an EPT_MISCONFIG. The memory slot generation number that
 was incremented in (2) would normally take care of this but we fast
 path mmio faults through quickly_check_mmio_pf(), which only checks
 the per-vcpu mmio cache. Since we hit the cache, KVM passes a
 KVM_EXIT_MMIO up to userspace.

 This patch fixes the issue by using the memslot generation number
 to validate the mmio cache.

 [ xiaoguangrong: adjust the code to make it simpler for stable-tree fix. ]

The adjustments look good. Thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 9:35 AM, Xiao Guangrong
xiaoguangrong.e...@gmail.com wrote:

 Hi Paolo,

 Thank you to review the patch!

 On Aug 18, 2014, at 9:57 PM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 14/08/2014 09:01, Xiao Guangrong ha scritto:
 -update_memslots(slots, new, kvm-memslots-generation);
 +/* ensure generation number is always increased. */
 +slots-generation = old_memslots-generation;
 +update_memslots(slots, new);
  rcu_assign_pointer(kvm-memslots, slots);
  synchronize_srcu_expedited(kvm-srcu);
 +slots-generation++;

 I don't trust my brain enough to review this patch.

Xiao, I thought about your approach a lot and I can't think of a bad race
that isn't already possible due to the fact that kvm allows memslot
mutation to race with vm exits. That being said, it's hard to reason about
all the other clients of memslots and what weirdness (or badness) will
be caused by updating generation after srcu_synch. I like Paolo's two
approaches because they fix the bug without any side-effects.

 Sorry to make you confused. I should expain it more clearly.

 What this patch tried to fix is:  kvm will generate wrong mmio-exit forever
 if no luck event cleans mmio spte. (eg. if no memory pressure or  no
 context-sync on that spte.)

 Note, it is hard to do precise sync between kvm_vm_ioctl_set_memory_region
 and mmio-exit - that means userspace is able to get mmio-exit even if
 kvm_vm_ioctl_set_memory_region have finished, for example, kvm identifies
 a mmio access before issuing the ioctl and injects mmio-exit to userspace 
 after
 ioctl return. So checking if mmio-exit is a real mmio access in userspace is
 needed anyway.

 kvm_current_mmio_generation seems like a very bad (race-prone) API.  One
 patch I trust myself reviewing would change a bunch of functions in
 kvm_main.c to take a memslots struct.  This would make it easy to
 respect the hard and fast rule of not dereferencing the same pointer
 twice.  But it would be a tedious change.

 kvm_set_memory_region is the only place updating memslot and
 kvm_current_mmio_generation accesses memslot by rcu-dereference,
 i do not know why other places need to take into account.

If you rcu_dereference() more than once, you can't trust previous
decisions based on rcu_dereference()'s. If the mmio cache code only
did one rcu_dereference() per vm exit, this bug would be gone.

 I think this patch is auditable, page-fault is always called by holding
 srcu-lock so that a page fault can’t go across synchronize_srcu_expedited.
 Only these cases can happen:

 1)  page fault occurs before synchronize_srcu_expedited.
 In this case, vcpu will generate mmio-exit for the memslot being 
 registered
 by the ioctl. That’s ok since the ioctl have not finished.

 2) page fault occurs after synchronize_srcu_expedited and during
increasing generation-number.
In this case, userspace may get wrong mmio-exit (that happen if handing
page-fault is slower that the ioctl), that’s ok too since userspace need do
   the check anyway like i said above.

 3) page fault occurs after generation-number update
that’s definitely correct. :)

 Another alternative could be to use the low bit to mark an in-progress
 change, and skip the caching if the low bit is set.  Similar to a
 seqcount (except if read_seqcount_retry fails, we just punt and not
 retry anything), you could use it even though the memory barriers
 provided by write_seqcount_begin/end are not too useful in this case.

I like this approach best. It would have the least code changes and provide
the same guarantees.

 I do not know how the bit works, page fault will cache the memslot before
 the bit set and cache the generation-number after the bit set.

 Maybe i missed your idea, could you please detail it?

In vcpu_cache_mmio_info() if generation is odd, just don't do the caching
because memslots were changed while we were running and we just assume the
worst case.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 12:56 PM, Xiao Guangrong
xiaoguangrong.e...@gmail.com wrote:
 @@ -287,9 +293,15 @@ static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, 
 gfn_t gfn,

  static bool check_mmio_spte(struct kvm *kvm, u64 spte)
  {
 +   struct kvm_memslots *slots = kvm_memslots(kvm);
 unsigned int kvm_gen, spte_gen;

 -   kvm_gen = kvm_current_mmio_generation(kvm);
 +   if (slots-updated)
 +   return false;
 +
 +   smp_rmb();
 +
 +   kvm_gen = __kvm_current_mmio_generation(slots);
 spte_gen = get_mmio_spte_generation(spte);


What does this fix? Case 2 can still happen. (Case 2 is unavoidable unless we
block during memslot updates, which I don't think we should :).

 trace_check_mmio_spte(spte, kvm_gen, spte_gen);
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 4b6c01b..1d4e78f 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -96,7 +96,7 @@ static void hardware_disable_all(void);

  static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
  static void update_memslots(struct kvm_memslots *slots,
 -   struct kvm_memory_slot *new, u64 last_generation);
 +   struct kvm_memory_slot *new);

  static void kvm_release_pfn_dirty(pfn_t pfn);
  static void mark_page_dirty_in_slot(struct kvm *kvm,
 @@ -685,8 +685,7 @@ static void sort_memslots(struct kvm_memslots *slots)
  }

  static void update_memslots(struct kvm_memslots *slots,
 -   struct kvm_memory_slot *new,
 -   u64 last_generation)
 +   struct kvm_memory_slot *new)
  {
 if (new) {
 int id = new-id;
 @@ -697,8 +696,6 @@ static void update_memslots(struct kvm_memslots *slots,
 if (new-npages != npages)
 sort_memslots(slots);
 }
 -
 -   slots-generation = last_generation + 1;
  }

  static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
 @@ -720,10 +717,17 @@ static struct kvm_memslots *install_new_memslots(struct 
 kvm *kvm,
  {
 struct kvm_memslots *old_memslots = kvm-memslots;

 -   update_memslots(slots, new, kvm-memslots-generation);
 +   /* ensure generation number is always increased. */
 +   slots-updated = true;
 +   slots-generation = old_memslots-generation;
 +   update_memslots(slots, new);
 rcu_assign_pointer(kvm-memslots, slots);
 synchronize_srcu_expedited(kvm-srcu);

 +   slots-generation++;
 +   smp_wmb();
 +   slots-updated = false;
 +
 kvm_arch_memslots_updated(kvm);

 return old_memslots;


This is effectively the same as the first approach.

I just realized how simple Paolo's idea is. I think it can be a one line
patch (without comments):

[...]
update_memslots(slots, new, kvm-memslots-generation);
rcu_assign_pointer(kvm-memslots, slots);
synchronize_srcu_expedited(kvm-srcu);
+   slots-generation++;

kvm_arch_memslots_updated(kvm);
[...]
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 2:24 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 18/08/2014 23:15, David Matlack ha scritto:
 I just realized how simple Paolo's idea is. I think it can be a one line
 patch (without comments):

 [...]
 update_memslots(slots, new, kvm-memslots-generation);
 rcu_assign_pointer(kvm-memslots, slots);
 synchronize_srcu_expedited(kvm-srcu);
 +   slots-generation++;

 kvm_arch_memslots_updated(kvm);
 [...]

 Yeah, you're right.  I think at this point it makes sense to put all
 generation handling in install_new_memslots, but with proper comments
 the above can do as well.

 Would you like to send it?  Patch 2 still applies on top.

Sure, I like doing everything in install_new_memslots() as well so I'll fix
that.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] kvm: fix potentially corrupt mmio cache

2014-08-18 Thread David Matlack

vcpu exits and memslot mutations can run concurrently as long as the
vcpu does not aquire the slots mutex. Thus it is theoretically possible
for memslots to change underneath a vcpu that is handling an exit.

If we increment the memslot generation number again after
synchronize_srcu_expedited(), vcpus can safely cache memslot generation
without maintaining a single rcu_dereference through an entire vm exit.
And much of the x86/kvm code does not maintain a single rcu_dereference
of the current memslots during each exit.

We can prevent the following case:

   vcpu (CPU 0) | thread (CPU 1)
+--
1  vm exit  |
2  decide to cache something based on   |
 old memslots   |
3   | change memslots
4   | increment generation
5  tag cache with new memslot generation|
... |
   action based on cache occurs even   |
though the caching decision was based   |
on the old memslots|
... |
   action *continues* to occur until next  |
memslot generation change, which may|
be never   |

By incrementing the generation again after synchronizing kvm-srcu
readers, we guarantee the generation cached in (5) will very soon
become invalid.

Cc: sta...@vger.kernel.org
Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
Signed-off-by: David Matlack dmatl...@google.com
---
 virt/kvm/kvm_main.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4b6c01b..86d3697 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -95,8 +95,6 @@ static int hardware_enable_all(void);
 static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
-static void update_memslots(struct kvm_memslots *slots,
-   struct kvm_memory_slot *new, u64 last_generation);
 
 static void kvm_release_pfn_dirty(pfn_t pfn);
 static void mark_page_dirty_in_slot(struct kvm *kvm,
@@ -685,8 +683,7 @@ static void sort_memslots(struct kvm_memslots *slots)
 }
 
 static void update_memslots(struct kvm_memslots *slots,
-   struct kvm_memory_slot *new,
-   u64 last_generation)
+   struct kvm_memory_slot *new)
 {
if (new) {
int id = new-id;
@@ -697,8 +694,6 @@ static void update_memslots(struct kvm_memslots *slots,
if (new-npages != npages)
sort_memslots(slots);
}
-
-   slots-generation = last_generation + 1;
 }
 
 static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
@@ -720,10 +715,19 @@ static struct kvm_memslots *install_new_memslots(struct 
kvm *kvm,
 {
struct kvm_memslots *old_memslots = kvm-memslots;
 
-   update_memslots(slots, new, kvm-memslots-generation);
+   slots-generation = old_memslots-generation + 1;
+
+   update_memslots(slots, new);
rcu_assign_pointer(kvm-memslots, slots);
synchronize_srcu_expedited(kvm-srcu);
 
+   /*
+* Increment the new memslot generation a second time. This prevents
+* vm exits that race with memslot updates from caching a memslot
+* generation that will (potentially) be valid forever.
+*/
+   slots-generation++;
+
kvm_arch_memslots_updated(kvm);
 
return old_memslots;
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] kvm: x86: fix stale mmio cache bug

2014-08-18 Thread David Matlack

The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.

(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.

This patch fixes the issue by using the memslot generation number
to validate the mmio cache.

[ xiaoguangrong: adjust the code to make it simpler for stable-tree fix. ]

Cc: sta...@vger.kernel.org
Signed-off-by: David Matlack dmatl...@google.com
Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.c  |  4 ++--
 arch/x86/kvm/mmu.h  |  2 ++
 arch/x86/kvm/x86.h  | 21 -
 4 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 49205d0..f518d14 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -479,6 +479,7 @@ struct kvm_vcpu_arch {
u64 mmio_gva;
unsigned access;
gfn_t mmio_gfn;
+   unsigned int mmio_gen;
 
struct kvm_pmu pmu;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9314678..e00fbfe 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -234,7 +234,7 @@ static unsigned int get_mmio_spte_generation(u64 spte)
return gen;
 }
 
-static unsigned int kvm_current_mmio_generation(struct kvm *kvm)
+unsigned int kvm_current_mmio_generation(struct kvm *kvm)
 {
/*
 * Init kvm generation close to MMIO_MAX_GEN to easily test the
@@ -3163,7 +3163,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
if (!VALID_PAGE(vcpu-arch.mmu.root_hpa))
return;
 
-   vcpu_clear_mmio_info(vcpu, ~0ul);
+   vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
if (vcpu-arch.mmu.root_level == PT64_ROOT_LEVEL) {
hpa_t root = vcpu-arch.mmu.root_hpa;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b982112..e2d902a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -76,6 +76,8 @@ enum {
 };
 
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
direct);
+unsigned int kvm_current_mmio_generation(struct kvm *kvm);
+
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
bool execonly);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 8c97bac..ae7006d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -3,6 +3,7 @@
 
 #include linux/kvm_host.h
 #include kvm_cache_regs.h
+#include mmu.h
 
 static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
 {
@@ -78,15 +79,23 @@ static inline void vcpu_cache_mmio_info(struct kvm_vcpu 
*vcpu,
vcpu-arch.mmio_gva = gva  PAGE_MASK;
vcpu-arch.access = access;
vcpu-arch.mmio_gfn = gfn;
+   vcpu-arch.mmio_gen = kvm_current_mmio_generation(vcpu-kvm);
+}
+
+static inline bool vcpu_match_mmio_gen(struct kvm_vcpu *vcpu)
+{
+   return vcpu-arch.mmio_gen == kvm_current_mmio_generation(vcpu-kvm);
 }
 
 /*
- * Clear the mmio cache info for the given gva,
- * specially, if gva is ~0ul, we clear all mmio cache info.
+ * Clear the mmio cache info for the given gva. If gva is MMIO_GVA_ANY, we
+ * clear all mmio cache info.
  */
+#define MMIO_GVA_ANY (~(gva_t)0)
+
 static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
 {
-   if (gva != (~0ul)  vcpu-arch.mmio_gva != (gva  PAGE_MASK))
+   if (gva != MMIO_GVA_ANY  vcpu-arch.mmio_gva != (gva  PAGE_MASK))
return;
 
vcpu-arch.mmio_gva = 0;
@@ -94,7 +103,8 @@ static inline void vcpu_clear_mmio_info(struct kvm_vcpu 
*vcpu, gva_t gva)
 
 static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long 
gva)
 {
-   if (vcpu-arch.mmio_gva  vcpu-arch.mmio_gva == (gva  PAGE_MASK))
+   if (vcpu_match_mmio_gen(vcpu)  vcpu-arch.mmio_gva 
+ vcpu-arch.mmio_gva == (gva  PAGE_MASK))
return true;
 
return false;
@@ -102,7 +112,8 @@ static inline bool vcpu_match_mmio_gva(struct kvm_vcpu 
*vcpu, unsigned long gva)
 
 static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
-   if (vcpu-arch.mmio_gfn  vcpu-arch.mmio_gfn == gpa  PAGE_SHIFT)
+   if (vcpu_match_mmio_gen(vcpu)  vcpu

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 8:50 PM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 On 08/19/2014 05:15 AM, David Matlack wrote:
 On Mon, Aug 18, 2014 at 12:56 PM, Xiao Guangrong
 xiaoguangrong.e...@gmail.com wrote:
 @@ -287,9 +293,15 @@ static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, 
 gfn_t gfn,

  static bool check_mmio_spte(struct kvm *kvm, u64 spte)
  {
 +   struct kvm_memslots *slots = kvm_memslots(kvm);
 unsigned int kvm_gen, spte_gen;

 -   kvm_gen = kvm_current_mmio_generation(kvm);
 +   if (slots-updated)
 +   return false;
 +
 +   smp_rmb();
 +
 +   kvm_gen = __kvm_current_mmio_generation(slots);
 spte_gen = get_mmio_spte_generation(spte);


 What does this fix? Case 2 can still happen. (Case 2 is unavoidable unless we
 block during memslot updates, which I don't think we should :).

 This exactly fixes case 2, slots-updated just acts as the low bit
 but avoid generation number wrap-around and trick handling of the number.
 More details please see below.


 trace_check_mmio_spte(spte, kvm_gen, spte_gen);
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 4b6c01b..1d4e78f 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -96,7 +96,7 @@ static void hardware_disable_all(void);

  static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
  static void update_memslots(struct kvm_memslots *slots,
 -   struct kvm_memory_slot *new, u64 
 last_generation);
 +   struct kvm_memory_slot *new);

  static void kvm_release_pfn_dirty(pfn_t pfn);
  static void mark_page_dirty_in_slot(struct kvm *kvm,
 @@ -685,8 +685,7 @@ static void sort_memslots(struct kvm_memslots *slots)
  }

  static void update_memslots(struct kvm_memslots *slots,
 -   struct kvm_memory_slot *new,
 -   u64 last_generation)
 +   struct kvm_memory_slot *new)
  {
 if (new) {
 int id = new-id;
 @@ -697,8 +696,6 @@ static void update_memslots(struct kvm_memslots *slots,
 if (new-npages != npages)
 sort_memslots(slots);
 }
 -
 -   slots-generation = last_generation + 1;
  }

  static int check_memory_region_flags(struct kvm_userspace_memory_region 
 *mem)
 @@ -720,10 +717,17 @@ static struct kvm_memslots 
 *install_new_memslots(struct kvm *kvm,
  {
 struct kvm_memslots *old_memslots = kvm-memslots;

 -   update_memslots(slots, new, kvm-memslots-generation);
 +   /* ensure generation number is always increased. */
 +   slots-updated = true;
 +   slots-generation = old_memslots-generation;
 +   update_memslots(slots, new);
 rcu_assign_pointer(kvm-memslots, slots);
 synchronize_srcu_expedited(kvm-srcu);

 +   slots-generation++;
 +   smp_wmb();
 +   slots-updated = false;
 +
 kvm_arch_memslots_updated(kvm);

 return old_memslots;


 This is effectively the same as the first approach.

 I just realized how simple Paolo's idea is. I think it can be a one line
 patch (without comments):

 [...]
 update_memslots(slots, new, kvm-memslots-generation);
 rcu_assign_pointer(kvm-memslots, slots);
 synchronize_srcu_expedited(kvm-srcu);
 +   slots-generation++;

 kvm_arch_memslots_updated(kvm);
 [...]

 Really? Unfortunately no. :)

 See this scenario:

 CPU 0  CPU 1
 ioctl registering a new memslot which
 contains GPA:
page-fault handler:
  see it'a mmio access on GPA;

  assign the new memslots with generation number increased
  cache the generation-number into spte;
  fix the access and comeback to guest;
 SRCU-sync
  page-fault again and check the spte is a valid 
 mmio-spte(*)
 generation-number++;
 return to userspace;
  do mmio-emulation and inject mmio-exit;

 !!! userspace receives a unexpected mmio-exit, that is case 2 i exactly
 said in the last mail.


 Note in the step *, my approach detects the invalid generation-number which
 will invalidate the mmio spte properly .

Sorry I didn't explain myself very well: Since we can get a single wrong
mmio exit no matter what, it has to be handled in userspace. So my point
was, it doesn't really help to fix that one very specific way that it can
happen, because it can just happen in other ways. (E.g. update memslots
occurs after is_noslot_pfn() and before mmio exit).

But it looks like you basically said the same thing earlier, so I think
we're on the same page.

The single line patch I suggested was only intended to fix the forever
incorrectly exit mmio.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 9:41 PM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 On 08/19/2014 12:31 PM, David Matlack wrote:
 But it looks like you basically said the same thing earlier, so I think
 we're on the same page.


 Yes, that is what i try to explain in previous mails. :(

I'm glad we understand each other now! Sorry again for my confusion.

 The single line patch I suggested was only intended to fix the forever
 incorrectly exit mmio.

 My patch also fixes this case and that does not doubly increase the
 number. I think this is the better one.

I prefer doubly increasing the generation for this reason: the updated boolean
requires extra code on the client-side to check if there's an update in
progress. And that makes it easy to get wrong. In fact, your patch
forgot to check the updated bit in mark_mmio_spte(). Doubly increasing the
generation requires no client-side code to work.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 10:19 PM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 On 08/19/2014 01:00 PM, David Matlack wrote:
 On Mon, Aug 18, 2014 at 9:41 PM, Xiao Guangrong
 xiaoguangr...@linux.vnet.ibm.com wrote:
 On 08/19/2014 12:31 PM, David Matlack wrote:
 The single line patch I suggested was only intended to fix the forever
 incorrectly exit mmio.

 My patch also fixes this case and that does not doubly increase the
 number. I think this is the better one.

 I prefer doubly increasing the generation for this reason: the updated 
 boolean
 requires extra code on the client-side to check if there's an update in
 progress. And that makes it easy to get wrong. In fact, your patch
 forgot to check the updated bit in mark_mmio_spte(). Doubly increasing the
 generation requires no client-side code to work.

 No, the updated patch is used to fix case 2 which i draw the scenario in
 the last mail. I mean the original patch in this patchset which just
 increase the number after srcu-sync.

 Then could you tell me that your approach can do but my original patch can 
 not?

It avoids publishing new memslots with an old generation number attached to
them (even if it only lasts for a short period of time). Do you have a reason
why you don't want to doubly increase the generation?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-19 Thread David Matlack

On Tue, Aug 19, 2014 at 5:29 PM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 On 08/19/2014 05:03 PM, Paolo Bonzini wrote:
 Il 19/08/2014 10:50, Xiao Guangrong ha scritto:
 Okay, what confused me it that it seems that the single line patch
 is ok to you. :)

 No, it was late and I was confused. :)

 Now, do we really need to care the case 2? like David said:
 Sorry I didn't explain myself very well: Since we can get a single wrong
 mmio exit no matter what, it has to be handled in userspace. So my point
 was, it doesn't really help to fix that one very specific way that it can
 happen, because it can just happen in other ways. (E.g. update memslots
 occurs after is_noslot_pfn() and before mmio exit).

 What's your idea?

 I think if you always treat the low bit as zero in mmio sptes, you can
 do that without losing a bit of the generation.

 What's you did is avoiding cache a invalid generation number into spte, but
 actually if we can figure it out when we check mmio access, it's ok. Like 
 the
 updated patch i posted should fix it, that way avoids doubly increase the 
 number.

 Yes.

 Okay, if you're interested increasing the number doubly, there is the 
 simpler
 one:

 This wastes a bit in the mmio spte though.  My idea is to increase the
 memslots generation twice, but drop the low bit in the mmio spte.

 Yeah, really smart idea. :)

 Paolo/David, would you mind making a patch for this (+ the comments in David's
 patch)?

Paolo, since it was your idea would you like to write it? I don't mind either
way.


 Please feel free to add my:
 Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] kvm: x86: fix stale mmio cache bug

2014-08-28 Thread David Matlack

On Mon, Aug 18, 2014 at 3:46 PM, David Matlack dmatl...@google.com wrote:
 The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
 up to userspace:

 (1) Guest accesses gpa X without a memory slot. The gfn is cached in
 struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
 the SPTE write-execute-noread so that future accesses cause
 EPT_MISCONFIGs.

 (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
 covering the page just accessed.

 (3) Guest attempts to read or write to gpa X again. On Intel, this
 generates an EPT_MISCONFIG. The memory slot generation number that
 was incremented in (2) would normally take care of this but we fast
 path mmio faults through quickly_check_mmio_pf(), which only checks
 the per-vcpu mmio cache. Since we hit the cache, KVM passes a
 KVM_EXIT_MMIO up to userspace.

 This patch fixes the issue by using the memslot generation number
 to validate the mmio cache.

 [ xiaoguangrong: adjust the code to make it simpler for stable-tree fix. ]

 Cc: sta...@vger.kernel.org
 Signed-off-by: David Matlack dmatl...@google.com
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---

Paolo,
It seems like this patch ([PATCH 2/2] kvm: x86: fix stale mmio cache)
is ready to go. Is there anything blocking it from being merged?

(It should be fine to merge this on its own, independent of the fix
discussed in [PATCH 1/2] KVM: fix cache stale memslot info with
correct mmio generation number, https://lkml.org/lkml/2014/8/14/62.)

  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/mmu.c  |  4 ++--
  arch/x86/kvm/mmu.h  |  2 ++
  arch/x86/kvm/x86.h  | 21 -
  4 files changed, 21 insertions(+), 7 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 49205d0..f518d14 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -479,6 +479,7 @@ struct kvm_vcpu_arch {
 u64 mmio_gva;
 unsigned access;
 gfn_t mmio_gfn;
 +   unsigned int mmio_gen;

 struct kvm_pmu pmu;

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 9314678..e00fbfe 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -234,7 +234,7 @@ static unsigned int get_mmio_spte_generation(u64 spte)
 return gen;
  }

 -static unsigned int kvm_current_mmio_generation(struct kvm *kvm)
 +unsigned int kvm_current_mmio_generation(struct kvm *kvm)
  {
 /*
  * Init kvm generation close to MMIO_MAX_GEN to easily test the
 @@ -3163,7 +3163,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
 if (!VALID_PAGE(vcpu-arch.mmu.root_hpa))
 return;

 -   vcpu_clear_mmio_info(vcpu, ~0ul);
 +   vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
 if (vcpu-arch.mmu.root_level == PT64_ROOT_LEVEL) {
 hpa_t root = vcpu-arch.mmu.root_hpa;
 diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
 index b982112..e2d902a 100644
 --- a/arch/x86/kvm/mmu.h
 +++ b/arch/x86/kvm/mmu.h
 @@ -76,6 +76,8 @@ enum {
  };

  int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
 direct);
 +unsigned int kvm_current_mmio_generation(struct kvm *kvm);
 +
  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
 bool execonly);
 diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
 index 8c97bac..ae7006d 100644
 --- a/arch/x86/kvm/x86.h
 +++ b/arch/x86/kvm/x86.h
 @@ -3,6 +3,7 @@

  #include linux/kvm_host.h
  #include kvm_cache_regs.h
 +#include mmu.h

  static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
  {
 @@ -78,15 +79,23 @@ static inline void vcpu_cache_mmio_info(struct kvm_vcpu 
 *vcpu,
 vcpu-arch.mmio_gva = gva  PAGE_MASK;
 vcpu-arch.access = access;
 vcpu-arch.mmio_gfn = gfn;
 +   vcpu-arch.mmio_gen = kvm_current_mmio_generation(vcpu-kvm);
 +}
 +
 +static inline bool vcpu_match_mmio_gen(struct kvm_vcpu *vcpu)
 +{
 +   return vcpu-arch.mmio_gen == kvm_current_mmio_generation(vcpu-kvm);
  }

  /*
 - * Clear the mmio cache info for the given gva,
 - * specially, if gva is ~0ul, we clear all mmio cache info.
 + * Clear the mmio cache info for the given gva. If gva is MMIO_GVA_ANY, we
 + * clear all mmio cache info.
   */
 +#define MMIO_GVA_ANY (~(gva_t)0)
 +
  static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
  {
 -   if (gva != (~0ul)  vcpu-arch.mmio_gva != (gva  PAGE_MASK))
 +   if (gva != MMIO_GVA_ANY  vcpu-arch.mmio_gva != (gva  PAGE_MASK))
 return;

 vcpu-arch.mmio_gva = 0;
 @@ -94,7 +103,8 @@ static inline void vcpu_clear_mmio_info(struct kvm_vcpu 
 *vcpu, gva_t gva)

  static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long 
 gva)
  {
 -   if (vcpu

Re: [PATCH 2/2] kvm: x86: fix stale mmio cache bug

2014-08-29 Thread David Matlack

On Fri, Aug 29, 2014 at 12:58 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 28/08/2014 23:10, David Matlack ha scritto:
 Paolo,
 It seems like this patch ([PATCH 2/2] kvm: x86: fix stale mmio cache)
 is ready to go. Is there anything blocking it from being merged?

 (It should be fine to merge this on its own, independent of the fix
 discussed in [PATCH 1/2] KVM: fix cache stale memslot info with
 correct mmio generation number, https://lkml.org/lkml/2014/8/14/62.)

 I'll post the full series today.  Sorry, I've been swamped a bit.

Thanks, I'll start testing it. Hope things quiet down for you soon :)


 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] kvm: fix potentially corrupt mmio cache

2014-09-02 Thread David Matlack

On Fri, Aug 29, 2014 at 3:31 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 From: David Matlack dmatl...@google.com

 vcpu exits and memslot mutations can run concurrently as long as the
 vcpu does not aquire the slots mutex. Thus it is theoretically possible
 for memslots to change underneath a vcpu that is handling an exit.

 If we increment the memslot generation number again after
 synchronize_srcu_expedited(), vcpus can safely cache memslot generation
 without maintaining a single rcu_dereference through an entire vm exit.
 And much of the x86/kvm code does not maintain a single rcu_dereference
 of the current memslots during each exit.

 We can prevent the following case:

vcpu (CPU 0) | thread (CPU 1)
 +--
 1  vm exit  |
 2  srcu_read_unlock(kvm-srcu) |
 3  decide to cache something based on   |
  old memslots   |
 4   | change memslots
 | (increments generation)
 5   | synchronize_srcu(kvm-srcu);
 6  retrieve generation # from new memslots  |
 7  tag cache with new memslot generation|
 8  srcu_read_unlock(kvm-srcu) |
 ... |
action based on cache occurs even   |
 though the caching decision was based   |
 on the old memslots|
 ... |
action *continues* to occur until next  |
 memslot generation change, which may|
 be never   |
 |

 By incrementing the generation after synchronizing with kvm-srcu readers,
 we ensure that the generation retrieved in (6) will become invalid soon
 after (8).

 Keeping the existing increment is not strictly necessary, but we
 do keep it and just move it for consistency from update_memslots to
 install_new_memslots.  It invalidates old cached MMIOs immediately,
 instead of having to wait for the end of synchronize_srcu_expedited,
 which makes the code more clearly correct in case CPU 1 is preempted
 right after synchronize_srcu() returns.

 To avoid halving the generation space in SPTEs, always presume that the
 low bit of the generation is zero when reconstructing a generation number
 out of an SPTE.  This effectively disables MMIO caching in SPTEs during
 the call to synchronize_srcu_expedited.  Using the low bit this way is
 somewhat like a seqcount---where the protected thing is a cache, and
 instead of retrying we can simply punt if we observe the low bit to be 1.

 Cc: sta...@vger.kernel.org
 Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 Signed-off-by: David Matlack dmatl...@google.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  Documentation/virtual/kvm/mmu.txt | 14 ++
  arch/x86/kvm/mmu.c| 20 
  virt/kvm/kvm_main.c   | 23 ---
  3 files changed, 42 insertions(+), 15 deletions(-)

 diff --git a/Documentation/virtual/kvm/mmu.txt 
 b/Documentation/virtual/kvm/mmu.txt
 index 290894176142..53838d9c6295 100644
 --- a/Documentation/virtual/kvm/mmu.txt
 +++ b/Documentation/virtual/kvm/mmu.txt
 @@ -425,6 +425,20 @@ fault through the slow path.
  Since only 19 bits are used to store generation-number on mmio spte, all
  pages are zapped when there is an overflow.

 +Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
 +times, the last one happening when the generation number is retrieved and
 +stored into the MMIO spte.  Thus, the MMIO spte might be created based on
 +out-of-date information, but with an up-to-date generation number.
 +
 +To avoid this, the generation number is incremented again after 
 synchronize_srcu
 +returns; thus, the low bit of kvm_memslots(kvm)-generation is only 1 during 
 a
 +memslot update, while some SRCU readers might be using the old copy.  We do 
 not
 +want to use an MMIO sptes created with an odd generation number, and we can 
 do
 +this without losing a bit in the MMIO spte.  The low bit of the generation
 +is not stored in MMIO spte, and presumed zero when it is extracted out of the
 +spte.  If KVM is unlucky and creates an MMIO spte while the low bit is 1,
 +the next access to the spte will always be a cache miss.
 +

  Further reading
  ===
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 323c3f5f5c84..96515957ba82 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -199,16 +199,20 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);

  /*
 - * spte bits of bit 3 ~ bit 11 are used as low 9 bits of generation number,
 - * the bits of bits 52 ~ bit 61 are used as high 10 bits of generation
 - * number.
 + * the low bit

Re: [PATCH 0/3] fix bugs with stale or corrupt MMIO caches

2014-09-02 Thread David Matlack

On Tue, Sep 2, 2014 at 8:42 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 29/08/2014 12:31, Paolo Bonzini ha scritto:
 David and Xiao, here's my take on the MMIO generation patches.  Now
 with documentation, too. :)  Please review!

 David Matlack (2):
   kvm: fix potentially corrupt mmio cache
   kvm: x86: fix stale mmio cache bug

 Paolo Bonzini (1):
   KVM: do not bias the generation number in kvm_current_mmio_generation

  Documentation/virtual/kvm/mmu.txt | 14 ++
  arch/x86/include/asm/kvm_host.h   |  1 +
  arch/x86/kvm/mmu.c| 29 ++---
  arch/x86/kvm/x86.h| 20 +++-
  virt/kvm/kvm_main.c   | 30 +++---
  5 files changed, 67 insertions(+), 27 deletions(-)


 Ping?

Sorry for the delay. I think the patches look good. And patch 3/3 still
fixes the bug I was originally seeing, so I'm happy :). I just had one
small comment (see my reply to patch 2/3).


 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] kvm: fix potentially corrupt mmio cache

2014-09-02 Thread David Matlack

On Tue, Sep 2, 2014 at 9:49 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 02/09/2014 18:44, David Matlack ha scritto:
 
  -#define MMIO_GEN_SHIFT 19
  -#define MMIO_GEN_LOW_SHIFT 9
  -#define MMIO_GEN_LOW_MASK  ((1  MMIO_GEN_LOW_SHIFT) - 1)
  +#define MMIO_GEN_SHIFT 20
  +#define MMIO_GEN_LOW_SHIFT 10
  +#define MMIO_GEN_LOW_MASK  ((1  MMIO_GEN_LOW_SHIFT) - 2)
   #define MMIO_GEN_MASK  ((1  MMIO_GEN_SHIFT) - 1)
   #define MMIO_MAX_GEN   ((1  MMIO_GEN_SHIFT) - 1)
 
  @@ -4428,7 +4432,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
   * The very rare case: if the generation-number is round,
   * zap all shadow pages.
   */
  -   if (unlikely(kvm_current_mmio_generation(kvm) = MMIO_MAX_GEN)) {
  +   if (unlikely(kvm_current_mmio_generation(kvm) == 0)) {

 This should be in patch 1/3.

 I don't think so.  This change is not due to the removal of biasing in
 x86.c, but rather due to removing bit 0 from MMIO_GEN_LOW_MASK.

 I placed it here, because the previous test works just fine until bit 0
 is removed from MMIO_GEN_LOW_MASK.

Ah ok, you're right.


 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] fix bugs with stale or corrupt MMIO caches

2014-09-02 Thread David Matlack

On Tue, Sep 2, 2014 at 9:50 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 02/09/2014 18:47, David Matlack ha scritto:
  Ping?
 Sorry for the delay. I think the patches look good. And patch 3/3 still
 fixes the bug I was originally seeing, so I'm happy :). I just had one
 small comment (see my reply to patch 2/3).


 I answered that question now.  Can I add your Reviewed-by and, for patch
 3, Tested-by?

Please do. And thanks again for working on these!


 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-19 Thread David Matlack

vcpu ioctls can hang the calling thread if issued while a vcpu is
running. If we know ioctl is going to be rejected as invalid anyway,
we can fail before trying to take the vcpu mutex.

This patch does not change functionality, it just makes invalid ioctls
fail faster.

Signed-off-by: David Matlack dmatl...@google.com
---
 virt/kvm/kvm_main.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 96ec622..f9234e5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -52,6 +52,7 @@
 
 #include asm/processor.h
 #include asm/io.h
+#include asm/ioctl.h
 #include asm/uaccess.h
 #include asm/pgtable.h
 
@@ -1975,6 +1976,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
if (vcpu-kvm-mm != current-mm)
return -EIO;
 
+   if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
+   return -EINVAL;
+
 #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
/*
 * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread David Matlack

On 09/22, Paolo Bonzini wrote:
 Il 22/09/2014 15:45, Christian Borntraeger ha scritto:
  We now have an extra condition check for every valid ioctl, to make an 
  error case go faster.
  I know, the extra check is just a 1 or 2 cycles if branch prediction is 
  right, but still.
 
 I applied the patch because the delay could be substantial, depending on
 what the other VCPU is doing.  Perhaps something like this would be
 better?

I'm happy with either approach.

 
 (Untested, but Tested-by/Reviewed-bys are welcome).

There were a few build bugs in your diff. Here's a working version that
I tested. Feel free to add my Tested-by and Reviewed-by if you go with
this.


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c71931f..fbdcdc2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -133,12 +133,10 @@ bool kvm_is_mmio_pfn(pfn_t pfn)
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
  */
-int vcpu_load(struct kvm_vcpu *vcpu)
+static void __vcpu_load(struct kvm_vcpu *vcpu)
 {
int cpu;
 
-   if (mutex_lock_killable(vcpu-mutex))
-   return -EINTR;
if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) {
/* The thread running this VCPU changed. */
struct pid *oldpid = vcpu-pid;
@@ -151,6 +149,14 @@ int vcpu_load(struct kvm_vcpu *vcpu)
preempt_notifier_register(vcpu-preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
put_cpu();
+}
+
+int vcpu_load(struct kvm_vcpu *vcpu)
+{
+   if (mutex_lock_killable(vcpu-mutex))
+   return -EINTR;
+
+   __vcpu_load(vcpu);
return 0;
 }
 
@@ -2197,10 +2203,21 @@ static long kvm_vcpu_ioctl(struct file *filp,
return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
 #endif
 
+   if (!mutex_trylock(vcpu-mutex)) {
+   /*
+* Before a potentially long sleep, check if we'd exit anyway.
+* The common case is for the mutex not to be contended, when
+* this does not add overhead.
+*/
+   if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
+   return -EINVAL;
+
+   if (mutex_lock_killable(vcpu-mutex))
+   return -EINTR;
+   }
+
+   __vcpu_load(vcpu);
 
-   r = vcpu_load(vcpu);
-   if (r)
-   return r;
switch (ioctl) {
case KVM_RUN:
r = -EINVAL;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread David Matlack

On 09/22, Christian Borntraeger wrote:
 On 09/22/2014 04:31 PM, Paolo Bonzini wrote:
  Il 22/09/2014 15:45, Christian Borntraeger ha scritto:
  We now have an extra condition check for every valid ioctl, to make an 
  error case go faster.
  I know, the extra check is just a 1 or 2 cycles if branch prediction is 
  right, but still.
  
  I applied the patch because the delay could be substantial,
 
 I know, but only for seriously misbehaving userspace, no? See my comment is 
 really a minor one - lets say 2 more cycles for something that exited to 
 userspace - nobody would even notice. I am just disturbed by the fact that we 
 care about something that is not slow-path but broken beyond repair (why does 
 userspace call a non-KVM ioctl on a fd of a vcpu from a different thread 
 (otherwise the mutex would be free)?
 
 Please, can we have an explanation, e.g. something like
 while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. 
 Lets bail out early on invalid ioctls. or similar?

We noticed this while using code that inspects the open file descriptors
of a process. One of the things it did was check if each file descriptor
was a tty (isatty() calls ioctl(TCGETS)).

 
 
  depending on what the other VCPU is doing.
  Perhaps something like this would be
  better?
  
  (Untested, but Tested-by/Reviewed-bys are welcome).
 
 Given that it makes sense to improve a misbehaving userspace, I prefer Davids 
 variant as the patch is smaller and easier to get right. No need to revert, 
 but a better explanation for the use case would be appeciated.
 
 Christian 
  
  Paolo
  
  diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
  index 84e24b210273..ed31760d79fe 100644
  --- a/virt/kvm/kvm_main.c
  +++ b/virt/kvm/kvm_main.c
  @@ -117,12 +117,10 @@ bool kvm_is_mmio_pfn(pfn_t pfn)
   /*
* Switches to specified vcpu, until a matching vcpu_put()
*/
  -int vcpu_load(struct kvm_vcpu *vcpu)
  +static void __vcpu_load(struct kvm_vcpu *vcpu)
   {
  int cpu;
  
  -   if (mutex_lock_killable(vcpu-mutex))
  -   return -EINTR;
  if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) {
  /* The thread running this VCPU changed. */
  struct pid *oldpid = vcpu-pid;
  @@ -136,6 +134,14 @@ int vcpu_load(struct kvm_vcpu *vcpu)
  preempt_notifier_register(vcpu-preempt_notifier);
  kvm_arch_vcpu_load(vcpu, cpu);
  put_cpu();
  +}
  +
  +int vcpu_load(struct kvm_vcpu *vcpu)
  +{
  +   if (mutex_lock_killable(vcpu-mutex))
  +   return -EINTR;
  +
  +   __vcpu_load(vcpu);
  return 0;
   }
  
  @@ -1989,9 +1995,6 @@ static long kvm_vcpu_ioctl(struct file *filp,
  if (vcpu-kvm-mm != current-mm)
  return -EIO;
  
  -   if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
  -   return -EINVAL;
  -
   #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
  /*
   * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
  @@ -2001,8 +2004,21 @@ static long kvm_vcpu_ioctl(struct file *filp,
  return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
   #endif
  
  +   if (!mutex_trylock(vcpu-mutex))) {
  +   /*
  +* Before a potentially long sleep, check if we'd exit anyway.
  +* The common case is for the mutex not to be contended, when
  +* this does not add overhead.
  +*/
  +   if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
  +   return -EINVAL;
  +
  +   if (mutex_lock_killable(vcpu-mutex))
  +   return -EINTR;
  +   }
  +
  
  -   r = vcpu_load(vcpu);
  +   r = __vcpu_load(vcpu);
  if (r)
  return r;
  switch (ioctl) {
  
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread David Matlack

On 09/22, Marcelo Tosatti wrote:
 On Fri, Sep 19, 2014 at 04:03:25PM -0700, David Matlack wrote:
  vcpu ioctls can hang the calling thread if issued while a vcpu is
  running. 
 
 There is a mutex per-vcpu, so thats expected, OK...
 
  If we know ioctl is going to be rejected as invalid anyway,
  we can fail before trying to take the vcpu mutex.
 
 Consider a valid ioctl that takes the vcpu mutex. If you need immediate
 access for that valid ioctl, it is necessary to interrupt thread
 which KVM_RUN ioctl executes. 
 
 So knowledge of whether KVM_RUN is being executed is expected in
 userspace (either
 that or ask the KVM_RUN thread to run the ioctl for you, as qemu does).
 
 Can't see why having different behaviour for valid/invalid ioctls
 is a good thing.
 
  This patch does not change functionality, it just makes invalid ioctls
  fail faster.
 
 Should not be executing vcpu ioctls without interrupt KVM_RUN in the
 first place.

This patch is trying to be nice to code that isn't aware it's
probing kvm file descriptors. We saw long hangs with some generic
process inspection code that was probing all open file descriptors.
There's no reason non-kvm ioctls should have to wait for the vcpu
mutex to become available just to fail.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm: x86: add trace event for pvclock updates

2014-11-05 Thread David Matlack

The new trace event records:
  * the id of vcpu being updated
  * the pvclock_vcpu_time_info struct being written to guest memory

This is useful for debugging pvclock bugs, such as the bug fixed by
[PATCH] kvm: x86: Fix kvm clock versioning..

Signed-off-by: David Matlack dmatl...@google.com
---
 arch/x86/kvm/trace.h | 37 +
 arch/x86/kvm/x86.c   |  2 ++
 2 files changed, 39 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 6b06ab8..c2a34bb 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -5,6 +5,7 @@
 #include asm/vmx.h
 #include asm/svm.h
 #include asm/clocksource.h
+#include asm/pvclock-abi.h
 
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvm
@@ -877,6 +878,42 @@ TRACE_EVENT(kvm_ple_window,
 #define trace_kvm_ple_window_shrink(vcpu_id, new, old) \
trace_kvm_ple_window(false, vcpu_id, new, old)
 
+TRACE_EVENT(kvm_pvclock_update,
+   TP_PROTO(unsigned int vcpu_id, struct pvclock_vcpu_time_info *pvclock),
+   TP_ARGS(vcpu_id, pvclock),
+
+   TP_STRUCT__entry(
+   __field(unsigned int,   vcpu_id )
+   __field(__u32,  version )
+   __field(__u64,  tsc_timestamp   )
+   __field(__u64,  system_time )
+   __field(__u32,  tsc_to_system_mul   )
+   __field(__s8,   tsc_shift   )
+   __field(__u8,   flags   )
+   ),
+
+   TP_fast_assign(
+   __entry-vcpu_id   = vcpu_id;
+   __entry-version   = pvclock-version;
+   __entry-tsc_timestamp = pvclock-tsc_timestamp;
+   __entry-system_time   = pvclock-system_time;
+   __entry-tsc_to_system_mul = pvclock-tsc_to_system_mul;
+   __entry-tsc_shift = pvclock-tsc_shift;
+   __entry-flags = pvclock-flags;
+   ),
+
+   TP_printk(vcpu_id %u, pvclock { version %u, tsc_timestamp 0x%llx, 
+ system_time 0x%llx, tsc_to_system_mul 0x%x, tsc_shift %d, 
+ flags 0x%x },
+ __entry-vcpu_id,
+ __entry-version,
+ __entry-tsc_timestamp,
+ __entry-system_time,
+ __entry-tsc_to_system_mul,
+ __entry-tsc_shift,
+ __entry-flags)
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0033df3..0f533df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1662,6 +1662,8 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 
vcpu-hv_clock.flags = pvclock_flags;
 
+   trace_kvm_pvclock_update(v-vcpu_id, vcpu-hv_clock);
+
kvm_write_guest_cached(v-kvm, vcpu-pv_time,
vcpu-hv_clock,
sizeof(vcpu-hv_clock));
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: x86: add trace event for pvclock updates

2014-11-12 Thread David Matlack

On 11/10 11:18 PM, Marcelo Tosatti wrote:
 On Wed, Nov 05, 2014 at 11:46:42AM -0800, David Matlack wrote:
  The new trace event records:
* the id of vcpu being updated
* the pvclock_vcpu_time_info struct being written to guest memory
  
  This is useful for debugging pvclock bugs, such as the bug fixed by
  [PATCH] kvm: x86: Fix kvm clock versioning..
  
  Signed-off-by: David Matlack dmatl...@google.com
 
 So you actually hit that bug in practice? Can you describe the
 scenario?

We noticed guests running stress workloads would occasionally get stuck 
on the far side of a save/restore. Inspecting the guest state revealed
arch/x86/kernel/pvclock.c:last_value was stuck at a value like
8020566108469899263, despite TSC and pvclock looking sane.

Since these guests ran without PVCLOCK_TSC_STABLE_BIT set in their
CPUID, they were stuck with this large time value until real time
caught up (in about 250 years :).

We've been unable to reproduce the bug with kvm: x86: Fix kvm clock
versioning. so we didn't invest in catching the overflow in the act,
but a likely explanation is the guest gets save/restore-ed while
computing the pvclock delta:

u64 delta = __native_read_tsc() - src-tsc_timestamp;

causing the subtraction to underflow and delta to be huge.

 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2014-12-24 Thread David Matlack

On Mon, Dec 22, 2014 at 4:39 PM, Andy Lutomirski l...@amacapital.net wrote:
 The pvclock vdso code was too abstracted to understand easily and
 excessively paranoid.  Simplify it for a huge speedup.

 This opens the door for additional simplifications, as the vdso no
 longer accesses the pvti for any vcpu other than vcpu 0.

 Before, vclock_gettime using kvm-clock took about 64ns on my machine.
 With this change, it takes 19ns, which is almost as fast as the pure TSC
 implementation.

 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  arch/x86/vdso/vclock_gettime.c | 82 
 --
  1 file changed, 47 insertions(+), 35 deletions(-)

 diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
 index 9793322751e0..f2e0396d5629 100644
 --- a/arch/x86/vdso/vclock_gettime.c
 +++ b/arch/x86/vdso/vclock_gettime.c
 @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info 
 *get_pvti(int cpu)

  static notrace cycle_t vread_pvclock(int *mode)
  {
 -   const struct pvclock_vsyscall_time_info *pvti;
 +   const struct pvclock_vcpu_time_info *pvti = get_pvti(0)-pvti;
 cycle_t ret;
 -   u64 last;
 -   u32 version;
 -   u8 flags;
 -   unsigned cpu, cpu1;
 -
 +   u64 tsc, pvti_tsc;
 +   u64 last, delta, pvti_system_time;
 +   u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;

 /*
 -* Note: hypervisor must guarantee that:
 -* 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
 -* 2. that per-CPU pvclock time info is updated if the
 -*underlying CPU changes.
 -* 3. that version is increased whenever underlying CPU
 -*changes.
 +* Note: The kernel and hypervisor must guarantee that cpu ID
 +* number maps 1:1 to per-CPU pvclock time info.
 +*
 +* Because the hypervisor is entirely unaware of guest userspace
 +* preemption, it cannot guarantee that per-CPU pvclock time
 +* info is updated if the underlying CPU changes or that that
 +* version is increased whenever underlying CPU changes.
 +*
 +* On KVM, we are guaranteed that pvti updates for any vCPU are
 +* atomic as seen by *all* vCPUs.  This is an even stronger
 +* guarantee than we get with a normal seqlock.
  *
 +* On Xen, we don't appear to have that guarantee, but Xen still
 +* supplies a valid seqlock using the version field.
 +

Forgotten * here?

 +* We only do pvclock vdso timing at all if
 +* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
 +* mean that all vCPUs have matching pvti and that the TSC is
 +* synced, so we can just look at vCPU 0's pvti.
  */
 -   do {
 -   cpu = __getcpu()  VGETCPU_CPU_MASK;
 -   /* TODO: We can put vcpu id into higher bits of pvti.version.
 -* This will save a couple of cycles by getting rid of
 -* __getcpu() calls (Gleb).
 -*/
 -
 -   pvti = get_pvti(cpu);
 -
 -   version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 -
 -   /*
 -* Test we're still on the cpu as well as the version.
 -* We could have been migrated just after the first
 -* vgetcpu but before fetching the version, so we
 -* wouldn't notice a version change.
 -*/
 -   cpu1 = __getcpu()  VGETCPU_CPU_MASK;
 -   } while (unlikely(cpu != cpu1 ||
 - (pvti-pvti.version  1) ||
 - pvti-pvti.version != version));
 -
 -   if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
 +
 +   if (unlikely(!(pvti-flags  PVCLOCK_TSC_STABLE_BIT))) {
 *mode = VCLOCK_NONE;
 +   return 0;
 +   }
 +
 +   do {
 +   version = pvti-version;
 +
 +   /* This is also a read barrier, so we'll read version first. 
 */
 +   rdtsc_barrier();
 +   tsc = __native_read_tsc();

Is there a reason why you read the tsc inside the loop rather than once
after the loop?

 +
 +   pvti_tsc_to_system_mul = pvti-tsc_to_system_mul;
 +   pvti_tsc_shift = pvti-tsc_shift;
 +   pvti_system_time = pvti-system_time;
 +   pvti_tsc = pvti-tsc_timestamp;
 +
 +   /* Make sure that the version double-check is last. */
 +   smp_rmb();
 +   } while (unlikely((version  1) || version != pvti-version));
 +
 +   delta = tsc - pvti_tsc;
 +   ret = pvti_system_time +
 +   pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
 +   pvti_tsc_shift);

 /* refer to tsc.c read_tsc() comment for rationale */
 last = gtod-cycle_last;
 --
 2.1.0

 --
 To unsubscribe from this list: send the line unsubscribe

Re: [PATCH] kvm: add halt_poll_ns module parameter

2015-02-09 Thread David Matlack

On Fri, Feb 6, 2015 at 4:48 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 This patch introduces a new module parameter for the KVM module; when it
 is present, KVM attempts a bit of polling on every HLT before scheduling
 itself out via kvm_vcpu_block.

[...]

 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---

Looks good, thanks for making those changes. I ran this patch on my
benchmarks (loopback TCP_RR and memcache) using halt_poll_ns=7 and
saw performance go from 40% to 60-65% of bare-metal.

Tested-by: David Matlack dmatl...@google.com
Reviewed-by: David Matlack dmatl...@google.com

  arch/arm/include/asm/kvm_host.h |  1 +
  arch/arm64/include/asm/kvm_host.h   |  1 +
  arch/mips/include/asm/kvm_host.h|  1 +
  arch/mips/kvm/mips.c|  1 +
  arch/powerpc/include/asm/kvm_host.h |  1 +
  arch/powerpc/kvm/book3s.c   |  1 +
  arch/powerpc/kvm/booke.c|  1 +
  arch/s390/include/asm/kvm_host.h|  1 +
  arch/s390/kvm/kvm-s390.c|  1 +
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/x86.c  |  1 +
  include/trace/events/kvm.h  | 19 +++
  virt/kvm/kvm_main.c | 48 
 +++--
  13 files changed, 71 insertions(+), 7 deletions(-)

 diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
 index bde494654bcc..6a79314bc1df 100644
 --- a/arch/arm/include/asm/kvm_host.h
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -148,6 +148,7 @@ struct kvm_vm_stat {
  };

  struct kvm_vcpu_stat {
 +   u32 halt_successful_poll;
 u32 halt_wakeup;
  };

 diff --git a/arch/arm64/include/asm/kvm_host.h 
 b/arch/arm64/include/asm/kvm_host.h
 index 2c49aa4ac818..8efde89613f2 100644
 --- a/arch/arm64/include/asm/kvm_host.h
 +++ b/arch/arm64/include/asm/kvm_host.h
 @@ -165,6 +165,7 @@ struct kvm_vm_stat {
  };

  struct kvm_vcpu_stat {
 +   u32 halt_successful_poll;
 u32 halt_wakeup;
  };

 diff --git a/arch/mips/include/asm/kvm_host.h 
 b/arch/mips/include/asm/kvm_host.h
 index f2c249796ea8..ac4fc716062b 100644
 --- a/arch/mips/include/asm/kvm_host.h
 +++ b/arch/mips/include/asm/kvm_host.h
 @@ -120,6 +120,7 @@ struct kvm_vcpu_stat {
 u32 resvd_inst_exits;
 u32 break_inst_exits;
 u32 flush_dcache_exits;
 +   u32 halt_successful_poll;
 u32 halt_wakeup;
  };

 diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
 index e97b90784031..c9eccf5df912 100644
 --- a/arch/mips/kvm/mips.c
 +++ b/arch/mips/kvm/mips.c
 @@ -49,6 +49,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 { resvd_inst,   VCPU_STAT(resvd_inst_exits),   KVM_STAT_VCPU },
 { break_inst,   VCPU_STAT(break_inst_exits),   KVM_STAT_VCPU },
 { flush_dcache, VCPU_STAT(flush_dcache_exits), KVM_STAT_VCPU },
 +   { halt_successful_poll, VCPU_STAT(halt_successful_poll), 
 KVM_STAT_VCPU },
 { halt_wakeup,  VCPU_STAT(halt_wakeup),KVM_STAT_VCPU },
 {NULL}
  };
 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index 7efd666a3fa7..8ef05121d3cd 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -107,6 +107,7 @@ struct kvm_vcpu_stat {
 u32 emulated_inst_exits;
 u32 dec_exits;
 u32 ext_intr_exits;
 +   u32 halt_successful_poll;
 u32 halt_wakeup;
 u32 dbell_exits;
 u32 gdbell_exits;
 diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
 index 888bf466d8c6..cfbcdc654201 100644
 --- a/arch/powerpc/kvm/book3s.c
 +++ b/arch/powerpc/kvm/book3s.c
 @@ -52,6 +52,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 { dec, VCPU_STAT(dec_exits) },
 { ext_intr,VCPU_STAT(ext_intr_exits) },
 { queue_intr,  VCPU_STAT(queue_intr) },
 +   { halt_successful_poll, VCPU_STAT(halt_successful_poll), },
 { halt_wakeup, VCPU_STAT(halt_wakeup) },
 { pf_storage,  VCPU_STAT(pf_storage) },
 { sp_storage,  VCPU_STAT(sp_storage) },
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index 9b55dec2d6cc..6c1316a15a27 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -62,6 +62,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 { inst_emu,   VCPU_STAT(emulated_inst_exits) },
 { dec,VCPU_STAT(dec_exits) },
 { ext_intr,   VCPU_STAT(ext_intr_exits) },
 +   { halt_successful_poll, VCPU_STAT(halt_successful_poll) },
 { halt_wakeup, VCPU_STAT(halt_wakeup) },
 { doorbell, VCPU_STAT(dbell_exits) },
 { guest doorbell, VCPU_STAT(gdbell_exits) },
 diff --git a/arch/s390/include/asm/kvm_host.h 
 b/arch/s390/include/asm/kvm_host.h
 index d1ecc7fd0579..f79058e3fd98 100644
 --- a/arch/s390/include/asm/kvm_host.h
 +++ b/arch/s390/include/asm/kvm_host.h
 @@ -196,6 +196,7 @@ struct kvm_vcpu_stat {
 u32 exit_stop_request

Re: [PATCH 1/2] KVM: x86: extract guest running logic from __vcpu_run

2015-02-09 Thread David Matlack

On Fri, Feb 6, 2015 at 4:16 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Rename the old __vcpu_run to vcpu_run, and extract its main logic
 to a new function.

 The next patch will add a new early-exit condition to __vcpu_run,
 avoid extra indentation.

 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---

Reviewed-by: David Matlack dmatl...@google.com

  arch/x86/kvm/x86.c | 67 
 +-
  1 file changed, 36 insertions(+), 31 deletions(-)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index bd7a70be41b3..0b8dd13676ef 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6165,7 +6165,7 @@ void kvm_arch_mmu_notifier_invalidate_page(struct kvm 
 *kvm,
  }

  /*
 - * Returns 1 to let __vcpu_run() continue the guest execution loop without
 + * Returns 1 to let vcpu_run() continue the guest execution loop without
   * exiting to the userspace.  Otherwise, the value will be returned to the
   * userspace.
   */
 @@ -6383,42 +6383,45 @@ out:
 return r;
  }

 +static inline int __vcpu_run(struct kvm *kvm, struct kvm_vcpu *vcpu)

Kind of confusing function body given the name. Maybe put

   if (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE 
   !vcpu-arch.apf.halted)
   return vcpu_enter_guest(vcpu);

back in vcpu_run and rename this function? vcpu_wait?

 +{
 +   if (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE 
 +   !vcpu-arch.apf.halted)
 +   return vcpu_enter_guest(vcpu);
 +
 +   srcu_read_unlock(kvm-srcu, vcpu-srcu_idx);
 +   kvm_vcpu_block(vcpu);
 +   vcpu-srcu_idx = srcu_read_lock(kvm-srcu);
 +   if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
 +   return 1;

 -static int __vcpu_run(struct kvm_vcpu *vcpu)
 +   kvm_apic_accept_events(vcpu);
 +   switch(vcpu-arch.mp_state) {
 +   case KVM_MP_STATE_HALTED:
 +   vcpu-arch.pv.pv_unhalted = false;
 +   vcpu-arch.mp_state =
 +   KVM_MP_STATE_RUNNABLE;
 +   case KVM_MP_STATE_RUNNABLE:
 +   vcpu-arch.apf.halted = false;
 +   break;
 +   case KVM_MP_STATE_INIT_RECEIVED:
 +   break;
 +   default:
 +   return -EINTR;
 +   break;
 +   }
 +   return 1;
 +}
 +
 +static int vcpu_run(struct kvm_vcpu *vcpu)

vcpu_run_loop might be a clearer name.

  {
 int r;
 struct kvm *kvm = vcpu-kvm;

 vcpu-srcu_idx = srcu_read_lock(kvm-srcu);

 -   r = 1;
 -   while (r  0) {
 -   if (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE 
 -   !vcpu-arch.apf.halted)
 -   r = vcpu_enter_guest(vcpu);
 -   else {
 -   srcu_read_unlock(kvm-srcu, vcpu-srcu_idx);
 -   kvm_vcpu_block(vcpu);
 -   vcpu-srcu_idx = srcu_read_lock(kvm-srcu);
 -   if (kvm_check_request(KVM_REQ_UNHALT, vcpu)) {
 -   kvm_apic_accept_events(vcpu);
 -   switch(vcpu-arch.mp_state) {
 -   case KVM_MP_STATE_HALTED:
 -   vcpu-arch.pv.pv_unhalted = false;
 -   vcpu-arch.mp_state =
 -   KVM_MP_STATE_RUNNABLE;
 -   case KVM_MP_STATE_RUNNABLE:
 -   vcpu-arch.apf.halted = false;
 -   break;
 -   case KVM_MP_STATE_INIT_RECEIVED:
 -   break;
 -   default:
 -   r = -EINTR;
 -   break;
 -   }
 -   }
 -   }
 -
 +   for (;;) {
 +   r = __vcpu_run(kvm, vcpu);
 if (r = 0)
 break;

 @@ -6430,6 +6433,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 r = -EINTR;
 vcpu-run-exit_reason = KVM_EXIT_INTR;
 ++vcpu-stat.request_irq_exits;
 +   break;
 }

 kvm_check_async_pf_completion(vcpu);
 @@ -6438,6 +6442,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 r = -EINTR;
 vcpu-run-exit_reason = KVM_EXIT_INTR;
 ++vcpu-stat.signal_exits;
 +   break;

The removal of the loop condition and the addition of these breaks
change the loop behavior slightly, but I think it's safe. We'll start
breaking before checking need_resched(), but we're about to return to
userspace anyway so we'll reschedule then.

 }
 if (need_resched()) {
 srcu_read_unlock(kvm-srcu, vcpu-srcu_idx);
 @@ -6569,7

Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread David Matlack

On Thu, Feb 5, 2015 at 8:05 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 This patch introduces a new module parameter for the KVM module; when it
 is present, KVM attempts a bit of polling on every HLT before scheduling
 itself out via kvm_vcpu_block.

Awesome. I have been working on the same feature in parallel so I have
some suggestions :)


 This parameter helps a lot for latency-bound workloads---in particular
 I tested it with O_DSYNC writes with a battery-backed disk in the host.
 In this case, writes are fast (because the data doesn't have to go all
 the way to the platters) but they cannot be merged by either the host or
 the guest.  KVM's performance here is usually around 30% of bare metal,
 or 50% if you use cache=directsync or cache=writethrough (these
 parameters avoid that the guest sends pointless flush requests, and
 at the same time they are not slow because of the battery-backed cache).
 The bad performance happens because on every halt the host CPU decides
 to halt itself too.  When the interrupt comes, the vCPU thread is then
 migrated to a new physical CPU, and in general the latency is horrible
 because the vCPU thread has to be scheduled back in.

 With this patch performance reaches 60-65% of bare metal and, more
 important, 99% of what you get if you use idle=poll in the guest.  This

I used loopback TCP_RR and loopback memcache as benchmarks for halt
polling. I saw very similar results as you (before: 40% bare metal,
after: 60-65% bare metal and 95% of guest idle=poll).

 means that the tunable gets rid of this particular bottleneck, and more
 work can be done to improve performance in the kernel or QEMU.

 Of course there is some price to pay; every time an otherwise idle vCPUs
 is interrupted by an interrupt, it will poll unnecessarily and thus
 impose a little load on the host.  The above results were obtained with
 a mostly random value of the parameter (200), and the load was around
 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.

 The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
 that can be used to tune the parameter.  It counts how many HLT
 instructions received an interrupt during the polling period; each
 successful poll avoids that Linux schedules the VCPU thread out and back
 in, and may also avoid a likely trip to C1 and back for the physical CPU.

 While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
 Of these halts, almost all are failed polls.  During the benchmark,
 instead, basically all halts end within the polling period, except a more
 or less constant stream of 50 per second coming from vCPUs that are not
 running the benchmark.  The wasted time is thus very low.  Things may
 be slightly different for Windows VMs, which have a ~10 ms timer tick.

 The effect is also visible on Marcelo's recently-introduced latency
 test for the TSC deadline timer.  Though of course a non-RT kernel has
 awful latency bounds, the latency of the timer is around 8000-1 clock
 cycles compared to 2-12 without setting halt_poll.  For the TSC
 deadline timer, thus, the effect is both a smaller average latency and
 a smaller variance.

 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---

Reviewed-by: David Matlack dmatl...@google.com

  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/x86.c  | 28 
  include/linux/kvm_host.h|  1 +
  virt/kvm/kvm_main.c | 22 +++---
  4 files changed, 41 insertions(+), 11 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 848947ac6ade..a236e39cc385 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -655,6 +655,7 @@ struct kvm_vcpu_stat {
 u32 irq_window_exits;
 u32 nmi_window_exits;
 u32 halt_exits;
 +   u32 halt_successful_poll;
 u32 halt_wakeup;
 u32 request_irq_exits;
 u32 irq_exits;
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 1373e04e1f19..b7b20828f01c 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -96,6 +96,9 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
  static bool ignore_msrs = 0;
  module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);

 +unsigned int halt_poll = 0;
 +module_param(halt_poll, uint, S_IRUGO | S_IWUSR);

Suggest encoding the units in the name. halt_poll_cycles in this case.

 +
  unsigned int min_timer_period_us = 500;
  module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);

 @@ -145,6 +148,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 { irq_window, VCPU_STAT(irq_window_exits) },
 { nmi_window, VCPU_STAT(nmi_window_exits) },
 { halt_exits, VCPU_STAT(halt_exits) },
 +   { halt_successful_poll, VCPU_STAT(halt_successful_poll) },
 { halt_wakeup, VCPU_STAT(halt_wakeup) },
 { hypercalls, VCPU_STAT(hypercalls) },
 { request_irq, VCPU_STAT

Re: [PATCH 12/15] KVM: MTRR: introduce mtrr_for_each_mem_type

2015-06-08 Thread David Matlack

On Sat, May 30, 2015 at 3:59 AM, Xiao Guangrong
guangrong.x...@linux.intel.com wrote:
 It walks all MTRRs and gets all the memory cache type setting for the
 specified range also it checks if the range is fully covered by MTRRs

 Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com
 ---
  arch/x86/kvm/mtrr.c | 183 
 
  1 file changed, 183 insertions(+)

 diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
 index e59d138..35f86303 100644
 --- a/arch/x86/kvm/mtrr.c
 +++ b/arch/x86/kvm/mtrr.c
 @@ -395,6 +395,189 @@ void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
 INIT_LIST_HEAD(vcpu-arch.mtrr_state.head);
  }

 +struct mtrr_looker {
 +   /* input fields. */
 +   struct kvm_mtrr *mtrr_state;
 +   u64 start;
 +   u64 end;
 +
 +   /* output fields. */
 +   int mem_type;
 +   /* [start, end) is fully covered in MTRRs? */

s/fully/not fully/ ?

 +   bool partial_map;
 +
 +   /* private fields. */
 +   union {
 +   /* used for fixed MTRRs. */
 +   struct {
 +   int index;
 +   int seg;
 +   };
 +
 +   /* used for var MTRRs. */
 +   struct {
 +   struct kvm_mtrr_range *range;
 +   /* max address has been covered in var MTRRs. */
 +   u64 start_max;
 +   };
 +   };
 +
 +   bool fixed;
 +};
 +
 +static void mtrr_lookup_init(struct mtrr_looker *looker,
 +struct kvm_mtrr *mtrr_state, u64 start, u64 end)
 +{
 +   looker-mtrr_state = mtrr_state;
 +   looker-start = start;
 +   looker-end = end;
 +}
 +
 +static u64 fixed_mtrr_range_end_addr(int seg, int index)
 +{
 +   struct fixed_mtrr_segment *mtrr_seg = fixed_seg_table[seg];
 +
 +return mtrr_seg-start + mtrr_seg-range_size * index;

Should be (index + 1)?

 +}
 +
 +static bool mtrr_lookup_fixed_start(struct mtrr_looker *looker)
 +{
 +   int seg, index;
 +
 +   if (!looker-mtrr_state-fixed_mtrr_enabled)
 +   return false;
 +
 +   seg = fixed_mtrr_addr_to_seg(looker-start);
 +   if (seg  0)
 +   return false;
 +
 +   looker-fixed = true;
 +   index = fixed_mtrr_addr_seg_to_range_index(looker-start, seg);
 +   looker-index = index;
 +   looker-seg = seg;
 +   looker-mem_type = looker-mtrr_state-fixed_ranges[index];
 +   looker-start = fixed_mtrr_range_end_addr(seg, index);
 +   return true;
 +}
 +
 +static bool match_var_range(struct mtrr_looker *looker,
 +   struct kvm_mtrr_range *range)
 +{
 +   u64 start, end;
 +
 +   var_mtrr_range(range, start, end);
 +   if (!(start = looker-end || end = looker-start)) {
 +   looker-range = range;
 +   looker-mem_type = range-base  0xff;
 +
 +   /*
 +* the function is called when we do kvm_mtrr.head walking
 +* that means range has the minimum base address interleaves
 +* with [looker-start_max, looker-end).
 +*/

I'm having trouble understanding this comment. I think this is what you
are trying to say:

  this function is called when we do kvm_mtrr.head walking. range has the
  minimum base address which interleaves [looker-start_max, looker-end).

Let me know if I parsed it wrong.

 +   looker-partial_map |= looker-start_max  start;
 +
 +   /* update the max address has been covered. */
 +   looker-start_max = max(looker-start_max, end);
 +   return true;
 +   }
 +
 +   return false;
 +}
 +
 +static void mtrr_lookup_var_start(struct mtrr_looker *looker)
 +{
 +   struct kvm_mtrr *mtrr_state = looker-mtrr_state;
 +   struct kvm_mtrr_range *range;
 +
 +   looker-fixed = false;
 +   looker-partial_map = false;
 +   looker-start_max = looker-start;
 +   looker-mem_type = -1;
 +
 +   list_for_each_entry(range, mtrr_state-head, node)
 +   if (match_var_range(looker, range))
 +   return;
 +
 +   looker-partial_map = true;
 +}
 +
 +static void mtrr_lookup_fixed_next(struct mtrr_looker *looker)
 +{
 +   struct fixed_mtrr_segment *eseg = fixed_seg_table[looker-seg];
 +   struct kvm_mtrr *mtrr_state = looker-mtrr_state;
 +   u64 end;
 +
 +   if (looker-start = looker-end) {
 +   looker-mem_type = -1;
 +   looker-partial_map = false;
 +   return;
 +   }
 +
 +   WARN_ON(!looker-fixed);
 +
 +   looker-index++;
 +   end = fixed_mtrr_range_end_addr(looker-seg, looker-index);
 +
 +   /* switch to next segment. */
 +   if (end = eseg-end) {
 +   looker-seg++;
 +   looker-index = 0;
 +
 +   /* have looked up for all fixed MTRRs. */
 +   if (looker-seg =

Re: [PATCH 05/15] KVM: MTRR: clean up mtrr default type

2015-06-08 Thread David Matlack

On Sat, May 30, 2015 at 3:59 AM, Xiao Guangrong
guangrong.x...@linux.intel.com wrote:
 Use union definition to avoid the decode/code workload and drop all the
 hard code

Thank you for doing this cleanup. The new code is much clearer!


 Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com
 ---
  arch/x86/include/asm/kvm_host.h | 12 ++--
  arch/x86/kvm/mtrr.c | 19 ---
  2 files changed, 18 insertions(+), 13 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 2c3c52d..95ce2ff 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -347,8 +347,16 @@ enum {
  struct kvm_mtrr {
 struct mtrr_var_range var_ranges[KVM_NR_VAR_MTRR];
 mtrr_type fixed_ranges[KVM_NR_FIXED_MTRR_REGION];
 -   unsigned char enabled;
 -   mtrr_type def_type;
 +
 +   union {
 +   u64 deftype;
 +   struct {
 +   unsigned def_type:8;
 +   unsigned reserved:2;
 +   unsigned fixed_mtrr_enabled:1;
 +   unsigned mtrr_enabled:1;
 +   };
 +   };
  };

  struct kvm_vcpu_arch {
 diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
 index 562341b..6de49dd 100644
 --- a/arch/x86/kvm/mtrr.c
 +++ b/arch/x86/kvm/mtrr.c
 @@ -105,7 +105,6 @@ EXPORT_SYMBOL_GPL(kvm_mtrr_valid);
  static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
  {
 struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state;
 -   unsigned char mtrr_enabled = mtrr_state-enabled;
 gfn_t start, end, mask;
 int index;
 bool is_fixed = true;
 @@ -114,7 +113,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
   !kvm_arch_has_noncoherent_dma(vcpu-kvm))
 return;

 -   if (!(mtrr_enabled  0x2)  msr != MSR_MTRRdefType)
 +   if (!mtrr_state-mtrr_enabled  msr != MSR_MTRRdefType)
 return;

 switch (msr) {
 @@ -153,7 +152,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 end = ((start  mask) | ~mask) + 1;
 }

 -   if (is_fixed  !(mtrr_enabled  0x1))
 +   if (is_fixed  !mtrr_state-fixed_mtrr_enabled)
 return;

 kvm_zap_gfn_range(vcpu-kvm, gpa_to_gfn(start), gpa_to_gfn(end));
 @@ -166,10 +165,9 @@ int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 
 data)
 if (!kvm_mtrr_valid(vcpu, msr, data))
 return 1;

 -   if (msr == MSR_MTRRdefType) {
 -   vcpu-arch.mtrr_state.def_type = data;
 -   vcpu-arch.mtrr_state.enabled = (data  0xc00)  10;
 -   } else if (msr == MSR_MTRRfix64K_0)
 +   if (msr == MSR_MTRRdefType)
 +   vcpu-arch.mtrr_state.deftype = data;
 +   else if (msr == MSR_MTRRfix64K_0)
 p[0] = data;
 else if (msr == MSR_MTRRfix16K_8 || msr == MSR_MTRRfix16K_A)
 p[1 + msr - MSR_MTRRfix16K_8] = data;
 @@ -216,8 +214,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 
 *pdata)
 return 1;

 if (msr == MSR_MTRRdefType)
 -   *pdata = vcpu-arch.mtrr_state.def_type +
 -(vcpu-arch.mtrr_state.enabled  10);
 +   *pdata = vcpu-arch.mtrr_state.deftype;
 else if (msr == MSR_MTRRfix64K_0)
 *pdata = p[0];
 else if (msr == MSR_MTRRfix16K_8 || msr == MSR_MTRRfix16K_A)
 @@ -256,14 +253,14 @@ static int get_mtrr_type(struct kvm_mtrr *mtrr_state,
 int i, num_var_ranges = KVM_NR_VAR_MTRR;

 /* MTRR is completely disabled, use UC for all of physical memory. */
 -   if (!(mtrr_state-enabled  0x2))
 +   if (!mtrr_state-mtrr_enabled)
 return MTRR_TYPE_UNCACHABLE;

 /* Make end inclusive end, instead of exclusive */
 end--;

 /* Look in fixed ranges. Just return the type as per start */
 -   if ((mtrr_state-enabled  0x1)  (start  0x10)) {
 +   if (mtrr_state-fixed_mtrr_enabled  (start  0x10)) {
 int idx;

 if (start  0x8) {
 --
 2.1.0

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/15] KVM: MTRR: introduce var_mtrr_range

2015-06-08 Thread David Matlack

On Sat, May 30, 2015 at 3:59 AM, Xiao Guangrong
guangrong.x...@linux.intel.com wrote:
 It gets the range for the specified variable MTRR

 Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com
 ---
  arch/x86/kvm/mtrr.c | 19 +--
  1 file changed, 13 insertions(+), 6 deletions(-)

 diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
 index 888441e..aeb9767 100644
 --- a/arch/x86/kvm/mtrr.c
 +++ b/arch/x86/kvm/mtrr.c
 @@ -217,10 +217,21 @@ static int fixed_msr_to_range_index(u32 msr)
 return 0;
  }

 +static void var_mtrr_range(struct kvm_mtrr_range *range, u64 *start, u64 
 *end)
 +{
 +   u64 mask;
 +
 +   *start = range-base  PAGE_MASK;
 +
 +   mask = range-mask  PAGE_MASK;
 +   mask |= ~0ULL  boot_cpu_data.x86_phys_bits;
 +   *end = ((*start  mask) | ~mask) + 1;
 +}
 +
  static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
  {
 struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state;
 -   gfn_t start, end, mask;
 +   gfn_t start, end;
 int index;

 if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
 @@ -244,11 +255,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 default:
 /* variable range MTRRs. */
 index = (msr - 0x200) / 2;
 -   start = mtrr_state-var_ranges[index].base  PAGE_MASK;
 -   mask = mtrr_state-var_ranges[index].mask  PAGE_MASK;
 -   mask |= ~0ULL  cpuid_maxphyaddr(vcpu);

Why did you drop this in favor of boot_cpu_data.x86_phys_bits?

 -
 -   end = ((start  mask) | ~mask) + 1;
 +   var_mtrr_range(mtrr_state-var_ranges[index], start, end);
 }

  do_zap:
 --
 2.1.0

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM slow LAMP guest

2015-06-18 Thread David Matlack

On Thu, Jun 18, 2015 at 1:25 AM, Hansa for...@logic-q.nl wrote:
 Hi,

 I have a LAMP server as guest in KVM. Whenever the server is idle for some
 time it takes about 30 seconds to load a Wordpress site.
 If the server is not idle the site shows up in max 5 seconds. I've already
 turned of power management in the guest by passing

 GRUB_CMDLINE_LINUX_DEFAULT=apm=off

 in /etc/default/grub. This has no effect.
 Does KVM do some power management on guests? If so, how do I turn this off
 for my LAMP guest?

KVM doesn't do any power management of guests. But if everything is idle on
the host (including your guest), then host power management could kick in.
Have you tried playing with host pm?

Could you try running your workload with the guest kernel parameter idle=poll
and let me know the performance?

Also, if you are running Linux 4.0 or later on the host, could you try running
your workload with the KVM module parameter halt_poll_ns=50?


 Best, Hansa


 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: make halt_poll_ns per-VCPU

2015-08-24 Thread David Matlack

On Mon, Aug 24, 2015 at 5:53 AM, Wanpeng Li wanpeng...@hotmail.com wrote:
 Change halt_poll_ns into per-VCPU variable, seeded from module parameter,
 to allow greater flexibility.

You should also change kvm_vcpu_block to read halt_poll_ns from
the vcpu instead of the module parameter.


 Signed-off-by: Wanpeng Li wanpeng...@hotmail.com
 ---
  include/linux/kvm_host.h | 1 +
  virt/kvm/kvm_main.c  | 1 +
  2 files changed, 2 insertions(+)

 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 81089cf..1bef9e2 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -242,6 +242,7 @@ struct kvm_vcpu {
 int sigset_active;
 sigset_t sigset;
 struct kvm_vcpu_stat stat;
 +   unsigned int halt_poll_ns;

  #ifdef CONFIG_HAS_IOMEM
 int mmio_needed;
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index d8db2f8f..a122b52 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -217,6 +217,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
 unsigned id)
 vcpu-kvm = kvm;
 vcpu-vcpu_id = id;
 vcpu-pid = NULL;
 +   vcpu-halt_poll_ns = halt_poll_ns;
 init_waitqueue_head(vcpu-wq);
 kvm_async_pf_vcpu_init(vcpu);

 --
 1.9.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] KVM: dynamise halt_poll_ns adjustment

2015-08-24 Thread David Matlack

On Mon, Aug 24, 2015 at 5:53 AM, Wanpeng Li wanpeng...@hotmail.com wrote:
 There are two new kernel parameters for changing the halt_poll_ns:
 halt_poll_ns_grow and halt_poll_ns_shrink. halt_poll_ns_grow affects
 halt_poll_ns when an interrupt arrives and halt_poll_ns_shrink
 does it when idle VCPU is detected.

   halt_poll_ns_shrink/ |
   halt_poll_ns_grow| interrupt arrives| idle VCPU is detected
   -+--+---
1  |  = halt_poll_ns  |  = 0
halt_poll_ns   | *= halt_poll_ns_grow | /= halt_poll_ns_shrink
   otherwise| += halt_poll_ns_grow | -= halt_poll_ns_shrink

 A third new parameter, halt_poll_ns_max, controls the maximal halt_poll_ns;
 it is internally rounded down to a closest multiple of halt_poll_ns_grow.

I like the idea of growing and shrinking halt_poll_ns, but I'm not sure
we grow and shrink in the right places here. For example, if vcpu-halt_poll_ns
gets down to 0, I don't see how it can then grow back up.

This might work better:
  if (poll successfully for interrupt): stay the same
  else if (length of kvm_vcpu_block is longer than halt_poll_ns_max): shrink
  else if (length of kvm_vcpu_block is less than halt_poll_ns_max): grow

where halt_poll_ns_max is something reasonable, like 2 millisecond.

You get diminishing returns from halt polling as the length of the
halt gets longer (halt polling only reduces halt latency by 10-15 us).
So there's little benefit to polling longer than a few milliseconds.


 Signed-off-by: Wanpeng Li wanpeng...@hotmail.com
 ---
  virt/kvm/kvm_main.c | 81 
 -
  1 file changed, 80 insertions(+), 1 deletion(-)

 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index a122b52..bcfbd35 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -66,9 +66,28 @@
  MODULE_AUTHOR(Qumranet);
  MODULE_LICENSE(GPL);

 -static unsigned int halt_poll_ns;
 +#define KVM_HALT_POLL_NS  50
 +#define KVM_HALT_POLL_NS_GROW   2
 +#define KVM_HALT_POLL_NS_SHRINK 0
 +#define KVM_HALT_POLL_NS_MAX \
 +   INT_MAX / KVM_HALT_POLL_NS_GROW
 +
 +static unsigned int halt_poll_ns = KVM_HALT_POLL_NS;
  module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);

 +/* Default doubles per-vcpu halt_poll_ns. */
 +static int halt_poll_ns_grow = KVM_HALT_POLL_NS_GROW;
 +module_param(halt_poll_ns_grow, int, S_IRUGO);
 +
 +/* Default resets per-vcpu halt_poll_ns . */
 +int halt_poll_ns_shrink = KVM_HALT_POLL_NS_SHRINK;
 +module_param(halt_poll_ns_shrink, int, S_IRUGO);
 +
 +/* Default is to compute the maximum so we can never overflow. */
 +unsigned int halt_poll_ns_actual_max = KVM_HALT_POLL_NS_MAX;
 +unsigned int halt_poll_ns_max = KVM_HALT_POLL_NS_MAX;
 +module_param(halt_poll_ns_max, int, S_IRUGO);
 +
  /*
   * Ordering of locks:
   *
 @@ -1907,6 +1926,62 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
 gfn_t gfn)
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);

 +static unsigned int __grow_halt_poll_ns(unsigned int val)
 +{
 +   if (halt_poll_ns_grow  1)
 +   return halt_poll_ns;
 +
 +   val = min(val, halt_poll_ns_actual_max);
 +
 +   if (val == 0)
 +   return halt_poll_ns;
 +
 +   if (halt_poll_ns_grow  halt_poll_ns)
 +   val *= halt_poll_ns_grow;
 +   else
 +   val += halt_poll_ns_grow;
 +
 +   return val;
 +}
 +
 +static unsigned int __shrink_halt_poll_ns(int val, int modifier, int minimum)
 +{
 +   if (modifier  1)
 +   return 0;
 +
 +   if (modifier  halt_poll_ns)
 +   val /= modifier;
 +   else
 +   val -= modifier;
 +
 +   return val;
 +}
 +
 +static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
 +{
 +   vcpu-halt_poll_ns = __grow_halt_poll_ns(vcpu-halt_poll_ns);
 +}
 +
 +static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
 +{
 +   vcpu-halt_poll_ns = __shrink_halt_poll_ns(vcpu-halt_poll_ns,
 +   halt_poll_ns_shrink, halt_poll_ns);
 +}
 +
 +/*
 + * halt_poll_ns_actual_max is computed to be one grow_halt_poll_ns() below
 + * halt_poll_ns_max. (See __grow_halt_poll_ns for the reason.)
 + * This prevents overflows, because ple_halt_poll_ns is int.
 + * halt_poll_ns_max effectively rounded down to a multiple of 
 halt_poll_ns_grow in
 + * this process.
 + */
 +static void update_halt_poll_ns_actual_max(void)
 +{
 +   halt_poll_ns_actual_max =
 +   __shrink_halt_poll_ns(max(halt_poll_ns_max, halt_poll_ns),
 +   halt_poll_ns_grow, INT_MIN);
 +}
 +
  static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
  {
 if (kvm_arch_vcpu_runnable(vcpu)) {
 @@ -1941,6 +2016,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
  */
 if (kvm_vcpu_check_block(vcpu)  0) {
 ++vcpu-stat.halt_successful_poll;
 +

Re: KVM slow LAMP guest

2015-07-14 Thread David Matlack

(adding back the kvm mailing list)

On Tue, Jul 14, 2015 at 1:12 AM, C. Bröcker
c.broc...@intercollect.com wrote:
 On 14-7-2015 10:04, Hansa wrote:

 On 13-7-2015 20:57, David Matlack wrote:

 On Thu, Jun 18, 2015 at 10:26 AM, David Matlack dmatl...@google.com
 wrote:

 On Thu, Jun 18, 2015 at 1:25 AM, Hansa for...@logic-q.nl wrote:

 Hi,

 I have a LAMP server as guest in KVM. Whenever the server is idle for
 some
 time it takes about 30 seconds to load a Wordpress site.
 If the server is not idle the site shows up in max 5 seconds. I've
 already
 turned of power management in the guest by passing

  GRUB_CMDLINE_LINUX_DEFAULT=apm=off

 in /etc/default/grub. This has no effect.
 Does KVM do some power management on guests? If so, how do I turn this
 off
 for my LAMP guest?

 KVM doesn't do any power management of guests. But if everything is idle
 on
 the host (including your guest), then host power management could kick
 in.
 Have you tried playing with host pm?

 Could you try running your workload with the guest kernel parameter
 idle=poll
 and let me know the performance?

 Also, if you are running Linux 4.0 or later on the host, could you try
 running
 your workload with the KVM module parameter halt_poll_ns=50?

 Hansa-- Did you ever get a chance to run your workload with these
 changes?

 Hi David,

 Sorry I didn't respond to your mail earlier and nice of you to check back.
 To be honest no I haven't. To much work overload on my side. I've just added
 the idle=poll to the guests' grub and rebooted the server. I'll let it run a
 few day's to check if it helps. Btw: I'm running 2.6 linux kernel here so
 I'm not sure if I can add the halt_poll_ns param to KVM.

 Best, Hansa

 Wow! That makes real difference!
 The site was never that fast! Although CPU load is constantly on 100% any
 chance I can tweak this param? And is there a similar param for Windows
 guests?

Since idle=poll gave a noticeable improvement, your workload is a good
candidate for halt_poll_ns. This is a host KVM module parameter. It will
give you a similar speedup, it will work independent of your guest os,
and won't peg your cpu usage at 100% during periods of no activity. If
you don't have halt_poll_ns available (your host kernel older than 4.0),
is upgrading to a newer kernel or backporting the patch [1] an option?

I'm not sure of any more guest params you can tweak (linux or windows), but
I'll let you know if I come up with anything.

[1] 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f7819512996361280b86259222456fcf15aad926
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/3] KVM: dynamic halt_poll_ns adjustment

2015-08-25 Thread David Matlack

Thanks for writing v2, Wanpeng.

On Mon, Aug 24, 2015 at 11:35 PM, Wanpeng Li wanpeng...@hotmail.com wrote:
 There is a downside of halt_poll_ns since poll is still happen for idle
 VCPU which can waste cpu usage. This patch adds the ability to adjust
 halt_poll_ns dynamically.

What testing have you done with these patches? Do you know if this removes
the overhead of polling in idle VCPUs? Do we lose any of the performance
from always polling?


 There are two new kernel parameters for changing the halt_poll_ns:
 halt_poll_ns_grow and halt_poll_ns_shrink. A third new parameter,
 halt_poll_ns_max, controls the maximal halt_poll_ns; it is internally
 rounded down to a closest multiple of halt_poll_ns_grow. The shrink/grow
 matrix is suggested by David:

 if (poll successfully for interrupt): stay the same
   else if (length of kvm_vcpu_block is longer than halt_poll_ns_max): shrink
   else if (length of kvm_vcpu_block is less than halt_poll_ns_max): grow

The way you implemented this wasn't what I expected. I thought you would time
the whole function (kvm_vcpu_block). But I like your approach better. It's
simpler and [by inspection] does what we want.


   halt_poll_ns_shrink/ |
   halt_poll_ns_grow| grow halt_poll_ns| shrink halt_poll_ns
   -+--+---
1  |  = halt_poll_ns  |  = 0
halt_poll_ns   | *= halt_poll_ns_grow | /= halt_poll_ns_shrink
   otherwise| += halt_poll_ns_grow | -= halt_poll_ns_shrink

I was curious why you went with this approach rather than just the
middle row, or just the last row. Do you think we'll want the extra
flexibility?


 Signed-off-by: Wanpeng Li wanpeng...@hotmail.com
 ---
  virt/kvm/kvm_main.c | 65 
 -
  1 file changed, 64 insertions(+), 1 deletion(-)

 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 93db833..2a4962b 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -66,9 +66,26 @@
  MODULE_AUTHOR(Qumranet);
  MODULE_LICENSE(GPL);

 -static unsigned int halt_poll_ns;
 +#define KVM_HALT_POLL_NS  50
 +#define KVM_HALT_POLL_NS_GROW   2
 +#define KVM_HALT_POLL_NS_SHRINK 0
 +#define KVM_HALT_POLL_NS_MAX 200

The macros are not necessary. Also, hard coding the numbers in the param
definitions will make reading the comments above them easier.

 +
 +static unsigned int halt_poll_ns = KVM_HALT_POLL_NS;
  module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);

 +/* Default doubles per-vcpu halt_poll_ns. */
 +static unsigned int halt_poll_ns_grow = KVM_HALT_POLL_NS_GROW;
 +module_param(halt_poll_ns_grow, int, S_IRUGO);
 +
 +/* Default resets per-vcpu halt_poll_ns . */
 +static unsigned int halt_poll_ns_shrink = KVM_HALT_POLL_NS_SHRINK;
 +module_param(halt_poll_ns_shrink, int, S_IRUGO);
 +
 +/* halt polling only reduces halt latency by 10-15 us, 2ms is enough */

Ah, I misspoke before. I was thinking about round-trip latency. The latency
of a single halt is reduced by about 5-7 us.

 +static unsigned int halt_poll_ns_max = KVM_HALT_POLL_NS_MAX;
 +module_param(halt_poll_ns_max, int, S_IRUGO);

We can remove halt_poll_ns_max. vcpu-halt_poll_ns can always start at zero
and grow from there. Then we just need one module param to keep
vcpu-halt_poll_ns from growing too large.

[ It would make more sense to remove halt_poll_ns and keep halt_poll_ns_max,
  but since halt_poll_ns already exists in upstream kernels, we probably can't
  remove it. ]

 +
  /*
   * Ordering of locks:
   *
 @@ -1907,6 +1924,48 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
 gfn_t gfn)
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);

 +static unsigned int __grow_halt_poll_ns(unsigned int val)
 +{
 +   if (halt_poll_ns_grow  1)
 +   return halt_poll_ns;
 +
 +   val = min(val, halt_poll_ns_max);
 +
 +   if (val == 0)
 +   return halt_poll_ns;
 +
 +   if (halt_poll_ns_grow  halt_poll_ns)
 +   val *= halt_poll_ns_grow;
 +   else
 +   val += halt_poll_ns_grow;
 +
 +   return val;
 +}
 +
 +static unsigned int __shrink_halt_poll_ns(int val, int modifier, int minimum)

minimum never gets used.

 +{
 +   if (modifier  1)
 +   return 0;
 +
 +   if (modifier  halt_poll_ns)
 +   val /= modifier;
 +   else
 +   val -= modifier;
 +
 +   return val;
 +}
 +
 +static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)

These wrappers aren't necessary.

 +{
 +   vcpu-halt_poll_ns = __grow_halt_poll_ns(vcpu-halt_poll_ns);
 +}
 +
 +static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
 +{
 +   vcpu-halt_poll_ns = __shrink_halt_poll_ns(vcpu-halt_poll_ns,
 +   halt_poll_ns_shrink, halt_poll_ns);
 +}
 +
  static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
  {
 if (kvm_arch_vcpu_runnable(vcpu)) {
 @@ -1954,6 +2013,10 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)

Re: [PATCH] KVM: x86: move steal time initialization to vcpu entry time

2015-10-15 Thread David Matlack

On Wed, Oct 14, 2015 at 3:33 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
>
> As reported at https://bugs.launchpad.net/qemu/+bug/1494350,
> it is possible to have vcpu->arch.st.last_steal initialized
> from a thread other than vcpu thread, say the iothread, via
> KVM_SET_MSRS.
>
> Which can cause an overflow later (when subtracting from vcpu threads
> sched_info.run_delay).
>
> To avoid that, move steal time accumulation to vcpu entry time,
> before copying steal time data to guest.
>
> Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com>

Reviewed-by: David Matlack <dmatl...@google.com>

>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8f0f6ec..0e0332e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2030,6 +2030,8 @@ static void accumulate_steal_time(struct kvm_vcpu *vcpu)
>
>  static void record_steal_time(struct kvm_vcpu *vcpu)
>  {
> +   accumulate_steal_time(vcpu);
> +
> if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> return;
>
> @@ -2182,12 +2184,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> if (!(data & KVM_MSR_ENABLED))
> break;
>
> -   vcpu->arch.st.last_steal = current->sched_info.run_delay;
> -
> -   preempt_disable();
> -   accumulate_steal_time(vcpu);
> -   preempt_enable();
> -
> kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
>
> break;
> @@ -2830,7 +2826,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> vcpu->cpu = cpu;
> }
>
> -   accumulate_steal_time(vcpu);
> kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
>  }
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-10-14 Thread David Matlack

Hi Feng.

On Fri, Sep 18, 2015 at 7:29 AM, Feng Wu  wrote:
> This patch updates the Posted-Interrupts Descriptor when vCPU
> is blocked.
>
> pre-block:
> - Add the vCPU to the blocked per-CPU list
> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
>
> post-block:
> - Remove the vCPU from the per-CPU list

I'm wondering what happens if a posted interrupt arrives at the IOMMU
after pre-block and before post-block.

In pre_block, NV is set to POSTED_INTR_WAKEUP_VECTOR. IIUC, this means
future posted interrupts will not trigger "Posted-Interrupt Processing"
(PIR will not get copied to VIRR). Instead, the IOMMU will do ON := 1,
PIR |= (1 << vector), and send POSTED_INTR_WAKEUP_VECTOR. PIWV calls
wakeup_handler which does kvm_vcpu_kick. kvm_vcpu_kick does a wait-queue
wakeup and possibly a scheduler ipi.

But the VCPU is sitting in kvm_vcpu_block. It spins and/or schedules
(wait queue) until it has a reason to wake up. I couldn't find a code
path from kvm_vcpu_block that lead to checking ON or PIR. How does the
blocked VCPU "receive" the posted interrupt? (And when does Posted-
Interrupt Processing get triggered?)

Thanks!

>
> Signed-off-by: Feng Wu 
> ---
> v9:
> - Add description for blocked_vcpu_on_cpu_lock in 
> Documentation/virtual/kvm/locking.txt
> - Check !kvm_arch_has_assigned_device(vcpu->kvm) first, then
>   !irq_remapping_cap(IRQ_POSTING_CAP)
>
> v8:
> - Rename 'pi_pre_block' to 'pre_block'
> - Rename 'pi_post_block' to 'post_block'
> - Change some comments
> - Only add the vCPU to the blocking list when the VM has assigned devices.
>
>  Documentation/virtual/kvm/locking.txt |  12 +++
>  arch/x86/include/asm/kvm_host.h   |  13 +++
>  arch/x86/kvm/vmx.c| 153 
> ++
>  arch/x86/kvm/x86.c|  53 +---
>  include/linux/kvm_host.h  |   3 +
>  virt/kvm/kvm_main.c   |   3 +
>  6 files changed, 227 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/locking.txt 
> b/Documentation/virtual/kvm/locking.txt
> index d68af4d..19f94a6 100644
> --- a/Documentation/virtual/kvm/locking.txt
> +++ b/Documentation/virtual/kvm/locking.txt
> @@ -166,3 +166,15 @@ Comment:   The srcu read lock must be held while 
> accessing memslots (e.g.
> MMIO/PIO address->device structure mapping (kvm->buses).
> The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
> if it is needed by multiple functions.
> +
> +Name:  blocked_vcpu_on_cpu_lock
> +Type:  spinlock_t
> +Arch:  x86
> +Protects:  blocked_vcpu_on_cpu
> +Comment:   This is a per-CPU lock and it is used for VT-d 
> posted-interrupts.
> +   When VT-d posted-interrupts is supported and the VM has 
> assigned
> +   devices, we put the blocked vCPU on the list 
> blocked_vcpu_on_cpu
> +   protected by blocked_vcpu_on_cpu_lock, when VT-d hardware 
> issues
> +   wakeup notification event since external interrupts from the
> +   assigned devices happens, we will find the vCPU on the list to
> +   wakeup.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0ddd353..304fbb5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -552,6 +552,8 @@ struct kvm_vcpu_arch {
>  */
> bool write_fault_to_shadow_pgtable;
>
> +   bool halted;
> +
> /* set at EPT violation at this point */
> unsigned long exit_qualification;
>
> @@ -864,6 +866,17 @@ struct kvm_x86_ops {
> /* pmu operations of sub-arch */
> const struct kvm_pmu_ops *pmu_ops;
>
> +   /*
> +* Architecture specific hooks for vCPU blocking due to
> +* HLT instruction.
> +* Returns for .pre_block():
> +*- 0 means continue to block the vCPU.
> +*- 1 means we cannot block the vCPU since some event
> +*happens during this period, such as, 'ON' bit in
> +*posted-interrupts descriptor is set.
> +*/
> +   int (*pre_block)(struct kvm_vcpu *vcpu);
> +   void (*post_block)(struct kvm_vcpu *vcpu);
> int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
>   uint32_t guest_irq, bool set);
>  };
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 902a67d..9968896 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -879,6 +879,13 @@ static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
>  static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
>  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
>
> +/*
> + * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
> + * can find which vCPU should be waken up.
> + */
> +static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> +static DEFINE_PER_CPU(spinlock_t,

Re: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-10-15 Thread David Matlack

On Wed, Oct 14, 2015 at 6:33 PM, Wu, Feng <feng...@intel.com> wrote:
>
>> -Original Message-
>> From: David Matlack [mailto:dmatl...@google.com]
>> Sent: Thursday, October 15, 2015 7:41 AM
>> To: Wu, Feng <feng...@intel.com>
>> Cc: Paolo Bonzini <pbonz...@redhat.com>; alex.william...@redhat.com; Joerg
>> Roedel <j...@8bytes.org>; Marcelo Tosatti <mtosa...@redhat.com>;
>> eric.au...@linaro.org; kvm list <kvm@vger.kernel.org>; iommu@lists.linux-
>> foundation.org; linux-ker...@vger.kernel.org
>> Subject: Re: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when
>> vCPU is blocked
>>
>> Hi Feng.
>>
>> On Fri, Sep 18, 2015 at 7:29 AM, Feng Wu <feng...@intel.com> wrote:
>> > This patch updates the Posted-Interrupts Descriptor when vCPU
>> > is blocked.
>> >
>> > pre-block:
>> > - Add the vCPU to the blocked per-CPU list
>> > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
>> >
>> > post-block:
>> > - Remove the vCPU from the per-CPU list
>>
>> I'm wondering what happens if a posted interrupt arrives at the IOMMU
>> after pre-block and before post-block.
>>
>> In pre_block, NV is set to POSTED_INTR_WAKEUP_VECTOR. IIUC, this means
>> future posted interrupts will not trigger "Posted-Interrupt Processing"
>> (PIR will not get copied to VIRR). Instead, the IOMMU will do ON := 1,
>> PIR |= (1 << vector), and send POSTED_INTR_WAKEUP_VECTOR. PIWV calls
>> wakeup_handler which does kvm_vcpu_kick. kvm_vcpu_kick does a wait-queue
>> wakeup and possibly a scheduler ipi.
>>
>> But the VCPU is sitting in kvm_vcpu_block. It spins and/or schedules
>> (wait queue) until it has a reason to wake up. I couldn't find a code
>> path from kvm_vcpu_block that lead to checking ON or PIR. How does the
>> blocked VCPU "receive" the posted interrupt? (And when does Posted-
>> Interrupt Processing get triggered?)
>
> In the pre_block, it also change the 'NDST' filed to the pCPU, on which the 
> vCPU
> is put to the per-CPU list 'blocked_vcpu_on_cpu', so when posted-interrupts
> come it, it will sent the wakeup notification event to the pCPU above, then in
> the wakeup_handler, it can find the vCPU from the per-CPU list, hence
> kvm_vcpu_kick can wake up it.

Thank you for your response. I was actually confused about something
else. After wakeup_handler->kvm_vcpu_kick causes the vcpu to wake up,
that vcpu calls kvm_vcpu_check_block() to check if there are pending
events, otherwise the vcpu goes back to sleep. I had trouble yesterday
finding the code path from kvm_vcpu_check_block() which checks PIR/ON.

But after spending more time reading the source code this morning I
found that kvm_vcpu_check_block() eventually calls into
vmx_sync_pir_to_irr(), which copies PIR to IRR and clears ON. And then
apic_find_highest_irr() detects the pending posted interrupt.

>
> Thanks,
> Feng
>
>>
>> Thanks!
>>
>> >
>> > Signed-off-by: Feng Wu <feng...@intel.com>
>> > ---
>> > v9:
>> > - Add description for blocked_vcpu_on_cpu_lock in
>> Documentation/virtual/kvm/locking.txt
>> > - Check !kvm_arch_has_assigned_device(vcpu->kvm) first, then
>> >   !irq_remapping_cap(IRQ_POSTING_CAP)
>> >
>> > v8:
>> > - Rename 'pi_pre_block' to 'pre_block'
>> > - Rename 'pi_post_block' to 'post_block'
>> > - Change some comments
>> > - Only add the vCPU to the blocking list when the VM has assigned devices.
>> >
>> >  Documentation/virtual/kvm/locking.txt |  12 +++
>> >  arch/x86/include/asm/kvm_host.h   |  13 +++
>> >  arch/x86/kvm/vmx.c| 153
>> ++
>> >  arch/x86/kvm/x86.c|  53 +---
>> >  include/linux/kvm_host.h  |   3 +
>> >  virt/kvm/kvm_main.c   |   3 +
>> >  6 files changed, 227 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/Documentation/virtual/kvm/locking.txt
>> b/Documentation/virtual/kvm/locking.txt
>> > index d68af4d..19f94a6 100644
>> > --- a/Documentation/virtual/kvm/locking.txt
>> > +++ b/Documentation/virtual/kvm/locking.txt
>> > @@ -166,3 +166,15 @@ Comment:   The srcu read lock must be held while
>> accessing memslots (e.g.
>> > MMIO/PIO address->device structure mapping (kvm->buses).
>> > The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
>> > if it is needed by multiple func

Re: [PATCH 04/12] KVM: x86: Replace call-back set_tsc_khz() with a common function

2015-10-05 Thread David Matlack

On Mon, Oct 5, 2015 at 12:53 PM, Radim Krčmář  wrote:
> 2015-09-28 13:38+0800, Haozhong Zhang:
>> Both VMX and SVM propagate virtual_tsc_khz in the same way, so this
>> patch removes the call-back set_tsc_khz() and replaces it with a common
>> function.
>>
>> Signed-off-by: Haozhong Zhang 
>> ---
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> +static void set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
>> +{
>> + u64 ratio, khz;
> | [...]
>> + khz = user_tsc_khz;
>
> I'd use "user_tsc_khz" directly.
>
>> + /* TSC scaling required  - calculate ratio */
>> + shift = (kvm_tsc_scaling_ratio_frac_bits <= 32) ?
>> + kvm_tsc_scaling_ratio_frac_bits : 32;
>> + ratio = khz << shift;
>> + do_div(ratio, tsc_khz);
>> + ratio <<= (kvm_tsc_scaling_ratio_frac_bits - shift);
>
> VMX is losing 16 bits by this operation;  normal fixed point division
> could get us a smaller drift (and an one-liner here) ...
> at 4.3 GHz, 32 instead of 48 bits after decimal point translate to one
> "lost" TSC tick per second, in the worst case.

We can easily avoid losing precision on x86_64 (divq allows a 128-bit
dividend). 32-bit can just lose the 16 bits of precision (TSC scaling
is only available on SkyLake, and I'd be surprised if there were
many hosts running KVM in protected mode on SkyLake :)).

>
> Please mention that we are truncating on purpose :)
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: VM exit profiling

2015-10-06 Thread David Matlack

Have you tried perf kvm stat? e.g.

perf kvm stat record -a sleep 10 # record all vmexits for 10 seconds
perf kvm stat report --event=vmexit

This gives per-exit counts and min/max/avg latencies.

Alternatively you can record the raw events kvm:kvm_exit and kvm:kvm_entry and
process the data however you want.

On Tue, Oct 6, 2015 at 3:42 PM, Prateek Sharma  wrote:
> Hello all,
>   There used to be perf scripts to do exit-level profiling of VMs
> (number of exits, time spent per exit, etc). I am using kernel version
> 3.19, and the perf utility which ships with it has a perf-kvm option, but
> that only reports total number of exits, and not the reason/latency.
>   The stats in  /sys/debug/kvm seem to be about number of exits only,
> not the time information. My question is: whats the most convenient way to
> get per-exit counts and durations?
>
> Thanks,
> --Prateek
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/3] KVM: dynamic halt_poll_ns adjustment

2015-08-27 Thread David Matlack

On Thu, Aug 27, 2015 at 2:59 AM, Wanpeng Li wanpeng...@hotmail.com wrote:
 Hi David,
 On 8/26/15 1:19 AM, David Matlack wrote:

 Thanks for writing v2, Wanpeng.

 On Mon, Aug 24, 2015 at 11:35 PM, Wanpeng Li wanpeng...@hotmail.com
 wrote:

 There is a downside of halt_poll_ns since poll is still happen for idle
 VCPU which can waste cpu usage. This patch adds the ability to adjust
 halt_poll_ns dynamically.

 What testing have you done with these patches? Do you know if this removes
 the overhead of polling in idle VCPUs? Do we lose any of the performance
 from always polling?

 There are two new kernel parameters for changing the halt_poll_ns:
 halt_poll_ns_grow and halt_poll_ns_shrink. A third new parameter,
 halt_poll_ns_max, controls the maximal halt_poll_ns; it is internally
 rounded down to a closest multiple of halt_poll_ns_grow. The shrink/grow
 matrix is suggested by David:

 if (poll successfully for interrupt): stay the same
else if (length of kvm_vcpu_block is longer than halt_poll_ns_max):
 shrink
else if (length of kvm_vcpu_block is less than halt_poll_ns_max): grow

 The way you implemented this wasn't what I expected. I thought you would
 time
 the whole function (kvm_vcpu_block). But I like your approach better. It's
 simpler and [by inspection] does what we want.


 I see there is more idle vCPUs overhead w/ this method even more than always
 halt-poll, so I bring back grow vcpu-halt_poll_ns when interrupt arrives
 and shrinks when idle VCPU is detected. The perfomance looks good in v4.

Why did this patch have a worse idle overhead than always poll?


 Regards,
 Wanpeng Li



halt_poll_ns_shrink/ |
halt_poll_ns_grow| grow halt_poll_ns| shrink halt_poll_ns
-+--+---
 1  |  = halt_poll_ns  |  = 0
 halt_poll_ns   | *= halt_poll_ns_grow | /= halt_poll_ns_shrink
otherwise| += halt_poll_ns_grow | -= halt_poll_ns_shrink

 I was curious why you went with this approach rather than just the
 middle row, or just the last row. Do you think we'll want the extra
 flexibility?

 Signed-off-by: Wanpeng Li wanpeng...@hotmail.com
 ---
   virt/kvm/kvm_main.c | 65
 -
   1 file changed, 64 insertions(+), 1 deletion(-)

 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 93db833..2a4962b 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -66,9 +66,26 @@
   MODULE_AUTHOR(Qumranet);
   MODULE_LICENSE(GPL);

 -static unsigned int halt_poll_ns;
 +#define KVM_HALT_POLL_NS  50
 +#define KVM_HALT_POLL_NS_GROW   2
 +#define KVM_HALT_POLL_NS_SHRINK 0
 +#define KVM_HALT_POLL_NS_MAX 200

 The macros are not necessary. Also, hard coding the numbers in the param
 definitions will make reading the comments above them easier.

 +
 +static unsigned int halt_poll_ns = KVM_HALT_POLL_NS;
   module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);

 +/* Default doubles per-vcpu halt_poll_ns. */
 +static unsigned int halt_poll_ns_grow = KVM_HALT_POLL_NS_GROW;
 +module_param(halt_poll_ns_grow, int, S_IRUGO);
 +
 +/* Default resets per-vcpu halt_poll_ns . */
 +static unsigned int halt_poll_ns_shrink = KVM_HALT_POLL_NS_SHRINK;
 +module_param(halt_poll_ns_shrink, int, S_IRUGO);
 +
 +/* halt polling only reduces halt latency by 10-15 us, 2ms is enough */

 Ah, I misspoke before. I was thinking about round-trip latency. The
 latency
 of a single halt is reduced by about 5-7 us.

 +static unsigned int halt_poll_ns_max = KVM_HALT_POLL_NS_MAX;
 +module_param(halt_poll_ns_max, int, S_IRUGO);

 We can remove halt_poll_ns_max. vcpu-halt_poll_ns can always start at
 zero
 and grow from there. Then we just need one module param to keep
 vcpu-halt_poll_ns from growing too large.

 [ It would make more sense to remove halt_poll_ns and keep
 halt_poll_ns_max,
but since halt_poll_ns already exists in upstream kernels, we probably
 can't
remove it. ]

 +
   /*
* Ordering of locks:
*
 @@ -1907,6 +1924,48 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu
 *vcpu, gfn_t gfn)
   }
   EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);

 +static unsigned int __grow_halt_poll_ns(unsigned int val)
 +{
 +   if (halt_poll_ns_grow  1)
 +   return halt_poll_ns;
 +
 +   val = min(val, halt_poll_ns_max);
 +
 +   if (val == 0)
 +   return halt_poll_ns;
 +
 +   if (halt_poll_ns_grow  halt_poll_ns)
 +   val *= halt_poll_ns_grow;
 +   else
 +   val += halt_poll_ns_grow;
 +
 +   return val;
 +}
 +
 +static unsigned int __shrink_halt_poll_ns(int val, int modifier, int
 minimum)

 minimum never gets used.

 +{
 +   if (modifier  1)
 +   return 0;
 +
 +   if (modifier  halt_poll_ns)
 +   val /= modifier;
 +   else
 +   val -= modifier;
 +
 +   return val;
 +}
 +
 +static void grow_halt_poll_ns(struct kvm_vcpu *vcpu

Re: [PATCH v4 0/3] KVM: Dynamic Halt-Polling

2015-09-01 Thread David Matlack

On Tue, Sep 1, 2015 at 5:29 PM, Wanpeng Li <wanpeng...@hotmail.com> wrote:
> On 9/2/15 7:24 AM, David Matlack wrote:
>>
>> On Tue, Sep 1, 2015 at 3:58 PM, Wanpeng Li <wanpeng...@hotmail.com> wrote:

>>>
>>> Why this can happen?
>>
>> Ah, probably because I'm missing 9c8fd1ba220 (KVM: x86: optimize delivery
>> of TSC deadline timer interrupt). I don't think the edge case exists in
>> the latest kernel.
>
>
> Yeah, hope we both(include Peter Kieser) can test against latest kvm tree to
> avoid confusing. The reason to introduce the adaptive halt-polling toggle is
> to handle the "edge case" as you mentioned above. So I think we can make
> more efforts improve v4 instead. I will improve v4 to handle short halt
> today. ;-)

That's fine. It's just easier to convey my ideas with a patch. FYI the
other reason for the toggle patch was to add the timer for kvm_vcpu_block,
which I think is the only way to get dynamic halt-polling right. Feel free
to work on top of v4!

>

>>>
>>> Did you test your patch against a windows guest?
>>
>> I have not. I tested against a 250HZ linux guest to check how it performs
>> against a ticking guest. Presumably, windows should be the same, but at a
>> higher tick rate. Do you have a test for Windows?
>
>
> I just test the idle vCPUs usage.
>
>
> V4 for windows 10:
>
> +-++---+
> | | |
> |
> |  w/o halt-poll   |  w/ halt-poll  | dynamic(v4) halt-poll
> |
> +-++---+
> | | |
> |
> |~2.1%|~3.0%  | ~2.4%
> |
> +-++---+

I'm not seeing the same results with v4. With a 250HZ ticking guest
I see 15% c0 with halt_poll_ns=200 and 1.27% with halt_poll_ns=0.
Are you running one vcpu per pcpu?

(The reason for the overhead: the new tracepoint shows each vcpu is
alternating between 0 and 500 us.)

>
> V4  for linux guest:
>
> +-++---+
> | ||   |
> |  w/o halt-poll  |  w/ halt-poll  | dynamic halt-poll |
> +-++---+
> | ||   |
> |~0.9%|~1.8%   | ~1.2% |
> +-++---+
>
>
> Regards,
> Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/3] KVM: Dynamic Halt-Polling

2015-09-01 Thread David Matlack

On Thu, Aug 27, 2015 at 2:47 AM, Wanpeng Li  wrote:
> v3 -> v4:
>  * bring back grow vcpu->halt_poll_ns when interrupt arrives and shrinks
>when idle VCPU is detected
>
> v2 -> v3:
>  * grow/shrink vcpu->halt_poll_ns by *halt_poll_ns_grow or 
> /halt_poll_ns_shrink
>  * drop the macros and hard coding the numbers in the param definitions
>  * update the comments "5-7 us"
>  * remove halt_poll_ns_max and use halt_poll_ns as the max halt_poll_ns time,
>vcpu->halt_poll_ns start at zero
>  * drop the wrappers
>  * move the grow/shrink logic before "out:" w/ "if (waited)"

I posted a patchset which adds dynamic poll toggling (on/off switch). I think
this gives you a good place to build your dynamic growth patch on top. The
toggling patch has close to zero overhead for idle VMs and equivalent
performance VMs doing message passing as always-poll. It's a patch that's been
in my queue for a few weeks but just haven't had the time to send out. We can
win even more with your patchset by only polling as much as we need (via
dynamic growth/shrink). It also gives us a better place to stand for choosing
a default for halt_poll_ns. (We can run experiments and see how high
vcpu->halt_poll_ns tends to grow.)

The reason I posted a separate patch for toggling is because it adds timers
to kvm_vcpu_block and deals with a weird edge case (kvm_vcpu_block can get
called multiple times for one halt). To do dynamic poll adjustment correctly,
we have to time the length of each halt. Otherwise we hit some bad edge cases:

  v3: v3 had lots of idle overhead. It's because vcpu->halt_poll_ns grew every
  time we had a long halt. So idle VMs looked like: 0 us -> 500 us -> 1 ms ->
  2 ms -> 4 ms -> 0 us. Ideally vcpu->halt_poll_ns should just stay at 0 when
  the halts are long.

  v4: v4 fixed the idle overhead problem but broke dynamic growth for message
  passing VMs. Every time a VM did a short halt, vcpu->halt_poll_ns would grow.
  That means vcpu->halt_poll_ns will always be maxed out, even when the halt
  time is much less than the max.

I think we can fix both edge cases if we make grow/shrink decisions based on
the length of kvm_vcpu_block rather than the arrival of a guest interrupt
during polling.

Some thoughts for dynamic growth:
  * Given Windows 10 timer tick (1 ms), let's set the maximum poll time to
less than 1ms. 200 us has been a good value for always-poll. We can
probably go a bit higher once we have your patch. Maybe 500 us?

  * The base case of dynamic growth (the first grow() after being at 0) should
be small. 500 us is too big. When I run TCP_RR in my guest I see poll times
of < 10 us. TCP_RR is on the lower-end of message passing workload latency,
so 10 us would be a good base case.

>
> v1 -> v2:
>  * change kvm_vcpu_block to read halt_poll_ns from the vcpu instead of
>the module parameter
>  * use the shrink/grow matrix which is suggested by David
>  * set halt_poll_ns_max to 2ms
>
> There is a downside of halt_poll_ns since poll is still happen for idle
> VCPU which can waste cpu usage. This patchset add the ability to adjust
> halt_poll_ns dynamically, grows halt_poll_ns if an interrupt arrives and
> shrinks halt_poll_ns when idle VCPU is detected.
>
> There are two new kernel parameters for changing the halt_poll_ns:
> halt_poll_ns_grow and halt_poll_ns_shrink.
>
>
> Test w/ high cpu overcommit ratio, pin vCPUs, and the halt_poll_ns of
> halt-poll is the default 50ns, the max halt_poll_ns of dynamic
> halt-poll is 2ms. Then watch the %C0 in the dump of Powertop tool.
> The test method is almost from David.
>
> +-++---+
> | ||   |
> |  w/o halt-poll  |  w/ halt-poll  | dynamic halt-poll |
> +-++---+
> | ||   |
> |~0.9%|~1.8%   | ~1.2% |
> +-++---+
>
> The always halt-poll will increase ~0.9% cpu usage for idle vCPUs and the
> dynamic halt-poll drop it to ~0.3% which means that reduce the 67% overhead
> introduced by always halt-poll.
>
> Wanpeng Li (3):
>   KVM: make halt_poll_ns per-VCPU
>   KVM: dynamic halt_poll_ns adjustment
>   KVM: trace kvm_halt_poll_ns grow/shrink
>
>  include/linux/kvm_host.h   |  1 +
>  include/trace/events/kvm.h | 30 
>  virt/kvm/kvm_main.c| 50 
> +++---
>  3 files changed, 78 insertions(+), 3 deletions(-)
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/3] KVM: Dynamic Halt-Polling

2015-09-01 Thread David Matlack

On Tue, Sep 1, 2015 at 3:58 PM, Wanpeng Li <wanpeng...@hotmail.com> wrote:
> On 9/2/15 6:34 AM, David Matlack wrote:
>>
>> On Tue, Sep 1, 2015 at 3:30 PM, Wanpeng Li <wanpeng...@hotmail.com> wrote:
>>>
>>> On 9/2/15 5:45 AM, David Matlack wrote:
>>>>
>>>> On Thu, Aug 27, 2015 at 2:47 AM, Wanpeng Li <wanpeng...@hotmail.com>
>>>> wrote:
>>>>>
>>>>> v3 -> v4:
>>>>>* bring back grow vcpu->halt_poll_ns when interrupt arrives and
>>>>> shrinks
>>>>>  when idle VCPU is detected
>>>>>
>>>>> v2 -> v3:
>>>>>* grow/shrink vcpu->halt_poll_ns by *halt_poll_ns_grow or
>>>>> /halt_poll_ns_shrink
>>>>>* drop the macros and hard coding the numbers in the param
>>>>> definitions
>>>>>* update the comments "5-7 us"
>>>>>* remove halt_poll_ns_max and use halt_poll_ns as the max
>>>>> halt_poll_ns
>>>>> time,
>>>>>  vcpu->halt_poll_ns start at zero
>>>>>* drop the wrappers
>>>>>* move the grow/shrink logic before "out:" w/ "if (waited)"
>>>>
>>>> I posted a patchset which adds dynamic poll toggling (on/off switch). I
>>>> think
>>>> this gives you a good place to build your dynamic growth patch on top.
>>>> The
>>>> toggling patch has close to zero overhead for idle VMs and equivalent
>>>> performance VMs doing message passing as always-poll. It's a patch
>>>> that's
>>>> been
>>>> in my queue for a few weeks but just haven't had the time to send out.
>>>> We
>>>> can
>>>> win even more with your patchset by only polling as much as we need (via
>>>> dynamic growth/shrink). It also gives us a better place to stand for
>>>> choosing
>>>> a default for halt_poll_ns. (We can run experiments and see how high
>>>> vcpu->halt_poll_ns tends to grow.)
>>>>
>>>> The reason I posted a separate patch for toggling is because it adds
>>>> timers
>>>> to kvm_vcpu_block and deals with a weird edge case (kvm_vcpu_block can
>>>> get
>>>> called multiple times for one halt). To do dynamic poll adjustment
>
>
> Why this can happen?

Ah, probably because I'm missing 9c8fd1ba220 (KVM: x86: optimize delivery
of TSC deadline timer interrupt). I don't think the edge case exists in
the latest kernel.

>
>
>>>> correctly,
>>>> we have to time the length of each halt. Otherwise we hit some bad edge
>>>> cases:
>>>>
>>>> v3: v3 had lots of idle overhead. It's because vcpu->halt_poll_ns
>>>> grew
>>>> every
>>>> time we had a long halt. So idle VMs looked like: 0 us -> 500 us ->
>>>> 1
>>>> ms ->
>>>> 2 ms -> 4 ms -> 0 us. Ideally vcpu->halt_poll_ns should just stay at
>>>> 0
>>>> when
>>>> the halts are long.
>>>>
>>>> v4: v4 fixed the idle overhead problem but broke dynamic growth for
>>>> message
>>>> passing VMs. Every time a VM did a short halt, vcpu->halt_poll_ns
>>>> would
>>>> grow.
>>>> That means vcpu->halt_poll_ns will always be maxed out, even when
>>>> the
>>>> halt
>>>> time is much less than the max.
>>>>
>>>> I think we can fix both edge cases if we make grow/shrink decisions
>>>> based
>>>> on
>>>> the length of kvm_vcpu_block rather than the arrival of a guest
>>>> interrupt
>>>> during polling.
>>>>
>>>> Some thoughts for dynamic growth:
>>>> * Given Windows 10 timer tick (1 ms), let's set the maximum poll
>>>> time
>>>> to
>>>>   less than 1ms. 200 us has been a good value for always-poll. We
>>>> can
>>>>   probably go a bit higher once we have your patch. Maybe 500 us?
>
>
> Did you test your patch against a windows guest?

I have not. I tested against a 250HZ linux guest to check how it performs
against a ticking guest. Presumably, windows should be the same, but at a
higher tick rate. Do you have a test for Windows?

>
>>>>
>>>> * The base case of dynamic growth (the first grow() after being at
>>>> 0)
>>>> should
>>>>   be small. 500 us is too big. When I run TCP_RR in my guest I see
>>>> poll
>>>> times
>>>>   of < 10 us. TCP_RR is on the lower-end of message passing workload
>>>> latency,
>>>>   so 10 us would be a good base case.
>>>
>>>
>>> How to get your TCP_RR benchmark?
>>>
>>> Regards,
>>> Wanpeng Li
>>
>> Install the netperf package, or build from here:
>> http://www.netperf.org/netperf/DownloadNetperf.html
>>
>> In the vm:
>>
>> # ./netserver
>> # ./netperf -t TCP_RR
>>
>> Be sure to use an SMP guest (we want TCP_RR to be a cross-core message
>> passing workload in order to test halt-polling).
>
>
> Ah, ok, I use the same benchmark as yours.
>
> Regards,
> Wanpeng Li
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] Adaptive halt-polling toggle

2015-09-01 Thread David Matlack

This patchset adds a dynamic on/off switch for polling. This patchset
gets good performance on its own for both idle and Message Passing
workloads.

   no-poll always-polladaptive-toggle
-
Idle (nohz) VCPU %c0   0.120.32   0.15
Idle (250HZ) VCPU %c0  1.226.35   1.27
TCP_RR latency 39 us   25 us  25 us

(3.16 Linux guest, halt_poll_ns=20)

"Idle (X) VCPU %c0" is the percent of time the physical cpu spent in
c0 over 60 seconds (each VCPU is pinned to a PCPU). (nohz) means the
guest was tickless. (250HZ) means the guest was ticking at 250HZ.

The big win is with ticking operating systems. Running the linux guest
with nohz=off (and HZ=250), we save 5% CPUs/second and get close to
no-polling overhead levels by using the adaptive toggle. The savings
should be even higher for higher frequency ticks.

Since we get low idle overhead with polling now, halt_poll_ns defaults
to 20, instead of 0. We can increase halt_poll_ns a bit more once
we have dynamic halt-polling length adjustments (Wanpeng's patch). We
should however keep halt_poll_ns below 1 ms since that is the tick
frequency used by windows.

David Matlack (1):
  kvm: adaptive halt-polling toggle

Wanpeng Li (1):
  KVM: make halt_poll_ns per-VCPU

 include/linux/kvm_host.h   |   1 +
 include/trace/events/kvm.h |  23 ++
 virt/kvm/kvm_main.c| 111 ++---
 3 files changed, 99 insertions(+), 36 deletions(-)

-- 
2.5.0.457.gab17608

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/3] KVM: Dynamic Halt-Polling

2015-09-01 Thread David Matlack

On Tue, Sep 1, 2015 at 3:30 PM, Wanpeng Li <wanpeng...@hotmail.com> wrote:
> On 9/2/15 5:45 AM, David Matlack wrote:
>>
>> On Thu, Aug 27, 2015 at 2:47 AM, Wanpeng Li <wanpeng...@hotmail.com>
>> wrote:
>>>
>>> v3 -> v4:
>>>   * bring back grow vcpu->halt_poll_ns when interrupt arrives and shrinks
>>> when idle VCPU is detected
>>>
>>> v2 -> v3:
>>>   * grow/shrink vcpu->halt_poll_ns by *halt_poll_ns_grow or
>>> /halt_poll_ns_shrink
>>>   * drop the macros and hard coding the numbers in the param definitions
>>>   * update the comments "5-7 us"
>>>   * remove halt_poll_ns_max and use halt_poll_ns as the max halt_poll_ns
>>> time,
>>> vcpu->halt_poll_ns start at zero
>>>   * drop the wrappers
>>>   * move the grow/shrink logic before "out:" w/ "if (waited)"
>>
>> I posted a patchset which adds dynamic poll toggling (on/off switch). I
>> think
>> this gives you a good place to build your dynamic growth patch on top. The
>> toggling patch has close to zero overhead for idle VMs and equivalent
>> performance VMs doing message passing as always-poll. It's a patch that's
>> been
>> in my queue for a few weeks but just haven't had the time to send out. We
>> can
>> win even more with your patchset by only polling as much as we need (via
>> dynamic growth/shrink). It also gives us a better place to stand for
>> choosing
>> a default for halt_poll_ns. (We can run experiments and see how high
>> vcpu->halt_poll_ns tends to grow.)
>>
>> The reason I posted a separate patch for toggling is because it adds
>> timers
>> to kvm_vcpu_block and deals with a weird edge case (kvm_vcpu_block can get
>> called multiple times for one halt). To do dynamic poll adjustment
>> correctly,
>> we have to time the length of each halt. Otherwise we hit some bad edge
>> cases:
>>
>>v3: v3 had lots of idle overhead. It's because vcpu->halt_poll_ns grew
>> every
>>time we had a long halt. So idle VMs looked like: 0 us -> 500 us -> 1
>> ms ->
>>2 ms -> 4 ms -> 0 us. Ideally vcpu->halt_poll_ns should just stay at 0
>> when
>>the halts are long.
>>
>>v4: v4 fixed the idle overhead problem but broke dynamic growth for
>> message
>>passing VMs. Every time a VM did a short halt, vcpu->halt_poll_ns would
>> grow.
>>That means vcpu->halt_poll_ns will always be maxed out, even when the
>> halt
>>time is much less than the max.
>>
>> I think we can fix both edge cases if we make grow/shrink decisions based
>> on
>> the length of kvm_vcpu_block rather than the arrival of a guest interrupt
>> during polling.
>>
>> Some thoughts for dynamic growth:
>>* Given Windows 10 timer tick (1 ms), let's set the maximum poll time
>> to
>>  less than 1ms. 200 us has been a good value for always-poll. We can
>>  probably go a bit higher once we have your patch. Maybe 500 us?
>>
>>* The base case of dynamic growth (the first grow() after being at 0)
>> should
>>  be small. 500 us is too big. When I run TCP_RR in my guest I see poll
>> times
>>  of < 10 us. TCP_RR is on the lower-end of message passing workload
>> latency,
>>  so 10 us would be a good base case.
>
>
> How to get your TCP_RR benchmark?
>
> Regards,
> Wanpeng Li

Install the netperf package, or build from here:
http://www.netperf.org/netperf/DownloadNetperf.html

In the vm:

# ./netserver
# ./netperf -t TCP_RR

Be sure to use an SMP guest (we want TCP_RR to be a cross-core message
passing workload in order to test halt-polling).

>
>
>>> v1 -> v2:
>>>   * change kvm_vcpu_block to read halt_poll_ns from the vcpu instead of
>>> the module parameter
>>>   * use the shrink/grow matrix which is suggested by David
>>>   * set halt_poll_ns_max to 2ms
>>>
>>> There is a downside of halt_poll_ns since poll is still happen for idle
>>> VCPU which can waste cpu usage. This patchset add the ability to adjust
>>> halt_poll_ns dynamically, grows halt_poll_ns if an interrupt arrives and
>>> shrinks halt_poll_ns when idle VCPU is detected.
>>>
>>> There are two new kernel parameters for changing the halt_poll_ns:
>>> halt_poll_ns_grow and halt_poll_ns_shrink.
>>>
>>>
>>> Test w/ high cpu overcommit ratio, pin vCPUs, and the halt_poll_ns of
>>> halt-poll is the default 50ns, the max

[PATCH 2/2] kvm: adaptive halt-polling toggle

2015-09-01 Thread David Matlack

This patch removes almost all of the overhead of polling for idle VCPUs
by disabling polling for long halts. The length of the previous halt
is used as a predictor for the current halt:

  if (length of previous halt < halt_poll_ns): poll for halt_poll_ns
  else: don't poll

This tends to work well in practice. For VMs running Message Passing
workloads, all halts are short and so the VCPU should always poll. When
a VCPU is idle, all halts are long and so the VCPU should never halt.
Experimental results on an IvyBridge host show adaptive toggling gets
close to the best of both worlds.

   no-poll always-polladaptive-toggle
-
Idle (nohz) VCPU %c0   0.120.32   0.15
Idle (250HZ) VCPU %c0  1.226.35   1.27
TCP_RR latency 39 us   25 us  25 us

(3.16 Linux guest, halt_poll_ns=20)

The big win is with ticking operating systems. Running the linux guest
with nohz=off (and HZ=250), we save 5% CPU/second and get close to
no-polling overhead levels by using the adaptive toggle. The savings
should be even higher for higher frequency ticks.

Signed-off-by: David Matlack <dmatl...@google.com>
---
 include/trace/events/kvm.h |  23 ++
 virt/kvm/kvm_main.c| 110 ++---
 2 files changed, 97 insertions(+), 36 deletions(-)

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index a44062d..34e0b11 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -38,22 +38,27 @@ TRACE_EVENT(kvm_userspace_exit,
 );
 
 TRACE_EVENT(kvm_vcpu_wakeup,
-   TP_PROTO(__u64 ns, bool waited),
-   TP_ARGS(ns, waited),
+   TP_PROTO(bool poll, bool success, __u64 poll_ns, __u64 wait_ns),
+   TP_ARGS(poll, success, poll_ns, wait_ns),
 
TP_STRUCT__entry(
-   __field(__u64,  ns  )
-   __field(bool,   waited  )
+   __field( bool,  poll)
+   __field( bool,  success )
+   __field(__u64,  poll_ns )
+   __field(__u64,  wait_ns )
),
 
TP_fast_assign(
-   __entry->ns = ns;
-   __entry->waited = waited;
+   __entry->poll   = poll;
+   __entry->success= success;
+   __entry->poll_ns= poll_ns;
+   __entry->wait_ns= wait_ns;
),
 
-   TP_printk("%s time %lld ns",
- __entry->waited ? "wait" : "poll",
- __entry->ns)
+   TP_printk("%s %s, poll ns %lld, wait ns %lld",
+ __entry->poll ? "poll" : "wait",
+ __entry->success ? "success" : "fail",
+ __entry->poll_ns, __entry->wait_ns)
 );
 
 #if defined(CONFIG_HAVE_KVM_IRQFD)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 977ffb1..3a66694 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -66,7 +66,8 @@
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
 
-static unsigned int halt_poll_ns;
+/* The maximum amount of time a vcpu will poll for interrupts while halted. */
+static unsigned int halt_poll_ns = 20;
 module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
 
 /*
@@ -1907,6 +1908,7 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
+/* This sets KVM_REQ_UNHALT if an interrupt arrives. */
 static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
 {
if (kvm_arch_vcpu_runnable(vcpu)) {
@@ -1921,47 +1923,101 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
return 0;
 }
 
-/*
- * The vCPU has executed a HLT instruction with in-kernel mode enabled.
- */
-void kvm_vcpu_block(struct kvm_vcpu *vcpu)
+static void
+update_vcpu_block_predictor(struct kvm_vcpu *vcpu, u64 poll_ns, u64 wait_ns)
 {
-   ktime_t start, cur;
-   DEFINE_WAIT(wait);
-   bool waited = false;
-
-   start = cur = ktime_get();
-   if (vcpu->halt_poll_ns) {
-   ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
-
-   do {
-   /*
-* This sets KVM_REQ_UNHALT if an interrupt
-* arrives.
-*/
-   if (kvm_vcpu_check_block(vcpu) < 0) {
-   ++vcpu->stat.halt_successful_poll;
-   goto out;
-   }
-   cur = ktime_get();
-   } while (single_task_running() && ktime_befo

[PATCH 1/2] KVM: make halt_poll_ns per-VCPU

2015-09-01 Thread David Matlack

From: Wanpeng Li 

Change halt_poll_ns into per-VCPU variable, seeded from module parameter,
to allow greater flexibility.

Signed-off-by: Wanpeng Li 
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c  | 5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 05e99b8..382cbef 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -241,6 +241,7 @@ struct kvm_vcpu {
int sigset_active;
sigset_t sigset;
struct kvm_vcpu_stat stat;
+   unsigned int halt_poll_ns;
 
 #ifdef CONFIG_HAS_IOMEM
int mmio_needed;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b8a444..977ffb1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -217,6 +217,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
unsigned id)
vcpu->kvm = kvm;
vcpu->vcpu_id = id;
vcpu->pid = NULL;
+   vcpu->halt_poll_ns = 0;
init_waitqueue_head(>wq);
kvm_async_pf_vcpu_init(vcpu);
 
@@ -1930,8 +1931,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
bool waited = false;
 
start = cur = ktime_get();
-   if (halt_poll_ns) {
-   ktime_t stop = ktime_add_ns(ktime_get(), halt_poll_ns);
+   if (vcpu->halt_poll_ns) {
+   ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
 
do {
/*
-- 
2.5.0.457.gab17608

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 2/3] KVM: dynamic halt_poll_ns adjustment

2015-09-03 Thread David Matlack

On Thu, Sep 3, 2015 at 2:23 AM, Wanpeng Li  wrote:
>
> How about something like:
>
> @@ -1941,10 +1976,14 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  */
> if (kvm_vcpu_check_block(vcpu) < 0) {
> ++vcpu->stat.halt_successful_poll;
> -   goto out;
> +   break;
> }
> cur = ktime_get();
> } while (single_task_running() && ktime_before(cur, stop));
> +
> +   poll_ns = ktime_to_ns(cur) - ktime_to_ns(start);
> +   if (ktime_before(cur, stop) && single_task_running())
> +   goto out;

I would prefer an explicit signal (e.g. set a bool to true before breaking out
of the loop, and check it here) to avoid duplicating the loop exit condition.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 0/3] KVM: Dynamic Halt-Polling

2015-09-02 Thread David Matlack

On Wed, Sep 2, 2015 at 12:29 AM, Wanpeng Li  wrote:
> v5 -> v6:
>  * fix wait_ns and poll_ns

Thanks for bearing with me through all the reviews. I think it's on the
verge of being done :). There are just few small things to fix.

>
> v4 -> v5:
>  * set base case 10us and max poll time 500us
>  * handle short/long halt, idea from David, many thanks David ;-)
>
> v3 -> v4:
>  * bring back grow vcpu->halt_poll_ns when interrupt arrives and shrinks
>when idle VCPU is detected
>
> v2 -> v3:
>  * grow/shrink vcpu->halt_poll_ns by *halt_poll_ns_grow or 
> /halt_poll_ns_shrink
>  * drop the macros and hard coding the numbers in the param definitions
>  * update the comments "5-7 us"
>  * remove halt_poll_ns_max and use halt_poll_ns as the max halt_poll_ns time,
>vcpu->halt_poll_ns start at zero
>  * drop the wrappers
>  * move the grow/shrink logic before "out:" w/ "if (waited)"
>
> v1 -> v2:
>  * change kvm_vcpu_block to read halt_poll_ns from the vcpu instead of
>the module parameter
>  * use the shrink/grow matrix which is suggested by David
>  * set halt_poll_ns_max to 2ms
>
> There is a downside of always-poll since poll is still happened for idle
> vCPUs which can waste cpu usage. This patchset add the ability to adjust
> halt_poll_ns dynamically, to grow halt_poll_ns when shot halt is detected,
> and to shrink halt_poll_ns when long halt is detected.
>
> There are two new kernel parameters for changing the halt_poll_ns:
> halt_poll_ns_grow and halt_poll_ns_shrink.
>
> no-poll  always-polldynamic-poll
> ---
> Idle (nohz) vCPU %c0 0.15%0.3%0.2%
> Idle (250HZ) vCPU %c01.1% 4.6%~14%1.2%
> TCP_RR latency   34us 27us26.7us
>
> "Idle (X) vCPU %c0" is the percent of time the physical cpu spent in
> c0 over 60 seconds (each vCPU is pinned to a pCPU). (nohz) means the
> guest was tickless. (250HZ) means the guest was ticking at 250HZ.
>
> The big win is with ticking operating systems. Running the linux guest
> with nohz=off (and HZ=250), we save 3.4%~12.8% CPUs/second and get close
> to no-polling overhead levels by using the dynamic-poll. The savings
> should be even higher for higher frequency ticks.
>
>
> Wanpeng Li (3):
>   KVM: make halt_poll_ns per-vCPU
>   KVM: dynamic halt_poll_ns adjustment
>   KVM: trace kvm_halt_poll_ns grow/shrink
>
>  include/linux/kvm_host.h   |  1 +
>  include/trace/events/kvm.h | 30 
>  virt/kvm/kvm_main.c| 70 
> ++
>  3 files changed, 96 insertions(+), 5 deletions(-)
>
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 2/3] KVM: dynamic halt_poll_ns adjustment

2015-09-02 Thread David Matlack

On Wed, Sep 2, 2015 at 12:29 AM, Wanpeng Li <wanpeng...@hotmail.com> wrote:
> There is a downside of always-poll since poll is still happened for idle
> vCPUs which can waste cpu usage. This patch adds the ability to adjust
> halt_poll_ns dynamically, to grow halt_poll_ns when shot halt is detected,
> and to shrink halt_poll_ns when long halt is detected.
>
> There are two new kernel parameters for changing the halt_poll_ns:
> halt_poll_ns_grow and halt_poll_ns_shrink.
>
> no-poll  always-polldynamic-poll
> ---
> Idle (nohz) vCPU %c0 0.15%0.3%0.2%
> Idle (250HZ) vCPU %c01.1% 4.6%~14%1.2%
> TCP_RR latency   34us 27us26.7us
>
> "Idle (X) vCPU %c0" is the percent of time the physical cpu spent in
> c0 over 60 seconds (each vCPU is pinned to a pCPU). (nohz) means the
> guest was tickless. (250HZ) means the guest was ticking at 250HZ.
>
> The big win is with ticking operating systems. Running the linux guest
> with nohz=off (and HZ=250), we save 3.4%~12.8% CPUs/second and get close
> to no-polling overhead levels by using the dynamic-poll. The savings
> should be even higher for higher frequency ticks.
>
> Suggested-by: David Matlack <dmatl...@google.com>
> Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com>
> ---
>  virt/kvm/kvm_main.c | 61 
> ++---
>  1 file changed, 58 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c06e57c..3cff02f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -66,9 +66,18 @@
>  MODULE_AUTHOR("Qumranet");
>  MODULE_LICENSE("GPL");
>
> -static unsigned int halt_poll_ns;
> +/* halt polling only reduces halt latency by 5-7 us, 500us is enough */
> +static unsigned int halt_poll_ns = 50;
>  module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
>
> +/* Default doubles per-vcpu halt_poll_ns. */
> +static unsigned int halt_poll_ns_grow = 2;
> +module_param(halt_poll_ns_grow, int, S_IRUGO);
> +
> +/* Default resets per-vcpu halt_poll_ns . */
> +static unsigned int halt_poll_ns_shrink;
> +module_param(halt_poll_ns_shrink, int, S_IRUGO);
> +
>  /*
>   * Ordering of locks:
>   *
> @@ -1907,6 +1916,31 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
> gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>
> +static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
> +{
> +   int val = vcpu->halt_poll_ns;
> +
> +   /* 10us base */
> +   if (val == 0 && halt_poll_ns_grow)
> +   val = 1;
> +   else
> +   val *= halt_poll_ns_grow;
> +
> +   vcpu->halt_poll_ns = val;
> +}
> +
> +static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
> +{
> +   int val = vcpu->halt_poll_ns;
> +
> +   if (halt_poll_ns_shrink == 0)
> +   val = 0;
> +   else
> +   val /= halt_poll_ns_shrink;
> +
> +   vcpu->halt_poll_ns = val;
> +}
> +
>  static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>  {
> if (kvm_arch_vcpu_runnable(vcpu)) {
> @@ -1929,6 +1963,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> ktime_t start, cur;
> DEFINE_WAIT(wait);
> bool waited = false;
> +   u64 poll_ns = 0, wait_ns = 0, block_ns = 0;
>
> start = cur = ktime_get();
> if (vcpu->halt_poll_ns) {
> @@ -1941,10 +1976,15 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  */
> if (kvm_vcpu_check_block(vcpu) < 0) {
> ++vcpu->stat.halt_successful_poll;
> -   goto out;
> +   break;
> }
> cur = ktime_get();
> } while (single_task_running() && ktime_before(cur, stop));
> +
> +   if (ktime_before(cur, stop)) {

You can't use 'cur' to tell if the interrupt arrived. single_task_running()
can break us out of the loop before 'stop'.

> +   poll_ns = ktime_to_ns(cur) - ktime_to_ns(start);

Put this line before the if(). block_ns should always include the time
spent polling; even if polling does not succeed.

> +   goto out;
> +   }
> }
>
> for (;;) {
> @@ -1959,9 +1999,24 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>
> finish_wait(>wq, );
> cur = ktime_get();
> +   wait_ns = ktime_to_ns(cur) - kt

Re: [PATCH v6 3/3] KVM: trace kvm_halt_poll_ns grow/shrink

2015-09-02 Thread David Matlack

On Wed, Sep 2, 2015 at 12:42 AM, Wanpeng Li  wrote:
> Tracepoint for dynamic halt_pool_ns, fired on every potential change.
>
> Signed-off-by: Wanpeng Li 
> ---
>  include/trace/events/kvm.h | 30 ++
>  virt/kvm/kvm_main.c|  8 ++--
>  2 files changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index a44062d..75ddf80 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -356,6 +356,36 @@ TRACE_EVENT(
>   __entry->address)
>  );
>
> +TRACE_EVENT(kvm_halt_poll_ns,
> +   TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old),
> +   TP_ARGS(grow, vcpu_id, new, old),
> +
> +   TP_STRUCT__entry(
> +   __field(bool, grow)
> +   __field(unsigned int, vcpu_id)
> +   __field(int, new)
> +   __field(int, old)
> +   ),
> +
> +   TP_fast_assign(
> +   __entry->grow   = grow;
> +   __entry->vcpu_id= vcpu_id;
> +   __entry->new= new;
> +   __entry->old= old;
> +   ),
> +
> +   TP_printk("vcpu %u: halt_pool_ns %d (%s %d)",

s/pool/poll/

> +   __entry->vcpu_id,
> +   __entry->new,
> +   __entry->grow ? "grow" : "shrink",
> +   __entry->old)
> +);
> +
> +#define trace_kvm_halt_poll_ns_grow(vcpu_id, new, old) \
> +   trace_kvm_halt_poll_ns(true, vcpu_id, new, old)
> +#define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \
> +   trace_kvm_halt_poll_ns(false, vcpu_id, new, old)
> +
>  #endif
>
>  #endif /* _TRACE_KVM_MAIN_H */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3cff02f..fee339e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1918,8 +1918,9 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>
>  static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
>  {
> -   int val = vcpu->halt_poll_ns;
> +   int old, val;
>
> +   old = val = vcpu->halt_poll_ns;
> /* 10us base */
> if (val == 0 && halt_poll_ns_grow)
> val = 1;
> @@ -1927,18 +1928,21 @@ static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
> val *= halt_poll_ns_grow;
>
> vcpu->halt_poll_ns = val;
> +   trace_kvm_halt_poll_ns_grow(vcpu->vcpu_id, val, old);
>  }
>
>  static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
>  {
> -   int val = vcpu->halt_poll_ns;
> +   int old, val;
>
> +   old = val = vcpu->halt_poll_ns;
> if (halt_poll_ns_shrink == 0)
> val = 0;
> else
> val /= halt_poll_ns_shrink;
>
> vcpu->halt_poll_ns = val;
> +   trace_kvm_halt_poll_ns_shrink(vcpu->vcpu_id, val, old);
>  }
>
>  static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 2/3] KVM: dynamic halt_poll_ns adjustment

2015-09-02 Thread David Matlack

On Wed, Sep 2, 2015 at 12:12 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 02/09/2015 20:09, David Matlack wrote:
>> On Wed, Sep 2, 2015 at 12:29 AM, Wanpeng Li <wanpeng...@hotmail.com> wrote:
>>> There is a downside of always-poll since poll is still happened for idle
>>> vCPUs which can waste cpu usage. This patch adds the ability to adjust
>>> halt_poll_ns dynamically, to grow halt_poll_ns when shot halt is detected,
>>> and to shrink halt_poll_ns when long halt is detected.
>>>
>>> There are two new kernel parameters for changing the halt_poll_ns:
>>> halt_poll_ns_grow and halt_poll_ns_shrink.
>>>
>>> no-poll  always-polldynamic-poll
>>> ---
>>> Idle (nohz) vCPU %c0 0.15%0.3%0.2%
>>> Idle (250HZ) vCPU %c01.1% 4.6%~14%1.2%
>>> TCP_RR latency   34us 27us26.7us
>>>
>>> "Idle (X) vCPU %c0" is the percent of time the physical cpu spent in
>>> c0 over 60 seconds (each vCPU is pinned to a pCPU). (nohz) means the
>>> guest was tickless. (250HZ) means the guest was ticking at 250HZ.
>>>
>>> The big win is with ticking operating systems. Running the linux guest
>>> with nohz=off (and HZ=250), we save 3.4%~12.8% CPUs/second and get close
>>> to no-polling overhead levels by using the dynamic-poll. The savings
>>> should be even higher for higher frequency ticks.
>>>
>>> Suggested-by: David Matlack <dmatl...@google.com>
>>> Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com>
>>> ---
>>>  virt/kvm/kvm_main.c | 61 
>>> ++---
>>>  1 file changed, 58 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index c06e57c..3cff02f 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -66,9 +66,18 @@
>>>  MODULE_AUTHOR("Qumranet");
>>>  MODULE_LICENSE("GPL");
>>>
>>> -static unsigned int halt_poll_ns;
>>> +/* halt polling only reduces halt latency by 5-7 us, 500us is enough */
>>> +static unsigned int halt_poll_ns = 50;
>>>  module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
>>>
>>> +/* Default doubles per-vcpu halt_poll_ns. */
>>> +static unsigned int halt_poll_ns_grow = 2;
>>> +module_param(halt_poll_ns_grow, int, S_IRUGO);
>>> +
>>> +/* Default resets per-vcpu halt_poll_ns . */
>>> +static unsigned int halt_poll_ns_shrink;
>>> +module_param(halt_poll_ns_shrink, int, S_IRUGO);
>>> +
>>>  /*
>>>   * Ordering of locks:
>>>   *
>>> @@ -1907,6 +1916,31 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
>>> gfn_t gfn)
>>>  }
>>>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>>>
>>> +static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
>>> +{
>>> +   int val = vcpu->halt_poll_ns;
>>> +
>>> +   /* 10us base */
>>> +   if (val == 0 && halt_poll_ns_grow)
>>> +   val = 1;
>>> +   else
>>> +   val *= halt_poll_ns_grow;
>>> +
>>> +   vcpu->halt_poll_ns = val;
>>> +}
>>> +
>>> +static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
>>> +{
>>> +   int val = vcpu->halt_poll_ns;
>>> +
>>> +   if (halt_poll_ns_shrink == 0)
>>> +   val = 0;
>>> +   else
>>> +   val /= halt_poll_ns_shrink;
>>> +
>>> +   vcpu->halt_poll_ns = val;
>>> +}
>>> +
>>>  static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>>>  {
>>> if (kvm_arch_vcpu_runnable(vcpu)) {
>>> @@ -1929,6 +1963,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>> ktime_t start, cur;
>>> DEFINE_WAIT(wait);
>>> bool waited = false;
>>> +   u64 poll_ns = 0, wait_ns = 0, block_ns = 0;
>>>
>>> start = cur = ktime_get();
>>> if (vcpu->halt_poll_ns) {
>>> @@ -1941,10 +1976,15 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>  */
>>> if (kvm_vcpu_check_block(vcpu) < 0) {
>>> ++vcpu->stat.halt_succes

Re: Questions about KVM TSC trapping

2015-09-15 Thread David Matlack

On Tue, Sep 15, 2015 at 12:04 AM, Oliver Yang  wrote:
> Hi Guys,
>
> I found below patch for KVM TSC trapping / migration support,
>
> https://lkml.org/lkml/2011/1/6/90
>
> It seemed the patch were not merged in Linux mainline.
>
> So I have 3 questions here,
>
> 1.  Can KVM support TSC trapping today? If not, what is the plan?

Not without a patch. Did you want to trap RDTSC? RDTSC works without
trapping thanks to hardware support.

>
> 2. What is the solution if my SMP Linux guest OS doesn't have reliable
> TSC?

If you are seeing an unreliable TSC in your guest, maybe your host
hardware does not support a synchronized TSC across CPUs. I can't
recall how to check for this. There might be a flag in your host's
/proc/cpuinfo.

>
> Because the no TSC trapping support, will kvmclock driver handle all
> TSC sync issues?
>
> 3. What if my Linux guest doesn't have kvmclock driver?

The guest will use a different clock source (e.g. acpi-pm). Note the
RDTSC[P] instruction will still work just fine.

>
> Does that mean I shouldn't run TSC sensitive application in my guest
> OS?
>
> BTW, my application is written with lots of rdtsc instructions, and
> which performs well in VMware guest.

Does it not work well in KVM?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: add halt_attempted_poll to VCPU stats

2015-09-15 Thread David Matlack

On Tue, Sep 15, 2015 at 9:27 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
> This new statistic can help diagnosing VCPUs that, for any reason,
> trigger bad behavior of halt_poll_ns autotuning.
>
> For example, say halt_poll_ns = 48, and wakeups are spaced exactly
> like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
> 10+20+40+80+160+320+480 = 1110 microseconds out of every
> 479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
> is consuming about 30% more CPU than it would use without
> polling.  This would show as an abnormally high number of
> attempted polling compared to the successful polls.

Reviewed-by: David Matlack <dmatl...@google.com>

>
> Cc: Christian Borntraeger <borntrae...@de.ibm.com<
> Cc: David Matlack <dmatl...@google.com>
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> ---
>  arch/arm/include/asm/kvm_host.h | 1 +
>  arch/arm64/include/asm/kvm_host.h   | 1 +
>  arch/mips/include/asm/kvm_host.h| 1 +
>  arch/mips/kvm/mips.c| 1 +
>  arch/powerpc/include/asm/kvm_host.h | 1 +
>  arch/powerpc/kvm/book3s.c   | 1 +
>  arch/powerpc/kvm/booke.c| 1 +
>  arch/s390/include/asm/kvm_host.h| 1 +
>  arch/s390/kvm/kvm-s390.c| 1 +
>  arch/x86/include/asm/kvm_host.h | 1 +
>  arch/x86/kvm/x86.c  | 1 +
>  virt/kvm/kvm_main.c | 1 +
>  12 files changed, 12 insertions(+)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index dcba0fa5176e..687ddeba3611 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -148,6 +148,7 @@ struct kvm_vm_stat {
>
>  struct kvm_vcpu_stat {
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
>  };
>
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 415938dc45cf..486594583cc6 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -195,6 +195,7 @@ struct kvm_vm_stat {
>
>  struct kvm_vcpu_stat {
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
>  };
>
> diff --git a/arch/mips/include/asm/kvm_host.h 
> b/arch/mips/include/asm/kvm_host.h
> index e8c8d9d0c45f..3a54dbca9f7e 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -128,6 +128,7 @@ struct kvm_vcpu_stat {
> u32 msa_disabled_exits;
> u32 flush_dcache_exits;
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
>  };
>
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index cd4c129ce743..49ff3bfc007e 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -55,6 +55,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "msa_disabled", VCPU_STAT(msa_disabled_exits), KVM_STAT_VCPU },
> { "flush_dcache", VCPU_STAT(flush_dcache_exits), KVM_STAT_VCPU },
> { "halt_successful_poll", VCPU_STAT(halt_successful_poll), 
> KVM_STAT_VCPU },
> +   { "halt_attempted_poll", VCPU_STAT(halt_attempted_poll), 
> KVM_STAT_VCPU },
> { "halt_wakeup",  VCPU_STAT(halt_wakeup),KVM_STAT_VCPU },
> {NULL}
>  };
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 98eebbf66340..195886a583ba 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -108,6 +108,7 @@ struct kvm_vcpu_stat {
> u32 dec_exits;
> u32 ext_intr_exits;
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
> u32 dbell_exits;
> u32 gdbell_exits;
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index d75bf325f54a..cf009167d208 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -53,6 +53,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "ext_intr",VCPU_STAT(ext_intr_exits) },
> { "queue_intr",  VCPU_STAT(queue_intr) },
> { "halt_successful_poll", VCPU_STAT(halt_successful_poll), },
> +   { "halt_attempted_poll", VCPU_STAT(halt_attempted_poll), },
> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> { "pf_storage",  VCPU_STAT(pf_storage) },
> { "sp_storage",  VCPU_STAT(sp_storage) },
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index ae458f0fd061..fd5

Re: Questions about KVM TSC trapping

2015-09-25 Thread David Matlack

On Wed, Sep 23, 2015 at 5:55 PM, yangoliver <yang_oli...@hotmail.com> wrote:
>
> David,
>
> Sorry for late reply. See my inline comments.
>
>
> On Tue, 15 Sep 2015, David Matlack wrote:
>
>> On Tue, Sep 15, 2015 at 12:04 AM, Oliver Yang <yang_oli...@hotmail.com> 
>> wrote:
>> > Hi Guys,
>> >
>> > I found below patch for KVM TSC trapping / migration support,
>> >
>> > https://lkml.org/lkml/2011/1/6/90
>> >
>> > It seemed the patch were not merged in Linux mainline.
>> >
>> > So I have 3 questions here,
>> >
>> > 1.  Can KVM support TSC trapping today? If not, what is the plan?
>>
>> Not without a patch. Did you want to trap RDTSC? RDTSC works without
>> trapping thanks to hardware support.
> In fact, I think TSC trap is the only way to make TSC sync in guest OS.
>
> For exmaple, user space applications could use rdtsc directly, which
> may cause the problem if there is no TSC emulation.

Ah, I see. KVM does not provide an option to trap and emulate RDTSC to
provide a stable TSC. I do not know of any plans to change this.

>>
>> >
>> > 2. What is the solution if my SMP Linux guest OS doesn't have reliable
>> > TSC?
>>
>> If you are seeing an unreliable TSC in your guest, maybe your host
>> hardware does not support a synchronized TSC across CPUs. I can't
>> recall how to check for this. There might be a flag in your host's
>> /proc/cpuinfo.
>
> Yes. I knew how to check the hardware capabilitiy.
>
> My major question was whether KVM support TSC sync even running over
> TSC unreliable hardware.
>>
>> >
>> > Because the no TSC trapping support, will kvmclock driver handle all
>> > TSC sync issues?
>> >
>> > 3. What if my Linux guest doesn't have kvmclock driver?
>>
>> The guest will use a different clock source (e.g. acpi-pm). Note the
>> RDTSC[P] instruction will still work just fine.
>
> The major problem is user application, as my application calls rstsc under
> multi-thread environment.
>
>> >
>> > Does that mean I shouldn't run TSC sensitive application in my guest
>> > OS?
>> >
>> > BTW, my application is written with lots of rdtsc instructions, and
>> > which performs well in VMware guest.
>>
>> Does it not work well in KVM?
>
> Not quite sure. We plan to run our application over KVM.
> VMware ESX5.x supported TSC sync even underlying hardware TSC reliable,
> so I'm just wondering KVM cases.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kvm-unit-tests] x86: always inline functions called after set_exception_return

2015-12-07 Thread David Matlack

set_exception_return forces exceptions handlers to return to a specific
address instead of returning to the instruction address pushed by the
CPU at the time of the exception. The unit tests apic.c and vmx.c use
this functionality to recover from expected exceptions.

When using set_exception_return we have to be careful not to modify the
stack (such as by doing a function call) as triggering the exception will
likely jump us past the instructions which undo the stack manipulation
(such as a ret). To accomplish this, declare all functions called after
set_exception_return as __always_inline, so that the compiler always
inlines them.

Signed-off-by: David Matlack <dmatl...@google.com>
---
 lib/libcflat.h  | 4 
 lib/x86/processor.h | 2 +-
 x86/vmx.c   | 4 ++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/lib/libcflat.h b/lib/libcflat.h
index 9747ccd..9ffb5db 100644
--- a/lib/libcflat.h
+++ b/lib/libcflat.h
@@ -27,6 +27,10 @@
 
 #define __unused __attribute__((__unused__))
 
+#ifndef __always_inline
+# define __always_inline inline __attribute__((always_inline))
+#endif
+
 #define xstr(s) xxstr(s)
 #define xxstr(s) #s
 
diff --git a/lib/x86/processor.h b/lib/x86/processor.h
index 95cea1a..c4bc64f 100644
--- a/lib/x86/processor.h
+++ b/lib/x86/processor.h
@@ -149,7 +149,7 @@ static inline u64 rdmsr(u32 index)
 return a | ((u64)d << 32);
 }
 
-static inline void wrmsr(u32 index, u64 val)
+static __always_inline void wrmsr(u32 index, u64 val)
 {
 u32 a = val, d = val >> 32;
 asm volatile ("wrmsr" : : "a"(a), "d"(d), "c"(index) : "memory");
diff --git a/x86/vmx.c b/x86/vmx.c
index f05cd33..28cd349 100644
--- a/x86/vmx.c
+++ b/x86/vmx.c
@@ -117,7 +117,7 @@ static void __attribute__((__used__)) syscall_handler(u64 
syscall_no)
current->syscall_handler(syscall_no);
 }
 
-static inline int vmx_on()
+static __always_inline int vmx_on()
 {
bool ret;
u64 rflags = read_rflags() | X86_EFLAGS_CF | X86_EFLAGS_ZF;
@@ -126,7 +126,7 @@ static inline int vmx_on()
return ret;
 }
 
-static inline int vmx_off()
+static __always_inline int vmx_off()
 {
bool ret;
u64 rflags = read_rflags() | X86_EFLAGS_CF | X86_EFLAGS_ZF;
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH kvm-unit-tests 3/4] x86: remove test_for_exception

2015-12-15 Thread David Matlack

On Tue, Dec 15, 2015 at 2:25 AM, Paolo Bonzini  wrote:
> Test functions know whether an exception was generated simply by checking
> the last value returned by set_exception_jmpbuf.  The exception number is
> passed to set_exception_jmpbuf so that it can set up the exception handler.

I like the idea of test_for_exception, it makes the unit tests simpler. But
this (and the previous) version, require the trigger function to be in on
the joke (either by it calling set_exception_return, or now by it calling
set_exception_jmpbuf and handle_exception(NULL)).

What do you think about this simplification?

bool test_for_exception(unsigned int ex, void (*trigger_func)(void *data),
void *data)
jmp_buf jmpbuf;
int ret;

handle_exception(ex, exception_handler);
ret = set_exception_jmpbuf();
if (ret == 0)
trigger_func(data);
handle_exception(ex, NULL);
return ret;
}

Then trigger_func can be very simple, e.g.

static void do_write_apicbase(u64 data)
{
wrmsr(MSR_IA32_APICBASE, data);
}

>
> Signed-off-by: Paolo Bonzini 
> ---
>  lib/x86/desc.c | 13 +
>  lib/x86/desc.h |  8 +++-
>  x86/apic.c | 34 +-
>  x86/vmx.c  | 29 +++--
>  4 files changed, 40 insertions(+), 44 deletions(-)
>
> diff --git a/lib/x86/desc.c b/lib/x86/desc.c
> index acf29e3..acf66b6 100644
> --- a/lib/x86/desc.c
> +++ b/lib/x86/desc.c
> @@ -315,7 +315,6 @@ void setup_alt_stack(void)
>  }
>  #endif
>
> -static bool exception;
>  static jmp_buf *exception_jmpbuf;
>
>  static void exception_handler_longjmp(void)
> @@ -326,21 +325,11 @@ static void exception_handler_longjmp(void)
>  static void exception_handler(struct ex_regs *regs)
>  {
> /* longjmp must happen after iret, so do not do it now.  */
> -   exception = true;
> regs->rip = (unsigned long)_handler_longjmp;
>  }
>
> -bool test_for_exception(unsigned int ex, void (*trigger_func)(void *data),
> -   void *data)
> +void __set_exception_jmpbuf(unsigned int ex, jmp_buf *addr)
>  {
> handle_exception(ex, exception_handler);
> -   exception = false;
> -   trigger_func(data);
> -   handle_exception(ex, NULL);
> -   return exception;
> -}
> -
> -void __set_exception_jmpbuf(jmp_buf *addr)
> -{
> exception_jmpbuf = addr;
>  }
> diff --git a/lib/x86/desc.h b/lib/x86/desc.h
> index be52fd4..fceabd8 100644
> --- a/lib/x86/desc.h
> +++ b/lib/x86/desc.h
> @@ -155,10 +155,8 @@ void set_intr_alt_stack(int e, void *fn);
>  void print_current_tss_info(void);
>  void handle_exception(u8 v, void (*func)(struct ex_regs *regs));
>
> -bool test_for_exception(unsigned int ex, void (*trigger_func)(void *data),
> -   void *data);
> -void __set_exception_jmpbuf(jmp_buf *addr);
> -#define set_exception_jmpbuf(jmpbuf) \
> -   (setjmp(jmpbuf) ? : (__set_exception_jmpbuf(&(jmpbuf)), 0))
> +void __set_exception_jmpbuf(unsigned int ex, jmp_buf *addr);
> +#define set_exception_jmpbuf(ex, jmpbuf) \
> +   (setjmp(jmpbuf) ? : (__set_exception_jmpbuf((ex), &(jmpbuf)), 0))
>
>  #endif
> diff --git a/x86/apic.c b/x86/apic.c
> index 2e04c82..de19724 100644
> --- a/x86/apic.c
> +++ b/x86/apic.c
> @@ -61,12 +61,18 @@ static void test_tsc_deadline_timer(void)
>  }
>  }
>
> -static void do_write_apicbase(void *data)
> +static bool do_write_apicbase(u64 data)
>  {
>  jmp_buf jmpbuf;
> -if (set_exception_jmpbuf(jmpbuf) == 0) {
> -wrmsr(MSR_IA32_APICBASE, *(u64 *)data);
> +int ret;
> +if (set_exception_jmpbuf(GP_VECTOR, jmpbuf) == 0) {
> +wrmsr(MSR_IA32_APICBASE, data);
> +ret = 0;
> +} else {
> +ret = 1;
>  }
> +handle_exception(GP_VECTOR, NULL);
> +return ret;
>  }
>
>  void test_enable_x2apic(void)
> @@ -80,24 +86,19 @@ void test_enable_x2apic(void)
>  printf("x2apic enabled\n");
>
>  report("x2apic enabled to invalid state",
> -   test_for_exception(GP_VECTOR, do_write_apicbase,
> -  _state));
> +   do_write_apicbase(invalid_state));
>  report("x2apic enabled to apic enabled",
> -   test_for_exception(GP_VECTOR, do_write_apicbase,
> -  _enabled));
> +   do_write_apicbase(apic_enabled));
>
>  wrmsr(MSR_IA32_APICBASE, APIC_DEFAULT_PHYS_BASE | APIC_BSP);
>  report("disabled to invalid state",
> -   test_for_exception(GP_VECTOR, do_write_apicbase,
> -  _state));
> +   do_write_apicbase(invalid_state));
>  report("disabled to x2apic enabled",
> -   test_for_exception(GP_VECTOR, do_write_apicbase,
> -  _enabled));
> +   do_write_apicbase(x2apic_enabled));
>
>  wrmsr(MSR_IA32_APICBASE,

Re: [PATCH kvm-unit-tests] x86: always inline functions called after set_exception_return

2015-12-11 Thread David Matlack

On Wed, Dec 9, 2015 at 7:02 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
> On 07/12/2015 21:36, David Matlack wrote:
>> set_exception_return forces exceptions handlers to return to a specific
>> address instead of returning to the instruction address pushed by the
>> CPU at the time of the exception. The unit tests apic.c and vmx.c use
>> this functionality to recover from expected exceptions.
>>
>> When using set_exception_return we have to be careful not to modify the
>> stack (such as by doing a function call) as triggering the exception will
>> likely jump us past the instructions which undo the stack manipulation
>> (such as a ret). To accomplish this, declare all functions called after
>> set_exception_return as __always_inline, so that the compiler always
>> inlines them.
>
> set_exception_return is generally not a great idea IMHO---thanks for
> looking at it.

Yup. This is a band-aid just to fix the current implementation.

>
> A couple years ago we discussed adding setjmp/longjmp to libcflat
> (http://www.spinics.net/lists/kvm/msg94159.html which is however missing
> a 32-bit version).  Making the exceptions do a longjmp would be a much
> safer option.

Good idea! I might give this a try, but don't hold your breath :)

>
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QUESTION: mmu.gva_to_gpa versus nested_mmu.gva_to_gpa

2015-12-29 Thread David Matlack

On Mon, Dec 28, 2015 at 6:25 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 28/12/2015 23:23, David Matlack wrote:
>> I'm wondering if this comment in mmu.c:init_kvm_nested_mmu is correct (at
>> least in the context of Nested EPT):
>>
>> 4055 /*
>> 4056  * Note that arch.mmu.gva_to_gpa translates l2_gva to l1_gpa. 
>> The
>> 4057  * translation of l2_gpa to l1_gpa addresses is done using the
>> 4058  * arch.nested_mmu.gva_to_gpa function. Basically the gva_to_gpa
>> 4059  * functions between mmu and nested_mmu are swapped.
>> 4060  */
>>
>> nested_mmu.get_cr3 gets set to get_cr3, which I believe will return L2's cr3.
>> In vmx.c:nested_ept_init_mmu_context, mmu.get_cr3 is set to
>> nested_ept_get_cr3, which should be the root of EPT12. Given these get_cr3
>> functions, shouldn't nested_mmu.gva_to_gpa translate l2_gva->l2_gpa and

I think I got this wrong. walk_addr_generic uses translate_gpa to convert
the l2_gpa into its l1_gpa address at the end of the translation. So
nested_mmu.gva_to_gpa should translate l2_gva to l1_gpa?

>> mmu.gva_to_gpa translate l2_gpa->l1_gpa?
>
> Yes, it's correct.  It can be trivially seen by looking at
> kvm_init_shadow_ept_mmu's usage of >arch.mmu.  This is obviously a
> l2_gpa to l1_gpa translation.

If vcpu->arch.mmu.gva_to_gpa is a l2_gpa to l1_gpa translation, then the
comment is incorrect... right? The comment says "arch.mmu.gva_to_gpa
translates l2_gva to l1_gpa" and "l2_gpa to l1_gpa addresses is done
using the arch.nested_mmu.gva_to_gpa".

>
> Whether the roles are swapped, depends on whether you think of
> "nested_mmu" as "nested guest" or "nested virtualization"  nested_mmu is
> the MMU for the nested guest, mmu is the MMU for the L1 guest and it's
> the one that takes care of nested virtualization.
>
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa

2015-12-30 Thread David Matlack

The comment had the meaning of mmu.gva_to_gpa and nested_mmu.gva_to_gpa
swapped. Fix that, and also add some details describing how each translation
works.

Signed-off-by: David Matlack <dmatl...@google.com>
---
 arch/x86/kvm/mmu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e7c2c14..098a9c2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4058,10 +4058,12 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
g_context->inject_page_fault = kvm_inject_page_fault;
 
/*
-* Note that arch.mmu.gva_to_gpa translates l2_gva to l1_gpa. The
-* translation of l2_gpa to l1_gpa addresses is done using the
-* arch.nested_mmu.gva_to_gpa function. Basically the gva_to_gpa
-* functions between mmu and nested_mmu are swapped.
+* Note that arch.mmu.gva_to_gpa translates l2_gpa to l1_gpa using
+* L1's nested page tables (e.g. EPT12). The nested translation
+* of l2_gva to l1_gpa is done by arch.nested_mmu.gva_to_gpa using
+* L2's page tables as the first level of translation and L1's
+* nested page tables as the second level of translation. Basically
+* the gva_to_gpa functions between mmu and nested_mmu are swapped.
 */
if (!is_paging(vcpu)) {
g_context->nx = false;
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

QUESTION: mmu.gva_to_gpa versus nested_mmu.gva_to_gpa

2015-12-28 Thread David Matlack

I'm wondering if this comment in mmu.c:init_kvm_nested_mmu is correct (at
least in the context of Nested EPT):

4055 /*
4056  * Note that arch.mmu.gva_to_gpa translates l2_gva to l1_gpa. The
4057  * translation of l2_gpa to l1_gpa addresses is done using the
4058  * arch.nested_mmu.gva_to_gpa function. Basically the gva_to_gpa
4059  * functions between mmu and nested_mmu are swapped.
4060  */

nested_mmu.get_cr3 gets set to get_cr3, which I believe will return L2's cr3.
In vmx.c:nested_ept_init_mmu_context, mmu.get_cr3 is set to
nested_ept_get_cr3, which should be the root of EPT12. Given these get_cr3
functions, shouldn't nested_mmu.gva_to_gpa translate l2_gva->l2_gpa and
mmu.gva_to_gpa translate l2_gpa->l1_gpa?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QUESTION: mmu.gva_to_gpa versus nested_mmu.gva_to_gpa

2015-12-30 Thread David Matlack

On Wed, Dec 30, 2015 at 3:36 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 29/12/2015 17:37, David Matlack wrote:
>>> > Yes, it's correct.
>
> s/it's/you're/ :)

Ah ok. Thanks for your help!

I will send a patch to fix the comment then.

>
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

71 matches

Mail list logo