Re: [Qemu-devel] [PATCH] msix: fix interrupt aggregation problem at the passthrough of NVMe SSD

2019-04-10 Thread Zhuangyanying



> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Tuesday, April 09, 2019 11:04 PM
> To: Zhuangyanying 
> Cc: marcel.apfelb...@gmail.com; qemu-devel@nongnu.org; Gonglei (Arei)
> 
> Subject: Re: [PATCH] msix: fix interrupt aggregation problem at the
> passthrough of NVMe SSD
> 
> On Tue, Apr 09, 2019 at 02:14:56PM +, Zhuangyanying wrote:
> > From: Zhuang Yanying 
> >
> > Recently I tested the performance of NVMe SSD passthrough and found
> > that interrupts were aggregated on vcpu0(or the first vcpu of each
> > numa) by /proc/interrupts,when GuestOS was upgraded to sles12sp3 (or
> > redhat7.6). But /proc/irq/X/smp_affinity_list shows that the interrupt is
> spread out, such as 0-10, 11-21, and so on.
> > This problem cannot be resolved by "echo X >
> > /proc/irq/X/smp_affinity_list", because the NVMe SSD interrupt is
> > requested by the API pci_alloc_irq_vectors(), so the interrupt has the
> IRQD_AFFINITY_MANAGED flag.
> >
> > GuestOS sles12sp3 backport "automatic interrupt affinity for MSI/MSI-X
> > capable devices", but the implementation of __setup_irq has no
> > corresponding modification. It is still irq_startup(), then
> > setup_affinity(), that is sending an affinity message when the interrupt is
> unmasked.
> 
> So does latest upstream still change data/address of an unmasked vector?
> 
The latest upstream works fine. Affinity configuration is successful.

The original order at my understanding is
nvme_setup_io_queues()
 \ \
  \ --->pci_alloc_irq_vectors_affinity()
   \\
\-> msi_domain_alloc_irqs()
 \ \ /* if IRQD_AFFINITY_MANAGED, then "mask = affinity " */
  \ -> ...-> __irq_alloc_descs()
   \ \  /* cpumask_copy(desc->irq_common_data.affinity, 
affinity); */
\ -> ...-> desc_smp_init()
 ->request_threaded_irq()
\
 ->__setup_irq()
   \  \
\  ->irq_startup()->msi_domain_activate()
 \  \
  \  ->irq_enable()->pci_msi_unmask_irq()
   \
-->setup_affinity()
   \\
\-->if (irqd_affinity_is_managed(>irq_data)) 
 \   set = desc->irq_common_data.affinity;
  \  cpumask_and(mask, cpu_online_mask, set);
   \
-->irq_do_set_affinity()
   \
-->msi_domain_set_affinity()
  \   /* Actual setting affinity*/
   -->__pci_write_msi_msg()

Afetr commit e43b3b58:" genirq/cpuhotplug: Enforce affinity setting on startup 
of managed irqs "
@@ -265,8 +265,8 @@ int irq_startup(struct irq_desc *desc, bool resend, bool 
force)
irq_setup_affinity(desc);
break;
case IRQ_STARTUP_MANAGED:
+   irq_do_set_affinity(d, aff, false);
ret = __irq_startup (desc);
-   irq_set_affinity_locked(d, aff, false);
break;
First irq_do_set_affinity(), then __irq_startup(),that is unmask irq.

> > The bare metal configuration is successful, but qemu will not trigger
> > the msix update, and the affinity configuration fails.
> > The affinity is configured by /proc/irq/X/smp_affinity_list,
> > implemented at apic_ack_edge(), the bitmap is stored in pending_mask,
> > mask->__pci_write_msi_msg()->unmask,
> > and the timing is guaranteed, and the configuration takes effect.
> >
> > The GuestOS linux-master incorporates the "genirq/cpuhotplug: Enforce
> > affinity setting on startup of managed irqs" to ensure that the
> > affinity is first issued and then __irq_startup(), for the managerred
> > interrupt. So configuration is successful.
> >
> > It now looks like sles12sp3 (up to sles15sp1, linux-4.12.x), redhat7.6
> > (3.10.0-957.10.1) does not have backport the patch yet.
> 
> Sorry - which patch?
genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs
commit e43b3b58
> 
> > "if (is_masked == was_masked) return;" can it be removed at qemu?
> > What is the reason for this check?
> >
> > Signed-off-by: Zhuang Yanying 
> > ---
> >  hw/pci/msix.c | 4 
> >  1 file changed, 4 deletions(-)
> >
> > diff --git a/hw/pci/msix.c b/hw/pci/msix.c index 4e33641..e1ff533
> > 100644
> > --- a/hw/pc

[Qemu-devel] [PATCH] msix: fix interrupt aggregation problem at the passthrough of NVMe SSD

2019-04-09 Thread Zhuangyanying
From: Zhuang Yanying 

Recently I tested the performance of NVMe SSD passthrough and found that 
interrupts
were aggregated on vcpu0(or the first vcpu of each numa) by 
/proc/interrupts,when
GuestOS was upgraded to sles12sp3 (or redhat7.6). But 
/proc/irq/X/smp_affinity_list
shows that the interrupt is spread out, such as 0-10, 11-21, and so on.
This problem cannot be resolved by "echo X > /proc/irq/X/smp_affinity_list", 
because
the NVMe SSD interrupt is requested by the API pci_alloc_irq_vectors(), so the
interrupt has the IRQD_AFFINITY_MANAGED flag.

GuestOS sles12sp3 backport "automatic interrupt affinity for MSI/MSI-X capable 
devices",
but the implementation of __setup_irq has no corresponding modification. It is 
still
irq_startup(), then setup_affinity(), that is sending an affinity message when 
the
interrupt is unmasked. The bare metal configuration is successful, but qemu will
not trigger the msix update, and the affinity configuration fails.
The affinity is configured by /proc/irq/X/smp_affinity_list, implemented at
apic_ack_edge(), the bitmap is stored in pending_mask,
mask->__pci_write_msi_msg()->unmask,
and the timing is guaranteed, and the configuration takes effect.

The GuestOS linux-master incorporates the "genirq/cpuhotplug: Enforce affinity
setting on startup of managed irqs" to ensure that the affinity is first issued
and then __irq_startup(), for the managerred interrupt. So configuration is
successful.

It now looks like sles12sp3 (up to sles15sp1, linux-4.12.x), redhat7.6
(3.10.0-957.10.1) does not have backport the patch yet.
"if (is_masked == was_masked) return;" can it be removed at qemu?
What is the reason for this check?

Signed-off-by: Zhuang Yanying 
---
 hw/pci/msix.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 4e33641..e1ff533 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -119,10 +119,6 @@ static void msix_handle_mask_update(PCIDevice *dev, int 
vector, bool was_masked)
 {
 bool is_masked = msix_is_masked(dev, vector);
 
-if (is_masked == was_masked) {
-return;
-}
-
 msix_fire_vector_notifier(dev, vector, is_masked);
 
 if (!is_masked && msix_is_pending(dev, vector)) {
-- 
1.8.3.1





[Qemu-devel] [PATCH v2 2/3] KVM: MMU: introduce kvm_mmu_write_protect_all_pages

2019-01-24 Thread Zhuangyanying
From: Xiao Guangrong 

The original idea is from Avi. kvm_mmu_write_protect_all_pages() is
extremely fast to write protect all the guest memory. Comparing with
the ordinary algorithm which write protects last level sptes based on
the rmap one by one, it just simply updates the generation number to
ask all vCPUs to reload its root page table, particularly, it can be
done out of mmu-lock, so that it does not hurt vMMU's parallel. It is
the O(1) algorithm which does not depends on the capacity of guest's
memory and the number of guest's vCPUs

When reloading its root page table, the vCPU checks root page table's
generation number with current global number, if it is not matched, it
makes all the entries in page readonly and directly go to VM. So the
read access is still going on smoothly without KVM's involvement and
write access triggers page fault, then KVM moves the write protection
from the upper level to the lower level page - by making all the entries
in the lower page readonly first then make the upper level writable,
this operation is repeated until we meet the last spte

Note, the number of page fault and TLB flush are the same as the ordinary
algorithm. During our test, for a VM which has 3G memory and 12 vCPUs,
we benchmarked the performance of pure memory write after write protection,
noticed only 3% is dropped, however, we also benchmarked the case that
run the test case of pure memory-write in the new booted VM (i.e, it will
trigger #PF to map memory), at the same time, live migration is going on,
we noticed the diry page ratio is increased ~50%, that means, the memory's
performance is hugely improved during live migration

Signed-off-by: Xiao Guangrong 
Signed-off-by: Zhuang Yanying 
---
 arch/x86/include/asm/kvm_host.h |  19 +
 arch/x86/kvm/mmu.c  | 172 ++--
 arch/x86/kvm/mmu.h  |   1 +
 arch/x86/kvm/paging_tmpl.h  |  13 ++-
 4 files changed, 196 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6633b40..3d4231b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -338,6 +338,13 @@ struct kvm_mmu_page {
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
unsigned long mmu_valid_gen;
 
+   /*
+* The generation number of write protection for all guest memory
+* which is synced with kvm_arch.mmu_write_protect_all_indicator
+* whenever it is linked into upper entry.
+*/
+   u64 mmu_write_protect_all_gen;
+
DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
 
DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
@@ -850,6 +857,18 @@ struct kvm_arch {
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
unsigned long mmu_valid_gen;
+
+   /*
+* The indicator of write protection for all guest memory.
+*
+* The top bit indicates if the write-protect is enabled,
+* remaining bits are used as a generation number which is
+* increased whenever write-protect is enabled.
+*
+* The enable bit and generation number are squeezed into
+* a single u64 so that it can be accessed atomically.
+*/
+   u64 mmu_write_protect_all_indicator;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e8adafc..effae7a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -490,6 +490,29 @@ static void kvm_mmu_reset_all_pte_masks(void)
GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
 }
 
+/* see the comments in struct kvm_arch. */
+#define WP_ALL_ENABLE_BIT  (63)
+#define WP_ALL_ENABLE_MASK (1ull << WP_ALL_ENABLE_BIT)
+#define WP_ALL_GEN_MASK(~0ull & ~WP_ALL_ENABLE_MASK)
+
+/* should under mmu_lock */
+static bool get_write_protect_all_indicator(struct kvm *kvm, u64 *generation)
+{
+   u64 indicator = kvm->arch.mmu_write_protect_all_indicator;
+
+   *generation = indicator & WP_ALL_GEN_MASK;
+   return !!(indicator & WP_ALL_ENABLE_MASK);
+}
+
+static void
+set_write_protect_all_indicator(struct kvm *kvm, bool enable, u64 generation)
+{
+   u64 value = (u64)(!!enable) << WP_ALL_ENABLE_BIT;
+
+   value |= generation & WP_ALL_GEN_MASK;
+   kvm->arch.mmu_write_protect_all_indicator = value;
+}
+
 static int is_cpuid_PSE36(void)
 {
return 1;
@@ -2532,6 +2555,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
flush |= kvm_sync_pages(vcpu, gfn, _list);
}
sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+   get_write_protect_all_indicator(vcpu->kvm,
+   >mmu_write_protect_all_gen);
clear_page(sp->spt);
trace_kvm_mmu_get_page(sp, true);
 
@@ 

[Qemu-devel] [PATCH v2 3/3] KVM: MMU: fast cleanup D bit based on fast write protect

2019-01-24 Thread Zhuangyanying
From: Zhuang Yanying 

When live-migration with large-memory guests, vcpu may hang for a long
time while starting migration, such as 9s for 2T
(linux-5.0.0-rc2+qemu-3.1.0).
The reason is memory_global_dirty_log_start() taking too long, and the
vcpu is waiting for BQL. The page-by-page D bit clearup is the main time
consumption. I think that the idea of "KVM: MMU: fast write protect" by
xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(),
is very helpful. After a little modifcation, on his patch, can solve
this problem, 9s to 0.5s.

At the beginning of live migration, write protection is only applied to the
top-level SPTE. Then the write from vm trigger the EPT violation, with
for_each_shadow_entry write protection is performed at dirct_map.
Finally the Dirty bit of the target page(at level 1 page table) is
cleared, and the dirty page tracking is started. The page where
GPA is located is marked dirty when mmu_set_spte.

Signed-off-by: Zhuang Yanying 
---
 arch/x86/kvm/mmu.c | 6 +-
 arch/x86/kvm/vmx/vmx.c | 5 ++---
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index effae7a..ac7a994 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3230,7 +3230,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
break;
 
if (is_last_spte(spte, sp->role.level)) {
-   flush |= spte_write_protect(sptep, false);
+   if (sp->role.level == PT_PAGE_TABLE_LEVEL)
+   flush |= spte_clear_dirty(sptep);
+   else
+   flush |= spte_write_protect(sptep, false);
continue;
}
 
@@ -6106,6 +6109,7 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, 
bool write_protect)
kvm_reload_remote_mmus(kvm);
spin_unlock(>mmu_lock);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
 
 static unsigned long
 mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f6915f1..540ec21 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7180,14 +7180,13 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 static void vmx_slot_enable_log_dirty(struct kvm *kvm,
 struct kvm_memory_slot *slot)
 {
-   kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
-   kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
+   kvm_mmu_write_protect_all_pages(kvm, true);
 }
 
 static void vmx_slot_disable_log_dirty(struct kvm *kvm,
   struct kvm_memory_slot *slot)
 {
-   kvm_mmu_slot_set_dirty(kvm, slot);
+   kvm_mmu_write_protect_all_pages(kvm, false);
 }
 
 static void vmx_flush_log_dirty(struct kvm *kvm)
-- 
1.8.3.1





[Qemu-devel] [PATCH v2 1/3] KVM: MMU: introduce possible_writable_spte_bitmap

2019-01-24 Thread Zhuangyanying
From: Xiao Guangrong 

It is used to track possible writable sptes on the shadow page on
which the bit is set to 1 for the sptes that are already writable
or can be locklessly updated to writable on the fast_page_fault
path, also a counter for the number of possible writable sptes is
introduced to speed up bitmap walking.

This works very efficiently as usually only one entry in
PML4 ( < 512 G),few entries in PDPT (only entry indicates 1G
memory), PDEs and PTEs need to be write protected for the worst case.

Later patch will benefit good performance by using this bitmap and
counter to fast figure out writable sptes and write protect them.

Signed-off-by: Xiao Guangrong 
Signed-off-by: Zhuang Yanying 
---
 arch/x86/include/asm/kvm_host.h |  5 +++-
 arch/x86/kvm/mmu.c  | 53 +++--
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4660ce9..6633b40 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -128,6 +128,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, 
int level)
 #define KVM_MIN_ALLOC_MMU_PAGES 64
 #define KVM_MMU_HASH_SHIFT 12
 #define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT)
+#define KVM_MMU_SP_ENTRY_NR 512
 #define KVM_MIN_FREE_MMU_PAGES 5
 #define KVM_REFILL_PAGES 25
 #define KVM_MAX_CPUID_ENTRIES 80
@@ -331,13 +332,15 @@ struct kvm_mmu_page {
gfn_t *gfns;
int root_count;  /* Currently serving as active root */
unsigned int unsync_children;
+   unsigned int possible_writable_sptes;
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
unsigned long mmu_valid_gen;
 
-   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
 
+   DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
 #ifdef CONFIG_X86_32
/*
 * Used out of the mmu-lock to avoid reading spte values while an
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ce770b4..e8adafc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -718,6 +718,49 @@ static bool is_dirty_spte(u64 spte)
return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static bool is_possible_writable_spte(u64 spte)
+{
+   if (!is_shadow_present_pte(spte))
+   return false;
+
+   if (is_writable_pte(spte))
+   return true;
+
+   if (spte_can_locklessly_be_made_writable(spte))
+   return true;
+
+   /*
+* although is_access_track_spte() sptes can be updated out of
+* mmu-lock, we need not take them into account as access_track
+* drops writable bit for them
+*/
+   return false;
+}
+
+static void
+mmu_log_possible_writable_spte(u64 *sptep, u64 old_spte, u64 new_spte)
+{
+   struct kvm_mmu_page *sp = page_header(__pa(sptep));
+   bool old_state, new_state;
+
+   old_state = is_possible_writable_spte(old_spte);
+   new_state = is_possible_writable_spte(new_spte);
+
+   if (old_state == new_state)
+   return;
+
+   /* a possible writable spte is dropped */
+   if (old_state) {
+   sp->possible_writable_sptes--;
+   __clear_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
+   return;
+   }
+
+   /* a new possible writable spte is set */
+   sp->possible_writable_sptes++;
+   __set_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
+}
+
 /* Rules for using mmu_spte_set:
  * Set the sptep from nonpresent to present.
  * Note: the sptep being assigned *must* be either not present
@@ -728,6 +771,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
 {
WARN_ON(is_shadow_present_pte(*sptep));
__set_spte(sptep, new_spte);
+   mmu_log_possible_writable_spte(sptep, 0ull, new_spte);
 }
 
 /*
@@ -751,7 +795,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 
new_spte)
old_spte = __update_clear_spte_slow(sptep, new_spte);
 
WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
-
+   mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
return old_spte;
 }
 
@@ -817,6 +861,8 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
else
old_spte = __update_clear_spte_slow(sptep, 0ull);
 
+   mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
+
if (!is_shadow_present_pte(old_spte))
return 0;
 
@@ -845,7 +891,10 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
  */
 static void mmu_spte_clear_no_track(u64 *sptep)
 {
+   u64 old_spte = *sptep;
+
__update_clear_spte_fast(sptep, 0ull);
+   mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
 }
 
 static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -2140,7 

[Qemu-devel] [PATCH v2 0/3] KVM: MMU: fast cleanup D bit based on fast write protect

2019-01-24 Thread Zhuangyanying
From: Zhuang yanying 

When live-migration with large-memory guests, vcpu may hang for a long
time while starting migration, such as 9s for 2T
(linux-5.0.0-rc2+qemu-3.1.0).
The reason is memory_global_dirty_log_start() taking too long, and the
vcpu is waiting for BQL. The page-by-page D bit clearup is the main time
consumption. I think that the idea of "KVM: MMU: fast write protect" by
xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(),
is very helpful. After a little modifcation, on his patch, can solve
this problem, 9s to 0.5s.

At the beginning of live migration, write protection is only applied to the
top-level SPTE. Then the write from vm trigger the EPT violation, with
for_each_shadow_entry write protection is performed at dirct_map.
Finally the Dirty bit of the target page(at level 1 page table) is cleared,
and the dirty page tracking is started. The page where GPA is located is
marked dirty when mmu_set_spte.
A similar implementation on xen, just emt instead of write protection.

Xiao Guangrong (2):
  KVM: MMU: introduce possible_writable_spte_bitmap
  KVM: MMU: introduce kvm_mmu_write_protect_all_pages

Zhuang Yanying (1):
  KVM: MMU: fast cleanup D bit based on fast write protect

 arch/x86/include/asm/kvm_host.h |  24 -
 arch/x86/kvm/mmu.c  | 229 ++--
 arch/x86/kvm/mmu.h  |   1 +
 arch/x86/kvm/paging_tmpl.h  |  13 ++-
 arch/x86/kvm/vmx/vmx.c  |   5 +-
 5 files changed, 257 insertions(+), 15 deletions(-)

--
v1 -> v2:

  - drop "KVM: MMU: correct the behavior of mmu_spte_update_no_track"
  - mmu_write_protect_all_indicator is no longer an atomic variable,
protected by mmu_lock
  - Implement kvm_mmu_slot_set_dirty with kvm_mmu_write_protect_all_pages
  - some modification on the commit messages
--
1.8.3.1





Re: [Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect

2019-01-23 Thread Zhuangyanying



> -Original Message-
> From: Sean Christopherson [mailto:sean.j.christopher...@intel.com]
> Sent: Tuesday, January 22, 2019 11:17 PM
> To: Zhuangyanying 
> Cc: xiaoguangr...@tencent.com; pbonz...@redhat.com; Gonglei (Arei)
> ; qemu-devel@nongnu.org; k...@vger.kernel.org;
> wangxin (U) ; Liujinsong (Paul)
> ; Zhoujian (jay) ;
> xiaoguangrong.e...@gmail.com
> Subject: Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write
> protect
> 
> On Mon, Jan 21, 2019 at 06:37:36AM +, Zhuangyanying wrote:
> >
> > > > u64 wp_all_indicator, kvm_wp_all_gen;
> > > >
> > > > -   mutex_lock(>slots_lock);
> > > > wp_all_indicator = get_write_protect_all_indicator(kvm);
> > > > kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> > > >
> > > > @@ -6134,8 +6136,8 @@ void
> kvm_mmu_write_protect_all_pages(struct
> > > kvm *kvm, bool write_protect)
> > > >  */
> > > > if (write_protect)
> > > > kvm_reload_remote_mmus(kvm);
> > > > -   mutex_unlock(>slots_lock);
> > >
> > > Why is the lock removed?  And why was it added in the first place?
> > >
> > The original purpose of fast write protect is to implement lock-free
> > get_dirty_log, kvm_mmu_write_protect_all_pages is a stand-alone kvm
> > API. See
> > https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04370.html
> > A total of 7 patches, I only need the first 3 patches to achieve
> > step-by-step page table traversal. In order to maintain the integrity
> > of the xiaoguangrong patch, I did not directly modify it on his patch.
> 
> That's not a sufficient argument for adding locking and removing it one patch
> later.
> 
> > > >  }
> > > > +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
> > > >
> > > >  static unsigned long
> > > >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control
> > > > *sc) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index
> > > > f6915f1..5236a07 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu
> > > > *vcpu, int cpu)  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> > > >  struct kvm_memory_slot *slot)  {
> > > > -   kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> > > > -   kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> > > > +   kvm_mmu_write_protect_all_pages(kvm, true);
> > >
> > > What's the purpose of having @write_protect if
> > > kvm_mmu_write_protect_all_pages() is only ever called to enable
> > > protection?  If there's no known scenario where write protection is
> > > explicitly disabled then there's no need for WP_ALL_ENABLE_MASK,
> > > i.e. a non-zero generation would indicate write protection is
> > > enabled.  That'd simplify the code and clean up the atomic usage.
> > >
> > In the live migration, The large page split depends on the creation of
> > memslot->dirty_bitmap in the function __kvm_set_memory_region().
> > The interface design between qemu and kvm to enable dirty log is one by one
> in slot units.
> > In order to enable dirty page tracking of the entire vm, it is
> > necessary to call kvm_mmu_write_protect_all_pages multiple times. The
> > page table update request can be merged for processing by the atomic usage.
> This method is not elegant, but it works.
> > Complete the creation of all solt's dirty_bitmap in an API, just call
> > kvm_mmu_write_protect_all_pages once, need more implementation
> changes, even qemu.
> 
> Calling kvm_mmu_write_protect_all_pages() multiple times is fine.  My
> question was regarding the 'write_protect' parameter.  If all callers always
> pass %true for 'write_protect' then why does the parameter exist?
> And eliminating the parameter means you don't need an 'enable' flag buried in
> the generation, which would simplify the implementation.

In fact, when cancel migration for 2T vm, the func 
memory_global_dirty_log_stop()
will hold the BQL for 12s, vmx_slot_disable_log_dirty() for 4s, 
kvm_mmu_zap_collapsible_sptes() for 2s, double because smm is enabled by 
default.
What about:
static void vmx_slot_disable_log_dirty(struct kvm *kvm,
   struct kvm_memory_slot *slot)
{
+   kvm_mmu_write_protect_all_pages(kvm, false);
-   kvm_mmu_slot_set_dirty(kvm, slot);
}

Sorry, this patch is not complete. I will send patch v2 soon.

Best regards,
-Zhuang Yanying



Re: [Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect

2019-01-20 Thread Zhuangyanying



> -Original Message-
> From: Sean Christopherson [mailto:sean.j.christopher...@intel.com]
> Sent: Friday, January 18, 2019 12:32 AM
> To: Zhuangyanying 
> Cc: xiaoguangr...@tencent.com; pbonz...@redhat.com; Gonglei (Arei)
> ; qemu-devel@nongnu.org; k...@vger.kernel.org;
> wangxin (U) ; Liujinsong (Paul)
> 
> Subject: Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write
> protect
> 
> On Thu, Jan 17, 2019 at 01:55:31PM +, Zhuangyanying wrote:
> > From: Zhuang Yanying 
> >
> > When live-migration with large-memory guests, vcpu may hang for a long
> > time while starting migration, such as 9s for 2T
> > (linux-5.0.0-rc2+qemu-3.1.0).
> > The reason is memory_global_dirty_log_start() taking too long, and the
> > vcpu is waiting for BQL. The page-by-page D bit clearup is the main
> > time consumption. I think that the idea of "KVM: MMU: fast write
> > protect" by xiaoguangrong, especially the function
> > kvm_mmu_write_protect_all_pages(),
> > is very helpful. After a little modifcation, on his patch, can solve
> > this problem, 9s to 0.5s.
> >
> > At the beginning of live migration, write protection is only applied
> > to the top-level SPTE. Then the write from vm trigger the EPT
> > violation, with for_each_shadow_entry write protection is performed at
> dirct_map.
> > Finally the Dirty bit of the target page(at level 1 page table) is
> > cleared, and the dirty page tracking is started. Of coure, the page
> > where GPA is located is marked dirty when mmu_set_spte.
> > A similar implementation on xen, just emt instead of write protection.
> >
> > Signed-off-by: Zhuang Yanying 
> > ---
> >  arch/x86/kvm/mmu.c | 8 +---
> >  arch/x86/kvm/vmx/vmx.c | 3 +--
> >  2 files changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index
> > 047b897..a18bcc0 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3257,7 +3257,10 @@ static bool mmu_load_shadow_page(struct kvm
> *kvm, struct kvm_mmu_page *sp)
> > break;
> >
> > if (is_last_spte(spte, sp->role.level)) {
> > -   flush |= spte_write_protect(sptep, false);
> > +   if (sp->role.level == PT_PAGE_TABLE_LEVEL)
> > +   flush |= spte_clear_dirty(sptep);
> > +   else
> > +   flush |= spte_write_protect(sptep, false);
> > continue;
> > }
> >
> > @@ -6114,7 +6117,6 @@ void kvm_mmu_write_protect_all_pages(struct
> kvm
> > *kvm, bool write_protect)  {
> > u64 wp_all_indicator, kvm_wp_all_gen;
> >
> > -   mutex_lock(>slots_lock);
> > wp_all_indicator = get_write_protect_all_indicator(kvm);
> > kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> >
> > @@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct
> kvm *kvm, bool write_protect)
> >  */
> > if (write_protect)
> > kvm_reload_remote_mmus(kvm);
> > -   mutex_unlock(>slots_lock);
> 
> Why is the lock removed?  And why was it added in the first place?
> 
The original purpose of fast write protect is to implement lock-free 
get_dirty_log,
kvm_mmu_write_protect_all_pages is a stand-alone kvm API. See
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04370.html
A total of 7 patches, I only need the first 3 patches to achieve step-by-step
page table traversal. In order to maintain the integrity of the xiaoguangrong
patch, I did not directly modify it on his patch.
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
> >
> >  static unsigned long
> >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
> > f6915f1..5236a07 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu,
> > int cpu)  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> >  struct kvm_memory_slot *slot)  {
> > -   kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> > -   kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> > +   kvm_mmu_write_protect_all_pages(kvm, true);
> 
> What's the purpose of having @write_protect if
> kvm_mmu_write_protect_all_pages() is only ever called to enable
> protection?  If there's no known scenario where write protection is
> explicitly disabled then there's no need for WP_ALL_

[Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect

2019-01-17 Thread Zhuangyanying
From: Zhuang Yanying 

When live-migration with large-memory guests, vcpu may hang for a long
time while starting migration, such as 9s for 2T
(linux-5.0.0-rc2+qemu-3.1.0).
The reason is memory_global_dirty_log_start() taking too long, and the
vcpu is waiting for BQL. The page-by-page D bit clearup is the main time
consumption. I think that the idea of "KVM: MMU: fast write protect" by
xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(),
is very helpful. After a little modifcation, on his patch, can solve
this problem, 9s to 0.5s.

At the beginning of live migration, write protection is only applied to the
top-level SPTE. Then the write from vm trigger the EPT violation, with
for_each_shadow_entry write protection is performed at dirct_map.
Finally the Dirty bit of the target page(at level 1 page table) is
cleared, and the dirty page tracking is started. Of coure, the page where
GPA is located is marked dirty when mmu_set_spte.
A similar implementation on xen, just emt instead of write protection.

Signed-off-by: Zhuang Yanying 
---
 arch/x86/kvm/mmu.c | 8 +---
 arch/x86/kvm/vmx/vmx.c | 3 +--
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 047b897..a18bcc0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3257,7 +3257,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
break;
 
if (is_last_spte(spte, sp->role.level)) {
-   flush |= spte_write_protect(sptep, false);
+   if (sp->role.level == PT_PAGE_TABLE_LEVEL)
+   flush |= spte_clear_dirty(sptep);
+   else
+   flush |= spte_write_protect(sptep, false);
continue;
}
 
@@ -6114,7 +6117,6 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, 
bool write_protect)
 {
u64 wp_all_indicator, kvm_wp_all_gen;
 
-   mutex_lock(>slots_lock);
wp_all_indicator = get_write_protect_all_indicator(kvm);
kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
 
@@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, 
bool write_protect)
 */
if (write_protect)
kvm_reload_remote_mmus(kvm);
-   mutex_unlock(>slots_lock);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
 
 static unsigned long
 mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f6915f1..5236a07 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 static void vmx_slot_enable_log_dirty(struct kvm *kvm,
 struct kvm_memory_slot *slot)
 {
-   kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
-   kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
+   kvm_mmu_write_protect_all_pages(kvm, true);
 }
 
 static void vmx_slot_disable_log_dirty(struct kvm *kvm,
-- 
1.8.3.1





[Qemu-devel] [PATCH 2/4] KVM: MMU: introduce possible_writable_spte_bitmap

2019-01-17 Thread Zhuangyanying
From: Xiao Guangrong 

It is used to track possible writable sptes on the shadow page on
which the bit is set to 1 for the sptes that are already writable
or can be locklessly updated to writable on the fast_page_fault
path, also a counter for the number of possible writable sptes is
introduced to speed up bitmap walking

Later patch will benefit good performance by using this bitmap and
counter to fast figure out writable sptes and write protect them

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  6 -
 arch/x86/kvm/mmu.c  | 53 -
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4660ce9..5c30aa0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -128,6 +128,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, 
int level)
 #define KVM_MIN_ALLOC_MMU_PAGES 64
 #define KVM_MMU_HASH_SHIFT 12
 #define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT)
+#define KVM_MMU_SP_ENTRY_NR 512
 #define KVM_MIN_FREE_MMU_PAGES 5
 #define KVM_REFILL_PAGES 25
 #define KVM_MAX_CPUID_ENTRIES 80
@@ -331,12 +332,15 @@ struct kvm_mmu_page {
gfn_t *gfns;
int root_count;  /* Currently serving as active root */
unsigned int unsync_children;
+   unsigned int possiable_writable_sptes;
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
unsigned long mmu_valid_gen;
 
-   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
+
+   DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
 
 #ifdef CONFIG_X86_32
/*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index eeb3bac..9daab00 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -718,6 +718,49 @@ static bool is_dirty_spte(u64 spte)
return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static bool is_possible_writable_spte(u64 spte)
+{
+   if (!is_shadow_present_pte(spte))
+   return false;
+
+   if (is_writable_pte(spte))
+   return true;
+
+   if (spte_can_locklessly_be_made_writable(spte))
+   return true;
+
+   /*
+* although is_access_track_spte() sptes can be updated out of
+* mmu-lock, we need not take them into account as access_track
+* drops writable bit for them
+*/
+   return false;
+}
+
+static void
+mmu_log_possible_writable_spte(u64 *sptep, u64 old_spte, u64 new_spte)
+{
+   struct kvm_mmu_page *sp = page_header(__pa(sptep));
+   bool old_state, new_state;
+
+   old_state = is_possible_writable_spte(old_spte);
+   new_state = is_possible_writable_spte(new_spte);
+
+   if (old_state == new_state)
+   return;
+
+   /* a possible writable spte is dropped */
+   if (old_state) {
+   sp->possiable_writable_sptes--;
+   __clear_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
+   return;
+   }
+
+   /* a new possible writable spte is set */
+   sp->possiable_writable_sptes++;
+   __set_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
+}
+
 /* Rules for using mmu_spte_set:
  * Set the sptep from nonpresent to present.
  * Note: the sptep being assigned *must* be either not present
@@ -728,6 +771,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
 {
WARN_ON(is_shadow_present_pte(*sptep));
__set_spte(sptep, new_spte);
+   mmu_log_possible_writable_spte(sptep, 0ull, new_spte);
 }
 
 /*
@@ -746,6 +790,7 @@ static void mmu_spte_update_no_track(u64 *sptep, u64 
new_spte)
}
 
__update_clear_spte_fast(sptep, new_spte);
+   mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
 }
 
 /*
@@ -771,6 +816,7 @@ static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
 
WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
 
+   mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
return old_spte;
 }
 
@@ -836,6 +882,8 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
else
old_spte = __update_clear_spte_slow(sptep, 0ull);
 
+   mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
+
if (!is_shadow_present_pte(old_spte))
return 0;
 
@@ -864,7 +912,10 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
  */
 static void mmu_spte_clear_no_track(u64 *sptep)
 {
+   u64 old_spte = *sptep;
+
__update_clear_spte_fast(sptep, 0ull);
+   mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
 }
 
 static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -2159,7 +2210,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
 {
int i, ret, nr_unsync_leaf = 0;
 
-   for_each_set_bit(i, 

[Qemu-devel] [PATCH 1/4] KVM: MMU: correct the behavior of mmu_spte_update_no_track

2019-01-17 Thread Zhuangyanying
From: Xiao Guangrong 

Current behavior of mmu_spte_update_no_track() does not match
the name of _no_track() as actually the A/D bits are tracked
and returned to the caller

This patch introduces the real _no_track() function to update
the spte regardless of A/D bits and rename the original function
to _track()

The _no_track() function will be used by later patches to update
upper spte which need not care of A/D bits indeed

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ce770b4..eeb3bac 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -731,10 +731,29 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
 }
 
 /*
- * Update the SPTE (excluding the PFN), but do not track changes in its
+ * Update the SPTE (excluding the PFN) regardless of accessed/dirty
+ * status which is used to update the upper level spte.
+ */
+static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
+{
+   u64 old_spte = *sptep;
+
+   WARN_ON(!is_shadow_present_pte(new_spte));
+
+   if (!is_shadow_present_pte(old_spte)) {
+   mmu_spte_set(sptep, new_spte);
+   return;
+   }
+
+   __update_clear_spte_fast(sptep, new_spte);
+}
+
+/*
+ * Update the SPTE (excluding the PFN), the original value is
+ * returned, based on it, the caller can track changes of its
  * accessed/dirty status.
  */
-static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
+static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
 {
u64 old_spte = *sptep;
 
@@ -769,7 +788,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 
new_spte)
 static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 {
bool flush = false;
-   u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);
+   u64 old_spte = mmu_spte_update_track(sptep, new_spte);
 
if (!is_shadow_present_pte(old_spte))
return false;
-- 
1.8.3.1





[Qemu-devel] [PATCH 3/4] KVM: MMU: introduce kvm_mmu_write_protect_all_pages

2019-01-17 Thread Zhuangyanying
From: Xiao Guangrong 

The original idea is from Avi. kvm_mmu_write_protect_all_pages() is
extremely fast to write protect all the guest memory. Comparing with
the ordinary algorithm which write protects last level sptes based on
the rmap one by one, it just simply updates the generation number to
ask all vCPUs to reload its root page table, particularly, it can be
done out of mmu-lock, so that it does not hurt vMMU's parallel. It is
the O(1) algorithm which does not depends on the capacity of guest's
memory and the number of guest's vCPUs

When reloading its root page table, the vCPU checks root page table's
generation number with current global number, if it is not matched, it
makes all the entries in page readonly and directly go to VM. So the
read access is still going on smoothly without KVM's involvement and
write access triggers page fault, then KVM moves the write protection
from the upper level to the lower level page - by making all the entries
in the lower page readonly first then make the upper level writable,
this operation is repeated until we meet the last spte

In order to speed up the process of making all entries readonly, we
introduce possible_writable_spte_bitmap which indicates the writable
sptes and possiable_writable_sptes which is a counter indicating the
number of writable sptes, this works very efficiently as usually only
one entry in PML4 ( < 512 G),few entries in PDPT (only entry indicates
1G memory), PDEs and PTEs need to be write protected for the worst case.
Note, the number of page fault and TLB flush are the same as the ordinary
algorithm. During our test, for a VM which has 3G memory and 12 vCPUs,
we benchmarked the performance of pure memory write after write protection,
noticed only 3% is dropped, however, we also benchmarked the case that
run the test case of pure memory-write in the new booted VM (i.e, it will
trigger #PF to map memory), at the same time, live migration is going on,
we noticed the diry page ratio is increased ~50%, that means, the memory's
performance is hugely improved during live migration

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  19 +
 arch/x86/kvm/mmu.c  | 179 ++--
 arch/x86/kvm/mmu.h  |   1 +
 arch/x86/kvm/paging_tmpl.h  |  13 ++-
 4 files changed, 204 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5c30aa0..a581ff4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -338,6 +338,13 @@ struct kvm_mmu_page {
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
unsigned long mmu_valid_gen;
 
+   /*
+* The generation number of write protection for all guest memory
+* which is synced with kvm_arch.mmu_write_protect_all_indicator
+* whenever it is linked into upper entry.
+*/
+   u64 mmu_write_protect_all_gen;
+
DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
 
DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
@@ -851,6 +858,18 @@ struct kvm_arch {
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
unsigned long mmu_valid_gen;
+
+   /*
+* The indicator of write protection for all guest memory.
+*
+* The top bit indicates if the write-protect is enabled,
+* remaining bits are used as a generation number which is
+* increased whenever write-protect is enabled.
+*
+* The enable bit and generation number are squeezed into
+* a single u64 so that it can be accessed atomically.
+*/
+   atomic64_t mmu_write_protect_all_indicator;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9daab00..047b897 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -489,6 +489,34 @@ static void kvm_mmu_reset_all_pte_masks(void)
shadow_nonpresent_or_rsvd_lower_gfn_mask =
GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
 }
+/* see the comments in struct kvm_arch. */
+#define WP_ALL_ENABLE_BIT  (63)
+#define WP_ALL_ENABLE_MASK (1ull << WP_ALL_ENABLE_BIT)
+#define WP_ALL_GEN_MASK(~0ull & ~WP_ALL_ENABLE_MASK)
+
+static bool is_write_protect_all_enabled(u64 indicator)
+{
+   return !!(indicator & WP_ALL_ENABLE_MASK);
+}
+
+static u64 get_write_protect_all_gen(u64 indicator)
+{
+   return indicator & WP_ALL_GEN_MASK;
+}
+
+static u64 get_write_protect_all_indicator(struct kvm *kvm)
+{
+   return atomic64_read(>arch.mmu_write_protect_all_indicator);
+}
+
+static void
+set_write_protect_all_indicator(struct kvm *kvm, bool enable, u64 generation)
+{
+   u64 value = (u64)(!!enable) << WP_ALL_ENABLE_BIT;
+
+   value |= generation & WP_ALL_GEN_MASK;
+   

[Qemu-devel] [PATCH 0/4] KVM: MMU: fast cleanup D bit based on fast write protect

2019-01-17 Thread Zhuangyanying
From: Zhuang Yanying 

Recently I tested live-migration with large-memory guests, find vcpu may hang 
for a long time while starting migration, such as 9s for 
2048G(linux-5.0.0-rc2+qemu-3.1.0).
The reason is memory_global_dirty_log_start() taking too long, and the vcpu is 
waiting for BQL. The page-by-page D bit clearup is the main time consumption.
I think that the idea of "KVM: MMU: fast write protect" by xiaoguangrong, 
especially the function kvm_mmu_write_protect_all_pages(), is very helpful.
After a little modifcation, on his patch, can solve this problem, 9s to 0.5s.

At the begining of live migration, write protection is only applied to the 
top-level SPTE. Then the write from vm trigger the EPT violation, with 
for_each_shadow_entry write protection is performed at dirct_map.
Finally the Dirty bit of the target page(at level 1 page table) is cleared, and 
the dirty page tracking is started. Of coure, the page where GPA is located is 
marked dirty when mmu_set_spte.
A similar implementation on xen, just emt instead of write protection.

What do you think about this solution?

Xiao Guangrong (3):
  KVM: MMU: correct the behavior of mmu_spte_update_no_track
  KVM: MMU: introduce possible_writable_spte_bitmap
  KVM: MMU: introduce kvm_mmu_write_protect_all_pages

Zhuang Yanying (1):
  KVM: MMU: fast cleanup D bit based on fast write protect

 arch/x86/include/asm/kvm_host.h |  25 +++-
 arch/x86/kvm/mmu.c  | 259 ++--
 arch/x86/kvm/mmu.h  |   1 +
 arch/x86/kvm/paging_tmpl.h  |  13 +-
 arch/x86/kvm/vmx/vmx.c  |   3 +-
 5 files changed, 286 insertions(+), 15 deletions(-)

-- 
1.8.3.1





[Qemu-devel] [PATCH] KVM: MMU: fast cleanup D bit based on fast write protect

2019-01-12 Thread Zhuangyanying
From: Zhuang Yanying 

Recently I tested live-migration with large-memory guests, find vcpu may hang 
for a long time while starting migration, such as 9s for 
2048G(linux-4.20.1+qemu-3.1.0).
The reason is memory_global_dirty_log_start() taking too long, and the vcpu is 
waiting for BQL. The page-by-page D bit clearup is the main time consumption.
I think that the idea of "KVM: MMU: fast write protect" by xiaoguangrong, 
especially the function kvm_mmu_write_protect_all_pages(), is very helpful.
After a little modifcation, on his patch, can solve this problem, 9s to 0.5s.

At the begining of live migration, write protection is only applied to the 
top-level SPTE. Then the write from vm trigger the EPT violation, with 
for_each_shadow_entry write protection is performed at dirct_map.
Finally the Dirty bit of the target page(at level 1 page table) is cleared, and 
the dirty page tracking is started. Of coure, the page where GPA is located is 
marked dirty when mmu_set_spte.
A similar implementation on xen, just emt instead of write protection.

What do you think about this solution?
---
 mmu.c | 5 -
 vmx.c | 3 +--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mmu.c b/mmu.c
index b079d74..f49d316 100755
--- a/mmu.c
+++ b/mmu.c
@@ -3210,7 +3210,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
break;
 
 if (is_last_spte(spte, sp->role.level)) {
-   flush |= spte_write_protect(sptep, false);
+   if (sp->role.level == PT_PAGE_TABLE_LEVEL)
+   flush |= spte_clear_dirty(sptep);
+   else
+   flush |= spte_write_protect(sptep, false);
continue;
 }
 
diff --git a/vmx.c b/vmx.c
index 95784bc..7ec717f 100755
--- a/vmx.c
+++ b/vmx.c
@@ -14421,8 +14421,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 static void vmx_slot_enable_log_dirty(struct kvm *kvm,
 struct kvm_memory_slot *slot)
 {
-   kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
-   kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
+   kvm_mmu_write_protect_all_pages(kvm, true);
 }
 
 static void vmx_slot_disable_log_dirty(struct kvm *kvm,
-- 
1.8.3.1





[Qemu-devel] [RFH]vcpu may hang for up to 4s while starting migration

2018-12-11 Thread Zhuangyanying
From: Zhuang Yanying 

Hi,

Recently I test live-migration vm with 1T memory,
find vcpu may hang for up to 4s while starting migration.
The reason is memory_global_dirty_log_start taking too long, and the vcpu is 
waiting for BQL.

migrate threadvcpu
--
qemu_mutex_lock_iothread
 kvm_handle_io
memory_global_dirty_log_start  /* lasts 4s */   try 
qemu_mutex_lock_iothread

qemu_mutex_unlock_iothread
 success 
qemu_mutex_lock_iothread

Memory_global_dirty_log_start will cleans up the dirty bits of spte in the KVM, 
and starts the dirty page tracking of PML.Because the VM's memory is very 
large, it takes too long time in KVM.

Is the following scheme feasible:
(1)Put the action of turning on dirty page tracking in 
memory_global_dirty_log_start into asynchronous execution outside BQL.
(2)The first time tunrning on dirty page tracking, only clean up the dirty page 
bits of the 1G pagetable, and the time spent by memory_global_dirty_log_start 
is reduced.

Best regards,
 -Zhuang Yanying
---





[Qemu-devel] [PATCH v3] KVM: x86: Fix nmi injection failure when vcpu got blocked

2017-05-25 Thread Zhuangyanying
From: ZhuangYanying <ann.zhuangyany...@huawei.com>

When spin_lock_irqsave() deadlock occurs inside the guest, vcpu threads,
other than the lock-holding one, would enter into S state because of
pvspinlock. Then inject NMI via libvirt API "inject-nmi", the NMI could
not be injected into vm.

The reason is:
1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and sets
cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile.
2 It sets nmi_queued to 0 in process_nmi(), before entering guest, because
cpu->kvm_vcpu_dirty is true.

It's not enough just to check nmi_queued to decide whether to stay in
vcpu_block() or not. NMI should be injected immediately at any situation.
Add checking nmi_pending, and testing KVM_REQ_NMI replaces nmi_queued
in vm_vcpu_has_events().

Do the same change for SMIs.

Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com>
---
v1->v2
  - simplify message. The complete description is here:
http://www.spinics.net/lists/kvm/msg150380.html
  - Testing KVM_REQ_NMI replaces nmi_pending.
  - Add Testing kvm_x86_ops->nmi_allowed(vcpu).
v2->v3
  - Testing KVM_REQ_NMI replaces nmi_queued, not nmi_pending.
  - Do the same change for SMIs. 
---
 arch/x86/kvm/x86.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02363e3..a2cd099 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8394,10 +8394,13 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu 
*vcpu)
if (vcpu->arch.pv.pv_unhalted)
return true;
 
-   if (atomic_read(>arch.nmi_queued))
+   if (kvm_test_request(KVM_REQ_NMI, vcpu) ||
+   (vcpu->arch.nmi_pending &&
+kvm_x86_ops->nmi_allowed(vcpu)))
return true;
 
-   if (kvm_test_request(KVM_REQ_SMI, vcpu))
+   if (kvm_test_request(KVM_REQ_SMI, vcpu) ||
+   (vcpu->arch.smi_pending && !is_smm(vcpu)))
return true;
 
if (kvm_arch_interrupt_allowed(vcpu) &&
-- 
1.8.3.1





[Qemu-devel] [PATCH v2] KVM: x86: Fix nmi injection failure when vcpu got blocked

2017-05-25 Thread Zhuangyanying
From: ZhuangYanying <ann.zhuangyany...@huawei.com>

When spin_lock_irqsave() deadlock occurs inside the guest, vcpu threads,
other than the lock-holding one, would enter into S state because of
pvspinlock. Then inject NMI via libvirt API "inject-nmi", the NMI could
not be injected into vm.

The reason is:
1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and sets
cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile.
2 It sets nmi_queued to 0 in process_nmi(), before entering guest, because
cpu->kvm_vcpu_dirty is true.

It's not enough just to check nmi_queued to decide whether to stay in
vcpu_block() or not. NMI should be injected immediately at any situation.
Add checking KVM_REQ_NMI request plus with nmi_queued in
vm_vcpu_has_events().

Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com>
---
 arch/x86/kvm/x86.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02363e3..2d15708 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8394,7 +8394,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu 
*vcpu)
if (vcpu->arch.pv.pv_unhalted)
return true;
 
-   if (atomic_read(>arch.nmi_queued))
+   if ((kvm_test_request(KVM_REQ_NMI, vcpu) ||
+   atomic_read(>arch.nmi_queued)) &&
+   kvm_x86_ops->nmi_allowed(vcpu))
return true;
 
if (kvm_test_request(KVM_REQ_SMI, vcpu))
-- 
1.8.3.1





Re: [Qemu-devel] [PATCH] Fix nmi injection failure when vcpu got blocked

2017-05-25 Thread Zhuangyanying

> -Original Message-
> From: Radim Krčmář [mailto:rkrc...@redhat.com]
> Sent: Wednesday, May 24, 2017 10:34 PM
> To: Zhuangyanying
> Cc: pbonz...@redhat.com; Herongguang (Stephen); qemu-devel@nongnu.org;
> Gonglei (Arei); Zhangbo (Oscar); k...@vger.kernel.org
> Subject: Re: [PATCH] Fix nmi injection failure when vcpu got blocked
> 
> Please use tags in patches.
> We usually begin the subject with "KVM: x86:" when touching
> arch/x86/kvm/x86.c.
> 
Sorry, I will add in patch v2.
> 2017-05-24 13:48+0800, Zhuangyanying:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c @@ -8394,7
> > +8394,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
> > if (vcpu->arch.pv.pv_unhalted)
> > return true;
> >
> > -   if (atomic_read(>arch.nmi_queued))
> > +   if (vcpu->arch.nmi_pending ||
> > +   atomic_read(>arch.nmi_queued))
> > return true;
> 
> Hm, I think we've been missing '&& kvm_x86_ops->nmi_allowed(vcpu)'.
> 
Yes, we've been missing, I will add in patch v2.
> The undesired resume if we have suppressed NMI is not making it much worse,
> but wouldn't "kvm_test_request(KVM_REQ_NMI, vcpu)" also work here?
> 
> > if (kvm_test_request(KVM_REQ_SMI, vcpu))
> 
> Thanks.
"kvm_test_request(KVM_REQ_NMI, vcpu)", works fine for me.



[Qemu-devel] [PATCH] Fix nmi injection failure when vcpu got blocked

2017-05-23 Thread Zhuangyanying
From: ZhuangYanying <ann.zhuangyany...@huawei.com>

Recently I found NMI could not be injected to vm via libvirt API
Reproduce the problem:
1 use guest of redhat 7.3
2 disable nmi_watchdog and trig spinlock deadlock inside the guest
check the running vcpu thread, make sure not vcpu0
3 inject NMI into the guest via libvirt API "inject-nmi"

Result:
The NMI could not be injected into the guest.

Reason:
1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and
sets cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile.
2 It sets nmi_queued to 0 in process_nmi(), before entering guest,
because cpu->kvm_vcpu_dirty is true.

Normally, vcpu could call vcpu_enter_guest successfully to inject the NMI.
However, in the problematic scenario, when the guest's threads hold
spin_lock_irqsave for a long time, such as entering a while loop after
spin_lock_irqsave(), other vcpus would enter into S state because of
pvspinlock scheme, then KVM module will loop in vcpu_block rather than
entry the guest.

I think that it's not suitable to decide whether to stay in vcpu_block()
or not just by checking nmi_queued, NMI should be injected immediately
even at this situation.

Solution:
There're 2 ways to solve the problem:
1 call cpu_synchronize_state_not_set_dirty() rather than
cpu_synchronize_state(), while injecting NMI, to avoid changing
nmi_queued to 0. But other workqueues may affect cpu->kvm_vcpu_dirty,
so it's not recommended.
2 add checking nmi_pending plus with nmi_queued in vm_vcpu_has_events()
in KVM module.

qemu_kvm_wait_io_event
  qemu_wait_io_event_common
flush_queued_work
  do_inject_external_nmi
cpu_synchronize_state
  kvm_cpu_synchronize_state
do_kvm_cpu_synchronize_state
  cpu->kvm_vcpu_dirty = true;/* trigger process_nmi */
kvm_vcpu_ioctl(cpu, KVM_NMI)
  kvm_vcpu_ioctl_nmi
kvm_inject_nmi
  atomic_inc(>arch.nmi_queued);
nmi_queued = 1
/* nmi_queued set to 1, when qemu ioctl KVM_NMI */
  kvm_make_request(KVM_REQ_NMI, vcpu);
kvm_cpu_exec
  kvm_arch_put_registers(cpu, KVM_PUT_RUNTIME_STATE);
kvm_arch_put_registers
  kvm_put_vcpu_events
kvm_vcpu_ioctl(CPU(cpu), KVM_SET_VCPU_EVENTS, );
  kvm_vcpu_ioctl_x86_set_vcpu_events
process_nmi(vcpu);
  vcpu->arch.nmi_pending +=
atomic_xchg(>arch.nmi_queued, 0);
nmi_queued = 0
/* nmi_queued set to 0, vcpu thread always block */
nmi_pending = 1
  kvm_make_request(KVM_REQ_EVENT, vcpu);
  kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
kvm_arch_vcpu_ioctl_run
  vcpu_run(vcpu);
kvm_vcpu_running(vcpu)
  /* always false, could not call vcpu_enter_guest */
  vcpu_block
kvm_arch_vcpu_runnable
  kvm_vcpu_has_events
if (atomic_read(>arch.nmi_queued))
   /* nmi_queued is 0, vcpu thread always block*/

Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com>
---
 arch/x86/kvm/x86.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02363e3..96983dc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8394,7 +8394,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu 
*vcpu)
if (vcpu->arch.pv.pv_unhalted)
return true;
 
-   if (atomic_read(>arch.nmi_queued))
+   if (vcpu->arch.nmi_pending ||
+   atomic_read(>arch.nmi_queued))
return true;
 
if (kvm_test_request(KVM_REQ_SMI, vcpu))
-- 
1.8.3.1





Re: [Qemu-devel] [BUG] Migrate failes between boards with different PMC counts

2017-04-24 Thread Zhuangyanying


> -Original Message-
> From: Daniel P. Berrange [mailto:berra...@redhat.com]
> Sent: Monday, April 24, 2017 6:34 PM
> To: Dr. David Alan Gilbert
> Cc: Zhuangyanying; Zhanghailiang; wangxin (U); qemu-devel@nongnu.org;
> Gonglei (Arei); Huangzhichao; pbonz...@redhat.com
> Subject: Re: [Qemu-devel] [BUG] Migrate failes between boards with different
> PMC counts
> 
> On Mon, Apr 24, 2017 at 11:27:16AM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrange (berra...@redhat.com) wrote:
> > > On Mon, Apr 24, 2017 at 10:23:21AM +0100, Dr. David Alan Gilbert wrote:
> > > > * Zhuangyanying (ann.zhuangyany...@huawei.com) wrote:
> > > > > Hi all,
> > > > >
> > > > > Recently, I found migration failed when enable vPMU.
> > > > >
> > > > > migrate vPMU state was introduced in linux-3.10 + qemu-1.7.
> > > > >
> > > > > As long as enable vPMU, qemu will save / load the
> > > > > vmstate_msr_architectural_pmu(msr_global_ctrl) register during the
> migration.
> > > > > But global_ctrl generated based on cpuid(0xA), the number of
> > > > > general-purpose performance monitoring counters(PMC) can vary
> > > > > according to Intel SDN. The number of PMC presented to vm, does
> > > > > not support configuration currently, it depend on host cpuid, and 
> > > > > enable
> all pmc defaultly at KVM. It cause migration to fail between boards with
> different PMC counts.
> > > > >
> > > > > The return value of cpuid (0xA) is different dur to cpu, according to 
> > > > > Intel
> SDN,18-10 Vol. 3B:
> > > > >
> > > > > Note: The number of general-purpose performance monitoring
> > > > > counters (i.e. N in Figure 18-9) can vary across processor
> > > > > generations within a processor family, across processor
> > > > > families, or could be different depending on the configuration
> > > > > chosen at boot time in the BIOS regarding Intel Hyper Threading
> > > > > Technology, (e.g. N=2 for 45 nm Intel Atom processors; N =4 for
> processors based on the Nehalem microarchitecture; for processors based on
> the Sandy Bridge microarchitecture, N = 4 if Intel Hyper Threading Technology
> is active and N=8 if not active).
> > > > >
> > > > > Also I found, N=8 if HT is not active based on the broadwell,,
> > > > > such as CPU E7-8890 v4 @ 2.20GHz
> > > > >
> > > > > # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m
> > > > > 4096 -hda
> > > > > /data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true
> > > > > -incoming tcp:: Completed 100 %
> > > > > qemu-system-x86_64: error: failed to set MSR 0x38f to
> > > > > 0x700ff
> > > > > qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833:
> kvm_put_msrs:
> > > > > Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
> > > > > Aborted
> > > > >
> > > > > So make number of pmc configurable to vm ? Any better idea ?
> > > >
> > > > Coincidentally we hit a similar problem a few days ago with -cpu
> > > > host  - it took me quite a while to spot the difference between
> > > > the machines was the source had hyperthreading disabled.
> > > >
> > > > An option to set the number of counters makes sense to me; but I
> > > > wonder how many other options we need as well.  Also, I'm not sure
> > > > there's any easy way for libvirt etc to figure out how many
> > > > counters a host supports - it's not in /proc/cpuinfo.
> > >
> > > We actually try to avoid /proc/cpuinfo whereever possible. We do
> > > direct CPUID asm instructions to identify features, and prefer to
> > > use /sys/devices/system/cpu if that has suitable data
> > >
> > > Where do the PMC counts come from originally ? CPUID or something
> else ?
> >
> > Yes, they're bits 8..15 of CPUID leaf 0xa
> 
> Ok, that's easy enough for libvirt to detect then. More a question of what 
> libvirt
> should then do this with the info
> 

Do you mean to do a validation at the begining of migration? in 
qemuMigrationBakeCookie() & qemuMigrationEatCookie(), if the PMC numbers are 
not equal, just quit migration?
It maybe a good enough first edition.
But for a further better edition, maybe it's better to support Heterogeneous 
migration I think, so we might need to make PMC number configrable, then we 
need to modify KVM/qemu as well.

Regards,
-Zhuang Yanying


[Qemu-devel] [BUG] Migrate failes between boards with different PMC counts

2017-04-24 Thread Zhuangyanying
Hi all,

Recently, I found migration failed when enable vPMU.

migrate vPMU state was introduced in linux-3.10 + qemu-1.7.

As long as enable vPMU, qemu will save / load the
vmstate_msr_architectural_pmu(msr_global_ctrl) register during the migration.
But global_ctrl generated based on cpuid(0xA), the number of general-purpose 
performance
monitoring counters(PMC) can vary according to Intel SDN. The number of PMC 
presented
to vm, does not support configuration currently, it depend on host cpuid, and 
enable all pmc
defaultly at KVM. It cause migration to fail between boards with different PMC 
counts.

The return value of cpuid (0xA) is different dur to cpu, according to Intel 
SDN,18-10 Vol. 3B:

Note: The number of general-purpose performance monitoring counters (i.e. N in 
Figure 18-9)
can vary across processor generations within a processor family, across 
processor families, or
could be different depending on the configuration chosen at boot time in the 
BIOS regarding
Intel Hyper Threading Technology, (e.g. N=2 for 45 nm Intel Atom processors; N 
=4 for processors
based on the Nehalem microarchitecture; for processors based on the Sandy Bridge
microarchitecture, N = 4 if Intel Hyper Threading Technology is active and N=8 
if not active).

Also I found, N=8 if HT is not active based on the broadwell,,
such as CPU E7-8890 v4 @ 2.20GHz   

# ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 -hda
/data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true -incoming 
tcp::
Completed 100 %
qemu-system-x86_64: error: failed to set MSR 0x38f to 0x700ff
qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833: 
kvm_put_msrs: 
Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
Aborted

So make number of pmc configurable to vm ? Any better idea ?


Regards,
-Zhuang Yanying


[Qemu-devel] [PATCH] ipmi: fix qemu crash while migrating with ipmi

2016-11-18 Thread Zhuangyanying
From: ZhuangYanying <ann.zhuangyany...@huawei.com>

Qemu crash in the source side while migrating, after starting ipmi service 
inside vm.

./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 \
-drive 
file=/work/suse/suse11_sp3_64_vt,format=raw,if=none,id=drive-virtio-disk0,cache=none
 \
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0
 \
-vnc :99 -monitor vc -device ipmi-bmc-sim,id=bmc0 -device 
isa-ipmi-kcs,bmc=bmc0,ioport=0xca2

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffec4268700 (LWP 7657)]
__memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:2757
(gdb) bt
 #0  __memcpy_ssse3_back () at 
../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:2757
 #1  0x559ef775 in memcpy (__len=3, __src=0xc1421c, __dest=)
 at /usr/include/bits/string3.h:51
 #2  qemu_put_buffer (f=0x57a97690, buf=0xc1421c , size=3)
 at migration/qemu-file.c:346
 #3  0x559eef66 in vmstate_save_state (f=f@entry=0x57a97690,
 vmsd=0x55f8a5a0 , opaque=0x57231160,
 vmdesc=vmdesc@entry=0x5798cc40) at migration/vmstate.c:333
 #4  0x557cfe45 in vmstate_save (f=f@entry=0x57a97690, 
se=se@entry=0x57231de0,
 vmdesc=vmdesc@entry=0x5798cc40) at 
/mnt/sdb/zyy/qemu/migration/savevm.c:720
 #5  0x557d2be7 in qemu_savevm_state_complete_precopy (f=0x57a97690,
 iterable_only=iterable_only@entry=false) at 
/mnt/sdb/zyy/qemu/migration/savevm.c:1128
 #6  0x559ea102 in migration_completion (start_time=,
 old_vm_running=, current_active_state=,
 s=0x560eaa80 ) at migration/migration.c:1707
 #7  migration_thread (opaque=0x560eaa80 ) at 
migration/migration.c:1855
 #8  0x73900dc5 in start_thread (arg=0x7ffec4268700) at 
pthread_create.c:308
 #9  0x7fffefc6c71d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com>
---
 hw/ipmi/isa_ipmi_kcs.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/hw/ipmi/isa_ipmi_kcs.c b/hw/ipmi/isa_ipmi_kcs.c
index 9a38f8a..8044497 100644
--- a/hw/ipmi/isa_ipmi_kcs.c
+++ b/hw/ipmi/isa_ipmi_kcs.c
@@ -433,10 +433,8 @@ const VMStateDescription vmstate_ISAIPMIKCSDevice = {
 VMSTATE_BOOL(kcs.use_irq, ISAIPMIKCSDevice),
 VMSTATE_BOOL(kcs.irqs_enabled, ISAIPMIKCSDevice),
 VMSTATE_UINT32(kcs.outpos, ISAIPMIKCSDevice),
-VMSTATE_VBUFFER_UINT32(kcs.outmsg, ISAIPMIKCSDevice, 1, NULL, 0,
-   kcs.outlen),
-VMSTATE_VBUFFER_UINT32(kcs.inmsg, ISAIPMIKCSDevice, 1, NULL, 0,
-   kcs.inlen),
+VMSTATE_UINT8_ARRAY(kcs.outmsg, ISAIPMIKCSDevice, MAX_IPMI_MSG_SIZE),
+VMSTATE_UINT8_ARRAY(kcs.inmsg, ISAIPMIKCSDevice, MAX_IPMI_MSG_SIZE),
 VMSTATE_BOOL(kcs.write_end, ISAIPMIKCSDevice),
 VMSTATE_UINT8(kcs.status_reg, ISAIPMIKCSDevice),
 VMSTATE_UINT8(kcs.data_out_reg, ISAIPMIKCSDevice),
-- 
1.8.3.1





[Qemu-devel] [PATCH v3] ivshmem: Fix 64 bit memory bar configuration

2016-11-17 Thread Zhuangyanying
From: Zhuang Yanying 

Device ivshmem property use64=0 is designed to make the device
expose a 32 bit shared memory BAR instead of 64 bit one.  The
default is a 64 bit BAR, except pc-1.2 and older retain a 32 bit
BAR.  A 32 bit BAR can support only up to 1 GiB of shared memory.

This worked as designed until commit 5400c02 accidentally flipped
its sense: since then, we misinterpret use64=0 as use64=1 and vice
versa.  Worse, the default got flipped as well.  Devices
ivshmem-plain and ivshmem-doorbell are not affected.

Fix by restoring the test of IVShmemState member not_legacy_32bit
that got messed up in commit 5400c02.  Also update its
initialization for devices ivhsmem-plain and ivshmem-doorbell.
Without that, they'd regress to 32 bit BARs.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Zhuang Yanying 
Reviewed-by: Gonglei 
Reviewed-by: Marc-Andr?? Lureau 

---
 hw/misc/ivshmem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 230e51b..abeaf3d 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -858,7 +858,7 @@ static void ivshmem_common_realize(PCIDevice *dev, Error 
**errp)
 pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
  >ivshmem_mmio);
 
-if (!s->not_legacy_32bit) {
+if (s->not_legacy_32bit) {
 attr |= PCI_BASE_ADDRESS_MEM_TYPE_64;
 }
 
@@ -1045,6 +1045,7 @@ static void ivshmem_plain_init(Object *obj)
  ivshmem_check_memdev_is_busy,
  OBJ_PROP_LINK_UNREF_ON_RELEASE,
  _abort);
+s->not_legacy_32bit = 1;
 }
 
 static void ivshmem_plain_realize(PCIDevice *dev, Error **errp)
@@ -1116,6 +1117,7 @@ static void ivshmem_doorbell_init(Object *obj)
 
 s->features |= (1 << IVSHMEM_MSI);
 s->legacy_size = SIZE_MAX;  /* whatever the server sends */
+s->not_legacy_32bit = 1;
 }
 
 static void ivshmem_doorbell_realize(PCIDevice *dev, Error **errp)
-- 
1.8.3.1





[Qemu-devel] [PATCH v2 1/2] ivshmem: fix misconfig of not_legacy_32bit

2016-11-15 Thread Zhuangyanying
From: ZhuangYanying <ann.zhuangyany...@huawei.com>

After commit 5400c02, ivshmem_64bit renamed to not_legacy_32bit, 
and changed the implementation of this property.
Then use64 = 1, ~PCI_BASE_ADDRESS_MEM_TYPE_64 (default for ivshmem),
the actual use is the legacy model, 
can not support greater than or equal 1G mapping,
which is the opposite of configuration requirements.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com>
Reviewed-by: Gonglei <arei.gong...@huawei.com>
---
 hw/misc/ivshmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 230e51b..b897685 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -858,7 +858,7 @@ static void ivshmem_common_realize(PCIDevice *dev, Error 
**errp)
 pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
  >ivshmem_mmio);
 
-if (!s->not_legacy_32bit) {
+if (s->not_legacy_32bit) {
 attr |= PCI_BASE_ADDRESS_MEM_TYPE_64;
 }
 
-- 
1.8.3.1





[Qemu-devel] [PATCH v2 0/2] ivshmem: fix misconfig of not_legacy_32bit

2016-11-15 Thread Zhuangyanying
From: ZhuangYanying <ann.zhuangyany...@huawei.com>

Recently, I tested ivshmem, found that use64, that is not_legacy_32bit 
implementation is odd, or even the opposite.
Previous use64 = ivshmem_64bit = 1, then attr |= PCI_BASE_ADDRESS_MEM_TYPE_64, 
ivshmem support 1G and above packaged into bar2, presented to the virtual 
machine.
But after commit 5400c02, PCI_BASE_ADDRESS_MEM_TYPE_64 is configured 
while not_legacy_32bit = 0, that is the legacy model.

ZhuangYanying (2):
  hw/misc/ivshmem: fix misconfig of not_legacy_32bit
  hw/misc/ivshmem: set not_legacy_32bit to 1 for ivshmem_doorbell and
ivshmem-plain

 hw/misc/ivshmem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

-- 
1.8.3.1





[Qemu-devel] [PATCH v2 2/2] ivshmem: set not_legacy_32bit to 1 for ivshmem_doorbell and ivshmem-plain

2016-11-15 Thread Zhuangyanying
From: ZhuangYanying <ann.zhuangyany...@huawei.com>

Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com>
---
 hw/misc/ivshmem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index b897685..abeaf3d 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -1045,6 +1045,7 @@ static void ivshmem_plain_init(Object *obj)
  ivshmem_check_memdev_is_busy,
  OBJ_PROP_LINK_UNREF_ON_RELEASE,
  _abort);
+s->not_legacy_32bit = 1;
 }
 
 static void ivshmem_plain_realize(PCIDevice *dev, Error **errp)
@@ -1116,6 +1117,7 @@ static void ivshmem_doorbell_init(Object *obj)
 
 s->features |= (1 << IVSHMEM_MSI);
 s->legacy_size = SIZE_MAX;  /* whatever the server sends */
+s->not_legacy_32bit = 1;
 }
 
 static void ivshmem_doorbell_realize(PCIDevice *dev, Error **errp)
-- 
1.8.3.1





[Qemu-devel] [PATCH] hw/misc/ivshmem:fix misconfig of not_legacy_32bit

2016-11-14 Thread Zhuangyanying
From: ZhuangYanying <ann.zhuangyany...@huawei.com>

After "ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem", 
ivshmem_64bit renamed to not_legacy_32bit, and changed the implementation of 
this property.
Then use64 = not_legacy_32bit = 1, then PCI attribute configuration ~ 
PCI_BASE_ADDRESS_MEM_TYPE_64 (default for ivshmem), the actual use is the 
legacy model, can not support greater than or equal 1G mapping, which is the 
opposite of configuration requirements.

Signed-off-by: ann.zhuangyany...@huawei.com
---
Recently, I tested ivshmem, found that use64, that is not_legacy_32bit 
implementation is odd, or even the opposite.
Previous use64 = ivshmem_64bit = 1, then attr |= PCI_BASE_ADDRESS_MEM_TYPE_64, 
then ivshmem support 1G and above packaged into bar2, presented to the virtual 
machine.
But after "ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem", 
PCI_BASE_ADDRESS_MEM_TYPE_64 is configured while not_legacy_32bit = 0, that is 
the legacy model.
---
 hw/misc/ivshmem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 230e51b..b71acf6 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -858,7 +858,7 @@ static void ivshmem_common_realize(PCIDevice *dev, Error 
**errp)
 pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
  >ivshmem_mmio);
 
-if (!s->not_legacy_32bit) {
+if (s->not_legacy_32bit) {
 attr |= PCI_BASE_ADDRESS_MEM_TYPE_64;
 }
 
@@ -1033,6 +1033,7 @@ static const VMStateDescription ivshmem_plain_vmsd = {
 
 static Property ivshmem_plain_properties[] = {
 DEFINE_PROP_ON_OFF_AUTO("master", IVShmemState, master, ON_OFF_AUTO_OFF),
+DEFINE_PROP_UINT32("use64", IVShmemState, not_legacy_32bit, 1),
 DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1107,6 +1108,7 @@ static Property ivshmem_doorbell_properties[] = {
 DEFINE_PROP_BIT("ioeventfd", IVShmemState, features, IVSHMEM_IOEVENTFD,
 true),
 DEFINE_PROP_ON_OFF_AUTO("master", IVShmemState, master, ON_OFF_AUTO_OFF),
+DEFINE_PROP_UINT32("use64", IVShmemState, not_legacy_32bit, 1),
 DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
1.8.3.1





[Qemu-devel] [PATCH] target-i386/machine:fix migrate faile because of Hyper-V HV_X64_MSR_VP_RUNTIME

2016-11-04 Thread Zhuangyanying
From: ZhuangYanying <ann.zhuangyany...@huawei.com>

Hyper-V HV_X64_MSR_VP_RUNTIME was introduced in linux-4.4 + qemu-2.5.

As long as the KVM module supports, qemu will save / load the 
vmstate_msr_hyperv_runtime register during the migration.

Regardless of whether the hyperv_runtime configuration of x86_cpu_properties is 
enabled.

The qemu-2.3 does not support this feature, of course, failed to migrate.

linux-BGSfqC:/home/qemu # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm 
-nodefaults -machine pc-i440fx-2.3,accel=kvm,usb=off -smp 4 -m 4096 -drive 
file=/work/suse/sles11sp3.img.bak,format=raw,if=none,id=drive-virtio-disk0,cache=none
 
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0
 
-vnc :99 -device cirrus-vga,id=video0,vgamem_mb=8,bus=pci.0,addr=0x2 -monitor vc

save_section_header:se->section_id=3,se->idstr:ram,se->instance_id=0,se->version_id=4

save_section_header:se->section_id=0,se->idstr:timer,se->instance_id=0,se->version_id=2

save_section_header:se->section_id=4,se->idstr:cpu_common,se->instance_id=0,se->version_id=1

save_section_header:se->section_id=5,se->idstr:cpu,se->instance_id=0,se->version_id=12

vmstate_subsection_save:vmsd->name:cpu/async_pf_msr

hyperv_runtime_enable_needed:env->msr_hv_runtime=128902811

vmstate_subsection_save:vmsd->name:cpu/msr_hyperv_runtime

Since hyperv_runtime is false, vm will not use hv->runtime_offset, then 
vmstate_msr_hyperv_runtime is no need to transfer while migrating.

Signed-off-by: ann.zhuangyany...@huawei.com
---
Hi,

Recently, I tested cross-version migration/rollback between qemu tag v2.3.1 and 
qemu-master, found that rollback failed.

linux-rIVrzS:/home/git/qemu # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm 
-nodefaults -machine pc-i440fx-2.3,accel=kvm,usb=off -smp 4 -m 4096 -drive 
file=/work/suse/sles11sp3.img.bak,format=raw,if=none,id=drive-virtio-disk0,cache=none
 
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0
 
-vnc :99 -device cirrus-vga,id=video0,vgamem_mb=8,bus=pci.0,addr=0x2 -monitor 
vc 
-incoming tcp:0:

qemu-system-x86_64: error while loading state for instance 0x0 of device 'cpu'

qemu-system-x86_64: load of migration failed: No such file or directory

Maybe set compat_props on PC_COMPAT_2_5 ? Any better idea?
---
 target-i386/machine.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/target-i386/machine.c b/target-i386/machine.c
index 48037f1..e984d77 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -709,6 +709,10 @@ static bool hyperv_runtime_enable_needed(void *opaque)
 X86CPU *cpu = opaque;
 CPUX86State *env = >env;
 
+if (!cpu->hyperv_runtime) {
+return 0;
+}
+
 return env->msr_hv_runtime != 0;
 }
 
-- 
1.8.3.1