Re: [Qemu-devel] [PATCH] msix: fix interrupt aggregation problem at the passthrough of NVMe SSD
> -Original Message- > From: Michael S. Tsirkin [mailto:m...@redhat.com] > Sent: Tuesday, April 09, 2019 11:04 PM > To: Zhuangyanying > Cc: marcel.apfelb...@gmail.com; qemu-devel@nongnu.org; Gonglei (Arei) > > Subject: Re: [PATCH] msix: fix interrupt aggregation problem at the > passthrough of NVMe SSD > > On Tue, Apr 09, 2019 at 02:14:56PM +, Zhuangyanying wrote: > > From: Zhuang Yanying > > > > Recently I tested the performance of NVMe SSD passthrough and found > > that interrupts were aggregated on vcpu0(or the first vcpu of each > > numa) by /proc/interrupts,when GuestOS was upgraded to sles12sp3 (or > > redhat7.6). But /proc/irq/X/smp_affinity_list shows that the interrupt is > spread out, such as 0-10, 11-21, and so on. > > This problem cannot be resolved by "echo X > > > /proc/irq/X/smp_affinity_list", because the NVMe SSD interrupt is > > requested by the API pci_alloc_irq_vectors(), so the interrupt has the > IRQD_AFFINITY_MANAGED flag. > > > > GuestOS sles12sp3 backport "automatic interrupt affinity for MSI/MSI-X > > capable devices", but the implementation of __setup_irq has no > > corresponding modification. It is still irq_startup(), then > > setup_affinity(), that is sending an affinity message when the interrupt is > unmasked. > > So does latest upstream still change data/address of an unmasked vector? > The latest upstream works fine. Affinity configuration is successful. The original order at my understanding is nvme_setup_io_queues() \ \ \ --->pci_alloc_irq_vectors_affinity() \\ \-> msi_domain_alloc_irqs() \ \ /* if IRQD_AFFINITY_MANAGED, then "mask = affinity " */ \ -> ...-> __irq_alloc_descs() \ \ /* cpumask_copy(desc->irq_common_data.affinity, affinity); */ \ -> ...-> desc_smp_init() ->request_threaded_irq() \ ->__setup_irq() \ \ \ ->irq_startup()->msi_domain_activate() \ \ \ ->irq_enable()->pci_msi_unmask_irq() \ -->setup_affinity() \\ \-->if (irqd_affinity_is_managed(>irq_data)) \ set = desc->irq_common_data.affinity; \ cpumask_and(mask, cpu_online_mask, set); \ -->irq_do_set_affinity() \ -->msi_domain_set_affinity() \ /* Actual setting affinity*/ -->__pci_write_msi_msg() Afetr commit e43b3b58:" genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs " @@ -265,8 +265,8 @@ int irq_startup(struct irq_desc *desc, bool resend, bool force) irq_setup_affinity(desc); break; case IRQ_STARTUP_MANAGED: + irq_do_set_affinity(d, aff, false); ret = __irq_startup (desc); - irq_set_affinity_locked(d, aff, false); break; First irq_do_set_affinity(), then __irq_startup(),that is unmask irq. > > The bare metal configuration is successful, but qemu will not trigger > > the msix update, and the affinity configuration fails. > > The affinity is configured by /proc/irq/X/smp_affinity_list, > > implemented at apic_ack_edge(), the bitmap is stored in pending_mask, > > mask->__pci_write_msi_msg()->unmask, > > and the timing is guaranteed, and the configuration takes effect. > > > > The GuestOS linux-master incorporates the "genirq/cpuhotplug: Enforce > > affinity setting on startup of managed irqs" to ensure that the > > affinity is first issued and then __irq_startup(), for the managerred > > interrupt. So configuration is successful. > > > > It now looks like sles12sp3 (up to sles15sp1, linux-4.12.x), redhat7.6 > > (3.10.0-957.10.1) does not have backport the patch yet. > > Sorry - which patch? genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs commit e43b3b58 > > > "if (is_masked == was_masked) return;" can it be removed at qemu? > > What is the reason for this check? > > > > Signed-off-by: Zhuang Yanying > > --- > > hw/pci/msix.c | 4 > > 1 file changed, 4 deletions(-) > > > > diff --git a/hw/pci/msix.c b/hw/pci/msix.c index 4e33641..e1ff533 > > 100644 > > --- a/hw/pc
[Qemu-devel] [PATCH] msix: fix interrupt aggregation problem at the passthrough of NVMe SSD
From: Zhuang Yanying Recently I tested the performance of NVMe SSD passthrough and found that interrupts were aggregated on vcpu0(or the first vcpu of each numa) by /proc/interrupts,when GuestOS was upgraded to sles12sp3 (or redhat7.6). But /proc/irq/X/smp_affinity_list shows that the interrupt is spread out, such as 0-10, 11-21, and so on. This problem cannot be resolved by "echo X > /proc/irq/X/smp_affinity_list", because the NVMe SSD interrupt is requested by the API pci_alloc_irq_vectors(), so the interrupt has the IRQD_AFFINITY_MANAGED flag. GuestOS sles12sp3 backport "automatic interrupt affinity for MSI/MSI-X capable devices", but the implementation of __setup_irq has no corresponding modification. It is still irq_startup(), then setup_affinity(), that is sending an affinity message when the interrupt is unmasked. The bare metal configuration is successful, but qemu will not trigger the msix update, and the affinity configuration fails. The affinity is configured by /proc/irq/X/smp_affinity_list, implemented at apic_ack_edge(), the bitmap is stored in pending_mask, mask->__pci_write_msi_msg()->unmask, and the timing is guaranteed, and the configuration takes effect. The GuestOS linux-master incorporates the "genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs" to ensure that the affinity is first issued and then __irq_startup(), for the managerred interrupt. So configuration is successful. It now looks like sles12sp3 (up to sles15sp1, linux-4.12.x), redhat7.6 (3.10.0-957.10.1) does not have backport the patch yet. "if (is_masked == was_masked) return;" can it be removed at qemu? What is the reason for this check? Signed-off-by: Zhuang Yanying --- hw/pci/msix.c | 4 1 file changed, 4 deletions(-) diff --git a/hw/pci/msix.c b/hw/pci/msix.c index 4e33641..e1ff533 100644 --- a/hw/pci/msix.c +++ b/hw/pci/msix.c @@ -119,10 +119,6 @@ static void msix_handle_mask_update(PCIDevice *dev, int vector, bool was_masked) { bool is_masked = msix_is_masked(dev, vector); -if (is_masked == was_masked) { -return; -} - msix_fire_vector_notifier(dev, vector, is_masked); if (!is_masked && msix_is_pending(dev, vector)) { -- 1.8.3.1
[Qemu-devel] [PATCH v2 2/3] KVM: MMU: introduce kvm_mmu_write_protect_all_pages
From: Xiao Guangrong The original idea is from Avi. kvm_mmu_write_protect_all_pages() is extremely fast to write protect all the guest memory. Comparing with the ordinary algorithm which write protects last level sptes based on the rmap one by one, it just simply updates the generation number to ask all vCPUs to reload its root page table, particularly, it can be done out of mmu-lock, so that it does not hurt vMMU's parallel. It is the O(1) algorithm which does not depends on the capacity of guest's memory and the number of guest's vCPUs When reloading its root page table, the vCPU checks root page table's generation number with current global number, if it is not matched, it makes all the entries in page readonly and directly go to VM. So the read access is still going on smoothly without KVM's involvement and write access triggers page fault, then KVM moves the write protection from the upper level to the lower level page - by making all the entries in the lower page readonly first then make the upper level writable, this operation is repeated until we meet the last spte Note, the number of page fault and TLB flush are the same as the ordinary algorithm. During our test, for a VM which has 3G memory and 12 vCPUs, we benchmarked the performance of pure memory write after write protection, noticed only 3% is dropped, however, we also benchmarked the case that run the test case of pure memory-write in the new booted VM (i.e, it will trigger #PF to map memory), at the same time, live migration is going on, we noticed the diry page ratio is increased ~50%, that means, the memory's performance is hugely improved during live migration Signed-off-by: Xiao Guangrong Signed-off-by: Zhuang Yanying --- arch/x86/include/asm/kvm_host.h | 19 + arch/x86/kvm/mmu.c | 172 ++-- arch/x86/kvm/mmu.h | 1 + arch/x86/kvm/paging_tmpl.h | 13 ++- 4 files changed, 196 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6633b40..3d4231b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -338,6 +338,13 @@ struct kvm_mmu_page { /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */ unsigned long mmu_valid_gen; + /* +* The generation number of write protection for all guest memory +* which is synced with kvm_arch.mmu_write_protect_all_indicator +* whenever it is linked into upper entry. +*/ + u64 mmu_write_protect_all_gen; + DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR); DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR); @@ -850,6 +857,18 @@ struct kvm_arch { unsigned int n_max_mmu_pages; unsigned int indirect_shadow_pages; unsigned long mmu_valid_gen; + + /* +* The indicator of write protection for all guest memory. +* +* The top bit indicates if the write-protect is enabled, +* remaining bits are used as a generation number which is +* increased whenever write-protect is enabled. +* +* The enable bit and generation number are squeezed into +* a single u64 so that it can be accessed atomically. +*/ + u64 mmu_write_protect_all_indicator; struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; /* * Hash table of struct kvm_mmu_page. diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e8adafc..effae7a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -490,6 +490,29 @@ static void kvm_mmu_reset_all_pte_masks(void) GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT); } +/* see the comments in struct kvm_arch. */ +#define WP_ALL_ENABLE_BIT (63) +#define WP_ALL_ENABLE_MASK (1ull << WP_ALL_ENABLE_BIT) +#define WP_ALL_GEN_MASK(~0ull & ~WP_ALL_ENABLE_MASK) + +/* should under mmu_lock */ +static bool get_write_protect_all_indicator(struct kvm *kvm, u64 *generation) +{ + u64 indicator = kvm->arch.mmu_write_protect_all_indicator; + + *generation = indicator & WP_ALL_GEN_MASK; + return !!(indicator & WP_ALL_ENABLE_MASK); +} + +static void +set_write_protect_all_indicator(struct kvm *kvm, bool enable, u64 generation) +{ + u64 value = (u64)(!!enable) << WP_ALL_ENABLE_BIT; + + value |= generation & WP_ALL_GEN_MASK; + kvm->arch.mmu_write_protect_all_indicator = value; +} + static int is_cpuid_PSE36(void) { return 1; @@ -2532,6 +2555,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, flush |= kvm_sync_pages(vcpu, gfn, _list); } sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen; + get_write_protect_all_indicator(vcpu->kvm, + >mmu_write_protect_all_gen); clear_page(sp->spt); trace_kvm_mmu_get_page(sp, true); @@
[Qemu-devel] [PATCH v2 3/3] KVM: MMU: fast cleanup D bit based on fast write protect
From: Zhuang Yanying When live-migration with large-memory guests, vcpu may hang for a long time while starting migration, such as 9s for 2T (linux-5.0.0-rc2+qemu-3.1.0). The reason is memory_global_dirty_log_start() taking too long, and the vcpu is waiting for BQL. The page-by-page D bit clearup is the main time consumption. I think that the idea of "KVM: MMU: fast write protect" by xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(), is very helpful. After a little modifcation, on his patch, can solve this problem, 9s to 0.5s. At the beginning of live migration, write protection is only applied to the top-level SPTE. Then the write from vm trigger the EPT violation, with for_each_shadow_entry write protection is performed at dirct_map. Finally the Dirty bit of the target page(at level 1 page table) is cleared, and the dirty page tracking is started. The page where GPA is located is marked dirty when mmu_set_spte. Signed-off-by: Zhuang Yanying --- arch/x86/kvm/mmu.c | 6 +- arch/x86/kvm/vmx/vmx.c | 5 ++--- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index effae7a..ac7a994 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3230,7 +3230,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp) break; if (is_last_spte(spte, sp->role.level)) { - flush |= spte_write_protect(sptep, false); + if (sp->role.level == PT_PAGE_TABLE_LEVEL) + flush |= spte_clear_dirty(sptep); + else + flush |= spte_write_protect(sptep, false); continue; } @@ -6106,6 +6109,7 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect) kvm_reload_remote_mmus(kvm); spin_unlock(>mmu_lock); } +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages); static unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f6915f1..540ec21 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7180,14 +7180,13 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) static void vmx_slot_enable_log_dirty(struct kvm *kvm, struct kvm_memory_slot *slot) { - kvm_mmu_slot_leaf_clear_dirty(kvm, slot); - kvm_mmu_slot_largepage_remove_write_access(kvm, slot); + kvm_mmu_write_protect_all_pages(kvm, true); } static void vmx_slot_disable_log_dirty(struct kvm *kvm, struct kvm_memory_slot *slot) { - kvm_mmu_slot_set_dirty(kvm, slot); + kvm_mmu_write_protect_all_pages(kvm, false); } static void vmx_flush_log_dirty(struct kvm *kvm) -- 1.8.3.1
[Qemu-devel] [PATCH v2 1/3] KVM: MMU: introduce possible_writable_spte_bitmap
From: Xiao Guangrong It is used to track possible writable sptes on the shadow page on which the bit is set to 1 for the sptes that are already writable or can be locklessly updated to writable on the fast_page_fault path, also a counter for the number of possible writable sptes is introduced to speed up bitmap walking. This works very efficiently as usually only one entry in PML4 ( < 512 G),few entries in PDPT (only entry indicates 1G memory), PDEs and PTEs need to be write protected for the worst case. Later patch will benefit good performance by using this bitmap and counter to fast figure out writable sptes and write protect them. Signed-off-by: Xiao Guangrong Signed-off-by: Zhuang Yanying --- arch/x86/include/asm/kvm_host.h | 5 +++- arch/x86/kvm/mmu.c | 53 +++-- 2 files changed, 55 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4660ce9..6633b40 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -128,6 +128,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) #define KVM_MIN_ALLOC_MMU_PAGES 64 #define KVM_MMU_HASH_SHIFT 12 #define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT) +#define KVM_MMU_SP_ENTRY_NR 512 #define KVM_MIN_FREE_MMU_PAGES 5 #define KVM_REFILL_PAGES 25 #define KVM_MAX_CPUID_ENTRIES 80 @@ -331,13 +332,15 @@ struct kvm_mmu_page { gfn_t *gfns; int root_count; /* Currently serving as active root */ unsigned int unsync_children; + unsigned int possible_writable_sptes; struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */ unsigned long mmu_valid_gen; - DECLARE_BITMAP(unsync_child_bitmap, 512); + DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR); + DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR); #ifdef CONFIG_X86_32 /* * Used out of the mmu-lock to avoid reading spte values while an diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ce770b4..e8adafc 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -718,6 +718,49 @@ static bool is_dirty_spte(u64 spte) return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK; } +static bool is_possible_writable_spte(u64 spte) +{ + if (!is_shadow_present_pte(spte)) + return false; + + if (is_writable_pte(spte)) + return true; + + if (spte_can_locklessly_be_made_writable(spte)) + return true; + + /* +* although is_access_track_spte() sptes can be updated out of +* mmu-lock, we need not take them into account as access_track +* drops writable bit for them +*/ + return false; +} + +static void +mmu_log_possible_writable_spte(u64 *sptep, u64 old_spte, u64 new_spte) +{ + struct kvm_mmu_page *sp = page_header(__pa(sptep)); + bool old_state, new_state; + + old_state = is_possible_writable_spte(old_spte); + new_state = is_possible_writable_spte(new_spte); + + if (old_state == new_state) + return; + + /* a possible writable spte is dropped */ + if (old_state) { + sp->possible_writable_sptes--; + __clear_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap); + return; + } + + /* a new possible writable spte is set */ + sp->possible_writable_sptes++; + __set_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap); +} + /* Rules for using mmu_spte_set: * Set the sptep from nonpresent to present. * Note: the sptep being assigned *must* be either not present @@ -728,6 +771,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte) { WARN_ON(is_shadow_present_pte(*sptep)); __set_spte(sptep, new_spte); + mmu_log_possible_writable_spte(sptep, 0ull, new_spte); } /* @@ -751,7 +795,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte) old_spte = __update_clear_spte_slow(sptep, new_spte); WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte)); - + mmu_log_possible_writable_spte(sptep, old_spte, new_spte); return old_spte; } @@ -817,6 +861,8 @@ static int mmu_spte_clear_track_bits(u64 *sptep) else old_spte = __update_clear_spte_slow(sptep, 0ull); + mmu_log_possible_writable_spte(sptep, old_spte, 0ull); + if (!is_shadow_present_pte(old_spte)) return 0; @@ -845,7 +891,10 @@ static int mmu_spte_clear_track_bits(u64 *sptep) */ static void mmu_spte_clear_no_track(u64 *sptep) { + u64 old_spte = *sptep; + __update_clear_spte_fast(sptep, 0ull); + mmu_log_possible_writable_spte(sptep, old_spte, 0ull); } static u64 mmu_spte_get_lockless(u64 *sptep) @@ -2140,7
[Qemu-devel] [PATCH v2 0/3] KVM: MMU: fast cleanup D bit based on fast write protect
From: Zhuang yanying When live-migration with large-memory guests, vcpu may hang for a long time while starting migration, such as 9s for 2T (linux-5.0.0-rc2+qemu-3.1.0). The reason is memory_global_dirty_log_start() taking too long, and the vcpu is waiting for BQL. The page-by-page D bit clearup is the main time consumption. I think that the idea of "KVM: MMU: fast write protect" by xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(), is very helpful. After a little modifcation, on his patch, can solve this problem, 9s to 0.5s. At the beginning of live migration, write protection is only applied to the top-level SPTE. Then the write from vm trigger the EPT violation, with for_each_shadow_entry write protection is performed at dirct_map. Finally the Dirty bit of the target page(at level 1 page table) is cleared, and the dirty page tracking is started. The page where GPA is located is marked dirty when mmu_set_spte. A similar implementation on xen, just emt instead of write protection. Xiao Guangrong (2): KVM: MMU: introduce possible_writable_spte_bitmap KVM: MMU: introduce kvm_mmu_write_protect_all_pages Zhuang Yanying (1): KVM: MMU: fast cleanup D bit based on fast write protect arch/x86/include/asm/kvm_host.h | 24 - arch/x86/kvm/mmu.c | 229 ++-- arch/x86/kvm/mmu.h | 1 + arch/x86/kvm/paging_tmpl.h | 13 ++- arch/x86/kvm/vmx/vmx.c | 5 +- 5 files changed, 257 insertions(+), 15 deletions(-) -- v1 -> v2: - drop "KVM: MMU: correct the behavior of mmu_spte_update_no_track" - mmu_write_protect_all_indicator is no longer an atomic variable, protected by mmu_lock - Implement kvm_mmu_slot_set_dirty with kvm_mmu_write_protect_all_pages - some modification on the commit messages -- 1.8.3.1
Re: [Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
> -Original Message- > From: Sean Christopherson [mailto:sean.j.christopher...@intel.com] > Sent: Tuesday, January 22, 2019 11:17 PM > To: Zhuangyanying > Cc: xiaoguangr...@tencent.com; pbonz...@redhat.com; Gonglei (Arei) > ; qemu-devel@nongnu.org; k...@vger.kernel.org; > wangxin (U) ; Liujinsong (Paul) > ; Zhoujian (jay) ; > xiaoguangrong.e...@gmail.com > Subject: Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write > protect > > On Mon, Jan 21, 2019 at 06:37:36AM +, Zhuangyanying wrote: > > > > > > u64 wp_all_indicator, kvm_wp_all_gen; > > > > > > > > - mutex_lock(>slots_lock); > > > > wp_all_indicator = get_write_protect_all_indicator(kvm); > > > > kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator); > > > > > > > > @@ -6134,8 +6136,8 @@ void > kvm_mmu_write_protect_all_pages(struct > > > kvm *kvm, bool write_protect) > > > > */ > > > > if (write_protect) > > > > kvm_reload_remote_mmus(kvm); > > > > - mutex_unlock(>slots_lock); > > > > > > Why is the lock removed? And why was it added in the first place? > > > > > The original purpose of fast write protect is to implement lock-free > > get_dirty_log, kvm_mmu_write_protect_all_pages is a stand-alone kvm > > API. See > > https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04370.html > > A total of 7 patches, I only need the first 3 patches to achieve > > step-by-step page table traversal. In order to maintain the integrity > > of the xiaoguangrong patch, I did not directly modify it on his patch. > > That's not a sufficient argument for adding locking and removing it one patch > later. > > > > > } > > > > +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages); > > > > > > > > static unsigned long > > > > mmu_shrink_scan(struct shrinker *shrink, struct shrink_control > > > > *sc) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > > > index > > > > f6915f1..5236a07 100644 > > > > --- a/arch/x86/kvm/vmx/vmx.c > > > > +++ b/arch/x86/kvm/vmx/vmx.c > > > > @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu > > > > *vcpu, int cpu) static void vmx_slot_enable_log_dirty(struct kvm *kvm, > > > > struct kvm_memory_slot *slot) { > > > > - kvm_mmu_slot_leaf_clear_dirty(kvm, slot); > > > > - kvm_mmu_slot_largepage_remove_write_access(kvm, slot); > > > > + kvm_mmu_write_protect_all_pages(kvm, true); > > > > > > What's the purpose of having @write_protect if > > > kvm_mmu_write_protect_all_pages() is only ever called to enable > > > protection? If there's no known scenario where write protection is > > > explicitly disabled then there's no need for WP_ALL_ENABLE_MASK, > > > i.e. a non-zero generation would indicate write protection is > > > enabled. That'd simplify the code and clean up the atomic usage. > > > > > In the live migration, The large page split depends on the creation of > > memslot->dirty_bitmap in the function __kvm_set_memory_region(). > > The interface design between qemu and kvm to enable dirty log is one by one > in slot units. > > In order to enable dirty page tracking of the entire vm, it is > > necessary to call kvm_mmu_write_protect_all_pages multiple times. The > > page table update request can be merged for processing by the atomic usage. > This method is not elegant, but it works. > > Complete the creation of all solt's dirty_bitmap in an API, just call > > kvm_mmu_write_protect_all_pages once, need more implementation > changes, even qemu. > > Calling kvm_mmu_write_protect_all_pages() multiple times is fine. My > question was regarding the 'write_protect' parameter. If all callers always > pass %true for 'write_protect' then why does the parameter exist? > And eliminating the parameter means you don't need an 'enable' flag buried in > the generation, which would simplify the implementation. In fact, when cancel migration for 2T vm, the func memory_global_dirty_log_stop() will hold the BQL for 12s, vmx_slot_disable_log_dirty() for 4s, kvm_mmu_zap_collapsible_sptes() for 2s, double because smm is enabled by default. What about: static void vmx_slot_disable_log_dirty(struct kvm *kvm, struct kvm_memory_slot *slot) { + kvm_mmu_write_protect_all_pages(kvm, false); - kvm_mmu_slot_set_dirty(kvm, slot); } Sorry, this patch is not complete. I will send patch v2 soon. Best regards, -Zhuang Yanying
Re: [Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
> -Original Message- > From: Sean Christopherson [mailto:sean.j.christopher...@intel.com] > Sent: Friday, January 18, 2019 12:32 AM > To: Zhuangyanying > Cc: xiaoguangr...@tencent.com; pbonz...@redhat.com; Gonglei (Arei) > ; qemu-devel@nongnu.org; k...@vger.kernel.org; > wangxin (U) ; Liujinsong (Paul) > > Subject: Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write > protect > > On Thu, Jan 17, 2019 at 01:55:31PM +, Zhuangyanying wrote: > > From: Zhuang Yanying > > > > When live-migration with large-memory guests, vcpu may hang for a long > > time while starting migration, such as 9s for 2T > > (linux-5.0.0-rc2+qemu-3.1.0). > > The reason is memory_global_dirty_log_start() taking too long, and the > > vcpu is waiting for BQL. The page-by-page D bit clearup is the main > > time consumption. I think that the idea of "KVM: MMU: fast write > > protect" by xiaoguangrong, especially the function > > kvm_mmu_write_protect_all_pages(), > > is very helpful. After a little modifcation, on his patch, can solve > > this problem, 9s to 0.5s. > > > > At the beginning of live migration, write protection is only applied > > to the top-level SPTE. Then the write from vm trigger the EPT > > violation, with for_each_shadow_entry write protection is performed at > dirct_map. > > Finally the Dirty bit of the target page(at level 1 page table) is > > cleared, and the dirty page tracking is started. Of coure, the page > > where GPA is located is marked dirty when mmu_set_spte. > > A similar implementation on xen, just emt instead of write protection. > > > > Signed-off-by: Zhuang Yanying > > --- > > arch/x86/kvm/mmu.c | 8 +--- > > arch/x86/kvm/vmx/vmx.c | 3 +-- > > 2 files changed, 6 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index > > 047b897..a18bcc0 100644 > > --- a/arch/x86/kvm/mmu.c > > +++ b/arch/x86/kvm/mmu.c > > @@ -3257,7 +3257,10 @@ static bool mmu_load_shadow_page(struct kvm > *kvm, struct kvm_mmu_page *sp) > > break; > > > > if (is_last_spte(spte, sp->role.level)) { > > - flush |= spte_write_protect(sptep, false); > > + if (sp->role.level == PT_PAGE_TABLE_LEVEL) > > + flush |= spte_clear_dirty(sptep); > > + else > > + flush |= spte_write_protect(sptep, false); > > continue; > > } > > > > @@ -6114,7 +6117,6 @@ void kvm_mmu_write_protect_all_pages(struct > kvm > > *kvm, bool write_protect) { > > u64 wp_all_indicator, kvm_wp_all_gen; > > > > - mutex_lock(>slots_lock); > > wp_all_indicator = get_write_protect_all_indicator(kvm); > > kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator); > > > > @@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct > kvm *kvm, bool write_protect) > > */ > > if (write_protect) > > kvm_reload_remote_mmus(kvm); > > - mutex_unlock(>slots_lock); > > Why is the lock removed? And why was it added in the first place? > The original purpose of fast write protect is to implement lock-free get_dirty_log, kvm_mmu_write_protect_all_pages is a stand-alone kvm API. See https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04370.html A total of 7 patches, I only need the first 3 patches to achieve step-by-step page table traversal. In order to maintain the integrity of the xiaoguangrong patch, I did not directly modify it on his patch. > > } > > +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages); > > > > static unsigned long > > mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index > > f6915f1..5236a07 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, > > int cpu) static void vmx_slot_enable_log_dirty(struct kvm *kvm, > > struct kvm_memory_slot *slot) { > > - kvm_mmu_slot_leaf_clear_dirty(kvm, slot); > > - kvm_mmu_slot_largepage_remove_write_access(kvm, slot); > > + kvm_mmu_write_protect_all_pages(kvm, true); > > What's the purpose of having @write_protect if > kvm_mmu_write_protect_all_pages() is only ever called to enable > protection? If there's no known scenario where write protection is > explicitly disabled then there's no need for WP_ALL_
[Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
From: Zhuang Yanying When live-migration with large-memory guests, vcpu may hang for a long time while starting migration, such as 9s for 2T (linux-5.0.0-rc2+qemu-3.1.0). The reason is memory_global_dirty_log_start() taking too long, and the vcpu is waiting for BQL. The page-by-page D bit clearup is the main time consumption. I think that the idea of "KVM: MMU: fast write protect" by xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(), is very helpful. After a little modifcation, on his patch, can solve this problem, 9s to 0.5s. At the beginning of live migration, write protection is only applied to the top-level SPTE. Then the write from vm trigger the EPT violation, with for_each_shadow_entry write protection is performed at dirct_map. Finally the Dirty bit of the target page(at level 1 page table) is cleared, and the dirty page tracking is started. Of coure, the page where GPA is located is marked dirty when mmu_set_spte. A similar implementation on xen, just emt instead of write protection. Signed-off-by: Zhuang Yanying --- arch/x86/kvm/mmu.c | 8 +--- arch/x86/kvm/vmx/vmx.c | 3 +-- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 047b897..a18bcc0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3257,7 +3257,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp) break; if (is_last_spte(spte, sp->role.level)) { - flush |= spte_write_protect(sptep, false); + if (sp->role.level == PT_PAGE_TABLE_LEVEL) + flush |= spte_clear_dirty(sptep); + else + flush |= spte_write_protect(sptep, false); continue; } @@ -6114,7 +6117,6 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect) { u64 wp_all_indicator, kvm_wp_all_gen; - mutex_lock(>slots_lock); wp_all_indicator = get_write_protect_all_indicator(kvm); kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator); @@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect) */ if (write_protect) kvm_reload_remote_mmus(kvm); - mutex_unlock(>slots_lock); } +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages); static unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f6915f1..5236a07 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) static void vmx_slot_enable_log_dirty(struct kvm *kvm, struct kvm_memory_slot *slot) { - kvm_mmu_slot_leaf_clear_dirty(kvm, slot); - kvm_mmu_slot_largepage_remove_write_access(kvm, slot); + kvm_mmu_write_protect_all_pages(kvm, true); } static void vmx_slot_disable_log_dirty(struct kvm *kvm, -- 1.8.3.1
[Qemu-devel] [PATCH 2/4] KVM: MMU: introduce possible_writable_spte_bitmap
From: Xiao Guangrong It is used to track possible writable sptes on the shadow page on which the bit is set to 1 for the sptes that are already writable or can be locklessly updated to writable on the fast_page_fault path, also a counter for the number of possible writable sptes is introduced to speed up bitmap walking Later patch will benefit good performance by using this bitmap and counter to fast figure out writable sptes and write protect them Signed-off-by: Xiao Guangrong --- arch/x86/include/asm/kvm_host.h | 6 - arch/x86/kvm/mmu.c | 53 - 2 files changed, 57 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4660ce9..5c30aa0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -128,6 +128,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) #define KVM_MIN_ALLOC_MMU_PAGES 64 #define KVM_MMU_HASH_SHIFT 12 #define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT) +#define KVM_MMU_SP_ENTRY_NR 512 #define KVM_MIN_FREE_MMU_PAGES 5 #define KVM_REFILL_PAGES 25 #define KVM_MAX_CPUID_ENTRIES 80 @@ -331,12 +332,15 @@ struct kvm_mmu_page { gfn_t *gfns; int root_count; /* Currently serving as active root */ unsigned int unsync_children; + unsigned int possiable_writable_sptes; struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */ unsigned long mmu_valid_gen; - DECLARE_BITMAP(unsync_child_bitmap, 512); + DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR); + + DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR); #ifdef CONFIG_X86_32 /* diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index eeb3bac..9daab00 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -718,6 +718,49 @@ static bool is_dirty_spte(u64 spte) return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK; } +static bool is_possible_writable_spte(u64 spte) +{ + if (!is_shadow_present_pte(spte)) + return false; + + if (is_writable_pte(spte)) + return true; + + if (spte_can_locklessly_be_made_writable(spte)) + return true; + + /* +* although is_access_track_spte() sptes can be updated out of +* mmu-lock, we need not take them into account as access_track +* drops writable bit for them +*/ + return false; +} + +static void +mmu_log_possible_writable_spte(u64 *sptep, u64 old_spte, u64 new_spte) +{ + struct kvm_mmu_page *sp = page_header(__pa(sptep)); + bool old_state, new_state; + + old_state = is_possible_writable_spte(old_spte); + new_state = is_possible_writable_spte(new_spte); + + if (old_state == new_state) + return; + + /* a possible writable spte is dropped */ + if (old_state) { + sp->possiable_writable_sptes--; + __clear_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap); + return; + } + + /* a new possible writable spte is set */ + sp->possiable_writable_sptes++; + __set_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap); +} + /* Rules for using mmu_spte_set: * Set the sptep from nonpresent to present. * Note: the sptep being assigned *must* be either not present @@ -728,6 +771,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte) { WARN_ON(is_shadow_present_pte(*sptep)); __set_spte(sptep, new_spte); + mmu_log_possible_writable_spte(sptep, 0ull, new_spte); } /* @@ -746,6 +790,7 @@ static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte) } __update_clear_spte_fast(sptep, new_spte); + mmu_log_possible_writable_spte(sptep, old_spte, new_spte); } /* @@ -771,6 +816,7 @@ static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte) WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte)); + mmu_log_possible_writable_spte(sptep, old_spte, new_spte); return old_spte; } @@ -836,6 +882,8 @@ static int mmu_spte_clear_track_bits(u64 *sptep) else old_spte = __update_clear_spte_slow(sptep, 0ull); + mmu_log_possible_writable_spte(sptep, old_spte, 0ull); + if (!is_shadow_present_pte(old_spte)) return 0; @@ -864,7 +912,10 @@ static int mmu_spte_clear_track_bits(u64 *sptep) */ static void mmu_spte_clear_no_track(u64 *sptep) { + u64 old_spte = *sptep; + __update_clear_spte_fast(sptep, 0ull); + mmu_log_possible_writable_spte(sptep, old_spte, 0ull); } static u64 mmu_spte_get_lockless(u64 *sptep) @@ -2159,7 +2210,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp, { int i, ret, nr_unsync_leaf = 0; - for_each_set_bit(i,
[Qemu-devel] [PATCH 1/4] KVM: MMU: correct the behavior of mmu_spte_update_no_track
From: Xiao Guangrong Current behavior of mmu_spte_update_no_track() does not match the name of _no_track() as actually the A/D bits are tracked and returned to the caller This patch introduces the real _no_track() function to update the spte regardless of A/D bits and rename the original function to _track() The _no_track() function will be used by later patches to update upper spte which need not care of A/D bits indeed Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ce770b4..eeb3bac 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -731,10 +731,29 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte) } /* - * Update the SPTE (excluding the PFN), but do not track changes in its + * Update the SPTE (excluding the PFN) regardless of accessed/dirty + * status which is used to update the upper level spte. + */ +static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte) +{ + u64 old_spte = *sptep; + + WARN_ON(!is_shadow_present_pte(new_spte)); + + if (!is_shadow_present_pte(old_spte)) { + mmu_spte_set(sptep, new_spte); + return; + } + + __update_clear_spte_fast(sptep, new_spte); +} + +/* + * Update the SPTE (excluding the PFN), the original value is + * returned, based on it, the caller can track changes of its * accessed/dirty status. */ -static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte) +static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte) { u64 old_spte = *sptep; @@ -769,7 +788,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte) static bool mmu_spte_update(u64 *sptep, u64 new_spte) { bool flush = false; - u64 old_spte = mmu_spte_update_no_track(sptep, new_spte); + u64 old_spte = mmu_spte_update_track(sptep, new_spte); if (!is_shadow_present_pte(old_spte)) return false; -- 1.8.3.1
[Qemu-devel] [PATCH 3/4] KVM: MMU: introduce kvm_mmu_write_protect_all_pages
From: Xiao Guangrong The original idea is from Avi. kvm_mmu_write_protect_all_pages() is extremely fast to write protect all the guest memory. Comparing with the ordinary algorithm which write protects last level sptes based on the rmap one by one, it just simply updates the generation number to ask all vCPUs to reload its root page table, particularly, it can be done out of mmu-lock, so that it does not hurt vMMU's parallel. It is the O(1) algorithm which does not depends on the capacity of guest's memory and the number of guest's vCPUs When reloading its root page table, the vCPU checks root page table's generation number with current global number, if it is not matched, it makes all the entries in page readonly and directly go to VM. So the read access is still going on smoothly without KVM's involvement and write access triggers page fault, then KVM moves the write protection from the upper level to the lower level page - by making all the entries in the lower page readonly first then make the upper level writable, this operation is repeated until we meet the last spte In order to speed up the process of making all entries readonly, we introduce possible_writable_spte_bitmap which indicates the writable sptes and possiable_writable_sptes which is a counter indicating the number of writable sptes, this works very efficiently as usually only one entry in PML4 ( < 512 G),few entries in PDPT (only entry indicates 1G memory), PDEs and PTEs need to be write protected for the worst case. Note, the number of page fault and TLB flush are the same as the ordinary algorithm. During our test, for a VM which has 3G memory and 12 vCPUs, we benchmarked the performance of pure memory write after write protection, noticed only 3% is dropped, however, we also benchmarked the case that run the test case of pure memory-write in the new booted VM (i.e, it will trigger #PF to map memory), at the same time, live migration is going on, we noticed the diry page ratio is increased ~50%, that means, the memory's performance is hugely improved during live migration Signed-off-by: Xiao Guangrong --- arch/x86/include/asm/kvm_host.h | 19 + arch/x86/kvm/mmu.c | 179 ++-- arch/x86/kvm/mmu.h | 1 + arch/x86/kvm/paging_tmpl.h | 13 ++- 4 files changed, 204 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5c30aa0..a581ff4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -338,6 +338,13 @@ struct kvm_mmu_page { /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */ unsigned long mmu_valid_gen; + /* +* The generation number of write protection for all guest memory +* which is synced with kvm_arch.mmu_write_protect_all_indicator +* whenever it is linked into upper entry. +*/ + u64 mmu_write_protect_all_gen; + DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR); DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR); @@ -851,6 +858,18 @@ struct kvm_arch { unsigned int n_max_mmu_pages; unsigned int indirect_shadow_pages; unsigned long mmu_valid_gen; + + /* +* The indicator of write protection for all guest memory. +* +* The top bit indicates if the write-protect is enabled, +* remaining bits are used as a generation number which is +* increased whenever write-protect is enabled. +* +* The enable bit and generation number are squeezed into +* a single u64 so that it can be accessed atomically. +*/ + atomic64_t mmu_write_protect_all_indicator; struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; /* * Hash table of struct kvm_mmu_page. diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9daab00..047b897 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -489,6 +489,34 @@ static void kvm_mmu_reset_all_pte_masks(void) shadow_nonpresent_or_rsvd_lower_gfn_mask = GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT); } +/* see the comments in struct kvm_arch. */ +#define WP_ALL_ENABLE_BIT (63) +#define WP_ALL_ENABLE_MASK (1ull << WP_ALL_ENABLE_BIT) +#define WP_ALL_GEN_MASK(~0ull & ~WP_ALL_ENABLE_MASK) + +static bool is_write_protect_all_enabled(u64 indicator) +{ + return !!(indicator & WP_ALL_ENABLE_MASK); +} + +static u64 get_write_protect_all_gen(u64 indicator) +{ + return indicator & WP_ALL_GEN_MASK; +} + +static u64 get_write_protect_all_indicator(struct kvm *kvm) +{ + return atomic64_read(>arch.mmu_write_protect_all_indicator); +} + +static void +set_write_protect_all_indicator(struct kvm *kvm, bool enable, u64 generation) +{ + u64 value = (u64)(!!enable) << WP_ALL_ENABLE_BIT; + + value |= generation & WP_ALL_GEN_MASK; +
[Qemu-devel] [PATCH 0/4] KVM: MMU: fast cleanup D bit based on fast write protect
From: Zhuang Yanying Recently I tested live-migration with large-memory guests, find vcpu may hang for a long time while starting migration, such as 9s for 2048G(linux-5.0.0-rc2+qemu-3.1.0). The reason is memory_global_dirty_log_start() taking too long, and the vcpu is waiting for BQL. The page-by-page D bit clearup is the main time consumption. I think that the idea of "KVM: MMU: fast write protect" by xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(), is very helpful. After a little modifcation, on his patch, can solve this problem, 9s to 0.5s. At the begining of live migration, write protection is only applied to the top-level SPTE. Then the write from vm trigger the EPT violation, with for_each_shadow_entry write protection is performed at dirct_map. Finally the Dirty bit of the target page(at level 1 page table) is cleared, and the dirty page tracking is started. Of coure, the page where GPA is located is marked dirty when mmu_set_spte. A similar implementation on xen, just emt instead of write protection. What do you think about this solution? Xiao Guangrong (3): KVM: MMU: correct the behavior of mmu_spte_update_no_track KVM: MMU: introduce possible_writable_spte_bitmap KVM: MMU: introduce kvm_mmu_write_protect_all_pages Zhuang Yanying (1): KVM: MMU: fast cleanup D bit based on fast write protect arch/x86/include/asm/kvm_host.h | 25 +++- arch/x86/kvm/mmu.c | 259 ++-- arch/x86/kvm/mmu.h | 1 + arch/x86/kvm/paging_tmpl.h | 13 +- arch/x86/kvm/vmx/vmx.c | 3 +- 5 files changed, 286 insertions(+), 15 deletions(-) -- 1.8.3.1
[Qemu-devel] [PATCH] KVM: MMU: fast cleanup D bit based on fast write protect
From: Zhuang Yanying Recently I tested live-migration with large-memory guests, find vcpu may hang for a long time while starting migration, such as 9s for 2048G(linux-4.20.1+qemu-3.1.0). The reason is memory_global_dirty_log_start() taking too long, and the vcpu is waiting for BQL. The page-by-page D bit clearup is the main time consumption. I think that the idea of "KVM: MMU: fast write protect" by xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(), is very helpful. After a little modifcation, on his patch, can solve this problem, 9s to 0.5s. At the begining of live migration, write protection is only applied to the top-level SPTE. Then the write from vm trigger the EPT violation, with for_each_shadow_entry write protection is performed at dirct_map. Finally the Dirty bit of the target page(at level 1 page table) is cleared, and the dirty page tracking is started. Of coure, the page where GPA is located is marked dirty when mmu_set_spte. A similar implementation on xen, just emt instead of write protection. What do you think about this solution? --- mmu.c | 5 - vmx.c | 3 +-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/mmu.c b/mmu.c index b079d74..f49d316 100755 --- a/mmu.c +++ b/mmu.c @@ -3210,7 +3210,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp) break; if (is_last_spte(spte, sp->role.level)) { - flush |= spte_write_protect(sptep, false); + if (sp->role.level == PT_PAGE_TABLE_LEVEL) + flush |= spte_clear_dirty(sptep); + else + flush |= spte_write_protect(sptep, false); continue; } diff --git a/vmx.c b/vmx.c index 95784bc..7ec717f 100755 --- a/vmx.c +++ b/vmx.c @@ -14421,8 +14421,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) static void vmx_slot_enable_log_dirty(struct kvm *kvm, struct kvm_memory_slot *slot) { - kvm_mmu_slot_leaf_clear_dirty(kvm, slot); - kvm_mmu_slot_largepage_remove_write_access(kvm, slot); + kvm_mmu_write_protect_all_pages(kvm, true); } static void vmx_slot_disable_log_dirty(struct kvm *kvm, -- 1.8.3.1
[Qemu-devel] [RFH]vcpu may hang for up to 4s while starting migration
From: Zhuang Yanying Hi, Recently I test live-migration vm with 1T memory, find vcpu may hang for up to 4s while starting migration. The reason is memory_global_dirty_log_start taking too long, and the vcpu is waiting for BQL. migrate threadvcpu -- qemu_mutex_lock_iothread kvm_handle_io memory_global_dirty_log_start /* lasts 4s */ try qemu_mutex_lock_iothread qemu_mutex_unlock_iothread success qemu_mutex_lock_iothread Memory_global_dirty_log_start will cleans up the dirty bits of spte in the KVM, and starts the dirty page tracking of PML.Because the VM's memory is very large, it takes too long time in KVM. Is the following scheme feasible: (1)Put the action of turning on dirty page tracking in memory_global_dirty_log_start into asynchronous execution outside BQL. (2)The first time tunrning on dirty page tracking, only clean up the dirty page bits of the 1G pagetable, and the time spent by memory_global_dirty_log_start is reduced. Best regards, -Zhuang Yanying ---
[Qemu-devel] [PATCH v3] KVM: x86: Fix nmi injection failure when vcpu got blocked
From: ZhuangYanying <ann.zhuangyany...@huawei.com> When spin_lock_irqsave() deadlock occurs inside the guest, vcpu threads, other than the lock-holding one, would enter into S state because of pvspinlock. Then inject NMI via libvirt API "inject-nmi", the NMI could not be injected into vm. The reason is: 1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and sets cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile. 2 It sets nmi_queued to 0 in process_nmi(), before entering guest, because cpu->kvm_vcpu_dirty is true. It's not enough just to check nmi_queued to decide whether to stay in vcpu_block() or not. NMI should be injected immediately at any situation. Add checking nmi_pending, and testing KVM_REQ_NMI replaces nmi_queued in vm_vcpu_has_events(). Do the same change for SMIs. Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com> --- v1->v2 - simplify message. The complete description is here: http://www.spinics.net/lists/kvm/msg150380.html - Testing KVM_REQ_NMI replaces nmi_pending. - Add Testing kvm_x86_ops->nmi_allowed(vcpu). v2->v3 - Testing KVM_REQ_NMI replaces nmi_queued, not nmi_pending. - Do the same change for SMIs. --- arch/x86/kvm/x86.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 02363e3..a2cd099 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8394,10 +8394,13 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) if (vcpu->arch.pv.pv_unhalted) return true; - if (atomic_read(>arch.nmi_queued)) + if (kvm_test_request(KVM_REQ_NMI, vcpu) || + (vcpu->arch.nmi_pending && +kvm_x86_ops->nmi_allowed(vcpu))) return true; - if (kvm_test_request(KVM_REQ_SMI, vcpu)) + if (kvm_test_request(KVM_REQ_SMI, vcpu) || + (vcpu->arch.smi_pending && !is_smm(vcpu))) return true; if (kvm_arch_interrupt_allowed(vcpu) && -- 1.8.3.1
[Qemu-devel] [PATCH v2] KVM: x86: Fix nmi injection failure when vcpu got blocked
From: ZhuangYanying <ann.zhuangyany...@huawei.com> When spin_lock_irqsave() deadlock occurs inside the guest, vcpu threads, other than the lock-holding one, would enter into S state because of pvspinlock. Then inject NMI via libvirt API "inject-nmi", the NMI could not be injected into vm. The reason is: 1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and sets cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile. 2 It sets nmi_queued to 0 in process_nmi(), before entering guest, because cpu->kvm_vcpu_dirty is true. It's not enough just to check nmi_queued to decide whether to stay in vcpu_block() or not. NMI should be injected immediately at any situation. Add checking KVM_REQ_NMI request plus with nmi_queued in vm_vcpu_has_events(). Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com> --- arch/x86/kvm/x86.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 02363e3..2d15708 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8394,7 +8394,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) if (vcpu->arch.pv.pv_unhalted) return true; - if (atomic_read(>arch.nmi_queued)) + if ((kvm_test_request(KVM_REQ_NMI, vcpu) || + atomic_read(>arch.nmi_queued)) && + kvm_x86_ops->nmi_allowed(vcpu)) return true; if (kvm_test_request(KVM_REQ_SMI, vcpu)) -- 1.8.3.1
Re: [Qemu-devel] [PATCH] Fix nmi injection failure when vcpu got blocked
> -Original Message- > From: Radim Krčmář [mailto:rkrc...@redhat.com] > Sent: Wednesday, May 24, 2017 10:34 PM > To: Zhuangyanying > Cc: pbonz...@redhat.com; Herongguang (Stephen); qemu-devel@nongnu.org; > Gonglei (Arei); Zhangbo (Oscar); k...@vger.kernel.org > Subject: Re: [PATCH] Fix nmi injection failure when vcpu got blocked > > Please use tags in patches. > We usually begin the subject with "KVM: x86:" when touching > arch/x86/kvm/x86.c. > Sorry, I will add in patch v2. > 2017-05-24 13:48+0800, Zhuangyanying: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c @@ -8394,7 > > +8394,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) > > if (vcpu->arch.pv.pv_unhalted) > > return true; > > > > - if (atomic_read(>arch.nmi_queued)) > > + if (vcpu->arch.nmi_pending || > > + atomic_read(>arch.nmi_queued)) > > return true; > > Hm, I think we've been missing '&& kvm_x86_ops->nmi_allowed(vcpu)'. > Yes, we've been missing, I will add in patch v2. > The undesired resume if we have suppressed NMI is not making it much worse, > but wouldn't "kvm_test_request(KVM_REQ_NMI, vcpu)" also work here? > > > if (kvm_test_request(KVM_REQ_SMI, vcpu)) > > Thanks. "kvm_test_request(KVM_REQ_NMI, vcpu)", works fine for me.
[Qemu-devel] [PATCH] Fix nmi injection failure when vcpu got blocked
From: ZhuangYanying <ann.zhuangyany...@huawei.com> Recently I found NMI could not be injected to vm via libvirt API Reproduce the problem: 1 use guest of redhat 7.3 2 disable nmi_watchdog and trig spinlock deadlock inside the guest check the running vcpu thread, make sure not vcpu0 3 inject NMI into the guest via libvirt API "inject-nmi" Result: The NMI could not be injected into the guest. Reason: 1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and sets cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile. 2 It sets nmi_queued to 0 in process_nmi(), before entering guest, because cpu->kvm_vcpu_dirty is true. Normally, vcpu could call vcpu_enter_guest successfully to inject the NMI. However, in the problematic scenario, when the guest's threads hold spin_lock_irqsave for a long time, such as entering a while loop after spin_lock_irqsave(), other vcpus would enter into S state because of pvspinlock scheme, then KVM module will loop in vcpu_block rather than entry the guest. I think that it's not suitable to decide whether to stay in vcpu_block() or not just by checking nmi_queued, NMI should be injected immediately even at this situation. Solution: There're 2 ways to solve the problem: 1 call cpu_synchronize_state_not_set_dirty() rather than cpu_synchronize_state(), while injecting NMI, to avoid changing nmi_queued to 0. But other workqueues may affect cpu->kvm_vcpu_dirty, so it's not recommended. 2 add checking nmi_pending plus with nmi_queued in vm_vcpu_has_events() in KVM module. qemu_kvm_wait_io_event qemu_wait_io_event_common flush_queued_work do_inject_external_nmi cpu_synchronize_state kvm_cpu_synchronize_state do_kvm_cpu_synchronize_state cpu->kvm_vcpu_dirty = true;/* trigger process_nmi */ kvm_vcpu_ioctl(cpu, KVM_NMI) kvm_vcpu_ioctl_nmi kvm_inject_nmi atomic_inc(>arch.nmi_queued); nmi_queued = 1 /* nmi_queued set to 1, when qemu ioctl KVM_NMI */ kvm_make_request(KVM_REQ_NMI, vcpu); kvm_cpu_exec kvm_arch_put_registers(cpu, KVM_PUT_RUNTIME_STATE); kvm_arch_put_registers kvm_put_vcpu_events kvm_vcpu_ioctl(CPU(cpu), KVM_SET_VCPU_EVENTS, ); kvm_vcpu_ioctl_x86_set_vcpu_events process_nmi(vcpu); vcpu->arch.nmi_pending += atomic_xchg(>arch.nmi_queued, 0); nmi_queued = 0 /* nmi_queued set to 0, vcpu thread always block */ nmi_pending = 1 kvm_make_request(KVM_REQ_EVENT, vcpu); kvm_vcpu_ioctl(cpu, KVM_RUN, 0); kvm_arch_vcpu_ioctl_run vcpu_run(vcpu); kvm_vcpu_running(vcpu) /* always false, could not call vcpu_enter_guest */ vcpu_block kvm_arch_vcpu_runnable kvm_vcpu_has_events if (atomic_read(>arch.nmi_queued)) /* nmi_queued is 0, vcpu thread always block*/ Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com> --- arch/x86/kvm/x86.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 02363e3..96983dc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8394,7 +8394,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) if (vcpu->arch.pv.pv_unhalted) return true; - if (atomic_read(>arch.nmi_queued)) + if (vcpu->arch.nmi_pending || + atomic_read(>arch.nmi_queued)) return true; if (kvm_test_request(KVM_REQ_SMI, vcpu)) -- 1.8.3.1
Re: [Qemu-devel] [BUG] Migrate failes between boards with different PMC counts
> -Original Message- > From: Daniel P. Berrange [mailto:berra...@redhat.com] > Sent: Monday, April 24, 2017 6:34 PM > To: Dr. David Alan Gilbert > Cc: Zhuangyanying; Zhanghailiang; wangxin (U); qemu-devel@nongnu.org; > Gonglei (Arei); Huangzhichao; pbonz...@redhat.com > Subject: Re: [Qemu-devel] [BUG] Migrate failes between boards with different > PMC counts > > On Mon, Apr 24, 2017 at 11:27:16AM +0100, Dr. David Alan Gilbert wrote: > > * Daniel P. Berrange (berra...@redhat.com) wrote: > > > On Mon, Apr 24, 2017 at 10:23:21AM +0100, Dr. David Alan Gilbert wrote: > > > > * Zhuangyanying (ann.zhuangyany...@huawei.com) wrote: > > > > > Hi all, > > > > > > > > > > Recently, I found migration failed when enable vPMU. > > > > > > > > > > migrate vPMU state was introduced in linux-3.10 + qemu-1.7. > > > > > > > > > > As long as enable vPMU, qemu will save / load the > > > > > vmstate_msr_architectural_pmu(msr_global_ctrl) register during the > migration. > > > > > But global_ctrl generated based on cpuid(0xA), the number of > > > > > general-purpose performance monitoring counters(PMC) can vary > > > > > according to Intel SDN. The number of PMC presented to vm, does > > > > > not support configuration currently, it depend on host cpuid, and > > > > > enable > all pmc defaultly at KVM. It cause migration to fail between boards with > different PMC counts. > > > > > > > > > > The return value of cpuid (0xA) is different dur to cpu, according to > > > > > Intel > SDN,18-10 Vol. 3B: > > > > > > > > > > Note: The number of general-purpose performance monitoring > > > > > counters (i.e. N in Figure 18-9) can vary across processor > > > > > generations within a processor family, across processor > > > > > families, or could be different depending on the configuration > > > > > chosen at boot time in the BIOS regarding Intel Hyper Threading > > > > > Technology, (e.g. N=2 for 45 nm Intel Atom processors; N =4 for > processors based on the Nehalem microarchitecture; for processors based on > the Sandy Bridge microarchitecture, N = 4 if Intel Hyper Threading Technology > is active and N=8 if not active). > > > > > > > > > > Also I found, N=8 if HT is not active based on the broadwell,, > > > > > such as CPU E7-8890 v4 @ 2.20GHz > > > > > > > > > > # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m > > > > > 4096 -hda > > > > > /data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true > > > > > -incoming tcp:: Completed 100 % > > > > > qemu-system-x86_64: error: failed to set MSR 0x38f to > > > > > 0x700ff > > > > > qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833: > kvm_put_msrs: > > > > > Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed. > > > > > Aborted > > > > > > > > > > So make number of pmc configurable to vm ? Any better idea ? > > > > > > > > Coincidentally we hit a similar problem a few days ago with -cpu > > > > host - it took me quite a while to spot the difference between > > > > the machines was the source had hyperthreading disabled. > > > > > > > > An option to set the number of counters makes sense to me; but I > > > > wonder how many other options we need as well. Also, I'm not sure > > > > there's any easy way for libvirt etc to figure out how many > > > > counters a host supports - it's not in /proc/cpuinfo. > > > > > > We actually try to avoid /proc/cpuinfo whereever possible. We do > > > direct CPUID asm instructions to identify features, and prefer to > > > use /sys/devices/system/cpu if that has suitable data > > > > > > Where do the PMC counts come from originally ? CPUID or something > else ? > > > > Yes, they're bits 8..15 of CPUID leaf 0xa > > Ok, that's easy enough for libvirt to detect then. More a question of what > libvirt > should then do this with the info > Do you mean to do a validation at the begining of migration? in qemuMigrationBakeCookie() & qemuMigrationEatCookie(), if the PMC numbers are not equal, just quit migration? It maybe a good enough first edition. But for a further better edition, maybe it's better to support Heterogeneous migration I think, so we might need to make PMC number configrable, then we need to modify KVM/qemu as well. Regards, -Zhuang Yanying
[Qemu-devel] [BUG] Migrate failes between boards with different PMC counts
Hi all, Recently, I found migration failed when enable vPMU. migrate vPMU state was introduced in linux-3.10 + qemu-1.7. As long as enable vPMU, qemu will save / load the vmstate_msr_architectural_pmu(msr_global_ctrl) register during the migration. But global_ctrl generated based on cpuid(0xA), the number of general-purpose performance monitoring counters(PMC) can vary according to Intel SDN. The number of PMC presented to vm, does not support configuration currently, it depend on host cpuid, and enable all pmc defaultly at KVM. It cause migration to fail between boards with different PMC counts. The return value of cpuid (0xA) is different dur to cpu, according to Intel SDN,18-10 Vol. 3B: Note: The number of general-purpose performance monitoring counters (i.e. N in Figure 18-9) can vary across processor generations within a processor family, across processor families, or could be different depending on the configuration chosen at boot time in the BIOS regarding Intel Hyper Threading Technology, (e.g. N=2 for 45 nm Intel Atom processors; N =4 for processors based on the Nehalem microarchitecture; for processors based on the Sandy Bridge microarchitecture, N = 4 if Intel Hyper Threading Technology is active and N=8 if not active). Also I found, N=8 if HT is not active based on the broadwell,, such as CPU E7-8890 v4 @ 2.20GHz # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 -hda /data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true -incoming tcp:: Completed 100 % qemu-system-x86_64: error: failed to set MSR 0x38f to 0x700ff qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833: kvm_put_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed. Aborted So make number of pmc configurable to vm ? Any better idea ? Regards, -Zhuang Yanying
[Qemu-devel] [PATCH] ipmi: fix qemu crash while migrating with ipmi
From: ZhuangYanying <ann.zhuangyany...@huawei.com> Qemu crash in the source side while migrating, after starting ipmi service inside vm. ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 \ -drive file=/work/suse/suse11_sp3_64_vt,format=raw,if=none,id=drive-virtio-disk0,cache=none \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 \ -vnc :99 -monitor vc -device ipmi-bmc-sim,id=bmc0 -device isa-ipmi-kcs,bmc=bmc0,ioport=0xca2 Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffec4268700 (LWP 7657)] __memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:2757 (gdb) bt #0 __memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:2757 #1 0x559ef775 in memcpy (__len=3, __src=0xc1421c, __dest=) at /usr/include/bits/string3.h:51 #2 qemu_put_buffer (f=0x57a97690, buf=0xc1421c , size=3) at migration/qemu-file.c:346 #3 0x559eef66 in vmstate_save_state (f=f@entry=0x57a97690, vmsd=0x55f8a5a0 , opaque=0x57231160, vmdesc=vmdesc@entry=0x5798cc40) at migration/vmstate.c:333 #4 0x557cfe45 in vmstate_save (f=f@entry=0x57a97690, se=se@entry=0x57231de0, vmdesc=vmdesc@entry=0x5798cc40) at /mnt/sdb/zyy/qemu/migration/savevm.c:720 #5 0x557d2be7 in qemu_savevm_state_complete_precopy (f=0x57a97690, iterable_only=iterable_only@entry=false) at /mnt/sdb/zyy/qemu/migration/savevm.c:1128 #6 0x559ea102 in migration_completion (start_time=, old_vm_running=, current_active_state=, s=0x560eaa80 ) at migration/migration.c:1707 #7 migration_thread (opaque=0x560eaa80 ) at migration/migration.c:1855 #8 0x73900dc5 in start_thread (arg=0x7ffec4268700) at pthread_create.c:308 #9 0x7fffefc6c71d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com> --- hw/ipmi/isa_ipmi_kcs.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/hw/ipmi/isa_ipmi_kcs.c b/hw/ipmi/isa_ipmi_kcs.c index 9a38f8a..8044497 100644 --- a/hw/ipmi/isa_ipmi_kcs.c +++ b/hw/ipmi/isa_ipmi_kcs.c @@ -433,10 +433,8 @@ const VMStateDescription vmstate_ISAIPMIKCSDevice = { VMSTATE_BOOL(kcs.use_irq, ISAIPMIKCSDevice), VMSTATE_BOOL(kcs.irqs_enabled, ISAIPMIKCSDevice), VMSTATE_UINT32(kcs.outpos, ISAIPMIKCSDevice), -VMSTATE_VBUFFER_UINT32(kcs.outmsg, ISAIPMIKCSDevice, 1, NULL, 0, - kcs.outlen), -VMSTATE_VBUFFER_UINT32(kcs.inmsg, ISAIPMIKCSDevice, 1, NULL, 0, - kcs.inlen), +VMSTATE_UINT8_ARRAY(kcs.outmsg, ISAIPMIKCSDevice, MAX_IPMI_MSG_SIZE), +VMSTATE_UINT8_ARRAY(kcs.inmsg, ISAIPMIKCSDevice, MAX_IPMI_MSG_SIZE), VMSTATE_BOOL(kcs.write_end, ISAIPMIKCSDevice), VMSTATE_UINT8(kcs.status_reg, ISAIPMIKCSDevice), VMSTATE_UINT8(kcs.data_out_reg, ISAIPMIKCSDevice), -- 1.8.3.1
[Qemu-devel] [PATCH v3] ivshmem: Fix 64 bit memory bar configuration
From: Zhuang YanyingDevice ivshmem property use64=0 is designed to make the device expose a 32 bit shared memory BAR instead of 64 bit one. The default is a 64 bit BAR, except pc-1.2 and older retain a 32 bit BAR. A 32 bit BAR can support only up to 1 GiB of shared memory. This worked as designed until commit 5400c02 accidentally flipped its sense: since then, we misinterpret use64=0 as use64=1 and vice versa. Worse, the default got flipped as well. Devices ivshmem-plain and ivshmem-doorbell are not affected. Fix by restoring the test of IVShmemState member not_legacy_32bit that got messed up in commit 5400c02. Also update its initialization for devices ivhsmem-plain and ivshmem-doorbell. Without that, they'd regress to 32 bit BARs. Cc: qemu-sta...@nongnu.org Signed-off-by: Zhuang Yanying Reviewed-by: Gonglei Reviewed-by: Marc-Andr?? Lureau --- hw/misc/ivshmem.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c index 230e51b..abeaf3d 100644 --- a/hw/misc/ivshmem.c +++ b/hw/misc/ivshmem.c @@ -858,7 +858,7 @@ static void ivshmem_common_realize(PCIDevice *dev, Error **errp) pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, >ivshmem_mmio); -if (!s->not_legacy_32bit) { +if (s->not_legacy_32bit) { attr |= PCI_BASE_ADDRESS_MEM_TYPE_64; } @@ -1045,6 +1045,7 @@ static void ivshmem_plain_init(Object *obj) ivshmem_check_memdev_is_busy, OBJ_PROP_LINK_UNREF_ON_RELEASE, _abort); +s->not_legacy_32bit = 1; } static void ivshmem_plain_realize(PCIDevice *dev, Error **errp) @@ -1116,6 +1117,7 @@ static void ivshmem_doorbell_init(Object *obj) s->features |= (1 << IVSHMEM_MSI); s->legacy_size = SIZE_MAX; /* whatever the server sends */ +s->not_legacy_32bit = 1; } static void ivshmem_doorbell_realize(PCIDevice *dev, Error **errp) -- 1.8.3.1
[Qemu-devel] [PATCH v2 1/2] ivshmem: fix misconfig of not_legacy_32bit
From: ZhuangYanying <ann.zhuangyany...@huawei.com> After commit 5400c02, ivshmem_64bit renamed to not_legacy_32bit, and changed the implementation of this property. Then use64 = 1, ~PCI_BASE_ADDRESS_MEM_TYPE_64 (default for ivshmem), the actual use is the legacy model, can not support greater than or equal 1G mapping, which is the opposite of configuration requirements. Cc: qemu-sta...@nongnu.org Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com> Reviewed-by: Gonglei <arei.gong...@huawei.com> --- hw/misc/ivshmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c index 230e51b..b897685 100644 --- a/hw/misc/ivshmem.c +++ b/hw/misc/ivshmem.c @@ -858,7 +858,7 @@ static void ivshmem_common_realize(PCIDevice *dev, Error **errp) pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, >ivshmem_mmio); -if (!s->not_legacy_32bit) { +if (s->not_legacy_32bit) { attr |= PCI_BASE_ADDRESS_MEM_TYPE_64; } -- 1.8.3.1
[Qemu-devel] [PATCH v2 0/2] ivshmem: fix misconfig of not_legacy_32bit
From: ZhuangYanying <ann.zhuangyany...@huawei.com> Recently, I tested ivshmem, found that use64, that is not_legacy_32bit implementation is odd, or even the opposite. Previous use64 = ivshmem_64bit = 1, then attr |= PCI_BASE_ADDRESS_MEM_TYPE_64, ivshmem support 1G and above packaged into bar2, presented to the virtual machine. But after commit 5400c02, PCI_BASE_ADDRESS_MEM_TYPE_64 is configured while not_legacy_32bit = 0, that is the legacy model. ZhuangYanying (2): hw/misc/ivshmem: fix misconfig of not_legacy_32bit hw/misc/ivshmem: set not_legacy_32bit to 1 for ivshmem_doorbell and ivshmem-plain hw/misc/ivshmem.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- 1.8.3.1
[Qemu-devel] [PATCH v2 2/2] ivshmem: set not_legacy_32bit to 1 for ivshmem_doorbell and ivshmem-plain
From: ZhuangYanying <ann.zhuangyany...@huawei.com> Signed-off-by: Zhuang Yanying <ann.zhuangyany...@huawei.com> --- hw/misc/ivshmem.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c index b897685..abeaf3d 100644 --- a/hw/misc/ivshmem.c +++ b/hw/misc/ivshmem.c @@ -1045,6 +1045,7 @@ static void ivshmem_plain_init(Object *obj) ivshmem_check_memdev_is_busy, OBJ_PROP_LINK_UNREF_ON_RELEASE, _abort); +s->not_legacy_32bit = 1; } static void ivshmem_plain_realize(PCIDevice *dev, Error **errp) @@ -1116,6 +1117,7 @@ static void ivshmem_doorbell_init(Object *obj) s->features |= (1 << IVSHMEM_MSI); s->legacy_size = SIZE_MAX; /* whatever the server sends */ +s->not_legacy_32bit = 1; } static void ivshmem_doorbell_realize(PCIDevice *dev, Error **errp) -- 1.8.3.1
[Qemu-devel] [PATCH] hw/misc/ivshmem:fix misconfig of not_legacy_32bit
From: ZhuangYanying <ann.zhuangyany...@huawei.com> After "ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem", ivshmem_64bit renamed to not_legacy_32bit, and changed the implementation of this property. Then use64 = not_legacy_32bit = 1, then PCI attribute configuration ~ PCI_BASE_ADDRESS_MEM_TYPE_64 (default for ivshmem), the actual use is the legacy model, can not support greater than or equal 1G mapping, which is the opposite of configuration requirements. Signed-off-by: ann.zhuangyany...@huawei.com --- Recently, I tested ivshmem, found that use64, that is not_legacy_32bit implementation is odd, or even the opposite. Previous use64 = ivshmem_64bit = 1, then attr |= PCI_BASE_ADDRESS_MEM_TYPE_64, then ivshmem support 1G and above packaged into bar2, presented to the virtual machine. But after "ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem", PCI_BASE_ADDRESS_MEM_TYPE_64 is configured while not_legacy_32bit = 0, that is the legacy model. --- hw/misc/ivshmem.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c index 230e51b..b71acf6 100644 --- a/hw/misc/ivshmem.c +++ b/hw/misc/ivshmem.c @@ -858,7 +858,7 @@ static void ivshmem_common_realize(PCIDevice *dev, Error **errp) pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, >ivshmem_mmio); -if (!s->not_legacy_32bit) { +if (s->not_legacy_32bit) { attr |= PCI_BASE_ADDRESS_MEM_TYPE_64; } @@ -1033,6 +1033,7 @@ static const VMStateDescription ivshmem_plain_vmsd = { static Property ivshmem_plain_properties[] = { DEFINE_PROP_ON_OFF_AUTO("master", IVShmemState, master, ON_OFF_AUTO_OFF), +DEFINE_PROP_UINT32("use64", IVShmemState, not_legacy_32bit, 1), DEFINE_PROP_END_OF_LIST(), }; @@ -1107,6 +1108,7 @@ static Property ivshmem_doorbell_properties[] = { DEFINE_PROP_BIT("ioeventfd", IVShmemState, features, IVSHMEM_IOEVENTFD, true), DEFINE_PROP_ON_OFF_AUTO("master", IVShmemState, master, ON_OFF_AUTO_OFF), +DEFINE_PROP_UINT32("use64", IVShmemState, not_legacy_32bit, 1), DEFINE_PROP_END_OF_LIST(), }; -- 1.8.3.1
[Qemu-devel] [PATCH] target-i386/machine:fix migrate faile because of Hyper-V HV_X64_MSR_VP_RUNTIME
From: ZhuangYanying <ann.zhuangyany...@huawei.com> Hyper-V HV_X64_MSR_VP_RUNTIME was introduced in linux-4.4 + qemu-2.5. As long as the KVM module supports, qemu will save / load the vmstate_msr_hyperv_runtime register during the migration. Regardless of whether the hyperv_runtime configuration of x86_cpu_properties is enabled. The qemu-2.3 does not support this feature, of course, failed to migrate. linux-BGSfqC:/home/qemu # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -nodefaults -machine pc-i440fx-2.3,accel=kvm,usb=off -smp 4 -m 4096 -drive file=/work/suse/sles11sp3.img.bak,format=raw,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -vnc :99 -device cirrus-vga,id=video0,vgamem_mb=8,bus=pci.0,addr=0x2 -monitor vc save_section_header:se->section_id=3,se->idstr:ram,se->instance_id=0,se->version_id=4 save_section_header:se->section_id=0,se->idstr:timer,se->instance_id=0,se->version_id=2 save_section_header:se->section_id=4,se->idstr:cpu_common,se->instance_id=0,se->version_id=1 save_section_header:se->section_id=5,se->idstr:cpu,se->instance_id=0,se->version_id=12 vmstate_subsection_save:vmsd->name:cpu/async_pf_msr hyperv_runtime_enable_needed:env->msr_hv_runtime=128902811 vmstate_subsection_save:vmsd->name:cpu/msr_hyperv_runtime Since hyperv_runtime is false, vm will not use hv->runtime_offset, then vmstate_msr_hyperv_runtime is no need to transfer while migrating. Signed-off-by: ann.zhuangyany...@huawei.com --- Hi, Recently, I tested cross-version migration/rollback between qemu tag v2.3.1 and qemu-master, found that rollback failed. linux-rIVrzS:/home/git/qemu # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -nodefaults -machine pc-i440fx-2.3,accel=kvm,usb=off -smp 4 -m 4096 -drive file=/work/suse/sles11sp3.img.bak,format=raw,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -vnc :99 -device cirrus-vga,id=video0,vgamem_mb=8,bus=pci.0,addr=0x2 -monitor vc -incoming tcp:0: qemu-system-x86_64: error while loading state for instance 0x0 of device 'cpu' qemu-system-x86_64: load of migration failed: No such file or directory Maybe set compat_props on PC_COMPAT_2_5 ? Any better idea? --- target-i386/machine.c | 4 1 file changed, 4 insertions(+) diff --git a/target-i386/machine.c b/target-i386/machine.c index 48037f1..e984d77 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -709,6 +709,10 @@ static bool hyperv_runtime_enable_needed(void *opaque) X86CPU *cpu = opaque; CPUX86State *env = >env; +if (!cpu->hyperv_runtime) { +return 0; +} + return env->msr_hv_runtime != 0; } -- 1.8.3.1