dog harness, dog leash, dog clothes and dog carrier
Hi, this is Jeff Wu from China. We are a factory which produces pet clothes, pet harness, pet leash. Please let me know if you are interested to buy from us, thanks! I will send the price lists and website to you then. Best regards, Jeff Wu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if there is no still-pending event to L1 which blocked by nested_run_pending. There is a race which lead to an interrupt will be injected to L2 which belong to L1 if L0 send an interrupt to L1 during this window. VCPU0 another thread L1 intr not blocked on L2 first entry vmx_vcpu_run req event kvm check request req event check_nested_events don't have any intr not nested exit intr occur (8254, lapic timer etc) inject_pending_event now have intr inject interrupt This patch fix this race by introduced a l1_events_blocked field in nested_vmx which indicates there is still-pending event which blocked by nested_run_pending, and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked by nested_run_pending. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- arch/x86/kvm/vmx.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f4e5aed..fe69c49 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -372,6 +372,7 @@ struct nested_vmx { u64 vmcs01_tsc_offset; /* L2 must run next, and mustn't decide to exit to L1. */ bool nested_run_pending; + bool l1_events_blocked; /* * Guest pages referred to in vmcs02 with host-physical pointers, so * we must keep them pinned while L2 runs. @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) * we did not inject a still-pending event to L1 now because of * nested_run_pending, we need to re-enable this bit. */ - if (vmx-nested.nested_run_pending) + if (to_vmx(vcpu)-nested.l1_events_blocked) { + to_vmx(vcpu)-nested.l1_events_blocked = false; kvm_make_request(KVM_REQ_EVENT, vcpu); + } vmx-nested.nested_run_pending = 0; @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) vmx-nested.preemption_timer_expired) { - if (vmx-nested.nested_run_pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; return -EBUSY; + } nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); return 0; } if (vcpu-arch.nmi_pending nested_exit_on_nmi(vcpu)) { - if (vmx-nested.nested_run_pending || - vcpu-arch.interrupt.pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; + return -EBUSY; + } + if (vcpu-arch.interrupt.pending) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, NMI_VECTOR | INTR_TYPE_NMI_INTR | @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if ((kvm_cpu_has_interrupt(vcpu) || external_intr) nested_exit_on_intr(vcpu)) { - if (vmx-nested.nested_run_pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; return -EBUSY; + } nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
-Original Message- From: Wanpeng Li [mailto:wanpeng...@linux.intel.com] Sent: Wednesday, July 2, 2014 2:54 PM To: Paolo Bonzini; Jan Kiszka; Gleb Natapov Cc: Hu, Robert; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Wanpeng Li Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if there is no still-pending event to L1 which blocked by nested_run_pending. There is a race which lead to an interrupt will be injected to L2 which belong to L1 if L0 send an interrupt to L1 during this window. VCPU0 another thread L1 intr not blocked on L2 first entry vmx_vcpu_run req event kvm check request req event check_nested_events don't have any intr not nested exit intr occur (8254, lapic timer etc) inject_pending_event now have intr inject interrupt This patch fix this race by introduced a l1_events_blocked field in nested_vmx which indicates there is still-pending event which blocked by nested_run_pending, and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked by nested_run_pending. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com Tested-by: Robert Hurobert...@intel.com --- arch/x86/kvm/vmx.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f4e5aed..fe69c49 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -372,6 +372,7 @@ struct nested_vmx { u64 vmcs01_tsc_offset; /* L2 must run next, and mustn't decide to exit to L1. */ bool nested_run_pending; + bool l1_events_blocked; /* * Guest pages referred to in vmcs02 with host-physical pointers, so * we must keep them pinned while L2 runs. @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) * we did not inject a still-pending event to L1 now because of * nested_run_pending, we need to re-enable this bit. */ - if (vmx-nested.nested_run_pending) + if (to_vmx(vcpu)-nested.l1_events_blocked) { + to_vmx(vcpu)-nested.l1_events_blocked = false; kvm_make_request(KVM_REQ_EVENT, vcpu); + } vmx-nested.nested_run_pending = 0; @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) vmx-nested.preemption_timer_expired) { - if (vmx-nested.nested_run_pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; return -EBUSY; + } nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); return 0; } if (vcpu-arch.nmi_pending nested_exit_on_nmi(vcpu)) { - if (vmx-nested.nested_run_pending || - vcpu-arch.interrupt.pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; + return -EBUSY; + } + if (vcpu-arch.interrupt.pending) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, NMI_VECTOR | INTR_TYPE_NMI_INTR | @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if ((kvm_cpu_has_interrupt(vcpu) || external_intr) nested_exit_on_intr(vcpu)) { - if (vmx-nested.nested_run_pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; return -EBUSY; + } nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] kvm, mem-hotplug: Update apic access page when it is migrated.
apic access page is pinned in memory, and as a result it cannot be migrated/hot-removed. Actually it doesn't need to be pinned in memory. This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT. This requet will be made when kvm_mmu_notifier_invalidate_page() is called when the page is unmapped from the qemu user space to reset APIC_ACCESS_ADDR pointer in each online vcpu to 0. And will also be made when ept violation happens to reset APIC_ACCESS_ADDR to the new page phys_addr (host phys_addr). --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/mmu.c | 15 +++ arch/x86/kvm/vmx.c | 9 - arch/x86/kvm/x86.c | 20 include/linux/kvm_host.h| 1 + virt/kvm/kvm_main.c | 15 +++ 6 files changed, 61 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8771c0f..f104b87 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -575,6 +575,7 @@ struct kvm_arch { unsigned int tss_addr; struct page *apic_access_page; + bool apic_access_page_migrated; gpa_t wall_clock; @@ -739,6 +740,7 @@ struct kvm_x86_ops { void (*hwapic_isr_update)(struct kvm *kvm, int isr); void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); + void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa); void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c0d72f6..a655444 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3436,6 +3436,21 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, kvm_make_request(KVM_REQ_MIGRATE_EPT, vcpu); } + if (gpa == VMX_APIC_ACCESS_PAGE_ADDR + vcpu-kvm-arch.apic_access_page_migrated) { + int i; + + vcpu-kvm-arch.apic_access_page_migrated = false; + + /* +* We need update APIC_ACCESS_ADDR pointer in each VMCS of +* all the online vcpus. +*/ + for (i = 0; i atomic_read(vcpu-kvm-online_vcpus); i++) + kvm_make_request(KVM_REQ_MIGRATE_APIC, +vcpu-kvm-vcpus[i]); + } + spin_unlock(vcpu-kvm-mmu_lock); return r; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c336cb3..abc152f 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3988,7 +3988,7 @@ static int alloc_apic_access_page(struct kvm *kvm) if (r) goto out; - page = gfn_to_page(kvm, VMX_APIC_ACCESS_PAGE_ADDR PAGE_SHIFT); + page = gfn_to_page_no_pin(kvm, VMX_APIC_ACCESS_PAGE_ADDR PAGE_SHIFT); if (is_error_page(page)) { r = -EFAULT; goto out; @@ -7075,6 +7075,12 @@ static void vmx_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) vmx_set_msr_bitmap(vcpu); } +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) +{ + if (vm_need_virtualize_apic_accesses(kvm)) + vmcs_write64(APIC_ACCESS_ADDR, hpa); +} + static void vmx_hwapic_isr_update(struct kvm *kvm, int isr) { u16 status; @@ -8846,6 +8852,7 @@ static struct kvm_x86_ops vmx_x86_ops = { .enable_irq_window = enable_irq_window, .update_cr8_intercept = update_cr8_intercept, .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode, + .set_apic_access_page_addr = vmx_set_apic_access_page_addr, .vm_has_apicv = vmx_vm_has_apicv, .load_eoi_exitmap = vmx_load_eoi_exitmap, .hwapic_irr_update = vmx_hwapic_irr_update, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a26524f..14e7174 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5943,6 +5943,24 @@ static void vcpu_migrated_page_update_ept(struct kvm_vcpu *vcpu) } } +static void vcpu_migrated_page_update_apic(struct kvm_vcpu *vcpu) +{ + struct kvm *kvm = vcpu-kvm; + + if (kvm-arch.apic_access_page_migrated) { + if (kvm-arch.apic_access_page) + kvm-arch.apic_access_page = pfn_to_page(0); + kvm_x86_ops-set_apic_access_page_addr(kvm, 0x0ull); + } else { + struct page *page; + page = gfn_to_page_no_pin(kvm, + VMX_APIC_ACCESS_PAGE_ADDR PAGE_SHIFT); + kvm-arch.apic_access_page = page; + kvm_x86_ops-set_apic_access_page_addr(kvm, + page_to_phys(page)); + } +} + /* * Returns 1 to let __vcpu_run()
[PATCH 2/4] kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR
Define guest phys_addr of apic access page. --- arch/x86/include/asm/vmx.h | 2 +- arch/x86/kvm/svm.c | 3 ++- arch/x86/kvm/vmx.c | 7 --- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 7004d21..c4672d1 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -422,7 +422,7 @@ enum vmcs_field { #define VMX_EPT_DIRTY_BIT (1ull 9) #define VMX_EPT_IDENTITY_PAGETABLE_ADDR0xfffbc000ul - +#define VMX_APIC_ACCESS_PAGE_ADDR 0xfee0ull #define ASM_VMX_VMCLEAR_RAX .byte 0x66, 0x0f, 0xc7, 0x30 #define ASM_VMX_VMLAUNCH .byte 0x0f, 0x01, 0xc2 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ec8366c..22aa2ae 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1257,7 +1257,8 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-asid_generation = 0; init_vmcb(svm); - svm-vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE; + svm-vcpu.arch.apic_base = VMX_APIC_ACCESS_PAGE_ADDR | + MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(svm-vcpu)) svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 801332e..366b5b3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3982,13 +3982,13 @@ static int alloc_apic_access_page(struct kvm *kvm) goto out; kvm_userspace_mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT; kvm_userspace_mem.flags = 0; - kvm_userspace_mem.guest_phys_addr = 0xfee0ULL; + kvm_userspace_mem.guest_phys_addr = VMX_APIC_ACCESS_PAGE_ADDR; kvm_userspace_mem.memory_size = PAGE_SIZE; r = __kvm_set_memory_region(kvm, kvm_userspace_mem); if (r) goto out; - page = gfn_to_page(kvm, 0xfee00); + page = gfn_to_page(kvm, VMX_APIC_ACCESS_PAGE_ADDR PAGE_SHIFT); if (is_error_page(page)) { r = -EFAULT; goto out; @@ -4460,7 +4460,8 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val(); kvm_set_cr8(vmx-vcpu, 0); - apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE; + apic_base_msr.data = VMX_APIC_ACCESS_PAGE_ADDR | +MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(vmx-vcpu)) apic_base_msr.data |= MSR_IA32_APICBASE_BSP; apic_base_msr.host_initiated = true; -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] kvm, memory-hotplug: Update ept identity pagetable when it is migrated.
ept identity pagetable is pinned in memory, and as a result it cannot be migrated/hot-removed. But actually it doesn't need to be pinned in memory. This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT to reset ept indetity pagetable related variable. This request will be made when kvm_mmu_notifier_invalidate_page() is called when the page is unmapped from the qemu user space to reset kvm-arch.ept_identity_pagetable to NULL. And will also be made when ept violation happens to reset kvm-arch.ept_identity_pagetable to the new page. --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/mmu.c | 11 +++ arch/x86/kvm/vmx.c | 3 ++- arch/x86/kvm/x86.c | 16 include/linux/kvm_host.h| 1 + virt/kvm/kvm_main.c | 6 ++ 6 files changed, 37 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4931415..8771c0f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -581,6 +581,7 @@ struct kvm_arch { struct page *ept_identity_pagetable; bool ept_identity_pagetable_done; gpa_t ept_identity_map_addr; + bool ept_identity_pagetable_migrated; unsigned long irq_sources_bitmap; s64 kvmclock_offset; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9314678..c0d72f6 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3425,6 +3425,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, transparent_hugepage_adjust(vcpu, gfn, pfn, level); r = __direct_map(vcpu, gpa, write, map_writable, level, gfn, pfn, prefault); + + /* +* Update ept identity pagetable page and apic access page if +* they are migrated. +*/ + if (gpa == vcpu-kvm-arch.ept_identity_map_addr + vcpu-kvm-arch.ept_identity_pagetable_migrated) { + vcpu-kvm-arch.ept_identity_pagetable_migrated = false; + kvm_make_request(KVM_REQ_MIGRATE_EPT, vcpu); + } + spin_unlock(vcpu-kvm-mmu_lock); return r; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 366b5b3..c336cb3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4018,7 +4018,8 @@ static int alloc_identity_pagetable(struct kvm *kvm) if (r) goto out; - page = gfn_to_page(kvm, kvm-arch.ept_identity_map_addr PAGE_SHIFT); + page = gfn_to_page_no_pin(kvm, + kvm-arch.ept_identity_map_addr PAGE_SHIFT); if (is_error_page(page)) { r = -EFAULT; goto out; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f32a025..a26524f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5929,6 +5929,20 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) kvm_apic_update_tmr(vcpu, tmr); } +static void vcpu_migrated_page_update_ept(struct kvm_vcpu *vcpu) +{ + struct kvm *kvm = vcpu-kvm; + + if (kvm-arch.ept_identity_pagetable_migrated) + kvm-arch.ept_identity_pagetable = NULL; + else { + struct page *page; + page = gfn_to_page_no_pin(kvm, + kvm-arch.ept_identity_map_addr PAGE_SHIFT); + kvm-arch.ept_identity_pagetable = page; + } +} + /* * Returns 1 to let __vcpu_run() continue the guest execution loop without * exiting to the userspace. Otherwise, the value will be returned to the @@ -5989,6 +6003,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) kvm_deliver_pmi(vcpu); if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu)) vcpu_scan_ioapic(vcpu); + if (kvm_check_request(KVM_REQ_MIGRATE_EPT, vcpu)) + vcpu_migrated_page_update_ept(vcpu); } if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7c58d9d..4b7e51a 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -136,6 +136,7 @@ static inline bool is_error_page(struct page *page) #define KVM_REQ_GLOBAL_CLOCK_UPDATE 22 #define KVM_REQ_ENABLE_IBS23 #define KVM_REQ_DISABLE_IBS 24 +#define KVM_REQ_MIGRATE_EPT 25 #define KVM_USERSPACE_IRQ_SOURCE_ID0 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID 1 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6091849..d271e89 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -294,6 +294,12 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, if (need_tlb_flush) kvm_flush_remote_tlbs(kvm); + if (address == + gfn_to_hva(kvm, kvm-arch.ept_identity_map_addr PAGE_SHIFT)) { + kvm-arch.ept_identity_pagetable_migrated = true; +
[PATCH 0/4] kvm, mem-hotplug: Do not pin ept identity pagetable and apic access page.
Hi Gleb, Marcelo, Please help to review this patch-set. NOTE: This patch-set doesn't work properly. ept identity pagetable and apic access page in kvm are pinned in memory. As a result, they cannot be migrated/hot-removed. But actually they don't need to be pinned in memory. This patch-set introduces two new vcpu requests: KVM_REQ_MIGRATE_EPT and KVM_REQ_MIGRATE_APIC. These two requests are made when the two pages are migrated by the mmu_notifier to reset the related variable to unusable value. And will also be made when ept violation happens to reset new pages. [Known problem] After this patch-set applied, the two pages can be migrated/hot-removed. But after migrating apic access page, the guest died. The host physical address of apic access page is stored in VMCS. I reset it to 0 to stop guest from accessing it when it is unmapped by kvm_mmu_notifier_invalidate_page(). And reset it to new page's host physical address in tdp_page_fault(). But it seems that guest will access apic page directly by the host physical address. Tang Chen (4): kvm: Add gfn_to_page_no_pin() kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR kvm, memory-hotplug: Update ept identity pagetable when it is migrated. kvm, mem-hotplug: Update apic access page when it is migrated. arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/include/asm/vmx.h | 2 +- arch/x86/kvm/mmu.c | 26 ++ arch/x86/kvm/svm.c | 3 ++- arch/x86/kvm/vmx.c | 17 + arch/x86/kvm/x86.c | 36 include/linux/kvm_host.h| 3 +++ virt/kvm/kvm_main.c | 38 +- 8 files changed, 121 insertions(+), 7 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] kvm: Add gfn_to_page_no_pin()
Used by the followed patches. --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 17 - 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ec4e3bd..7c58d9d 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -541,6 +541,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages, int nr_pages); struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); +struct page *gfn_to_page_no_pin(struct kvm *kvm, gfn_t gfn); unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn); unsigned long gfn_to_hva_prot(struct kvm *kvm, gfn_t gfn, bool *writable); unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4b6c01b..6091849 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1371,9 +1371,24 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) return kvm_pfn_to_page(pfn); } - EXPORT_SYMBOL_GPL(gfn_to_page); +struct page *gfn_to_page_no_pin(struct kvm *kvm, gfn_t gfn) +{ + struct page *page = gfn_to_page(kvm, gfn); + + /* +* gfn_to_page() will finally call hva_to_pfn() to get the pfn, and pin +* the page in memory by calling GUP functions. This function unpins +* the page. +*/ + if (!is_error_page(page)) + put_page(page); + + return page; +} +EXPORT_SYMBOL_GPL(gfn_to_page_no_pin); + void kvm_release_page_clean(struct page *page) { WARN_ON(is_error_page(page)); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] kvm, mem-hotplug: Do not pin ept identity pagetable and apic access page.
Hi Gleb, Marcelo, Please help to review this patch-set. NOTE: This patch-set doesn't work properly. ept identity pagetable and apic access page in kvm are pinned in memory. As a result, they cannot be migrated/hot-removed. But actually they don't need to be pinned in memory. This patch-set introduces two new vcpu requests: KVM_REQ_MIGRATE_EPT and KVM_REQ_MIGRATE_APIC. These two requests are made when the two pages are migrated by the mmu_notifier to reset the related variable to unusable value. And will also be made when ept violation happens to reset new pages. [Known problem] After this patch-set applied, the two pages can be migrated/hot-removed. But after migrating apic access page, the guest died. The host physical address of apic access page is stored in VMCS. I reset it to 0 to stop guest from accessing it when it is unmapped by kvm_mmu_notifier_invalidate_page(). And reset it to new page's host physical address in tdp_page_fault(). But it seems that guest will access apic page directly by the host physical address. Tang Chen (4): kvm: Add gfn_to_page_no_pin() kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR kvm, memory-hotplug: Update ept identity pagetable when it is migrated. kvm, mem-hotplug: Update apic access page when it is migrated. arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/include/asm/vmx.h | 2 +- arch/x86/kvm/mmu.c | 26 ++ arch/x86/kvm/svm.c | 3 ++- arch/x86/kvm/vmx.c | 17 + arch/x86/kvm/x86.c | 36 include/linux/kvm_host.h| 3 +++ virt/kvm/kvm_main.c | 38 +- 8 files changed, 121 insertions(+), 7 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
On 2014-07-02 08:54, Wanpeng Li wrote: This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if there is no still-pending event to L1 which blocked by nested_run_pending. There is a race which lead to an interrupt will be injected to L2 which belong to L1 if L0 send an interrupt to L1 during this window. VCPU0 another thread L1 intr not blocked on L2 first entry vmx_vcpu_run req event kvm check request req event check_nested_events don't have any intr not nested exit intr occur (8254, lapic timer etc) inject_pending_event now have intr inject interrupt This patch fix this race by introduced a l1_events_blocked field in nested_vmx which indicates there is still-pending event which blocked by nested_run_pending, and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked by nested_run_pending. There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why aren't those able to trigger this scenario? In any case, unconditionally setting KVM_REQ_EVENT seems strange and should be changed. Jan Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- arch/x86/kvm/vmx.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f4e5aed..fe69c49 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -372,6 +372,7 @@ struct nested_vmx { u64 vmcs01_tsc_offset; /* L2 must run next, and mustn't decide to exit to L1. */ bool nested_run_pending; + bool l1_events_blocked; /* * Guest pages referred to in vmcs02 with host-physical pointers, so * we must keep them pinned while L2 runs. @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) * we did not inject a still-pending event to L1 now because of * nested_run_pending, we need to re-enable this bit. */ - if (vmx-nested.nested_run_pending) + if (to_vmx(vcpu)-nested.l1_events_blocked) { + to_vmx(vcpu)-nested.l1_events_blocked = false; kvm_make_request(KVM_REQ_EVENT, vcpu); + } vmx-nested.nested_run_pending = 0; @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) vmx-nested.preemption_timer_expired) { - if (vmx-nested.nested_run_pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; return -EBUSY; + } nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); return 0; } if (vcpu-arch.nmi_pending nested_exit_on_nmi(vcpu)) { - if (vmx-nested.nested_run_pending || - vcpu-arch.interrupt.pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; + return -EBUSY; + } + if (vcpu-arch.interrupt.pending) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, NMI_VECTOR | INTR_TYPE_NMI_INTR | @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if ((kvm_cpu_has_interrupt(vcpu) || external_intr) nested_exit_on_intr(vcpu)) { - if (vmx-nested.nested_run_pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; return -EBUSY; + } nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); } -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
On 2014-07-02 09:20, Hu, Robert wrote: -Original Message- From: Wanpeng Li [mailto:wanpeng...@linux.intel.com] Sent: Wednesday, July 2, 2014 2:54 PM To: Paolo Bonzini; Jan Kiszka; Gleb Natapov Cc: Hu, Robert; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Wanpeng Li Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if there is no still-pending event to L1 which blocked by nested_run_pending. There is a race which lead to an interrupt will be injected to L2 which belong to L1 if L0 send an interrupt to L1 during this window. VCPU0 another thread L1 intr not blocked on L2 first entry vmx_vcpu_run req event kvm check request req event check_nested_events don't have any intr not nested exit intr occur (8254, lapic timer etc) inject_pending_event now have intr inject interrupt This patch fix this race by introduced a l1_events_blocked field in nested_vmx which indicates there is still-pending event which blocked by nested_run_pending, and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked by nested_run_pending. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com Tested-by: Robert Hurobert...@intel.com Do you happen to have a kvm-unit-test for this race? Or how did you test? Just curious, and if there is a test, it would be good to integrate it. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
-Original Message- From: Jan Kiszka [mailto:jan.kis...@siemens.com] Sent: Wednesday, July 2, 2014 5:03 PM To: Hu, Robert; Wanpeng Li; Paolo Bonzini; Gleb Natapov Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race On 2014-07-02 09:20, Hu, Robert wrote: -Original Message- From: Wanpeng Li [mailto:wanpeng...@linux.intel.com] Sent: Wednesday, July 2, 2014 2:54 PM To: Paolo Bonzini; Jan Kiszka; Gleb Natapov Cc: Hu, Robert; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Wanpeng Li Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if there is no still-pending event to L1 which blocked by nested_run_pending. There is a race which lead to an interrupt will be injected to L2 which belong to L1 if L0 send an interrupt to L1 during this window. VCPU0 another thread L1 intr not blocked on L2 first entry vmx_vcpu_run req event kvm check request req event check_nested_events don't have any intr not nested exit intr occur (8254, lapic timer etc) inject_pending_event now have intr inject interrupt This patch fix this race by introduced a l1_events_blocked field in nested_vmx which indicates there is still-pending event which blocked by nested_run_pending, and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked by nested_run_pending. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com Tested-by: Robert Hurobert...@intel.com Do you happen to have a kvm-unit-test for this race? Or how did you test? Just curious, and if there is a test, it would be good to integrate it. I just apply the patch and test the reported scenario. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
On 2014-07-02 11:13, Hu, Robert wrote: -Original Message- From: Jan Kiszka [mailto:jan.kis...@siemens.com] Sent: Wednesday, July 2, 2014 5:03 PM To: Hu, Robert; Wanpeng Li; Paolo Bonzini; Gleb Natapov Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race On 2014-07-02 09:20, Hu, Robert wrote: -Original Message- From: Wanpeng Li [mailto:wanpeng...@linux.intel.com] Sent: Wednesday, July 2, 2014 2:54 PM To: Paolo Bonzini; Jan Kiszka; Gleb Natapov Cc: Hu, Robert; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Wanpeng Li Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if there is no still-pending event to L1 which blocked by nested_run_pending. There is a race which lead to an interrupt will be injected to L2 which belong to L1 if L0 send an interrupt to L1 during this window. VCPU0 another thread L1 intr not blocked on L2 first entry vmx_vcpu_run req event kvm check request req event check_nested_events don't have any intr not nested exit intr occur (8254, lapic timer etc) inject_pending_event now have intr inject interrupt This patch fix this race by introduced a l1_events_blocked field in nested_vmx which indicates there is still-pending event which blocked by nested_run_pending, and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked by nested_run_pending. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com Tested-by: Robert Hurobert...@intel.com Do you happen to have a kvm-unit-test for this race? Or how did you test? Just curious, and if there is a test, it would be good to integrate it. I just apply the patch and test the reported scenario. Ah, sorry, missed the referenced bug report. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] KVM: device: add simple registration mechanism for kvm_device_ops
On Tue, 1 Jul 2014 15:45:15 +0100 Will Deacon will.dea...@arm.com wrote: kvm_ioctl_create_device currently has knowledge of all the device types and their associated ops. This is fairly inflexible when adding support for new in-kernel device emulations, so move what we currently have out into a table, which can support dynamic registration of ops by new drivers for virtual hardware. I didn't try to port all current drivers over, as it's not always clear which initialisation hook the ops should be registered from. I think that last paragraph should rather go into a cover letter :) Cc: Cornelia Huck cornelia.h...@de.ibm.com Cc: Alex Williamson alex.william...@redhat.com Cc: Alex Graf ag...@suse.de Cc: Gleb Natapov g...@kernel.org Cc: Paolo Bonzini pbonz...@redhat.com Cc: Marc Zyngier marc.zyng...@arm.com Cc: Christoffer Dall christoffer.d...@linaro.org Signed-off-by: Will Deacon will.dea...@arm.com --- v1 - v2: Added enum for KVM_DEV_TYPE* IDs, changed limits to ARRAY_SIZE, removed stray semicolon, had a crack at porting VFIO, included Cornelia's s390 FLIC patch. ...and the changelog as well (or keep changelogs for individual patches). include/linux/kvm_host.h | 1 + include/uapi/linux/kvm.h | 22 +++- virt/kvm/kvm_main.c | 65 3 files changed, 55 insertions(+), 33 deletions(-) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e11d8f170a62..6875cc225dff 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -940,15 +940,25 @@ struct kvm_device_attr { __u64 addr; /* userspace address of attr data */ }; -#define KVM_DEV_TYPE_FSL_MPIC_20 1 -#define KVM_DEV_TYPE_FSL_MPIC_42 2 -#define KVM_DEV_TYPE_XICS3 -#define KVM_DEV_TYPE_VFIO4 #define KVM_DEV_VFIO_GROUP 1 #define KVM_DEV_VFIO_GROUP_ADD 1 #define KVM_DEV_VFIO_GROUP_DEL 2 -#define KVM_DEV_TYPE_ARM_VGIC_V2 5 -#define KVM_DEV_TYPE_FLIC6 + +enum kvm_device_type { + KVM_DEV_TYPE_FSL_MPIC_20= 1, +#define KVM_DEV_TYPE_FSL_MPIC_20 KVM_DEV_TYPE_FSL_MPIC_20 + KVM_DEV_TYPE_FSL_MPIC_42, +#define KVM_DEV_TYPE_FSL_MPIC_42 KVM_DEV_TYPE_FSL_MPIC_42 + KVM_DEV_TYPE_XICS, +#define KVM_DEV_TYPE_XICSKVM_DEV_TYPE_XICS + KVM_DEV_TYPE_VFIO, +#define KVM_DEV_TYPE_VFIOKVM_DEV_TYPE_VFIO + KVM_DEV_TYPE_ARM_VGIC_V2, +#define KVM_DEV_TYPE_ARM_VGIC_V2 KVM_DEV_TYPE_ARM_VGIC_V2 + KVM_DEV_TYPE_FLIC, +#define KVM_DEV_TYPE_FLICKVM_DEV_TYPE_FLIC + KVM_DEV_TYPE_MAX, Do you want to add the individual values to the enum? A removal of a type would be an API break (so we should be safe against renumbering), but it's sometimes helpful if one can get the numeric value at a glance. +}; /* * ioctls for VM fds -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] KVM: device: add simple registration mechanism for kvm_device_ops
On Wed, Jul 02, 2014 at 10:17:20AM +0100, Cornelia Huck wrote: On Tue, 1 Jul 2014 15:45:15 +0100 Will Deacon will.dea...@arm.com wrote: kvm_ioctl_create_device currently has knowledge of all the device types and their associated ops. This is fairly inflexible when adding support for new in-kernel device emulations, so move what we currently have out into a table, which can support dynamic registration of ops by new drivers for virtual hardware. I didn't try to port all current drivers over, as it's not always clear which initialisation hook the ops should be registered from. I think that last paragraph should rather go into a cover letter :) Cc: Cornelia Huck cornelia.h...@de.ibm.com Cc: Alex Williamson alex.william...@redhat.com Cc: Alex Graf ag...@suse.de Cc: Gleb Natapov g...@kernel.org Cc: Paolo Bonzini pbonz...@redhat.com Cc: Marc Zyngier marc.zyng...@arm.com Cc: Christoffer Dall christoffer.d...@linaro.org Signed-off-by: Will Deacon will.dea...@arm.com --- v1 - v2: Added enum for KVM_DEV_TYPE* IDs, changed limits to ARRAY_SIZE, removed stray semicolon, had a crack at porting VFIO, included Cornelia's s390 FLIC patch. ...and the changelog as well (or keep changelogs for individual patches). Yeah... this has grown to be bigger than one patch now. I can include that for v3. diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e11d8f170a62..6875cc225dff 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -940,15 +940,25 @@ struct kvm_device_attr { __u64 addr; /* userspace address of attr data */ }; -#define KVM_DEV_TYPE_FSL_MPIC_20 1 -#define KVM_DEV_TYPE_FSL_MPIC_42 2 -#define KVM_DEV_TYPE_XICS 3 -#define KVM_DEV_TYPE_VFIO 4 #define KVM_DEV_VFIO_GROUP1 #define KVM_DEV_VFIO_GROUP_ADD 1 #define KVM_DEV_VFIO_GROUP_DEL 2 -#define KVM_DEV_TYPE_ARM_VGIC_V2 5 -#define KVM_DEV_TYPE_FLIC 6 + +enum kvm_device_type { + KVM_DEV_TYPE_FSL_MPIC_20= 1, +#define KVM_DEV_TYPE_FSL_MPIC_20 KVM_DEV_TYPE_FSL_MPIC_20 + KVM_DEV_TYPE_FSL_MPIC_42, +#define KVM_DEV_TYPE_FSL_MPIC_42 KVM_DEV_TYPE_FSL_MPIC_42 + KVM_DEV_TYPE_XICS, +#define KVM_DEV_TYPE_XICS KVM_DEV_TYPE_XICS + KVM_DEV_TYPE_VFIO, +#define KVM_DEV_TYPE_VFIO KVM_DEV_TYPE_VFIO + KVM_DEV_TYPE_ARM_VGIC_V2, +#define KVM_DEV_TYPE_ARM_VGIC_V2 KVM_DEV_TYPE_ARM_VGIC_V2 + KVM_DEV_TYPE_FLIC, +#define KVM_DEV_TYPE_FLIC KVM_DEV_TYPE_FLIC + KVM_DEV_TYPE_MAX, Do you want to add the individual values to the enum? A removal of a type would be an API break (so we should be safe against renumbering), but it's sometimes helpful if one can get the numeric value at a glance. Could do, but then I think the advantage of the enum is questionable over the #defines, since you could just as easily have two entries in the enum with the same ID value as forgetting to update a KVM_DEV_TYPE_MAX #define (which was the reason for the enum in the first place). So I'd be inclined to keep the patch as-is, unless you have really strong objections? Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM] About releasing vcpu when closing vcpu fd
On Mon, 30 Jun 2014 16:41:07 +0200 Anshul Makkar anshul.mak...@profitbricks.com wrote: Hi, Currently as per the specs for cpu_hot(un)plug, ACPI GPE Block: IO ports 0xafe0-0xafe3 where each bit corresponds to each CPU. Currently, EJ0 method in acpi-dsdt-cpu-hotplu.dsl doesn't do anything. Method(CPEJ, 2, NotSerialized) { // _EJ0 method - eject callback Sleep(200) } I want to implement a notification mechanism for CPU hotunplug just like we have in memory hotunplug where in we write to particular IO port and this read/write is caught in the memory-hotplug.c. So, just want a suggestion as to whether I should expand the IO port range from 0xafe0 to 0xafe4 (addition of 1 byte), where last byte is for notification of EJ0 event. I have it in my TODO list to rewrite CPU hotplug IO interface to be similar with memory hotplug one. So you can try to it, it will allow to drop CPUs bitmask and make interface scalable to more then 256 cpus. Or if you have any other suggestion, please share. Thanks Anshul Makkar On Fri, Jun 6, 2014 at 3:41 PM, Anshul Makkar anshul.mak...@profitbricks.com wrote: Oh yes, sorry for the ambiguity. I meant proposal to park unplugged vcpus. Thanks for the suggesting the practical approach. Anshul Makkar On Fri, Jun 6, 2014 at 3:36 PM, Gleb Natapov g...@minantech.com wrote: On Fri, Jun 06, 2014 at 03:02:59PM +0200, Anshul Makkar wrote: IIRC, Igor was of the opinion that patch for vcpu deletion will be incomplete till its handled properly in kvm i.e vcpus are destroyed completely. http://comments.gmane.org/gmane.comp.emulators.kvm.devel/114347 . So can the above proposal where just vcpus can be disabled and reused in qemu is an acceptable solution ? If by above proposal you mean the proposal in the email you linked, then no since it tries to destroy vcpu, but does it incorrectly. If you mean proposal to park unplugged vcpu, so that guest will not be able to use it, then yes, it is pragmatic path forward. Thanks Anshul Makkar On Thu, May 29, 2014 at 10:12 AM, Gleb Natapov g...@kernel.org wrote: On Thu, May 29, 2014 at 01:40:08PM +0800, Gu Zheng wrote: There was a patch(from Chen Fan, last august) about releasing vcpu when closing vcpu fd http://www.spinics.net/lists/kvm/msg95701.html, but your comment said Attempt where made to make it possible to destroy individual vcpus separately from destroying VM before, but they were unsuccessful thus far. So what is the pain here? If we want to achieve the goal, what should we do? Looking forward to your further comments.:) CPU array is accessed locklessly in a lot of places, so it will have to be RCUified. There was attempt to do so 2 year or so ago, but it didn't go anyware. Adding locks is to big a price to pay for ability to free a little bit of memory by destroying vcpu. Yes, it's a pain here. But if we want to implement vcpu hot-remove, this must be fixed sooner or later. Why? vcpu hot-remove already works (or at least worked in the past for some value of work). No need to destroy vcpu completely, just park it and tell a guest not to use it via ACPI hot unplug event. And any guys working on kvm vcpu hot-remove now? An alternative may be to make sure that stopped vcpu takes as little memory as possible. Yeah. But if we add a new vcpu with the old id that we stopped before, it will fail. No need to create vcpu again, just unpark it and notify a guest via ACPI hot plug event that vcpu can be used now. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- Regards, Igor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2] ARM VM System Specification
On Tue, Jul 01, 2014 at 06:10:19PM +0100, Peter Maydell wrote: On 1 July 2014 18:03, Stefano Stabellini stefano.stabell...@eu.citrix.com wrote: On Mon, 30 Jun 2014, Peter Maydell wrote: How about: = Guest OSes in the VM image should rely on the UEFI RTC API for real time clock services. (To provide that API, the VM system will likely need to implement some real time clock device, but the details of these are a private implementation issue between it and its associated UEFI implementation.) I don't see why we need to add the text within brackets: it is out of scope for this document. The intention is to be an informative note, not normative text (I'm happy if we want to format the text to make that clearer, with some sort of NOTE: markup). I'd like VM implementors reading this spec to make the correct decisions even if they don't happen to know inside-out the details of the UEFI specification and what exactly UEFI demands of the hardware, and I think it's worth adding the occasional clarifying sentence even if it doesn't strictly speaking add any extra rules to the specification. That's also the approach we've taken so far elsewhere in the document, so I think it's useful. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] KVM: device: add simple registration mechanism for kvm_device_ops
On Wed, 2 Jul 2014 10:32:41 +0100 Will Deacon will.dea...@arm.com wrote: On Wed, Jul 02, 2014 at 10:17:20AM +0100, Cornelia Huck wrote: On Tue, 1 Jul 2014 15:45:15 +0100 Will Deacon will.dea...@arm.com wrote: diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e11d8f170a62..6875cc225dff 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -940,15 +940,25 @@ struct kvm_device_attr { __u64 addr; /* userspace address of attr data */ }; -#define KVM_DEV_TYPE_FSL_MPIC_20 1 -#define KVM_DEV_TYPE_FSL_MPIC_42 2 -#define KVM_DEV_TYPE_XICS3 -#define KVM_DEV_TYPE_VFIO4 #define KVM_DEV_VFIO_GROUP 1 #define KVM_DEV_VFIO_GROUP_ADD 1 #define KVM_DEV_VFIO_GROUP_DEL 2 -#define KVM_DEV_TYPE_ARM_VGIC_V2 5 -#define KVM_DEV_TYPE_FLIC6 + +enum kvm_device_type { + KVM_DEV_TYPE_FSL_MPIC_20= 1, +#define KVM_DEV_TYPE_FSL_MPIC_20 KVM_DEV_TYPE_FSL_MPIC_20 + KVM_DEV_TYPE_FSL_MPIC_42, +#define KVM_DEV_TYPE_FSL_MPIC_42 KVM_DEV_TYPE_FSL_MPIC_42 + KVM_DEV_TYPE_XICS, +#define KVM_DEV_TYPE_XICSKVM_DEV_TYPE_XICS + KVM_DEV_TYPE_VFIO, +#define KVM_DEV_TYPE_VFIOKVM_DEV_TYPE_VFIO + KVM_DEV_TYPE_ARM_VGIC_V2, +#define KVM_DEV_TYPE_ARM_VGIC_V2 KVM_DEV_TYPE_ARM_VGIC_V2 + KVM_DEV_TYPE_FLIC, +#define KVM_DEV_TYPE_FLICKVM_DEV_TYPE_FLIC + KVM_DEV_TYPE_MAX, Do you want to add the individual values to the enum? A removal of a type would be an API break (so we should be safe against renumbering), but it's sometimes helpful if one can get the numeric value at a glance. Could do, but then I think the advantage of the enum is questionable over the #defines, since you could just as easily have two entries in the enum with the same ID value as forgetting to update a KVM_DEV_TYPE_MAX #define (which was the reason for the enum in the first place). So I'd be inclined to keep the patch as-is, unless you have really strong objections? I don't really have a strong opinion on that, so Acked-by: Cornelia Huck cornelia.h...@de.ibm.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest
-Original Message- From: Wood Scott-B07421 Sent: Tuesday, July 01, 2014 10:11 PM To: Alexander Graf Cc: Bhushan Bharat-R65777; kvm-...@vger.kernel.org; kvm@vger.kernel.org Subject: Re: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest On Tue, 2014-07-01 at 18:22 +0200, Alexander Graf wrote: On 01.07.14 17:35, Scott Wood wrote: On Tue, 2014-07-01 at 17:04 +0200, Alexander Graf wrote: On 01.07.14 16:58, Scott Wood wrote: On Tue, 2014-07-01 at 08:23 +0200, Alexander Graf wrote: I don't think QEMU should be aware of these limitations. OK, but we should at least have some idea of how the whole thing is supposed to work, in order to determine if this is the correct behavior for QEMU. I thought the model was that debug resources are either owned by QEMU or by the guest, and in the latter case, QEMU would never see the debug exception to begin with. That's bad for a number of reasons. For starters it's different from how x86 handles debug registers - and I hate to be different just for the sake of being different. How does it work on x86? It overwrites more-or-less random breakpoints with its own ones, but leaves the others intact ;). Are you talking about software breakpoints or management of hardware debug registers? So if we do want to declare that debug registers are owned by either QEMU or the guest, we should change the semantics for all architectures. If we want to say that ownership of the registers is shared, we need a plan for how that would actually work. I think you're overengineering here :). When do people actually use gdbstub? Usually when they want to debug a broken guest. We can either * overengineer heavily and reduce the number of registers available to the guest to always have spares * overengineer a bit and turn off guest debugging completely when we use gdbstub * just leave as much alive as we can, hoping that it helps with the debugging Option 3 is what x86 does - and I think it's a reasonable approach. This is not an interface that needs to be 100% consistent and bullet proof, it's a best effort to enable you to debug as much as possible. I'm not insisting on 100% -- just hoping for some explanation/discussion about how it's intended to work for the cases where it can. How will MSR[DE] and MSRP[DEP] be handled? How would I go about telling QEMU/KVM that I don't want this shared mode, because I don't want guest to interfere with the debugging I'm trying to do from QEMU? Will guest accesses to debug registers cause a userspace exit when guest_debug is enabled? I think we're in a path that is slow enough already to not worry about performance. It's not just about performance, but simplicity of use, and consistency of API. Oh, and it looks like there already exist one reg definitions and implementations for most of the debug registers. For BookE? Where? arch/powerpc/kvm/booke.c: KVM_REG_PPC_IACn, KVM_REG_PPC_DACn I tried to quickly prototype what I think we want to do (this is not tested) --- diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index e8b3982..746b5c6 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -179,6 +179,9 @@ extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, u32 *server, extern int kvmppc_xics_int_on(struct kvm *kvm, u32 irq); extern int kvmppc_xics_int_off(struct kvm *kvm, u32 irq); +void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu); +void kvmppc_core_queue_debug(struct kvm_vcpu *vcpu); + /* * Cuts out inst bits with ordering according to spec. * That means the leftmost bit is zero. All given bits are included. diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 9f13056..0b7e4e4 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -235,6 +235,16 @@ void kvmppc_core_dequeue_dec(struct kvm_vcpu *vcpu) clear_bit(BOOKE_IRQPRIO_DECREMENTER, vcpu-arch.pending_exceptions); } +void kvmppc_core_queue_debug(struct kvm_vcpu *vcpu) +{ + kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_DEBUG); +} + +void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu) +{ + clear_bit(BOOKE_IRQPRIO_DEBUG, vcpu-arch.pending_exceptions); +} + void kvmppc_core_queue_external(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq) { @@ -841,6 +851,20 @@ static int kvmppc_handle_debug(struct kvm_run *run, struct kvm_vcpu *vcpu) struct debug_reg *dbg_reg = (vcpu-arch.shadow_dbg_reg); u32 dbsr = vcpu-arch.dbsr; + /* Userspace (QEMU) is not using debug resource, so inject debug interrupt +* directly to guest debug. +*/ + if (vcpu-guest_debug == 0) { + if (dbsr (vcpu-arch.shared-msr MSR_DE)) +
Re: [PATCH 4/6] KVM: PPC: BOOK3S: HV: Use new functions for mapping/unmapping hpte in host
Paul Mackerras pau...@samba.org writes: On Sun, Jun 29, 2014 at 04:47:33PM +0530, Aneesh Kumar K.V wrote: We want to use virtual page class key protection mechanism for indicating a MMIO mapped hpte entry or a guest hpte entry that is swapped out in the host. Those hptes will be marked valid, but have virtual page class key set to 30 or 31. These virtual page class numbers are configured in AMR to deny read/write. To accomodate such a change, add new functions that map, unmap and check whether a hpte is mapped in the host. This patch still use HPTE_V_VALID and HPTE_V_ABSENT and don't use virtual page class keys. But we want to differentiate in the code where we explicitly check for HPTE_V_VALID with places where we want to check whether the hpte is host mapped. This patch enables a closer review for such a change. [...] /* Check for pending invalidations under the rmap chain lock */ if (kvm-arch.using_mmu_notifiers mmu_notifier_retry(kvm, mmu_seq)) { -/* inval in progress, write a non-present HPTE */ -pteh |= HPTE_V_ABSENT; -pteh = ~HPTE_V_VALID; +/* + * inval in progress in host, write host unmapped pte. + */ +host_unmapped_hpte = 1; This isn't right. We already have HPTE_V_VALID set here, and you now don't clear it here, and it doesn't get cleared by the __kvmppc_unmap_host_hpte() call below either. Ok missed that. Will fix that in the next update. In the earlier version I had kvmppc_unmap_host_hpte always clearing V_VALID. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] KVM: PPC: BOOK3S: Use hpte_update_in_progress to track invalid hpte during an hpte update
Paul Mackerras pau...@samba.org writes: On Sun, Jun 29, 2014 at 04:47:34PM +0530, Aneesh Kumar K.V wrote: As per ISA, we first need to mark hpte invalid (V=0) before we update the hpte lower half bits. With virtual page class key protection mechanism we want to send any fault other than key fault to guest directly without searching the hash page table. But then we can get NO_HPTE fault while we are updating the hpte. To track that add a vm specific atomic variable that we check in the fault path to always send the fault to host. [...] @@ -750,13 +751,15 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, r = rcbits | ~(HPTE_R_R | HPTE_R_C); if (be64_to_cpu(hptep[0]) HPTE_V_VALID) { -/* HPTE was previously valid, so we need to invalidate it */ +/* + * If we had mapped this hpte before, we now need to + * invalidate that. + */ unlock_rmap(rmap); -/* Always mark HPTE_V_ABSENT before invalidating */ -kvmppc_unmap_host_hpte(kvm, hptep); kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= be64_to_cpu(hptep[1]) (HPTE_R_R | HPTE_R_C); +hpte_invalidated = true; So now we're not setting the ABSENT bit before invalidating the HPTE. That means that another guest vcpu could do an H_ENTER which could think that this HPTE is free and use it for another unrelated guest HPTE, which would be bad... But henter looks at HPTE_V_HVLOCK, and we keep that set through out. But I will double the code again to make sure it is safe in the above scenario. @@ -1144,8 +1149,8 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) npages_dirty = n; eieio(); } -kvmppc_map_host_hpte(kvm, v, r); -hptep[0] = cpu_to_be64(v ~HPTE_V_HVLOCK); +hptep[0] = cpu_to_be64(v ~HPTE_V_LOCK); +atomic_dec(kvm-arch.hpte_update_in_progress); Why are we using LOCK rather than HVLOCK now? (And why didn't you mention this change and its rationale in the patch description?) Sorry, that is a typo. I intend to use HPTE_V_HVLOCK. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] KVM: PPC: BOOK3S: HV: Deny virtual page class key update via h_protect
Paul Mackerras pau...@samba.org writes: On Sun, Jun 29, 2014 at 04:47:31PM +0530, Aneesh Kumar K.V wrote: This makes it consistent with h_enter where we clear the key bits. We also want to use virtual page class key protection mechanism for indicating host page fault. For that we will be using key class index 30 and 31. So prevent the guest from updating key bits until we add proper support for virtual page class protection mechanism for the guest. This will not have any impact for PAPR linux guest because Linux guest currently don't use virtual page class key protection model As things stand, without this patch series, we do actually have everything we need in the kernel for guests to use virtual page class keys. Arguably we should have a capability to tell userspace how many storage keys the guest can use, but that's the only missing piece as far as I can see. yes. If we add such a capability, I can't see any reason why we should need to disable guest use of storage keys in this patchset. With this patchset, we would need additonal changes to find out whether the key fault happened because of the guest's usage of the key. I was planning to do that as an add-on series to keep the changes in this minimal. Also since linux didn't use keys i was not sure whether guest support of keys is an important item. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST v2] ARM: virt: fix wrong HSCTLR.EE bit setting
On 01/07/14 11:01, john.liuli wrote: From: Li Liu john.li...@huawei.com HSCTLR.EE is defined as bit[25] referring to arm manual DDI0606C.b(p1590). Reviewed-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Li Liu john.li...@huawei.com --- changes v1 - v2: 1)Use the canonical syntax for orr (orr r7, r7, #(1 25)) 2)Use the ARM_BE8 macro instead of the open-coded CONFIG_CPU_BIG_ENDIAN arch/arm/kernel/hyp-stub.S |4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/arm/kernel/hyp-stub.S b/arch/arm/kernel/hyp-stub.S index 797b1a6..7e666cf 100644 --- a/arch/arm/kernel/hyp-stub.S +++ b/arch/arm/kernel/hyp-stub.S @@ -134,9 +134,7 @@ ENTRY(__hyp_stub_install_secondary) mcr p15, 4, r7, c1, c1, 3 @ HSTR THUMB( orr r7, #(1 30) ) @ HSCTLR.TE -#ifdef CONFIG_CPU_BIG_ENDIAN - orr r7, #(1 9) @ HSCTLR.EE -#endif +ARM_BE8(orr r7, r7, #(1 25)) @ HSCTLR.EE mcr p15, 4, r7, c1, c0, 0 @ HSCTLR mrc p15, 4, r7, c1, c1, 1 @ HDCR Russell, Assuming you're happy with this patch, would you agree to it going through the KVM/ARM tree? We have a whole bunch of BE patches queued up, and this is a looks like a better fix than what we have so far [1]. Thanks, M. [1]: https://lists.cs.columbia.edu/pipermail/kvmarm/2014-June/009950.html -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Spam-Mail]Dear e-mail user
Dear e-mail user, Your mailbox has exceeded its storage limit to set your e-mail administrator, and you will not be able to receive new mail until you re-validate it. Click on the link: https://www.formstack.com/forms/?1775936-RHsf107Bxi In other to re-validate your email account. Thank you 2014 Email Administrator Center. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR
On Wed, Jul 02, 2014 at 05:00:35PM +0800, Tang Chen wrote: Define guest phys_addr of apic access page. --- arch/x86/include/asm/vmx.h | 2 +- arch/x86/kvm/svm.c | 3 ++- arch/x86/kvm/vmx.c | 7 --- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 7004d21..c4672d1 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -422,7 +422,7 @@ enum vmcs_field { #define VMX_EPT_DIRTY_BIT(1ull 9) #define VMX_EPT_IDENTITY_PAGETABLE_ADDR 0xfffbc000ul - +#define VMX_APIC_ACCESS_PAGE_ADDR0xfee0ull It has nothing to do with VMX and there is already define for that: APIC_DEFAULT_PHYS_BASE -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] kvm, memory-hotplug: Update ept identity pagetable when it is migrated.
On Wed, Jul 02, 2014 at 05:00:36PM +0800, Tang Chen wrote: ept identity pagetable is pinned in memory, and as a result it cannot be migrated/hot-removed. But actually it doesn't need to be pinned in memory. This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT to reset ept indetity pagetable related variable. This request will be made when kvm_mmu_notifier_invalidate_page() is called when the page is unmapped from the qemu user space to reset kvm-arch.ept_identity_pagetable to NULL. And will also be made when ept violation happens to reset kvm-arch.ept_identity_pagetable to the new page. kvm-arch.ept_identity_pagetable is never used as a page address, just boolean null/!null to see if identity pagetable is initialized. I do not see why would we want to track its address at all. Changing it to bool and assigning true during initialization should be enough. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
Wanpeng Li wanpeng...@linux.intel.com writes: This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 I can also reproduce this easily with Linux as L1 by slowing it down eg. running with ept = 0 I suggest changing the subject to - KVM: nVMX: Fix race that incorrectly injects L1's irq to L2 If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if What's current log ? Do you mean current code ? there is no still-pending event to L1 which blocked by nested_run_pending. There is a race which lead to an interrupt will be injected to L2 which belong to L1 if L0 send an interrupt to L1 during this window. VCPU0 another thread L1 intr not blocked on L2 first entry vmx_vcpu_run req event kvm check request req event check_nested_events don't have any intr not nested exit intr occur (8254, lapic timer etc) inject_pending_event now have intr inject interrupt This patch fix this race by introduced a l1_events_blocked field in nested_vmx which indicates there is still-pending event which blocked by nested_run_pending, and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked by nested_run_pending. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- arch/x86/kvm/vmx.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f4e5aed..fe69c49 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -372,6 +372,7 @@ struct nested_vmx { u64 vmcs01_tsc_offset; /* L2 must run next, and mustn't decide to exit to L1. */ bool nested_run_pending; + bool l1_events_blocked; /* * Guest pages referred to in vmcs02 with host-physical pointers, so * we must keep them pinned while L2 runs. @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) * we did not inject a still-pending event to L1 now because of * nested_run_pending, we need to re-enable this bit. */ - if (vmx-nested.nested_run_pending) + if (to_vmx(vcpu)-nested.l1_events_blocked) { Is to_vmx() necessary since we alredy have the vmx pointer ? + to_vmx(vcpu)-nested.l1_events_blocked = false; kvm_make_request(KVM_REQ_EVENT, vcpu); + } vmx-nested.nested_run_pending = 0; @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) vmx-nested.preemption_timer_expired) { - if (vmx-nested.nested_run_pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; return -EBUSY; + } nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); return 0; } if (vcpu-arch.nmi_pending nested_exit_on_nmi(vcpu)) { - if (vmx-nested.nested_run_pending || - vcpu-arch.interrupt.pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; + return -EBUSY; + } + if (vcpu-arch.interrupt.pending) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, NMI_VECTOR | INTR_TYPE_NMI_INTR | @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if ((kvm_cpu_has_interrupt(vcpu) || external_intr) nested_exit_on_intr(vcpu)) { - if (vmx-nested.nested_run_pending) + if (vmx-nested.nested_run_pending) { + vmx-nested.l1_events_blocked = true; return -EBUSY; + } nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); } Also, I am wondering isn't it enough to just do this to avoid this race ? static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { - return (!to_vmx(vcpu)-nested.nested_run_pending + return (!is_guest_mode(vcpu) + !to_vmx(vcpu)-nested.nested_run_pending vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_IF) !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) Thanks, Bandan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest
-Original Message- From: Bhushan Bharat-R65777 Sent: Wednesday, July 02, 2014 5:07 PM To: Wood Scott-B07421; Alexander Graf Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org Subject: RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest -Original Message- From: Wood Scott-B07421 Sent: Tuesday, July 01, 2014 10:11 PM To: Alexander Graf Cc: Bhushan Bharat-R65777; kvm-...@vger.kernel.org; kvm@vger.kernel.org Subject: Re: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest On Tue, 2014-07-01 at 18:22 +0200, Alexander Graf wrote: On 01.07.14 17:35, Scott Wood wrote: On Tue, 2014-07-01 at 17:04 +0200, Alexander Graf wrote: On 01.07.14 16:58, Scott Wood wrote: On Tue, 2014-07-01 at 08:23 +0200, Alexander Graf wrote: I don't think QEMU should be aware of these limitations. OK, but we should at least have some idea of how the whole thing is supposed to work, in order to determine if this is the correct behavior for QEMU. I thought the model was that debug resources are either owned by QEMU or by the guest, and in the latter case, QEMU would never see the debug exception to begin with. That's bad for a number of reasons. For starters it's different from how x86 handles debug registers - and I hate to be different just for the sake of being different. How does it work on x86? It overwrites more-or-less random breakpoints with its own ones, but leaves the others intact ;). Are you talking about software breakpoints or management of hardware debug registers? So if we do want to declare that debug registers are owned by either QEMU or the guest, we should change the semantics for all architectures. If we want to say that ownership of the registers is shared, we need a plan for how that would actually work. I think you're overengineering here :). When do people actually use gdbstub? Usually when they want to debug a broken guest. We can either * overengineer heavily and reduce the number of registers available to the guest to always have spares * overengineer a bit and turn off guest debugging completely when we use gdbstub * just leave as much alive as we can, hoping that it helps with the debugging Option 3 is what x86 does - and I think it's a reasonable approach. This is not an interface that needs to be 100% consistent and bullet proof, it's a best effort to enable you to debug as much as possible. I'm not insisting on 100% -- just hoping for some explanation/discussion about how it's intended to work for the cases where it can. How will MSR[DE] and MSRP[DEP] be handled? How would I go about telling QEMU/KVM that I don't want this shared mode, because I don't want guest to interfere with the debugging I'm trying to do from QEMU? Will guest accesses to debug registers cause a userspace exit when guest_debug is enabled? I think we're in a path that is slow enough already to not worry about performance. It's not just about performance, but simplicity of use, and consistency of API. Oh, and it looks like there already exist one reg definitions and implementations for most of the debug registers. For BookE? Where? arch/powerpc/kvm/booke.c: KVM_REG_PPC_IACn, KVM_REG_PPC_DACn I tried to quickly prototype what I think we want to do (this is not tested) Hi Scott, There is one problem which is stopping us to share debug resource between qemu and guest, looking for suggestions: - As qemu is also using debug resource, We have to set MSR_DE and set MSRP_DEP (guest will not be able to clear MSR_DE). So qemu set debug events will always cause the debug interrupts. - Now guest is also using debug resources and for some reason if guest wants to clear MSR_DE (disable debug interrupt) But it will not be able to disable as MSRP_DEP is set and KVM will not come to know guest willingness to disable MSR_DE. - If the debug interrupts occurs then we will exit to QEMU and this may not a QEMU set event so it will inject interrupt to guest (using one-reg or set-sregs) - Now KVM, when handling one-reg/sregs request to inject debug interrupt, do not know whether guest can handle the debug interrupt or not (as guest might have tried to set/clear MSR_DE). Thanks -Bharat --- diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index e8b3982..746b5c6 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -179,6 +179,9 @@ extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, u32 *server, extern int kvmppc_xics_int_on(struct kvm *kvm, u32 irq); extern int kvmppc_xics_int_off(struct kvm *kvm, u32 irq); +void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu); void
Re: [PATCH 0/4] kvm, mem-hotplug: Do not pin ept identity pagetable and apic access page.
Hi Gleb, On 07/02/2014 05:00 PM, Tang Chen wrote: Hi Gleb, Marcelo, Please help to review this patch-set. NOTE: This patch-set doesn't work properly. ept identity pagetable and apic access page in kvm are pinned in memory. As a result, they cannot be migrated/hot-removed. But actually they don't need to be pinned in memory. This patch-set introduces two new vcpu requests: KVM_REQ_MIGRATE_EPT and KVM_REQ_MIGRATE_APIC. These two requests are made when the two pages are migrated by the mmu_notifier to reset the related variable to unusable value. And will also be made when ept violation happens to reset new pages. [Known problem] After this patch-set applied, the two pages can be migrated/hot-removed. But after migrating apic access page, the guest died. The host physical address of apic access page is stored in VMCS. I reset it to 0 to stop guest from accessing it when it is unmapped by kvm_mmu_notifier_invalidate_page(). And reset it to new page's host physical address in tdp_page_fault(). But it seems that guest will access apic page directly by the host physical address. Would you please to give some advice about this problem ? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR
On 07/03/2014 12:24 AM, Gleb Natapov wrote: On Wed, Jul 02, 2014 at 05:00:35PM +0800, Tang Chen wrote: Define guest phys_addr of apic access page. --- arch/x86/include/asm/vmx.h | 2 +- arch/x86/kvm/svm.c | 3 ++- arch/x86/kvm/vmx.c | 7 --- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 7004d21..c4672d1 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -422,7 +422,7 @@ enum vmcs_field { #define VMX_EPT_DIRTY_BIT (1ull 9) #define VMX_EPT_IDENTITY_PAGETABLE_ADDR 0xfffbc000ul - +#define VMX_APIC_ACCESS_PAGE_ADDR 0xfee0ull It has nothing to do with VMX and there is already define for that: APIC_DEFAULT_PHYS_BASE OK, followed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] kvm, memory-hotplug: Update ept identity pagetable when it is migrated.
On 07/03/2014 12:34 AM, Gleb Natapov wrote: On Wed, Jul 02, 2014 at 05:00:36PM +0800, Tang Chen wrote: ept identity pagetable is pinned in memory, and as a result it cannot be migrated/hot-removed. But actually it doesn't need to be pinned in memory. This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT to reset ept indetity pagetable related variable. This request will be made when kvm_mmu_notifier_invalidate_page() is called when the page is unmapped from the qemu user space to reset kvm-arch.ept_identity_pagetable to NULL. And will also be made when ept violation happens to reset kvm-arch.ept_identity_pagetable to the new page. kvm-arch.ept_identity_pagetable is never used as a page address, just boolean null/!null to see if identity pagetable is initialized. I do not see why would we want to track its address at all. Changing it to bool and assigning true during initialization should be enough. OK, followed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/10] RFC: userfault
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote: Hello everyone, There's a large CC list for this RFC because this adds two new syscalls (userfaultfd and remap_anon_pages) and MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API or on a completely different API if somebody has better ideas are welcome now. cc:linux-api -- this is certainly worthy of linux-api discussion. The combination of these features are what I would propose to implement postcopy live migration in qemu, and in general demand paging of remote memory, hosted in different cloud nodes. The MADV_USERFAULT feature should be generic enough that it can provide the userfaults to the Android volatile range feature too, on access of reclaimed volatile pages. If the access could ever happen in kernel context through syscalls (not not just from userland context), then userfaultfd has to be used to make the userfault unnoticeable to the syscall (no error will be returned). This latter feature is more advanced than what volatile ranges alone could do with SIGBUS so far (but it's optional, if the process doesn't call userfaultfd, the regular SIGBUS will fire, if the fd is closed SIGBUS will also fire for any blocked userfault that was waiting a userfaultfd_write ack). userfaultfd is also a generic enough feature, that it allows KVM to implement postcopy live migration without having to modify a single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all other GUP features works just fine in combination with userfaults (userfaults trigger async page faults in the guest scheduler so those guest processes that aren't waiting for userfaults can keep running in the guest vcpus). remap_anon_pages is the syscall to use to resolve the userfaults (it's not mandatory, vmsplice will likely still be used in the case of local postcopy live migration just to upgrade the qemu binary, but remap_anon_pages is faster and ideal for transferring memory across the network, it's zerocopy and doesn't touch the vma: it only holds the mmap_sem for reading). The current behavior of remap_anon_pages is very strict to avoid any chance of memory corruption going unnoticed. mremap is not strict like that: if there's a synchronization bug it would drop the destination range silently resulting in subtle memory corruption for example. remap_anon_pages would return -EEXIST in that case. If there are holes in the source range remap_anon_pages will return -ENOENT. If remap_anon_pages is used always with 2M naturally aligned addresses, transparent hugepages will not be splitted. In there could be 4k (or any size) holes in the 2M (or any size) source range, remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to relax some of its strict checks (-ENOENT won't be returned if RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as a noop on any hole in the source range). This flag is generally useful when implementing userfaults with THP granularity, but it shouldn't be set if doing the userfaults with PAGE_SIZE granularity if the developer wants to benefit from the strict -ENOENT behavior. The remap_anon_pages syscall API is not vectored, as I expect it to be used mainly for demand paging (where there can be just one faulting range per userfault) or for large ranges (with the THP model as an alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k granularity before starting the guest in the destination node) where vectoring isn't going to provide much performance advantages (thanks to the THP coarser granularity). On the rmap side remap_anon_pages doesn't add much complexity: there's no need of nonlinear anon vmas to support it because I added the constraint that it will fail if the mapcount is more than 1. So in general the source range of remap_anon_pages should be marked MADV_DONTFORK to prevent any risk of failure if the process ever forks (like qemu can in some case). One part that hasn't been tested is the poll() syscall on the userfaultfd because the postcopy migration thread currently is more efficient waiting on blocking read()s (I'll write some code to test poll() too). I also appended below a patch to trinity to exercise remap_anon_pages and userfaultfd and it completes trinity successfully. The code can be found here: git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault The branch is rebased so you can get updates for example with: git fetch git checkout -f origin/userfault Comments welcome, thanks! Andrea From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli aarca...@redhat.com Date: Wed, 2 Jul 2014 18:32:35 +0200 Subject: [PATCH] add remap_anon_pages and userfaultfd Signed-off-by: Andrea Arcangeli aarca...@redhat.com --- include/syscalls-x86_64.h | 2 + syscalls/remap_anon_pages.c | 100
Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote: Once an userfaultfd is created MADV_USERFAULT regions talks through the userfaultfd protocol with the thread responsible for doing the memory externalization of the process. The protocol starts by userland writing the requested/preferred USERFAULT_PROTOCOL version into the userfault fd (64bit write), if kernel knows it, it will ack it by allowing userland to read 64bit from the userfault fd that will contain the same 64bit USERFAULT_PROTOCOL version that userland asked. Otherwise userfault will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it will have to try again by writing an older protocol version if suitable for its usage too, and read it back again until it stops reading -1ULL. After that the userfaultfd protocol starts. The protocol consists in the userfault fd reads 64bit in size providing userland the fault addresses. After a userfault address has been read and the fault is resolved by userland, the application must write back 128bits in the form of [ start, end ] range (64bit each) that will tell the kernel such a range has been mapped. Multiple read userfaults can be resolved in a single range write. poll() can be used to know when there are new userfaults to read (POLLIN) and when there are threads waiting a wakeup through a range write (POLLOUT). Signed-off-by: Andrea Arcangeli aarca...@redhat.com +#ifdef CONFIG_PROC_FS +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f) +{ + struct userfaultfd_ctx *ctx = f-private_data; + int ret; + wait_queue_t *wq; + struct userfaultfd_wait_queue *uwq; + unsigned long pending = 0, total = 0; + + spin_lock(ctx-fault_wqh.lock); + list_for_each_entry(wq, ctx-fault_wqh.task_list, task_list) { + uwq = container_of(wq, struct userfaultfd_wait_queue, wq); + if (uwq-pending) + pending++; + total++; + } + spin_unlock(ctx-fault_wqh.lock); + + ret = seq_printf(m, pending:\t%lu\ntotal:\t%lu\n, pending, total); This should show the protocol version, too. + +SYSCALL_DEFINE1(userfaultfd, int, flags) +{ + int fd, error; + struct file *file; This looks like it can't be used more than once in a process. That will be unfortunate for libraries. Would it be feasible to either have userfaultfd claim a range of addresses or for a vma to be explicitly associated with a userfaultfd? (In the latter case, giant PROT_NONE MAP_NORESERVE mappings could be used.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 3/3] kvm: ignore LBR and offcore rsp
From: Kan Liang kan.li...@intel.com With -cpu host KVM reports LBR and offcore support, so the perf driver may accesses the LBR and offcore MSRs. However, there is no LBR and offcore virtualization support yet. This could causes guest to crash. As a workaround, KVM just simply ignore the LBR and offcore_rsp MSRs to lie the guest. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y. And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n. Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR #GP. Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP Signed-off-by: Andi Kleen a...@linux.intel.com Signed-off-by: Kan Liang kan.li...@intel.com --- arch/x86/kvm/pmu.c | 29 + 1 file changed, 29 insertions(+) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index cbecaa9..ad9b4fa 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -331,6 +331,15 @@ bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: ret = pmu-version 1; break; + case MSR_OFFCORE_RSP_0: + case MSR_OFFCORE_RSP_1: + /* At most 8-deep LBR for core and atom */ + case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7: + case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7: + /* 16-deep LBR for core i3/i5/i7 series processors */ + case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15: + case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15: + return 1; /* to avoid crashes */ default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) @@ -358,6 +367,16 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: *data = pmu-global_ovf_ctrl; return 0; + case MSR_OFFCORE_RSP_0: + case MSR_OFFCORE_RSP_1: + /* At most 8-deep LBR for core and atom */ + case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7: + case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7: + /* 16-deep LBR for core i3/i5/i7 series processors */ + case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15: + case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15: + *data = 0; + return 0; default: if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) || (pmc = get_fixed_pmc(pmu, index))) { @@ -409,6 +428,16 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } break; + case MSR_OFFCORE_RSP_0: + case MSR_OFFCORE_RSP_1: + /* At most 8-deep LBR for core and atom */ + case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7: + case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7: + /* 16-deep LBR for core i3/i5/i7 series processors */ + case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15: + case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15: + /* dummy for now */ + break; default: if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) || (pmc = get_fixed_pmc(pmu, index))) { -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 1/3] perf ignore LBR and offcore_rsp.
From: Kan Liang kan.li...@intel.com x86, perf: Protect LBR and offcore rsp against KVM lying With -cpu host, KVM reports LBR and offcore support, if the host has support. When the guest perf driver tries to access LBR or offcore_rsp MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and offcore support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y. And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n. Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR #GP. Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP Signed-off-by: Andi Kleen a...@linux.intel.com Signed-off-by: Kan Liang kan.li...@intel.com V2: Move the check code to initialization time. --- arch/x86/kernel/cpu/perf_event.h | 19 +-- arch/x86/kernel/cpu/perf_event_intel.c | 7 +++ arch/x86/kernel/cpu/perf_event_intel_lbr.c | 4 ++-- 3 files changed, 26 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h index 3b2f9bd..5d977b2 100644 --- a/arch/x86/kernel/cpu/perf_event.h +++ b/arch/x86/kernel/cpu/perf_event.h @@ -458,12 +458,13 @@ struct x86_pmu { u64 lbr_sel_mask; /* LBR_SELECT valid bits */ const int *lbr_sel_map; /* lbr_select mappings */ boollbr_double_abort; /* duplicated lbr aborts */ - + boollbr_msr_access;/* LBR MSR can be accessed */ /* * Extra registers for events */ struct extra_reg *extra_regs; unsigned int er_flags; + boolextra_msr_access; /* EXTRA REG MSR can be accessed */ /* * Intel host/guest support (KVM) @@ -525,6 +526,20 @@ extern u64 __read_mostly hw_cache_extra_regs [PERF_COUNT_HW_CACHE_OP_MAX] [PERF_COUNT_HW_CACHE_RESULT_MAX]; +/* + * Under certain circumstances, access certain MSR may cause #GP. + * The function tests if the input MSR can be safely accessed. + */ +static inline bool test_msr_access(unsigned long msr) +{ + u64 value; + if (rdmsrl_safe(msr, value) 0) + return false; + if (wrmsrl_safe(msr, value) 0) + return false; + return true; +} + u64 x86_perf_event_update(struct perf_event *event); static inline unsigned int x86_pmu_config_addr(int index) @@ -555,7 +570,7 @@ static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc, { u64 disable_mask = __this_cpu_read(cpu_hw_events.perf_ctr_virt_mask); - if (hwc-extra_reg.reg) + if (hwc-extra_reg.reg x86_pmu.extra_msr_access) wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config); wrmsrl(hwc-config_base, (hwc-config | enable_mask) ~disable_mask); } diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..8011d42 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -2565,6 +2565,13 @@ __init int intel_pmu_init(void) } } + /* Access LBR MSR may cause #GP under certain circumstances. E.g. KVM doesn't support LBR MSR */ + if (x86_pmu.lbr_nr) + x86_pmu.lbr_msr_access = test_msr_access(x86_pmu.lbr_tos) test_msr_access(x86_pmu.lbr_from); + /* Access extra MSR may cause #GP under certain circumstances. E.g. KVM doesn't support offcore event */ + if (x86_pmu.extra_regs) + x86_pmu.extra_msr_access = test_msr_access(x86_pmu.extra_regs-msr); + /* Support full width counters using alternative MSR range */ if (x86_pmu.intel_cap.full_width_write) { x86_pmu.max_period = x86_pmu.cntval_mask; diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c index 9dd2459..9508d1e 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c @@ -172,7 +172,7 @@ static void intel_pmu_lbr_reset_64(void) void intel_pmu_lbr_reset(void) { - if (!x86_pmu.lbr_nr) + if (!x86_pmu.lbr_nr || !x86_pmu.lbr_msr_access) return; if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32) @@ -334,7 +334,7 @@ void intel_pmu_lbr_read(void) { struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events); - if (!cpuc-lbr_users) + if (!cpuc-lbr_users || !x86_pmu.lbr_msr_access) return; if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at
[PATCH V2 2/3] perf protect LBR when Intel PT is enabled.
From: Kan Liang kan.li...@intel.com If RTIT_CTL.TraceEn=1, any attempt to read or write the LBR or LER MSRs, including LBR_TOS, will result in a #GP. Since Intel PT can be enabled/disabled at runtime, LBR MSRs have to be protected by _safe() at runtime. Signed-off-by: Kan Liang kan.li...@intel.com --- arch/x86/kernel/cpu/perf_event.h | 1 - arch/x86/kernel/cpu/perf_event_intel.c | 3 --- arch/x86/kernel/cpu/perf_event_intel_lbr.c | 38 +- 3 files changed, 21 insertions(+), 21 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h index 5d977b2..fafb809 100644 --- a/arch/x86/kernel/cpu/perf_event.h +++ b/arch/x86/kernel/cpu/perf_event.h @@ -458,7 +458,6 @@ struct x86_pmu { u64 lbr_sel_mask; /* LBR_SELECT valid bits */ const int *lbr_sel_map; /* lbr_select mappings */ boollbr_double_abort; /* duplicated lbr aborts */ - boollbr_msr_access;/* LBR MSR can be accessed */ /* * Extra registers for events */ diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 8011d42..ddd3590 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -2565,9 +2565,6 @@ __init int intel_pmu_init(void) } } - /* Access LBR MSR may cause #GP under certain circumstances. E.g. KVM doesn't support LBR MSR */ - if (x86_pmu.lbr_nr) - x86_pmu.lbr_msr_access = test_msr_access(x86_pmu.lbr_tos) test_msr_access(x86_pmu.lbr_from); /* Access extra MSR may cause #GP under certain circumstances. E.g. KVM doesn't support offcore event */ if (x86_pmu.extra_regs) x86_pmu.extra_msr_access = test_msr_access(x86_pmu.extra_regs-msr); diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c index 9508d1e..980b8dc 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c @@ -157,7 +157,7 @@ static void intel_pmu_lbr_reset_32(void) int i; for (i = 0; i x86_pmu.lbr_nr; i++) - wrmsrl(x86_pmu.lbr_from + i, 0); + wrmsrl_safe(x86_pmu.lbr_from + i, 0ULL); } static void intel_pmu_lbr_reset_64(void) @@ -165,14 +165,14 @@ static void intel_pmu_lbr_reset_64(void) int i; for (i = 0; i x86_pmu.lbr_nr; i++) { - wrmsrl(x86_pmu.lbr_from + i, 0); - wrmsrl(x86_pmu.lbr_to + i, 0); + wrmsrl_safe(x86_pmu.lbr_from + i, 0ULL); + wrmsrl_safe(x86_pmu.lbr_to + i, 0ULL); } } void intel_pmu_lbr_reset(void) { - if (!x86_pmu.lbr_nr || !x86_pmu.lbr_msr_access) + if (!x86_pmu.lbr_nr) return; if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32) @@ -237,19 +237,14 @@ void intel_pmu_lbr_disable_all(void) /* * TOS = most recently recorded branch */ -static inline u64 intel_pmu_lbr_tos(void) +static inline int intel_pmu_lbr_tos(u64 *tos) { - u64 tos; - - rdmsrl(x86_pmu.lbr_tos, tos); - - return tos; + return rdmsrl_safe(x86_pmu.lbr_tos, tos); } -static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc) +static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc, u64 tos) { unsigned long mask = x86_pmu.lbr_nr - 1; - u64 tos = intel_pmu_lbr_tos(); int i; for (i = 0; i x86_pmu.lbr_nr; i++) { @@ -278,11 +273,10 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc) * is the same as the linear address, allowing us to merge the LIP and EIP * LBR formats. */ -static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc) +static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc, u64 tos) { unsigned long mask = x86_pmu.lbr_nr - 1; int lbr_format = x86_pmu.intel_cap.lbr_format; - u64 tos = intel_pmu_lbr_tos(); int i; int out = 0; @@ -333,14 +327,24 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc) void intel_pmu_lbr_read(void) { struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events); + u64 tos; - if (!cpuc-lbr_users || !x86_pmu.lbr_msr_access) + if (!cpuc-lbr_users) + return; + + /* +* If KVM doesn't support LBR MSRs or Intel PT is enabled, +* accessing LBR MSRs cause GP#. +* Since Intel PT can be enabled/disabled at runtime, +* checking the LBR MSRs access right here. +*/ + if (intel_pmu_lbr_tos(tos) 0) return; if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32) - intel_pmu_lbr_read_32(cpuc); + intel_pmu_lbr_read_32(cpuc, tos); else - intel_pmu_lbr_read_64(cpuc); + intel_pmu_lbr_read_64(cpuc, tos);
Re: [PATCH V2 1/3] perf ignore LBR and offcore_rsp.
Signed-off-by: Andi Kleen a...@linux.intel.com I did not contribute to this patch, so please remove that SOB. Signed-off-by: Kan Liang kan.li...@intel.com struct extra_reg *extra_regs; unsigned int er_flags; + boolextra_msr_access; /* EXTRA REG MSR can be accessed */ This doesn't look right, needs a flag for each extra register. They are completely unrelated to each other. BTW this will also cause KVM messages at each boot now. wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config); wrmsrl(hwc-config_base, (hwc-config | enable_mask) ~disable_mask); } diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..8011d42 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -2565,6 +2565,13 @@ __init int intel_pmu_init(void) } } + /* Access LBR MSR may cause #GP under certain circumstances. E.g. KVM doesn't support LBR MSR */ + if (x86_pmu.lbr_nr) + x86_pmu.lbr_msr_access = test_msr_access(x86_pmu.lbr_tos) test_msr_access(x86_pmu.lbr_from); s/// And also this doesn't cover the case when someone takes over the LBRs and they start #GPing later. So for LBR the test has to be still at each access. -Andi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/3] perf ignore LBR and offcore_rsp.
On Wed, Jul 2, 2014 at 2:14 PM, kan.li...@intel.com wrote: From: Kan Liang kan.li...@intel.com x86, perf: Protect LBR and offcore rsp against KVM lying With -cpu host, KVM reports LBR and offcore support, if the host has support. When the guest perf driver tries to access LBR or offcore_rsp MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and offcore support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y. This is for host kernel, And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n. And this is for guest kernel, right? -Jidong -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 2/3] perf protect LBR when Intel PT is enabled.
On Wed, Jul 02, 2014 at 11:14:14AM -0700, kan.li...@intel.com wrote: From: Kan Liang kan.li...@intel.com If RTIT_CTL.TraceEn=1, any attempt to read or write the LBR or LER MSRs, including LBR_TOS, will result in a #GP. Since Intel PT can be enabled/disabled at runtime, LBR MSRs have to be protected by _safe() at runtime. Signed-off-by: Kan Liang kan.li...@intel.com Patch looks good to me. -Andi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
Hi Jan, On Wed, Jul 02, 2014 at 11:01:30AM +0200, Jan Kiszka wrote: On 2014-07-02 08:54, Wanpeng Li wrote: This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if there is no still-pending event to L1 which blocked by nested_run_pending. There is a race which lead to an interrupt will be injected to L2 which belong to L1 if L0 send an interrupt to L1 during this window. VCPU0 another thread L1 intr not blocked on L2 first entry vmx_vcpu_run req event kvm check request req event check_nested_events don't have any intr not nested exit intr occur (8254, lapic timer etc) inject_pending_event now have intr inject interrupt This patch fix this race by introduced a l1_events_blocked field in nested_vmx which indicates there is still-pending event which blocked by nested_run_pending, and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked by nested_run_pending. There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why aren't those able to trigger this scenario? Other KVM_REQ_EVENT has a specific intr or pending , however, the KVM_REQ_EVENT which request after vmexit with nested_run_pending may or may not have a specific intr or pending. Regards, Wanpeng Li In any case, unconditionally setting KVM_REQ_EVENT seems strange and should be changed. Jan Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- arch/x86/kvm/vmx.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f4e5aed..fe69c49 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -372,6 +372,7 @@ struct nested_vmx { u64 vmcs01_tsc_offset; /* L2 must run next, and mustn't decide to exit to L1. */ bool nested_run_pending; +bool l1_events_blocked; /* * Guest pages referred to in vmcs02 with host-physical pointers, so * we must keep them pinned while L2 runs. @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) * we did not inject a still-pending event to L1 now because of * nested_run_pending, we need to re-enable this bit. */ -if (vmx-nested.nested_run_pending) +if (to_vmx(vcpu)-nested.l1_events_blocked) { +to_vmx(vcpu)-nested.l1_events_blocked = false; kvm_make_request(KVM_REQ_EVENT, vcpu); +} vmx-nested.nested_run_pending = 0; @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) vmx-nested.preemption_timer_expired) { -if (vmx-nested.nested_run_pending) +if (vmx-nested.nested_run_pending) { +vmx-nested.l1_events_blocked = true; return -EBUSY; +} nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); return 0; } if (vcpu-arch.nmi_pending nested_exit_on_nmi(vcpu)) { -if (vmx-nested.nested_run_pending || -vcpu-arch.interrupt.pending) +if (vmx-nested.nested_run_pending) { +vmx-nested.l1_events_blocked = true; +return -EBUSY; +} +if (vcpu-arch.interrupt.pending) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, NMI_VECTOR | INTR_TYPE_NMI_INTR | @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if ((kvm_cpu_has_interrupt(vcpu) || external_intr) nested_exit_on_intr(vcpu)) { -if (vmx-nested.nested_run_pending) +if (vmx-nested.nested_run_pending) { +vmx-nested.l1_events_blocked = true; return -EBUSY; +} nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); } -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] COLO HA Project proposal
Hi David, On 07/01/2014 08:12 PM, Dr. David Alan Gilbert wrote: * Hongyang Yang (yan...@cn.fujitsu.com) wrote: Hi Yang, Background: COLO HA project is a high availability solution. Both primary VM (PVM) and secondary VM (SVM) run in parallel. They receive the same request from client, and generate response in parallel too. If the response packets from PVM and SVM are identical, they are released immediately. Otherwise, a VM checkpoint (on demand) is conducted. The idea is presented in Xen summit 2012, and 2013, and academia paper in SOCC 2013. It's also presented in KVM forum 2013: http://www.linux-kvm.org/wiki/images/1/1d/Kvm-forum-2013-COLO.pdf Please refer to above document for detailed information. Yes, I remember that talk - very interesting. I didn't quite understand a couple of things though, perhaps you can explain: 1) If we ignore the TCP sequence number problem, in an SMP machine don't we get other randomnesses - e.g. which core completes something first, or who wins a lock contention, so the output stream might not be identical - so do those normal bits of randomness cause the machines to flag as out-of-sync? It's about COLO agent, CCing Congyang, he can give the detailed explanation. 2) If the PVM has decided that the SVM is out of sync (due to 1) and the PVM fails at about the same point - can we switch over to the SVM? Yes, we can switch over, we have some mechanisms to ensure the SVM's state is consentient: - memory cache. The memory cache was initially the same as PVM's memory. At checkpoint, we cache the dirty memory of PVM while transporting the memory, write cached memory to SVM when we received all PVM memory (we only need to write memory that was both dirty on PVM and SVM from last checkpoint). This solves problem 2) you've mentioned above: If PVM fails while checkpointing, SVM will discard the cached memory and continue to run and to provide service just as it is. - COLO Disk manager Like memory cache, COLO Disk manager caches the Disk modifications of PVM, and write it to SVM Disk when checkpointing. If PVM fails while checkpointing, SVM will discard the cached Disk modifications. I'm worried that due to (1) there are periods where the system is out-of-sync and a failure of the PVM is not protected. Does that happen? If so how often? The attached was the architecture of kvm-COLO we proposed. - COLO Manager: Requires modifications of qemu - COLO Controller COLO Controller includes modifications of save/restore flow just like MC(macrocheckpoint), a memory cache on secondary VM which cache the dirty pages of primary VM and a failover module which provides APIs to communicate with external heartbead module. - COLO Disk Manager When pvm writes data into image, the colo disk manger captures this data and send it to the colo disk manger which makes sure the context of svm's image is consentient with the context of pvm's image. I wonder if there is anyway to coordinate this between COLO, Michael Hines microcheckpointing and the two separate reverse-execution projects that also need to do some similar things. Are there any standard APIs for the heartbeet thing we can already tie into? Sadly we have checked MC, it does not have heartbeat support for now. - COLO Agent(Proxy module in the arch picture) We need an agent to compare the packets returned by Primary VM and Secondary VM, and decide whether to start a checkpoint according to some rules. It is a linux kernel module for host. Why is that a kernel module, and how does it communicate the state to the QEMU instance? The reason we made this a kernel module is to gain better performance. We can easily hook the packets in a kernel module. QEMU instance uses ioctl() to communicate with the COLO Agent. - Other minor modifications We may need other modifications for better performance. Dave P.S. I'm starting to look at fault-tolerance stuff, but haven't got very far yet, so starting to try and understand the details of COLO, microcheckpointing, etc -- Thanks, Yang. -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK . -- Thanks, Yang. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 1/3] perf ignore LBR and offcore_rsp.
Signed-off-by: Andi Kleen a...@linux.intel.com I did not contribute to this patch, so please remove that SOB. OK Signed-off-by: Kan Liang kan.li...@intel.com struct extra_reg *extra_regs; unsigned int er_flags; + boolextra_msr_access; /* EXTRA REG MSR can be accessed */ This doesn't look right, needs a flag for each extra register. They are completely unrelated to each other. The extra register is either MSR_OFFCORE_RSP_0 or MSR_OFFCORE_RSP_1. I will add two variables to handle them. BTW this will also cause KVM messages at each boot now. Do you mean the unhandled wrmsr or unhandled rdmsr messages? You should not observer such error message for offcore msrs and LBR from/to msrs. But I forget to handle LBR TOS in KVM patch. You may observed unhandled rdmsr: 0x1c9 when doing LBR in guest. I will fix it in next version. wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config); wrmsrl(hwc-config_base, (hwc-config | enable_mask) ~disable_mask); } diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..8011d42 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -2565,6 +2565,13 @@ __init int intel_pmu_init(void) } } + /* Access LBR MSR may cause #GP under certain circumstances. E.g. KVM doesn't support LBR MSR */ + if (x86_pmu.lbr_nr) + x86_pmu.lbr_msr_access = test_msr_access(x86_pmu.lbr_tos) +test_msr_access(x86_pmu.lbr_from); s/// And also this doesn't cover the case when someone takes over the LBRs and they start #GPing later. So for LBR the test has to be still at each access. In the second patch, the LBR test has been moved to runtime check. This case has been handled. Kan -Andi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 1/3] perf ignore LBR and offcore_rsp.
On Wed, Jul 2, 2014 at 2:14 PM, kan.li...@intel.com wrote: From: Kan Liang kan.li...@intel.com x86, perf: Protect LBR and offcore rsp against KVM lying With -cpu host, KVM reports LBR and offcore support, if the host has support. When the guest perf driver tries to access LBR or offcore_rsp MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and offcore support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y. This is for host kernel, And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n. And this is for guest kernel, right? Right. -Jidong
Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
Hi Bandan, On Wed, Jul 02, 2014 at 12:27:59PM -0400, Bandan Das wrote: Wanpeng Li wanpeng...@linux.intel.com writes: This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 I can also reproduce this easily with Linux as L1 by slowing it down eg. running with ept = 0 I suggest changing the subject to - KVM: nVMX: Fix race that incorrectly injects L1's irq to L2 Ok, I will fold this to next version. ;-) If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if What's current log ? Do you mean current code ? Yeah, it's a typo. I mean logic. [...] Also, I am wondering isn't it enough to just do this to avoid this race ? static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { - return (!to_vmx(vcpu)-nested.nested_run_pending + return (!is_guest_mode(vcpu) + !to_vmx(vcpu)-nested.nested_run_pending vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_IF) !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) I don't think you fix the root cause of the race, and there are two cases which I concern about your proposal: - If there is a special L1 which don't ask to exit on external intrs, you will lose the intrs which L0 inject to L2. - If inject_pending_event fail to inject an intr since the change that you made in vmx_interrupt_allowed, L0 will request an intr window for L2, however, the intr is belong to L1. Regards, Wanpeng Li Thanks, Bandan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
Jan Kiszka jan.kis...@siemens.com writes: On 2014-07-02 08:54, Wanpeng Li wrote: This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if there is no still-pending event to L1 which blocked by nested_run_pending. There is a race which lead to an interrupt will be injected to L2 which belong to L1 if L0 send an interrupt to L1 during this window. VCPU0 another thread L1 intr not blocked on L2 first entry vmx_vcpu_run req event kvm check request req event check_nested_events don't have any intr not nested exit intr occur (8254, lapic timer etc) inject_pending_event now have intr inject interrupt This patch fix this race by introduced a l1_events_blocked field in nested_vmx which indicates there is still-pending event which blocked by nested_run_pending, and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked by nested_run_pending. There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why aren't those able to trigger this scenario? In any case, unconditionally setting KVM_REQ_EVENT seems strange and should be changed. Ugh! I think I am hitting another one but this one's probably because we are not setting KVM_REQ_EVENT for something we should. Before this patch, I was able to hit this bug everytime with modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0 and then booting L2. I can verify that I was indeed hitting the race in inject_pending_event. After this patch, I believe I am hitting another bug - this happens after I boot L2, as above, and then start a Linux kernel compilation and then wait and watch :) It's a pain to debug because this happens almost once in three times; it never happens if I run with ept=1, however, I think that's only because the test completes sooner. But I can confirm that I don't see it if I always set REQ_EVENT if nested_run_pending is set instead of the approach this patch takes. (Any debug hints help appreciated!) So, I am not sure if this is the right fix. Rather, I think the safer thing to do is to have the interrupt pending check for injection into L1 at the same site as the call to kvm_queue_interrupt() just like we had before commit b6b8a1451fc40412c57d1. Is there any advantage to having all the nested events checks together ? PS - Actually, a much easier fix (or rather hack) is to return 1 in vmx_interrupt_allowed() (as I mentioned elsewhere) only if !is_guest_mode(vcpu) That way, the pending interrupt interrupt can be taken care of correctly during the next vmexit. Bandan Jan Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- arch/x86/kvm/vmx.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f4e5aed..fe69c49 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -372,6 +372,7 @@ struct nested_vmx { u64 vmcs01_tsc_offset; /* L2 must run next, and mustn't decide to exit to L1. */ bool nested_run_pending; +bool l1_events_blocked; /* * Guest pages referred to in vmcs02 with host-physical pointers, so * we must keep them pinned while L2 runs. @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) * we did not inject a still-pending event to L1 now because of * nested_run_pending, we need to re-enable this bit. */ -if (vmx-nested.nested_run_pending) +if (to_vmx(vcpu)-nested.l1_events_blocked) { +to_vmx(vcpu)-nested.l1_events_blocked = false; kvm_make_request(KVM_REQ_EVENT, vcpu); +} vmx-nested.nested_run_pending = 0; @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) vmx-nested.preemption_timer_expired) { -if (vmx-nested.nested_run_pending) +if (vmx-nested.nested_run_pending) { +vmx-nested.l1_events_blocked = true; return -EBUSY; +} nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); return 0; } if (vcpu-arch.nmi_pending nested_exit_on_nmi(vcpu)) { -if (vmx-nested.nested_run_pending || -vcpu-arch.interrupt.pending) +if (vmx-nested.nested_run_pending) { +vmx-nested.l1_events_blocked = true; +return -EBUSY; +} +if (vcpu-arch.interrupt.pending) return -EBUSY; nested_vmx_vmexit(vcpu,
Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race
Wanpeng Li wanpeng...@linux.intel.com writes: Hi Bandan, On Wed, Jul 02, 2014 at 12:27:59PM -0400, Bandan Das wrote: Wanpeng Li wanpeng...@linux.intel.com writes: This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 I can also reproduce this easily with Linux as L1 by slowing it down eg. running with ept = 0 I suggest changing the subject to - KVM: nVMX: Fix race that incorrectly injects L1's irq to L2 Ok, I will fold this to next version. ;-) If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if What's current log ? Do you mean current code ? Yeah, it's a typo. I mean logic. [...] Also, I am wondering isn't it enough to just do this to avoid this race ? static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { - return (!to_vmx(vcpu)-nested.nested_run_pending + return (!is_guest_mode(vcpu) + !to_vmx(vcpu)-nested.nested_run_pending vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_IF) !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) I don't think you fix the root cause of the race, and there are two cases which I concern about your proposal: - If there is a special L1 which don't ask to exit on external intrs, you will lose the intrs which L0 inject to L2. Oh didn't think about that case :), thanks for the pointing this out. It's easy to check this with Xen as L1, I suppose. - If inject_pending_event fail to inject an intr since the change that you made in vmx_interrupt_allowed, L0 will request an intr window for L2, however, the intr is belong to L1. Oh, I thought inject_pending_event has kvm_queue_interrupt only if vmx_interrupt_allowed() returns true so, interrupt will be injected correctly on the next vmexit. Anyway, I am hitting another bug with your patch! Please see my other mail to the list. Thanks! Regards, Wanpeng Li Thanks, Bandan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest
-Original Message- From: Wood Scott-B07421 Sent: Tuesday, July 01, 2014 10:11 PM To: Alexander Graf Cc: Bhushan Bharat-R65777; kvm-ppc@vger.kernel.org; k...@vger.kernel.org Subject: Re: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest On Tue, 2014-07-01 at 18:22 +0200, Alexander Graf wrote: On 01.07.14 17:35, Scott Wood wrote: On Tue, 2014-07-01 at 17:04 +0200, Alexander Graf wrote: On 01.07.14 16:58, Scott Wood wrote: On Tue, 2014-07-01 at 08:23 +0200, Alexander Graf wrote: I don't think QEMU should be aware of these limitations. OK, but we should at least have some idea of how the whole thing is supposed to work, in order to determine if this is the correct behavior for QEMU. I thought the model was that debug resources are either owned by QEMU or by the guest, and in the latter case, QEMU would never see the debug exception to begin with. That's bad for a number of reasons. For starters it's different from how x86 handles debug registers - and I hate to be different just for the sake of being different. How does it work on x86? It overwrites more-or-less random breakpoints with its own ones, but leaves the others intact ;). Are you talking about software breakpoints or management of hardware debug registers? So if we do want to declare that debug registers are owned by either QEMU or the guest, we should change the semantics for all architectures. If we want to say that ownership of the registers is shared, we need a plan for how that would actually work. I think you're overengineering here :). When do people actually use gdbstub? Usually when they want to debug a broken guest. We can either * overengineer heavily and reduce the number of registers available to the guest to always have spares * overengineer a bit and turn off guest debugging completely when we use gdbstub * just leave as much alive as we can, hoping that it helps with the debugging Option 3 is what x86 does - and I think it's a reasonable approach. This is not an interface that needs to be 100% consistent and bullet proof, it's a best effort to enable you to debug as much as possible. I'm not insisting on 100% -- just hoping for some explanation/discussion about how it's intended to work for the cases where it can. How will MSR[DE] and MSRP[DEP] be handled? How would I go about telling QEMU/KVM that I don't want this shared mode, because I don't want guest to interfere with the debugging I'm trying to do from QEMU? Will guest accesses to debug registers cause a userspace exit when guest_debug is enabled? I think we're in a path that is slow enough already to not worry about performance. It's not just about performance, but simplicity of use, and consistency of API. Oh, and it looks like there already exist one reg definitions and implementations for most of the debug registers. For BookE? Where? arch/powerpc/kvm/booke.c: KVM_REG_PPC_IACn, KVM_REG_PPC_DACn I tried to quickly prototype what I think we want to do (this is not tested) --- diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index e8b3982..746b5c6 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -179,6 +179,9 @@ extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, u32 *server, extern int kvmppc_xics_int_on(struct kvm *kvm, u32 irq); extern int kvmppc_xics_int_off(struct kvm *kvm, u32 irq); +void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu); +void kvmppc_core_queue_debug(struct kvm_vcpu *vcpu); + /* * Cuts out inst bits with ordering according to spec. * That means the leftmost bit is zero. All given bits are included. diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 9f13056..0b7e4e4 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -235,6 +235,16 @@ void kvmppc_core_dequeue_dec(struct kvm_vcpu *vcpu) clear_bit(BOOKE_IRQPRIO_DECREMENTER, vcpu-arch.pending_exceptions); } +void kvmppc_core_queue_debug(struct kvm_vcpu *vcpu) +{ + kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_DEBUG); +} + +void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu) +{ + clear_bit(BOOKE_IRQPRIO_DEBUG, vcpu-arch.pending_exceptions); +} + void kvmppc_core_queue_external(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq) { @@ -841,6 +851,20 @@ static int kvmppc_handle_debug(struct kvm_run *run, struct kvm_vcpu *vcpu) struct debug_reg *dbg_reg = (vcpu-arch.shadow_dbg_reg); u32 dbsr = vcpu-arch.dbsr; + /* Userspace (QEMU) is not using debug resource, so inject debug interrupt +* directly to guest debug. +*/ + if (vcpu-guest_debug == 0) { + if (dbsr (vcpu-arch.shared-msr MSR_DE)) +
Re: [PATCH 4/6] KVM: PPC: BOOK3S: HV: Use new functions for mapping/unmapping hpte in host
Paul Mackerras pau...@samba.org writes: On Sun, Jun 29, 2014 at 04:47:33PM +0530, Aneesh Kumar K.V wrote: We want to use virtual page class key protection mechanism for indicating a MMIO mapped hpte entry or a guest hpte entry that is swapped out in the host. Those hptes will be marked valid, but have virtual page class key set to 30 or 31. These virtual page class numbers are configured in AMR to deny read/write. To accomodate such a change, add new functions that map, unmap and check whether a hpte is mapped in the host. This patch still use HPTE_V_VALID and HPTE_V_ABSENT and don't use virtual page class keys. But we want to differentiate in the code where we explicitly check for HPTE_V_VALID with places where we want to check whether the hpte is host mapped. This patch enables a closer review for such a change. [...] /* Check for pending invalidations under the rmap chain lock */ if (kvm-arch.using_mmu_notifiers mmu_notifier_retry(kvm, mmu_seq)) { -/* inval in progress, write a non-present HPTE */ -pteh |= HPTE_V_ABSENT; -pteh = ~HPTE_V_VALID; +/* + * inval in progress in host, write host unmapped pte. + */ +host_unmapped_hpte = 1; This isn't right. We already have HPTE_V_VALID set here, and you now don't clear it here, and it doesn't get cleared by the __kvmppc_unmap_host_hpte() call below either. Ok missed that. Will fix that in the next update. In the earlier version I had kvmppc_unmap_host_hpte always clearing V_VALID. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] KVM: PPC: BOOK3S: Use hpte_update_in_progress to track invalid hpte during an hpte update
Paul Mackerras pau...@samba.org writes: On Sun, Jun 29, 2014 at 04:47:34PM +0530, Aneesh Kumar K.V wrote: As per ISA, we first need to mark hpte invalid (V=0) before we update the hpte lower half bits. With virtual page class key protection mechanism we want to send any fault other than key fault to guest directly without searching the hash page table. But then we can get NO_HPTE fault while we are updating the hpte. To track that add a vm specific atomic variable that we check in the fault path to always send the fault to host. [...] @@ -750,13 +751,15 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, r = rcbits | ~(HPTE_R_R | HPTE_R_C); if (be64_to_cpu(hptep[0]) HPTE_V_VALID) { -/* HPTE was previously valid, so we need to invalidate it */ +/* + * If we had mapped this hpte before, we now need to + * invalidate that. + */ unlock_rmap(rmap); -/* Always mark HPTE_V_ABSENT before invalidating */ -kvmppc_unmap_host_hpte(kvm, hptep); kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= be64_to_cpu(hptep[1]) (HPTE_R_R | HPTE_R_C); +hpte_invalidated = true; So now we're not setting the ABSENT bit before invalidating the HPTE. That means that another guest vcpu could do an H_ENTER which could think that this HPTE is free and use it for another unrelated guest HPTE, which would be bad... But henter looks at HPTE_V_HVLOCK, and we keep that set through out. But I will double the code again to make sure it is safe in the above scenario. @@ -1144,8 +1149,8 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) npages_dirty = n; eieio(); } -kvmppc_map_host_hpte(kvm, v, r); -hptep[0] = cpu_to_be64(v ~HPTE_V_HVLOCK); +hptep[0] = cpu_to_be64(v ~HPTE_V_LOCK); +atomic_dec(kvm-arch.hpte_update_in_progress); Why are we using LOCK rather than HVLOCK now? (And why didn't you mention this change and its rationale in the patch description?) Sorry, that is a typo. I intend to use HPTE_V_HVLOCK. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] KVM: PPC: BOOK3S: HV: Deny virtual page class key update via h_protect
Paul Mackerras pau...@samba.org writes: On Sun, Jun 29, 2014 at 04:47:31PM +0530, Aneesh Kumar K.V wrote: This makes it consistent with h_enter where we clear the key bits. We also want to use virtual page class key protection mechanism for indicating host page fault. For that we will be using key class index 30 and 31. So prevent the guest from updating key bits until we add proper support for virtual page class protection mechanism for the guest. This will not have any impact for PAPR linux guest because Linux guest currently don't use virtual page class key protection model As things stand, without this patch series, we do actually have everything we need in the kernel for guests to use virtual page class keys. Arguably we should have a capability to tell userspace how many storage keys the guest can use, but that's the only missing piece as far as I can see. yes. If we add such a capability, I can't see any reason why we should need to disable guest use of storage keys in this patchset. With this patchset, we would need additonal changes to find out whether the key fault happened because of the guest's usage of the key. I was planning to do that as an add-on series to keep the changes in this minimal. Also since linux didn't use keys i was not sure whether guest support of keys is an important item. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest
-Original Message- From: Bhushan Bharat-R65777 Sent: Wednesday, July 02, 2014 5:07 PM To: Wood Scott-B07421; Alexander Graf Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org Subject: RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest -Original Message- From: Wood Scott-B07421 Sent: Tuesday, July 01, 2014 10:11 PM To: Alexander Graf Cc: Bhushan Bharat-R65777; kvm-ppc@vger.kernel.org; k...@vger.kernel.org Subject: Re: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest On Tue, 2014-07-01 at 18:22 +0200, Alexander Graf wrote: On 01.07.14 17:35, Scott Wood wrote: On Tue, 2014-07-01 at 17:04 +0200, Alexander Graf wrote: On 01.07.14 16:58, Scott Wood wrote: On Tue, 2014-07-01 at 08:23 +0200, Alexander Graf wrote: I don't think QEMU should be aware of these limitations. OK, but we should at least have some idea of how the whole thing is supposed to work, in order to determine if this is the correct behavior for QEMU. I thought the model was that debug resources are either owned by QEMU or by the guest, and in the latter case, QEMU would never see the debug exception to begin with. That's bad for a number of reasons. For starters it's different from how x86 handles debug registers - and I hate to be different just for the sake of being different. How does it work on x86? It overwrites more-or-less random breakpoints with its own ones, but leaves the others intact ;). Are you talking about software breakpoints or management of hardware debug registers? So if we do want to declare that debug registers are owned by either QEMU or the guest, we should change the semantics for all architectures. If we want to say that ownership of the registers is shared, we need a plan for how that would actually work. I think you're overengineering here :). When do people actually use gdbstub? Usually when they want to debug a broken guest. We can either * overengineer heavily and reduce the number of registers available to the guest to always have spares * overengineer a bit and turn off guest debugging completely when we use gdbstub * just leave as much alive as we can, hoping that it helps with the debugging Option 3 is what x86 does - and I think it's a reasonable approach. This is not an interface that needs to be 100% consistent and bullet proof, it's a best effort to enable you to debug as much as possible. I'm not insisting on 100% -- just hoping for some explanation/discussion about how it's intended to work for the cases where it can. How will MSR[DE] and MSRP[DEP] be handled? How would I go about telling QEMU/KVM that I don't want this shared mode, because I don't want guest to interfere with the debugging I'm trying to do from QEMU? Will guest accesses to debug registers cause a userspace exit when guest_debug is enabled? I think we're in a path that is slow enough already to not worry about performance. It's not just about performance, but simplicity of use, and consistency of API. Oh, and it looks like there already exist one reg definitions and implementations for most of the debug registers. For BookE? Where? arch/powerpc/kvm/booke.c: KVM_REG_PPC_IACn, KVM_REG_PPC_DACn I tried to quickly prototype what I think we want to do (this is not tested) Hi Scott, There is one problem which is stopping us to share debug resource between qemu and guest, looking for suggestions: - As qemu is also using debug resource, We have to set MSR_DE and set MSRP_DEP (guest will not be able to clear MSR_DE). So qemu set debug events will always cause the debug interrupts. - Now guest is also using debug resources and for some reason if guest wants to clear MSR_DE (disable debug interrupt) But it will not be able to disable as MSRP_DEP is set and KVM will not come to know guest willingness to disable MSR_DE. - If the debug interrupts occurs then we will exit to QEMU and this may not a QEMU set event so it will inject interrupt to guest (using one-reg or set-sregs) - Now KVM, when handling one-reg/sregs request to inject debug interrupt, do not know whether guest can handle the debug interrupt or not (as guest might have tried to set/clear MSR_DE). Thanks -Bharat --- diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index e8b3982..746b5c6 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -179,6 +179,9 @@ extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, u32 *server, extern int kvmppc_xics_int_on(struct kvm *kvm, u32 irq); extern int kvmppc_xics_int_off(struct kvm *kvm, u32 irq); +void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu); void