Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
On Tue, May 21, 2013 at 03:26:18PM -0700, Nakajima, Jun wrote: On Tue, May 21, 2013 at 4:05 AM, Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: On 05/21/2013 05:01 PM, Gleb Natapov wrote: On Tue, May 21, 2013 at 04:30:13PM +0800, Xiao Guangrong wrote: @@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr, return gpa; } +#endif Strange! Why does nested ept not need these functions? How to emulate the instruction faulted on L2? Sorry, i misunderstood it. Have found the reason out. You can write it down here for future reviewers :) Okay. The functions used to translate L2's gva to L1's gpa are paging32_gva_to_gpa_nested and paging64_gva_to_gpa_nested which are created by PTTYPE == 32 and PTTYPE == 64. Back to your comments on PT_MAX_FULL_LEVELS: + #ifdef CONFIG_X86_64 + #define PT_MAX_FULL_LEVELS 4 + #define CMPXCHG cmpxchg + #else + #define CMPXCHG cmpxchg64 +#define PT_MAX_FULL_LEVELS 2 I don't think we need to support nEPT on 32-bit hosts. So, I plan to remove such code. What do you think? Why shouldn't we support nEPT on 32-bit hosts? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote: On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote: Any pages with stale information will be zapped by kvm_mmu_zap_all(). When that happens, page faults will take place which will automatically use the new generation number. So still not clear why is this necessary. This is not, strictly speaking, necessary, but it is the sane thing to do. You cannot update page's generation number to prevent it from been destroyed since after kvm_mmu_zap_all() completes stale ptes in the shadow page may point to now deleted memslot. So why build shadow page table with a page that is in a process of been destroyed? OK, can this be introduced separately, in a later patch, with separate justification, then? Xiao please have the first patches of the patchset focus on the problem at hand: fix long mmu_lock hold times. Not sure what you mean again. We flush TLB once before entering this function. kvm_reload_remote_mmus() does this for us, no? kvm_reload_remote_mmus() is used as an optimization, its separate from the problem solution. What was suggested was... go to phrase which starts with The only purpose of the generation number should be to. The comment quoted here does not match that description. The comment describes what code does and in this it is correct. You propose to not reload roots right away and do it only when root sp is encountered, right? So my question is what's the point? There are, obviously, root sps with invalid generation number at this point, so reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not do it here right away and avoid it in kvm_mmu_prepare_zap_page() for invalid and obsolete sps as I proposed in one of my email? Sure. But Xiao please introduce that TLB collapsing optimization as a later patch, so we can reason about it in a more organized fashion. So, if I understand correctly, you are asking to move is_obsolete_sp() check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will become nop. But I question the need to zap all shadow pages tables there in the first place, why kvm_flush_remote_tlbs() is not enough? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC
- Original Message - From: Marcelo Tosatti mtosa...@redhat.com To: Vadim Rozenfeld vroze...@redhat.com Cc: kvm@vger.kernel.org, g...@redhat.com, p...@dlh.net Sent: Wednesday, May 22, 2013 10:50:46 AM Subject: Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC On Sun, May 19, 2013 at 05:06:37PM +1000, Vadim Rozenfeld wrote: The following patch allows to activate a partition reference time enlightenment that is based on the host platform's support for an Invariant Time Stamp Counter (iTSC). NOTE: This code will survive migration due to lack of VM stop/resume handlers, when offset, scale and sequence should be readjusted. --- arch/x86/kvm/x86.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9645dab..b423fe4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1838,7 +1838,6 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) u64 gfn; unsigned long addr; HV_REFERENCE_TSC_PAGE tsc_ref; - tsc_ref.TscSequence = 0; if (!(data HV_X64_MSR_TSC_REFERENCE_ENABLE)) { kvm-arch.hv_tsc_page = data; break; @@ -1848,6 +1847,11 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT); if (kvm_is_error_hva(addr)) return 1; + tsc_ref.TscSequence = + boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ? 1 : 0; 1) You want NONSTOP_TSC (see 40fb1715 commit) which matches INVARIANT TSC. [VR] Thank you for reviewing. Will fix it. 2) TscSequence should increase? This field serves as a sequence number that is incremented whenever... [VR] Yes, on every VM resume, including migration. After migration we also need to recalculate scale and adjust offset. 3) 0x is the value for invalid source of reference time? [VR] Yes, on boot-up. In this case guest will go with PMTimer (not sure about HPET but I can check). But if we set sequence to 0x after migration - it's probably will not work. + tsc_ref.TscScale = + ((1LL 32) / vcpu-arch.virtual_tsc_khz) 32; + tsc_ref.TscOffset = 0; if (__copy_to_user((void __user *)addr, tsc_ref, sizeof(tsc_ref))) return 1; mark_page_dirty(kvm, gfn); -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/2] add support for Hyper-V reference time counter
- Original Message - From: Marcelo Tosatti mtosa...@redhat.com To: Vadim Rozenfeld vroze...@redhat.com Cc: kvm@vger.kernel.org, g...@redhat.com, p...@dlh.net Sent: Wednesday, May 22, 2013 10:46:14 AM Subject: Re: [RFC PATCH v2 1/2] add support for Hyper-V reference time counter On Sun, May 19, 2013 at 05:06:36PM +1000, Vadim Rozenfeld wrote: Signed-off: Peter Lieven p...@dlh.net Signed-off: Gleb Natapov g...@redhat.com Signed-off: Vadim Rozenfeld vroze...@redhat.com v1 - v2 1. mark TSC page dirty as suggested by Eric Northup digitale...@google.com and Gleb 2. disable local irq when calling get_kernel_ns, as it was done by Peter Lieven p...@dlhnet.de 3. move check for TSC page enable from second patch to this one. --- arch/x86/include/asm/kvm_host.h| 2 ++ arch/x86/include/uapi/asm/hyperv.h | 14 ++ arch/x86/kvm/x86.c | 39 +- include/uapi/linux/kvm.h | 1 + 4 files changed, 55 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3741c65..f0fee35 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -575,6 +575,8 @@ struct kvm_arch { /* fields used by HYPER-V emulation */ u64 hv_guest_os_id; u64 hv_hypercall; + u64 hv_ref_count; + u64 hv_tsc_page; #ifdef CONFIG_KVM_MMU_AUDIT int audit_point; diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h index b80420b..890dfc3 100644 --- a/arch/x86/include/uapi/asm/hyperv.h +++ b/arch/x86/include/uapi/asm/hyperv.h @@ -136,6 +136,9 @@ /* MSR used to read the per-partition time reference counter */ #define HV_X64_MSR_TIME_REF_COUNT0x4020 +/* A partition's reference time stamp counter (TSC) page */ +#define HV_X64_MSR_REFERENCE_TSC 0x4021 + /* Define the virtual APIC registers */ #define HV_X64_MSR_EOI 0x4070 #define HV_X64_MSR_ICR 0x4071 @@ -179,6 +182,9 @@ #define HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_MASK \ (~((1ull HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT) - 1)) +#define HV_X64_MSR_TSC_REFERENCE_ENABLE 0x0001 +#define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT 12 + #define HV_PROCESSOR_POWER_STATE_C0 0 #define HV_PROCESSOR_POWER_STATE_C1 1 #define HV_PROCESSOR_POWER_STATE_C2 2 @@ -191,4 +197,12 @@ #define HV_STATUS_INVALID_ALIGNMENT 4 #define HV_STATUS_INSUFFICIENT_BUFFERS 19 +typedef struct _HV_REFERENCE_TSC_PAGE { + __u32 TscSequence; + __u32 Rserved1; + __u64 TscScale; + __s64 TscOffset; +} HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE; + + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8d28810..9645dab 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -843,7 +843,7 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc); static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, not needed. - HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, + HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_KVM_PV_EOI_EN, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, @@ -1788,6 +1788,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr) switch (msr) { case HV_X64_MSR_GUEST_OS_ID: case HV_X64_MSR_HYPERCALL: + case HV_X64_MSR_REFERENCE_TSC: + case HV_X64_MSR_TIME_REF_COUNT: r = true; break; } @@ -1827,6 +1829,29 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (__copy_to_user((void __user *)addr, instructions, 4)) return 1; kvm-arch.hv_hypercall = data; + local_irq_disable(); + kvm-arch.hv_ref_count = get_kernel_ns(); + local_irq_enable(); + break; local_irq_disable/local_irq_enable not needed. What is the reasoning behind reading this time value at msr write time? [VR] Windows writs this MSR only once, during HAL initialization. So, I decided to treat this call as a partition crate event. + } + case HV_X64_MSR_REFERENCE_TSC: { + u64 gfn; + unsigned long addr; + HV_REFERENCE_TSC_PAGE tsc_ref; + tsc_ref.TscSequence = 0; + if (!(data HV_X64_MSR_TSC_REFERENCE_ENABLE)) { + kvm-arch.hv_tsc_page = data; + break; + } + gfn = data HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT; + addr = gfn_to_hva(kvm, data +
Re: [PATCH v4 21/32] arm64: KVM: Build system integration
On Tue, May 21, 2013 at 05:09:47PM +0100, Paolo Bonzini wrote: - Messaggio originale - Da: Catalin Marinas catalin.mari...@arm.com A: Marc Zyngier marc.zyng...@arm.com Cc: linux-arm-ker...@lists.infradead.org, kvm...@lists.cs.columbia.edu, kvm@vger.kernel.org, Will Deacon will.dea...@arm.com, pbonz...@redhat.com, g...@redhat.com, Christopher Covington c...@codeaurora.org Inviato: Martedì, 21 maggio 2013 17:42:27 Oggetto: Re: [PATCH v4 21/32] arm64: KVM: Build system integration On Tue, May 14, 2013 at 03:13:49PM +0100, Marc Zyngier wrote: --- /dev/null +++ b/arch/arm64/kvm/Kconfig ... +config KVM_ARM_VGIC +bool + depends on KVM_ARM_HOST OF + select HAVE_KVM_IRQCHIP + ---help--- + Adds support for a hardware assisted, in-kernel GIC emulation. + +config KVM_ARM_TIMER +bool + depends on KVM_ARM_VGIC + select HAVE_KVM_IRQCHIP + ---help--- + Adds support for the Architected Timers in virtual machines + +source drivers/virtio/Kconfig Shouldn't the above configs go to drivers/virtio/Kconfig as well (or drivers/virtio/arm/Kconfig)? virtio doesn't mean virtual versions of devices that also exist in hardware; those are indeed in arch/ARCH/kvm. virtio is a specific bus for paravirtualized devices. It is not KVM-specific and the code runs in the guest (whereas arch/arm64/kvm/Kconfig is host code). You are right. What I meant was kvm/virt/arm/Kconfig (the place where the vgic and timer driver goes). -- Catalin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 21/32] arm64: KVM: Build system integration
Il 22/05/2013 10:42, Catalin Marinas ha scritto: On Tue, May 21, 2013 at 05:09:47PM +0100, Paolo Bonzini wrote: - Messaggio originale - Da: Catalin Marinas catalin.mari...@arm.com A: Marc Zyngier marc.zyng...@arm.com Cc: linux-arm-ker...@lists.infradead.org, kvm...@lists.cs.columbia.edu, kvm@vger.kernel.org, Will Deacon will.dea...@arm.com, pbonz...@redhat.com, g...@redhat.com, Christopher Covington c...@codeaurora.org Inviato: Martedì, 21 maggio 2013 17:42:27 Oggetto: Re: [PATCH v4 21/32] arm64: KVM: Build system integration On Tue, May 14, 2013 at 03:13:49PM +0100, Marc Zyngier wrote: --- /dev/null +++ b/arch/arm64/kvm/Kconfig ... +config KVM_ARM_VGIC +bool + depends on KVM_ARM_HOST OF + select HAVE_KVM_IRQCHIP + ---help--- +Adds support for a hardware assisted, in-kernel GIC emulation. + +config KVM_ARM_TIMER +bool + depends on KVM_ARM_VGIC + select HAVE_KVM_IRQCHIP + ---help--- +Adds support for the Architected Timers in virtual machines + +source drivers/virtio/Kconfig Shouldn't the above configs go to drivers/virtio/Kconfig as well (or drivers/virtio/arm/Kconfig)? virtio doesn't mean virtual versions of devices that also exist in hardware; those are indeed in arch/ARCH/kvm. virtio is a specific bus for paravirtualized devices. It is not KVM-specific and the code runs in the guest (whereas arch/arm64/kvm/Kconfig is host code). You are right. What I meant was kvm/virt/arm/Kconfig (the place where the vgic and timer driver goes). Ah yeah, that could make sense. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
On 05/22/2013 02:34 PM, Gleb Natapov wrote: On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote: On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote: Any pages with stale information will be zapped by kvm_mmu_zap_all(). When that happens, page faults will take place which will automatically use the new generation number. So still not clear why is this necessary. This is not, strictly speaking, necessary, but it is the sane thing to do. You cannot update page's generation number to prevent it from been destroyed since after kvm_mmu_zap_all() completes stale ptes in the shadow page may point to now deleted memslot. So why build shadow page table with a page that is in a process of been destroyed? OK, can this be introduced separately, in a later patch, with separate justification, then? Xiao please have the first patches of the patchset focus on the problem at hand: fix long mmu_lock hold times. Not sure what you mean again. We flush TLB once before entering this function. kvm_reload_remote_mmus() does this for us, no? kvm_reload_remote_mmus() is used as an optimization, its separate from the problem solution. What was suggested was... go to phrase which starts with The only purpose of the generation number should be to. The comment quoted here does not match that description. The comment describes what code does and in this it is correct. You propose to not reload roots right away and do it only when root sp is encountered, right? So my question is what's the point? There are, obviously, root sps with invalid generation number at this point, so reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not do it here right away and avoid it in kvm_mmu_prepare_zap_page() for invalid and obsolete sps as I proposed in one of my email? Sure. But Xiao please introduce that TLB collapsing optimization as a later patch, so we can reason about it in a more organized fashion. So, if I understand correctly, you are asking to move is_obsolete_sp() check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will become nop. But I question the need to zap all shadow pages tables there in the first place, why kvm_flush_remote_tlbs() is not enough? I do not know too... I even do no know why kvm_flush_remote_tlbs is needed. :( -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote: On 05/22/2013 02:34 PM, Gleb Natapov wrote: On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote: On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote: Any pages with stale information will be zapped by kvm_mmu_zap_all(). When that happens, page faults will take place which will automatically use the new generation number. So still not clear why is this necessary. This is not, strictly speaking, necessary, but it is the sane thing to do. You cannot update page's generation number to prevent it from been destroyed since after kvm_mmu_zap_all() completes stale ptes in the shadow page may point to now deleted memslot. So why build shadow page table with a page that is in a process of been destroyed? OK, can this be introduced separately, in a later patch, with separate justification, then? Xiao please have the first patches of the patchset focus on the problem at hand: fix long mmu_lock hold times. Not sure what you mean again. We flush TLB once before entering this function. kvm_reload_remote_mmus() does this for us, no? kvm_reload_remote_mmus() is used as an optimization, its separate from the problem solution. What was suggested was... go to phrase which starts with The only purpose of the generation number should be to. The comment quoted here does not match that description. The comment describes what code does and in this it is correct. You propose to not reload roots right away and do it only when root sp is encountered, right? So my question is what's the point? There are, obviously, root sps with invalid generation number at this point, so reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not do it here right away and avoid it in kvm_mmu_prepare_zap_page() for invalid and obsolete sps as I proposed in one of my email? Sure. But Xiao please introduce that TLB collapsing optimization as a later patch, so we can reason about it in a more organized fashion. So, if I understand correctly, you are asking to move is_obsolete_sp() check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will become nop. But I question the need to zap all shadow pages tables there in the first place, why kvm_flush_remote_tlbs() is not enough? I do not know too... I even do no know why kvm_flush_remote_tlbs is needed. :( We changed the content of an executable page, we need to flush instruction cache of all vcpus to not use stale data, so my suggestion to call kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb, not instruction cache, but why kvm_reload_remote_mmus() would flush instruction cache? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior
On Thursday 16 May 2013, Michael S. Tsirkin wrote: This improves the might_fault annotations used by uaccess routines: 1. The only reason uaccess routines might sleep is if they fault. Make this explicit for all architectures. 2. Accesses (e.g through socket ops) to kernel memory with KERNEL_DS like net/sunrpc does will never sleep. Remove an unconditinal might_sleep in the inline might_fault in kernel.h (used when PROVE_LOCKING is not set). 3. Accesses with pagefault_disable return EFAULT but won't cause caller to sleep. Check for that and avoid might_sleep when PROVE_LOCKING is set. I'd like these changes to go in for the benefit of the vhost driver where we want to call socket ops under a spinlock, and fall back on slower thread handler on error. Hi Michael, I have recently stumbled over a related topic, which is the highly inconsistent placement of might_fault() or might_sleep() in certain classes of uaccess functions. Your patches seem completely reasonable, but it would be good to also fix the other problem, at least on the architectures we most care about. Given the most commonly used functions and a couple of architectures I'm familiar with, these are the ones that currently call might_fault() x86-32 x86-64 arm arm64 powerpc s390generic copy_to_user- x - - - x x copy_from_user - x - - - x x put_userx x x x x x x get_userx x x x x x x __copy_to_user x x - - x - - __copy_from_userx x - - x - - __put_user - - x - x - - __get_user - - x - x - - WTF? Calling might_fault() for every __get_user/__put_user is rather expensive because it turns what should be a single instruction (plus fixup) into an external function call. My feeling is that we should do might_fault() only in access_ok() to get the right balance. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: add detail error message when fail to add ioeventfd
On Wed, May 22, 2013 at 12:57:35PM +0800, Amos Kong wrote: I try to hotplug 28 * 8 multiple-function devices to guest with old host kernel, ioeventfds in host kernel will be exhausted, then qemu fails to allocate ioeventfds for blk/nic devices. It's better to add detail error here. Signed-off-by: Amos Kong ak...@redhat.com --- kvm-all.c |4 1 files changed, 4 insertions(+), 0 deletions(-) It would be nice to make kvm bus scalable so that the hardcoded in-kernel I/O device limit can be lifted. Reviewed-by: Stefan Hajnoczi stefa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
On 05/22/2013 04:54 PM, Gleb Natapov wrote: On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote: On 05/22/2013 02:34 PM, Gleb Natapov wrote: On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote: On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote: Any pages with stale information will be zapped by kvm_mmu_zap_all(). When that happens, page faults will take place which will automatically use the new generation number. So still not clear why is this necessary. This is not, strictly speaking, necessary, but it is the sane thing to do. You cannot update page's generation number to prevent it from been destroyed since after kvm_mmu_zap_all() completes stale ptes in the shadow page may point to now deleted memslot. So why build shadow page table with a page that is in a process of been destroyed? OK, can this be introduced separately, in a later patch, with separate justification, then? Xiao please have the first patches of the patchset focus on the problem at hand: fix long mmu_lock hold times. Not sure what you mean again. We flush TLB once before entering this function. kvm_reload_remote_mmus() does this for us, no? kvm_reload_remote_mmus() is used as an optimization, its separate from the problem solution. What was suggested was... go to phrase which starts with The only purpose of the generation number should be to. The comment quoted here does not match that description. The comment describes what code does and in this it is correct. You propose to not reload roots right away and do it only when root sp is encountered, right? So my question is what's the point? There are, obviously, root sps with invalid generation number at this point, so reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not do it here right away and avoid it in kvm_mmu_prepare_zap_page() for invalid and obsolete sps as I proposed in one of my email? Sure. But Xiao please introduce that TLB collapsing optimization as a later patch, so we can reason about it in a more organized fashion. So, if I understand correctly, you are asking to move is_obsolete_sp() check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will become nop. But I question the need to zap all shadow pages tables there in the first place, why kvm_flush_remote_tlbs() is not enough? I do not know too... I even do no know why kvm_flush_remote_tlbs is needed. :( We changed the content of an executable page, we need to flush instruction cache of all vcpus to not use stale data, so my suggestion to call I thought the reason is about icache too but icache is automatically flushed on x86, we only need to invalidate the prefetched instructions by executing a serializing operation. See the SDM in the chapter of 8.1.3 Handling Self- and Cross-Modifying Code kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb, not instruction cache, but why kvm_reload_remote_mmus() would flush instruction cache? kvm_reload_remote_mmus do not have any help i think. I find that this change is introduced by commit: 7aa81cc0 and I have added Anthony in the CC. I also find some discussions related to calling kvm_reload_remote_mmus(): But if the instruction is architecture dependent, and you run on the wrong architecture, now you have to patch many locations at fault time, introducing some nasty runtime code / data cache overlap performance problems. Granted, they go away eventually. We're addressing that by blowing away the shadow cache and holding the big kvm lock to ensure SMP safety. Not a great thing to do from a performance perspective but the whole point of patching is that the cost is amortized. (http://kerneltrap.org/mailarchive/linux-kernel/2007/9/14/260288) But i can not understand... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/10] kernel: might_fault does not imply might_sleep
On Tue, May 21, 2013 at 01:57:34PM +0200, Peter Zijlstra wrote: On Sun, May 19, 2013 at 12:35:26PM +0300, Michael S. Tsirkin wrote: --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -198,7 +198,6 @@ void might_fault(void); #else static inline void might_fault(void) { - might_sleep(); This removes potential resched points for PREEMPT_VOLUNTARY -- was that intentional? No it's a bug. Thanks for pointing this out. OK so I guess it should be might_sleep_if(!in_atomic()) and this means might_fault would have to move from linux/kernel.h to linux/uaccess.h, since in_atomic() is in linux/hardirq.h Makes sense? So the only difference between PROVE_LOCKING and not should be the might_lock_read() thing; so how about something like this? --- include/linux/kernel.h | 7 ++- include/linux/uaccess.h | 26 ++ mm/memory.c | 14 ++ 3 files changed, 30 insertions(+), 17 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index e96329c..70812f4 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -194,12 +194,9 @@ extern int _cond_resched(void); }) #ifdef CONFIG_PROVE_LOCKING -void might_fault(void); +void might_fault_lockdep(void); #else -static inline void might_fault(void) -{ - might_sleep(); -} +static inline void might_fault_lockdep(void) { } #endif extern struct atomic_notifier_head panic_notifier_list; diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 5ca0951..50a2cc9 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -38,6 +38,32 @@ static inline void pagefault_enable(void) preempt_check_resched(); } +static inline bool __can_fault(void) +{ + /* + * Some code (nfs/sunrpc) uses socket ops on kernel memory while + * holding the mmap_sem, this is safe because kernel memory doesn't + * get paged out, therefore we'll never actually fault, and the + * below annotations will generate false positives. + */ + if (segment_eq(get_fs(), KERNEL_DS)) + return false; + + if (in_atomic() /* || pagefault_disabled() */) One question here: I'm guessing you put this comment here for illustrative purposes, implying code that will be enabled in -rt? We don't want it upstream I think, right? + return false; + + return true; +} + +static inline void might_fault(void) +{ + if (!__can_fault()) + return; + + might_sleep(); + might_fault_lockdep(); +} + #ifndef ARCH_HAS_NOCACHE_UACCESS static inline unsigned long __copy_from_user_inatomic_nocache(void *to, diff --git a/mm/memory.c b/mm/memory.c index 6dc1882..266610c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4211,19 +4211,9 @@ void print_vma_addr(char *prefix, unsigned long ip) } #ifdef CONFIG_PROVE_LOCKING -void might_fault(void) +void might_fault_lockdep(void) { /* - * Some code (nfs/sunrpc) uses socket ops on kernel memory while - * holding the mmap_sem, this is safe because kernel memory doesn't - * get paged out, therefore we'll never actually fault, and the - * below annotations will generate false positives. - */ - if (segment_eq(get_fs(), KERNEL_DS)) - return; - - might_sleep(); - /* * it would be nicer only to annotate paths which are not under * pagefault_disable, however that requires a larger audit and * providing helpers like get_user_atomic. @@ -4231,7 +4221,7 @@ void might_fault(void) if (!in_atomic() current-mm) might_lock_read(current-mm-mmap_sem); } -EXPORT_SYMBOL(might_fault); +EXPORT_SYMBOL(might_fault_lockdep); #endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior
On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote: On Thursday 16 May 2013, Michael S. Tsirkin wrote: This improves the might_fault annotations used by uaccess routines: 1. The only reason uaccess routines might sleep is if they fault. Make this explicit for all architectures. 2. Accesses (e.g through socket ops) to kernel memory with KERNEL_DS like net/sunrpc does will never sleep. Remove an unconditinal might_sleep in the inline might_fault in kernel.h (used when PROVE_LOCKING is not set). 3. Accesses with pagefault_disable return EFAULT but won't cause caller to sleep. Check for that and avoid might_sleep when PROVE_LOCKING is set. I'd like these changes to go in for the benefit of the vhost driver where we want to call socket ops under a spinlock, and fall back on slower thread handler on error. Hi Michael, I have recently stumbled over a related topic, which is the highly inconsistent placement of might_fault() or might_sleep() in certain classes of uaccess functions. Your patches seem completely reasonable, but it would be good to also fix the other problem, at least on the architectures we most care about. Given the most commonly used functions and a couple of architectures I'm familiar with, these are the ones that currently call might_fault() x86-32 x86-64 arm arm64 powerpc s390generic copy_to_user - x - - - x x copy_from_user- x - - - x x put_user x x x x x x x get_user x x x x x x x __copy_to_userx x - - x - - __copy_from_user x x - - x - - __put_user- - x - x - - __get_user- - x - x - - WTF? Yea. Calling might_fault() for every __get_user/__put_user is rather expensive because it turns what should be a single instruction (plus fixup) into an external function call. You mean _cond_resched with CONFIG_PREEMPT_VOLUNTARY? Or do you mean when we build with PROVE_LOCKING? My feeling is that we should do might_fault() only in access_ok() to get the right balance. Arnd Well access_ok is currently non-blocking I think - we'd have to audit all callers. There are some 200 of these in drivers and some 1000 total so ... a bit risky. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: provide vhost thread per virtqueue for forwarding scenario
On 2013/5/20 15:43, Michael S. Tsirkin wrote: On Mon, May 20, 2013 at 02:11:19AM +, Qinchuanyu wrote: Vhost thread provide both tx and rx ability for virtio-net. In the forwarding scenarios, tx and rx share the vhost thread, and throughput is limited by single thread. So I did a patch for provide vhost thread per virtqueue, not per vhost_net. Of course, multi-queue virtio-net is final solution, but it require new version of virtio-net working in guest. If you have to work with suse10,11, redhat 5.x as guest, and want to improve the forward throughput, using vhost thread per queue seems to be the only solution. Why is it? If multi-queue works well for you, just update the drivers in the guests that you care about. Guest driver backport is not so hard. In my testing, performance of thread per vq varies: some workloads might gain throughput but you get more IPIs and more scheduling overhead, so you waste more host CPU per byte. As you create more VMs, this stops being a win. I did the test with kernel 3.0.27 and qemu-1.4.0, guest is suse11-sp2, and then two vhost thread provide double tx/rx forwarding performance than signal vhost thread. The virtqueue of vhost_blk is 1, so it still use one vhost thread without change. Is there something wrong in this solution? If not, I would list patch later. Best regards King Yes, I don't think we want to create threads even more aggressively in all cases. I'm worried about scalability as it is. I think we should explore a flexible approach, use a thread pool (for example, a wq) to share threads between virtqueues, switch to a separate thread only if there's free CPU and existing threads are busy. Hopefully share threads between vhost instances too. On Xen platform, network backend pv driver model has evolved to this way. Netbacks from all DomUs share a thread pool, and thread number eaqual to cpu core number. Is there any plan for kvm paltform? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: provide vhost thread per virtqueue for forwarding scenario
On 05/22/2013 05:59 PM, Zang Hongyong wrote: On 2013/5/20 15:43, Michael S. Tsirkin wrote: On Mon, May 20, 2013 at 02:11:19AM +, Qinchuanyu wrote: Vhost thread provide both tx and rx ability for virtio-net. In the forwarding scenarios, tx and rx share the vhost thread, and throughput is limited by single thread. So I did a patch for provide vhost thread per virtqueue, not per vhost_net. Of course, multi-queue virtio-net is final solution, but it require new version of virtio-net working in guest. If you have to work with suse10,11, redhat 5.x as guest, and want to improve the forward throughput, using vhost thread per queue seems to be the only solution. Why is it? If multi-queue works well for you, just update the drivers in the guests that you care about. Guest driver backport is not so hard. In my testing, performance of thread per vq varies: some workloads might gain throughput but you get more IPIs and more scheduling overhead, so you waste more host CPU per byte. As you create more VMs, this stops being a win. I did the test with kernel 3.0.27 and qemu-1.4.0, guest is suse11-sp2, and then two vhost thread provide double tx/rx forwarding performance than signal vhost thread. The virtqueue of vhost_blk is 1, so it still use one vhost thread without change. Is there something wrong in this solution? If not, I would list patch later. Best regards King Yes, I don't think we want to create threads even more aggressively in all cases. I'm worried about scalability as it is. I think we should explore a flexible approach, use a thread pool (for example, a wq) to share threads between virtqueues, switch to a separate thread only if there's free CPU and existing threads are busy. Hopefully share threads between vhost instances too. On Xen platform, network backend pv driver model has evolved to this way. Netbacks from all DomUs share a thread pool, and thread number eaqual to cpu core number. Is there any plan for kvm paltform? There used to be two related RFCs for this, one is the multiple vhost workers from Anthony another is percpu vhost thread from Shirley. You can search the archives on netdev or kvm for the patches. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: provide vhost thread per virtqueue for forwarding scenario
On Wed, May 22, 2013 at 05:59:03PM +0800, Zang Hongyong wrote: On 2013/5/20 15:43, Michael S. Tsirkin wrote: On Mon, May 20, 2013 at 02:11:19AM +, Qinchuanyu wrote: Vhost thread provide both tx and rx ability for virtio-net. In the forwarding scenarios, tx and rx share the vhost thread, and throughput is limited by single thread. So I did a patch for provide vhost thread per virtqueue, not per vhost_net. Of course, multi-queue virtio-net is final solution, but it require new version of virtio-net working in guest. If you have to work with suse10,11, redhat 5.x as guest, and want to improve the forward throughput, using vhost thread per queue seems to be the only solution. Why is it? If multi-queue works well for you, just update the drivers in the guests that you care about. Guest driver backport is not so hard. In my testing, performance of thread per vq varies: some workloads might gain throughput but you get more IPIs and more scheduling overhead, so you waste more host CPU per byte. As you create more VMs, this stops being a win. I did the test with kernel 3.0.27 and qemu-1.4.0, guest is suse11-sp2, and then two vhost thread provide double tx/rx forwarding performance than signal vhost thread. The virtqueue of vhost_blk is 1, so it still use one vhost thread without change. Is there something wrong in this solution? If not, I would list patch later. Best regards King Yes, I don't think we want to create threads even more aggressively in all cases. I'm worried about scalability as it is. I think we should explore a flexible approach, use a thread pool (for example, a wq) to share threads between virtqueues, switch to a separate thread only if there's free CPU and existing threads are busy. Hopefully share threads between vhost instances too. On Xen platform, network backend pv driver model has evolved to this way. Netbacks from all DomUs share a thread pool, and thread number eaqual to cpu core number. Is there any plan for kvm paltform? Shirley Ma had a patchset like this. Look it up: 'NUMA aware scheduling per vhost thread patch' Unfortunately I don't think we can fix the thread number: if a thread gets blocked because its accessing a swapped out memory for guest 1, we must allow guest 2 to make progress. But it shouldn't be too hard to fix: detect that a thread is blocked and spawn a new one, or pre-create a per-guest thread and bounce the work there. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[nVMX ] A couple more tests
Heya, So, Gleb suggested (thank you) to try kvm unit tests in L1, here's some (incomprehensive) detail. I also ran a defconfig in a 20 loop iteration (in L2) where subsequent runs after the first one are from page cache. 1. A couple of unit tests w/ 64 bit compiled: $ ./configure $ time make [...] gcc -g -D__x86_64__ -I../include/x86 -m64 -O1 -MMD -MF x86/.init.d -g -fomit-frame-pointer -Wall -fno-stack-protector -I. -std=gnu99 -ffreestanding -I lib -I lib/x86 -c -o x86/init.o x86/init.c x86/init.c: In function ‘main’: x86/init.c:110:1: warning: control reaches end of non-void function [-Wreturn-type] } ^ gcc -g -D__x86_64__ -I../include/x86 -m64 -O1 -MMD -MF x86/.ini real0m14.358s user0m6.990s sys 0m6.639s = MSR test = $ time qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio \ -nographic -no-user-config -nodefaults -device \ isa-debug-exit,iobase=0xf4,iosize=0x4 -kernel ./x86/msr.flat \ enabling apic MSR_IA32_APICBASE: PASS MSR_IA32_APICBASE: PASS IA32_SYSENTER_CS: PASS MSR_IA32_SYSENTER_ESP: PASS IA32_SYSENTER_EIP: PASS MSR_IA32_MISC_ENABLE: PASS MSR_IA32_CR_PAT: PASS MSR_FS_BASE: PASS MSR_GS_BASE: PASS MSR_KERNEL_GS_BASE: PASS MSR_EFER: PASS MSR_LSTAR: PASS MSR_CSTAR: PASS MSR_SYSCALL_MASK: PASS MSR_*STAR eager loading: PASS 15 tests, 0 failures real0m0.525s user0m0.147s sys 0m0.121s = eventinj test = $ time qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio \ -nographic -no-user-config -nodefaults -device \ isa-debug-exit,iobase=0xf4,iosize=0x4 -kernel ./x86/eventinj.flat enabling apic paging enabled cr0 = 80010011 cr3 = 7fff000 cr4 = 20 Try to divide by 0 DE isr running divider is 0 Result is 150 DE exception: PASS Try int 3 BP isr running After int 3 BP exception: PASS Try send vec 33 to itself irq1 running After vec 33 to itself vec 33: PASS Try int $33 irq1 running After int $33 int $33: PASS Try send vec 32 and 33 to itself irq1 running irq0 running After vec 32 and 33 to itself vec 32/33: PASS Try send vec 32 and int $33 irq1 running irq0 running After vec 32 and int $33 vec 32/int $33: PASS Try send vec 33 and 62 and mask one with TPR irq1 running After 33/62 TPR test TPR: PASS irq0 running Try send NMI to itself After NMI to itself NMI: FAIL Try int 33 with shadowed stack irq1 running After int 33 with shadowed stack int 33 with shadowed stack: PASS summary: 9 tests, 1 failures real0m0.586s user0m0.159s sys 0m0.163s 2. A couple of kvm unit tests w/ 32-bit compiled: $ yum install '*/stubs-32.h' [...] $ make ARCH=i386 clean all [...] gcc -g -D__i386__ -I /usr/src/kernels/3.10.0-0.rc1.git5.2.fc20.x86_64/include -I../include/x86 -m32 -O1 -MMD -MF x86/.eventinj.d -g -fomit-frame-pointer -Wall -fno-stack-protector -I. -nostdlib -o x86/eventin j.elf -Wl,-T,flat.lds x86/eventinj.o x86/cstart.o lib/libcflat.a /usr/lib/gcc/x86_64-redhat-linux/4.8.0/32/libgcc.a objcopy -O elf32-i386 x86/eventinj.elf x86/eventinj.flat [...] = eventinj test = $ time qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio \ -nographic \ -no-user-config -nodefaults -device \ isa-debug-exit,iobase=0xf4,iosize=0x4 -kernel \ ./x86/eventinj.flat \ [...] After NMI to itself NMI: FAIL Try to divide by 0 PF running DE isr running divider is 0 Result is 150 DE PF exceptions: PASS Before NP test PF running NP isr running 400b5f err=18 irq1 running After int33 NP PF exceptions: PASS Try int 33 with shadowed stack irq1 running After int 33 with shadowed stack int 33 with shadowed stack: PASS summary: 14 tests, 1 failures real0m0.589s user0m0.188s sys 0m0.127s = MSR test = $ time qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio -nographic \ -no-user-config -nodefaults -device isa-debug-exit,iobase=0xf4,iosize=0x4 -kernel \ ./x86/msr.flat enabling apic MSR_IA32_APICBASE: PASS MSR_IA32_APICBASE: PASS IA32_SYSENTER_CS: PASS MSR_IA32_SYSENTER_ESP: PASS IA32_SYSENTER_EIP: PASS MSR_IA32_MISC_ENABLE: PASS MSR_IA32_CR_PAT: PASS MSR_FS_BASE: PASS MSR_GS_BASE: PASS MSR_KERNEL_GS_BASE: PASS MSR_LSTAR: PASS MSR_CSTAR: PASS MSR_SYSCALL_MASK: PASS 13 tests, 0 failures real0m0.499s user0m0.136s sys 0m0.117s 3. 20 iterations of kernel compile time: http://kashyapc.fedorapeople.org/virt/Haswell-VMCS-Shadowing-tests/kernel-build-times-20-iterations.txt Average real (elapsed) time - 9m44.495s. The result of of the above file is obtained after: $ for i in {1..20}; do make clean; \ time make -j 3; done 21 | tee \ defconfig-make-output-20-iterations-l2.txt in L2 (nested guest). And, then
Re: [PATCH v2 10/10] kernel: might_fault does not imply might_sleep
On Wed, May 22, 2013 at 12:47:09PM +0300, Michael S. Tsirkin wrote: +static inline bool __can_fault(void) +{ + /* +* Some code (nfs/sunrpc) uses socket ops on kernel memory while +* holding the mmap_sem, this is safe because kernel memory doesn't +* get paged out, therefore we'll never actually fault, and the +* below annotations will generate false positives. +*/ + if (segment_eq(get_fs(), KERNEL_DS)) + return false; + + if (in_atomic() /* || pagefault_disabled() */) One question here: I'm guessing you put this comment here for illustrative purposes, implying code that will be enabled in -rt? We don't want it upstream I think, right? Right, and as a reminder that when we do this we need to add a patch to -rt. But yeah, we should have a look and see if its worth pulling those patches from -rt into mainline in some way shape or form. They're big but trivial IIRC. I'm fine with you leaving that comment out though.. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior
On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote: Calling might_fault() for every __get_user/__put_user is rather expensive because it turns what should be a single instruction (plus fixup) into an external function call. We could hide it all behind CONFIG_DEBUG_ATOMIC_SLEEP just like might_sleep() is. I'm not sure there's a point to might_fault() when might_sleep() is a NOP. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior
On Wed, May 22, 2013 at 12:19:16PM +0200, Peter Zijlstra wrote: On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote: Calling might_fault() for every __get_user/__put_user is rather expensive because it turns what should be a single instruction (plus fixup) into an external function call. We could hide it all behind CONFIG_DEBUG_ATOMIC_SLEEP just like might_sleep() is. I'm not sure there's a point to might_fault() when might_sleep() is a NOP. The patch that you posted gets pretty close. E.g. I'm testing this now: +#define might_fault() do { \ + if (_might_fault()) \ + __might_sleep(__FILE__, __LINE__, 0); \ + might_resched(); \ +} while(0) So if might_sleep is a NOP, __might_sleep and might_resched are NOPs so compiler will optimize this all out. However, in a related thread, you pointed out that might_sleep is not a NOP if CONFIG_PREEMPT_VOLUNTARY is set, even without CONFIG_DEBUG_ATOMIC_SLEEP. Do you think we should drop the preemption point in might_fault? Only copy_XX_user? Only __copy_XXX_user ? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior
On Wed, May 22, 2013 at 02:07:29PM +0300, Michael S. Tsirkin wrote: On Wed, May 22, 2013 at 12:19:16PM +0200, Peter Zijlstra wrote: On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote: Calling might_fault() for every __get_user/__put_user is rather expensive because it turns what should be a single instruction (plus fixup) into an external function call. We could hide it all behind CONFIG_DEBUG_ATOMIC_SLEEP just like might_sleep() is. I'm not sure there's a point to might_fault() when might_sleep() is a NOP. The patch that you posted gets pretty close. E.g. I'm testing this now: +#define might_fault() do { \ + if (_might_fault()) \ + __might_sleep(__FILE__, __LINE__, 0); \ + might_resched(); \ +} while(0) So if might_sleep is a NOP, __might_sleep and might_resched are NOPs so compiler will optimize this all out. However, in a related thread, you pointed out that might_sleep is not a NOP if CONFIG_PREEMPT_VOLUNTARY is set, even without CONFIG_DEBUG_ATOMIC_SLEEP. Oh crud yeah.. and you actually need that _might_fault() stuff for that too. Bugger. Yeah, I wouldn't know what the effects of dropping ita (for the copy functions) would be, VOLUNTARY isn't a preemption mode I ever use (even though it seems most distros default to it). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] KVM: add kvm_para_available to asm-generic/kvm_para.h
According to include/uapi/linux/kvm_para.h architectures should define kvm_para_available, so add an implementation to asm-generic/kvm_para.h which just returns false. Signed-off-by: James Hogan james.ho...@imgtec.com Cc: Marcelo Tosatti mtosa...@redhat.com Cc: Gleb Natapov g...@redhat.com Cc: Arnd Bergmann a...@arndb.de --- include/asm-generic/kvm_para.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/asm-generic/kvm_para.h b/include/asm-generic/kvm_para.h index 9d96605..fa25bec 100644 --- a/include/asm-generic/kvm_para.h +++ b/include/asm-generic/kvm_para.h @@ -18,4 +18,9 @@ static inline unsigned int kvm_arch_para_features(void) return 0; } +static inline bool kvm_para_available(void) +{ + return false; +} + #endif -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] mips/kvm: Fix ABI for compatibility with 64-bit guests.
On Tue, May 21, 2013 at 01:54:49PM -0700, David Daney wrote: From: David Daney david.da...@cavium.com The initial patch set implementing MIPS KVM does not handle 64-bit guests or use of the FPU. This patch set corrects these ABI issues, and does some very minor clean up. Sanjay, is this looks good to you. What userspace MIPS is using for machine emulation? Is there corresponding patches to the userspace? Chandes from v3: Use KVM_SET_ONE_REG instead of KVM_SET_MSRS. Added ENOIOCTLCMD patch. Changes from v2: Split into five parts, no code change. David Daney (6): mips/kvm: Fix ABI for use of FPU. mips/kvm: Fix ABI for use of 64-bit registers. mips/kvm: Fix name of gpr field in struct kvm_regs. mips/kvm: Use ARRAY_SIZE() instead of hardcoded constants in kvm_arch_vcpu_ioctl_{s,g}et_regs mips/kvm: Fix ABI by moving manipulation of CP0 registers to KVM_{G,S}ET_ONE_REG mips/kvm: Use ENOIOCTLCMD to indicate unimplemented ioctls. arch/mips/include/asm/kvm.h | 137 arch/mips/include/asm/kvm_host.h | 4 - arch/mips/kvm/kvm_mips.c | 118 +++--- arch/mips/kvm/kvm_trap_emul.c| 338 ++- 4 files changed, 430 insertions(+), 167 deletions(-) -- 1.7.11.7 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
On Wed, May 22, 2013 at 05:41:10PM +0800, Xiao Guangrong wrote: On 05/22/2013 04:54 PM, Gleb Natapov wrote: On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote: On 05/22/2013 02:34 PM, Gleb Natapov wrote: On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote: On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote: Any pages with stale information will be zapped by kvm_mmu_zap_all(). When that happens, page faults will take place which will automatically use the new generation number. So still not clear why is this necessary. This is not, strictly speaking, necessary, but it is the sane thing to do. You cannot update page's generation number to prevent it from been destroyed since after kvm_mmu_zap_all() completes stale ptes in the shadow page may point to now deleted memslot. So why build shadow page table with a page that is in a process of been destroyed? OK, can this be introduced separately, in a later patch, with separate justification, then? Xiao please have the first patches of the patchset focus on the problem at hand: fix long mmu_lock hold times. Not sure what you mean again. We flush TLB once before entering this function. kvm_reload_remote_mmus() does this for us, no? kvm_reload_remote_mmus() is used as an optimization, its separate from the problem solution. What was suggested was... go to phrase which starts with The only purpose of the generation number should be to. The comment quoted here does not match that description. The comment describes what code does and in this it is correct. You propose to not reload roots right away and do it only when root sp is encountered, right? So my question is what's the point? There are, obviously, root sps with invalid generation number at this point, so reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not do it here right away and avoid it in kvm_mmu_prepare_zap_page() for invalid and obsolete sps as I proposed in one of my email? Sure. But Xiao please introduce that TLB collapsing optimization as a later patch, so we can reason about it in a more organized fashion. So, if I understand correctly, you are asking to move is_obsolete_sp() check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will become nop. But I question the need to zap all shadow pages tables there in the first place, why kvm_flush_remote_tlbs() is not enough? I do not know too... I even do no know why kvm_flush_remote_tlbs is needed. :( We changed the content of an executable page, we need to flush instruction cache of all vcpus to not use stale data, so my suggestion to call I thought the reason is about icache too but icache is automatically flushed on x86, we only need to invalidate the prefetched instructions by executing a serializing operation. See the SDM in the chapter of 8.1.3 Handling Self- and Cross-Modifying Code Right, so we do cross-modifying code here and we need to make sure no vcpu is running in a guest mode while this happens, but kvm_mmu_zap_all() does not provide this guaranty since vcpus will continue running after reloading roots! kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb, not instruction cache, but why kvm_reload_remote_mmus() would flush instruction cache? kvm_reload_remote_mmus do not have any help i think. I find that this change is introduced by commit: 7aa81cc0 and I have added Anthony in the CC. I also find some discussions related to calling kvm_reload_remote_mmus(): But if the instruction is architecture dependent, and you run on the wrong architecture, now you have to patch many locations at fault time, introducing some nasty runtime code / data cache overlap performance problems. Granted, they go away eventually. We're addressing that by blowing away the shadow cache and holding the big kvm lock to ensure SMP safety. Not a great thing to do from a performance perspective but the whole point of patching is that the cost is amortized. (http://kerneltrap.org/mailarchive/linux-kernel/2007/9/14/260288) But i can not understand... Back then kvm-lock protected memslot access so code like: mutex_lock(vcpu-kvm-lock); kvm_mmu_zap_all(vcpu-kvm); mutex_unlock(vcpu-kvm-lock); which is what 7aa81cc0 does was enough to guaranty that no vcpu will run while code is patched. This is no longer the case and mutex_lock(vcpu-kvm-lock); is gone from that code path long time ago, so now kvm_mmu_zap_all() there is useless and the code is incorrect. Lets drop kvm_mmu_zap_all() there (in separate patch) and fix the patching properly later. -- Gleb. -- To unsubscribe from
Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior
On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote: Given the most commonly used functions and a couple of architectures I'm familiar with, these are the ones that currently call might_fault() x86-32 x86-64 arm arm64 powerpc s390generic copy_to_user - x - - - x x copy_from_user- x - - - x x put_user x x x x x x x get_user x x x x x x x __copy_to_userx x - - x - - __copy_from_user x x - - x - - __put_user- - x - x - - __get_user- - x - x - - WTF? I think your table is rather screwed - especially on ARM. Tell me - how can __copy_to_user() use might_fault() but copy_to_user() not when copy_to_user() is implemented using __copy_to_user() ? Same for copy_from_user() but the reverse argument - there's nothing special in our copy_from_user() which would make it do might_fault() when __copy_from_user() wouldn't. The correct position for ARM is: our (__)?(pu|ge)t_user all use might_fault(), but (__)?copy_(to|from)_user do not. Neither does (__)?clear_user. We might want to fix those to use might_fault(). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: add detail error message when fail to add ioeventfd
On Wed, May 22, 2013 at 11:32:27AM +0200, Stefan Hajnoczi wrote: On Wed, May 22, 2013 at 12:57:35PM +0800, Amos Kong wrote: I try to hotplug 28 * 8 multiple-function devices to guest with old host kernel, ioeventfds in host kernel will be exhausted, then qemu fails to allocate ioeventfds for blk/nic devices. It's better to add detail error here. Signed-off-by: Amos Kong ak...@redhat.com --- kvm-all.c |4 1 files changed, 4 insertions(+), 0 deletions(-) It would be nice to make kvm bus scalable so that the hardcoded in-kernel I/O device limit can be lifted. I had increased kernel NR_IOBUS_DEVS to 1000 (a limitation is needed for security) in last Mar, and make resizing kvm_io_range array dynamical. Reviewed-by: Stefan Hajnoczi stefa...@redhat.com -- Amos. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/10] powerpc: uaccess s/might_sleep/might_fault/
On Thursday 16 May 2013, Michael S. Tsirkin wrote: @@ -178,7 +178,7 @@ do { \ long __pu_err; \ __typeof__(*(ptr)) __user *__pu_addr = (ptr); \ if (!is_kernel_addr((unsigned long)__pu_addr)) \ - might_sleep(); \ + might_fault(); \ __chk_user_ptr(ptr);\ __put_user_size((x), __pu_addr, (size), __pu_err); \ __pu_err; \ Another observation: if (!is_kernel_addr((unsigned long)__pu_addr)) might_sleep(); is almost the same as might_fault(); except that it does not call might_lock_read(). The version above may have been put there intentionally and correctly, but if you want to replace it with might_fault(), you should remove the if () condition. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior
On Wednesday 22 May 2013, Russell King - ARM Linux wrote: On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote: Given the most commonly used functions and a couple of architectures I'm familiar with, these are the ones that currently call might_fault() x86-32 x86-64 arm arm64 powerpc s390 generic copy_to_user - x - - - x x copy_from_user- x - - - x x put_user x x x x x x x get_user x x x x x x x __copy_to_userx x - - x - - __copy_from_user x x - - x - - __put_user- - x - x - - __get_user- - x - x - - WTF? I think your table is rather screwed - especially on ARM. Tell me - how can __copy_to_user() use might_fault() but copy_to_user() not when copy_to_user() is implemented using __copy_to_user() ? Same for copy_from_user() but the reverse argument - there's nothing special in our copy_from_user() which would make it do might_fault() when __copy_from_user() wouldn't. I think something went wrong with formatting of the tabstobs in the table. I've tried to correct it above to the same version I see on the mailing list. The correct position for ARM is: our (__)?(pu|ge)t_user all use might_fault(), but (__)?copy_(to|from)_user do not. Neither does (__)?clear_user. We might want to fix those to use might_fault(). Yes, that sounds like a good idea, especially since they are all implemented out-of-line. For __get_user()/__put_user(), I would probably do the reverse and make them not call might_fault() though, like we do on most other architectures: Look at the object code produced for setup_sigframe for instance, it calls might_fault() around 25 times where one should really be enough. Using __put_user() instead of put_user() is normally an indication that the author of that function has made performance considerations and move the (trivial) access_ok() call out, but now we add a more expensive call instead. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/10] powerpc: uaccess s/might_sleep/might_fault/
On Wed, May 22, 2013 at 03:59:01PM +0200, Arnd Bergmann wrote: On Thursday 16 May 2013, Michael S. Tsirkin wrote: @@ -178,7 +178,7 @@ do { \ long __pu_err; \ __typeof__(*(ptr)) __user *__pu_addr = (ptr); \ if (!is_kernel_addr((unsigned long)__pu_addr)) \ - might_sleep(); \ + might_fault(); \ __chk_user_ptr(ptr);\ __put_user_size((x), __pu_addr, (size), __pu_err); \ __pu_err; \ Another observation: if (!is_kernel_addr((unsigned long)__pu_addr)) might_sleep(); is almost the same as might_fault(); except that it does not call might_lock_read(). The version above may have been put there intentionally and correctly, but if you want to replace it with might_fault(), you should remove the if () condition. Arnd Good point, thanks. I think I'll do it in a separate patch on top, just to make sure bisect does not result in a revision that produces false positive warnings. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior
On Wed, May 22, 2013 at 04:04:48PM +0200, Arnd Bergmann wrote: On Wednesday 22 May 2013, Russell King - ARM Linux wrote: On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote: Given the most commonly used functions and a couple of architectures I'm familiar with, these are the ones that currently call might_fault() x86-32 x86-64 arm arm64 powerpc s390 generic copy_to_user - x - - - x x copy_from_user- x - - - x x put_user x x x x x x x get_user x x x x x x x __copy_to_userx x - - x - - __copy_from_user x x - - x - - __put_user- - x - x - - __get_user- - x - x - - WTF? I think your table is rather screwed - especially on ARM. Tell me - how can __copy_to_user() use might_fault() but copy_to_user() not when copy_to_user() is implemented using __copy_to_user() ? Same for copy_from_user() but the reverse argument - there's nothing special in our copy_from_user() which would make it do might_fault() when __copy_from_user() wouldn't. I think something went wrong with formatting of the tabstobs in the table. I've tried to correct it above to the same version I see on the mailing list. The correct position for ARM is: our (__)?(pu|ge)t_user all use might_fault(), but (__)?copy_(to|from)_user do not. Neither does (__)?clear_user. We might want to fix those to use might_fault(). Yes, that sounds like a good idea, especially since they are all implemented out-of-line. For __get_user()/__put_user(), I would probably do the reverse and make them not call might_fault() though, like we do on most other architectures: Look at the object code produced for setup_sigframe for instance, it calls might_fault() around 25 times where one should really be enough. Well it depends on what config options you set. But with VOLUNTARY you are right. Also, look at memcpy_fromiovec and weep. Using __put_user() instead of put_user() is normally an indication that the author of that function has made performance considerations and move the (trivial) access_ok() call out, but now we add a more expensive call instead. Arnd I think exactly the same rules should apply to __XXX_user and __copy_XXX_user - otherwise it's really confusing. Maybe a preempt point in might_fault should go away? Basically #define might_fault() __might_sleep(__FILE__, __LINE__, 0) Possibly adding the in_atomic() etc checks that Peter suggested. Ingo, what do you think? And what testing would be appropriate for such a change? Thanks, -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/2] add support for Hyper-V reference time counter
On Wed, May 22, 2013 at 06:38:47AM +0300, Gleb Natapov wrote: On Wed, May 22, 2013 at 12:32:57AM -0300, Marcelo Tosatti wrote: On Wed, May 22, 2013 at 06:28:47AM +0300, Gleb Natapov wrote + case HV_X64_MSR_TIME_REF_COUNT: r = true; break; } @@ -1827,6 +1829,29 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (__copy_to_user((void __user *)addr, instructions, 4)) return 1; kvm-arch.hv_hypercall = data; + local_irq_disable(); + kvm-arch.hv_ref_count = get_kernel_ns(); + local_irq_enable(); + break; local_irq_disable/local_irq_enable not needed. get_kernel_ns has WARN_ON(preemptible()) so at least preempt_disable() is needed, but all callers of the function disable interrupts. OK. Its not necessary to disable interrupts (this is a left over of kvm_guest_update_time). So preempt_disable() then? Should we change other callers? Yes. Should only disable interrupts if necessary (the case in kvm_write_guest_time is necessary). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
On Wed, May 22, 2013 at 09:34:13AM +0300, Gleb Natapov wrote: On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote: On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote: Any pages with stale information will be zapped by kvm_mmu_zap_all(). When that happens, page faults will take place which will automatically use the new generation number. So still not clear why is this necessary. This is not, strictly speaking, necessary, but it is the sane thing to do. You cannot update page's generation number to prevent it from been destroyed since after kvm_mmu_zap_all() completes stale ptes in the shadow page may point to now deleted memslot. So why build shadow page table with a page that is in a process of been destroyed? OK, can this be introduced separately, in a later patch, with separate justification, then? Xiao please have the first patches of the patchset focus on the problem at hand: fix long mmu_lock hold times. Not sure what you mean again. We flush TLB once before entering this function. kvm_reload_remote_mmus() does this for us, no? kvm_reload_remote_mmus() is used as an optimization, its separate from the problem solution. What was suggested was... go to phrase which starts with The only purpose of the generation number should be to. The comment quoted here does not match that description. The comment describes what code does and in this it is correct. You propose to not reload roots right away and do it only when root sp is encountered, right? So my question is what's the point? There are, obviously, root sps with invalid generation number at this point, so reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not do it here right away and avoid it in kvm_mmu_prepare_zap_page() for invalid and obsolete sps as I proposed in one of my email? Sure. But Xiao please introduce that TLB collapsing optimization as a later patch, so we can reason about it in a more organized fashion. So, if I understand correctly, you are asking to move is_obsolete_sp() check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() to a separate patch. - Move the is_obsolete_sp() check from kvm_mmu_get_page() to a separate optimization patch. - Move kvm_reload_remote_mmus() optimization to a separate optimization patch. - Leave the is_obsolete_sp() check in kvm_mmu_invalidate_all_pages(), as it is necessary to guarantee forward progress, in the initial patches. Fine by me, but if we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will become nop. But I question the need to zap all shadow pages tables there in the first place, why kvm_flush_remote_tlbs() is not enough? Good question. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
On 05/22/2013 09:17 PM, Gleb Natapov wrote: On Wed, May 22, 2013 at 05:41:10PM +0800, Xiao Guangrong wrote: On 05/22/2013 04:54 PM, Gleb Natapov wrote: On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote: On 05/22/2013 02:34 PM, Gleb Natapov wrote: On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote: On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote: Any pages with stale information will be zapped by kvm_mmu_zap_all(). When that happens, page faults will take place which will automatically use the new generation number. So still not clear why is this necessary. This is not, strictly speaking, necessary, but it is the sane thing to do. You cannot update page's generation number to prevent it from been destroyed since after kvm_mmu_zap_all() completes stale ptes in the shadow page may point to now deleted memslot. So why build shadow page table with a page that is in a process of been destroyed? OK, can this be introduced separately, in a later patch, with separate justification, then? Xiao please have the first patches of the patchset focus on the problem at hand: fix long mmu_lock hold times. Not sure what you mean again. We flush TLB once before entering this function. kvm_reload_remote_mmus() does this for us, no? kvm_reload_remote_mmus() is used as an optimization, its separate from the problem solution. What was suggested was... go to phrase which starts with The only purpose of the generation number should be to. The comment quoted here does not match that description. The comment describes what code does and in this it is correct. You propose to not reload roots right away and do it only when root sp is encountered, right? So my question is what's the point? There are, obviously, root sps with invalid generation number at this point, so reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not do it here right away and avoid it in kvm_mmu_prepare_zap_page() for invalid and obsolete sps as I proposed in one of my email? Sure. But Xiao please introduce that TLB collapsing optimization as a later patch, so we can reason about it in a more organized fashion. So, if I understand correctly, you are asking to move is_obsolete_sp() check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will become nop. But I question the need to zap all shadow pages tables there in the first place, why kvm_flush_remote_tlbs() is not enough? I do not know too... I even do no know why kvm_flush_remote_tlbs is needed. :( We changed the content of an executable page, we need to flush instruction cache of all vcpus to not use stale data, so my suggestion to call I thought the reason is about icache too but icache is automatically flushed on x86, we only need to invalidate the prefetched instructions by executing a serializing operation. See the SDM in the chapter of 8.1.3 Handling Self- and Cross-Modifying Code Right, so we do cross-modifying code here and we need to make sure no vcpu is running in a guest mode while this happens, but kvm_mmu_zap_all() does not provide this guaranty since vcpus will continue running after reloading roots! May be we can introduce a function to atomic write gpa, then the guest either 1) see the old value, in that case, it can be intercepted or 2) see the the new value in that case, it can continue to execute. kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb, not instruction cache, but why kvm_reload_remote_mmus() would flush instruction cache? kvm_reload_remote_mmus do not have any help i think. I find that this change is introduced by commit: 7aa81cc0 and I have added Anthony in the CC. I also find some discussions related to calling kvm_reload_remote_mmus(): But if the instruction is architecture dependent, and you run on the wrong architecture, now you have to patch many locations at fault time, introducing some nasty runtime code / data cache overlap performance problems. Granted, they go away eventually. We're addressing that by blowing away the shadow cache and holding the big kvm lock to ensure SMP safety. Not a great thing to do from a performance perspective but the whole point of patching is that the cost is amortized. (http://kerneltrap.org/mailarchive/linux-kernel/2007/9/14/260288) But i can not understand... Back then kvm-lock protected memslot access so code like: mutex_lock(vcpu-kvm-lock); kvm_mmu_zap_all(vcpu-kvm); mutex_unlock(vcpu-kvm-lock); which is what 7aa81cc0 does was enough to guaranty that no vcpu will run while code is patched. So, at that time, kvm-lock is also held when #PF is being fixed? This is no longer the case and mutex_lock(vcpu-kvm-lock); is gone
Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
On Wed, May 22, 2013 at 11:25:13PM +0800, Xiao Guangrong wrote: On 05/22/2013 09:17 PM, Gleb Natapov wrote: On Wed, May 22, 2013 at 05:41:10PM +0800, Xiao Guangrong wrote: On 05/22/2013 04:54 PM, Gleb Natapov wrote: On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote: On 05/22/2013 02:34 PM, Gleb Natapov wrote: On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote: On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote: Any pages with stale information will be zapped by kvm_mmu_zap_all(). When that happens, page faults will take place which will automatically use the new generation number. So still not clear why is this necessary. This is not, strictly speaking, necessary, but it is the sane thing to do. You cannot update page's generation number to prevent it from been destroyed since after kvm_mmu_zap_all() completes stale ptes in the shadow page may point to now deleted memslot. So why build shadow page table with a page that is in a process of been destroyed? OK, can this be introduced separately, in a later patch, with separate justification, then? Xiao please have the first patches of the patchset focus on the problem at hand: fix long mmu_lock hold times. Not sure what you mean again. We flush TLB once before entering this function. kvm_reload_remote_mmus() does this for us, no? kvm_reload_remote_mmus() is used as an optimization, its separate from the problem solution. What was suggested was... go to phrase which starts with The only purpose of the generation number should be to. The comment quoted here does not match that description. The comment describes what code does and in this it is correct. You propose to not reload roots right away and do it only when root sp is encountered, right? So my question is what's the point? There are, obviously, root sps with invalid generation number at this point, so reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not do it here right away and avoid it in kvm_mmu_prepare_zap_page() for invalid and obsolete sps as I proposed in one of my email? Sure. But Xiao please introduce that TLB collapsing optimization as a later patch, so we can reason about it in a more organized fashion. So, if I understand correctly, you are asking to move is_obsolete_sp() check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will become nop. But I question the need to zap all shadow pages tables there in the first place, why kvm_flush_remote_tlbs() is not enough? I do not know too... I even do no know why kvm_flush_remote_tlbs is needed. :( We changed the content of an executable page, we need to flush instruction cache of all vcpus to not use stale data, so my suggestion to call I thought the reason is about icache too but icache is automatically flushed on x86, we only need to invalidate the prefetched instructions by executing a serializing operation. See the SDM in the chapter of 8.1.3 Handling Self- and Cross-Modifying Code Right, so we do cross-modifying code here and we need to make sure no vcpu is running in a guest mode while this happens, but kvm_mmu_zap_all() does not provide this guaranty since vcpus will continue running after reloading roots! May be we can introduce a function to atomic write gpa, then the guest either 1) see the old value, in that case, it can be intercepted or 2) see the the new value in that case, it can continue to execute. SDM says atomic write is not enough. All vcpu should be guarantied to not execute code in the vicinity of modified code. This is easy to achieve though: vcpu0: lock(x); make_all_cpus_request(EXIT); unlock(x); vcpuX: if (kvm_check_request(EXIT)) { lock(x); unlock(x); } kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb, not instruction cache, but why kvm_reload_remote_mmus() would flush instruction cache? kvm_reload_remote_mmus do not have any help i think. I find that this change is introduced by commit: 7aa81cc0 and I have added Anthony in the CC. I also find some discussions related to calling kvm_reload_remote_mmus(): But if the instruction is architecture dependent, and you run on the wrong architecture, now you have to patch many locations at fault time, introducing some nasty runtime code / data cache overlap performance problems. Granted, they go away eventually. We're addressing that by blowing away the shadow cache and holding the big kvm lock to ensure SMP safety. Not a great thing to do from a performance perspective but the whole point of patching is that the cost is amortized.
Re: [PATCH v4 0/6] mips/kvm: Fix ABI for compatibility with 64-bit guests.
On May 22, 2013, at 5:54 AM, Gleb Natapov wrote: On Tue, May 21, 2013 at 01:54:49PM -0700, David Daney wrote: From: David Daney david.da...@cavium.com The initial patch set implementing MIPS KVM does not handle 64-bit guests or use of the FPU. This patch set corrects these ABI issues, and does some very minor clean up. Sanjay, is this looks good to you. What userspace MIPS is using for machine emulation? Is there corresponding patches to the userspace? Gleb, I'll post some comments on the patches later in the day. We use QEMu for the machine emulation. I am in the process of integrating with the new ABI, and will post the QEMU patches shortly. Regards Sanjay -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 5/6] mips/kvm: Fix ABI by moving manipulation of CP0 registers to KVM_{G,S}ET_ONE_REG
On May 21, 2013, at 1:54 PM, David Daney wrote: From: David Daney david.da...@cavium.com Because not all 256 CP0 registers are ever implemented, we need a different method of manipulating them. Use the KVM_SET_ONE_REG/KVM_GET_ONE_REG mechanism. Code related to implementing KVM_SET_ONE_REG/KVM_GET_ONE_REG is consolidated in to kvm_trap_emul.c, now unused code and definitions are removed. Signed-off-by: David Daney david.da...@cavium.com --- arch/mips/include/asm/kvm.h | 91 +-- arch/mips/include/asm/kvm_host.h | 4 - arch/mips/kvm/kvm_mips.c | 90 +-- arch/mips/kvm/kvm_trap_emul.c| 338 ++- 4 files changed, 383 insertions(+), 140 deletions(-) diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h index d145ead..3f424f5 100644 --- a/arch/mips/include/asm/kvm.h +++ b/arch/mips/include/asm/kvm.h @@ -13,10 +13,11 @@ #include linux/types.h -#define __KVM_MIPS - -#define N_MIPS_COPROC_REGS 32 -#define N_MIPS_COPROC_SEL8 +/* + * KVM MIPS specific structures and definitions. + * + * Some parts derived from the x86 version of this file. + */ /* * for KVM_GET_REGS and KVM_SET_REGS @@ -31,12 +32,6 @@ struct kvm_regs { __u64 hi; __u64 lo; __u64 pc; - - __u32 cp0reg[N_MIPS_COPROC_REGS][N_MIPS_COPROC_SEL]; -}; - -/* for KVM_GET_SREGS and KVM_SET_SREGS */ -struct kvm_sregs { }; /* @@ -55,21 +50,89 @@ struct kvm_fpu { __u32 pad; }; + +/* + * For MIPS, we use KVM_SET_ONE_REG and KVM_GET_ONE_REG to access CP0 + * registers. The id field is broken down as follows: + * + * bits[2..0] - Register 'sel' index. + * bits[7..3] - Register 'rd' index. + * bits[15..8] - Must be zero. + * bits[63..16] - 1 - CP0 registers. + * + * Other sets registers may be added in the future. Each set would + * have its own identifier in bits[63..16]. + * + * The addr field of struct kvm_one_reg must point to an aligned + * 64-bit wide location. For registers that are narrower than + * 64-bits, the value is stored in the low order bits of the location, + * and sign extended to 64-bits. + * + * The registers defined in struct kvm_regs are also accessible, the + * id values for these are below. + */ + +#define KVM_REG_MIPS_R0 0 +#define KVM_REG_MIPS_R1 1 +#define KVM_REG_MIPS_R2 2 +#define KVM_REG_MIPS_R3 3 +#define KVM_REG_MIPS_R4 4 +#define KVM_REG_MIPS_R5 5 +#define KVM_REG_MIPS_R6 6 +#define KVM_REG_MIPS_R7 7 +#define KVM_REG_MIPS_R8 8 +#define KVM_REG_MIPS_R9 9 +#define KVM_REG_MIPS_R10 10 +#define KVM_REG_MIPS_R11 11 +#define KVM_REG_MIPS_R12 12 +#define KVM_REG_MIPS_R13 13 +#define KVM_REG_MIPS_R14 14 +#define KVM_REG_MIPS_R15 15 +#define KVM_REG_MIPS_R16 16 +#define KVM_REG_MIPS_R17 17 +#define KVM_REG_MIPS_R18 18 +#define KVM_REG_MIPS_R19 19 +#define KVM_REG_MIPS_R20 20 +#define KVM_REG_MIPS_R21 21 +#define KVM_REG_MIPS_R22 22 +#define KVM_REG_MIPS_R23 23 +#define KVM_REG_MIPS_R24 24 +#define KVM_REG_MIPS_R25 25 +#define KVM_REG_MIPS_R26 26 +#define KVM_REG_MIPS_R27 27 +#define KVM_REG_MIPS_R28 28 +#define KVM_REG_MIPS_R29 29 +#define KVM_REG_MIPS_R30 30 +#define KVM_REG_MIPS_R31 31 + +#define KVM_REG_MIPS_HI 32 +#define KVM_REG_MIPS_LO 33 +#define KVM_REG_MIPS_PC 34 + +/* + * KVM MIPS specific structures and definitions + * + */ struct kvm_debug_exit_arch { + __u64 epc; }; /* for KVM_SET_GUEST_DEBUG */ struct kvm_guest_debug_arch { }; +/* definition of registers in kvm_run */ +struct kvm_sync_regs { +}; + +/* dummy definition */ +struct kvm_sregs { +}; + struct kvm_mips_interrupt { /* in */ __u32 cpu; __u32 irq; }; -/* definition of registers in kvm_run */ -struct kvm_sync_regs { -}; - #endif /* __LINUX_KVM_MIPS_H */ diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 143875c..4d6fa0b 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -496,10 +496,6 @@ struct kvm_mips_callbacks { uint32_t cause); int (*irq_clear) (struct kvm_vcpu *vcpu, unsigned int priority, uint32_t cause); - int (*vcpu_ioctl_get_regs) (struct kvm_vcpu *vcpu, - struct kvm_regs *regs); - int (*vcpu_ioctl_set_regs) (struct kvm_vcpu *vcpu, - struct kvm_regs *regs); }; extern struct kvm_mips_callbacks *kvm_mips_callbacks; int kvm_mips_emulation_init(struct kvm_mips_callbacks **install_callbacks); diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c index 71a1fc1..bc879bd 100644 --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -51,16 +51,6 @@ struct kvm_stats_debugfs_item debugfs_entries[] = { {NULL} }; -static int kvm_mips_reset_vcpu(struct kvm_vcpu *vcpu) -{ - int i; - for_each_possible_cpu(i) {
Re: [PATCH v4 5/6] mips/kvm: Fix ABI by moving manipulation of CP0 registers to KVM_{G,S}ET_ONE_REG
On 05/22/2013 10:44 AM, Sanjay Lal wrote: On May 21, 2013, at 1:54 PM, David Daney wrote: From: David Daney david.da...@cavium.com Because not all 256 CP0 registers are ever implemented, we need a different method of manipulating them. Use the KVM_SET_ONE_REG/KVM_GET_ONE_REG mechanism. Code related to implementing KVM_SET_ONE_REG/KVM_GET_ONE_REG is consolidated in to kvm_trap_emul.c, now unused code and definitions are removed. Signed-off-by: David Daney david.da...@cavium.com --- arch/mips/include/asm/kvm.h | 91 +-- arch/mips/include/asm/kvm_host.h | 4 - arch/mips/kvm/kvm_mips.c | 90 +-- arch/mips/kvm/kvm_trap_emul.c| 338 ++- 4 files changed, 383 insertions(+), 140 deletions(-) [...] Most of the functions that have been relocated to kvm_trap_emul.c should stay in kvm_mips.c. They are/will shared between the trap and emulate and VZ modes. They include kvm_mips_reset_vcpu(), kvm_vcpu_ioctl_interrupt(), kvm_arch_vcpu_ioctl(). kvm_mips_get_reg() and kvm_mips_set_reg() should be in kvm_mips.c as they will be shared by the trap and emulate and VZ code. OK, I will revise the patch set to rearrange things in a manner that leaves these in kvm_mips.c. However, this is of secondary importance to the question of the suitability of the ABI. If you plan on defining specific versions of these functions for Cavium's implementation of KVM, please make them callbacks. There will soon be follow on patches that do exactly that. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -next] ia64, metag: Do not export min_low_pfn in arch-specific code
On Sun, Apr 28, 2013 at 11:24:35AM +0100, James Hogan wrote: On 27 April 2013 19:20, Geert Uytterhoeven ge...@linux-m68k.org wrote: As of commit 787dcbe6984b3638e94f60d807dcb51bb8a07211 (MIPS: Export symbols used by KVM/MIPS module), min_low_pfn is already exported by the generic mm/bootmem.c, causing: WARNING: vmlinux: 'min_low_pfn' exported twice. Previous export was in vmlinux Signed-off-by: Geert Uytterhoeven ge...@linux-m68k.org This has been pointed out several times and needs fixing in the mips tree where the warning was introduced. I've changed the MIPS code to no longer require the export of min_low_pfn and dropped the export of that particular file. I still see that IA-64 and metag export min_low_pfn so maybe it should be exported from mm/ after all? Ralf -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 0/6] mips/kvm: Fix ABI for compatibility with 64-bit guests.
From: David Daney david.da...@cavium.com The initial patch set implementing MIPS KVM does not handle 64-bit guests or use of the FPU. This patch set corrects these ABI issues, and does some very minor clean up. Changes from v4: No code change, just keep more of the code in kvm_mips.c rather than kvm_trap_emul.c Changes from v3: Use KVM_SET_ONE_REG instead of KVM_SET_MSRS. Added ENOIOCTLCMD patch. Changes from v2: Split into five parts, no code change. David Daney (6): mips/kvm: Fix ABI for use of FPU. mips/kvm: Fix ABI for use of 64-bit registers. mips/kvm: Fix name of gpr field in struct kvm_regs. mips/kvm: Use ARRAY_SIZE() instead of hardcoded constants in kvm_arch_vcpu_ioctl_{s,g}et_regs mips/kvm: Fix ABI by moving manipulation of CP0 registers to KVM_{G,S}ET_ONE_REG mips/kvm: Use ENOIOCTLCMD to indicate unimplemented ioctls. arch/mips/include/asm/kvm.h | 137 +++ arch/mips/include/asm/kvm_host.h | 4 - arch/mips/kvm/kvm_mips.c | 278 --- arch/mips/kvm/kvm_trap_emul.c| 50 --- 4 files changed, 367 insertions(+), 102 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 2/6] mips/kvm: Fix ABI for use of 64-bit registers.
From: David Daney david.da...@cavium.com All registers are 64-bits wide, 32-bit guests use the least significant portion of the register storage fields. Signed-off-by: David Daney david.da...@cavium.com --- arch/mips/include/asm/kvm.h | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h index 0e8f565..86812fb 100644 --- a/arch/mips/include/asm/kvm.h +++ b/arch/mips/include/asm/kvm.h @@ -18,12 +18,18 @@ #define N_MIPS_COPROC_REGS 32 #define N_MIPS_COPROC_SEL 8 -/* for KVM_GET_REGS and KVM_SET_REGS */ +/* + * for KVM_GET_REGS and KVM_SET_REGS + * + * If Config[AT] is zero (32-bit CPU), the register contents are + * stored in the lower 32-bits of the struct kvm_regs fields and sign + * extended to 64-bits. + */ struct kvm_regs { - __u32 gprs[32]; - __u32 hi; - __u32 lo; - __u32 pc; + __u64 gprs[32]; + __u64 hi; + __u64 lo; + __u64 pc; __u32 cp0reg[N_MIPS_COPROC_REGS][N_MIPS_COPROC_SEL]; }; -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 1/6] mips/kvm: Fix ABI for use of FPU.
From: David Daney david.da...@cavium.com Define a non-empty struct kvm_fpu. Signed-off-by: David Daney david.da...@cavium.com --- arch/mips/include/asm/kvm.h | 29 + 1 file changed, 21 insertions(+), 8 deletions(-) diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h index 85789ea..0e8f565 100644 --- a/arch/mips/include/asm/kvm.h +++ b/arch/mips/include/asm/kvm.h @@ -1,11 +1,12 @@ /* -* This file is subject to the terms and conditions of the GNU General Public -* License. See the file COPYING in the main directory of this archive -* for more details. -* -* Copyright (C) 2012 MIPS Technologies, Inc. All rights reserved. -* Authors: Sanjay Lal sanj...@kymasys.com -*/ + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of this archive + * for more details. + * + * Copyright (C) 2012 MIPS Technologies, Inc. All rights reserved. + * Copyright (C) 2013 Cavium, Inc. + * Authors: Sanjay Lal sanj...@kymasys.com + */ #ifndef __LINUX_KVM_MIPS_H #define __LINUX_KVM_MIPS_H @@ -31,8 +32,20 @@ struct kvm_regs { struct kvm_sregs { }; -/* for KVM_GET_FPU and KVM_SET_FPU */ +/* + * for KVM_GET_FPU and KVM_SET_FPU + * + * If Status[FR] is zero (32-bit FPU), the upper 32-bits of the FPRs + * are zero filled. + */ struct kvm_fpu { + __u64 fpr[32]; + __u32 fir; + __u32 fccr; + __u32 fexr; + __u32 fenr; + __u32 fcsr; + __u32 pad; }; struct kvm_debug_exit_arch { -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 6/6] mips/kvm: Use ENOIOCTLCMD to indicate unimplemented ioctls.
From: David Daney david.da...@cavium.com The Linux Way is to return -ENOIOCTLCMD to the vfs when an unimplemented ioctl is requested. Do this in kvm_mips instead of a random mixture of -ENOTSUPP and -EINVAL. Signed-off-by: David Daney david.da...@cavium.com --- arch/mips/kvm/kvm_mips.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c index 5aec4f4..13e9438 100644 --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -195,7 +195,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { - return -EINVAL; + return -ENOIOCTLCMD; } void kvm_arch_free_memslot(struct kvm_memory_slot *free, @@ -401,7 +401,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg) { - return -EINVAL; + return -ENOIOCTLCMD; } int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) @@ -475,14 +475,14 @@ int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu, struct kvm_mp_state *mp_state) { - return -EINVAL; + return -ENOIOCTLCMD; } int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, struct kvm_mp_state *mp_state) { - return -EINVAL; + return -ENOIOCTLCMD; } #define KVM_REG_MIPS_CP0_INDEX (0x1 + 8 * 0 + 0) @@ -799,7 +799,7 @@ long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) switch (ioctl) { default: - r = -EINVAL; + r = -ENOIOCTLCMD; } return r; @@ -827,13 +827,13 @@ void kvm_arch_exit(void) int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) { - return -ENOTSUPP; + return -ENOIOCTLCMD; } int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) { - return -ENOTSUPP; + return -ENOIOCTLCMD; } int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) @@ -843,12 +843,12 @@ int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { - return -ENOTSUPP; + return -ENOIOCTLCMD; } int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { - return -ENOTSUPP; + return -ENOIOCTLCMD; } int kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 4/6] mips/kvm: Use ARRAY_SIZE() instead of hardcoded constants in kvm_arch_vcpu_ioctl_{s,g}et_regs
From: David Daney david.da...@cavium.com Also we cannot set special zero register, so force it to zero. Signed-off-by: David Daney david.da...@cavium.com --- arch/mips/kvm/kvm_mips.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c index 93da750..71a1fc1 100644 --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -677,9 +677,9 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) { int i; - for (i = 0; i 32; i++) + for (i = 1; i ARRAY_SIZE(vcpu-arch.gprs); i++) vcpu-arch.gprs[i] = regs-gpr[i]; - + vcpu-arch.gprs[0] = 0; /* zero is special, and cannot be set. */ vcpu-arch.hi = regs-hi; vcpu-arch.lo = regs-lo; vcpu-arch.pc = regs-pc; @@ -691,7 +691,7 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) { int i; - for (i = 0; i 32; i++) + for (i = 0; i ARRAY_SIZE(vcpu-arch.gprs); i++) regs-gpr[i] = vcpu-arch.gprs[i]; regs-hi = vcpu-arch.hi; -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 5/6] mips/kvm: Fix ABI by moving manipulation of CP0 registers to KVM_{G,S}ET_ONE_REG
From: David Daney david.da...@cavium.com Because not all 256 CP0 registers are ever implemented, we need a different method of manipulating them. Use the KVM_SET_ONE_REG/KVM_GET_ONE_REG mechanism. Now unused code and definitions are removed. Signed-off-by: David Daney david.da...@cavium.com --- arch/mips/include/asm/kvm.h | 91 +++--- arch/mips/include/asm/kvm_host.h | 4 - arch/mips/kvm/kvm_mips.c | 250 +-- arch/mips/kvm/kvm_trap_emul.c| 50 4 files changed, 320 insertions(+), 75 deletions(-) diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h index d145ead..3f424f5 100644 --- a/arch/mips/include/asm/kvm.h +++ b/arch/mips/include/asm/kvm.h @@ -13,10 +13,11 @@ #include linux/types.h -#define __KVM_MIPS - -#define N_MIPS_COPROC_REGS 32 -#define N_MIPS_COPROC_SEL 8 +/* + * KVM MIPS specific structures and definitions. + * + * Some parts derived from the x86 version of this file. + */ /* * for KVM_GET_REGS and KVM_SET_REGS @@ -31,12 +32,6 @@ struct kvm_regs { __u64 hi; __u64 lo; __u64 pc; - - __u32 cp0reg[N_MIPS_COPROC_REGS][N_MIPS_COPROC_SEL]; -}; - -/* for KVM_GET_SREGS and KVM_SET_SREGS */ -struct kvm_sregs { }; /* @@ -55,21 +50,89 @@ struct kvm_fpu { __u32 pad; }; + +/* + * For MIPS, we use KVM_SET_ONE_REG and KVM_GET_ONE_REG to access CP0 + * registers. The id field is broken down as follows: + * + * bits[2..0] - Register 'sel' index. + * bits[7..3] - Register 'rd' index. + * bits[15..8] - Must be zero. + * bits[63..16] - 1 - CP0 registers. + * + * Other sets registers may be added in the future. Each set would + * have its own identifier in bits[63..16]. + * + * The addr field of struct kvm_one_reg must point to an aligned + * 64-bit wide location. For registers that are narrower than + * 64-bits, the value is stored in the low order bits of the location, + * and sign extended to 64-bits. + * + * The registers defined in struct kvm_regs are also accessible, the + * id values for these are below. + */ + +#define KVM_REG_MIPS_R0 0 +#define KVM_REG_MIPS_R1 1 +#define KVM_REG_MIPS_R2 2 +#define KVM_REG_MIPS_R3 3 +#define KVM_REG_MIPS_R4 4 +#define KVM_REG_MIPS_R5 5 +#define KVM_REG_MIPS_R6 6 +#define KVM_REG_MIPS_R7 7 +#define KVM_REG_MIPS_R8 8 +#define KVM_REG_MIPS_R9 9 +#define KVM_REG_MIPS_R10 10 +#define KVM_REG_MIPS_R11 11 +#define KVM_REG_MIPS_R12 12 +#define KVM_REG_MIPS_R13 13 +#define KVM_REG_MIPS_R14 14 +#define KVM_REG_MIPS_R15 15 +#define KVM_REG_MIPS_R16 16 +#define KVM_REG_MIPS_R17 17 +#define KVM_REG_MIPS_R18 18 +#define KVM_REG_MIPS_R19 19 +#define KVM_REG_MIPS_R20 20 +#define KVM_REG_MIPS_R21 21 +#define KVM_REG_MIPS_R22 22 +#define KVM_REG_MIPS_R23 23 +#define KVM_REG_MIPS_R24 24 +#define KVM_REG_MIPS_R25 25 +#define KVM_REG_MIPS_R26 26 +#define KVM_REG_MIPS_R27 27 +#define KVM_REG_MIPS_R28 28 +#define KVM_REG_MIPS_R29 29 +#define KVM_REG_MIPS_R30 30 +#define KVM_REG_MIPS_R31 31 + +#define KVM_REG_MIPS_HI 32 +#define KVM_REG_MIPS_LO 33 +#define KVM_REG_MIPS_PC 34 + +/* + * KVM MIPS specific structures and definitions + * + */ struct kvm_debug_exit_arch { + __u64 epc; }; /* for KVM_SET_GUEST_DEBUG */ struct kvm_guest_debug_arch { }; +/* definition of registers in kvm_run */ +struct kvm_sync_regs { +}; + +/* dummy definition */ +struct kvm_sregs { +}; + struct kvm_mips_interrupt { /* in */ __u32 cpu; __u32 irq; }; -/* definition of registers in kvm_run */ -struct kvm_sync_regs { -}; - #endif /* __LINUX_KVM_MIPS_H */ diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 143875c..4d6fa0b 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -496,10 +496,6 @@ struct kvm_mips_callbacks { uint32_t cause); int (*irq_clear) (struct kvm_vcpu *vcpu, unsigned int priority, uint32_t cause); - int (*vcpu_ioctl_get_regs) (struct kvm_vcpu *vcpu, - struct kvm_regs *regs); - int (*vcpu_ioctl_set_regs) (struct kvm_vcpu *vcpu, - struct kvm_regs *regs); }; extern struct kvm_mips_callbacks *kvm_mips_callbacks; int kvm_mips_emulation_init(struct kvm_mips_callbacks **install_callbacks); diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c index 71a1fc1..5aec4f4 100644 --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -485,15 +485,251 @@ kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, return -EINVAL; } +#define KVM_REG_MIPS_CP0_INDEX (0x1 + 8 * 0 + 0) +#define KVM_REG_MIPS_CP0_ENTRYLO0 (0x1 + 8 * 2 + 0) +#define KVM_REG_MIPS_CP0_ENTRYLO1 (0x1 + 8 * 3 + 0) +#define KVM_REG_MIPS_CP0_CONTEXT (0x1 + 8 * 4 + 0) +#define KVM_REG_MIPS_CP0_USERLOCAL (0x1 + 8 * 4 + 2) +#define KVM_REG_MIPS_CP0_PAGEMASK (0x1 + 8 * 5 + 0) +#define
[PATCH v5 3/6] mips/kvm: Fix name of gpr field in struct kvm_regs.
From: David Daney david.da...@cavium.com Signed-off-by: David Daney david.da...@cavium.com --- arch/mips/include/asm/kvm.h | 3 ++- arch/mips/kvm/kvm_mips.c| 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h index 86812fb..d145ead 100644 --- a/arch/mips/include/asm/kvm.h +++ b/arch/mips/include/asm/kvm.h @@ -26,7 +26,8 @@ * extended to 64-bits. */ struct kvm_regs { - __u64 gprs[32]; + /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */ + __u64 gpr[32]; __u64 hi; __u64 lo; __u64 pc; diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c index e0dad02..93da750 100644 --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -678,7 +678,7 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) int i; for (i = 0; i 32; i++) - vcpu-arch.gprs[i] = regs-gprs[i]; + vcpu-arch.gprs[i] = regs-gpr[i]; vcpu-arch.hi = regs-hi; vcpu-arch.lo = regs-lo; @@ -692,7 +692,7 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) int i; for (i = 0; i 32; i++) - regs-gprs[i] = vcpu-arch.gprs[i]; + regs-gpr[i] = vcpu-arch.gprs[i]; regs-hi = vcpu-arch.hi; regs-lo = vcpu-arch.lo; -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/6] mips/kvm: Fix ABI for compatibility with 64-bit guests.
On May 22, 2013, at 11:43 AM, David Daney wrote: From: David Daney david.da...@cavium.com The initial patch set implementing MIPS KVM does not handle 64-bit guests or use of the FPU. This patch set corrects these ABI issues, and does some very minor clean up. Changes from v4: No code change, just keep more of the code in kvm_mips.c rather than kvm_trap_emul.c Changes from v3: Use KVM_SET_ONE_REG instead of KVM_SET_MSRS. Added ENOIOCTLCMD patch. Changes from v2: Split into five parts, no code change. David Daney (6): mips/kvm: Fix ABI for use of FPU. mips/kvm: Fix ABI for use of 64-bit registers. mips/kvm: Fix name of gpr field in struct kvm_regs. mips/kvm: Use ARRAY_SIZE() instead of hardcoded constants in kvm_arch_vcpu_ioctl_{s,g}et_regs mips/kvm: Fix ABI by moving manipulation of CP0 registers to KVM_{G,S}ET_ONE_REG mips/kvm: Use ENOIOCTLCMD to indicate unimplemented ioctls. arch/mips/include/asm/kvm.h | 137 +++ arch/mips/include/asm/kvm_host.h | 4 - arch/mips/kvm/kvm_mips.c | 278 --- arch/mips/kvm/kvm_trap_emul.c| 50 --- 4 files changed, 367 insertions(+), 102 deletions(-) -- 1.7.11.7 Acked-by: Sanjay Lal sanj...@kymasys.com Regards Sanjay -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 04/11] KVM: MMU: zap pages in batch
Zap at lease 10 pages before releasing mmu-lock to reduce the overload caused by requiring lock After the patch, kvm_zap_obsolete_pages can forward progress anyway, so update the comments [ It improves kernel building 0.6% ~ 1% ] Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 35 +++ 1 files changed, 11 insertions(+), 24 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f302540..688e755 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -4203,14 +4203,18 @@ restart: spin_unlock(kvm-mmu_lock); } +#define BATCH_ZAP_PAGES10 static void kvm_zap_obsolete_pages(struct kvm *kvm) { struct kvm_mmu_page *sp, *node; LIST_HEAD(invalid_list); + int batch = 0; restart: list_for_each_entry_safe_reverse(sp, node, kvm-arch.active_mmu_pages, link) { + int ret; + /* * No obsolete page exists before new created page since * active_mmu_pages is the FIFO list. @@ -4219,28 +4223,6 @@ restart: break; /* -* Do not repeatedly zap a root page to avoid unnecessary -* KVM_REQ_MMU_RELOAD, otherwise we may not be able to -* progress: -*vcpu 0vcpu 1 -* call vcpu_enter_guest(): -*1): handle KVM_REQ_MMU_RELOAD -*and require mmu-lock to -*load mmu -* repeat: -*1): zap root page and -*send KVM_REQ_MMU_RELOAD -* -*2): if (cond_resched_lock(mmu-lock)) -* -*2): hold mmu-lock and load mmu -* -*3): see KVM_REQ_MMU_RELOAD bit -*on vcpu-requests is set -*then return 1 to call -*vcpu_enter_guest() again. -*goto repeat; -* * Since we are reversely walking the list and the invalid * list will be moved to the head, skip the invalid page * can help us to avoid the infinity list walking. @@ -4248,13 +4230,18 @@ restart: if (sp-role.invalid) continue; - if (need_resched() || spin_needbreak(kvm-mmu_lock)) { + if (batch = BATCH_ZAP_PAGES + (need_resched() || spin_needbreak(kvm-mmu_lock))) { + batch = 0; kvm_mmu_commit_zap_page(kvm, invalid_list); cond_resched_lock(kvm-mmu_lock); goto restart; } - if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list)) + ret = kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); + batch += ret; + + if (ret) goto restart; } -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 05/11] KVM: x86: use the fast way to invalidate all pages
Replace kvm_mmu_zap_all by kvm_mmu_invalidate_zap_all_pages Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 15 --- arch/x86/kvm/x86.c |4 ++-- 2 files changed, 2 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 688e755..c010ace 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -4188,21 +4188,6 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) spin_unlock(kvm-mmu_lock); } -void kvm_mmu_zap_all(struct kvm *kvm) -{ - struct kvm_mmu_page *sp, *node; - LIST_HEAD(invalid_list); - - spin_lock(kvm-mmu_lock); -restart: - list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link) - if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list)) - goto restart; - - kvm_mmu_commit_zap_page(kvm, invalid_list); - spin_unlock(kvm-mmu_lock); -} - #define BATCH_ZAP_PAGES10 static void kvm_zap_obsolete_pages(struct kvm *kvm) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3758ff9..15e10f7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7066,13 +7066,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, void kvm_arch_flush_shadow_all(struct kvm *kvm) { - kvm_mmu_zap_all(kvm); + kvm_mmu_invalidate_zap_all_pages(kvm); } void kvm_arch_flush_shadow_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { - kvm_arch_flush_shadow_all(kvm); + kvm_mmu_invalidate_zap_all_pages(kvm); } int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 07/11] KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages
It is good for debug and development Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c |1 + arch/x86/kvm/mmutrace.h | 20 2 files changed, 21 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c010ace..3a3e6c5 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -4245,6 +4245,7 @@ restart: void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm) { spin_lock(kvm-mmu_lock); + trace_kvm_mmu_invalidate_zap_all_pages(kvm); kvm-arch.mmu_valid_gen++; kvm_zap_obsolete_pages(kvm); diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h index 697f466..eb444dd 100644 --- a/arch/x86/kvm/mmutrace.h +++ b/arch/x86/kvm/mmutrace.h @@ -276,6 +276,26 @@ TRACE_EVENT( __spte_satisfied(old_spte), __spte_satisfied(new_spte) ) ); + +TRACE_EVENT( + kvm_mmu_invalidate_zap_all_pages, + TP_PROTO(struct kvm *kvm), + TP_ARGS(kvm), + + TP_STRUCT__entry( + __field(unsigned long, mmu_valid_gen) + __field(unsigned int, mmu_used_pages) + ), + + TP_fast_assign( + __entry-mmu_valid_gen = kvm-arch.mmu_valid_gen; + __entry-mmu_used_pages = kvm-arch.n_used_mmu_pages; + ), + + TP_printk(kvm-mmu-valid-gen %lx used_pages %x, + __entry-mmu_valid_gen, __entry-mmu_used_pages + ) +); #endif /* _TRACE_KVMMMU_H */ #undef TRACE_INCLUDE_PATH -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 11/11] KVM: MMU: reduce KVM_REQ_MMU_RELOAD when root page is zapped
From: Gleb Natapov g...@redhat.com Quote Gleb's mail: | why don't we check for sp-role.invalid in | kvm_mmu_prepare_zap_page before calling kvm_reload_remote_mmus()? and | Actually we can add check for is_obsolete_sp() there too since | kvm_mmu_invalidate_all_pages() already calls kvm_reload_remote_mmus() | after incrementing mmu_valid_gen. [ Xiao: add some comments and the check of is_obsolete_sp() ] Signed-off-by: Gleb Natapov g...@redhat.com Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5e34056..055d675 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2097,7 +2097,13 @@ __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, kvm_mod_used_mmu_pages(kvm, -1); } else { list_move(sp-link, kvm-arch.active_mmu_pages); - kvm_reload_remote_mmus(kvm); + + /* +* The obsolete pages can not be used on any vcpus. +* See the comments in kvm_mmu_invalidate_zap_all_pages(). +*/ + if (!sp-role.invalid !is_obsolete_sp(kvm, sp)) + kvm_reload_remote_mmus(kvm); } sp-role.invalid = 1; -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages
kvm_zap_obsolete_pages uses lock-break technique to zap pages, it will flush tlb every time when it does lock-break We can reload mmu on all vcpus after updating the generation number so that the obsolete pages are not used on any vcpus, after that we do not need to flush tlb when obsolete pages are zapped Note: kvm_mmu_commit_zap_page is still needed before free the pages since other vcpus may be doing locklessly shadow page walking Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 32 ++-- 1 files changed, 22 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e676356..5e34056 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -4237,8 +4237,6 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm) restart: list_for_each_entry_safe_reverse(sp, node, kvm-arch.active_mmu_pages, link) { - int ret; - /* * No obsolete page exists before new created page since * active_mmu_pages is the FIFO list. @@ -4254,21 +4252,24 @@ restart: if (sp-role.invalid) continue; + /* +* Need not flush tlb since we only zap the sp with invalid +* generation number. +*/ if (batch = BATCH_ZAP_PAGES - (need_resched() || spin_needbreak(kvm-mmu_lock))) { + cond_resched_lock(kvm-mmu_lock)) { batch = 0; - kvm_mmu_commit_zap_page(kvm, invalid_list); - cond_resched_lock(kvm-mmu_lock); goto restart; } - ret = kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); - batch += ret; - - if (ret) - goto restart; + batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp, + invalid_list); } + /* +* Should flush tlb before free page tables since lockless-walking +* may use the pages. +*/ kvm_mmu_commit_zap_page(kvm, invalid_list); } @@ -4287,6 +4288,17 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm) trace_kvm_mmu_invalidate_zap_all_pages(kvm); kvm-arch.mmu_valid_gen++; + /* +* Notify all vcpus to reload its shadow page table +* and flush TLB. Then all vcpus will switch to new +* shadow page table with the new mmu_valid_gen. +* +* Note: we should do this under the protection of +* mmu-lock, otherwise, vcpu would purge shadow page +* but miss tlb flush. +*/ + kvm_reload_remote_mmus(kvm); + kvm_zap_obsolete_pages(kvm); spin_unlock(kvm-mmu_lock); } -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 08/11] KVM: MMU: do not reuse the obsolete page
The obsolete page will be zapped soon, do not resue it to reduce future page fault Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3a3e6c5..9b57faa 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1869,6 +1869,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, role.quadrant = quadrant; } for_each_gfn_sp(vcpu-kvm, sp, gfn) { + if (is_obsolete_sp(vcpu-kvm, sp)) + continue; + if (!need_sync sp-unsync) need_sync = true; -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 00/11] KVM: MMU: fast zap all shadow pages
Changlog: V7: 1): separate some optimization into two patches which do not reuse the obsolete pages and collapse tlb flushes, suggested by Marcelo. 2): make the patch based on Gleb's diff change which reduce KVM_REQ_MMU_RELOAD when root page is being zapped. 3): remove calling kvm_mmu_zap_page when patching hypercall, investigated by Gleb. 4): drop the patch which deleted page from hash list at the prepare time since it can break the walk based on hash list. 5): rename kvm_mmu_invalidate_all_pages to kvm_mmu_invalidate_zap_all_pages. 6): introduce kvm_mmu_prepare_zap_obsolete_page which is used to zap obsolete page to collapse tlb flushes. V6: 1): reversely walk active_list to skip the new created pages based on the comments from Gleb and Paolo. 2): completely replace kvm_mmu_zap_all by kvm_mmu_invalidate_all_pages based on Gleb's comments. 3): improve the parameters of kvm_mmu_invalidate_all_pages based on Gleb's comments. 4): rename kvm_mmu_invalidate_memslot_pages to kvm_mmu_invalidate_all_pages 5): rename zap_invalid_pages to kvm_zap_obsolete_pages V5: 1): rename is_valid_sp to is_obsolete_sp 2): use lock-break technique to zap all old pages instead of only pages linked on invalid slot's rmap suggested by Marcelo. 3): trace invalid pages and kvm_mmu_invalidate_memslot_pages() 4): rename kvm_mmu_invalid_memslot_pages to kvm_mmu_invalidate_memslot_pages according to Takuya's comments. V4: 1): drop unmapping invalid rmap out of mmu-lock and use lock-break technique instead. Thanks to Gleb's comments. 2): needn't handle invalid-gen pages specially due to page table always switched by KVM_REQ_MMU_RELOAD. Thanks to Marcelo's comments. V3: completely redesign the algorithm, please see below. V2: - do not reset n_requested_mmu_pages and n_max_mmu_pages - batch free root shadow pages to reduce vcpu notification and mmu-lock contention - remove the first patch that introduce kvm-arch.mmu_cache since we only 'memset zero' on hashtable rather than all mmu cache members in this version - remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all * Issue The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to walk and zap all shadow pages one by one, also it need to zap all guest page's rmap and all shadow page's parent spte list. Particularly, things become worse if guest uses more memory or vcpus. It is not good for scalability. * Idea KVM maintains a global mmu invalid generation-number which is stored in kvm-arch.mmu_valid_gen and every shadow page stores the current global generation-number into sp-mmu_valid_gen when it is created. When KVM need zap all shadow pages sptes, it just simply increase the global generation-number then reload root shadow pages on all vcpus. Vcpu will create a new shadow page table according to current kvm's generation-number. It ensures the old pages are not used any more. Then the invalid-gen pages (sp-mmu_valid_gen != kvm-arch.mmu_valid_gen) are zapped by using lock-break technique. Gleb Natapov (1): KVM: MMU: reduce KVM_REQ_MMU_RELOAD when root page is zapped Xiao Guangrong (10): KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall KVM: MMU: drop unnecessary kvm_reload_remote_mmus KVM: MMU: fast invalidate all pages KVM: MMU: zap pages in batch KVM: x86: use the fast way to invalidate all pages KVM: MMU: show mmu_valid_gen in shadow page related tracepoints KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages KVM: MMU: do not reuse the obsolete page KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page KVM: MMU: collapse TLB flushes when zap all pages arch/x86/include/asm/kvm_host.h |2 + arch/x86/kvm/mmu.c | 134 --- arch/x86/kvm/mmu.h |1 + arch/x86/kvm/mmutrace.h | 42 +--- arch/x86/kvm/x86.c | 16 + 5 files changed, 162 insertions(+), 33 deletions(-) -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 01/11] KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall
Quote Gleb's mail: | Back then kvm-lock protected memslot access so code like: | | mutex_lock(vcpu-kvm-lock); | kvm_mmu_zap_all(vcpu-kvm); | mutex_unlock(vcpu-kvm-lock); | | which is what 7aa81cc0 does was enough to guaranty that no vcpu will | run while code is patched. This is no longer the case and | mutex_lock(vcpu-kvm-lock); is gone from that code path long time ago, | so now kvm_mmu_zap_all() there is useless and the code is incorrect. So we drop it and it will be fixed later Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/x86.c |7 --- 1 files changed, 0 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8d28810..6739b1d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5523,13 +5523,6 @@ static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt) char instruction[3]; unsigned long rip = kvm_rip_read(vcpu); - /* -* Blow out the MMU to ensure that no other VCPU has an active mapping -* to ensure that the updated hypercall appears atomically across all -* VCPUs. -*/ - kvm_mmu_zap_all(vcpu-kvm); - kvm_x86_ops-patch_hypercall(vcpu, instruction); return emulator_write_emulated(ctxt, rip, instruction, 3, NULL); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page
It is only used to zap the obsolete page. Since the obsolete page will not be used, we need not spend time to find its unsync children out. Also, we delete the page from shadow page cache so that the page is completely isolated after call this function. The later patch will use it to collapse tlb flushes Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 46 +- 1 files changed, 41 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9b57faa..e676356 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr) static void kvm_mmu_free_page(struct kvm_mmu_page *sp) { ASSERT(is_empty_shadow_page(sp-spt)); - hlist_del(sp-hash_link); + hlist_del_init(sp-hash_link); list_del(sp-link); free_page((unsigned long)sp-spt); if (!sp-role.direct) @@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm, return zapped; } -static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, - struct list_head *invalid_list) +static int +__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + bool zap_unsync_children, + struct list_head *invalid_list) { - int ret; + int ret = 0; trace_kvm_mmu_prepare_zap_page(sp); ++kvm-stat.mmu_shadow_zapped; - ret = mmu_zap_unsync_children(kvm, sp, invalid_list); + + if (likely(zap_unsync_children)) + ret = mmu_zap_unsync_children(kvm, sp, invalid_list); + kvm_mmu_page_unlink_children(kvm, sp); kvm_mmu_unlink_parents(kvm, sp); @@ -2099,6 +2104,37 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, return ret; } +/* + * The obsolete page will not be used, we need not spend time to find + * its unsync children out. Also, we delete the page from shadow page + * cache so that the page is completely isolated after call this + * function. + * + * Note: if we use this function in for_each_gfn_xxx macros, we should + * re-walk the list when it successfully zaps one page. + */ +static int +kvm_mmu_prepare_zap_obsolete_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list) +{ + int ret; + + WARN_ON(!is_obsolete_sp(kvm, sp)); + + ret = __kvm_mmu_prepare_zap_page(kvm, sp, false, invalid_list); + if (ret) + hlist_del_init(sp-hash_link); + + WARN_ON(ret 1); + return ret; +} + +static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list) +{ + return __kvm_mmu_prepare_zap_page(kvm, sp, true, invalid_list); +} + static void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list) { -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 06/11] KVM: MMU: show mmu_valid_gen in shadow page related tracepoints
Show sp-mmu_valid_gen Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmutrace.h | 22 -- 1 files changed, 12 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h index b8f6172..697f466 100644 --- a/arch/x86/kvm/mmutrace.h +++ b/arch/x86/kvm/mmutrace.h @@ -7,16 +7,18 @@ #undef TRACE_SYSTEM #define TRACE_SYSTEM kvmmmu -#define KVM_MMU_PAGE_FIELDS \ - __field(__u64, gfn) \ - __field(__u32, role) \ - __field(__u32, root_count) \ +#define KVM_MMU_PAGE_FIELDS\ + __field(unsigned long, mmu_valid_gen) \ + __field(__u64, gfn) \ + __field(__u32, role)\ + __field(__u32, root_count) \ __field(bool, unsync) -#define KVM_MMU_PAGE_ASSIGN(sp) \ - __entry-gfn = sp-gfn; \ - __entry-role = sp-role.word; \ - __entry-root_count = sp-root_count;\ +#define KVM_MMU_PAGE_ASSIGN(sp)\ + __entry-mmu_valid_gen = sp-mmu_valid_gen; \ + __entry-gfn = sp-gfn; \ + __entry-role = sp-role.word; \ + __entry-root_count = sp-root_count; \ __entry-unsync = sp-unsync; #define KVM_MMU_PAGE_PRINTK() ({ \ @@ -28,8 +30,8 @@ \ role.word = __entry-role; \ \ - trace_seq_printf(p, sp gfn %llx %u%s q%u%s %s%s \ - %snxe root %u %s%c, \ + trace_seq_printf(p, sp gen %lx gfn %llx %u%s q%u%s %s%s \ + %snxe root %u %s%c, __entry-mmu_valid_gen, \ __entry-gfn, role.level, \ role.cr4_pae ? pae : ,\ role.quadrant, \ -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 03/11] KVM: MMU: fast invalidate all pages
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to walk and zap all shadow pages one by one, also it need to zap all guest page's rmap and all shadow page's parent spte list. Particularly, things become worse if guest uses more memory or vcpus. It is not good for scalability In this patch, we introduce a faster way to invalidate all shadow pages. KVM maintains a global mmu invalid generation-number which is stored in kvm-arch.mmu_valid_gen and every shadow page stores the current global generation-number into sp-mmu_valid_gen when it is created When KVM need zap all shadow pages sptes, it just simply increase the global generation-number then reload root shadow pages on all vcpus. Vcpu will create a new shadow page table according to current kvm's generation-number. It ensures the old pages are not used any more. Then the obsolete pages (sp-mmu_valid_gen != kvm-arch.mmu_valid_gen) are zapped by using lock-break technique Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |2 + arch/x86/kvm/mmu.c | 84 +++ arch/x86/kvm/mmu.h |1 + 3 files changed, 87 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3741c65..bff7d46 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -222,6 +222,7 @@ struct kvm_mmu_page { int root_count; /* Currently serving as active root */ unsigned int unsync_children; unsigned long parent_ptes; /* Reverse mapping for parent_pte */ + unsigned long mmu_valid_gen; DECLARE_BITMAP(unsync_child_bitmap, 512); #ifdef CONFIG_X86_32 @@ -529,6 +530,7 @@ struct kvm_arch { unsigned int n_requested_mmu_pages; unsigned int n_max_mmu_pages; unsigned int indirect_shadow_pages; + unsigned long mmu_valid_gen; struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; /* * Hash table of struct kvm_mmu_page. diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f8ca2f3..f302540 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sp); } +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + return unlikely(sp-mmu_valid_gen != kvm-arch.mmu_valid_gen); +} + static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gaddr, @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, account_shadowed(vcpu-kvm, gfn); } + sp-mmu_valid_gen = vcpu-kvm-arch.mmu_valid_gen; init_shadow_page_table(sp); trace_kvm_mmu_get_page(sp, true); return sp; @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, ret = mmu_zap_unsync_children(kvm, sp, invalid_list); kvm_mmu_page_unlink_children(kvm, sp); kvm_mmu_unlink_parents(kvm, sp); + if (!sp-role.invalid !sp-role.direct) unaccount_shadowed(kvm, sp-gfn); + if (sp-unsync) kvm_unlink_unsync_page(kvm, sp); if (!sp-root_count) { @@ -4195,6 +4203,82 @@ restart: spin_unlock(kvm-mmu_lock); } +static void kvm_zap_obsolete_pages(struct kvm *kvm) +{ + struct kvm_mmu_page *sp, *node; + LIST_HEAD(invalid_list); + +restart: + list_for_each_entry_safe_reverse(sp, node, + kvm-arch.active_mmu_pages, link) { + /* +* No obsolete page exists before new created page since +* active_mmu_pages is the FIFO list. +*/ + if (!is_obsolete_sp(kvm, sp)) + break; + + /* +* Do not repeatedly zap a root page to avoid unnecessary +* KVM_REQ_MMU_RELOAD, otherwise we may not be able to +* progress: +*vcpu 0vcpu 1 +* call vcpu_enter_guest(): +*1): handle KVM_REQ_MMU_RELOAD +*and require mmu-lock to +*load mmu +* repeat: +*1): zap root page and +*send KVM_REQ_MMU_RELOAD +* +*2): if (cond_resched_lock(mmu-lock)) +* +*2): hold mmu-lock and load mmu +* +*3): see KVM_REQ_MMU_RELOAD bit +*on vcpu-requests is set +
[PATCH v7 02/11] KVM: MMU: drop unnecessary kvm_reload_remote_mmus
It is the responsibility of kvm_mmu_zap_all that keeps the consistent of mmu and tlbs. And it is also unnecessary after zap all mmio sptes since no mmio spte exists on root shadow page and it can not be cached into tlb Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/x86.c |5 + 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6739b1d..3758ff9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7060,16 +7060,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, * If memory slot is created, or moved, we need to clear all * mmio sptes. */ - if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) { + if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) kvm_mmu_zap_mmio_sptes(kvm); - kvm_reload_remote_mmus(kvm); - } } void kvm_arch_flush_shadow_all(struct kvm *kvm) { kvm_mmu_zap_all(kvm); - kvm_reload_remote_mmus(kvm); } void kvm_arch_flush_shadow_memslot(struct kvm *kvm, -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/10] kernel: might_fault does not imply might_sleep
On Thu, May 16, 2013 at 08:40:41PM +0200, Peter Zijlstra wrote: On Thu, May 16, 2013 at 02:16:10PM +0300, Michael S. Tsirkin wrote: There are several ways to make sure might_fault calling function does not sleep. One is to use it on kernel or otherwise locked memory - apparently nfs/sunrpc does this. As noted by Ingo, this is handled by the migh_fault() implementation in mm/memory.c but not the one in linux/kernel.h so in the current code might_fault() schedules differently depending on CONFIG_PROVE_LOCKING, which is an undesired semantical side effect. Another is to call pagefault_disable: in this case the page fault handler will go to fixups processing and we get an error instead of sleeping, so the might_sleep annotation is a false positive. vhost driver wants to do this now in order to reuse socket ops under a spinlock (and fall back on slower thread handler on error). Are you using the assumption that spin_lock() implies preempt_disable() implies pagefault_disable()? Note that this assumption isn't valid for -rt where the spinlock becomes preemptible but we'll not disable pagefaults. Address both issues by: - dropping the unconditional call to might_sleep from the fast might_fault code in linux/kernel.h - checking for pagefault_disable() in the CONFIG_PROVE_LOCKING implementation Signed-off-by: Michael S. Tsirkin m...@redhat.com --- include/linux/kernel.h | 1 - mm/memory.c| 14 +- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index e96329c..322b065 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -198,7 +198,6 @@ void might_fault(void); #else static inline void might_fault(void) { - might_sleep(); This removes potential resched points for PREEMPT_VOLUNTARY -- was that intentional? OK so I'm thinking of going back to this idea: it has the advantage of being very simple, and just might make some workloads faster if they do lots of copy_XX_user in a loop. Will have to be tested of course - anyone has objections? } #endif diff --git a/mm/memory.c b/mm/memory.c index 6dc1882..1b8327b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4222,13 +4222,17 @@ void might_fault(void) if (segment_eq(get_fs(), KERNEL_DS)) return; - might_sleep(); /* -* it would be nicer only to annotate paths which are not under -* pagefault_disable, however that requires a larger audit and -* providing helpers like get_user_atomic. +* It would be nicer to annotate paths which are under preempt_disable +* but not under pagefault_disable, however that requires a new flag +* for differentiating between the two. -rt has this, pagefault_disable() doesn't change the preempt count but pokes at task_struct::pagefault_disable. */ - if (!in_atomic() current-mm) + if (in_atomic()) + return; + + might_sleep(); + + if (current-mm) might_lock_read(current-mm-mmap_sem); } EXPORT_SYMBOL(might_fault); -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/10] kernel: might_fault does not imply might_sleep
On Tue, May 21, 2013 at 01:57:34PM +0200, Peter Zijlstra wrote: On Sun, May 19, 2013 at 12:35:26PM +0300, Michael S. Tsirkin wrote: --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -198,7 +198,6 @@ void might_fault(void); #else static inline void might_fault(void) { - might_sleep(); This removes potential resched points for PREEMPT_VOLUNTARY -- was that intentional? No it's a bug. Thanks for pointing this out. OK so I guess it should be might_sleep_if(!in_atomic()) and this means might_fault would have to move from linux/kernel.h to linux/uaccess.h, since in_atomic() is in linux/hardirq.h Makes sense? So the only difference between PROVE_LOCKING and not should be the might_lock_read() thing; so how about something like this? So the problem with the below is that might_fault is needed in asm/uaccess.h. I'm still trying various approaches but the dependencies there are very complex. --- include/linux/kernel.h | 7 ++- include/linux/uaccess.h | 26 ++ mm/memory.c | 14 ++ 3 files changed, 30 insertions(+), 17 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index e96329c..70812f4 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -194,12 +194,9 @@ extern int _cond_resched(void); }) #ifdef CONFIG_PROVE_LOCKING -void might_fault(void); +void might_fault_lockdep(void); #else -static inline void might_fault(void) -{ - might_sleep(); -} +static inline void might_fault_lockdep(void) { } #endif extern struct atomic_notifier_head panic_notifier_list; diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 5ca0951..50a2cc9 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -38,6 +38,32 @@ static inline void pagefault_enable(void) preempt_check_resched(); } +static inline bool __can_fault(void) +{ + /* + * Some code (nfs/sunrpc) uses socket ops on kernel memory while + * holding the mmap_sem, this is safe because kernel memory doesn't + * get paged out, therefore we'll never actually fault, and the + * below annotations will generate false positives. + */ + if (segment_eq(get_fs(), KERNEL_DS)) + return false; + + if (in_atomic() /* || pagefault_disabled() */) + return false; + + return true; +} + +static inline void might_fault(void) +{ + if (!__can_fault()) + return; + + might_sleep(); + might_fault_lockdep(); +} + #ifndef ARCH_HAS_NOCACHE_UACCESS static inline unsigned long __copy_from_user_inatomic_nocache(void *to, diff --git a/mm/memory.c b/mm/memory.c index 6dc1882..266610c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4211,19 +4211,9 @@ void print_vma_addr(char *prefix, unsigned long ip) } #ifdef CONFIG_PROVE_LOCKING -void might_fault(void) +void might_fault_lockdep(void) { /* - * Some code (nfs/sunrpc) uses socket ops on kernel memory while - * holding the mmap_sem, this is safe because kernel memory doesn't - * get paged out, therefore we'll never actually fault, and the - * below annotations will generate false positives. - */ - if (segment_eq(get_fs(), KERNEL_DS)) - return; - - might_sleep(); - /* * it would be nicer only to annotate paths which are not under * pagefault_disable, however that requires a larger audit and * providing helpers like get_user_atomic. @@ -4231,7 +4221,7 @@ void might_fault(void) if (!in_atomic() current-mm) might_lock_read(current-mm-mmap_sem); } -EXPORT_SYMBOL(might_fault); +EXPORT_SYMBOL(might_fault_lockdep); #endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] x86: Annotate _ASM_EXTABLE users to distinguish uaccess from everything else
The idea is that the kernel can be much more careful fixing up uaccess exceptions -- page faults on user addresses are the only legitimate reason for a uaccess instruction to fault. Signed-off-by: Andy Lutomirski l...@amacapital.net --- I'm not 100% sure what's happening in the KVM code. Can someone familiar with it take a look? arch/x86/ia32/ia32entry.S | 4 +- arch/x86/include/asm/asm.h| 13 ++- arch/x86/include/asm/fpu-internal.h | 6 +- arch/x86/include/asm/futex.h | 8 +- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/include/asm/msr.h| 4 +- arch/x86/include/asm/segment.h| 2 +- arch/x86/include/asm/special_insns.h | 2 +- arch/x86/include/asm/uaccess.h| 8 +- arch/x86/include/asm/word-at-a-time.h | 2 +- arch/x86/include/asm/xsave.h | 6 +- arch/x86/kernel/entry_32.S| 26 ++--- arch/x86/kernel/entry_64.S| 6 +- arch/x86/kernel/ftrace.c | 4 +- arch/x86/kernel/test_nx.c | 2 +- arch/x86/kernel/test_rodata.c | 2 +- arch/x86/kvm/emulate.c| 4 +- arch/x86/lib/checksum_32.S| 4 +- arch/x86/lib/copy_user_64.S | 50 arch/x86/lib/copy_user_nocache_64.S | 44 +++ arch/x86/lib/csum-copy_64.S | 6 +- arch/x86/lib/getuser.S| 12 +- arch/x86/lib/mmx_32.c | 12 +- arch/x86/lib/msr-reg.S| 4 +- arch/x86/lib/putuser.S| 10 +- arch/x86/lib/usercopy_32.c| 212 +- arch/x86/lib/usercopy_64.c| 4 +- arch/x86/mm/init_32.c | 2 +- arch/x86/um/checksum_32.S | 4 +- arch/x86/xen/xen-asm_32.S | 2 +- 30 files changed, 236 insertions(+), 231 deletions(-) diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 474dc1b..8d3b5c2 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -149,7 +149,7 @@ ENTRY(ia32_sysenter_target) 32bit zero extended */ ASM_STAC 1: movl(%rbp),%ebp - _ASM_EXTABLE(1b,ia32_badarg) + _ASM_EXTABLE_UACCESS(1b,ia32_badarg) ASM_CLAC orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET) testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) @@ -306,7 +306,7 @@ ENTRY(ia32_cstar_target) /* hardware stack frame is complete now */ ASM_STAC 1: movl(%r8),%r9d - _ASM_EXTABLE(1b,ia32_badarg) + _ASM_EXTABLE_UACCESS(1b,ia32_badarg) ASM_CLAC orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET) testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index fa47fd4..f48a850 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -57,14 +57,16 @@ */ /* There are two bits of extable entry class, added to a signed offset. */ -#define _EXTABLE_CLASS_DEFAULT 0 /* standard uaccess fixup */ +#define _EXTABLE_CLASS_UACCESS 0 /* standard uaccess fixup */ +#define _EXTABLE_CLASS_ANY 0x4000 /* catch any exception */ #define _EXTABLE_CLASS_EX 0x8000 /* uaccess + set uaccess_err */ /* * The biases are the class constants + 0x2000, as signed integers. * This can't use ordinary arithmetic -- the assembler isn't that smart. */ -#define _EXTABLE_BIAS_DEFAULT 0x2000 +#define _EXTABLE_BIAS_UACCESS 0x2000 +#define _EXTABLE_BIAS_ANY 0x2000 + 0x4000 #define _EXTABLE_BIAS_EX 0x2000 - 0x8000 #ifdef __ASSEMBLY__ @@ -85,8 +87,11 @@ .popsection\n #endif -#define _ASM_EXTABLE(from,to) \ - _ASM_EXTABLE_CLASS(from, to, _EXTABLE_BIAS_DEFAULT) +#define _ASM_EXTABLE_UACCESS(from,to) \ + _ASM_EXTABLE_CLASS(from, to, _EXTABLE_BIAS_UACCESS) + +#define _ASM_EXTABLE_ANY(from,to) \ + _ASM_EXTABLE_CLASS(from, to, _EXTABLE_BIAS_ANY) #define _ASM_EXTABLE_EX(from,to) \ _ASM_EXTABLE_CLASS(from, to, _EXTABLE_BIAS_EX) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index e25cc33..7f86031 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -133,7 +133,7 @@ static inline void sanitize_i387_state(struct task_struct *tsk) 3: movl $-1,%[err]\n\ jmp 2b\n\ .previous\n \ -_ASM_EXTABLE(1b, 3b) \ +_ASM_EXTABLE_UACCESS(1b, 3b)
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote: diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 8465c2a..da6bf61 100644 --- a/arch/powerpc/kvm/powerpc.c @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext) +++ b/arch/powerpc/kvm/powerpc.c break; #endif case KVM_CAP_SPAPR_MULTITCE: + case KVM_CAP_SPAPR_TCE_IOMMU: r = 1; break; default: Don't advertise SPAPR capabilities if it's not book3s -- and probably there's some additional limitation that would be appropriate. @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_create_spapr_tce(kvm, create_tce); goto out; } + case KVM_CREATE_SPAPR_TCE_IOMMU: { + struct kvm_create_spapr_tce_iommu create_tce_iommu; + struct kvm *kvm = filp-private_data; + + r = -EFAULT; + if (copy_from_user(create_tce_iommu, argp, + sizeof(create_tce_iommu))) + goto out; + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, create_tce_iommu); + goto out; + } #endif /* CONFIG_PPC_BOOK3S_64 */ #ifdef CONFIG_KVM_BOOK3S_64_HV diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 5a2afda..450c82a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE (0x11 + 89) +#define KVM_CAP_SPAPR_TCE_IOMMU (0x11 + 90) Hmm... @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping { #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) +/* ioctl for SPAPR TCE IOMMU */ +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct kvm_create_spapr_tce_iommu) Shouldn't this go under the vm ioctl section? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC
On Wed, May 22, 2013 at 03:22:55AM -0400, Vadim Rozenfeld wrote: - Original Message - From: Marcelo Tosatti mtosa...@redhat.com To: Vadim Rozenfeld vroze...@redhat.com Cc: kvm@vger.kernel.org, g...@redhat.com, p...@dlh.net Sent: Wednesday, May 22, 2013 10:50:46 AM Subject: Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC On Sun, May 19, 2013 at 05:06:37PM +1000, Vadim Rozenfeld wrote: The following patch allows to activate a partition reference time enlightenment that is based on the host platform's support for an Invariant Time Stamp Counter (iTSC). NOTE: This code will survive migration due to lack of VM stop/resume handlers, when offset, scale and sequence should be readjusted. --- arch/x86/kvm/x86.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9645dab..b423fe4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1838,7 +1838,6 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) u64 gfn; unsigned long addr; HV_REFERENCE_TSC_PAGE tsc_ref; - tsc_ref.TscSequence = 0; if (!(data HV_X64_MSR_TSC_REFERENCE_ENABLE)) { kvm-arch.hv_tsc_page = data; break; @@ -1848,6 +1847,11 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT); if (kvm_is_error_hva(addr)) return 1; + tsc_ref.TscSequence = + boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ? 1 : 0; 1) You want NONSTOP_TSC (see 40fb1715 commit) which matches INVARIANT TSC. [VR] Thank you for reviewing. Will fix it. 2) TscSequence should increase? This field serves as a sequence number that is incremented whenever... [VR] Yes, on every VM resume, including migration. After migration we also need to recalculate scale and adjust offset. 3) 0x is the value for invalid source of reference time? [VR] Yes, on boot-up. In this case guest will go with PMTimer (not sure about HPET but I can check). But if we set sequence to 0x after migration - it's probably will not work. Reference TSC during Save and Restore and Migration To address migration scenarios to physical platforms that do not support iTSC, the TscSequence field is used. In the event that a guest partition is migrated from an iTSC capable host to a non-iTSC capable host, the hypervisor sets TscSequence to the special value of 0x, which directs the guest operating system to fall back to a different clock source (for example, the virtual PM timer). Why it would not/does not work after migration? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/2] add support for Hyper-V reference time counter
Il 22/05/2013 09:32, Vadim Rozenfeld ha scritto: @@ -1827,6 +1829,29 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (__copy_to_user((void __user *)addr, instructions, 4)) return 1; kvm-arch.hv_hypercall = data; + local_irq_disable(); + kvm-arch.hv_ref_count = get_kernel_ns(); + local_irq_enable(); + break; local_irq_disable/local_irq_enable not needed. What is the reasoning behind reading this time value at msr write time? [VR] Windows writs this MSR only once, during HAL initialization. So, I decided to treat this call as a partition crate event. But is it expected by Windows that the reference count starts counting up from 0 at partition creation time? If you could just use (get_kernel_ns() + kvm-arch.kvmclock_offset) / 100, it would also be simpler for migration purposes. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] qemu-kvm: fix unmatched RAM alloction/free
mmap is used in qemu_vmalloc function instead of qemu_memalign(commit 7dda5dc8), so it should change qemu_vfree to munmap to fix a unmatched issue. This issue appears when a PCI device is being assigned to KVM guest, failure to read PCI rom file will bring RAM free, then the incorrect qemu_vfree calling will cause a segment fault. Signed-off-by: Xudong Hao xudong@intel.com --- exec.c |6 +- 1 files changed, 1 insertions(+), 5 deletions(-) diff --git a/exec.c b/exec.c index fa1e0c3..d40d237 100644 --- a/exec.c +++ b/exec.c @@ -1152,15 +1152,11 @@ void qemu_ram_free(ram_addr_t addr) abort(); #endif } else { -#if defined(TARGET_S390X) defined(CONFIG_KVM) -munmap(block-host, block-length); -#else if (xen_enabled()) { xen_invalidate_map_cache_entry(block-host); } else { -qemu_vfree(block-host); +munmap(block-host, block-length); } -#endif } g_free(block); break; -- 1.5.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: provide vhost thread per virtqueue for forwarding scenario
Jason Wang jasow...@redhat.com writes: On 05/22/2013 05:59 PM, Zang Hongyong wrote: On 2013/5/20 15:43, Michael S. Tsirkin wrote: On Mon, May 20, 2013 at 02:11:19AM +, Qinchuanyu wrote: Yes, I don't think we want to create threads even more aggressively in all cases. I'm worried about scalability as it is. I think we should explore a flexible approach, use a thread pool (for example, a wq) to share threads between virtqueues, switch to a separate thread only if there's free CPU and existing threads are busy. Hopefully share threads between vhost instances too. On Xen platform, network backend pv driver model has evolved to this way. Netbacks from all DomUs share a thread pool, and thread number eaqual to cpu core number. Is there any plan for kvm paltform? There used to be two related RFCs for this, one is the multiple vhost workers from Anthony another is percpu vhost thread from Shirley. You can search the archives on netdev or kvm for the patches. As I've said to MST before, I think our entire model is wrong. Userspace should create the threads and call in. If you're doing kernel acceleration, two extra threads per NIC is a tiny overhead. Of course, such radical changes to vhost doesn't help existing users as Qinchuanyu asked... Cheers, Rusty, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page
On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote: It is only used to zap the obsolete page. Since the obsolete page will not be used, we need not spend time to find its unsync children out. Also, we delete the page from shadow page cache so that the page is completely isolated after call this function. The later patch will use it to collapse tlb flushes Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 46 +- 1 files changed, 41 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9b57faa..e676356 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr) static void kvm_mmu_free_page(struct kvm_mmu_page *sp) { ASSERT(is_empty_shadow_page(sp-spt)); - hlist_del(sp-hash_link); + hlist_del_init(sp-hash_link); Why do you need hlist_del_init() here? Why not move it into kvm_mmu_prepare_zap_page() like we discussed it here: https://patchwork.kernel.org/patch/2580351/ instead of doing it differently for obsolete and non obsolete pages? list_del(sp-link); free_page((unsigned long)sp-spt); if (!sp-role.direct) @@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm, return zapped; } -static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, - struct list_head *invalid_list) +static int +__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, +bool zap_unsync_children, +struct list_head *invalid_list) { - int ret; + int ret = 0; trace_kvm_mmu_prepare_zap_page(sp); ++kvm-stat.mmu_shadow_zapped; - ret = mmu_zap_unsync_children(kvm, sp, invalid_list); + + if (likely(zap_unsync_children)) + ret = mmu_zap_unsync_children(kvm, sp, invalid_list); + kvm_mmu_page_unlink_children(kvm, sp); kvm_mmu_unlink_parents(kvm, sp); @@ -2099,6 +2104,37 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, return ret; } +/* + * The obsolete page will not be used, we need not spend time to find + * its unsync children out. Also, we delete the page from shadow page + * cache so that the page is completely isolated after call this + * function. + * + * Note: if we use this function in for_each_gfn_xxx macros, we should + * re-walk the list when it successfully zaps one page. + */ +static int +kvm_mmu_prepare_zap_obsolete_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list) +{ + int ret; + + WARN_ON(!is_obsolete_sp(kvm, sp)); + + ret = __kvm_mmu_prepare_zap_page(kvm, sp, false, invalid_list); + if (ret) + hlist_del_init(sp-hash_link); Why hlist_del() is not enough? + + WARN_ON(ret 1); + return ret; +} + +static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list) +{ + return __kvm_mmu_prepare_zap_page(kvm, sp, true, invalid_list); +} + static void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list) { -- 1.7.7.6 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote: diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 8465c2a..da6bf61 100644 --- a/arch/powerpc/kvm/powerpc.c @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext) +++ b/arch/powerpc/kvm/powerpc.c break; #endif case KVM_CAP_SPAPR_MULTITCE: + case KVM_CAP_SPAPR_TCE_IOMMU: r = 1; break; default: Don't advertise SPAPR capabilities if it's not book3s -- and probably there's some additional limitation that would be appropriate. @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_create_spapr_tce(kvm, create_tce); goto out; } + case KVM_CREATE_SPAPR_TCE_IOMMU: { + struct kvm_create_spapr_tce_iommu create_tce_iommu; + struct kvm *kvm = filp-private_data; + + r = -EFAULT; + if (copy_from_user(create_tce_iommu, argp, + sizeof(create_tce_iommu))) + goto out; + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, create_tce_iommu); + goto out; + } #endif /* CONFIG_PPC_BOOK3S_64 */ #ifdef CONFIG_KVM_BOOK3S_64_HV diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 5a2afda..450c82a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE (0x11 + 89) +#define KVM_CAP_SPAPR_TCE_IOMMU (0x11 + 90) Hmm... @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping { #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) +/* ioctl for SPAPR TCE IOMMU */ +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct kvm_create_spapr_tce_iommu) Shouldn't this go under the vm ioctl section? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html