[RFC PATCH v5] add support for Hyper-V reference time counter
Signed-off: Peter Lieven p...@kamp.de Signed-off: Gleb Natapov Signed-off: Vadim Rozenfeld vroze...@redhat.com After some consideration I decided to submit only Hyper-V reference counters support this time. I will submit iTSC support as a separate patch as soon as it is ready. v1 - v2 1. mark TSC page dirty as suggested by Eric Northup digitale...@google.com and Gleb 2. disable local irq when calling get_kernel_ns, as it was done by Peter Lieven p...@amp.de 3. move check for TSC page enable from second patch to this one. v3 - v4 Get rid of ref counter offset. v4 - v5 replace __copy_to_user with kvm_write_guest when updateing iTSC page. --- arch/x86/include/asm/kvm_host.h| 1 + arch/x86/include/uapi/asm/hyperv.h | 13 + arch/x86/kvm/x86.c | 28 +++- include/uapi/linux/kvm.h | 1 + 4 files changed, 42 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ae5d783..33fef07 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -605,6 +605,7 @@ struct kvm_arch { /* fields used by HYPER-V emulation */ u64 hv_guest_os_id; u64 hv_hypercall; + u64 hv_tsc_page; #ifdef CONFIG_KVM_MMU_AUDIT int audit_point; diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h index b8f1c01..462efe7 100644 --- a/arch/x86/include/uapi/asm/hyperv.h +++ b/arch/x86/include/uapi/asm/hyperv.h @@ -28,6 +28,9 @@ /* Partition Reference Counter (HV_X64_MSR_TIME_REF_COUNT) available*/ #define HV_X64_MSR_TIME_REF_COUNT_AVAILABLE(1 1) +/* A partition's reference time stamp counter (TSC) page */ +#define HV_X64_MSR_REFERENCE_TSC 0x4021 + /* * There is a single feature flag that signifies the presence of the MSR * that can be used to retrieve both the local APIC Timer frequency as @@ -198,6 +201,9 @@ #define HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_MASK \ (~((1ull HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT) - 1)) +#define HV_X64_MSR_TSC_REFERENCE_ENABLE0x0001 +#define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT 12 + #define HV_PROCESSOR_POWER_STATE_C00 #define HV_PROCESSOR_POWER_STATE_C11 #define HV_PROCESSOR_POWER_STATE_C22 @@ -210,4 +216,11 @@ #define HV_STATUS_INVALID_ALIGNMENT4 #define HV_STATUS_INSUFFICIENT_BUFFERS 19 +typedef struct _HV_REFERENCE_TSC_PAGE { + __u32 tsc_sequence; + __u32 res1; + __u64 tsc_scale; + __s64 tsc_offset; +} HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE; + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5d004da..8e685b8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -836,11 +836,12 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc); * kvm-specific. Those are put in the beginning of the list. */ -#define KVM_SAVE_MSRS_BEGIN10 +#define KVM_SAVE_MSRS_BEGIN12 static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, + HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_REFERENCE_TSC, HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_KVM_PV_EOI_EN, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, @@ -1826,6 +1827,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr) switch (msr) { case HV_X64_MSR_GUEST_OS_ID: case HV_X64_MSR_HYPERCALL: + case HV_X64_MSR_REFERENCE_TSC: + case HV_X64_MSR_TIME_REF_COUNT: r = true; break; } @@ -1867,6 +1870,20 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) kvm-arch.hv_hypercall = data; break; } + case HV_X64_MSR_REFERENCE_TSC: { + u64 gfn; + HV_REFERENCE_TSC_PAGE tsc_ref; + memset(tsc_ref, 0, sizeof(tsc_ref)); + kvm-arch.hv_tsc_page = data; + if (!(data HV_X64_MSR_TSC_REFERENCE_ENABLE)) + break; + gfn = data HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT; + if (kvm_write_guest(kvm, data, + tsc_ref, sizeof(tsc_ref))) + return 1; + mark_page_dirty(kvm, gfn); + break; + } default: vcpu_unimpl(vcpu, HYPER-V unimplemented wrmsr: 0x%x data 0x%llx\n, msr, data); @@ -2291,6 +2308,14 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) case HV_X64_MSR_HYPERCALL: data = kvm-arch.hv_hypercall; break; + case HV_X64_MSR_TIME_REF_COUNT: { + data = +div_u64(get_kernel_ns() +
[PATCH] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
When starting lots of dataplane devices the bootup takes very long on my s390 system(prototype irqfd code). With larger setups we are even able to trigger some timeouts in some components. Turns out that the KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec) when having multiple CPUs. This is caused by the synchronize_rcu and the HZ=100 of s390. We can defer the freeing outside of the ioctl path by using kfree_rcu. Please note that we now have to check for a NULL pointer, since the underlying call_rcu in kfree_rcu cannot handle broken rcu callbacks. This patch reduces the boot time till mounting root from 8 to 2 seconds on my s390 guest with 100 disks. Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- include/linux/kvm_host.h | 1 + virt/kvm/irqchip.c | 5 ++--- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3d1b0e6..2f19079 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -327,6 +327,7 @@ struct kvm_kernel_irq_routing_entry { #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING struct kvm_irq_routing_table { + struct rcu_head rcu; int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS]; struct kvm_kernel_irq_routing_entry *rt_entries; u32 nr_rt_entries; diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c index 20dc9e4..3e2ebed 100644 --- a/virt/kvm/irqchip.c +++ b/virt/kvm/irqchip.c @@ -226,12 +226,11 @@ int kvm_set_irq_routing(struct kvm *kvm, kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); - synchronize_rcu(); - new = old; r = 0; out: - kfree(new); + if (new) + kfree_rcu(new, rcu); return r; } -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
Il 16/01/2014 10:23, Christian Borntraeger ha scritto: When starting lots of dataplane devices the bootup takes very long on my s390 system(prototype irqfd code). With larger setups we are even able to trigger some timeouts in some components. Turns out that the KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec) when having multiple CPUs. This is caused by the synchronize_rcu and the HZ=100 of s390. We can defer the freeing outside of the ioctl path by using kfree_rcu. Please note that we now have to check for a NULL pointer, since the underlying call_rcu in kfree_rcu cannot handle broken rcu callbacks. This patch reduces the boot time till mounting root from 8 to 2 seconds on my s390 guest with 100 disks. Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com Hi Christian, please take a look at the thread at http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/116933 - uses of call_rcu need to be rate limited, otherwise the guest can allocate arbitrarily large amounts of memory. Can you check if using a dedicated SRCU (possibly with synchronize_srcu_expedited in kvm_set_irq_routing) speeds up KVM_SET_GSI_ROUTING enough? Back-of-the-envelope, the latency should be on the order of tens of microseconds. Thanks, Paolo --- include/linux/kvm_host.h | 1 + virt/kvm/irqchip.c | 5 ++--- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3d1b0e6..2f19079 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -327,6 +327,7 @@ struct kvm_kernel_irq_routing_entry { #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING struct kvm_irq_routing_table { + struct rcu_head rcu; int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS]; struct kvm_kernel_irq_routing_entry *rt_entries; u32 nr_rt_entries; diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c index 20dc9e4..3e2ebed 100644 --- a/virt/kvm/irqchip.c +++ b/virt/kvm/irqchip.c @@ -226,12 +226,11 @@ int kvm_set_irq_routing(struct kvm *kvm, kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); - synchronize_rcu(); - new = old; r = 0; out: - kfree(new); + if (new) + kfree_rcu(new, rcu); return r; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] KVM changes for 3.13 final
Linus, The following changes since commit 7e22e91102c6b9df7c4ae2168910e19d2bb14cd6: Linux 3.13-rc8 (2014-01-12 17:04:18 +0700) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus for you to fetch changes up to 0dce7cd67fd9055c4a2ff278f8af1431e646d346: kvm: x86: fix apic_base enable check (2014-01-15 13:42:14 +0100) Fix for a brown paper bag bug. Thanks to Drew Jones for noticing. Andrew Jones (1): kvm: x86: fix apic_base enable check arch/x86/kvm/lapic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
When starting lots of dataplane devices the bootup takes very long on my s390 system(prototype irqfd code). With larger setups we are even able to trigger some timeouts in some components. Turns out that the KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec) when having multiple CPUs. This is caused by the synchronize_rcu and the HZ=100 of s390. By changing the code to use a private srcu we can speed things up. This patch reduces the boot time till mounting root from 8 to 2 seconds on my s390 guest with 100 disks. I converted most of the rcu routines to srcu. Review for the unconverted use of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu is necessary, though. They look fine to me since they are protected by outer functions. In addition, we should also discuss if a global srcu (for all guests) is fine. Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- virt/kvm/irqchip.c | 31 +-- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c index 20dc9e4..5283eb8 100644 --- a/virt/kvm/irqchip.c +++ b/virt/kvm/irqchip.c @@ -26,17 +26,20 @@ #include linux/kvm_host.h #include linux/slab.h +#include linux/srcu.h #include linux/export.h #include trace/events/kvm.h #include irq.h +DEFINE_STATIC_SRCU(irq_srcu); + bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin) { struct kvm_irq_ack_notifier *kian; - int gsi; + int gsi, idx; - rcu_read_lock(); - gsi = rcu_dereference(kvm-irq_routing)-chip[irqchip][pin]; + idx = srcu_read_lock(irq_srcu); + gsi = srcu_dereference(kvm-irq_routing, irq_srcu)-chip[irqchip][pin]; if (gsi != -1) hlist_for_each_entry_rcu(kian, kvm-irq_ack_notifier_list, link) @@ -45,7 +48,7 @@ bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin) return true; } - rcu_read_unlock(); + srcu_read_unlock(irq_srcu, idx); return false; } @@ -54,18 +57,18 @@ EXPORT_SYMBOL_GPL(kvm_irq_has_notifier); void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin) { struct kvm_irq_ack_notifier *kian; - int gsi; + int gsi, idx; trace_kvm_ack_irq(irqchip, pin); - rcu_read_lock(); - gsi = rcu_dereference(kvm-irq_routing)-chip[irqchip][pin]; + idx = srcu_read_lock(irq_srcu); + gsi = srcu_dereference(kvm-irq_routing, irq_srcu)-chip[irqchip][pin]; if (gsi != -1) hlist_for_each_entry_rcu(kian, kvm-irq_ack_notifier_list, link) if (kian-gsi == gsi) kian-irq_acked(kian); - rcu_read_unlock(); + srcu_read_unlock(irq_srcu, idx); } void kvm_register_irq_ack_notifier(struct kvm *kvm, @@ -85,7 +88,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm, mutex_lock(kvm-irq_lock); hlist_del_init_rcu(kian-link); mutex_unlock(kvm-irq_lock); - synchronize_rcu(); + synchronize_srcu_expedited(irq_srcu); #ifdef __KVM_HAVE_IOAPIC kvm_vcpu_request_scan_ioapic(kvm); #endif @@ -115,7 +118,7 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, bool line_status) { struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS]; - int ret = -1, i = 0; + int ret = -1, i = 0, idx; struct kvm_irq_routing_table *irq_rt; trace_kvm_set_irq(irq, level, irq_source_id); @@ -124,12 +127,12 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, * IOAPIC. So set the bit in both. The guest will ignore * writes to the unused one. */ - rcu_read_lock(); - irq_rt = rcu_dereference(kvm-irq_routing); + idx = srcu_read_lock(irq_srcu); + irq_rt = srcu_dereference(kvm-irq_routing, irq_srcu); if (irq irq_rt-nr_rt_entries) hlist_for_each_entry(e, irq_rt-map[irq], link) irq_set[i++] = *e; - rcu_read_unlock(); + srcu_read_unlock(irq_srcu, idx); while(i--) { int r; @@ -226,7 +229,7 @@ int kvm_set_irq_routing(struct kvm *kvm, kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); - synchronize_rcu(); + synchronize_srcu_expedited(irq_srcu); new = old; r = 0; -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
Il 16/01/2014 13:44, Christian Borntraeger ha scritto: I converted most of the rcu routines to srcu. Review for the unconverted use of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu is necessary, though. They look fine to me since they are protected by outer functions. They are fine because they do not have lockdep checks (hlist_for_each_entry_rcu uses rcu_dereference_raw rather than rcu_dereference, and write-sides do not do rcu lockdep at all). In addition, we should also discuss if a global srcu (for all guests) is fine. I think it is. It is already way cheaper than it used to be, and we're hardly relying on the sleepable part of srcu. We just want its faster detection of grace periods. One instance should be fine because our read sides are so small and mostly they are not even preemptable. Thanks for writing the patch! Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
On 16/01/14 13:59, Paolo Bonzini wrote: Il 16/01/2014 13:44, Christian Borntraeger ha scritto: I converted most of the rcu routines to srcu. Review for the unconverted use of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu is necessary, though. They look fine to me since they are protected by outer functions. They are fine because they do not have lockdep checks (hlist_for_each_entry_rcu uses rcu_dereference_raw rather than rcu_dereference, and write-sides do not do rcu lockdep at all). In addition, we should also discuss if a global srcu (for all guests) is fine. I think it is. It is already way cheaper than it used to be, and we're hardly relying on the sleepable part of srcu. We just want its faster detection of grace periods. One instance should be fine because our read sides are so small and mostly they are not even preemptable. Thanks for writing the patch! Paolo Will you edit the patch description or shall I resend the patch? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
Il 16/01/2014 14:06, Christian Borntraeger ha scritto: Will you edit the patch description or shall I resend the patch? I can edit the commit message. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I need your trust reply.
Dear Friend, I am contacting you to assist me in a transaction of 6,200,000.00 U.S dollars to be transferred out from my bank. This transaction is confidential and 100% genuine, so we MUST remain fiducially in all our dealings since i cannot complete the transaction alone without the help of a foreigner. Therefore, I am contacting you in this business for the benefit of both of us and kindly reply me on this E-Mail ID zak0...@terra.com for confidential reasons. Upon achieving this goal you will be entitled to 30% of the total sum and 70% will be mine. So kindly indicate your full interest on assurance of trust to enable me give you full details and how this transaction will be executed. I look forward to hear from you soon. Best regards, Mr. Zakan .M. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v4 2/2] add support for Hyper-V partition reference time enlightenment
Il 14/01/2014 10:22, Vadim Rozenfeld ha scritto: @@ -1883,6 +1884,13 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) addr = gfn_to_hva(kvm, gfn); if (kvm_is_error_hva(addr)) return 1; + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) + ka-use_master_clock) { + tsc_ref.tsc_sequence = 1; I think that after the first migration the sequence number will always be 2: set_msr_hyperv_pw will set it to 1, and KVM_SET_KVMCLOCK will increment it. Can you check this? The solution would be to propagate the struct msr_data * argument from kvm_set_msr_common to kvm_set_msr_hyper_pw. Then, here in this if you can test msr_info-host_initiated, and increment the tsc_sequence instead of resetting. Like if ((!msr_info-host_initiated || tsc_ref.tsc_sequence) boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ka-use_master_clock) { tsc_ref.tsc_sequence = msr_info-host_initiated ? tsc_ref.tsc_sequence + 1 : 1; + tsc_ref.tsc_scale = ((1LL 32) / + vcpu-arch.virtual_tsc_khz) 32; + tsc_ref.tsc_offset = kvm-arch.kvmclock_offset; + } if (__copy_to_user((void __user *)addr, tsc_ref, sizeof(tsc_ref))) return 1; mark_page_dirty(kvm, gfn); @@ -3871,6 +3879,27 @@ long kvm_arch_vm_ioctl(struct file *filp, local_irq_enable(); kvm-arch.kvmclock_offset = delta; kvm_gen_update_masterclock(kvm); + if (kvm-arch.hv_tsc_page HV_X64_MSR_TSC_REFERENCE_ENABLE) { + HV_REFERENCE_TSC_PAGE tsc_ref; + struct kvm_arch *ka = kvm-arch; + r = kvm_read_guest(kvm, kvm-arch.hv_tsc_page, +tsc_ref, sizeof(tsc_ref)); + if (r) + goto out; + if (tsc_ref.tsc_sequence + boot_cpu_has(X86_FEATURE_CONSTANT_TSC) + ka-use_master_clock) { + tsc_ref.tsc_sequence++; + tsc_ref.tsc_scale = ((1LL 32) / + __get_cpu_var(cpu_tsc_khz)) 32; Since on migration kvm_set_hyperv_pw will always be called, do you need to write tsc_ref.scale at all here? Paolo + tsc_ref.tsc_offset = kvm-arch.kvmclock_offset; + } else + tsc_ref.tsc_sequence = 0; + r = kvm_write_guest(kvm, kvm-arch.hv_tsc_page, + tsc_ref, sizeof(tsc_ref)); + if (r) + goto out; + } break; } case KVM_GET_CLOCK: { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] KVM: x86: Validate guest writes to MSR_IA32_APICBASE
Il 04/01/2014 18:47, Jan Kiszka ha scritto: + u64 old_state = vcpu-arch.apic_base + (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE); + u64 new_state = msr_info-data + (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE); + u64 reserved_bits = ((~0ULL) boot_cpu_data.x86_phys_bits) | 0x2ff | + (guest_cpuid_has_x2apic(vcpu) ? 0 : X2APIC_ENABLE); + Should this use the guest CPUID instead? + if (!msr_info-host_initiated Is this check on host_initiated just for backwards compatibility, or is there another case that I am missing? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] KVM: x86: Validate guest writes to MSR_IA32_APICBASE
On 2014-01-16 15:07, Paolo Bonzini wrote: Il 04/01/2014 18:47, Jan Kiszka ha scritto: +u64 old_state = vcpu-arch.apic_base +(MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE); +u64 new_state = msr_info-data +(MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE); +u64 reserved_bits = ((~0ULL) boot_cpu_data.x86_phys_bits) | 0x2ff | +(guest_cpuid_has_x2apic(vcpu) ? 0 : X2APIC_ENABLE); + Should this use the guest CPUID instead? Hmm, they may differ... Then yes. +if (!msr_info-host_initiated Is this check on host_initiated just for backwards compatibility, or is there another case that I am missing? The path is taken for both host-initiated and guest-initiated APICBASE updates. Host-initiated ones are allowed to perform architecturally invalid state transitions. And the MSR is emulated, so if they like to set a reserved bit... Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM, XEN: Fix potential race in pvclock code
The paravirtualized clock used in KVM and Xen uses a version field to allow the guest to see when the shared data structure is inconsistent. The code reads the version field twice (before and after the data structure is copied) and checks whether they haven't changed and that the lowest bit is not set. As the second access is not synchronized, the compiler could generate code that accesses version more than two times and you end up with inconsistent data. An example using pvclock_get_time_values: host starts updating data, sets src-version to 1 guest reads src-version (1) and stores it into dst-version. guest copies inconsistent data guest reads src-version (1) and computes xor with dst-version. host finishes updating data and sets src-version to 2 guest reads src-version (2) and checks whether lower bit is not set. while loop exits with inconsistent data! AFAICS the compiler is allowed to optimize the given code this way. Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kernel/pvclock.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 42eb330..f62b41c 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -55,6 +55,8 @@ static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, struct pvclock_vcpu_time_info *src) { + u32 nversion; + do { dst-version = src-version; rmb(); /* fetch version before data */ @@ -64,7 +66,8 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, dst-tsc_shift = src-tsc_shift; dst-flags = src-flags; rmb(); /* test version after fetching data */ - } while ((src-version 1) || (dst-version != src-version)); + nversion = ACCESS_ONCE(src-version); + } while ((nversion 1) || (dst-version != nversion)); return dst-version; } @@ -135,7 +138,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, struct pvclock_vcpu_time_info *vcpu_time, struct timespec *ts) { - u32 version; + u32 version, nversion; u64 delta; struct timespec now; @@ -146,7 +149,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, now.tv_sec = wall_clock-sec; now.tv_nsec = wall_clock-nsec; rmb(); /* fetch time before checking version */ - } while ((wall_clock-version 1) || (version != wall_clock-version)); + nversion = ACCESS_ONCE(wall_clock-version); + } while ((nversion 1) || (version != nversion)); delta = pvclock_clocksource_read(vcpu_time);/* time since system boot */ delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec; -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code
On 16.01.14 at 15:13, Julian Stecklina jstec...@os.inf.tu-dresden.de wrote: The paravirtualized clock used in KVM and Xen uses a version field to allow the guest to see when the shared data structure is inconsistent. The code reads the version field twice (before and after the data structure is copied) and checks whether they haven't changed and that the lowest bit is not set. As the second access is not synchronized, the compiler could generate code that accesses version more than two times and you end up with inconsistent data. An example using pvclock_get_time_values: host starts updating data, sets src-version to 1 guest reads src-version (1) and stores it into dst-version. guest copies inconsistent data guest reads src-version (1) and computes xor with dst-version. host finishes updating data and sets src-version to 2 guest reads src-version (2) and checks whether lower bit is not set. while loop exits with inconsistent data! AFAICS the compiler is allowed to optimize the given code this way. I don't think so - this would only be an issue if the conditions used | instead of ||. || implies a sequence point between evaluating the left and right sides, and the standard says: The presence of a sequence point between the evaluation of expressions A and B implies that every value computation and side effect associated with A is sequenced before every value computation and side effect associated with B. And even if there was a problem (i.e. my interpretation of the above being incorrect), I don't think you'd need ACCESS_ONCE() here: The same local variable can't have two different values in two different use sites when there was no intermediate assignment to it. Interestingly the old XenoLinux code uses | instead of || in both cases, yet only one of the two also entertains the use of a local variable. I shall fix this (read: Thanks for pointing out even if in the upstream incarnation this is not a problem). Jan Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kernel/pvclock.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 42eb330..f62b41c 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -55,6 +55,8 @@ static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, struct pvclock_vcpu_time_info *src) { + u32 nversion; + do { dst-version = src-version; rmb(); /* fetch version before data */ @@ -64,7 +66,8 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, dst-tsc_shift = src-tsc_shift; dst-flags = src-flags; rmb(); /* test version after fetching data */ - } while ((src-version 1) || (dst-version != src-version)); + nversion = ACCESS_ONCE(src-version); + } while ((nversion 1) || (dst-version != nversion)); return dst-version; } @@ -135,7 +138,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, struct pvclock_vcpu_time_info *vcpu_time, struct timespec *ts) { - u32 version; + u32 version, nversion; u64 delta; struct timespec now; @@ -146,7 +149,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, now.tv_sec = wall_clock-sec; now.tv_nsec = wall_clock-nsec; rmb(); /* fetch time before checking version */ - } while ((wall_clock-version 1) || (version != wall_clock-version)); + nversion = ACCESS_ONCE(wall_clock-version); + } while ((nversion 1) || (version != nversion)); delta = pvclock_clocksource_read(vcpu_time);/* time since system boot */ delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec; -- 1.8.4.2 ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/12] KVM: x86: Fixes for debug registers, IA32_APIC_BASE, and nVMX
On 2014-01-16 16:08, Paolo Bonzini wrote: Il 04/01/2014 18:47, Jan Kiszka ha scritto: This is on top of next after merging in the two patches of mine that are only present in master ATM. Highlights: - reworked fix of DR6 reading on SVM - full check for invalid writes to IA32_APIC_BASE - fixed support for halting in L2 (nVMX) - fully emulated preemption timer (nVMX) - tracing of nested vmexits (nVMX) The patch KVM: nVMX: Leave VMX mode on clearing of feature control MSR is included again, unchanged from previous posting. Most fixes are backed by KVM unit tests, to be posted soon as well. I'm applying patches 1-10 (for now to kvm/queue). For the last two, I prefer to wait for 3.15. Should we disable the broken features in 3.14? Also, for patch 11 I would really prefer to use check_nested_events for both VMX and SVM. I will look at SVM next week. OK, thanks. I will send an updated patch 11 and also a patch on top of 4 to read the physical bit width from the guest cpuid. Jan Thanks, Paolo Jan Kiszka (12): KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS KVM: SVM: Fix reading of DR6 KVM: VMX: Fix DR6 update on #DB exception KVM: x86: Validate guest writes to MSR_IA32_APICBASE KVM: nVMX: Leave VMX mode on clearing of feature control MSR KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject KVM: nVMX: Clean up handling of VMX-related MSRs KVM: nVMX: Fix nested_run_pending on activity state HLT KVM: nVMX: Update guest activity state field on L2 exits KVM: nVMX: Rework interception of IRQs and NMIs KVM: nVMX: Fully emulate preemption timer arch/x86/include/asm/kvm_host.h | 4 + arch/x86/include/uapi/asm/msr-index.h | 1 + arch/x86/kvm/cpuid.h | 8 + arch/x86/kvm/lapic.h | 2 +- arch/x86/kvm/svm.c| 15 ++ arch/x86/kvm/vmx.c| 399 -- arch/x86/kvm/x86.c| 67 +- 7 files changed, 318 insertions(+), 178 deletions(-) -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/12] KVM: nVMX: Rework interception of IRQs and NMIs
Il 04/01/2014 18:47, Jan Kiszka ha scritto: From: Jan Kiszka jan.kis...@siemens.com Move the check for leaving L2 on pending and intercepted IRQs or NMIs from the *_allowed handler into a dedicated callback. Invoke this callback at the relevant points before KVM checks if IRQs/NMIs can be injected. The callback has the task to switch from L2 to L1 if needed and inject the proper vmexit events. The rework fixes L2 wakeups from HLT and provides the foundation for preemption timer emulation. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/vmx.c | 67 +++-- arch/x86/kvm/x86.c | 15 +++-- 3 files changed, 53 insertions(+), 31 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index e73651b..d195421 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -764,6 +764,8 @@ struct kvm_x86_ops { struct x86_instruction_info *info, enum x86_intercept_stage stage); void (*handle_external_intr)(struct kvm_vcpu *vcpu); + + int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr); }; struct kvm_arch_async_pf { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1245ff1..ec8a976 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4620,22 +4620,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked) static int vmx_nmi_allowed(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu)) { - if (to_vmx(vcpu)-nested.nested_run_pending) - return 0; - if (nested_exit_on_nmi(vcpu)) { - nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, - NMI_VECTOR | INTR_TYPE_NMI_INTR | - INTR_INFO_VALID_MASK, 0); - /* - * The NMI-triggered VM exit counts as injection: - * clear this one and block further NMIs. - */ - vcpu-arch.nmi_pending = 0; - vmx_set_nmi_mask(vcpu, true); - return 0; - } - } + if (to_vmx(vcpu)-nested.nested_run_pending) + return 0; if (!cpu_has_virtual_nmis() to_vmx(vcpu)-soft_vnmi_blocked) return 0; @@ -4647,19 +4633,8 @@ static int vmx_nmi_allowed(struct kvm_vcpu *vcpu) static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu)) { - if (to_vmx(vcpu)-nested.nested_run_pending) - return 0; - if (nested_exit_on_intr(vcpu)) { - nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, - 0, 0); - /* - * fall through to normal code, but now in L1, not L2 - */ - } - } - - return (vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_IF) + return (!to_vmx(vcpu)-nested.nested_run_pending + vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_IF) !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); } @@ -8158,6 +8133,35 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu, } } +static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (vcpu-arch.nmi_pending nested_exit_on_nmi(vcpu)) { + if (vmx-nested.nested_run_pending) + return -EBUSY; + nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, + NMI_VECTOR | INTR_TYPE_NMI_INTR | + INTR_INFO_VALID_MASK, 0); + /* + * The NMI-triggered VM exit counts as injection: + * clear this one and block further NMIs. + */ + vcpu-arch.nmi_pending = 0; + vmx_set_nmi_mask(vcpu, true); + return 0; + } + + if ((kvm_cpu_has_interrupt(vcpu) || external_intr) + nested_exit_on_intr(vcpu)) { + if (vmx-nested.nested_run_pending) + return -EBUSY; + nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); + } + + return 0; +} + /* * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), @@ -8498,6 +8502,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, nested_vmx_succeed(vcpu); if (enable_shadow_vmcs) vmx-nested.sync_shadow_vmcs = true; + +
Re: [PATCH 00/12] KVM: x86: Fixes for debug registers, IA32_APIC_BASE, and nVMX
Il 04/01/2014 18:47, Jan Kiszka ha scritto: This is on top of next after merging in the two patches of mine that are only present in master ATM. Highlights: - reworked fix of DR6 reading on SVM - full check for invalid writes to IA32_APIC_BASE - fixed support for halting in L2 (nVMX) - fully emulated preemption timer (nVMX) - tracing of nested vmexits (nVMX) The patch KVM: nVMX: Leave VMX mode on clearing of feature control MSR is included again, unchanged from previous posting. Most fixes are backed by KVM unit tests, to be posted soon as well. I'm applying patches 1-10 (for now to kvm/queue). For the last two, I prefer to wait for 3.15. Also, for patch 11 I would really prefer to use check_nested_events for both VMX and SVM. I will look at SVM next week. Thanks, Paolo Jan Kiszka (12): KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS KVM: SVM: Fix reading of DR6 KVM: VMX: Fix DR6 update on #DB exception KVM: x86: Validate guest writes to MSR_IA32_APICBASE KVM: nVMX: Leave VMX mode on clearing of feature control MSR KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject KVM: nVMX: Clean up handling of VMX-related MSRs KVM: nVMX: Fix nested_run_pending on activity state HLT KVM: nVMX: Update guest activity state field on L2 exits KVM: nVMX: Rework interception of IRQs and NMIs KVM: nVMX: Fully emulate preemption timer arch/x86/include/asm/kvm_host.h | 4 + arch/x86/include/uapi/asm/msr-index.h | 1 + arch/x86/kvm/cpuid.h | 8 + arch/x86/kvm/lapic.h | 2 +- arch/x86/kvm/svm.c| 15 ++ arch/x86/kvm/vmx.c| 399 -- arch/x86/kvm/x86.c| 67 +- 7 files changed, 318 insertions(+), 178 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/12] KVM: x86: Fixes for debug registers, IA32_APIC_BASE, and nVMX
Il 16/01/2014 16:12, Jan Kiszka ha scritto: I'm applying patches 1-10 (for now to kvm/queue). For the last two, I prefer to wait for 3.15. Should we disable the broken features in 3.14? I was thinking about it---which would have meant applying patches 1-8 only. In the end I decided not to do it because the patch doesn't affect KVM-on-KVM emulation. If the patch already helped you in developing Jailhouse, so the feature must not be _completely_ broken. So the churn of reverting the feature now and reapplying it later is not warranted, IMO. Thanks, Paolo Also, for patch 11 I would really prefer to use check_nested_events for both VMX and SVM. I will look at SVM next week. OK, thanks. I will send an updated patch 11 and also a patch on top of 4 to read the physical bit width from the guest cpuid. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/12] KVM: x86: Fixes for debug registers, IA32_APIC_BASE, and nVMX
On 2014-01-16 16:20, Paolo Bonzini wrote: Il 16/01/2014 16:12, Jan Kiszka ha scritto: I'm applying patches 1-10 (for now to kvm/queue). For the last two, I prefer to wait for 3.15. Should we disable the broken features in 3.14? I was thinking about it---which would have meant applying patches 1-8 only. And we would have to disable the option to turn of HLT interception. In the end I decided not to do it because the patch doesn't affect KVM-on-KVM emulation. If the patch already helped you in developing Jailhouse, so the feature must not be _completely_ broken. So the churn of reverting the feature now and reapplying it later is not warranted, IMO. Well, in the end nested=1 remains the sign that this feature is still experimental. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code
On 01/16/2014 04:04 PM, Jan Beulich wrote: I don't think so - this would only be an issue if the conditions used | instead of ||. || implies a sequence point between evaluating the left and right sides, and the standard says: The presence of a sequence point between the evaluation of expressions A and B implies that every value computation and side effect associated with A is sequenced before every value computation and side effect associated with B. This only applies to single-threaded code. Multithreaded code must be data-race free for that to be true. See https://lwn.net/Articles/508991/ And even if there was a problem (i.e. my interpretation of the above being incorrect), I don't think you'd need ACCESS_ONCE() here: The same local variable can't have two different values in two different use sites when there was no intermediate assignment to it. Same comment as above. Julian signature.asc Description: OpenPGP digital signature
nested EPT
Dear All, I am having a problem with using nested EPT in my system: In L0 hypervisor CPUs support vmx and ept; however, L1 hypervisor's CPUs do not have ept capability. Flag 'ept' appears in /proc/cpuinfo of L0 but does not show in that of L1. - 'Nested' and 'EPT' are enabled in L0: $cat /sys/module/kvm_intel/parameters/nested Y $cat /sys/module/kvm_intel/parameters/ept Y - The libvirt xml file used in L0 has this cpu configuration: cpu mode='host-passthrough'/ - The kernel version I am using for both L0 and L1 is 3.9.11 Thank you very much. Best, Hai -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/4] virtio_balloon: add pressure notification via a new virtqueue
The pressure notification added to the virtio-balloon driver by this series is going to be used by the host (QEMU, in this case) to implement automatic balloning support. More details in patch 3/4 and patch 4/4. Patch 1/4 adds in-kernel vmpressure support and is not really part of this series, it's added here for the convenience of someone who wants to try automatic ballooning. Patch 2/4 is a hack to make in-kernel vmpressure work for something not related to cgroups, I'll improve it in later versions. Glauber Costa (1): vmpressure: in-kernel notifications Luiz capitulino (3): vmpressure in-kernel: hack virtio_balloon: add pressure notification via a new virtqueue virtio_balloon: skip inflation if guest is under pressure drivers/virtio/virtio_balloon.c | 100 include/linux/vmpressure.h | 6 +++ include/uapi/linux/virtio_balloon.h | 1 + mm/vmpressure.c | 58 +++-- 4 files changed, 151 insertions(+), 14 deletions(-) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/4] vmpressure in-kernel: hack
From: Luiz capitulino lcapitul...@redhat.com 1. Allow drivers to register private data 2. Allow drivers to pass css=NULL 3. Pass level to the callback Signed-off-by: Luiz capitulino lcapitul...@redhat.com --- include/linux/vmpressure.h | 3 ++- mm/vmpressure.c| 13 + 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h index 9102e53..de416b6 100644 --- a/include/linux/vmpressure.h +++ b/include/linux/vmpressure.h @@ -42,7 +42,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css, struct eventfd_ctx *eventfd, const char *args); extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css, - void (*fn)(void)); + void (*fn)(void *data, int level), + void *data); extern void vmpressure_unregister_event(struct cgroup_subsys_state *css, struct cftype *cft, struct eventfd_ctx *eventfd); diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 730e7c1..4ed0e85 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -132,9 +132,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, struct vmpressure_event { union { struct eventfd_ctx *efd; - void (*fn)(void); + void (*fn)(void *data, int level); }; enum vmpressure_levels level; + void *data; bool kernel_event; struct list_head node; }; @@ -152,7 +153,7 @@ static bool vmpressure_event(struct vmpressure *vmpr, list_for_each_entry(ev, vmpr-events, node) { if (ev-kernel_event) { - ev-fn(); + ev-fn(ev-data, level); } else if (vmpr-notify_userspace level = ev-level) { eventfd_signal(ev-efd, 1); signalled = true; @@ -352,21 +353,25 @@ int vmpressure_register_event(struct cgroup_subsys_state *css, * well-defined cgroup aware interface. */ int vmpressure_register_kernel_event(struct cgroup_subsys_state *css, - void (*fn)(void)) +void (*fn)(void *data, int level), void *data) { - struct vmpressure *vmpr = css_to_vmpressure(css); + struct vmpressure *vmpr; struct vmpressure_event *ev; + vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL); + ev = kzalloc(sizeof(*ev), GFP_KERNEL); if (!ev) return -ENOMEM; ev-kernel_event = true; + ev-data = data; ev-fn = fn; mutex_lock(vmpr-events_lock); list_add(ev-node, vmpr-events); mutex_unlock(vmpr-events_lock); + return 0; } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/4] vmpressure: in-kernel notifications
From: Glauber Costa glom...@openvz.org During the past weeks, it became clear to us that the shrinker interface we have right now works very well for some particular types of users, but not that well for others. The latter are usually people interested in one-shot notifications, that were forced to adapt themselves to the count+scan behavior of shrinkers. To do so, they had no choice than to greatly abuse the shrinker interface producing little monsters all over. During LSF/MM, one of the proposals that popped out during our session was to reuse Anton Voronstsov's vmpressure for this. They are designed for userspace consumption, but also provide a well-stablished, cgroup-aware entry point for notifications. This patch extends that to also support in-kernel users. Events that should be generated for in-kernel consumption will be marked as such, and for those, we will call a registered function instead of triggering an eventfd notification. Please note that due to my lack of understanding of each shrinker user, I will stay away from converting the actual users, you are all welcome to do so. Signed-off-by: Glauber Costa glom...@openvz.org Signed-off-by: Vladimir Davydov vdavy...@parallels.com Acked-by: Anton Vorontsov an...@enomsg.org Acked-by: Pekka Enberg penb...@kernel.org Reviewed-by: Greg Thelen gthe...@google.com Cc: Dave Chinner dchin...@redhat.com Cc: John Stultz john.stu...@linaro.org Cc: Andrew Morton a...@linux-foundation.org Cc: Joonsoo Kim iamjoonsoo@lge.com Cc: Michal Hocko mho...@suse.cz Cc: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com Cc: Johannes Weiner han...@cmpxchg.org Signed-off-by: Luiz capitulino lcapitul...@redhat.com --- include/linux/vmpressure.h | 5 + mm/vmpressure.c| 53 +++--- 2 files changed, 55 insertions(+), 3 deletions(-) diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h index 3f3788d..9102e53 100644 --- a/include/linux/vmpressure.h +++ b/include/linux/vmpressure.h @@ -19,6 +19,9 @@ struct vmpressure { /* Have to grab the lock on events traversal or modifications. */ struct mutex events_lock; + /* False if only kernel users want to be notified, true otherwise. */ + bool notify_userspace; + struct work_struct work; }; @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css, struct cftype *cft, struct eventfd_ctx *eventfd, const char *args); +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css, + void (*fn)(void)); extern void vmpressure_unregister_event(struct cgroup_subsys_state *css, struct cftype *cft, struct eventfd_ctx *eventfd); diff --git a/mm/vmpressure.c b/mm/vmpressure.c index e0f6283..730e7c1 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, } struct vmpressure_event { - struct eventfd_ctx *efd; + union { + struct eventfd_ctx *efd; + void (*fn)(void); + }; enum vmpressure_levels level; + bool kernel_event; struct list_head node; }; @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr, mutex_lock(vmpr-events_lock); list_for_each_entry(ev, vmpr-events, node) { - if (level = ev-level) { + if (ev-kernel_event) { + ev-fn(); + } else if (vmpr-notify_userspace level = ev-level) { eventfd_signal(ev-efd, 1); signalled = true; } } + vmpr-notify_userspace = false; mutex_unlock(vmpr-events_lock); return signalled; @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, * we account it too. */ if (!(gfp (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS))) - return; + goto schedule; /* * If we got here with no pages scanned, then that is an indicator @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, vmpr-scanned += scanned; vmpr-reclaimed += reclaimed; scanned = vmpr-scanned; + /* +* If we didn't reach this point, only kernel events will be triggered. +* It is the job of the worker thread to clean this up once the +* notifications are all delivered. +*/ + vmpr-notify_userspace = true; spin_unlock(vmpr-sr_lock); +schedule: if (scanned vmpressure_win) return; schedule_work(vmpr-work); @@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css, }
[RFC PATCH 3/4] virtio_balloon: add pressure notification via a new virtqueue
From: Luiz capitulino lcapitul...@redhat.com This commit adds support to a new virtqueue called message virtqueue. The message virtqueue can be used by guests to notify the host about important memory-related state changes in the guest. Currently, the only implemented notification is the guest is under pressure one, which informs the host that the guest is under memory pressure. This notification can be used to implement automatic memory ballooning in the host. For example, once learning that the guest is under pressure, the host could cancel an on-going inflate and/or start a deflate operation. Doing this through a virtqueue might seem like overkill, as all we're doing is to transfer an integer between guest and host. However, using a virtqueue offers the following advantages: 1. We can realibly synchronize host and guest. That is, the guest will only continue once the host acknowledges the notification. This is important, because if the guest gets under pressure while inflating the balloon, it has to stop to give the host a chance to reset num_pages (see next commit) 2. It's extensible. We may (or may not) want to tell the host which pressure level the guest finds itself in (ie. low, medium or critical) The lightweight alternative is to use a configuration space parameter. For this to work though, the guest would have to wait the for host to acknowloedge the receipt of a configuration change update. I could try this if the virtqueue is too overkill. Finally, the guest learns it's under pressure by registering a callback with the in-kernel vmpressure API. FIXMEs: - vmpressure API is missing an de-registration routine - not completely sure my changes in virtballoon_probe() are correct Signed-off-by: Luiz capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 93 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 84 insertions(+), 10 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 5c4a95b..1c3ee71 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -29,6 +29,9 @@ #include linux/module.h #include linux/balloon_compaction.h +#include linux/cgroup.h +#include linux/vmpressure.h + /* * Balloon device works in 4K page units. So each page is pointed to by * multiple balloon pages. All memory counters in this driver are in balloon @@ -37,10 +40,12 @@ #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE VIRTIO_BALLOON_PFN_SHIFT) #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256 +#define VIRTIO_BALLOON_MSG_PRESSURE 1 + struct virtio_balloon { struct virtio_device *vdev; - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *message_vq; /* Where the ballooning thread waits for config to change. */ wait_queue_head_t config_change; @@ -51,6 +56,8 @@ struct virtio_balloon /* Waiting for host to ack the pages we released. */ wait_queue_head_t acked; + wait_queue_head_t message_acked; + /* Number of balloon pages we've told the Host we're not using. */ unsigned int num_pages; /* @@ -71,6 +78,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* Message virtqueue */ + atomic_t guest_pressure; }; static struct virtio_device_id id_table[] = { @@ -78,6 +88,41 @@ static struct virtio_device_id id_table[] = { { 0 }, }; +static inline bool guest_under_pressure(const struct virtio_balloon *vb) +{ + return atomic_read(vb-guest_pressure) == 1; +} + +static void vmpressure_event_handler(void *data, int level) +{ + struct virtio_balloon *vb = data; + + atomic_set(vb-guest_pressure, 1); + wake_up(vb-config_change); +} + +static void tell_host_pressure(struct virtio_balloon *vb) +{ + const uint32_t msg = VIRTIO_BALLOON_MSG_PRESSURE; + struct scatterlist sg; + unsigned int len; + int err; + + sg_init_one(sg, msg, sizeof(msg)); + + err = virtqueue_add_outbuf(vb-message_vq, sg, 1, vb, GFP_KERNEL); + if (err 0) { + printk(KERN_WARNING virtio-balloon: failed to send host message (%d)\n, err); + goto out; + } + virtqueue_kick(vb-message_vq); + + wait_event(vb-message_acked, virtqueue_get_buf(vb-message_vq, len)); + +out: + atomic_set(vb-guest_pressure, 0); +} + static u32 page_to_balloon_pfn(struct page *page) { unsigned long pfn = page_to_pfn(page); @@ -100,6 +145,13 @@ static void balloon_ack(struct virtqueue *vq) wake_up(vb-acked); } +static void message_ack(struct virtqueue *vq) +{ + struct virtio_balloon *vb = vq-vdev-priv; + + wake_up(vb-message_acked); +} + static void tell_host(struct virtio_balloon
[RFC PATCH 4/4] virtio_balloon: skip inflation if guest is under pressure
From: Luiz capitulino lcapitul...@redhat.com This is necessary for automatic ballooning. If the guest gets under pressure while there's an on-going inflation operation, we want the guest to do the following: 1. Stop on-going inflation 2. Notify the host we're under pressure 3. Wait for host's acknowledge While the guest is waiting the host has the opportunity to reset num_pages to a value before the guest got into pressure. This will cancel current inflation. Signed-off-by: Luiz capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 1c3ee71..7f5b7d2 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -188,8 +188,13 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) mutex_lock(vb-balloon_lock); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { - struct page *page = balloon_page_enqueue(vb_dev_info); + struct page *page; + if (guest_under_pressure(vb)) { + break; + } + + page = balloon_page_enqueue(vb_dev_info); if (!page) { dev_info_ratelimited(vb-vdev-dev, Out of puff! Can't get %u pages\n, -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] balloon: add automatic ballooning support
The current balloon device has an important drawback: it's entirely manual. This largely limits the feature when used to manage memory on a memory over-committed host. In order to make the balloon device really useful for memory-overcommit setups, we have to make it automatic. This is what this patch is about. The general idea is simple. When the host is under pressure it notifies QEMU. QEMU then asks the guest to inflate its balloon by some amount of pages. When the guest is under pressure, it also notifies QEMU. QEMU then asks the guest to deflate its balloon by some amount of pages. The rest of this commit is divided into the following sections: Usage, Algorithm, FIXMEs/TODOs, Testing and some code in the end. Usage = This series depends on a kernel series for the virtio-balloon driver called add pressure notification via a new virtqueue. You have to apply that series in your guest kernel to play with automatic ballooning. Then, on the QEMU side you can enable automatic ballooning with the following command-line: $ qemu [...] -device virtio-balloon,automatic=true Algorithm = On host pressure: 1. On boot QEMU registers for Linux kernel's vmpressure event low. This event is sent by the kernel when it has started reclaiming memory. For more details, please read Documentation/cgroups/memory.txt in the kernel's source 2. When QEMU is notified on host pressure, it first checks if the guest is currently in pressure, if it is then the event is skipped. If the guest is not in pressure QEMU asks the guest to inflate its balloon (32MB by default) NOTE: QEMU will update num_pages whenever an event is received and the guest is not in pressure. This means that if QEMU receives 10 events in a row, num_pages will be updated to 320MB. On guest pressure: 1. QEMU is notified by the virtio-balloon driver in the guest (via message virtqueue) that the guest is under pressure 2. QEMU checks if there's an inflate going on. If true, QEMU rests num_pages to the current balloon value so that the guest stops inflating (IOW, QEMU cancels current inflation). QEMU returns 3. If there's no on-going inflate, QEMU asks the guest to deflate (32MB by default) 4. Everytime a guest pressure notification is received, QEMU sets a hysteresis period of 60 seconds. During this period the guest is defined to be under pressure (and inflates will be ignored) FIXMEs/TODOs - The number of pages to inflate/deflate and the memcg path are harcoded. Will add command-line options for them - The default value of 32MB for inflates/deflates is what worked for me in my very specific test-case. This is probably not a good default, but I don't how to define a good one - QEMU register's for vmpressure's level low notification. The guest too will notify QEMU on low pressure in the guest. The low notification is sent whenever the kernel has started reclaiming memory. On the guest side this means that it will only give free memory to the host. On the host side this means that a host with lots of large freeable caches will be considered as being in pressure. There two ways to solve this: 1. Register for medium pressure instead of low. This solves the problem above but it adds a different one: medium is sent when the kernel has started to swap, so it's a bit too late 2. Add a new event to vmpressure which is between low and medium. The perfect event would be triggered before waking up kswapd - It would be nice (required?) to be able to dynamically enable/disable automatic ballooning. With this patch you enable it for the lifetime of the VM - I think manual ballooning should be disabled when automatic ballooning is enabled, but this is not done yet - This patch probably doesn't build on windows Testing === Testing is by far the most difficult aspect of this project to me. It's been hard to find a good way to measure this work. So take this with a grain of salt. This is my test-case: a 2G host runs two VMs (guest A and guest B), each with 1.3G of memory. When the VMs are fully booted (but idle) the host has around 1.2G of free memory. Then the VMs do the following: 1. Guest A runs ebizzy five times in a row, with a chunk size of 1MB and the following number of chunks: 1024, 824, 624, 424, 224. IOW, the memory usage of this VM is going down. Let's call it vm-down 2. Guest B runs ebizzy in similar manner, but it runs ebizzy with the following number of chunks: 224, 424, 624, 824, 1024. IOW, the memory usage of this VM is going up. Let's call it vm-up Also, each ebizzy run takes 60 seconds. And the vm-up one waits 60 seconds before running ebizzy for the first time. This gives vm-down time to consume most of the host's pressure and release it. Here are the results. This is an avarage
Re: nested EPT
On 2014-01-16 17:10, duy hai nguyen wrote: Dear All, I am having a problem with using nested EPT in my system: In L0 hypervisor CPUs support vmx and ept; however, L1 hypervisor's CPUs do not have ept capability. Flag 'ept' appears in /proc/cpuinfo of L0 but does not show in that of L1. - 'Nested' and 'EPT' are enabled in L0: $cat /sys/module/kvm_intel/parameters/nested Y $cat /sys/module/kvm_intel/parameters/ept Y - The libvirt xml file used in L0 has this cpu configuration: cpu mode='host-passthrough'/ - The kernel version I am using for both L0 and L1 is 3.9.11 Update your host kernel (L0), nEPT got merged in 3.12. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nested EPT
Il 16/01/2014 17:10, duy hai nguyen ha scritto: Dear All, I am having a problem with using nested EPT in my system: In L0 hypervisor CPUs support vmx and ept; however, L1 hypervisor's CPUs do not have ept capability. Flag 'ept' appears in /proc/cpuinfo of L0 but does not show in that of L1. - 'Nested' and 'EPT' are enabled in L0: $cat /sys/module/kvm_intel/parameters/nested Y $cat /sys/module/kvm_intel/parameters/ept Y - The libvirt xml file used in L0 has this cpu configuration: cpu mode='host-passthrough'/ - The kernel version I am using for both L0 and L1 is 3.9.11 Nested EPT was added in 3.12. You need that version in L0. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: troubleshoot live migration
Found a workaround. I don't know why I disabled hpet when the issue was lapic, late nights. I can't reproduce the issue if I disable kvmclock on the qemu cmdline, or if I leave the vm at the grub prompt rather than loading the guest OS. It kind of sucks to disable kvmclock but at least migration works again now. I have no idea whether it's the guest KVM drivers, something in qemu, or the host (probably an incompatibility in the combination) but it happens with CentOS 6.5 guests and fully patched Ubuntu 12.04 guests. Here's our test environment in case anyone cares to try to reproduce. 2 x intel E5-2650 KVM servers running linux 3.10.26 and qemu 1.6.x or 1.7.0 Start a virtual machine with CentOS 6.5 (or perhaps any common, current distro) without disabling kvmclock. I tried single, dual, quad vcpus. Let the vm run for a few hours (migrations immediately after start or even 30 minutes later seem to have no problem) Migrate, and the vm should become unusable. You may be lucky enough to get console/serial output complaining about lapic, or it may just consume cpu On Wed, Jan 15, 2014 at 7:23 AM, Marcus Sorensen shadow...@gmail.com wrote: Just an update, I found that with different tools I was able to see a repeating 'lapic increasing min_delta_ns' scrolling furiously. I've added -no-hpet to the cmdline, but was still able to replicate it. On Tue, Jan 14, 2014 at 1:36 PM, Marcus Sorensen shadow...@gmail.com wrote: Does anyone have tips on troubleshooting live migration? I'm not sure if this should be a qemu question or a kvm one. I've got several E5-2650 servers running in test environment, kernel 3.10.26 and qemu 1.7.0. If I start a VM guest (say ubuntu, debian, or centos), I can migrate it around from host to host to host just fine, but if I wait awhile (say 1 hour), I try to migrate and it succeeds but the guest is hosed. No longer pings, cpu is thrashing. I've tried to strace it and don't see anything that other working hosts aren't doing, and I've tried gdb but I'm not entirely sure what I'm doing. I tried downgrading to qemu 1.6.1. I've found dozens of reports of such behavior, but they're all due to other things (migrating between different host CPUs, someone thinking it's virtio or memballoon only to later find a fix like changing machine type, etc). I'm at a loss. I've tried replacing the virtio disk/network with sata/e1000, no difference. I'm just trying to figure out where to go from here. I've got a core dump of the qemu process that was spinning, and I can reproduce it fairly easily. Here's an example qemu cmdline: /usr/bin/qemu-system-x86_64 -machine accel=kvm -name VM12 -S -machine pc-i440fx-1.7,accel=kvm,usb=off -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid dd10a210-ab41-4cc6-a8f2-51113dd39515 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/VM12.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/dev/sdd,if=none,id=drive-virtio-disk0,format=raw,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,cache=none -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=31 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=02:00:39:28:00:01,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/VM12.agent,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=VM12.vport -device usb-tablet,id=input0 -vnc 0.0.0.0:1 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nested EPT
Thanks, Jan and Paolo! Great! It helps solve the problem. Sincerely, Hai On Thu, Jan 16, 2014 at 12:09 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 16/01/2014 17:10, duy hai nguyen ha scritto: Dear All, I am having a problem with using nested EPT in my system: In L0 hypervisor CPUs support vmx and ept; however, L1 hypervisor's CPUs do not have ept capability. Flag 'ept' appears in /proc/cpuinfo of L0 but does not show in that of L1. - 'Nested' and 'EPT' are enabled in L0: $cat /sys/module/kvm_intel/parameters/nested Y $cat /sys/module/kvm_intel/parameters/ept Y - The libvirt xml file used in L0 has this cpu configuration: cpu mode='host-passthrough'/ - The kernel version I am using for both L0 and L1 is 3.9.11 Nested EPT was added in 3.12. You need that version in L0. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
On Thu, Jan 16, 2014 at 01:44:20PM +0100, Christian Borntraeger wrote: When starting lots of dataplane devices the bootup takes very long on my s390 system(prototype irqfd code). With larger setups we are even able to trigger some timeouts in some components. Turns out that the KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec) when having multiple CPUs. This is caused by the synchronize_rcu and the HZ=100 of s390. By changing the code to use a private srcu we can speed things up. This patch reduces the boot time till mounting root from 8 to 2 seconds on my s390 guest with 100 disks. I converted most of the rcu routines to srcu. Review for the unconverted use of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu is necessary, though. They look fine to me since they are protected by outer functions. In addition, we should also discuss if a global srcu (for all guests) is fine. Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com That's nice but did you try to measure the overhead on some interrupt-intensive workloads, such as RX with 10G ethernet? srcu locks aren't free like rcu ones. --- virt/kvm/irqchip.c | 31 +-- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c index 20dc9e4..5283eb8 100644 --- a/virt/kvm/irqchip.c +++ b/virt/kvm/irqchip.c @@ -26,17 +26,20 @@ #include linux/kvm_host.h #include linux/slab.h +#include linux/srcu.h #include linux/export.h #include trace/events/kvm.h #include irq.h +DEFINE_STATIC_SRCU(irq_srcu); + bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin) { struct kvm_irq_ack_notifier *kian; - int gsi; + int gsi, idx; - rcu_read_lock(); - gsi = rcu_dereference(kvm-irq_routing)-chip[irqchip][pin]; + idx = srcu_read_lock(irq_srcu); + gsi = srcu_dereference(kvm-irq_routing, irq_srcu)-chip[irqchip][pin]; if (gsi != -1) hlist_for_each_entry_rcu(kian, kvm-irq_ack_notifier_list, link) @@ -45,7 +48,7 @@ bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin) return true; } - rcu_read_unlock(); + srcu_read_unlock(irq_srcu, idx); return false; } @@ -54,18 +57,18 @@ EXPORT_SYMBOL_GPL(kvm_irq_has_notifier); void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin) { struct kvm_irq_ack_notifier *kian; - int gsi; + int gsi, idx; trace_kvm_ack_irq(irqchip, pin); - rcu_read_lock(); - gsi = rcu_dereference(kvm-irq_routing)-chip[irqchip][pin]; + idx = srcu_read_lock(irq_srcu); + gsi = srcu_dereference(kvm-irq_routing, irq_srcu)-chip[irqchip][pin]; if (gsi != -1) hlist_for_each_entry_rcu(kian, kvm-irq_ack_notifier_list, link) if (kian-gsi == gsi) kian-irq_acked(kian); - rcu_read_unlock(); + srcu_read_unlock(irq_srcu, idx); } void kvm_register_irq_ack_notifier(struct kvm *kvm, @@ -85,7 +88,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm, mutex_lock(kvm-irq_lock); hlist_del_init_rcu(kian-link); mutex_unlock(kvm-irq_lock); - synchronize_rcu(); + synchronize_srcu_expedited(irq_srcu); #ifdef __KVM_HAVE_IOAPIC kvm_vcpu_request_scan_ioapic(kvm); #endif @@ -115,7 +118,7 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, bool line_status) { struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS]; - int ret = -1, i = 0; + int ret = -1, i = 0, idx; struct kvm_irq_routing_table *irq_rt; trace_kvm_set_irq(irq, level, irq_source_id); @@ -124,12 +127,12 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, * IOAPIC. So set the bit in both. The guest will ignore * writes to the unused one. */ - rcu_read_lock(); - irq_rt = rcu_dereference(kvm-irq_routing); + idx = srcu_read_lock(irq_srcu); + irq_rt = srcu_dereference(kvm-irq_routing, irq_srcu); if (irq irq_rt-nr_rt_entries) hlist_for_each_entry(e, irq_rt-map[irq], link) irq_set[i++] = *e; - rcu_read_unlock(); + srcu_read_unlock(irq_srcu, idx); while(i--) { int r; @@ -226,7 +229,7 @@ int kvm_set_irq_routing(struct kvm *kvm, kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); - synchronize_rcu(); + synchronize_srcu_expedited(irq_srcu); Hmm, it's a bit strange that you also do _expecited here. What if this synchronize_rcu is replaced by synchronize_rcu_expedited and no other changes are made? Maybe that's enough? new = old;
Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
On Thu, Jan 16, 2014 at 02:07:19PM +0100, Paolo Bonzini wrote: Il 16/01/2014 14:06, Christian Borntraeger ha scritto: Will you edit the patch description or shall I resend the patch? I can edit the commit message. Paolo I think we really need to see the effect adding srcu has on interrupt injection. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv3] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
On 16/01/14 19:55, Michael S. Tsirkin wrote: On Thu, Jan 16, 2014 at 01:44:20PM +0100, Christian Borntraeger wrote: [...] I converted most of the rcu routines to srcu. Review for the unconverted [...] That's nice but did you try to measure the overhead on some interrupt-intensive workloads, such as RX with 10G ethernet? srcu locks aren't free like rcu ones. Right, but the read side is only acting on cpu local data structures so the overhead is probably very small in contrast to the hlist work and the injection itself. You have a more compelling review comment, though: [...] -synchronize_rcu(); +synchronize_srcu_expedited(irq_srcu); Hmm, it's a bit strange that you also do _expecited here. Well, I just did what the original mail thread suggested :-) What if this synchronize_rcu is replaced by synchronize_rcu_expedited and no other changes are made? Maybe that's enough? Yes, its enough. (seems slightly slower than v2, but fast enough) Patch below: [PATCHv3] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING When starting lots of dataplane devices the bootup takes very long on my s390 system(prototype irqfd code). With larger setups we are even able to trigger some timeouts in some components. Turns out that the KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec) when having multiple CPUs. This is caused by the synchronize_rcu and the HZ=100 of s390. Lets use the expedited variant to speed things up as suggested by Michael S. Tsirkin Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- virt/kvm/irqchip.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c index 20dc9e4..dbcfde7 100644 --- a/virt/kvm/irqchip.c +++ b/virt/kvm/irqchip.c @@ -226,7 +226,7 @@ int kvm_set_irq_routing(struct kvm *kvm, kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); - synchronize_rcu(); + synchronize_rcu_expedited(); new = old; r = 0; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
On Thu, Jan 16, 2014 at 09:07:14PM +0100, Christian Borntraeger wrote: On 16/01/14 19:55, Michael S. Tsirkin wrote: On Thu, Jan 16, 2014 at 01:44:20PM +0100, Christian Borntraeger wrote: [...] I converted most of the rcu routines to srcu. Review for the unconverted [...] That's nice but did you try to measure the overhead on some interrupt-intensive workloads, such as RX with 10G ethernet? srcu locks aren't free like rcu ones. Right, but the read side is only acting on cpu local data structures so the overhead is probably very small in contrast to the hlist work and the injection itself. My testing of VM exit paths shows the overhead of read size RCU is not negligeable, IIRC it involves memory barriers which are expensive operations. You have a more compelling review comment, though: [...] - synchronize_rcu(); + synchronize_srcu_expedited(irq_srcu); Hmm, it's a bit strange that you also do _expecited here. Well, I just did what the original mail thread suggested :-) What if this synchronize_rcu is replaced by synchronize_rcu_expedited and no other changes are made? Maybe that's enough? Yes, its enough. (seems slightly slower than v2, but fast enough) Patch below: [PATCHv3] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING When starting lots of dataplane devices the bootup takes very long on my s390 system(prototype irqfd code). With larger setups we are even able to trigger some timeouts in some components. Turns out that the KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec) when having multiple CPUs. This is caused by the synchronize_rcu and the HZ=100 of s390. Lets use the expedited variant to speed things up as suggested by Michael S. Tsirkin Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- virt/kvm/irqchip.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c index 20dc9e4..dbcfde7 100644 --- a/virt/kvm/irqchip.c +++ b/virt/kvm/irqchip.c @@ -226,7 +226,7 @@ int kvm_set_irq_routing(struct kvm *kvm, kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); - synchronize_rcu(); + synchronize_rcu_expedited(); new = old; r = 0; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v5] add support for Hyper-V reference time counter
On Thu, Jan 16, 2014 at 08:18:37PM +1100, Vadim Rozenfeld wrote: Signed-off: Peter Lieven p...@kamp.de Signed-off: Gleb Natapov Signed-off: Vadim Rozenfeld vroze...@redhat.com After some consideration I decided to submit only Hyper-V reference counters support this time. I will submit iTSC support as a separate patch as soon as it is ready. v1 - v2 1. mark TSC page dirty as suggested by Eric Northup digitale...@google.com and Gleb 2. disable local irq when calling get_kernel_ns, as it was done by Peter Lieven p...@amp.de 3. move check for TSC page enable from second patch to this one. v3 - v4 Get rid of ref counter offset. v4 - v5 replace __copy_to_user with kvm_write_guest when updateing iTSC page. --- arch/x86/include/asm/kvm_host.h| 1 + arch/x86/include/uapi/asm/hyperv.h | 13 + arch/x86/kvm/x86.c | 28 +++- include/uapi/linux/kvm.h | 1 + 4 files changed, 42 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ae5d783..33fef07 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -605,6 +605,7 @@ struct kvm_arch { /* fields used by HYPER-V emulation */ u64 hv_guest_os_id; u64 hv_hypercall; + u64 hv_tsc_page; #ifdef CONFIG_KVM_MMU_AUDIT int audit_point; diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h index b8f1c01..462efe7 100644 --- a/arch/x86/include/uapi/asm/hyperv.h +++ b/arch/x86/include/uapi/asm/hyperv.h @@ -28,6 +28,9 @@ /* Partition Reference Counter (HV_X64_MSR_TIME_REF_COUNT) available*/ #define HV_X64_MSR_TIME_REF_COUNT_AVAILABLE (1 1) +/* A partition's reference time stamp counter (TSC) page */ +#define HV_X64_MSR_REFERENCE_TSC 0x4021 + /* * There is a single feature flag that signifies the presence of the MSR * that can be used to retrieve both the local APIC Timer frequency as @@ -198,6 +201,9 @@ #define HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_MASK \ (~((1ull HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT) - 1)) +#define HV_X64_MSR_TSC_REFERENCE_ENABLE 0x0001 +#define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT 12 + #define HV_PROCESSOR_POWER_STATE_C0 0 #define HV_PROCESSOR_POWER_STATE_C1 1 #define HV_PROCESSOR_POWER_STATE_C2 2 @@ -210,4 +216,11 @@ #define HV_STATUS_INVALID_ALIGNMENT 4 #define HV_STATUS_INSUFFICIENT_BUFFERS 19 +typedef struct _HV_REFERENCE_TSC_PAGE { + __u32 tsc_sequence; + __u32 res1; + __u64 tsc_scale; + __s64 tsc_offset; +} HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE; + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5d004da..8e685b8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -836,11 +836,12 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc); * kvm-specific. Those are put in the beginning of the list. */ -#define KVM_SAVE_MSRS_BEGIN 10 +#define KVM_SAVE_MSRS_BEGIN 12 static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, + HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_REFERENCE_TSC, HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_KVM_PV_EOI_EN, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, @@ -1826,6 +1827,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr) switch (msr) { case HV_X64_MSR_GUEST_OS_ID: case HV_X64_MSR_HYPERCALL: + case HV_X64_MSR_REFERENCE_TSC: + case HV_X64_MSR_TIME_REF_COUNT: r = true; break; } @@ -1867,6 +1870,20 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) kvm-arch.hv_hypercall = data; break; } + case HV_X64_MSR_REFERENCE_TSC: { + u64 gfn; + HV_REFERENCE_TSC_PAGE tsc_ref; + memset(tsc_ref, 0, sizeof(tsc_ref)); + kvm-arch.hv_tsc_page = data; Comment 1) Is there a reason (that is compliance with spec) to maintain value, for HV_X64_MSR_REFERENCE_TSC wrmsr operation, in case HV_X64_MSR_TSC_REFERENCE_ENABLE is not set? If not, should only assign to kvm-arch.hv_tsc_page after proper checks. + if (!(data HV_X64_MSR_TSC_REFERENCE_ENABLE)) + break; + gfn = data HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT; + if (kvm_write_guest(kvm, data, + tsc_ref, sizeof(tsc_ref))) + return 1; + mark_page_dirty(kvm, gfn); + break; + } default: vcpu_unimpl(vcpu, HYPER-V unimplemented wrmsr: 0x%x
Re: [RFC PATCH 3/4] virtio_balloon: add pressure notification via a new virtqueue
Luiz Capitulino lcapitul...@redhat.com writes: From: Luiz capitulino lcapitul...@redhat.com This commit adds support to a new virtqueue called message virtqueue. OK, this needs a lot of thought (especially since reworking the virtio balloon is on the TODO list for the OASIS virtio technical committee...) But AFAICT this is a way of explicitly saying no to the host's target (vs the implicit method of just not meeting the target). I'm not sure that gives enough information to the host. On the other hand, I'm not sure what information we *can* give. Should we instead be counter-proposing a target? What does qemu do with this information? Thanks, Rusty. The message virtqueue can be used by guests to notify the host about important memory-related state changes in the guest. Currently, the only implemented notification is the guest is under pressure one, which informs the host that the guest is under memory pressure. This notification can be used to implement automatic memory ballooning in the host. For example, once learning that the guest is under pressure, the host could cancel an on-going inflate and/or start a deflate operation. Doing this through a virtqueue might seem like overkill, as all we're doing is to transfer an integer between guest and host. However, using a virtqueue offers the following advantages: 1. We can realibly synchronize host and guest. That is, the guest will only continue once the host acknowledges the notification. This is important, because if the guest gets under pressure while inflating the balloon, it has to stop to give the host a chance to reset num_pages (see next commit) 2. It's extensible. We may (or may not) want to tell the host which pressure level the guest finds itself in (ie. low, medium or critical) The lightweight alternative is to use a configuration space parameter. For this to work though, the guest would have to wait the for host to acknowloedge the receipt of a configuration change update. I could try this if the virtqueue is too overkill. Finally, the guest learns it's under pressure by registering a callback with the in-kernel vmpressure API. FIXMEs: - vmpressure API is missing an de-registration routine - not completely sure my changes in virtballoon_probe() are correct Signed-off-by: Luiz capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 93 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 84 insertions(+), 10 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 5c4a95b..1c3ee71 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -29,6 +29,9 @@ #include linux/module.h #include linux/balloon_compaction.h +#include linux/cgroup.h +#include linux/vmpressure.h + /* * Balloon device works in 4K page units. So each page is pointed to by * multiple balloon pages. All memory counters in this driver are in balloon @@ -37,10 +40,12 @@ #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE VIRTIO_BALLOON_PFN_SHIFT) #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256 +#define VIRTIO_BALLOON_MSG_PRESSURE 1 + struct virtio_balloon { struct virtio_device *vdev; - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *message_vq; /* Where the ballooning thread waits for config to change. */ wait_queue_head_t config_change; @@ -51,6 +56,8 @@ struct virtio_balloon /* Waiting for host to ack the pages we released. */ wait_queue_head_t acked; + wait_queue_head_t message_acked; + /* Number of balloon pages we've told the Host we're not using. */ unsigned int num_pages; /* @@ -71,6 +78,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* Message virtqueue */ + atomic_t guest_pressure; }; static struct virtio_device_id id_table[] = { @@ -78,6 +88,41 @@ static struct virtio_device_id id_table[] = { { 0 }, }; +static inline bool guest_under_pressure(const struct virtio_balloon *vb) +{ + return atomic_read(vb-guest_pressure) == 1; +} + +static void vmpressure_event_handler(void *data, int level) +{ + struct virtio_balloon *vb = data; + + atomic_set(vb-guest_pressure, 1); + wake_up(vb-config_change); +} + +static void tell_host_pressure(struct virtio_balloon *vb) +{ + const uint32_t msg = VIRTIO_BALLOON_MSG_PRESSURE; + struct scatterlist sg; + unsigned int len; + int err; + + sg_init_one(sg, msg, sizeof(msg)); + + err = virtqueue_add_outbuf(vb-message_vq, sg, 1, vb, GFP_KERNEL); + if (err 0) { + printk(KERN_WARNING virtio-balloon: failed to send host message
Re: KVM: MMU: handle invalid root_hpa at __direct_map
Hi everybody, Marcelo, your suggestion above should work together with the same patch to __direct_mapping. After running some test on different kernels, the !VALID_PAGE happens as following: Linux localhost.localdomain 3.12.5-200.fc19.x86_64 #1 SMP Tue Dec 17 22:21:14 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux: INVALID PAGE kvm/arch/x86/kvm/paging_tmpl.h:FNAME(fetch):573: Linux localhost.localdomain 3.11.8-200.strato0002.fc19.strato.44671928e2e2.x86_64 #1 SMP Sun Jan 5 18:30:38 IST 2014 x86_64 x86_64 x86_64 GNU/Linux: INVALID PAGE arch/x86/kvm/mmu.c:__direct_map:2695: So u probably should add both patches to kvm. Here is my complete patch for 3.11, if you like: diff --git a/kvm/arch/x86/kvm/mmu.c b/kvm/arch/x86/kvm/mmu.c index 9e9285a..c38e480 100644 --- a/kvm/arch/x86/kvm/mmu.c +++ b/kvm/arch/x86/kvm/mmu.c @@ -2691,6 +2691,11 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, int emulate = 0; gfn_t pseudo_gfn; + if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) { + pgprintk(KERN_WARNING invalid page access %s:%s:%d: %llx\n, __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa); +return 0; + } + for_each_shadow_entry(vcpu, (u64)gfn PAGE_SHIFT, iterator) { if (iterator.level == level) { mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, @@ -2861,6 +2866,11 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level, bool ret = false; u64 spte = 0ull; + if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) { + pgprintk(KERN_WARNING invalid page access %s:%s:%d: %llx\n, __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa); + return false; + } + if (!page_fault_can_be_fast(vcpu, error_code)) return false; @@ -3255,6 +3265,11 @@ static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr) struct kvm_shadow_walk_iterator iterator; u64 spte = 0ull; + if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) { + pgprintk(KERN_WARNING invalid page access %s:%s:%d: %llx\n, __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa); + return spte; + } + walk_shadow_page_lockless_begin(vcpu); for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) if (!is_shadow_present_pte(spte)) @@ -4525,6 +4540,11 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]) u64 spte; int nr_sptes = 0; + if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) { + pgprintk(KERN_WARNING invalid page access %s:%s:%d: %llx\n, __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa); + return nr_sptes; + } + walk_shadow_page_lockless_begin(vcpu); for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) { sptes[iterator.level-1] = spte; diff --git a/kvm/arch/x86/kvm/paging_tmpl.h b/kvm/arch/x86/kvm/paging_tmpl.h index 7769699..202a1fc 100644 --- a/kvm/arch/x86/kvm/paging_tmpl.h +++ b/kvm/arch/x86/kvm/paging_tmpl.h @@ -423,6 +423,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, if (FNAME(gpte_changed)(vcpu, gw, top_level)) goto out_gpte_changed; + if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) { + pgprintk(KERN_WARNING invalid page access %s:%s:%d: %llx\n, __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa); + goto out_gpte_changed; + } + for (shadow_walk_init(it, vcpu, addr); shadow_walk_okay(it) it.level gw-level; shadow_walk_next(it)) { @@ -674,6 +679,12 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) */ mmu_topup_memory_caches(vcpu); + if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) { + pgprintk(KERN_WARNING invalid page access %s:%s:%d: %llx\n, __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa); + WARN_ON(1); +return; + } Another issue I'm struggling with is the next bug: https://bugzilla.redhat.com/show_bug.cgi?id=1052861 The only thing i'm trying to over there is to run virt-v2v in nested environment. May be you have any idea? It's also related to memory pressure on L2. Of coarse L0 does not crash (using it with the above patch) but L2 is crushing during the conversion process. Thanks, Rom On Fri, Dec 27, 2013 at 9:43 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Sun, Dec 22, 2013 at 12:56:49PM -0200, Marcelo Tosatti wrote: On Sun, Dec 22, 2013 at 11:17:21AM +0200, Rom Freiman wrote: Hello everyone, I've been chasing this bug for a while. According to my research, this bug fix is works fine for 3.11.9-200.fc19.x86_64 kernel version (and I also came to almost similar solution and really solved the crash). But, the problem is, that it seems that this patch does not work on 3.13.0-rc2+ - it looks
Re: internal error End of file from monitor with shutdown vms
On Thu, Jan 16, 2014 at 8:57 PM, Amr ElGebaly a...@vision-as.com wrote: i didn't modify any thing in cgconfig (didn't define any limits ) i using the default config [root@Node1 ~]# cat /etc/cgconfig.conf # # Copyright IBM Corporation. 2007 # # Authors: Balbir Singh bal...@linux.vnet.ibm.com # This program is free software; you can redistribute it and/or modify it # under the terms of version 2.1 of the GNU Lesser General Public License # as published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # See man cgconfig.conf for further details. # # By default, mount all controllers to /cgroup/controller mount { cpuset = /cgroup/cpuset; cpu = /cgroup/cpu; cpuacct = /cgroup/cpuacct; memory = /cgroup/memory; devices = /cgroup/devices; freezer = /cgroup/freezer; net_cls = /cgroup/net_cls; blkio = /cgroup/blkio; } [root@Node1 ~]# cat /etc/cg cgconfig.conf cgrules.conf cgsnapshot_blacklist.conf [root@Node1 ~]# cat /etc/cgrules.conf # /etc/cgrules.conf #The format of this file is described in cgrules.conf(5) #manual page. # # Example: #user controllers destination #@student cpu,memory usergroup/student/ #peter cpu test1/ #% memory test how can i fix that I'm not sure where the 1 GB limit is coming from. Perhaps libvirt set it? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 3/4] virtio_balloon: add pressure notification via a new virtqueue
On Fri, 17 Jan 2014 09:10:47 +1030 Rusty Russell ru...@rustcorp.com.au wrote: Luiz Capitulino lcapitul...@redhat.com writes: From: Luiz capitulino lcapitul...@redhat.com This commit adds support to a new virtqueue called message virtqueue. OK, this needs a lot of thought (especially since reworking the virtio balloon is on the TODO list for the OASIS virtio technical committee...) But AFAICT this is a way of explicitly saying no to the host's target (vs the implicit method of just not meeting the target). I'm not sure that gives enough information to the host. On the other hand, I'm not sure what information we *can* give. Should we instead be counter-proposing a target? The problem is how to estimate a target value. I found it simpler to just try to obey what the host is asking for (and fail if not possible) than trying to make the guest negotiate with the host. What does qemu do with this information? There are two possible scenarios: 1. The balloon driver is currently inflating when it gets under pressure QEMU resets num_pages to the current balloon size. This cancels the on-going inflate 2. The balloon driver is not inflating, eg. it's possibly sleeping QEMU issues a deflate But note that those scenarios are not supposed to be used with the current device, they are part of the automatic ballooning feature. I CC'ed you on the QEMU patch, you can find it here case you didn't see it: http://marc.info/?l=kvmm=138988966315690w=2 Thanks, Rusty. The message virtqueue can be used by guests to notify the host about important memory-related state changes in the guest. Currently, the only implemented notification is the guest is under pressure one, which informs the host that the guest is under memory pressure. This notification can be used to implement automatic memory ballooning in the host. For example, once learning that the guest is under pressure, the host could cancel an on-going inflate and/or start a deflate operation. Doing this through a virtqueue might seem like overkill, as all we're doing is to transfer an integer between guest and host. However, using a virtqueue offers the following advantages: 1. We can realibly synchronize host and guest. That is, the guest will only continue once the host acknowledges the notification. This is important, because if the guest gets under pressure while inflating the balloon, it has to stop to give the host a chance to reset num_pages (see next commit) 2. It's extensible. We may (or may not) want to tell the host which pressure level the guest finds itself in (ie. low, medium or critical) The lightweight alternative is to use a configuration space parameter. For this to work though, the guest would have to wait the for host to acknowloedge the receipt of a configuration change update. I could try this if the virtqueue is too overkill. Finally, the guest learns it's under pressure by registering a callback with the in-kernel vmpressure API. FIXMEs: - vmpressure API is missing an de-registration routine - not completely sure my changes in virtballoon_probe() are correct Signed-off-by: Luiz capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 93 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 84 insertions(+), 10 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 5c4a95b..1c3ee71 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -29,6 +29,9 @@ #include linux/module.h #include linux/balloon_compaction.h +#include linux/cgroup.h +#include linux/vmpressure.h + /* * Balloon device works in 4K page units. So each page is pointed to by * multiple balloon pages. All memory counters in this driver are in balloon @@ -37,10 +40,12 @@ #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE VIRTIO_BALLOON_PFN_SHIFT) #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256 +#define VIRTIO_BALLOON_MSG_PRESSURE 1 + struct virtio_balloon { struct virtio_device *vdev; - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *message_vq; /* Where the ballooning thread waits for config to change. */ wait_queue_head_t config_change; @@ -51,6 +56,8 @@ struct virtio_balloon /* Waiting for host to ack the pages we released. */ wait_queue_head_t acked; + wait_queue_head_t message_acked; + /* Number of balloon pages we've told the Host we're not using. */ unsigned int num_pages; /* @@ -71,6 +78,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* Message virtqueue */ +
Re: [RFC PATCH 3/4] virtio_balloon: add pressure notification via a new virtqueue
On Thu, 16 Jan 2014 20:38:19 -0500 Luiz Capitulino lcapitul...@redhat.com wrote: What does qemu do with this information? There are two possible scenarios: 1. The balloon driver is currently inflating when it gets under pressure QEMU resets num_pages to the current balloon size. This cancels the on-going inflate 2. The balloon driver is not inflating, eg. it's possibly sleeping QEMU issues a deflate But note that those scenarios are not supposed to be used with the current device, they are part of the automatic ballooning feature. I CC'ed you on the QEMU patch, you can find it here case you didn't see it: http://marc.info/?l=kvmm=138988966315690w=2 By current device I meant outside of automatic ballooning scope. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nested EPT
Now I can run an L2 guest (nested guest) using the kvm kernel module of kernel 3.12 However, I am facing a new problem when trying to build and use kvm kernel module from git://git.kiszka.org/kvm-kmod.git: L1 (nested hypervisor) cannot boot L2 and the graphic console of virt-manager hangs displaying 'Booting from Hard Disk...'. L1 still runs fine. Loading kvm_intel with 'emulate_invalid_guest_state=0' in L0 does not solve the problem. I have also tried with different kernel versions: 3.12.0, 3.12.8 and 3.13.0 without success. Can you give me some suggestions? Thank you very much Best, Hai On Thu, Jan 16, 2014 at 1:17 PM, duy hai nguyen hain...@gmail.com wrote: Thanks, Jan and Paolo! Great! It helps solve the problem. Sincerely, Hai On Thu, Jan 16, 2014 at 12:09 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 16/01/2014 17:10, duy hai nguyen ha scritto: Dear All, I am having a problem with using nested EPT in my system: In L0 hypervisor CPUs support vmx and ept; however, L1 hypervisor's CPUs do not have ept capability. Flag 'ept' appears in /proc/cpuinfo of L0 but does not show in that of L1. - 'Nested' and 'EPT' are enabled in L0: $cat /sys/module/kvm_intel/parameters/nested Y $cat /sys/module/kvm_intel/parameters/ept Y - The libvirt xml file used in L0 has this cpu configuration: cpu mode='host-passthrough'/ - The kernel version I am using for both L0 and L1 is 3.9.11 Nested EPT was added in 3.12. You need that version in L0. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html