Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/28/2014 04:16 PM, Konrad Rzeszutek Wilk wrote: At least 4.8 and probably older compilers compile this as intended. The point is that the standard does not guarantee the indented behavior, i.e. the code is wrong. Perhaps I misunderstood Jan's response but it sounded to me like that the compiler was not adhering to the standard? Compilers are free to generate code that breaks the current clock code while staying within the standards. If the compiler ignores the standard, all bets are off anyway... I can refer to this lwn article: https://lwn.net/Articles/508991/ The whole point of ACCESS_ONCE is to avoid time bombs like that. There are lots of place where ACCESS_ONCE is used in the kernel: http://lxr.free-electrons.com/ident?i=ACCESS_ONCE See for example the check_zero function here: http://lxr.free-electrons.com/source/arch/x86/kernel/kvm.c#L559 In other words, you don't have a sample of 'bad' compiler code. Nope. Julian -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEUEARECAAYFAlLn1ZgACgkQ2EtjUdW3H9n/DgCSAmrVCyxrs42aFFB3Ug+aw4kN 7wCgmEO4CbOI3BkOXKSorY91td9u7H8= =fgrl -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code
On 01/24/2014 07:08 PM, Konrad Rzeszutek Wilk wrote: On Thu, Jan 16, 2014 at 03:13:44PM +0100, Julian Stecklina wrote: The paravirtualized clock used in KVM and Xen uses a version field to allow the guest to see when the shared data structure is inconsistent. The code reads the version field twice (before and after the data structure is copied) and checks whether they haven't changed and that the lowest bit is not set. As the second access is not synchronized, the compiler could generate code that accesses version more than two times and you end up with inconsistent data. Could you paste in the code that the 'bad' compiler generates vs the compiler that generate 'good' code please? At least 4.8 and probably older compilers compile this as intended. The point is that the standard does not guarantee the indented behavior, i.e. the code is wrong. I can refer to this lwn article: https://lwn.net/Articles/508991/ The whole point of ACCESS_ONCE is to avoid time bombs like that. There are lots of place where ACCESS_ONCE is used in the kernel: http://lxr.free-electrons.com/ident?i=ACCESS_ONCE See for example the check_zero function here: http://lxr.free-electrons.com/source/arch/x86/kernel/kvm.c#L559 Julian An example using pvclock_get_time_values: host starts updating data, sets src-version to 1 guest reads src-version (1) and stores it into dst-version. guest copies inconsistent data guest reads src-version (1) and computes xor with dst-version. host finishes updating data and sets src-version to 2 guest reads src-version (2) and checks whether lower bit is not set. while loop exits with inconsistent data! AFAICS the compiler is allowed to optimize the given code this way. Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kernel/pvclock.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 42eb330..f62b41c 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -55,6 +55,8 @@ static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, struct pvclock_vcpu_time_info *src) { +u32 nversion; + do { dst-version = src-version; rmb(); /* fetch version before data */ @@ -64,7 +66,8 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, dst-tsc_shift = src-tsc_shift; dst-flags = src-flags; rmb(); /* test version after fetching data */ -} while ((src-version 1) || (dst-version != src-version)); +nversion = ACCESS_ONCE(src-version); +} while ((nversion 1) || (dst-version != nversion)); return dst-version; } @@ -135,7 +138,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, struct pvclock_vcpu_time_info *vcpu_time, struct timespec *ts) { -u32 version; +u32 version, nversion; u64 delta; struct timespec now; @@ -146,7 +149,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, now.tv_sec = wall_clock-sec; now.tv_nsec = wall_clock-nsec; rmb(); /* fetch time before checking version */ -} while ((wall_clock-version 1) || (version != wall_clock-version)); +nversion = ACCESS_ONCE(wall_clock-version); +} while ((nversion 1) || (version != nversion)); delta = pvclock_clocksource_read(vcpu_time);/* time since system boot */ delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec; -- 1.8.4.2 ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code
On 01/17/2014 10:41 AM, Jan Beulich wrote: The half about converting local variable accesses back to memory reads (i.e. eliding the local variable), however, is only a theoretical issue afaict: If a compiler really did this, I think there'd be far more places where this would hurt. It happens rarely, but it does happen. Not fixing those issues is inviting trouble with new compiler generations. And these issues are terribly hard to debug. Julian signature.asc Description: OpenPGP digital signature
[PATCH] KVM, XEN: Fix potential race in pvclock code
The paravirtualized clock used in KVM and Xen uses a version field to allow the guest to see when the shared data structure is inconsistent. The code reads the version field twice (before and after the data structure is copied) and checks whether they haven't changed and that the lowest bit is not set. As the second access is not synchronized, the compiler could generate code that accesses version more than two times and you end up with inconsistent data. An example using pvclock_get_time_values: host starts updating data, sets src-version to 1 guest reads src-version (1) and stores it into dst-version. guest copies inconsistent data guest reads src-version (1) and computes xor with dst-version. host finishes updating data and sets src-version to 2 guest reads src-version (2) and checks whether lower bit is not set. while loop exits with inconsistent data! AFAICS the compiler is allowed to optimize the given code this way. Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kernel/pvclock.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 42eb330..f62b41c 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -55,6 +55,8 @@ static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, struct pvclock_vcpu_time_info *src) { + u32 nversion; + do { dst-version = src-version; rmb(); /* fetch version before data */ @@ -64,7 +66,8 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, dst-tsc_shift = src-tsc_shift; dst-flags = src-flags; rmb(); /* test version after fetching data */ - } while ((src-version 1) || (dst-version != src-version)); + nversion = ACCESS_ONCE(src-version); + } while ((nversion 1) || (dst-version != nversion)); return dst-version; } @@ -135,7 +138,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, struct pvclock_vcpu_time_info *vcpu_time, struct timespec *ts) { - u32 version; + u32 version, nversion; u64 delta; struct timespec now; @@ -146,7 +149,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, now.tv_sec = wall_clock-sec; now.tv_nsec = wall_clock-nsec; rmb(); /* fetch time before checking version */ - } while ((wall_clock-version 1) || (version != wall_clock-version)); + nversion = ACCESS_ONCE(wall_clock-version); + } while ((nversion 1) || (version != nversion)); delta = pvclock_clocksource_read(vcpu_time);/* time since system boot */ delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec; -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code
On 01/16/2014 04:04 PM, Jan Beulich wrote: I don't think so - this would only be an issue if the conditions used | instead of ||. || implies a sequence point between evaluating the left and right sides, and the standard says: The presence of a sequence point between the evaluation of expressions A and B implies that every value computation and side effect associated with A is sequenced before every value computation and side effect associated with B. This only applies to single-threaded code. Multithreaded code must be data-race free for that to be true. See https://lwn.net/Articles/508991/ And even if there was a problem (i.e. my interpretation of the above being incorrect), I don't think you'd need ACCESS_ONCE() here: The same local variable can't have two different values in two different use sites when there was no intermediate assignment to it. Same comment as above. Julian signature.asc Description: OpenPGP digital signature
[PATCH v4] KVM: x86: Make register state after reset conform to specification
VMX behaves now as SVM wrt to FPU initialization. Code has been moved to generic code path. General-purpose registers are now cleared on reset and INIT. SVM code properly initializes EDX. Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kvm/cpuid.c | 1 + arch/x86/kvm/svm.c | 14 ++ arch/x86/kvm/vmx.c | 7 --- arch/x86/kvm/x86.c | 10 ++ 4 files changed, 17 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 52f6166..a20ecb5 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -661,6 +661,7 @@ void kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, u32 *ecx, u32 *edx) } else *eax = *ebx = *ecx = *edx = 0; } +EXPORT_SYMBOL_GPL(kvm_cpuid); void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index dcb79527..d29d3cd 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -20,6 +20,7 @@ #include mmu.h #include kvm_cache_regs.h #include x86.h +#include cpuid.h #include linux/module.h #include linux/mod_devicetable.h @@ -1193,6 +1194,8 @@ static void init_vmcb(struct vcpu_svm *svm) static int svm_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); + u32 dummy; + u32 eax = 1; init_vmcb(svm); @@ -1201,8 +1204,9 @@ static int svm_vcpu_reset(struct kvm_vcpu *vcpu) svm-vmcb-save.cs.base = svm-vcpu.arch.sipi_vector 12; svm-vmcb-save.cs.selector = svm-vcpu.arch.sipi_vector 8; } - vcpu-arch.regs_avail = ~0; - vcpu-arch.regs_dirty = ~0; + + kvm_cpuid(vcpu, eax, dummy, dummy, dummy); + kvm_register_write(vcpu, VCPU_REGS_RDX, eax); return 0; } @@ -1259,10 +1263,6 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-asid_generation = 0; init_vmcb(svm); - err = fx_init(svm-vcpu); - if (err) - goto free_page4; - svm-vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(svm-vcpu)) svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; @@ -1271,8 +1271,6 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) return svm-vcpu; -free_page4: - __free_page(hsave_page); free_page3: __free_pages(nested_msrpm_pages, MSRPM_ALLOC_ORDER); free_page2: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd2046..4ea3cb3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3934,8 +3934,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) u64 msr; int ret; - vcpu-arch.regs_avail = ~((1 VCPU_REGS_RIP) | (1 VCPU_REGS_RSP)); - vmx-rmode.vm86_active = 0; vmx-soft_vnmi_blocked = 0; @@ -3947,10 +3945,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) msr |= MSR_IA32_APICBASE_BSP; kvm_set_apic_base(vmx-vcpu, msr); - ret = fx_init(vmx-vcpu); - if (ret != 0) - goto out; - vmx_segment_cache_clear(vmx); seg_setup(VCPU_SREG_CS); @@ -3991,7 +3985,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) kvm_rip_write(vcpu, 0xfff0); else kvm_rip_write(vcpu, 0); - kvm_register_write(vcpu, VCPU_REGS_RSP, 0); vmcs_writel(GUEST_GDTR_BASE, 0); vmcs_write32(GUEST_GDTR_LIMIT, 0x); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3bdaf29..57c76e8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6461,6 +6461,10 @@ static int kvm_vcpu_reset(struct kvm_vcpu *vcpu) kvm_pmu_reset(vcpu); + memset(vcpu-arch.regs, 0, sizeof(vcpu-arch.regs)); + vcpu-arch.regs_avail = ~0; + vcpu-arch.regs_dirty = ~0; + return kvm_x86_ops-vcpu_reset(vcpu); } @@ -6629,11 +6633,17 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) if (!zalloc_cpumask_var(vcpu-arch.wbinvd_dirty_mask, GFP_KERNEL)) goto fail_free_mce_banks; + r = fx_init(vcpu); + if (r) + goto fail_free_wbinvd_dirty_mask; + vcpu-arch.ia32_tsc_adjust_msr = 0x0; kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); return 0; +fail_free_wbinvd_dirty_mask: + free_cpumask_var(vcpu-arch.wbinvd_dirty_mask); fail_free_mce_banks: kfree(vcpu-arch.mce_banks); fail_free_lapic: -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5] KVM: x86: Make register state after reset conform to specification
VMX behaves now as SVM wrt to FPU initialization. Code has been moved to generic code path. General-purpose registers are now cleared on reset and INIT. SVM code properly initializes EDX. Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kvm/cpuid.c | 1 + arch/x86/kvm/svm.c | 14 ++ arch/x86/kvm/vmx.c | 8 arch/x86/kvm/x86.c | 10 ++ 4 files changed, 17 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 52f6166..a20ecb5 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -661,6 +661,7 @@ void kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, u32 *ecx, u32 *edx) } else *eax = *ebx = *ecx = *edx = 0; } +EXPORT_SYMBOL_GPL(kvm_cpuid); void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index dcb79527..d29d3cd 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -20,6 +20,7 @@ #include mmu.h #include kvm_cache_regs.h #include x86.h +#include cpuid.h #include linux/module.h #include linux/mod_devicetable.h @@ -1193,6 +1194,8 @@ static void init_vmcb(struct vcpu_svm *svm) static int svm_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); + u32 dummy; + u32 eax = 1; init_vmcb(svm); @@ -1201,8 +1204,9 @@ static int svm_vcpu_reset(struct kvm_vcpu *vcpu) svm-vmcb-save.cs.base = svm-vcpu.arch.sipi_vector 12; svm-vmcb-save.cs.selector = svm-vcpu.arch.sipi_vector 8; } - vcpu-arch.regs_avail = ~0; - vcpu-arch.regs_dirty = ~0; + + kvm_cpuid(vcpu, eax, dummy, dummy, dummy); + kvm_register_write(vcpu, VCPU_REGS_RDX, eax); return 0; } @@ -1259,10 +1263,6 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-asid_generation = 0; init_vmcb(svm); - err = fx_init(svm-vcpu); - if (err) - goto free_page4; - svm-vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(svm-vcpu)) svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; @@ -1271,8 +1271,6 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) return svm-vcpu; -free_page4: - __free_page(hsave_page); free_page3: __free_pages(nested_msrpm_pages, MSRPM_ALLOC_ORDER); free_page2: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd2046..6adbad6 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3934,8 +3934,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) u64 msr; int ret; - vcpu-arch.regs_avail = ~((1 VCPU_REGS_RIP) | (1 VCPU_REGS_RSP)); - vmx-rmode.vm86_active = 0; vmx-soft_vnmi_blocked = 0; @@ -3947,10 +3945,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) msr |= MSR_IA32_APICBASE_BSP; kvm_set_apic_base(vmx-vcpu, msr); - ret = fx_init(vmx-vcpu); - if (ret != 0) - goto out; - vmx_segment_cache_clear(vmx); seg_setup(VCPU_SREG_CS); @@ -3991,7 +3985,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) kvm_rip_write(vcpu, 0xfff0); else kvm_rip_write(vcpu, 0); - kvm_register_write(vcpu, VCPU_REGS_RSP, 0); vmcs_writel(GUEST_GDTR_BASE, 0); vmcs_write32(GUEST_GDTR_LIMIT, 0x); @@ -4041,7 +4034,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) /* HACK: Don't enable emulation on guest boot/reset */ vmx-emulation_required = 0; -out: return ret; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3bdaf29..57c76e8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6461,6 +6461,10 @@ static int kvm_vcpu_reset(struct kvm_vcpu *vcpu) kvm_pmu_reset(vcpu); + memset(vcpu-arch.regs, 0, sizeof(vcpu-arch.regs)); + vcpu-arch.regs_avail = ~0; + vcpu-arch.regs_dirty = ~0; + return kvm_x86_ops-vcpu_reset(vcpu); } @@ -6629,11 +6633,17 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) if (!zalloc_cpumask_var(vcpu-arch.wbinvd_dirty_mask, GFP_KERNEL)) goto fail_free_mce_banks; + r = fx_init(vcpu); + if (r) + goto fail_free_wbinvd_dirty_mask; + vcpu-arch.ia32_tsc_adjust_msr = 0x0; kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); return 0; +fail_free_wbinvd_dirty_mask: + free_cpumask_var(vcpu-arch.wbinvd_dirty_mask); fail_free_mce_banks: kfree(vcpu-arch.mce_banks); fail_free_lapic: -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: x86: Make register state after reset conform to specification
Thus spake Gleb Natapov g...@redhat.com: -ret = fx_init(vmx-vcpu); -if (ret != 0) -goto out; - Label out is now unused. Compiler complains. 5th time's the charm. ;) Patch is updated. Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] KVM: x86: Make register state after reset conform to specification
VMX behaves now as SVM wrt to FPU initialization. Code has been moved to generic code path. General-purpose registers are now cleared on reset and INIT. SVM code properly initializes EDX. Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kvm/cpuid.c | 1 + arch/x86/kvm/svm.c | 14 ++ arch/x86/kvm/vmx.c | 7 --- arch/x86/kvm/x86.c | 8 4 files changed, 15 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0595f13..aa468c2 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -659,6 +659,7 @@ void kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, u32 *ecx, u32 *edx) } else *eax = *ebx = *ecx = *edx = 0; } +EXPORT_SYMBOL_GPL(kvm_cpuid); void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index baead95..6c50121 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -20,6 +20,7 @@ #include mmu.h #include kvm_cache_regs.h #include x86.h +#include cpuid.h #include linux/module.h #include linux/mod_devicetable.h @@ -1190,6 +1191,8 @@ static void init_vmcb(struct vcpu_svm *svm) static int svm_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); + u32 dummy; + u32 eax = 1; init_vmcb(svm); @@ -1198,8 +1201,9 @@ static int svm_vcpu_reset(struct kvm_vcpu *vcpu) svm-vmcb-save.cs.base = svm-vcpu.arch.sipi_vector 12; svm-vmcb-save.cs.selector = svm-vcpu.arch.sipi_vector 8; } - vcpu-arch.regs_avail = ~0; - vcpu-arch.regs_dirty = ~0; + + kvm_cpuid(vcpu, eax, dummy, dummy, dummy); + kvm_register_write(vcpu, VCPU_REGS_RDX, eax); return 0; } @@ -1257,10 +1261,6 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) init_vmcb(svm); kvm_write_tsc(svm-vcpu, 0); - err = fx_init(svm-vcpu); - if (err) - goto free_page4; - svm-vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(svm-vcpu)) svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; @@ -1269,8 +1269,6 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) return svm-vcpu; -free_page4: - __free_page(hsave_page); free_page3: __free_pages(nested_msrpm_pages, MSRPM_ALLOC_ORDER); free_page2: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ff66a3b..85cecfe 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3942,8 +3942,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) u64 msr; int ret; - vcpu-arch.regs_avail = ~((1 VCPU_REGS_RIP) | (1 VCPU_REGS_RSP)); - vmx-rmode.vm86_active = 0; vmx-soft_vnmi_blocked = 0; @@ -3955,10 +3953,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) msr |= MSR_IA32_APICBASE_BSP; kvm_set_apic_base(vmx-vcpu, msr); - ret = fx_init(vmx-vcpu); - if (ret != 0) - goto out; - vmx_segment_cache_clear(vmx); seg_setup(VCPU_SREG_CS); @@ -3999,7 +3993,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) kvm_rip_write(vcpu, 0xfff0); else kvm_rip_write(vcpu, 0); - kvm_register_write(vcpu, VCPU_REGS_RSP, 0); vmcs_writel(GUEST_DR7, 0x400); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2966c84..2e031bf 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6066,6 +6066,10 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu) kvm_pmu_reset(vcpu); + memset(vcpu-arch.regs, 0, sizeof(vcpu-arch.regs)); + vcpu-arch.regs_avail = ~0; + vcpu-arch.regs_dirty = ~0; + return kvm_x86_ops-vcpu_reset(vcpu); } @@ -6229,6 +6233,10 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) if (!zalloc_cpumask_var(vcpu-arch.wbinvd_dirty_mask, GFP_KERNEL)) goto fail_free_mce_banks; + r = fx_init(vcpu); + if (r) + goto fail_free_mce_banks; + kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM VMX: register state after reset violates spec
Thus spake Gleb Natapov g...@redhat.com: On Thu, Nov 29, 2012 at 03:07:38PM +0100, Julian Stecklina wrote: Hello, we have noticed that at least on 3.6.8 with VMX after a VCPU has been reset via the INIT-SIPI-SIPI sequence its register state violates Intel's specification. [...] Shouldn't vmx_vcpu_reset actively clear those registers? And from a quick glance at the SVM code the problem might exist there, too. It should, so why not move the fix to kvm_vcpu_reset() so it will work for both. Also what about R8-R15? Intel SDM says nothing about them in the section you mention, but in Volume 1 section 3.4.1.1 is says: [...] I take it that they are undefined on the first transition to 64-bit mode too. AMD spec says that they should be zeroed on reset, so lets do that. Also SVM does not set EDX to correct value on reset. I'll post a revised patch later today. Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: x86: Make register state after reset conform to specification
Floating point initialization is moved to kvm_arch_vcpu_init. Added general purpose register clearing to the same function. SVM code now properly initializes EDX. Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kvm/cpuid.c | 1 + arch/x86/kvm/svm.c | 7 +-- arch/x86/kvm/vmx.c | 7 +-- arch/x86/kvm/x86.c | 12 +++- 4 files changed, 18 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0595f13..aa468c2 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -659,6 +659,7 @@ void kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, u32 *ecx, u32 *edx) } else *eax = *ebx = *ecx = *edx = 0; } +EXPORT_SYMBOL_GPL(kvm_cpuid); void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index baead95..642213a 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -20,6 +20,7 @@ #include mmu.h #include kvm_cache_regs.h #include x86.h +#include cpuid.h #include linux/module.h #include linux/mod_devicetable.h @@ -1190,6 +1191,7 @@ static void init_vmcb(struct vcpu_svm *svm) static int svm_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); + u32 dummy, eax; init_vmcb(svm); @@ -1198,8 +1200,9 @@ static int svm_vcpu_reset(struct kvm_vcpu *vcpu) svm-vmcb-save.cs.base = svm-vcpu.arch.sipi_vector 12; svm-vmcb-save.cs.selector = svm-vcpu.arch.sipi_vector 8; } - vcpu-arch.regs_avail = ~0; - vcpu-arch.regs_dirty = ~0; + + kvm_cpuid(vcpu, eax, dummy, dummy, dummy); + kvm_register_write(vcpu, VCPU_REGS_RDX, eax); return 0; } diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ff66a3b..363fb03 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3942,7 +3942,7 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) u64 msr; int ret; - vcpu-arch.regs_avail = ~((1 VCPU_REGS_RIP) | (1 VCPU_REGS_RSP)); + vcpu-arch.regs_avail = ~(1 VCPU_REGS_RIP); vmx-rmode.vm86_active = 0; @@ -3955,10 +3955,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) msr |= MSR_IA32_APICBASE_BSP; kvm_set_apic_base(vmx-vcpu, msr); - ret = fx_init(vmx-vcpu); - if (ret != 0) - goto out; - vmx_segment_cache_clear(vmx); seg_setup(VCPU_SREG_CS); @@ -3999,7 +3995,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) kvm_rip_write(vcpu, 0xfff0); else kvm_rip_write(vcpu, 0); - kvm_register_write(vcpu, VCPU_REGS_RSP, 0); vmcs_writel(GUEST_DR7, 0x400); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2966c84..9640ea5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6045,6 +6045,8 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu) { + int ret; + atomic_set(vcpu-arch.nmi_queued, 0); vcpu-arch.nmi_pending = 0; vcpu-arch.nmi_injected = false; @@ -6066,7 +6068,15 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu) kvm_pmu_reset(vcpu); - return kvm_x86_ops-vcpu_reset(vcpu); + memset(vcpu-arch.regs, 0, sizeof(vcpu-arch.regs)); + vcpu-arch.regs_avail = ~0; + vcpu-arch.regs_dirty = ~0; + + if ((ret = fx_init(vcpu)) != 0) goto out; + + ret = kvm_x86_ops-vcpu_reset(vcpu); +out: + return ret; } int kvm_arch_hardware_enable(void *garbage) -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM VMX: register state after reset violates spec
Thus spake Gleb Natapov g...@redhat.com: It should, so why not move the fix to kvm_vcpu_reset() so it will work for both. Also what about R8-R15? Intel SDM says nothing about them in the section you mention, but in Volume 1 section 3.4.1.1 is says: [...] I take it that they are undefined on the first transition to 64-bit mode too. AMD spec says that they should be zeroed on reset, so lets do that. Also SVM does not set EDX to correct value on reset. It should be: I have posted a new version of the patch taking your suggestions into account. The VMX version is working for me. I could not test it on AMD hardware, though. Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM VMX: register state after reset violates spec
Hello, we have noticed that at least on 3.6.8 with VMX after a VCPU has been reset via the INIT-SIPI-SIPI sequence its register state violates Intel's specification. Specifically for our case we see at the end of vmx_vcpu_reset the following vcpu state: regs_avail=ffef regs_dirty=00010010 EIP= EAX=06e8 EBX=0001 ECX=8001 EDX=0600 ESI=d238 EDI= EBP= ESP= although EAX, EBX, ECX, ESI, EDI, EBP, ESP should _all_ be zero. See http://download.intel.com/products/processor/manual/253668.pdf section 9.1.1 (page 9-2). Shouldn't vmx_vcpu_reset actively clear those registers? And from a quick glance at the SVM code the problem might exist there, too. A workaround is to use qemu-kvm with -kvm-no-irqchip. Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Networking latency - what to expect?
Thus spake David Mohr damaili...@mcbf.net: * vm-vm (same host)22k This number is in the same ballpark as what I am seeing on pretty much the same hardware. AFAICS, there is little you can do to the current virtio-virtio code path that would make this substantially faster. Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM VMX: Make register state after reset conform to specification.
Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kvm/vmx.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ff66a3b..ec5a3b3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3948,7 +3948,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmx-soft_vnmi_blocked = 0; - vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val(); kvm_set_cr8(vmx-vcpu, 0); msr = 0xfee0 | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(vmx-vcpu)) @@ -3999,6 +3998,14 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) kvm_rip_write(vcpu, 0xfff0); else kvm_rip_write(vcpu, 0); + + kvm_register_write(vcpu, VCPU_REGS_RAX, 0); + kvm_register_write(vcpu, VCPU_REGS_RBX, 0); + kvm_register_write(vcpu, VCPU_REGS_RCX, 0); + kvm_register_write(vcpu, VCPU_REGS_RDX, get_rdx_init_val()); + kvm_register_write(vcpu, VCPU_REGS_RSI, 0); + kvm_register_write(vcpu, VCPU_REGS_RDI, 0); + kvm_register_write(vcpu, VCPU_REGS_RBP, 0); kvm_register_write(vcpu, VCPU_REGS_RSP, 0); vmcs_writel(GUEST_DR7, 0x400); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM VMX: Make register state after reset conform to specification.
Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kvm/vmx.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ff66a3b..ec5a3b3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3948,7 +3948,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmx-soft_vnmi_blocked = 0; - vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val(); kvm_set_cr8(vmx-vcpu, 0); msr = 0xfee0 | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(vmx-vcpu)) @@ -3999,6 +3998,14 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) kvm_rip_write(vcpu, 0xfff0); else kvm_rip_write(vcpu, 0); + + kvm_register_write(vcpu, VCPU_REGS_RAX, 0); + kvm_register_write(vcpu, VCPU_REGS_RBX, 0); + kvm_register_write(vcpu, VCPU_REGS_RCX, 0); + kvm_register_write(vcpu, VCPU_REGS_RDX, get_rdx_init_val()); + kvm_register_write(vcpu, VCPU_REGS_RSI, 0); + kvm_register_write(vcpu, VCPU_REGS_RDI, 0); + kvm_register_write(vcpu, VCPU_REGS_RBP, 0); kvm_register_write(vcpu, VCPU_REGS_RSP, 0); vmcs_writel(GUEST_DR7, 0x400); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Networking latency - what to expect?
Thus spake David Mohr damaili...@mcbf.net: On 2012-11-29 07:48, Julian Stecklina wrote: Thus spake David Mohr damaili...@mcbf.net: * vm-vm (same host)22k This number is in the same ballpark as what I am seeing on pretty much the same hardware. AFAICS, there is little you can do to the current virtio-virtio code path that would make this substantially faster. Thanks for the feedback. Considering that it's better than the hardware network performance my main issue is actually the latency of communication between VMs on different hosts: * vm-vm (diff. hosts) 7k It is obvious that there is a lot more going on compared to same host communication, but only ~30% of the performance when the network hardware should not be slowing it down (too) much? You are probably better of using SR-IOV NICs with PCI passthrough in this case. Maybe someone can comment on whether virtual interrupt delivery and posted interrupts[1] are already usable. The first one should help in either the virtio and SR-IOV scenarios, the latter only applies to SR-IOV (and PCI passthrough in general). Julian [1] http://www.spinics.net/lists/kvm/msg82762.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.
The spec (Vol 3C, Chapter 34.1) regarding the IA32_FEATURE_CONTROL MSR says Therefore the lock bit must be set after configuring support for Intel Virtualization Technology and prior to transferring control to an option ROM or the OS. and regarding bit 2: This bit enables VMX for system executive that do not require SMX. Signed-off-by: Julian Stecklina j...@alien8.de --- arch/x86/kvm/vmx.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 4ea7678..aef1e5b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2007,7 +2007,12 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) switch (msr_index) { case MSR_IA32_FEATURE_CONTROL: - *pdata = 0; +/* + * If nested VMX is enabled, set the lock bit (bit 0) + * and the Enable VMX outside SMX bit (bit 2) in the + * FEATURE_CONTROL MSR. + */ + *pdata = nested_vmx_allowed(vcpu) ? 0x5 : 0; break; case MSR_IA32_VMX_BASIC: /* -- 1.7.9.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.
Thus spake Avi Kivity a...@redhat.com: On 03/06/2012 05:02 PM, Julian Stecklina wrote: The spec (Vol 3C, Chapter 34.1) regarding the IA32_FEATURE_CONTROL MSR says Therefore the lock bit must be set after configuring support for Intel Virtualization Technology and prior to transferring control to an option ROM or the OS. and regarding bit 2: This bit enables VMX for system executive that do not require SMX. Signed-off-by: Julian Stecklina j...@alien8.de --- arch/x86/kvm/vmx.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 4ea7678..aef1e5b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2007,7 +2007,12 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) switch (msr_index) { case MSR_IA32_FEATURE_CONTROL: -*pdata = 0; +/* + * If nested VMX is enabled, set the lock bit (bit 0) + * and the Enable VMX outside SMX bit (bit 2) in the + * FEATURE_CONTROL MSR. + */ +*pdata = nested_vmx_allowed(vcpu) ? 0x5 : 0; break; case MSR_IA32_VMX_BASIC: /* The way I read it, it should be done by the guest, not the host. It should be done by the BIOS, before it hands off control to the OS kernel. The question is whether you want to emulate this MSR at this level or just set it to sane values when nested VMX should be enabled and be happy. :) Regards, Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.
Am Dienstag, den 06.03.2012, 17:47 +0200 schrieb Nadav Har'El: On Tue, Mar 06, 2012, Avi Kivity wrote about Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.: case MSR_IA32_FEATURE_CONTROL: - *pdata = 0; +/* + * If nested VMX is enabled, set the lock bit (bit 0) + * and the Enable VMX outside SMX bit (bit 2) in the + * FEATURE_CONTROL MSR. + */ + *pdata = nested_vmx_allowed(vcpu) ? 0x5 : 0; 0x5 can be written as FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX Nice. Didn't know those constants. Next time I'll try harder to find those. :) break; case MSR_IA32_VMX_BASIC: /* The way I read it, it should be done by the guest, not the host. This is also how I understand it. Check out KVM's own hardware_enable() to see how a guest does turn on these bits before using VMXON - it doesn't need to rely on the BIOS to have done it earlier (the BIOS, can, however, prevent the guest from doing this by setting only the lock bit). After looking through the code (vmx_disabled_by_bios), it seems that KVM doesn't bother with FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX if FEATURE_CONTROL_LOCKED is not set. It seems like our kernel should do the same. Sorry for the noise. What is true, however, is that the existing code is probably incomplete in three ways (see section 20.7 of the SDM): 1. It always returns 0 for this MSR, even if the guest previously set it to something else. 2. handle_vmon() does not check the previous setting of this MSR. If the guest (or its BIOS) doesn't set both FEATURE_CONTROL_LOCKED and FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX, it should get a #GP on an attempt to VMXON. This will allow L1's BIOS to disable nested VMX if it wishes (not that I think this is a very useful usecase...). 3. vmx_set_vmx_msr to MSR_IA32_FEATURE_CONTROL should throw a #GP if FEATURE_CONTROL_LOCKED is on. I'll create a patch to do this shortly. This is greatly appreciated! Regards, Julian signature.asc Description: This is a digitally signed message part
Re: [PATCH] KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC)
Am Freitag, den 13.01.2012, 08:52 -0200 schrieb Marcelo Tosatti: On Thu, Jan 12, 2012 at 06:07:51PM +0100, Julian Stecklina wrote: Am Freitag, den 23.12.2011, 08:40 -0200 schrieb Marcelo Tosatti: On Mon, Dec 19, 2011 at 02:14:27AM +0100, Julian Stecklina wrote: If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 (edge), it is erroneously treated as INIT de-assert and ignored, but to quote the spec: For this delivery mode [INIT de-assert], the level flag must be set to 0 and trigger mode flag to 1. Yes, the implementation ignores INIT de-assert. Quoting the spec: (INIT Level De-assert) (Not supported in the Pentium 4 and Intel Xeon processors.) Your patch below is not improving the implementation to be closer to the spec: it'll trigger the INIT state initialization with trig_mode == 0 (which is not in accordance with your spec quote above). After reading the spec again and consulting with the guy who wrote the code triggering this, it seems the whole if (level) in the code path below is superfluous. No. Look at whats inside if (level): the mp_state assignment is the internal implementation of delivers an INIT request to the target processor. According to the spec, the INIT level de-assert Sends a synchronization message to all the local APICs in the system to set their arbitration IDs (stored in their Arb ID registers) to the values of their APIC IDs (see Section 10.7, “System and APIC Bus Arbitration”). So if you remove the if (level) check, INIT de-assert will be emulated as INIT! Newer processors don't support INIT level de-assert and will interpret this as INIT. Without the if (level) check, KVM would behave in the same way, thus not breaking code that actually runs on real processors. For processors that still supported INIT level de-assert: If you look into older specs (243192), you read: 101 (INIT) ... INIT is treated as an edge triggered interrupt even if programmed otherwise. 101 (INIT Level De-assert) The trigger mode must also be set to 1 and level mode to 0. This means that if you don't set trigger mode to 1, you will get an INIT instead of INIT level de-assert. This is where the current code in KVM is wrong. So with my original patch, KVM would behave like the old spec mandates (check for trigger mode). With the if (level) check removed, it would behave like recent processors. Either way, the current code is bogus. Regards, Julian signature.asc Description: This is a digitally signed message part
Re: [PATCH] KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC)
Am Freitag, den 23.12.2011, 08:40 -0200 schrieb Marcelo Tosatti: On Mon, Dec 19, 2011 at 02:14:27AM +0100, Julian Stecklina wrote: If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 (edge), it is erroneously treated as INIT de-assert and ignored, but to quote the spec: For this delivery mode [INIT de-assert], the level flag must be set to 0 and trigger mode flag to 1. Yes, the implementation ignores INIT de-assert. Quoting the spec: (INIT Level De-assert) (Not supported in the Pentium 4 and Intel Xeon processors.) Your patch below is not improving the implementation to be closer to the spec: it'll trigger the INIT state initialization with trig_mode == 0 (which is not in accordance with your spec quote above). After reading the spec again and consulting with the guy who wrote the code triggering this, it seems the whole if (level) in the code path below is superfluous. Regarding level my spec says: This flag has no meaning in Pentium 4 and Intel Xeon processors, and will always be issued as a 1. Judging from Section 10.2 this means every CPU where the LAPICs communicate over the system bus (everything after and including the P4). Thus the if can be removed, since its condition is always true (regardless of what the programmer actually programmed into the level field). If there is no objection, I'll submit a revised patch. Signed-off-by: Julian Stecklina j...@alien8.de --- arch/x86/kvm/lapic.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index a7f3e65..260770d 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -433,7 +433,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, break; case APIC_DM_INIT: - if (level) { + if (!trig_mode || level) { result = 1; vcpu-arch.mp_state = KVM_MP_STATE_INIT_RECEIVED; kvm_make_request(KVM_REQ_EVENT, vcpu); -- 1.7.7.4 Regards, Julian signature.asc Description: This is a digitally signed message part
Re: [PATCH] KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC)
On Fr, 2011-12-23 at 08:40 -0200, Marcelo Tosatti wrote: On Mon, Dec 19, 2011 at 02:14:27AM +0100, Julian Stecklina wrote: If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 (edge), it is erroneously treated as INIT de-assert and ignored, but to quote the spec: For this delivery mode [INIT de-assert], the level flag must be set to 0 and trigger mode flag to 1. Yes, the implementation ignores INIT de-assert. Quoting the spec: (INIT Level De-assert) (Not supported in the Pentium 4 and Intel Xeon processors.) Your patch below is not improving the implementation to be closer to the spec: it'll trigger the INIT state initialization with trig_mode == 0 (which is not in accordance with your spec quote above). I think our code that triggers this does weird things. Let me check that. Until then ignore that patch and sorry for the noise. :) At least it seems as if real hardware is interpreting the spec in a slightly different way... Regards, Julian signature.asc Description: This is a digitally signed message part
[PATCH] KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC)
If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 (edge), it is erroneously treated as INIT de-assert and ignored, but to quote the spec: For this delivery mode [INIT de-assert], the level flag must be set to 0 and trigger mode flag to 1. Signed-off-by: Julian Stecklina j...@alien8.de --- arch/x86/kvm/lapic.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index a7f3e65..260770d 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -433,7 +433,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, break; case APIC_DM_INIT: - if (level) { + if (!trig_mode || level) { result = 1; vcpu-arch.mp_state = KVM_MP_STATE_INIT_RECEIVED; kvm_make_request(KVM_REQ_EVENT, vcpu); -- 1.7.7.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.
On Mi, 2011-11-23 at 12:47 +0200, Avi Kivity wrote: On 11/22/2011 06:09 PM, Julian Stecklina wrote: This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write consistent with ioapic_mmio_read, which also allows byte and word accesses. Your patch indents with spaces, while Linux uses tabs for indents. True. I'll post a revised version in a minute. I think this is my second patch to Linux in total, so I am slowly getting there... Thanks for the patience. Regards, Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] KVM: Allow aligned byte and word writes to IOAPIC registers.
This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write consistent with ioapic_mmio_read, which also allows byte and word accesses. Signed-off-by: Julian Stecklina j...@alien8.de --- virt/kvm/ioapic.c | 15 --- 1 files changed, 12 insertions(+), 3 deletions(-) diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 3eed61e..71e2253 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -332,9 +332,18 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, (void*)addr, len, val); ASSERT(!(addr 0xf)); /* check alignment */ - if (len == 4 || len == 8) + switch (len) { + case 8: + case 4: data = *(u32 *) val; - else { + break; + case 2: + data = *(u16 *) val; + break; + case 1: + data = *(u8 *) val; + break; + default: printk(KERN_WARNING ioapic: Unsupported size %d\n, len); return 0; } @@ -343,7 +352,7 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, spin_lock(ioapic-lock); switch (addr) { case IOAPIC_REG_SELECT: - ioapic-ioregsel = data; + ioapic-ioregsel = data 0xFF; /* 8-bit register */ break; case IOAPIC_REG_WINDOW: -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IOAPIC doesn't handle byte writes
Hello, KVM emulates an IOAPIC that doesn't handle byte writes to its IOAPIC_REG_SELECT register, although for example the ICH10 spec[1] clearly states that this is an 8-bit register. See http://www.intel.com/content/dam/doc/datasheet/io-controller-hub-10-family-datasheet.pdf Table 13-4 on page 433. The code in question is: http://git.kernel.org/?p=virt/kvm/kvm.git;a=blob;f=virt/kvm/ioapic.c;h=3eed61eb48675a63dd1f31b0095217ab6bc5f646;hb=HEAD#l323 This breaks IOAPIC code in OSes that adhere to the spec. I've created a small testcase[1]: $ qemu-kvm -serial stdio -kernel ioapic [26303.961804] ioapic: Unsupported size 1 IOAPIC ID [26303.970466] ioapic: Unsupported size 1 IOAPIC VER Done qemu: terminating on signal 2 $ qemu-kvm -no-kvm-irqchip -serial stdio -kernel ioapic IOAPIC ID IOAPIC VER 00170011 Done qemu: terminating on signal 2 Expected behavior is that the IOAPIC register is not read as zero with KVM irqchip emulation. I would file a bug, but the kernel bugzilla seems to be down at the moment. Regards, Julian [1] http://os.inf.tu-dresden.de/~jsteckli/tmp/ioapic signature.asc Description: This is a digitally signed message part
Re: IOAPIC doesn't handle byte writes
Hello, Avi Kivity wrote: Care to post a patch instead? Sure. Never hacked KVM, though. Is there a particular reason why the void *val argument to ioapic_mmio_read/_write is only dereferenced when ioapic-lock is not held? Regards, Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Allow aligned byte and word writes to IOAPIC registers.
This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write consistent with ioapic_mmio_read, which also allows byte and word accesses. Signed-off-by: Julian Stecklina j...@alien8.de --- virt/kvm/ioapic.c | 17 + 1 files changed, 13 insertions(+), 4 deletions(-) diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 3eed61e..e94ef6ba 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -332,9 +332,18 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, (void*)addr, len, val); ASSERT(!(addr 0xf)); /* check alignment */ - if (len == 4 || len == 8) - data = *(u32 *) val; - else { +switch (len) { +case 8: +case 4: +data = *(u32 *) val; +break; +case 2: +data = *(u16 *) val; +break; +case 1: +data = *(u8 *) val; +break; +default: printk(KERN_WARNING ioapic: Unsupported size %d\n, len); return 0; } @@ -343,7 +352,7 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, spin_lock(ioapic-lock); switch (addr) { case IOAPIC_REG_SELECT: - ioapic-ioregsel = data; + ioapic-ioregsel = data 0xFF; /* 8-bit register */ break; case IOAPIC_REG_WINDOW: -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov g...@redhat.com writes: On Sun, Apr 19, 2009 at 04:43:12PM +0300, Avi Kivity wrote: Gleb Natapov wrote: On Sun, Apr 19, 2009 at 04:40:51PM +0300, Avi Kivity wrote: Gleb Natapov wrote: It's OK as a first step towards correct NMI emulation. Additionally, you could enable the IRQ window interception in case the is an NMI pending. The resulting behavior should then much like the VNMI mask emulation for vmx. Yeah, but the question is if IRQ windows is already opened will exit happens before or after IRET. You mean if the NMI handler enabled interrupts? Yes. Then the guest deserves whatever it gets... I suspect windows may do this since it uses NMI for task switching. Could you elaborate on that? How/why does it use NMIs for task switching? Regards, -- Julian Stecklina The day Microsoft makes something that doesn't suck is probably the day they start making vacuum cleaners - Ernst Jan Plugge -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Fix task switching.
Gleb Natapov g...@redhat.com writes: On Mon, Mar 30, 2009 at 06:35:05PM +0200, Jan Kiszka wrote: Haven't tried. I wrote my own tests for task switching. How can I check it? There is a test case attached to Julian's sourceforge-reported bug: https://sourceforge.net/tracker/?func=detailatid=893831aid=2681442group_id=180599 Works for me. Then the patches should be fine (at least for me *g*). Regards, -- Julian Stecklina The day Microsoft makes something that doesn't suck is probably the day they start making vacuum cleaners - Ernst Jan Plugge -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Improvements for task switching
Kohl, Bernhard (NSN - DE/Munich) bernhard.k...@nsn.com writes: Jan Kiszka Wrote: [...] OK, after the discussion has finished, I will submit separate patches. Is there any progress on this? I've been using this patch for several days now with no ill effects. The patch fixes Bug 2681442 for me: https://sourceforge.net/tracker/?func=detailatid=893831aid=2681442group_id=180599 Regards, -- Julian Stecklina The day Microsoft makes something that doesn't suck is probably the day they start making vacuum cleaners - Ernst Jan Plugge -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html