[PATCH v2 10/12] KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
This patch makes kvm-intel module to load TSC scaling ratio into TSC multiplier field of VMCS when a vcpu is loaded, so that TSC scaling ratio can take effect if VMX TSC scaling is enabled. Signed-off-by: Haozhong Zhang--- arch/x86/kvm/vmx.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a86f790..c241ff3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2062,6 +2062,12 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp); vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */ + + /* Setup TSC multiplier */ + if (cpu_has_vmx_tsc_scaling()) + vmcs_write64(TSC_MULTIPLIER, +vcpu->arch.tsc_scaling_ratio); + vmx->loaded_vmcs->cpu = cpu; } -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 00/12] KVM: x86: add support for VMX TSC scaling
This patchset adds support for VMX TSC scaling feature which is available on Intel Skylake CPU. The specification of VMX TSC scaling can be found at http://www.intel.com/content/www/us/en/processors/timestamp-counter-scaling-virtualization-white-paper.html VMX TSC scaling allows guest TSC which is read by guest rdtsc(p) instructions increases in a rate that is customized by the hypervisor and can be different than the host TSC rate. Basically, VMX TSC scaling adds a 64-bit field called TSC multiplier in VMCS so that, if VMX TSC scaling is enabled, TSC read by guest rdtsc(p) instructions will be calculated by the following formula: guest EDX:EAX = (Host TSC * TSC multiplier) >> 48 + VMX TSC Offset where, Host TSC = Host MSR_IA32_TSC + Host MSR_IA32_TSC_ADJUST. This patchset, when cooperating with another QEMU patchset (sent in another email "target-i386: save/restore vcpu's TSC rate during migration"), allows guest programs observe a consistent TSC rate even though they are migrated among machines with different host TSC rates. VMX TSC scaling shares some common logics with SVM TSC ratio which is already supported by KVM. Patch 1 ~ 8 move those common logics from SVM code to the common code. Upon them, patch 9 ~ 12 add VMX-specific support for VMX TSC scaling. Changes in v2: * Remove the duplicated variable 'kvm_tsc_scaling_ratio_rsvd'. * Remove an unnecessary error check in original patch 2. * Do 64-bit arithmetic by functions recommended by Paolo. * Make kvm_set_tsc_khz() returns an error number so that ioctl KVM_SET_TSC_KHZ does not return 0 if errors happen. Reviewed-by: Eric NorthupHaozhong Zhang (12): KVM: x86: Collect information for setting TSC scaling ratio KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch KVM: x86: Add a common TSC scaling function KVM: x86: Replace call-back set_tsc_khz() with a common function KVM: x86: Replace call-back compute_tsc_offset() with a common function KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset() KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc() KVM: x86: Use the correct vcpu's TSC rate to compute time scale KVM: VMX: Enable and initialize VMX TSC scaling KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC KVM: VMX: Dump TSC multiplier in dump_vmcs() arch/x86/include/asm/kvm_host.h | 24 +++ arch/x86/include/asm/vmx.h | 3 + arch/x86/kvm/lapic.c| 4 +- arch/x86/kvm/svm.c | 116 -- arch/x86/kvm/vmx.c | 64 ++- arch/x86/kvm/x86.c | 134 +++- include/linux/kvm_host.h| 20 ++ include/linux/math64.h | 99 + 8 files changed, 297 insertions(+), 167 deletions(-) -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 01/12] KVM: x86: Collect information for setting TSC scaling ratio
The number of bits of the fractional part of the 64-bit TSC scaling ratio in VMX and SVM is different. This patch makes the architecture code to collect the number of fractional bits and other related information into variables that can be accessed in the common code. Signed-off-by: Haozhong Zhang--- arch/x86/include/asm/kvm_host.h | 6 ++ arch/x86/kvm/svm.c | 4 arch/x86/kvm/x86.c | 6 ++ 3 files changed, 16 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 53deb27..0540dc8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -990,6 +990,12 @@ extern bool kvm_has_tsc_control; extern u32 kvm_min_guest_tsc_khz; /* maximum supported tsc_khz for guests */ extern u32 kvm_max_guest_tsc_khz; +/* number of bits of the fractional part of the TSC scaling ratio */ +extern u8 kvm_tsc_scaling_ratio_frac_bits; +/* default TSC scaling ratio (= 1.0) */ +extern u64 kvm_default_tsc_scaling_ratio; +/* maximum allowed value of TSC scaling ratio */ +extern u64 kvm_max_tsc_scaling_ratio; enum emulation_result { EMULATE_DONE, /* no further processing */ diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index cd8659c..55f5f49 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -908,7 +908,11 @@ static __init int svm_hardware_setup(void) max = min(0x7fffULL, __scale_tsc(tsc_khz, TSC_RATIO_MAX)); kvm_max_guest_tsc_khz = max; + + kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX; + kvm_tsc_scaling_ratio_frac_bits = 32; } + kvm_default_tsc_scaling_ratio = TSC_RATIO_DEFAULT; if (nested) { printk(KERN_INFO "kvm: Nested Virtualization enabled\n"); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9e9c226..79cbbb5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -109,6 +109,12 @@ bool kvm_has_tsc_control; EXPORT_SYMBOL_GPL(kvm_has_tsc_control); u32 kvm_max_guest_tsc_khz; EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz); +u8 kvm_tsc_scaling_ratio_frac_bits; +EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_frac_bits); +u64 kvm_default_tsc_scaling_ratio; +EXPORT_SYMBOL_GPL(kvm_default_tsc_scaling_ratio); +u64 kvm_max_tsc_scaling_ratio; +EXPORT_SYMBOL_GPL(kvm_max_tsc_scaling_ratio); /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ static u32 tsc_tolerance_ppm = 250; -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 05/12] KVM: x86: Replace call-back compute_tsc_offset() with a common function
Both VMX and SVM calculate the tsc-offset in the same way, so this patch removes the call-back compute_tsc_offset() and replaces it with a common function kvm_compute_tsc_offset(). Signed-off-by: Haozhong Zhang--- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/svm.c | 10 -- arch/x86/kvm/vmx.c | 6 -- arch/x86/kvm/x86.c | 15 --- 4 files changed, 12 insertions(+), 20 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c67469b..d5e820b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -856,7 +856,6 @@ struct kvm_x86_ops { u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu); void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); - u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc); u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc); void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index a1364927..481fdd3 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1005,15 +1005,6 @@ static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool ho mark_dirty(svm->vmcb, VMCB_INTERCEPTS); } -static u64 svm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc) -{ - u64 tsc; - - tsc = kvm_scale_tsc(vcpu, rdtsc()); - - return target_tsc - tsc; -} - static void init_vmcb(struct vcpu_svm *svm, bool init_event) { struct vmcb_control_area *control = >vmcb->control; @@ -4371,7 +4362,6 @@ static struct kvm_x86_ops svm_x86_ops = { .read_tsc_offset = svm_read_tsc_offset, .write_tsc_offset = svm_write_tsc_offset, .adjust_tsc_offset = svm_adjust_tsc_offset, - .compute_tsc_offset = svm_compute_tsc_offset, .read_l1_tsc = svm_read_l1_tsc, .set_tdp_cr3 = set_tdp_cr3, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 7f87cf6..7896395 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2426,11 +2426,6 @@ static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool ho offset + adjustment); } -static u64 vmx_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc) -{ - return target_tsc - rdtsc(); -} - static bool guest_cpuid_has_vmx(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best = kvm_find_cpuid_entry(vcpu, 1, 0); @@ -10815,7 +10810,6 @@ static struct kvm_x86_ops vmx_x86_ops = { .read_tsc_offset = vmx_read_tsc_offset, .write_tsc_offset = vmx_write_tsc_offset, .adjust_tsc_offset = vmx_adjust_tsc_offset, - .compute_tsc_offset = vmx_compute_tsc_offset, .read_l1_tsc = vmx_read_l1_tsc, .set_tdp_cr3 = vmx_set_cr3, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index db5ef73..75129bd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1397,6 +1397,15 @@ u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc) } EXPORT_SYMBOL_GPL(kvm_scale_tsc); +static u64 kvm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc) +{ + u64 tsc; + + tsc = kvm_scale_tsc(vcpu, rdtsc()); + + return target_tsc - tsc; +} + void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr) { struct kvm *kvm = vcpu->kvm; @@ -1408,7 +1417,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr) u64 data = msr->data; raw_spin_lock_irqsave(>arch.tsc_write_lock, flags); - offset = kvm_x86_ops->compute_tsc_offset(vcpu, data); + offset = kvm_compute_tsc_offset(vcpu, data); ns = get_kernel_ns(); elapsed = ns - kvm->arch.last_tsc_nsec; @@ -1465,7 +1474,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr) } else { u64 delta = nsec_to_cycles(vcpu, elapsed); data += delta; - offset = kvm_x86_ops->compute_tsc_offset(vcpu, data); + offset = kvm_compute_tsc_offset(vcpu, data); pr_debug("kvm: adjusted tsc offset by %llu\n", delta); } matched = true; @@ -2692,7 +2701,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) if (tsc_delta < 0) mark_tsc_unstable("KVM discovered backwards TSC"); if (check_tsc_unstable()) { - u64 offset = kvm_x86_ops->compute_tsc_offset(vcpu, + u64 offset = kvm_compute_tsc_offset(vcpu, vcpu->arch.last_guest_tsc); kvm_x86_ops->write_tsc_offset(vcpu, offset); vcpu->arch.tsc_catchup = 1; -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to
[PATCH v2 08/12] KVM: x86: Use the correct vcpu's TSC rate to compute time scale
This patch makes KVM use virtual_tsc_khz rather than the host TSC rate as vcpu's TSC rate to compute the time scale if TSC scaling is enabled. Signed-off-by: Haozhong Zhang--- arch/x86/kvm/x86.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f2516bf..d5690a3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1698,7 +1698,7 @@ static void kvm_gen_update_masterclock(struct kvm *kvm) static int kvm_guest_time_update(struct kvm_vcpu *v) { - unsigned long flags, this_tsc_khz; + unsigned long flags, this_tsc_khz, tgt_tsc_khz; struct kvm_vcpu_arch *vcpu = >arch; struct kvm_arch *ka = >kvm->arch; s64 kernel_ns; @@ -1761,7 +1761,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) return 0; if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) { - kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz, + tgt_tsc_khz = kvm_has_tsc_control ? + vcpu->virtual_tsc_khz : this_tsc_khz; + kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz, >hv_clock.tsc_shift, >hv_clock.tsc_to_system_mul); vcpu->hw_tsc_khz = this_tsc_khz; -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] arm/arm64: KVM: Fix disabled distributor operation
Hi Christoffer, On 10/17/2015 10:30 PM, Christoffer Dall wrote: > We currently do a single update of the vgic state when the distrbutor distributor > enable/disable control register is accessed and then bypass updating the > state for as long as the distributor remains disabled. > > This is incorrect, because updating the state does not consider the > distributor enable bit, and this you can end up in a situation where an > interrupt is marked as pending on the CPU interface, but not pending on > the distributor, which is an impossible state to be in, and triggers a > warning. Consider for example the following sequence of events: > > 1. An interrupt is marked as pending on the distributor >- the interrupt is also forwarded to the CPU interface > 2. The guest turns off the distributor (it's about to do a reboot) >- we stop updating the CPU interface state from now on > 3. The guest disables the pending interrupt >- we remove the pending state from the distributor, but don't touch > the CPU interface, see point 2. > > Since the distributor disable bit really means that no interrupts should > be forwarded to the CPU interface, we modify the code to keep updating > the internal VGIC state, but always set the CPU interface pending bits > to zero when the distributor is disabled. > > Signed-off-by: Christoffer Dall> --- > virt/kvm/arm/vgic.c | 11 ++- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c > index 58b1256..66c6616 100644 > --- a/virt/kvm/arm/vgic.c > +++ b/virt/kvm/arm/vgic.c > @@ -1012,6 +1012,12 @@ static int compute_pending_for_cpu(struct kvm_vcpu > *vcpu) > pend_percpu = vcpu->arch.vgic_cpu.pending_percpu; > pend_shared = vcpu->arch.vgic_cpu.pending_shared; > > + if (!dist->enabled) { > + bitmap_zero(pend_percpu, VGIC_NR_PRIVATE_IRQS); > + bitmap_zero(pend_shared, nr_shared); > + return 0; > + } > + > pending = vgic_bitmap_get_cpu_map(>irq_pending, vcpu_id); > enabled = vgic_bitmap_get_cpu_map(>irq_enabled, vcpu_id); > bitmap_and(pend_percpu, pending, enabled, VGIC_NR_PRIVATE_IRQS); > @@ -1039,11 +1045,6 @@ void vgic_update_state(struct kvm *kvm) > struct kvm_vcpu *vcpu; > int c; > > - if (!dist->enabled) { > - set_bit(0, dist->irq_pending_on_cpu); > - return; I am confused. Don't you want to clear the whole bitmap? Shouldn't we also handle interrupts programmed in the LR. Spec says any ack should return a spurious ID. Is it what is going to happen with the current implementation? Eric > - } > - > kvm_for_each_vcpu(c, vcpu, kvm) { > if (compute_pending_for_cpu(vcpu)) > set_bit(c, dist->irq_pending_on_cpu); > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PING: [PATCH v5 0/7] KVM: arm64: Implement API for vGICv3 live migration
Hello! How are things going? There was lots of activity, discussions, and then - silence... Kind regards, Pavel Fedin Expert Engineer Samsung Electronics Research center Russia > -Original Message- > From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf > Of Pavel Fedin > Sent: Monday, October 12, 2015 11:30 AM > To: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org > Cc: Christoffer Dall; Marc Zyngier; Andre Przywara > Subject: [PATCH v5 0/7] KVM: arm64: Implement API for vGICv3 live migration > > This patchset adds necessary userspace API in order to support vGICv3 live > migration. GICv3 registers are accessed using device attribute ioctls, > similar to GICv2. > > Whoever wants to test it, please note that this version is not > binary-compatible with previous one, the API has been seriously changed. > qemu patchess will be posted in some time. > > v4 => v5: > - Adapted to new API by Peter Maydell, Marc Zyngier and Christoffer Dall. > Acked-by's on the documentation were dropped, just in case, because i > slightly adjusted it. Additionally, i merged all doc updates into one > patch. > > v3 => v4: > - Split pure refactoring from anything else > - Documentation brought up to date > - Cleaned up 'mmio' structure usage in vgic_attr_regs_access(), > use call_range_handler() for 64-bit access handling > - Rebased on new linux-next > > v2 => v3: > - KVM_DEV_ARM_VGIC_CPUID_MASK enlarged to 20 bits, allowing more than 256 > CPUs. > - Bug fix: Correctly set mmio->private, necessary for redistributor access. > - Added accessors for ICC_AP0R and ICC_AP1R registers > - Rebased on new linux-next > > v1 => v2: > - Do not use generic register get/set API for CPU interface, use only > device attributes. > - Introduce size specifier for distributor and redistributor register > accesses, do not assume size any more. > - Lots of refactor and reusable code extraction. > - Added forgotten documentation > > Christoffer Dall (1): > KVM: arm/arm64: Update API documentation > > Pavel Fedin (6): > KVM: arm/arm64: Move endianess conversion out of > vgic_attr_regs_access() > KVM: arm/arm64: Refactor vGIC attributes handling code > KVM: arm64: Implement vGICv3 distributor and redistributor access from > userspace > KVM: arm64: Refactor system register handlers > KVM: arm64: Introduce find_reg_by_id() > KVM: arm64: Implement vGICv3 CPU interface access > > Documentation/virtual/kvm/devices/arm-vgic-v3.txt | 117 > Documentation/virtual/kvm/devices/arm-vgic.txt| 41 +-- > arch/arm64/include/uapi/asm/kvm.h | 17 +- > arch/arm64/kvm/sys_regs.c | 83 +++--- > arch/arm64/kvm/sys_regs.h | 8 +- > arch/arm64/kvm/sys_regs_generic_v8.c | 2 +- > include/linux/irqchip/arm-gic-v3.h| 19 +- > virt/kvm/arm/vgic-v2-emul.c | 124 ++-- > virt/kvm/arm/vgic-v3-emul.c | 343 > +- > virt/kvm/arm/vgic.c | 57 > virt/kvm/arm/vgic.h | 3 + > 11 files changed, 630 insertions(+), 184 deletions(-) > create mode 100644 Documentation/virtual/kvm/devices/arm-vgic-v3.txt > > -- > 2.4.4 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] arm/arm64: KVM: Fix disabled distributor operation
On Tue, Oct 20, 2015 at 11:08:44AM +0200, Eric Auger wrote: > Hi Christoffer, > On 10/17/2015 10:30 PM, Christoffer Dall wrote: > > We currently do a single update of the vgic state when the distrbutor > distributor > > enable/disable control register is accessed and then bypass updating the > > state for as long as the distributor remains disabled. > > > > This is incorrect, because updating the state does not consider the > > distributor enable bit, and this you can end up in a situation where an > > interrupt is marked as pending on the CPU interface, but not pending on > > the distributor, which is an impossible state to be in, and triggers a > > warning. Consider for example the following sequence of events: > > > > 1. An interrupt is marked as pending on the distributor > >- the interrupt is also forwarded to the CPU interface > > 2. The guest turns off the distributor (it's about to do a reboot) > >- we stop updating the CPU interface state from now on > > 3. The guest disables the pending interrupt > >- we remove the pending state from the distributor, but don't touch > > the CPU interface, see point 2. > > > > Since the distributor disable bit really means that no interrupts should > > be forwarded to the CPU interface, we modify the code to keep updating > > the internal VGIC state, but always set the CPU interface pending bits > > to zero when the distributor is disabled. > > > > Signed-off-by: Christoffer Dall> > --- > > virt/kvm/arm/vgic.c | 11 ++- > > 1 file changed, 6 insertions(+), 5 deletions(-) > > > > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c > > index 58b1256..66c6616 100644 > > --- a/virt/kvm/arm/vgic.c > > +++ b/virt/kvm/arm/vgic.c > > @@ -1012,6 +1012,12 @@ static int compute_pending_for_cpu(struct kvm_vcpu > > *vcpu) > > pend_percpu = vcpu->arch.vgic_cpu.pending_percpu; > > pend_shared = vcpu->arch.vgic_cpu.pending_shared; > > > > + if (!dist->enabled) { > > + bitmap_zero(pend_percpu, VGIC_NR_PRIVATE_IRQS); > > + bitmap_zero(pend_shared, nr_shared); > > + return 0; > > + } > > + > > pending = vgic_bitmap_get_cpu_map(>irq_pending, vcpu_id); > > enabled = vgic_bitmap_get_cpu_map(>irq_enabled, vcpu_id); > > bitmap_and(pend_percpu, pending, enabled, VGIC_NR_PRIVATE_IRQS); > > @@ -1039,11 +1045,6 @@ void vgic_update_state(struct kvm *kvm) > > struct kvm_vcpu *vcpu; > > int c; > > > > - if (!dist->enabled) { > > - set_bit(0, dist->irq_pending_on_cpu); > > - return; > I am confused. Don't you want to clear the whole bitmap? So the first line used to set irq_pending_on_cpu for VCPU0 when the distributor was disabled, which I think basically worked around the guest kernel expecting to see a timer interrupt during boot when doing a WFI. Now when that's fixed, we don't need this (gigantuous hack) anymore. The return statement was also weird and buggy, because it never prevented anything from going to the CPU interface, it just stopped updating things. > > Shouldn't we also handle interrupts programmed in the LR. Spec says any > ack should return a spurious ID. Is it what is going to happen with the > current implementation? > yes, we really should. We should unqueue them, but I haven't seen any bugs from this, and I feel like we're already changing a lot with short notice, so I'd rather not distrupt anything more right now. Besides, when we get around to redesigning this whole thing, the concept of unqueueing goes away. I know it sucks reviewing fixes that only fix a subset of a bad implementation, but I'm aiming for 'slightly better than current state' right now :) -Christoffer > > - } > > - > > kvm_for_each_vcpu(c, vcpu, kvm) { > > if (compute_pending_for_cpu(vcpu)) > > set_bit(c, dist->irq_pending_on_cpu); > > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: PING: [PATCH v5 0/7] KVM: arm64: Implement API for vGICv3 live migration
Hello! > We'll get to your patches when we get to them, but seriously, you have > to curb your eagerness a bit. Ok, sorry for this disturbance then. It is enough for me to know it. Kind regards, Pavel Fedin Expert Engineer Samsung Electronics Research center Russia -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING: [PATCH v5 0/7] KVM: arm64: Implement API for vGICv3 live migration
On Tue, Oct 20, 2015 at 12:30:41PM +0300, Pavel Fedin wrote: > Hello! How are things going? There was lots of activity, discussions, and > then - silence... > As I told you, this is not going to make it for v4.4, and if you haven't noticed we're getting close to the release of v4.3 and to the v4.4 merge window opening up, so we're busy. Also, take a look at the number of patches on the kvmarm list, they all have to be reviewed, and the think about all the patches you reviewed (wait, that was none), and then you know how many patches I have to review. You can keep sending your pings, but they're useless, as we've told you before, because we're not ignoring your patches, we're just busy and have to prioritize. We'll get to your patches when we get to them, but seriously, you have to curb your eagerness a bit. Thanks, -Christoffer > > > > -Original Message- > > From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On > > Behalf Of Pavel Fedin > > Sent: Monday, October 12, 2015 11:30 AM > > To: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org > > Cc: Christoffer Dall; Marc Zyngier; Andre Przywara > > Subject: [PATCH v5 0/7] KVM: arm64: Implement API for vGICv3 live migration > > > > This patchset adds necessary userspace API in order to support vGICv3 live > > migration. GICv3 registers are accessed using device attribute ioctls, > > similar to GICv2. > > > > Whoever wants to test it, please note that this version is not > > binary-compatible with previous one, the API has been seriously changed. > > qemu patchess will be posted in some time. > > > > v4 => v5: > > - Adapted to new API by Peter Maydell, Marc Zyngier and Christoffer Dall. > > Acked-by's on the documentation were dropped, just in case, because i > > slightly adjusted it. Additionally, i merged all doc updates into one > > patch. > > > > v3 => v4: > > - Split pure refactoring from anything else > > - Documentation brought up to date > > - Cleaned up 'mmio' structure usage in vgic_attr_regs_access(), > > use call_range_handler() for 64-bit access handling > > - Rebased on new linux-next > > > > v2 => v3: > > - KVM_DEV_ARM_VGIC_CPUID_MASK enlarged to 20 bits, allowing more than 256 > > CPUs. > > - Bug fix: Correctly set mmio->private, necessary for redistributor access. > > - Added accessors for ICC_AP0R and ICC_AP1R registers > > - Rebased on new linux-next > > > > v1 => v2: > > - Do not use generic register get/set API for CPU interface, use only > > device attributes. > > - Introduce size specifier for distributor and redistributor register > > accesses, do not assume size any more. > > - Lots of refactor and reusable code extraction. > > - Added forgotten documentation > > > > Christoffer Dall (1): > > KVM: arm/arm64: Update API documentation > > > > Pavel Fedin (6): > > KVM: arm/arm64: Move endianess conversion out of > > vgic_attr_regs_access() > > KVM: arm/arm64: Refactor vGIC attributes handling code > > KVM: arm64: Implement vGICv3 distributor and redistributor access from > > userspace > > KVM: arm64: Refactor system register handlers > > KVM: arm64: Introduce find_reg_by_id() > > KVM: arm64: Implement vGICv3 CPU interface access > > > > Documentation/virtual/kvm/devices/arm-vgic-v3.txt | 117 > > Documentation/virtual/kvm/devices/arm-vgic.txt| 41 +-- > > arch/arm64/include/uapi/asm/kvm.h | 17 +- > > arch/arm64/kvm/sys_regs.c | 83 +++--- > > arch/arm64/kvm/sys_regs.h | 8 +- > > arch/arm64/kvm/sys_regs_generic_v8.c | 2 +- > > include/linux/irqchip/arm-gic-v3.h| 19 +- > > virt/kvm/arm/vgic-v2-emul.c | 124 ++-- > > virt/kvm/arm/vgic-v3-emul.c | 343 > > +- > > virt/kvm/arm/vgic.c | 57 > > virt/kvm/arm/vgic.h | 3 + > > 11 files changed, 630 insertions(+), 184 deletions(-) > > create mode 100644 Documentation/virtual/kvm/devices/arm-vgic-v3.txt > > > > -- > > 2.4.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Difference between vcpu_load and kvm_sched_in ?
Hi, I'm a student working on virtual machine introspection. I'm trying to implement an application on top of KVM in which I need to trap writes to CR3 (host with 8 cores and guest with one vcpu). When I do this when handling a VM EXIT using: vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, CPU_BASED_CR3_LOAD_EXITING), it works correctly and I can see the traps in my log file. Now when I do the same thing after receiving a command from Qemu (command is handled in kvm_vm_ioctl by calling a function I added to kvm_x86_ops vmx_x86_ops) I get a vmwrite error. I found out that the problem is because the logical processor on the host that is handling the ioctl command is not the same that is running the VM and holding its state; so I must do the vmwrite on the one executing the VM To change the logical cpu executing the VM, I tried this: vcpu_load; start cr3 trapping; vcpu_put it worked correctly (in my logs I see that vcpu.cpu become equal to "cpu = raw_smp_processor_id();") but the VM blocks for a lot of time due to mutex in vcpu_load (up to serveral seconds and sometimes minutes !) I replaced vcpu_load with kvm_sched_in, now everything works perfectly and the VM doesn't block at all (logs here: http://pastebin.com/h5XNNMcb). So, what I want to know is: what is the difference between vcpu_load and kvm_sched_in ? both of this functions call kvm_arch_vcpu_loadbut the latter one does it without doing a mutex Is there a problem in using kvm_sched_in instead of vcpu_load for my use case ? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network hangs when communicating with host
On 10/20/2015 09:42 AM, Dmitry Vyukov wrote: > I now have another issue. My binary fails to mmap a file within lkvm > sandbox. The same binary works fine on host and in qemu. I've added > strace into sandbox script, and here is the output: > > [pid 837] openat(AT_FDCWD, "syzkaller-shm048878722", O_RDWR|O_CLOEXEC) = 5 > [pid 837] mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_SHARED, 5, > 0) = -1 EINVAL (Invalid argument) > > I don't see anything that can potentially cause EINVAL here. Is it > possible that lkvm somehow affects kernel behavior here? > > I run lkvm as: > > $ taskset 1 /kvmtool/lkvm sandbox --disk syz-0 --mem=2048 --cpus=2 > --kernel /arch/x86/boot/bzImage --network mode=user --sandbox > /workdir/kvm/syz-0.sh It's possible that something in the virtio-9p layer is broken. I'll give it a look in the evening. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 12/12] KVM: VMX: Dump TSC multiplier in dump_vmcs()
This patch enhances dump_vmcs() to dump the value of TSC multiplier field in VMCS. Signed-off-by: Haozhong Zhang--- arch/x86/kvm/vmx.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a02b59c..66d25be 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8013,6 +8013,9 @@ static void dump_vmcs(void) vmcs_read32(IDT_VECTORING_INFO_FIELD), vmcs_read32(IDT_VECTORING_ERROR_CODE)); pr_err("TSC Offset = 0x%016lx\n", vmcs_readl(TSC_OFFSET)); + if (secondary_exec_control & SECONDARY_EXEC_TSC_SCALING) + pr_err("TSC Multiplier = 0x%016lx\n", + vmcs_readl(TSC_MULTIPLIER)); if (cpu_based_exec_ctrl & CPU_BASED_TPR_SHADOW) pr_err("TPR Threshold = 0x%02x\n", vmcs_read32(TPR_THRESHOLD)); if (pin_based_exec_ctrl & PIN_BASED_POSTED_INTR) -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT - PATCH v2 0/2] KVM/arm64: add fp/simd lazy switch support
On Mon, Oct 19, 2015 at 03:06:59PM -0700, Mario Smarduch wrote: > > > On 10/18/2015 2:07 PM, Christoffer Dall wrote: > > On Mon, Oct 12, 2015 at 09:29:23AM -0700, Mario Smarduch wrote: > >> Hi Christoffer, Marc - > >> I just threw this test your way without any explanation. > > > > I'm confused. Did you send me something somewhere already? > Yes in the last patchset > > https://lists.cs.columbia.edu/pipermail/kvmarm/2015-October/016698.html > > I included a simple test I put together. > Sorry, I missed that change in the cover letter. > > > >> > >> The test loops, does fp arithmetic and checks the truncated result. > >> It could be a little more dynamic have an initial run to > >> get the sum to compare against while looping, different fp > >> hardware may come up with a different sum, but truncation is > >> to 5'th decimal point. > >> > >> The rationale is that if there is any fp/simd corruption > >> one of these runs should fail. I think most likely scenario > >> for that is a world switch in midst of fp operation. I've > >> instrumented (basically add some tracing to vcpu_put()) and > >> validated vcpu_put gets called thousands of time (for v7,v8) > >> for an over night test running two guests/host crunching > >> fp operations. > >> > >> Other then that not sure how to really catch any problems > >> with the patches applied. Obviously this is a huge issues, if this has > >> any problems. If you or Marc have any other ideas I'd be happy > >> to enhance the test. > > > > I think it's important to run two VMs at the same time, each with some > > floating-point work, and then run some floating point on the host at the > > same time. > > > > You can make that even more interesting by doing 32-bit guests at the > > same time as well. > > Yes that's the test combination I've been running. ok, cool, then I trust these patches. > > > > I believe Marc was running Panranoia > > (http://www.netlib.org/paranoia/paranoia.c) to test the last lazy > > series. > > I'll try this test and run it for several days, see if anything shows up. > I actually don't know what it does, i.e. if it just uses the FP hardware or if it actually checks the results produced. Several days may be unnecessary, but if your machine has nothing else to do, then why not. Thanks, -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] KVM/arm: enable enhanced armv7 fp/simd lazy switch
On Mon, Oct 19, 2015 at 04:25:04PM -0700, Mario Smarduch wrote: > > > On 10/19/2015 3:14 AM, Christoffer Dall wrote: > > On Sat, Sep 26, 2015 at 04:43:29PM -0700, Mario Smarduch wrote: > >> This patch enhances current lazy vfp/simd hardware switch. In addition to > >> current lazy switch, it tracks vfp/simd hardware state with a vcpu > >> lazy flag. > >> > >> vcpu lazy flag is set on guest access and trap to vfp/simd hardware switch > >> handler. On vm-enter if lazy flag is set skip trap enable and saving > >> host fpexc. On vm-exit if flag is set skip hardware context switch > >> and return to host with guest context. > >> > >> On vcpu_put check if vcpu lazy flag is set, and execute a hardware context > >> switch to restore host. > >> > >> Signed-off-by: Mario Smarduch> >> --- > >> arch/arm/include/asm/kvm_asm.h | 1 + > >> arch/arm/kvm/arm.c | 17 > >> arch/arm/kvm/interrupts.S | 60 > >> +++--- > >> arch/arm/kvm/interrupts_head.S | 12 ++--- > >> 4 files changed, 71 insertions(+), 19 deletions(-) > >> > >> diff --git a/arch/arm/include/asm/kvm_asm.h > >> b/arch/arm/include/asm/kvm_asm.h > >> index 194c91b..4b45d79 100644 > >> --- a/arch/arm/include/asm/kvm_asm.h > >> +++ b/arch/arm/include/asm/kvm_asm.h > >> @@ -97,6 +97,7 @@ extern char __kvm_hyp_code_end[]; > >> extern void __kvm_flush_vm_context(void); > >> extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa); > >> extern void __kvm_tlb_flush_vmid(struct kvm *kvm); > >> +extern void __kvm_restore_host_vfp_state(struct kvm_vcpu *vcpu); > >> > >> extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu); > >> #endif > >> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c > >> index ce404a5..79f49c7 100644 > >> --- a/arch/arm/kvm/arm.c > >> +++ b/arch/arm/kvm/arm.c > >> @@ -105,6 +105,20 @@ void kvm_arch_check_processor_compat(void *rtn) > >>*(int *)rtn = 0; > >> } > >> > >> +/** > >> + * kvm_switch_fp_regs() - switch guest/host VFP/SIMD registers > >> + * @vcpu: pointer to vcpu structure. > >> + * > > > > nit: stray blank line > ok > > > >> + */ > >> +static void kvm_switch_fp_regs(struct kvm_vcpu *vcpu) > >> +{ > >> +#ifdef CONFIG_ARM > >> + if (vcpu->arch.vfp_lazy == 1) { > >> + kvm_call_hyp(__kvm_restore_host_vfp_state, vcpu); > > > > why do you have to do this in HYP mode ? > Calling it directly works fine. I moved the function outside hyp start/end > range in interrupts.S. Not thinking outside the box, just thought let them all > be hyp calls. > > > > >> + vcpu->arch.vfp_lazy = 0; > >> + } > >> +#endif > > > > we've tried to put stuff like this in header files to avoid the ifdefs > > so far. Could that be done here as well? > > That was a to do, but didn't get around to it. > > > >> +} > >> > >> /** > >> * kvm_arch_init_vm - initializes a VM data structure > >> @@ -295,6 +309,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > >> > >> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > >> { > >> + /* Check if Guest accessed VFP registers */ > > > > misleading comment: this function does more than checking > Yep sure does, will change. > > > >> + kvm_switch_fp_regs(vcpu); > >> + > >>/* > >> * The arch-generic KVM code expects the cpu field of a vcpu to be -1 > >> * if the vcpu is no longer assigned to a cpu. This is used for the > >> diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S > >> index 900ef6d..6d98232 100644 > >> --- a/arch/arm/kvm/interrupts.S > >> +++ b/arch/arm/kvm/interrupts.S > >> @@ -96,6 +96,29 @@ ENTRY(__kvm_flush_vm_context) > >>bx lr > >> ENDPROC(__kvm_flush_vm_context) > >> > >> +/** > >> + * void __kvm_restore_host_vfp_state(struct vcpu *vcpu) - Executes a lazy > >> + * fp/simd switch, saves the guest, restores host. > >> + * > > > > nit: stray blank line > ok. > > > >> + */ > >> +ENTRY(__kvm_restore_host_vfp_state) > >> +#ifdef CONFIG_VFPv3 > >> + push{r3-r7} > >> + > >> + add r7, r0, #VCPU_VFP_GUEST > >> + store_vfp_state r7 > >> + > >> + add r7, r0, #VCPU_VFP_HOST > >> + ldr r7, [r7] > >> + restore_vfp_state r7 > >> + > >> + ldr r3, [vcpu, #VCPU_VFP_FPEXC] > > > > either use r0 or vcpu throughout this function please > Yeah that's bad - in the same function to > > > >> + VFPFMXR FPEXC, r3 > >> + > >> + pop {r3-r7} > >> +#endif > >> + bx lr > >> +ENDPROC(__kvm_restore_host_vfp_state) > >> > >> / > >> * Hypervisor world-switch code > >> @@ -119,11 +142,15 @@ ENTRY(__kvm_vcpu_run) > >>@ If the host kernel has not been configured with VFPv3 support, > >>@ then it is safer if we deny guests from using it as well. > >> #ifdef CONFIG_VFPv3 > >> + @ r7 must be preserved until next vfp lazy check > > > > I don't understand this comment > > > >> + vfp_inlazy_mode r7,
[PATCH v2 2/3] target-i386: calculate vcpu's TSC rate to be migrated
If vcpu's TSC rate is not specified by the cpu option 'tsc-freq', we will use the value returned by KVM_GET_TSC_KHZ; otherwise, we use the user-specified value. Signed-off-by: Haozhong Zhang--- target-i386/kvm.c | 33 + 1 file changed, 33 insertions(+) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 80d1a7e..698524a 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -2213,6 +2213,35 @@ static int kvm_get_debugregs(X86CPU *cpu) return 0; } +static int kvm_setup_tsc_khz(X86CPU *cpu, int level) +{ +CPUState *cs = CPU(cpu); +CPUX86State *env = >env; +int r; + +if (level < KVM_PUT_FULL_STATE) +return 0; + +/* + * Prepare the vcpu's TSC rate (ie. env->tsc_khz_incoming) to be migrated. + * 1. If no user-specified value is provided, we will use the value from + *KVM; + * 2. Otherwise, we just use the user-specified value. + */ +if (!env->tsc_khz) { +r = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ); +if (r < 0) { +fprintf(stderr, "KVM_GET_TSC_KHZ failed\n"); +return r; +} +env->tsc_khz_incoming = r; +} else { +env->tsc_khz_incoming = env->tsc_khz; +} + +return 0; +} + int kvm_arch_put_registers(CPUState *cpu, int level) { X86CPU *x86_cpu = X86_CPU(cpu); @@ -2248,6 +2277,10 @@ int kvm_arch_put_registers(CPUState *cpu, int level) if (ret < 0) { return ret; } +ret = kvm_setup_tsc_khz(x86_cpu, level); +if (ret < 0) { +return ret; +} ret = kvm_put_msrs(x86_cpu, level); if (ret < 0) { return ret; -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] target-i386: add a subsection for migrating vcpu's TSC rate
The newly added subsection 'vmstate_tsc_khz' is used by following patches to migrate vcpu's TSC rate. For the back migration compatibility, this subsection is not migrated on pc-*-2.4 and older machine types by default. If users do want to migrate this subsection on older machine types, they can enable it by giving a new cpu option 'save-tsc-freq'. Signed-off-by: Haozhong Zhang--- include/hw/i386/pc.h | 5 + target-i386/cpu.c | 1 + target-i386/cpu.h | 2 ++ target-i386/machine.c | 19 +++ 4 files changed, 27 insertions(+) diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h index 0503485..7fde50f 100644 --- a/include/hw/i386/pc.h +++ b/include/hw/i386/pc.h @@ -300,6 +300,11 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *); #define PC_COMPAT_2_4 \ HW_COMPAT_2_4 \ {\ +.driver = TYPE_X86_CPU,\ +.property = "save-tsc-freq",\ +.value= "off",\ +},\ +{\ .driver = "Haswell-" TYPE_X86_CPU,\ .property = "abm",\ .value= "off",\ diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 05d7f26..b6bb457 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -3143,6 +3143,7 @@ static Property x86_cpu_properties[] = { DEFINE_PROP_BOOL("check", X86CPU, check_cpuid, false), DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false), DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true), +DEFINE_PROP_BOOL("save-tsc-freq", X86CPU, env.save_tsc_khz, true), DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0), DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0), DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0), diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 54d9d50..ba1a289 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -966,6 +966,8 @@ typedef struct CPUX86State { uint32_t sipi_vector; bool tsc_valid; int64_t tsc_khz; +int64_t tsc_khz_incoming; +bool save_tsc_khz; void *kvm_xsave_buf; uint64_t mcg_cap; diff --git a/target-i386/machine.c b/target-i386/machine.c index 9fa0563..7d68d63 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -752,6 +752,24 @@ static const VMStateDescription vmstate_xss = { } }; +static bool tsc_khz_needed(void *opaque) +{ +X86CPU *cpu = opaque; +CPUX86State *env = >env; +return env->tsc_khz_incoming && env->save_tsc_khz; +} + +static const VMStateDescription vmstate_tsc_khz = { +.name = "cpu/tsc_khz", +.version_id = 1, +.minimum_version_id = 1, +.needed = tsc_khz_needed, +.fields = (VMStateField[]) { +VMSTATE_INT64(env.tsc_khz_incoming, X86CPU), +VMSTATE_END_OF_LIST() +} +}; + VMStateDescription vmstate_x86_cpu = { .name = "cpu", .version_id = 12, @@ -871,6 +889,7 @@ VMStateDescription vmstate_x86_cpu = { _msr_hyperv_crash, _avx512, _xss, +_tsc_khz, NULL } }; -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/3] target-i386: save/restore vcpu's TSC rate during migration
This patchset enables QEMU to save/restore vcpu's TSC rate during the migration. When cooperating with KVM which supports TSC scaling, guest programs can observe a consistent guest TSC rate even though they are migrated among machines with different host TSC rates. A pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' are added to control the migration of vcpu's TSC rate. * By default, the migration of vcpu's TSC rate is enabled only on pc-*-2.5 and newer machine types. If the cpu option 'save-tsc-freq' is present, the vcpu's TSC rate will be migrated from older machine types as well. * Another cpu option 'load-tsc-freq' controls whether the migrated vcpu's TSC rate is used. By default, QEMU will not use the migrated TSC rate if this option is not present. Otherwise, QEMU will use the migrated TSC rate and override the TSC rate given by the cpu option 'tsc-freq'. Changes in v2: * Add a pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' to control the migration of vcpu's TSC rate. * Move all logic of setting TSC rate to target-i386. * Remove the duplicated TSC setup in kvm_arch_init_vcpu(). Haozhong Zhang (3): target-i386: add a subsection for migrating vcpu's TSC rate target-i386: calculate vcpu's TSC rate to be migrated target-i386: load the migrated vcpu's TSC rate include/hw/i386/pc.h | 5 + target-i386/cpu.c | 2 ++ target-i386/cpu.h | 3 +++ target-i386/kvm.c | 61 +++ target-i386/machine.c | 19 5 files changed, 81 insertions(+), 9 deletions(-) -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] target-i386: load the migrated vcpu's TSC rate
Set vcpu's TSC rate to the migrated value (if any). If KVM supports TSC scaling, guest programs will observe TSC increasing in the migrated rate other than the host TSC rate. The loading is controlled by a new cpu option 'load-tsc-freq'. If it is present, then the loading will be enabled and the migrated vcpu's TSC rate will override the value specified by the cpu option 'tsc-freq'. Otherwise, the loading will be disabled. The setting of vcpu's TSC rate in this patch duplicates the code in kvm_arch_init_vcpu(), so we remove the latter one. Signed-off-by: Haozhong Zhang--- target-i386/cpu.c | 1 + target-i386/cpu.h | 1 + target-i386/kvm.c | 28 +++- 3 files changed, 21 insertions(+), 9 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index b6bb457..763ba4b 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -3144,6 +3144,7 @@ static Property x86_cpu_properties[] = { DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false), DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true), DEFINE_PROP_BOOL("save-tsc-freq", X86CPU, env.save_tsc_khz, true), +DEFINE_PROP_BOOL("load-tsc-freq", X86CPU, env.load_tsc_khz, false), DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0), DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0), DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0), diff --git a/target-i386/cpu.h b/target-i386/cpu.h index ba1a289..353f5fb 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -968,6 +968,7 @@ typedef struct CPUX86State { int64_t tsc_khz; int64_t tsc_khz_incoming; bool save_tsc_khz; +bool load_tsc_khz; void *kvm_xsave_buf; uint64_t mcg_cap; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 698524a..34616f5 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -743,15 +743,6 @@ int kvm_arch_init_vcpu(CPUState *cs) return r; } -r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL); -if (r && env->tsc_khz) { -r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz); -if (r < 0) { -fprintf(stderr, "KVM_SET_TSC_KHZ failed\n"); -return r; -} -} - if (kvm_has_xsave()) { env->kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave)); } @@ -2223,6 +2214,25 @@ static int kvm_setup_tsc_khz(X86CPU *cpu, int level) return 0; /* + * If the cpu option 'load-tsc-freq' is present, the vcpu's TSC rate in the + * migrated state will be used and the overrides the user-specified vcpu's + * TSC rate (if any). + */ +if (runstate_check(RUN_STATE_INMIGRATE) && +env->load_tsc_khz && env->tsc_khz_incoming) { +env->tsc_khz = env->tsc_khz_incoming; +} + +r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL); +if (r && env->tsc_khz) { +r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz); +if (r < 0) { +fprintf(stderr, "KVM_SET_TSC_KHZ failed\n"); +return r; +} +} + +/* * Prepare the vcpu's TSC rate (ie. env->tsc_khz_incoming) to be migrated. * 1. If no user-specified value is provided, we will use the value from *KVM; -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 07/12] KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
Both VMX and SVM scales the host TSC in the same way in call-back read_l1_tsc(), so this patch moves the scaling logic from call-back read_l1_tsc() to a common function kvm_read_l1_tsc(). Signed-off-by: Haozhong Zhang--- arch/x86/kvm/lapic.c | 4 ++-- arch/x86/kvm/svm.c | 3 +-- arch/x86/kvm/x86.c | 11 --- include/linux/kvm_host.h | 1 + 4 files changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 168b875..355a400 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1250,7 +1250,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) tsc_deadline = apic->lapic_timer.expired_tscdeadline; apic->lapic_timer.expired_tscdeadline = 0; - guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu, rdtsc()); + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); /* __delay is delay_tsc whenever the hardware has TSC, thus always. */ @@ -1318,7 +1318,7 @@ static void start_apic_timer(struct kvm_lapic *apic) local_irq_save(flags); now = apic->lapic_timer.timer.base->get_time(); - guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu, rdtsc()); + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); if (likely(tscdeadline > guest_tsc)) { ns = (tscdeadline - guest_tsc) * 100ULL; do_div(ns, this_tsc_khz); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 9cfc02a..8e46be1 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2985,8 +2985,7 @@ static int cr8_write_interception(struct vcpu_svm *svm) static u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) { struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu)); - return vmcb->control.tsc_offset + - kvm_scale_tsc(vcpu, host_tsc); + return vmcb->control.tsc_offset + host_tsc; } static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 75129bd..f2516bf 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1406,6 +1406,12 @@ static u64 kvm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc) return target_tsc - tsc; } +u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) +{ + return kvm_x86_ops->read_l1_tsc(vcpu, kvm_scale_tsc(vcpu, host_tsc)); +} +EXPORT_SYMBOL_GPL(kvm_read_l1_tsc); + void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr) { struct kvm *kvm = vcpu->kvm; @@ -1729,7 +1735,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) kernel_ns = get_kernel_ns(); } - tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, host_tsc); + tsc_timestamp = kvm_read_l1_tsc(v, host_tsc); /* * We may have to catch up the TSC to match elapsed wall clock @@ -6532,8 +6538,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (hw_breakpoint_active()) hw_breakpoint_restore(); - vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu, - rdtsc()); + vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); vcpu->mode = OUTSIDE_GUEST_MODE; smp_wmb(); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 95a6bf2..0d3fd3c 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1185,6 +1185,7 @@ int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq, #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */ u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc); +u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc); extern struct kvm_x86_ops *kvm_x86_ops; -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 03/12] KVM: x86: Add a common TSC scaling function
VMX and SVM calculate the TSC scaling ratio in a similar logic, so this patch generalizes it to a common TSC scaling function. Signed-off-by: Haozhong Zhang--- arch/x86/kvm/svm.c | 48 +++-- arch/x86/kvm/x86.c | 45 +++ include/linux/kvm_host.h | 3 +++ include/linux/math64.h | 70 4 files changed, 122 insertions(+), 44 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 04b58cf..d347170 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -212,7 +212,6 @@ static int nested_svm_intercept(struct vcpu_svm *svm); static int nested_svm_vmexit(struct vcpu_svm *svm); static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, bool has_error_code, u32 error_code); -static u64 __scale_tsc(u64 ratio, u64 tsc); enum { VMCB_INTERCEPTS, /* Intercept vectors, TSC offset, @@ -892,21 +891,7 @@ static __init int svm_hardware_setup(void) kvm_enable_efer_bits(EFER_FFXSR); if (boot_cpu_has(X86_FEATURE_TSCRATEMSR)) { - u64 max; - kvm_has_tsc_control = true; - - /* -* Make sure the user can only configure tsc_khz values that -* fit into a signed integer. -* A min value is not calculated needed because it will always -* be 1 on all machines and a value of 0 is used to disable -* tsc-scaling for the vcpu. -*/ - max = min(0x7fffULL, __scale_tsc(tsc_khz, TSC_RATIO_MAX)); - - kvm_max_guest_tsc_khz = max; - kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX; kvm_tsc_scaling_ratio_frac_bits = 32; } @@ -973,31 +958,6 @@ static void init_sys_seg(struct vmcb_seg *seg, uint32_t type) seg->base = 0; } -static u64 __scale_tsc(u64 ratio, u64 tsc) -{ - u64 mult, frac, _tsc; - - mult = ratio >> 32; - frac = ratio & ((1ULL << 32) - 1); - - _tsc = tsc; - _tsc *= mult; - _tsc += (tsc >> 32) * frac; - _tsc += ((tsc & ((1ULL << 32) - 1)) * frac) >> 32; - - return _tsc; -} - -static u64 svm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc) -{ - u64 _tsc = tsc; - - if (vcpu->arch.tsc_scaling_ratio != TSC_RATIO_DEFAULT) - _tsc = __scale_tsc(vcpu->arch.tsc_scaling_ratio, tsc); - - return _tsc; -} - static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale) { u64 ratio; @@ -1066,7 +1026,7 @@ static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool ho if (host) { if (vcpu->arch.tsc_scaling_ratio != TSC_RATIO_DEFAULT) WARN_ON(adjustment < 0); - adjustment = svm_scale_tsc(vcpu, (u64)adjustment); + adjustment = kvm_scale_tsc(vcpu, (u64)adjustment); } svm->vmcb->control.tsc_offset += adjustment; @@ -1084,7 +1044,7 @@ static u64 svm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc) { u64 tsc; - tsc = svm_scale_tsc(vcpu, rdtsc()); + tsc = kvm_scale_tsc(vcpu, rdtsc()); return target_tsc - tsc; } @@ -3076,7 +3036,7 @@ static u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) { struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu)); return vmcb->control.tsc_offset + - svm_scale_tsc(vcpu, host_tsc); + kvm_scale_tsc(vcpu, host_tsc); } static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) @@ -3086,7 +3046,7 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) switch (msr_info->index) { case MSR_IA32_TSC: { msr_info->data = svm->vmcb->control.tsc_offset + - svm_scale_tsc(vcpu, rdtsc()); + kvm_scale_tsc(vcpu, rdtsc()); break; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8849e8b..29c5781 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1328,6 +1328,39 @@ static void update_ia32_tsc_adjust_msr(struct kvm_vcpu *vcpu, s64 offset) vcpu->arch.ia32_tsc_adjust_msr += offset - curr_offset; } +/* + * Multiply tsc by a fixed point number represented by ratio. + * + * The most significant 64-N bits (mult) of ratio represent the + * integral part of the fixed point number; the remaining N bits + * (frac) represent the fractional part, ie. ratio represents a fixed + * point number (mult + frac * 2^(-N)). + * + * N.B: we always assume not all 64 bits of ratio are used for the + * fractional part and the ratio has at least 1 bit for the fractional + * part, i.e. 0 < N < 64. + * + * N equals to kvm_tsc_scaling_ratio_frac_bits. + */ +static inline u64 __scale_tsc(u64 ratio, u64 tsc) +{ +
[PATCH v2 04/12] KVM: x86: Replace call-back set_tsc_khz() with a common function
Both VMX and SVM propagate virtual_tsc_khz in the same way, so this patch removes the call-back set_tsc_khz() and replaces it with a common function. Signed-off-by: Haozhong Zhang--- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/svm.c | 36 arch/x86/kvm/vmx.c | 17 --- arch/x86/kvm/x86.c | 46 - include/linux/math64.h | 29 ++ 5 files changed, 70 insertions(+), 59 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1e08ad5..c67469b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -853,7 +853,6 @@ struct kvm_x86_ops { bool (*has_wbinvd_exit)(void); - void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale); u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu); void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d347170..a1364927 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -958,41 +958,6 @@ static void init_sys_seg(struct vmcb_seg *seg, uint32_t type) seg->base = 0; } -static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale) -{ - u64 ratio; - u64 khz; - - /* Guest TSC same frequency as host TSC? */ - if (!scale) { - vcpu->arch.tsc_scaling_ratio = TSC_RATIO_DEFAULT; - return; - } - - /* TSC scaling supported? */ - if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) { - if (user_tsc_khz > tsc_khz) { - vcpu->arch.tsc_catchup = 1; - vcpu->arch.tsc_always_catchup = 1; - } else - WARN(1, "user requested TSC rate below hardware speed\n"); - return; - } - - khz = user_tsc_khz; - - /* TSC scaling required - calculate ratio */ - ratio = khz << 32; - do_div(ratio, tsc_khz); - - if (ratio == 0 || ratio & TSC_RATIO_RSVD) { - WARN_ONCE(1, "Invalid TSC ratio - virtual-tsc-khz=%u\n", - user_tsc_khz); - return; - } - vcpu->arch.tsc_scaling_ratio = ratio; -} - static u64 svm_read_tsc_offset(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4403,7 +4368,6 @@ static struct kvm_x86_ops svm_x86_ops = { .has_wbinvd_exit = svm_has_wbinvd_exit, - .set_tsc_khz = svm_set_tsc_khz, .read_tsc_offset = svm_read_tsc_offset, .write_tsc_offset = svm_write_tsc_offset, .adjust_tsc_offset = svm_adjust_tsc_offset, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 15bff51..7f87cf6 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2382,22 +2382,6 @@ static u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) return host_tsc + tsc_offset; } -/* - * Engage any workarounds for mis-matched TSC rates. Currently limited to - * software catchup for faster rates on slower CPUs. - */ -static void vmx_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale) -{ - if (!scale) - return; - - if (user_tsc_khz > tsc_khz) { - vcpu->arch.tsc_catchup = 1; - vcpu->arch.tsc_always_catchup = 1; - } else - WARN(1, "user requested TSC rate below hardware speed\n"); -} - static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu) { return vmcs_read64(TSC_OFFSET); @@ -10828,7 +10812,6 @@ static struct kvm_x86_ops vmx_x86_ops = { .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, - .set_tsc_khz = vmx_set_tsc_khz, .read_tsc_offset = vmx_read_tsc_offset, .write_tsc_offset = vmx_write_tsc_offset, .adjust_tsc_offset = vmx_adjust_tsc_offset, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 29c5781..db5ef73 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1252,7 +1252,43 @@ static u32 adjust_tsc_khz(u32 khz, s32 ppm) return v; } -static void kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 this_tsc_khz) +static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale) +{ + u64 ratio; + + /* Guest TSC same frequency as host TSC? */ + if (!scale) { + vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio; + return 0; + } + + /* TSC scaling supported? */ + if (!kvm_has_tsc_control) { + if (user_tsc_khz > tsc_khz) { + vcpu->arch.tsc_catchup = 1; + vcpu->arch.tsc_always_catchup = 1; + return 0; + } else { + WARN(1, "user requested TSC rate below hardware speed\n"); + return -1; + } + } + +
[PATCH v2 06/12] KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
For both VMX and SVM, if the 2nd argument of call-back adjust_tsc_offset() is the host TSC, then adjust_tsc_offset() will scale it first. This patch moves this common TSC scaling logic to its caller adjust_tsc_offset_host() and rename the call-back adjust_tsc_offset() to adjust_tsc_offset_guest(). Signed-off-by: Haozhong Zhang--- arch/x86/include/asm/kvm_host.h | 15 +-- arch/x86/kvm/svm.c | 10 ++ arch/x86/kvm/vmx.c | 4 ++-- include/linux/kvm_host.h| 16 4 files changed, 21 insertions(+), 24 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d5e820b..b70cebb 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -845,7 +845,7 @@ struct kvm_x86_ops { int (*get_lpage_level)(void); bool (*rdtscp_supported)(void); bool (*invpcid_supported)(void); - void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment, bool host); + void (*adjust_tsc_offset_guest)(struct kvm_vcpu *vcpu, s64 adjustment); void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); @@ -920,19 +920,6 @@ struct kvm_arch_async_pf { bool direct_map; }; -extern struct kvm_x86_ops *kvm_x86_ops; - -static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, - s64 adjustment) -{ - kvm_x86_ops->adjust_tsc_offset(vcpu, adjustment, false); -} - -static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment) -{ - kvm_x86_ops->adjust_tsc_offset(vcpu, adjustment, true); -} - int kvm_mmu_module_init(void); void kvm_mmu_module_exit(void); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 481fdd3..9cfc02a 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -984,16 +984,10 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) mark_dirty(svm->vmcb, VMCB_INTERCEPTS); } -static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool host) +static void svm_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 adjustment) { struct vcpu_svm *svm = to_svm(vcpu); - if (host) { - if (vcpu->arch.tsc_scaling_ratio != TSC_RATIO_DEFAULT) - WARN_ON(adjustment < 0); - adjustment = kvm_scale_tsc(vcpu, (u64)adjustment); - } - svm->vmcb->control.tsc_offset += adjustment; if (is_guest_mode(vcpu)) svm->nested.hsave->control.tsc_offset += adjustment; @@ -4361,7 +4355,7 @@ static struct kvm_x86_ops svm_x86_ops = { .read_tsc_offset = svm_read_tsc_offset, .write_tsc_offset = svm_write_tsc_offset, - .adjust_tsc_offset = svm_adjust_tsc_offset, + .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest, .read_l1_tsc = svm_read_l1_tsc, .set_tdp_cr3 = set_tdp_cr3, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 7896395..1f72480 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2413,7 +2413,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) } } -static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool host) +static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 adjustment) { u64 offset = vmcs_read64(TSC_OFFSET); @@ -10809,7 +10809,7 @@ static struct kvm_x86_ops vmx_x86_ops = { .read_tsc_offset = vmx_read_tsc_offset, .write_tsc_offset = vmx_write_tsc_offset, - .adjust_tsc_offset = vmx_adjust_tsc_offset, + .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest, .read_l1_tsc = vmx_read_l1_tsc, .set_tdp_cr3 = vmx_set_cr3, diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3556148..95a6bf2 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1186,4 +1186,20 @@ int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq, u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc); +extern struct kvm_x86_ops *kvm_x86_ops; + +static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, + s64 adjustment) +{ + kvm_x86_ops->adjust_tsc_offset_guest(vcpu, adjustment); +} + +static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment) +{ + if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio) + WARN_ON(adjustment < 0); + adjustment = kvm_scale_tsc(vcpu, (u64) adjustment); + kvm_x86_ops->adjust_tsc_offset_guest(vcpu, adjustment); +} + #endif -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 09/12] KVM: VMX: Enable and initialize VMX TSC scaling
This patch exhances kvm-intel module to enable VMX TSC scaling and collects information of TSC scaling ratio during initialization. Signed-off-by: Haozhong Zhang--- arch/x86/include/asm/vmx.h | 3 +++ arch/x86/kvm/vmx.c | 19 ++- 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index aa336ff..14c63c7 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -73,6 +73,7 @@ #define SECONDARY_EXEC_ENABLE_PML 0x0002 #define SECONDARY_EXEC_XSAVES 0x0010 #define SECONDARY_EXEC_PCOMMIT 0x0020 +#define SECONDARY_EXEC_TSC_SCALING 0x0200 #define PIN_BASED_EXT_INTR_MASK 0x0001 #define PIN_BASED_NMI_EXITING 0x0008 @@ -167,6 +168,8 @@ enum vmcs_field { VMWRITE_BITMAP = 0x2028, XSS_EXIT_BITMAP = 0x202C, XSS_EXIT_BITMAP_HIGH= 0x202D, + TSC_MULTIPLIER = 0x2032, + TSC_MULTIPLIER_HIGH = 0x2033, GUEST_PHYSICAL_ADDRESS = 0x2400, GUEST_PHYSICAL_ADDRESS_HIGH = 0x2401, VMCS_LINK_POINTER = 0x2800, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f72480..a86f790 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -107,6 +107,9 @@ static u64 __read_mostly host_xss; static bool __read_mostly enable_pml = 1; module_param_named(pml, enable_pml, bool, S_IRUGO); +#define KVM_VMX_TSC_MULTIPLIER_DEFAULT 0x0001ULL +#define KVM_VMX_TSC_MULTIPLIER_MAX 0xULL + #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD) #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE) #define KVM_VM_CR0_ALWAYS_ON \ @@ -1172,6 +1175,12 @@ static inline bool cpu_has_vmx_pml(void) return vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_ENABLE_PML; } +static inline bool cpu_has_vmx_tsc_scaling(void) +{ + return vmcs_config.cpu_based_2nd_exec_ctrl & + SECONDARY_EXEC_TSC_SCALING; +} + static inline bool report_flexpriority(void) { return flexpriority_enabled; @@ -3133,7 +3142,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) SECONDARY_EXEC_SHADOW_VMCS | SECONDARY_EXEC_XSAVES | SECONDARY_EXEC_ENABLE_PML | - SECONDARY_EXEC_PCOMMIT; + SECONDARY_EXEC_PCOMMIT | + SECONDARY_EXEC_TSC_SCALING; if (adjust_vmx_controls(min2, opt2, MSR_IA32_VMX_PROCBASED_CTLS2, &_cpu_based_2nd_exec_control) < 0) @@ -6177,6 +6187,13 @@ static __init int hardware_setup(void) if (!cpu_has_vmx_apicv()) enable_apicv = 0; + if (cpu_has_vmx_tsc_scaling()) { + kvm_has_tsc_control = true; + kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX; + kvm_tsc_scaling_ratio_frac_bits = 48; + } + kvm_default_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_DEFAULT; + if (enable_apicv) kvm_x86_ops->update_cr8_intercept = NULL; else { -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 11/12] KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
This patch makes kvm-intel to return a scaled host TSC plus the TSC offset when handling guest readings to MSR_IA32_TSC. Signed-off-by: Haozhong Zhang--- arch/x86/kvm/vmx.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c241ff3..a02b59c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2372,15 +2372,16 @@ static void setup_msrs(struct vcpu_vmx *vmx) /* * reads and returns guest's timestamp counter "register" - * guest_tsc = host_tsc + tsc_offset-- 21.3 + * guest_tsc = (host_tsc * tsc multiplier) >> 48 + tsc_offset + * -- Intel TSC Scaling for Virtualization White Paper, sec 1.3 */ -static u64 guest_read_tsc(void) +static u64 guest_read_tsc(struct kvm_vcpu *vcpu) { u64 host_tsc, tsc_offset; host_tsc = rdtsc(); tsc_offset = vmcs_read64(TSC_OFFSET); - return host_tsc + tsc_offset; + return kvm_scale_tsc(vcpu, host_tsc) + tsc_offset; } /* @@ -2772,7 +2773,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_EFER: return kvm_get_msr_common(vcpu, msr_info); case MSR_IA32_TSC: - msr_info->data = guest_read_tsc(); + msr_info->data = guest_read_tsc(vcpu); break; case MSR_IA32_SYSENTER_CS: msr_info->data = vmcs_read32(GUEST_SYSENTER_CS); -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 02/12] KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
This patch moves the field of TSC scaling ratio from the architecture struct vcpu_svm to the common struct kvm_vcpu_arch. Signed-off-by: Haozhong Zhang--- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/svm.c | 23 +-- arch/x86/kvm/x86.c | 5 - 3 files changed, 14 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 0540dc8..1e08ad5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -505,6 +505,7 @@ struct kvm_vcpu_arch { u32 virtual_tsc_mult; u32 virtual_tsc_khz; s64 ia32_tsc_adjust_msr; + u64 tsc_scaling_ratio; atomic_t nmi_queued; /* unprocessed asynchronous NMIs */ unsigned nmi_pending; /* NMI queued after currently running handler */ diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 55f5f49..04b58cf 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -158,8 +158,6 @@ struct vcpu_svm { unsigned long int3_rip; u32 apf_reason; - u64 tsc_ratio; - /* cached guest cpuid flags for faster access */ bool nrips_enabled : 1; }; @@ -992,24 +990,22 @@ static u64 __scale_tsc(u64 ratio, u64 tsc) static u64 svm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc) { - struct vcpu_svm *svm = to_svm(vcpu); u64 _tsc = tsc; - if (svm->tsc_ratio != TSC_RATIO_DEFAULT) - _tsc = __scale_tsc(svm->tsc_ratio, tsc); + if (vcpu->arch.tsc_scaling_ratio != TSC_RATIO_DEFAULT) + _tsc = __scale_tsc(vcpu->arch.tsc_scaling_ratio, tsc); return _tsc; } static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale) { - struct vcpu_svm *svm = to_svm(vcpu); u64 ratio; u64 khz; /* Guest TSC same frequency as host TSC? */ if (!scale) { - svm->tsc_ratio = TSC_RATIO_DEFAULT; + vcpu->arch.tsc_scaling_ratio = TSC_RATIO_DEFAULT; return; } @@ -1034,7 +1030,7 @@ static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale) user_tsc_khz); return; } - svm->tsc_ratio = ratio; + vcpu->arch.tsc_scaling_ratio = ratio; } static u64 svm_read_tsc_offset(struct kvm_vcpu *vcpu) @@ -1068,7 +1064,7 @@ static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool ho struct vcpu_svm *svm = to_svm(vcpu); if (host) { - if (svm->tsc_ratio != TSC_RATIO_DEFAULT) + if (vcpu->arch.tsc_scaling_ratio != TSC_RATIO_DEFAULT) WARN_ON(adjustment < 0); adjustment = svm_scale_tsc(vcpu, (u64)adjustment); } @@ -1240,8 +1236,6 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) goto out; } - svm->tsc_ratio = TSC_RATIO_DEFAULT; - err = kvm_vcpu_init(>vcpu, kvm, id); if (err) goto free_svm; @@ -1311,6 +1305,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { struct vcpu_svm *svm = to_svm(vcpu); int i; + u64 tsc_ratio = vcpu->arch.tsc_scaling_ratio; if (unlikely(cpu != vcpu->cpu)) { svm->asid_generation = 0; @@ -1328,9 +1323,9 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) rdmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]); if (static_cpu_has(X86_FEATURE_TSCRATEMSR) && - svm->tsc_ratio != __this_cpu_read(current_tsc_ratio)) { - __this_cpu_write(current_tsc_ratio, svm->tsc_ratio); - wrmsrl(MSR_AMD64_TSC_RATIO, svm->tsc_ratio); + tsc_ratio != __this_cpu_read(current_tsc_ratio)) { + __this_cpu_write(current_tsc_ratio, tsc_ratio); + wrmsrl(MSR_AMD64_TSC_RATIO, tsc_ratio); } } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 79cbbb5..8849e8b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1258,8 +1258,11 @@ static void kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 this_tsc_khz) int use_scaling = 0; /* tsc_khz can be zero if TSC calibration fails */ - if (this_tsc_khz == 0) + if (this_tsc_khz == 0) { + /* set tsc_scaling_ratio to a safe value */ + vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio; return; + } /* Compute a scale to convert nanoseconds in TSC cycles */ kvm_get_time_scale(this_tsc_khz, NSEC_PER_SEC / 1000, -- 2.4.8 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tools lib traceevent: update KVM plugin
Em Fri, Oct 09, 2015 at 10:10:13PM +0200, Paolo Bonzini escreveu: > On 01/10/2015 12:28, Paolo Bonzini wrote: > > The format of the role word has changed through the years and the > > plugin was never updated; some VMX exit reasons were missing too. > > > > Signed-off-by: Paolo Bonzini> > --- > > tools/lib/traceevent/plugin_kvm.c | 25 + > > 1 file changed, 17 insertions(+), 8 deletions(-) > > > > diff --git a/tools/lib/traceevent/plugin_kvm.c > > b/tools/lib/traceevent/plugin_kvm.c > > index 88fe83dff7cd..18536f756577 100644 > > --- a/tools/lib/traceevent/plugin_kvm.c > > +++ b/tools/lib/traceevent/plugin_kvm.c > > @@ -124,7 +124,10 @@ static const char *disassemble(unsigned char *insn, > > int len, uint64_t rip, > > _ER(WBINVD, 54)\ > > _ER(XSETBV, 55)\ > > _ER(APIC_WRITE, 56)\ > > - _ER(INVPCID, 58) > > + _ER(INVPCID, 58)\ > > + _ER(PML_FULL,62)\ > > + _ER(XSAVES, 63)\ > > + _ER(XRSTORS, 64) > > > > #define SVM_EXIT_REASONS \ > > _ER(EXIT_READ_CR0, 0x000) \ > > @@ -352,15 +355,18 @@ static int kvm_nested_vmexit_handler(struct trace_seq > > *s, struct pevent_record * > > union kvm_mmu_page_role { > > unsigned word; > > struct { > > - unsigned glevels:4; > > unsigned level:4; > > + unsigned cr4_pae:1; > > unsigned quadrant:2; > > - unsigned pad_for_nice_hex_output:6; > > unsigned direct:1; > > unsigned access:3; > > unsigned invalid:1; > > - unsigned cr4_pge:1; > > unsigned nxe:1; > > + unsigned cr0_wp:1; > > + unsigned smep_and_not_wp:1; > > + unsigned smap_and_not_wp:1; > > + unsigned pad_for_nice_hex_output:8; > > + unsigned smm:8; > > }; > > }; > > > > @@ -385,15 +391,18 @@ static int kvm_mmu_print_role(struct trace_seq *s, > > struct pevent_record *record, > > if (pevent_is_file_bigendian(event->pevent) == > > pevent_is_host_bigendian(event->pevent)) { > > > > - trace_seq_printf(s, "%u/%u q%u%s %s%s %spge %snxe", > > + trace_seq_printf(s, "%u q%u%s %s%s %spae %snxe %swp%s%s%s", > > role.level, > > -role.glevels, > > role.quadrant, > > role.direct ? " direct" : "", > > access_str[role.access], > > role.invalid ? " invalid" : "", > > -role.cr4_pge ? "" : "!", > > -role.nxe ? "" : "!"); > > +role.cr4_pae ? "" : "!", > > +role.nxe ? "" : "!", > > +role.cr0_wp ? "" : "!", > > +role.smep_and_not_wp ? " smep" : "", > > +role.smap_and_not_wp ? " smap" : "", > > +role.smm ? " smm" : ""); > > } else > > trace_seq_printf(s, "WORD: %08x", role.word); > > > > > > Ping? Arnaldo, ok to include this patch in my tree? Applying, I saw it before, but lost track, perhaps was waiting for Steven Rostedt to chime in, but now I noticed he wasn't CCed, he is now. - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled
Hi, On 10/20/15 19:27, Janusz wrote: > W dniu 15.10.2015 o 20:46, Laszlo Ersek pisze: >> On 10/15/15 18:53, Kinney, Michael D wrote: >>> Laszlo, >>> >>> There is already a PCD for this timeout that is used by CpuMpPei. >>> >>> gUefiCpuPkgTokenSpaceGuid.PcdCpuApInitTimeOutInMicroSeconds >>> >>> I noticed that CpuDxe is using a hard coded AP timeout. I think we should >>> just use this same PCD for both the PEI and DXE CPU module and then set it >>> for OVMF to the compatible value. >> Perfect, thank you! >> >> (I notice the default in the DEC file is 5, which is half of what >> the DXE driver hardcodes.) >> >> Now we only need a recommended (or experimental) value for it, and an >> explanation why 100*1000 is no longer sufficient on KVM :) >> >> Thanks! >> Laszlo >> >> >> > Laszlo, > > I saw that there is already some change in ovmf for MicroSecondDelay > https://github.com/tianocore/edk2/commit/1e410eadd80c328e66868263b3006a274ce81ae0 > Is that a fix for it? Because I tried it and it still doesn't work for > me: https://bpaste.net/show/2514b51bf41f > I still get internal error I think you guys are now "mature enough OVMF users" to start employing the correct terminology. "edk2" (also spelled as "EDK II") is: "a modern, feature-rich, cross-platform firmware development environment for the UEFI and PI specifications". The source tree contains a whole bunch of modules (drivers, applications, libraries), organized into packages. "OVMF" usually denotes a firmware binary built from one of the OvmfPkg/OvmfPkg*.dsc "platform description files". Think of them as "top level makefiles". The difference between them is the target architecture (there's Ia32, X64, and Ia32X64 -- the last one means that the SEC and PEI phases are 32-bit, whereas the DXE and later phases are 64-bit.) In practice you'll only care about full X64. Now, each of OvmfPkg/OvmfPkg*.dsc builds the following three kinds of modules into the final binary: - platform-independent modules from various top-level packages - platform- (ie. Ia32/X64-) dependent modules from various top-level packages - modules from under OvmfPkg that are specific to QEMU/KVM (and Xen, if you happen to use OVMF with Xen) Now, when you reference a commit like 1e410ead above, you can look at the diffstat, and decide if it is OvmfPkg-specific (third category above) or not. Here you see UefiCpuPkg, which happens to be the second category. The important point is: please do *not* call any and all edk2 patches "OVMF changes", indiscriminately. That's super confusing for people who understand the above distinctions. Which now you do too. :) Let me add that in edk2, patches that straddle top level packages are generally forbidden -- you can't have a patch that modifies OvmfPkg and UefiCpuPkg at the same time, modulo *very* rare exceptions. If a feature or bugfix needs to touch several top-level packages, the series must be built up carefully in stages. Knowing all of the above, you can tell that the patch you referenced had only *enabled* OvmfPkg to customize UefiCpuPkg, via "PcdCpuApInitTimeOutInMicroSeconds". But for that customization to occur actually, a small patch for OvmfPkg will be necessary too, in order to set "PcdCpuApInitTimeOutInMicroSeconds" differently from the default. I plan to send that patch soon. If you'd like to be CC'd, that's great (reporting back with a Tested-by is even better!), but I'll need your real name for that. (Or any name that looks like a real name.) Thanks! Laszlo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled
W dniu 15.10.2015 o 20:46, Laszlo Ersek pisze: > On 10/15/15 18:53, Kinney, Michael D wrote: >> Laszlo, >> >> There is already a PCD for this timeout that is used by CpuMpPei. >> >> gUefiCpuPkgTokenSpaceGuid.PcdCpuApInitTimeOutInMicroSeconds >> >> I noticed that CpuDxe is using a hard coded AP timeout. I think we should >> just use this same PCD for both the PEI and DXE CPU module and then set it >> for OVMF to the compatible value. > Perfect, thank you! > > (I notice the default in the DEC file is 5, which is half of what > the DXE driver hardcodes.) > > Now we only need a recommended (or experimental) value for it, and an > explanation why 100*1000 is no longer sufficient on KVM :) > > Thanks! > Laszlo > > > Laszlo, I saw that there is already some change in ovmf for MicroSecondDelay https://github.com/tianocore/edk2/commit/1e410eadd80c328e66868263b3006a274ce81ae0 Is that a fix for it? Because I tried it and it still doesn't work for me: https://bpaste.net/show/2514b51bf41f I still get internal error -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] arm/arm64: KVM: Fix disabled distributor operation
On 10/20/2015 11:44 AM, Christoffer Dall wrote: > On Tue, Oct 20, 2015 at 11:08:44AM +0200, Eric Auger wrote: >> Hi Christoffer, >> On 10/17/2015 10:30 PM, Christoffer Dall wrote: >>> We currently do a single update of the vgic state when the distrbutor >> distributor >>> enable/disable control register is accessed and then bypass updating the >>> state for as long as the distributor remains disabled. >>> >>> This is incorrect, because updating the state does not consider the >>> distributor enable bit, and this you can end up in a situation where an >>> interrupt is marked as pending on the CPU interface, but not pending on >>> the distributor, which is an impossible state to be in, and triggers a >>> warning. Consider for example the following sequence of events: >>> >>> 1. An interrupt is marked as pending on the distributor >>>- the interrupt is also forwarded to the CPU interface >>> 2. The guest turns off the distributor (it's about to do a reboot) >>>- we stop updating the CPU interface state from now on >>> 3. The guest disables the pending interrupt >>>- we remove the pending state from the distributor, but don't touch >>> the CPU interface, see point 2. >>> >>> Since the distributor disable bit really means that no interrupts should >>> be forwarded to the CPU interface, we modify the code to keep updating >>> the internal VGIC state, but always set the CPU interface pending bits >>> to zero when the distributor is disabled. >>> >>> Signed-off-by: Christoffer Dall>>> --- >>> virt/kvm/arm/vgic.c | 11 ++- >>> 1 file changed, 6 insertions(+), 5 deletions(-) >>> >>> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c >>> index 58b1256..66c6616 100644 >>> --- a/virt/kvm/arm/vgic.c >>> +++ b/virt/kvm/arm/vgic.c >>> @@ -1012,6 +1012,12 @@ static int compute_pending_for_cpu(struct kvm_vcpu >>> *vcpu) >>> pend_percpu = vcpu->arch.vgic_cpu.pending_percpu; >>> pend_shared = vcpu->arch.vgic_cpu.pending_shared; >>> >>> + if (!dist->enabled) { >>> + bitmap_zero(pend_percpu, VGIC_NR_PRIVATE_IRQS); >>> + bitmap_zero(pend_shared, nr_shared); >>> + return 0; >>> + } >>> + >>> pending = vgic_bitmap_get_cpu_map(>irq_pending, vcpu_id); >>> enabled = vgic_bitmap_get_cpu_map(>irq_enabled, vcpu_id); >>> bitmap_and(pend_percpu, pending, enabled, VGIC_NR_PRIVATE_IRQS); >>> @@ -1039,11 +1045,6 @@ void vgic_update_state(struct kvm *kvm) >>> struct kvm_vcpu *vcpu; >>> int c; >>> >>> - if (!dist->enabled) { >>> - set_bit(0, dist->irq_pending_on_cpu); >>> - return; >> I am confused. Don't you want to clear the whole bitmap? > > So the first line used to set irq_pending_on_cpu for VCPU0 when the > distributor was disabled, which I think basically worked around the > guest kernel expecting to see a timer interrupt during boot when doing a > WFI. Now when that's fixed, we don't need this (gigantuous hack) anymore. > > The return statement was also weird and buggy, because it never > prevented anything from going to the CPU interface, it just stopped > updating things. Yeah sorry I read it as an addition so I didn't understand that code but I definitively understand you remove it. Sorry - tired. Best Regards Eric > > >> >> Shouldn't we also handle interrupts programmed in the LR. Spec says any >> ack should return a spurious ID. Is it what is going to happen with the >> current implementation? >> > > yes, we really should. We should unqueue them, but I haven't seen any > bugs from this, and I feel like we're already changing a lot with short > notice, so I'd rather not distrupt anything more right now. > > Besides, when we get around to redesigning this whole thing, the concept > of unqueueing goes away. > > I know it sucks reviewing fixes that only fix a subset of a bad > implementation, but I'm aiming for 'slightly better than current state' > right now :) > > -Christoffer > > >>> - } >>> - >>> kvm_for_each_vcpu(c, vcpu, kvm) { >>> if (compute_pending_for_cpu(vcpu)) >>> set_bit(c, dist->irq_pending_on_cpu); >>> >> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tools lib traceevent: update KVM plugin
On Tue, 20 Oct 2015 17:19:12 +0200 Paolo Bonziniwrote: > > > On 20/10/2015 16:44, Steven Rostedt wrote: > > What happens if you run new perf on an older kernel. Is this new plugin > > going to be screwed up? Plugins should be backward compatible. > > If you run new perf on older kernel, the new plugin will print the > "role" field (see kvm_mmu_print_role) slightly incorrectly. That said, > the existing plugin was _also_ printing the role in a wildly wrong > format, like 2.6.35 vintage; the glevels field was removed by commit > 5b7e0102ae74, "KVM: MMU: Replace role.glevels with role.cr4_pae", in > April 2010. Can we add a check if glevels field exists, and do the old format if it does? I'm very strict on making sure event-parse works with all incarnations of kernels. > > Going forward it's really unlikely that the role will change apart from > adding new bits. These can be added to the plugin while keeping it > backwards-compatible. Addition to the role happen when you implement > new virtual MMU features such as SMEP, SMAP or SMM. That's once per year > or less. > > > Is the plugin even still needed? I'm looking at some of the kvm events > > and they seem to be mostly self sufficient. What ones need a plugin > > today? > > Yes, most of them are. It's only needed for kvm_mmu_get_page, > kvm_mmu_prepare_zap_page and kvm_emulate_insn. The latter is only > interesting if you install the disassembler library, and I wouldn't > really care if it went away. > > kvm_mmu_get_page and kvm_mmu_prepare_zap_page, however, have output like > > kvm_mmu_get_page: [FAILED TO PARSE] mmu_valid_gen=0x2 gfn=786432 > role=1923 root_count=0 unsync=0 created=1 > > without the plugin vs. > > kvm_mmu_get_page: new sp gfn c 3 q0 direct rwx !pae !nxe !wp root 0 > sync > > with the plugin. Thanks for the update. -- Steve -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: Re: [PATCH v3 2/4] KVM: use simple waitqueue for vcpu->wq
Hi kvm-ppc folks, there is a remark for you in here... Message ID is 20151020140031.gg17...@twins.programming.kicks-ass.net Thanks, Paolo Forwarded Message Subject: Re: [PATCH v3 2/4] KVM: use simple waitqueue for vcpu->wq Date: Tue, 20 Oct 2015 16:00:31 +0200 From: Peter ZijlstraTo: Daniel Wagner CC: linux-ker...@vger.kernel.org, linux-rt-us...@vger.kernel.org, Marcelo Tosatti , Paolo Bonzini , Paul E. McKenney , Paul Gortmaker , Thomas Gleixner On Tue, Oct 20, 2015 at 09:28:08AM +0200, Daniel Wagner wrote: > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 2280497..f534e15 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -2560,10 +2560,9 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore > *vc) > { > struct kvm_vcpu *vcpu; > int do_sleep = 1; > + DECLARE_SWAITQUEUE(wait); > > - DEFINE_WAIT(wait); > - > - prepare_to_wait(>wq, , TASK_INTERRUPTIBLE); > + prepare_to_swait(>wq, , TASK_INTERRUPTIBLE); > > /* >* Check one last time for pending exceptions and ceded state after > @@ -2577,7 +2576,7 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore > *vc) > } > > if (!do_sleep) { > - finish_wait(>wq, ); > + finish_swait(>wq, ); > return; > } > > @@ -2585,7 +2584,7 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore > *vc) > trace_kvmppc_vcore_blocked(vc, 0); > spin_unlock(>lock); > schedule(); > - finish_wait(>wq, ); > + finish_swait(>wq, ); > spin_lock(>lock); > vc->vcore_state = VCORE_INACTIVE; > trace_kvmppc_vcore_blocked(vc, 1); This one looks buggy, one should _NOT_ assume that your blocking condition is true after schedule(). > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 8db1d93..45ab55f 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2019,7 +2018,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) > } > > for (;;) { > - prepare_to_wait(>wq, , TASK_INTERRUPTIBLE); > + prepare_to_swait(>wq, , TASK_INTERRUPTIBLE); > > if (kvm_vcpu_check_block(vcpu) < 0) > break; > @@ -2028,7 +2027,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) > schedule(); > } > > - finish_wait(>wq, ); > + finish_swait(>wq, ); > cur = ktime_get(); > > out: Should we not take this opportunity to get rid of these open-coded wait loops? Does this work? --- arch/powerpc/kvm/book3s_hv.c | 33 + virt/kvm/kvm_main.c | 13 ++--- 2 files changed, 19 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 228049786888..b5b8bcad5105 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -2552,18 +2552,10 @@ static void kvmppc_wait_for_exec(struct kvmppc_vcore *vc, finish_wait(>arch.cpu_run, ); } -/* - * All the vcpus in this vcore are idle, so wait for a decrementer - * or external interrupt to one of the vcpus. vc->lock is held. - */ -static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc) +static inline bool kvmppc_vcore_should_sleep(struct kvmppc_vcore *vc) { struct kvm_vcpu *vcpu; - int do_sleep = 1; - - DEFINE_WAIT(wait); - - prepare_to_wait(>wq, , TASK_INTERRUPTIBLE); + bool sleep = true; /* * Check one last time for pending exceptions and ceded state after @@ -2571,26 +2563,35 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc) */ list_for_each_entry(vcpu, >runnable_threads, arch.run_list) { if (vcpu->arch.pending_exceptions || !vcpu->arch.ceded) { - do_sleep = 0; + sleep = false; break; } } - if (!do_sleep) { - finish_wait(>wq, ); - return; - } + return sleep; +} +static inline void kvmppc_vcore_schedule(struct kvmppc_vcore *vc) +{ vc->vcore_state = VCORE_SLEEPING; trace_kvmppc_vcore_blocked(vc, 0); spin_unlock(>lock); schedule(); - finish_wait(>wq, ); spin_lock(>lock); vc->vcore_state = VCORE_INACTIVE; trace_kvmppc_vcore_blocked(vc, 1); } +/* + * All the vcpus in this vcore are idle, so wait for a decrementer + * or external interrupt to one of the vcpus. vc->lock is held. + */ +static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc) +{ + ___wait_event(vc->wq, !kvmppc_vcore_should_sleep(vc), TASK_IDLE, 0, 0, + kvmppc_vcore_schedule(vc)); +} + static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct
Re: [PATCH 1/4] vfio: platform: add capability to register a reset function
Hi Alex, On 10/19/2015 09:45 PM, Alex Williamson wrote: > On Sun, 2015-10-18 at 18:00 +0200, Eric Auger wrote: >> In preparation for subsequent changes in reset function lookup, >> lets introduce a dynamic list of reset combos (compat string, >> reset module, reset function). The list can be populated/voided with >> two new functions, vfio_platform_register/unregister_reset. Those are >> not yet used in this patch. >> >> Signed-off-by: Eric Auger>> --- >> drivers/vfio/platform/vfio_platform_common.c | 55 >> +++ >> drivers/vfio/platform/vfio_platform_private.h | 14 +++ >> 2 files changed, 69 insertions(+) >> >> diff --git a/drivers/vfio/platform/vfio_platform_common.c >> b/drivers/vfio/platform/vfio_platform_common.c >> index e43efb5..d36afc9 100644 >> --- a/drivers/vfio/platform/vfio_platform_common.c >> +++ b/drivers/vfio/platform/vfio_platform_common.c >> @@ -23,6 +23,8 @@ >> >> #include "vfio_platform_private.h" >> >> +struct list_head reset_list; > > What's the purpose of this one above? useless indeed > >> +LIST_HEAD(reset_list); > > static > > I think you also need a mutex protecting this list, otherwise nothing > prevents concurrent list changes and walks. A rw lock probably fits the > usage model best, but I don't expect much contention if you just want to > use a mutex. yes you're right. I am going to use a standard mutex I think. > >> static DEFINE_MUTEX(driver_lock); >> >> static const struct vfio_platform_reset_combo reset_lookup_table[] = { >> @@ -573,3 +575,56 @@ struct vfio_platform_device >> *vfio_platform_remove_common(struct device *dev) >> return vdev; >> } >> EXPORT_SYMBOL_GPL(vfio_platform_remove_common); >> + >> +int vfio_platform_register_reset(struct module *reset_owner, char *compat, > > const char * ok > >> + vfio_platform_reset_fn_t reset) >> +{ >> +struct vfio_platform_reset_node *node, *iter; >> +bool found = false; >> + >> +list_for_each_entry(iter, _list, link) { >> +if (!strcmp(iter->compat, compat)) { >> +found = true; >> +break; >> +} >> +} >> +if (found) >> +return -EINVAL; >> + >> +node = kmalloc(sizeof(*node), GFP_KERNEL); >> +if (!node) >> +return -ENOMEM; >> + >> +node->compat = kstrdup(compat, GFP_KERNEL); >> +if (!node->compat) >> +return -ENOMEM; > > Leaks node thanks > >> + >> +node->owner = reset_owner; >> +node->reset = reset; >> + >> +list_add(>link, _list); >> +return 0; >> +} >> +EXPORT_SYMBOL_GPL(vfio_platform_register_reset); >> + >> +int vfio_platform_unregister_reset(char *compat) > > const char * ok Thanks for your quick review! Best Regards Eric > >> +{ >> +struct vfio_platform_reset_node *iter; >> +bool found = false; >> + >> +list_for_each_entry(iter, _list, link) { >> +if (!strcmp(iter->compat, compat)) { >> +found = true; >> +break; >> +} >> +} >> +if (!found) >> +return -EINVAL; >> + >> +list_del(>link); >> +kfree(iter->compat); >> +kfree(iter); >> +return 0; >> +} >> +EXPORT_SYMBOL_GPL(vfio_platform_unregister_reset); >> + >> diff --git a/drivers/vfio/platform/vfio_platform_private.h >> b/drivers/vfio/platform/vfio_platform_private.h >> index 1c9b3d5..17323f0 100644 >> --- a/drivers/vfio/platform/vfio_platform_private.h >> +++ b/drivers/vfio/platform/vfio_platform_private.h >> @@ -76,6 +76,15 @@ struct vfio_platform_reset_combo { >> const char *module_name; >> }; >> >> +typedef int (*vfio_platform_reset_fn_t)(struct vfio_platform_device *vdev); >> + >> +struct vfio_platform_reset_node { >> +struct list_head link; >> +char *compat; >> +struct module *owner; >> +vfio_platform_reset_fn_t reset; >> +}; >> + >> extern int vfio_platform_probe_common(struct vfio_platform_device *vdev, >>struct device *dev); >> extern struct vfio_platform_device *vfio_platform_remove_common >> @@ -89,4 +98,9 @@ extern int vfio_platform_set_irqs_ioctl(struct >> vfio_platform_device *vdev, >> unsigned start, unsigned count, >> void *data); >> >> +extern int vfio_platform_register_reset(struct module *owner, >> +char *compat, >> +vfio_platform_reset_fn_t reset); >> +extern int vfio_platform_unregister_reset(char *compat); >> + >> #endif /* VFIO_PLATFORM_PRIVATE_H */ > > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled
W dniu 20.10.2015 o 19:44, Laszlo Ersek pisze: > Hi, > > On 10/20/15 19:27, Janusz wrote: >> W dniu 15.10.2015 o 20:46, Laszlo Ersek pisze: >>> On 10/15/15 18:53, Kinney, Michael D wrote: Laszlo, There is already a PCD for this timeout that is used by CpuMpPei. gUefiCpuPkgTokenSpaceGuid.PcdCpuApInitTimeOutInMicroSeconds I noticed that CpuDxe is using a hard coded AP timeout. I think we should just use this same PCD for both the PEI and DXE CPU module and then set it for OVMF to the compatible value. >>> Perfect, thank you! >>> >>> (I notice the default in the DEC file is 5, which is half of what >>> the DXE driver hardcodes.) >>> >>> Now we only need a recommended (or experimental) value for it, and an >>> explanation why 100*1000 is no longer sufficient on KVM :) >>> >>> Thanks! >>> Laszlo >>> >>> >>> >> Laszlo, >> >> I saw that there is already some change in ovmf for MicroSecondDelay >> https://github.com/tianocore/edk2/commit/1e410eadd80c328e66868263b3006a274ce81ae0 >> Is that a fix for it? Because I tried it and it still doesn't work for >> me: https://bpaste.net/show/2514b51bf41f >> I still get internal error > I think you guys are now "mature enough OVMF users" to start employing > the correct terminology. Sory for that :) > "edk2" (also spelled as "EDK II") is: "a modern, feature-rich, > cross-platform firmware development environment for the UEFI and PI > specifications". > > The source tree contains a whole bunch of modules (drivers, > applications, libraries), organized into packages. > > "OVMF" usually denotes a firmware binary built from one of the > OvmfPkg/OvmfPkg*.dsc "platform description files". Think of them as "top > level makefiles". The difference between them is the target architecture > (there's Ia32, X64, and Ia32X64 -- the last one means that the SEC and > PEI phases are 32-bit, whereas the DXE and later phases are 64-bit.) In > practice you'll only care about full X64. > > Now, each of OvmfPkg/OvmfPkg*.dsc builds the following three kinds of > modules into the final binary: > - platform-independent modules from various top-level packages > - platform- (ie. Ia32/X64-) dependent modules from various top-level > packages > - modules from under OvmfPkg that are specific to QEMU/KVM (and Xen, if > you happen to use OVMF with Xen) > > Now, when you reference a commit like 1e410ead above, you can look at > the diffstat, and decide if it is OvmfPkg-specific (third category > above) or not. Here you see UefiCpuPkg, which happens to be the second > category. > > The important point is: please do *not* call any and all edk2 patches > "OVMF changes", indiscriminately. That's super confusing for people who > understand the above distinctions. Which now you do too. :) > > Let me add that in edk2, patches that straddle top level packages are > generally forbidden -- you can't have a patch that modifies OvmfPkg and > UefiCpuPkg at the same time, modulo *very* rare exceptions. If a feature > or bugfix needs to touch several top-level packages, the series must be > built up carefully in stages. > > Knowing all of the above, you can tell that the patch you referenced had > only *enabled* OvmfPkg to customize UefiCpuPkg, via > "PcdCpuApInitTimeOutInMicroSeconds". But for that customization to occur > actually, a small patch for OvmfPkg will be necessary too, in order to > set "PcdCpuApInitTimeOutInMicroSeconds" differently from the default. > > I plan to send that patch soon. If you'd like to be CC'd, that's great > (reporting back with a Tested-by is even better!), but I'll need your > real name for that. (Or any name that looks like a real name.) would be great if you could add me to cc list, thanks > > Thanks! > Laszlo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tools lib traceevent: update KVM plugin
Em Tue, Oct 20, 2015 at 05:49:22PM +0200, Paolo Bonzini escreveu: > > > On 20/10/2015 17:48, Steven Rostedt wrote: > > On Tue, 20 Oct 2015 17:37:43 +0200 > > Paolo Bonziniwrote: > > > >> However, it frankly seems a bit academic. The parsing _will_ work, > >> apart from printing a nonsensical role just like it has always done for > >> the past four years. > > > > I'm not going to be too picky about it. But things like this may seem > > academic, but it's also what we forget about, and people still use > > 2.6.32 kernels, and this may come back and bite us later. > > > > But it's more likely to bite you than me, so if you don't think it's > > worth while, then I'm not going to push it. > > Thanks. The role is not going to matter except in really weird cases > and hopefully those are just not there in 2.6.32 kernels. Thanks also > for educating me so I won't get it wrong in the future! Cool, Steven, can I take that as an Acked-by for this patch? - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tools lib traceevent: update KVM plugin
On 20/10/2015 17:26, Steven Rostedt wrote: > > > > What happens if you run new perf on an older kernel. Is this new plugin > > > > going to be screwed up? Plugins should be backward compatible. > > > > If you run new perf on older kernel, the new plugin will print the > > "role" field (see kvm_mmu_print_role) slightly incorrectly. That said, > > the existing plugin was _also_ printing the role in a wildly wrong > > format, like 2.6.35 vintage; the glevels field was removed by commit > > 5b7e0102ae74, "KVM: MMU: Replace role.glevels with role.cr4_pae", in > > April 2010. > > Can we add a check if glevels field exists, and do the old format if it > does? I'm very strict on making sure event-parse works with all > incarnations of kernels. Note that this is not a glevels _tracepoint_ field. The tracepoint field is "role" and it is a 64-bit integer. The plugin interprets the bitfields of that integer, and the position/meaning of the bitfields has changed. I guess I could use the mmu_valid_gen field as a proxy (it was introduced after glevels was removed), and only provide the "very old" (up to 2.6.34) and "new" (4.2+) formats. The "old" format (2.6.35-4.1) was never implemented in the plugin; it may well remain that way. However, it frankly seems a bit academic. The parsing _will_ work, apart from printing a nonsensical role just like it has always done for the past four years. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Difference between vcpu_load and kvm_sched_in ?
On 20/10/2015 11:57, Yacine wrote: > vcpu_load; start cr3 trapping; vcpu_put > > it worked correctly (in my logs I see that vcpu.cpu become equal to "cpu = > raw_smp_processor_id();") but the VM blocks for a lot of time due to mutex > in vcpu_load (up to serveral seconds and sometimes minutes !) Right, that's because while the CPU is running the mutex is taken. If the VCPU doesn't exit, the mutex is held. > I replaced vcpu_load with kvm_sched_in, now everything works perfectly and > the VM doesn't block at all (logs here: http://pastebin.com/h5XNNMcb). > > So, what I want to know is: what is the difference between vcpu_load and > kvm_sched_in ? both of this functions call kvm_arch_vcpu_loadbut the latter > one does it without doing a mutex kvm_sched_out and kvm_sched_in are part of KVM's preemption hooks. The hooks are registered only between vcpu_load and vcpu_put, therefore they know that the mutex is taken. The sequence will go like this: vcpu_load kvm_sched_out kvm_sched_in kvm_sched_out kvm_sched_in ... vcpu_put and it will all happen with the mutex held. > Is there a problem in using kvm_sched_in instead of vcpu_load for my use case > ? Yes, unfortunately it is a problem: you are loading the same VMCS on two processors, which has undefined results. To fix the problem, wrap the ioctl into a function and pass the function to QEMU's "run_on_cpu" function. It will send the ioctl from the right thread, so that the kernel will not be holding the vcpu mutex. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function
On 10/21/2015 12:26 AM, Xiao Guangrong wrote: On 10/20/2015 11:51 PM, Stefan Hajnoczi wrote: On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote: +exit: +/* Write our output result to dsm memory. */ +((dsm_out *)dsm_ram_addr)->len = out->len; Missing byteswap? I thought you were going to remove this field because it wasn't needed by the guest. The @len is the size of _DSM result buffer, for example, for the function of DSM_FUN_IMPLEMENTED the result buffer is 8 bytes, and for DSM_DEV_FUN_NAMESPACE_LABEL_SIZE the buffer size is 4 bytes. It tells ASL code Sorry, s/DSM_DEV_FUN_NAMESPACE_LABEL_SIZE/DSM_DEV_FUN_SET_NAMESPACE_LABEL_DATA how much size of memory we need to return to the _DSM caller. In _DSM code, it's handled like this: "RLEN" is @len, “OBUF” is the left memory in DSM page. /* get @len*/ aml_append(method, aml_store(aml_name("RLEN"), aml_local(6))); /* @len << 3 to get bits. */ aml_append(method, aml_store(aml_shiftleft(aml_local(6), aml_int(3)), aml_local(6))); /* get @len << 3 bits from OBUF, and return it to the caller. */ aml_append(method, aml_create_field(aml_name("ODAT"), aml_int(0), aml_local(6) , "OBUF")); Since @len is our internally used, it's not return to guest, so i did not do byteswap here. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function
On Tue, Oct 20, 2015 at 04:51:49PM +0100, Stefan Hajnoczi wrote: > On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote: > > +exit: > > +/* Write our output result to dsm memory. */ > > +((dsm_out *)dsm_ram_addr)->len = out->len; > > Missing byteswap? That's why I'd prefer no structures, using build_append_int_noprefix, unless the structure is used outside ACPI as well. > I thought you were going to remove this field because it wasn't needed > by the guest. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tools lib traceevent: update KVM plugin
On Tue, 20 Oct 2015 12:50:26 -0300 Arnaldo Carvalho de Melowrote: > Cool, Steven, can I take that as an Acked-by for this patch? > Acked-by: Steven Rostedt -- Steve -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function
On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote: > +exit: > +/* Write our output result to dsm memory. */ > +((dsm_out *)dsm_ram_addr)->len = out->len; Missing byteswap? I thought you were going to remove this field because it wasn't needed by the guest. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function
On 10/20/2015 11:51 PM, Stefan Hajnoczi wrote: On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote: +exit: +/* Write our output result to dsm memory. */ +((dsm_out *)dsm_ram_addr)->len = out->len; Missing byteswap? I thought you were going to remove this field because it wasn't needed by the guest. The @len is the size of _DSM result buffer, for example, for the function of DSM_FUN_IMPLEMENTED the result buffer is 8 bytes, and for DSM_DEV_FUN_NAMESPACE_LABEL_SIZE the buffer size is 4 bytes. It tells ASL code how much size of memory we need to return to the _DSM caller. In _DSM code, it's handled like this: "RLEN" is @len, “OBUF” is the left memory in DSM page. /* get @len*/ aml_append(method, aml_store(aml_name("RLEN"), aml_local(6))); /* @len << 3 to get bits. */ aml_append(method, aml_store(aml_shiftleft(aml_local(6), aml_int(3)), aml_local(6))); /* get @len << 3 bits from OBUF, and return it to the caller. */ aml_append(method, aml_create_field(aml_name("ODAT"), aml_int(0), aml_local(6) , "OBUF")); Since @len is our internally used, it's not return to guest, so i did not do byteswap here. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tools lib traceevent: update KVM plugin
On 20/10/2015 16:44, Steven Rostedt wrote: > What happens if you run new perf on an older kernel. Is this new plugin > going to be screwed up? Plugins should be backward compatible. If you run new perf on older kernel, the new plugin will print the "role" field (see kvm_mmu_print_role) slightly incorrectly. That said, the existing plugin was _also_ printing the role in a wildly wrong format, like 2.6.35 vintage; the glevels field was removed by commit 5b7e0102ae74, "KVM: MMU: Replace role.glevels with role.cr4_pae", in April 2010. Going forward it's really unlikely that the role will change apart from adding new bits. These can be added to the plugin while keeping it backwards-compatible. Addition to the role happen when you implement new virtual MMU features such as SMEP, SMAP or SMM. That's once per year or less. > Is the plugin even still needed? I'm looking at some of the kvm events > and they seem to be mostly self sufficient. What ones need a plugin > today? Yes, most of them are. It's only needed for kvm_mmu_get_page, kvm_mmu_prepare_zap_page and kvm_emulate_insn. The latter is only interesting if you install the disassembler library, and I wouldn't really care if it went away. kvm_mmu_get_page and kvm_mmu_prepare_zap_page, however, have output like kvm_mmu_get_page: [FAILED TO PARSE] mmu_valid_gen=0x2 gfn=786432 role=1923 root_count=0 unsync=0 created=1 without the plugin vs. kvm_mmu_get_page: new sp gfn c 3 q0 direct rwx !pae !nxe !wp root 0 sync with the plugin. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 3/6] KVM: arm: use GIC support unconditionally
From: Arnd BergmannThe vgic code on ARM is built for all configurations that enable KVM, but the parent_data field that it references is only present when CONFIG_IRQ_DOMAIN_HIERARCHY is set: virt/kvm/arm/vgic.c: In function 'kvm_vgic_map_phys_irq': virt/kvm/arm/vgic.c:1781:13: error: 'struct irq_data' has no member named 'parent_data' This flag is implied by the GIC driver, and indeed the VGIC code only makes sense if a GIC is present. This changes the CONFIG_KVM symbol to always select GIC, which avoids the issue. Fixes: 662d9715840 ("arm/arm64: KVM: Kill CONFIG_KVM_ARM_{VGIC,TIMER}") Signed-off-by: Arnd Bergmann Acked-by: Marc Zyngier Signed-off-by: Christoffer Dall --- arch/arm/kvm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig index 210ecca..356970f 100644 --- a/arch/arm/kvm/Kconfig +++ b/arch/arm/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM depends on MMU && OF select PREEMPT_NOTIFIERS select ANON_INODES + select ARM_GIC select HAVE_KVM_CPU_RELAX_INTERCEPT select HAVE_KVM_ARCH_TLB_FLUSH_ALL select KVM_MMIO -- 2.1.2.330.g565301e.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 2/6] KVM: arm/arm64: Fix memory leak if timer initialization fails
From: Pavel FedinJump to correct label and free kvm_host_cpu_state Reviewed-by: Wei Huang Signed-off-by: Pavel Fedin Signed-off-by: Christoffer Dall --- arch/arm/kvm/arm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index dc017ad..78b2869 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -1080,7 +1080,7 @@ static int init_hyp_mode(void) */ err = kvm_timer_hyp_init(); if (err) - goto out_free_mappings; + goto out_free_context; #ifndef CONFIG_HOTPLUG_CPU free_boot_hyp_pgd(); -- 2.1.2.330.g565301e.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 6/6] arm/arm64: KVM: Fix disabled distributor operation
We currently do a single update of the vgic state when the distributor enable/disable control register is accessed and then bypass updating the state for as long as the distributor remains disabled. This is incorrect, because updating the state does not consider the distributor enable bit, and this you can end up in a situation where an interrupt is marked as pending on the CPU interface, but not pending on the distributor, which is an impossible state to be in, and triggers a warning. Consider for example the following sequence of events: 1. An interrupt is marked as pending on the distributor - the interrupt is also forwarded to the CPU interface 2. The guest turns off the distributor (it's about to do a reboot) - we stop updating the CPU interface state from now on 3. The guest disables the pending interrupt - we remove the pending state from the distributor, but don't touch the CPU interface, see point 2. Since the distributor disable bit really means that no interrupts should be forwarded to the CPU interface, we modify the code to keep updating the internal VGIC state, but always set the CPU interface pending bits to zero when the distributor is disabled. Signed-off-by: Christoffer Dall--- virt/kvm/arm/vgic.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c index 58b1256..66c6616 100644 --- a/virt/kvm/arm/vgic.c +++ b/virt/kvm/arm/vgic.c @@ -1012,6 +1012,12 @@ static int compute_pending_for_cpu(struct kvm_vcpu *vcpu) pend_percpu = vcpu->arch.vgic_cpu.pending_percpu; pend_shared = vcpu->arch.vgic_cpu.pending_shared; + if (!dist->enabled) { + bitmap_zero(pend_percpu, VGIC_NR_PRIVATE_IRQS); + bitmap_zero(pend_shared, nr_shared); + return 0; + } + pending = vgic_bitmap_get_cpu_map(>irq_pending, vcpu_id); enabled = vgic_bitmap_get_cpu_map(>irq_enabled, vcpu_id); bitmap_and(pend_percpu, pending, enabled, VGIC_NR_PRIVATE_IRQS); @@ -1039,11 +1045,6 @@ void vgic_update_state(struct kvm *kvm) struct kvm_vcpu *vcpu; int c; - if (!dist->enabled) { - set_bit(0, dist->irq_pending_on_cpu); - return; - } - kvm_for_each_vcpu(c, vcpu, kvm) { if (compute_pending_for_cpu(vcpu)) set_bit(c, dist->irq_pending_on_cpu); -- 2.1.2.330.g565301e.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tools lib traceevent: update KVM plugin
On 20/10/2015 17:48, Steven Rostedt wrote: > On Tue, 20 Oct 2015 17:37:43 +0200 > Paolo Bonziniwrote: > >> However, it frankly seems a bit academic. The parsing _will_ work, >> apart from printing a nonsensical role just like it has always done for >> the past four years. > > I'm not going to be too picky about it. But things like this may seem > academic, but it's also what we forget about, and people still use > 2.6.32 kernels, and this may come back and bite us later. > > But it's more likely to bite you than me, so if you don't think it's > worth while, then I'm not going to push it. Thanks. The role is not going to matter except in really weird cases and hopefully those are just not there in 2.6.32 kernels. Thanks also for educating me so I won't get it wrong in the future! Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tools lib traceevent: update KVM plugin
On Tue, 20 Oct 2015 17:37:43 +0200 Paolo Bonziniwrote: > However, it frankly seems a bit academic. The parsing _will_ work, > apart from printing a nonsensical role just like it has always done for > the past four years. I'm not going to be too picky about it. But things like this may seem academic, but it's also what we forget about, and people still use 2.6.32 kernels, and this may come back and bite us later. But it's more likely to bite you than me, so if you don't think it's worth while, then I'm not going to push it. -- Steve -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 0/6] A handful of fixes for KVM/ARM for v4.3-rc7
Hi Paolo, The following changes since commit 920552b213e3dc832a874b4e7ba29ecddbab31bc: KVM: disable halt_poll_ns as default for s390x (2015-09-25 10:31:30 +0200) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git tags/kvm-arm-for-v4.3-rc7 for you to fetch changes up to 0d997491f814c87310a6ad7be30a9049c7150489: arm/arm64: KVM: Fix disabled distributor operation (2015-10-20 18:09:13 +0200) Sorry for sending these relatively late, but we had a situation where we found one breakage in the timer implementation changes merged for 4.3, then fixing that issue revealed another bug, and then that happened again, and now we have something that looks stable. Description of the fixes is in the tag and quoted below. Thanks, -Christoffer A late round of KVM/ARM fixes for v4.3-rc7, fixing: - A bug where level-triggered interrupts lowered from userspace are still routed to the guest - A memory leak an a failed initialization path - A build error under certain configurations - Several timer bugs introduced with moving the timer to the active state handling instead of the masking trick. Arnd Bergmann (1): KVM: arm: use GIC support unconditionally Christoffer Dall (3): arm/arm64: KVM: Fix arch timer behavior for disabled interrupts arm/arm64: KVM: Clear map->active on pend/active clear arm/arm64: KVM: Fix disabled distributor operation Pavel Fedin (2): KVM: arm/arm64: Do not inject spurious interrupts KVM: arm/arm64: Fix memory leak if timer initialization fails arch/arm/kvm/Kconfig | 1 + arch/arm/kvm/arm.c| 2 +- virt/kvm/arm/arch_timer.c | 19 ++ virt/kvm/arm/vgic.c | 95 +++ 4 files changed, 76 insertions(+), 41 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 5/6] arm/arm64: KVM: Clear map->active on pend/active clear
When a guest reboots or offlines/onlines CPUs, it is not uncommon for it to clear the pending and active states of an interrupt through the emulated VGIC distributor. However, since the architected timers are defined by the architecture to be level triggered and the guest rightfully expects them to be that, but we emulate them as edge-triggered, we have to mimic level-triggered behavior for an edge-triggered virtual implementation. We currently do not signal the VGIC when the map->active field is true, because it indicates that the guest has already been signalled of the interrupt as required. Normally this field is set to false when the guest deactivates the virtual interrupt through the sync path. We also need to catch the case where the guest deactivates the interrupt through the emulated distributor, again allowing guests to boot even if the original virtual timer signal hit before the guest's GIC initialization sequence is run. Reviewed-by: Eric AugerSigned-off-by: Christoffer Dall --- virt/kvm/arm/vgic.c | 32 +++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c index ea21bc2..58b1256 100644 --- a/virt/kvm/arm/vgic.c +++ b/virt/kvm/arm/vgic.c @@ -531,6 +531,34 @@ bool vgic_handle_set_pending_reg(struct kvm *kvm, return false; } +/* + * If a mapped interrupt's state has been modified by the guest such that it + * is no longer active or pending, without it have gone through the sync path, + * then the map->active field must be cleared so the interrupt can be taken + * again. + */ +static void vgic_handle_clear_mapped_irq(struct kvm_vcpu *vcpu) +{ + struct vgic_cpu *vgic_cpu = >arch.vgic_cpu; + struct list_head *root; + struct irq_phys_map_entry *entry; + struct irq_phys_map *map; + + rcu_read_lock(); + + /* Check for PPIs */ + root = _cpu->irq_phys_map_list; + list_for_each_entry_rcu(entry, root, entry) { + map = >map; + + if (!vgic_dist_irq_is_pending(vcpu, map->virt_irq) && + !vgic_irq_is_active(vcpu, map->virt_irq)) + map->active = false; + } + + rcu_read_unlock(); +} + bool vgic_handle_clear_pending_reg(struct kvm *kvm, struct kvm_exit_mmio *mmio, phys_addr_t offset, int vcpu_id) @@ -561,6 +589,7 @@ bool vgic_handle_clear_pending_reg(struct kvm *kvm, vcpu_id, offset); vgic_reg_access(mmio, reg, offset, mode); + vgic_handle_clear_mapped_irq(kvm_get_vcpu(kvm, vcpu_id)); vgic_update_state(kvm); return true; } @@ -598,6 +627,7 @@ bool vgic_handle_clear_active_reg(struct kvm *kvm, ACCESS_READ_VALUE | ACCESS_WRITE_CLEARBIT); if (mmio->is_write) { + vgic_handle_clear_mapped_irq(kvm_get_vcpu(kvm, vcpu_id)); vgic_update_state(kvm); return true; } @@ -1406,7 +1436,7 @@ static int vgic_sync_hwirq(struct kvm_vcpu *vcpu, struct vgic_lr vlr) return 0; map = vgic_irq_map_search(vcpu, vlr.irq); - BUG_ON(!map || !map->active); + BUG_ON(!map); ret = irq_get_irqchip_state(map->irq, IRQCHIP_STATE_ACTIVE, -- 2.1.2.330.g565301e.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 1/6] KVM: arm/arm64: Do not inject spurious interrupts
From: Pavel FedinWhen lowering a level-triggered line from userspace, we forgot to lower the pending bit on the emulated CPU interface and we also did not re-compute the pending_on_cpu bitmap for the CPU affected by the change. Update vgic_update_irq_pending() to fix the two issues above and also raise a warning in vgic_quue_irq_to_lr if we encounter an interrupt pending on a CPU which is neither marked active nor pending. [ Commit text reworked completely - Christoffer ] Signed-off-by: Pavel Fedin Signed-off-by: Christoffer Dall --- virt/kvm/arm/vgic.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c index 6bd1c9b..596455a 100644 --- a/virt/kvm/arm/vgic.c +++ b/virt/kvm/arm/vgic.c @@ -1132,7 +1132,8 @@ static void vgic_queue_irq_to_lr(struct kvm_vcpu *vcpu, int irq, kvm_debug("Set active, clear distributor: 0x%x\n", vlr.state); vgic_irq_clear_active(vcpu, irq); vgic_update_state(vcpu->kvm); - } else if (vgic_dist_irq_is_pending(vcpu, irq)) { + } else { + WARN_ON(!vgic_dist_irq_is_pending(vcpu, irq)); vlr.state |= LR_STATE_PENDING; kvm_debug("Set pending: 0x%x\n", vlr.state); } @@ -1607,8 +1608,12 @@ static int vgic_update_irq_pending(struct kvm *kvm, int cpuid, } else { if (level_triggered) { vgic_dist_irq_clear_level(vcpu, irq_num); - if (!vgic_dist_irq_soft_pend(vcpu, irq_num)) + if (!vgic_dist_irq_soft_pend(vcpu, irq_num)) { vgic_dist_irq_clear_pending(vcpu, irq_num); + vgic_cpu_irq_clear(vcpu, irq_num); + if (!compute_pending_for_cpu(vcpu)) + clear_bit(cpuid, dist->irq_pending_on_cpu); + } } ret = false; -- 2.1.2.330.g565301e.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 4/6] arm/arm64: KVM: Fix arch timer behavior for disabled interrupts
We have an interesting issue when the guest disables the timer interrupt on the VGIC, which happens when turning VCPUs off using PSCI, for example. The problem is that because the guest disables the virtual interrupt at the VGIC level, we never inject interrupts to the guest and therefore never mark the interrupt as active on the physical distributor. The host also never takes the timer interrupt (we only use the timer device to trigger a guest exit and everything else is done in software), so the interrupt does not become active through normal means. The result is that we keep entering the guest with a programmed timer that will always fire as soon as we context switch the hardware timer state and run the guest, preventing forward progress for the VCPU. Since the active state on the physical distributor is really part of the timer logic, it is the job of our virtual arch timer driver to manage this state. The timer->map->active boolean field indicates whether we have signalled this interrupt to the vgic and if that interrupt is still pending or active. As long as that is the case, the hardware doesn't have to generate physical interrupts and therefore we mark the interrupt as active on the physical distributor. We also have to restore the pending state of an interrupt that was queued to an LR but was retired from the LR for some reason, while remaining pending in the LR. Cc: Marc ZyngierReported-by: Lorenzo Pieralisi Signed-off-by: Christoffer Dall --- virt/kvm/arm/arch_timer.c | 19 +++ virt/kvm/arm/vgic.c | 43 +++ 2 files changed, 30 insertions(+), 32 deletions(-) diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c index 48c6e1a..b9d3a32 100644 --- a/virt/kvm/arm/arch_timer.c +++ b/virt/kvm/arm/arch_timer.c @@ -137,6 +137,8 @@ bool kvm_timer_should_fire(struct kvm_vcpu *vcpu) void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu) { struct arch_timer_cpu *timer = >arch.timer_cpu; + bool phys_active; + int ret; /* * We're about to run this vcpu again, so there is no need to @@ -151,6 +153,23 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu) */ if (kvm_timer_should_fire(vcpu)) kvm_timer_inject_irq(vcpu); + + /* +* We keep track of whether the edge-triggered interrupt has been +* signalled to the vgic/guest, and if so, we mask the interrupt and +* the physical distributor to prevent the timer from raising a +* physical interrupt whenever we run a guest, preventing forward +* VCPU progress. +*/ + if (kvm_vgic_get_phys_irq_active(timer->map)) + phys_active = true; + else + phys_active = false; + + ret = irq_set_irqchip_state(timer->map->irq, + IRQCHIP_STATE_ACTIVE, + phys_active); + WARN_ON(ret); } /** diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c index 596455a..ea21bc2 100644 --- a/virt/kvm/arm/vgic.c +++ b/virt/kvm/arm/vgic.c @@ -1092,6 +1092,15 @@ static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu) struct vgic_cpu *vgic_cpu = >arch.vgic_cpu; struct vgic_lr vlr = vgic_get_lr(vcpu, lr_nr); + /* +* We must transfer the pending state back to the distributor before +* retiring the LR, otherwise we may loose edge-triggered interrupts. +*/ + if (vlr.state & LR_STATE_PENDING) { + vgic_dist_irq_set_pending(vcpu, irq); + vlr.hwirq = 0; + } + vlr.state = 0; vgic_set_lr(vcpu, lr_nr, vlr); clear_bit(lr_nr, vgic_cpu->lr_used); @@ -1241,7 +1250,7 @@ static void __kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu) struct vgic_cpu *vgic_cpu = >arch.vgic_cpu; struct vgic_dist *dist = >kvm->arch.vgic; unsigned long *pa_percpu, *pa_shared; - int i, vcpu_id, lr, ret; + int i, vcpu_id; int overflow = 0; int nr_shared = vgic_nr_shared_irqs(dist); @@ -1296,31 +1305,6 @@ epilog: */ clear_bit(vcpu_id, dist->irq_pending_on_cpu); } - - for (lr = 0; lr < vgic->nr_lr; lr++) { - struct vgic_lr vlr; - - if (!test_bit(lr, vgic_cpu->lr_used)) - continue; - - vlr = vgic_get_lr(vcpu, lr); - - /* -* If we have a mapping, and the virtual interrupt is -* presented to the guest (as pending or active), then we must -* set the state to active in the physical world. See -* Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt. -*/ - if (vlr.state & LR_HW) { - struct irq_phys_map *map; - map
Re: Steal time accounting in KVM. Benchmark.
'echo NO_NONTASK_CAPACITY > /sys/kernel/debug/sched_features' in both guests. Results: VM1: STA is disabled -- no changes, still little bit bellow expected 90% VM2: STA is enabled -- result is changed, but still bad. Hard to say better or worse. It prefers to stuck at quarters (100% 75% 50% 25%) Output is attached. Thanks, Alexey On Tue, Oct 20, 2015 at 5:39 AM, Wanpeng Liwrote: > Cc Peterz, > 2015-10-20 5:58 GMT+08:00 Alexey Makhalov : >> >> Hi, >> >> I did benchmarking of scheduler fairness with enabled steal time >> accounting(STA) in KVM. >> And results are really interesting. >> >> Looks like STA provides worse scheduler fairness against disabled STA >> (no-steal-acc cmdline param) >> >> I created benchmark, main idea is: 2 cgroups with cpu.shares >> proportion like 1/9, run identical work in both groups, and expecting >> to get the same proportion of work done – 1/9. Condition – CPU >> overcommit. >> >> On bare metal it is fair +- some percentages of fluctation. >> On KVM with no STA it’s less fair. With STA enabled results are >> ugly! One again – in CPU overcommit situation. >> >> >> Host: ubuntu 14.04, Intel i5 – 2x2 CPUs >> 2 VMs (4 vCPU each) are working in parallel. 2:1 cpu overcommit. >> >> Each VM has running benchmark: >> cgroups cpu.shares proportion is 128/1152 (10%/90%), work – spinning >> in cycle, number of cycles are being counted. > > > Could you try if 'echo NO_NONTASK_CAPACITY > > /sys/kernel/debug/sched_features' in guests works? > > Regards, > Wanpeng Li 90% 28539 75% 31907 76% 30348 75% 43677 75% 29765 63% 30351 74% 28535 70% 29590 65% 31281 87% 29476 75% 29505 75% 29737 75% 29534 75% 29627 89% 30411 75% 29069 74% 29786 74% 29776 74% 29633 74% 29470 74% 29406 90% 30274 90% 29615 96% 30497 100% 29263 100% 29241 100% 29748 100% 29790 99% 29769 74% 66239 52% 118436 50% 117409 50% 118167 50% 118990 50% 38479 48% 29641 50% 29757 50% 29754 50% 29688 50% 29671 50% 29745 50% 29831 50% 29596 50% 29645 50% 29661 50% 29567 50% 29511 50% 29763 50% 29919 51% 28823 51% 29944 50% 29501 50% 29406 50% 29410 49% 29519 51% 30109 49% 29586 49% 29621 49% 29579 43% 80183 49% 118185 49% 113732 45% 29698 25% 29528 24% 30078 36% 29716 32% 28976 43% 77606 19% 131592 0% 39503 12% 29709 25% 29717 24% 29720 25% 29695 25% 29703 25% 29570 25% 29492 25% 29790 25% 29677 25% 29451 25% 29536 25% 29476 25% 29613 24% 30045 24% 30227 37% 52553 43% 29986 50% 29741 49% 30003 50% 29982 35% 29907 39% 30156 48% 30590 66% 58939 --- benchmark was stopped on the first VM. VM1 was in idle from here --- 90% 58977 89% 58952 90% 58945 89% 58945 90% 58948 89% 59001 89% 58966 90% 58966 90% 59005 89% 58998 90% 58998 90% 59010 90% 59028 90% 59070 90% 58558 90% 58948 89% 59080 89% 58601 90% 58385 90% 59003 89% 57695 90% 58843 89% 59104 90% 59006 90% 57110 --- to here --- 97% 30300 100% 29335 100% 29541 100% 29609 100% 29880 100% 29925 100% 29517 100% 29723 100% 28467 99% 29597 100% 29817 100% 29772 100% 29761 100% 29547 100% 29556 100% 29848 100% 29713 100% 29616 100% 29659 100% 29709 100% 29757 100% 29806 100% 30357 100% 29784 100% 29812 100% 29679 100% 29839 100% 29633 100% 29650 100% 29730 100% 29964 100% 29993 100% 29899 100% 29951 100% 29924 100% 29938 100% 29784 100% 30043 100% 29308 99% 30625 95% 29907 99% 29811 100% 29626 100% 30054 100% 29744 100% 29693 100% 30008 100% 29935 100% 29825 100% 29684 100% 30010 100% 30163 93% 51597 89% 58972 89% 58692 90% 57975 100% 27946 100% 29624 100% 30365 98% 30468 99% 30971 100% 30907 100% 28079 100% 30074 100% 29001 100% 30366 99% 29814 100% 30405 97% 29866 100% 29003 100% 28545 100% 29693 100% 30013 100% 28179 100% 28644 100% 30081 100% 28191 100% 31325 100% 30223 100% 29848 100% 30057 100% 29675 100% 29859 100% 28160 100% 29679 100% 29689 100% 29727 100% 29715 100% 30547 100% 29281 100% 29761 100% 29596 100% 29797 100% 29664 100% 29768 100% 29708 100% 29614 100% 29627 100% 30158 100% 29801 100% 28569 100% 29118 100% 30540 100% 29335 100% 29188 99% 30413 100% 29438 100% 28382 100% 29492 100% 28555 100% 29700 99% 30374 75% 29759 75% 29954 75% 29588 75% 29540 75% 29616 75% 29890 75% 29457 75% 29466 75% 29724 74% 29395 74% 29613 75% 29776 75% 29455 75% 29788 75% 29592 75% 29679 75% 29749 75% 29715 75% 29992 75% 29271 73% 30173 74% 29480 75% 29415 75% 28416 75% 29786 75% 29086 75% 29293 75% 29670 75% 29898 75% 29756 75% 29498 75% 29946 75% 29328 75% 29120 75% 29821 75% 29765 75% 29673 75% 29682 75% 29657 74% 30045 75% 29641 75% 29662 75% 29595 75% 29635 75% 29720 73% 30315 75% 29589 75% 29607 75% 29629 75% 29738 75% 30068 72% 28898 75% 26530 74% 30075 73% 29182 74% 28506 73% 28135 74% 29083 75% 29609 73% 27956 74% 28921 74% 29045 74% 29576 74% 29228 74% 29494 74% 29218 73% 27872 74% 29787 74% 29633 68% 32062 98% 31992 98% 32120 90% 32984 98% 32568 98% 32307 97% 32300 98% 32262 99% 30436 94% 30023 96% 32511 99% 31075 99% 30112 100% 29835 100% 30079 100% 29980 100% 29954 100% 29771 100% 29993 100% 29796 100% 29895 100% 29748 100% 29846 100%
[PATCH 2/4] tools lib traceevent: update KVM plugin
From: Paolo BonziniThe format of the role word has changed through the years and the plugin was never updated; some VMX exit reasons were missing too. Signed-off-by: Paolo Bonzini Acked-by: Steven Rostedt Cc: David Ahern Cc: Namhyung Kim Cc: kvm@vger.kernel.org Link: http://lkml.kernel.org/r/1443695293-31127-1-git-send-email-pbonz...@redhat.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/lib/traceevent/plugin_kvm.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/tools/lib/traceevent/plugin_kvm.c b/tools/lib/traceevent/plugin_kvm.c index 88fe83dff7cd..18536f756577 100644 --- a/tools/lib/traceevent/plugin_kvm.c +++ b/tools/lib/traceevent/plugin_kvm.c @@ -124,7 +124,10 @@ static const char *disassemble(unsigned char *insn, int len, uint64_t rip, _ER(WBINVD, 54)\ _ER(XSETBV, 55)\ _ER(APIC_WRITE, 56)\ - _ER(INVPCID, 58) + _ER(INVPCID, 58)\ + _ER(PML_FULL,62)\ + _ER(XSAVES, 63)\ + _ER(XRSTORS, 64) #define SVM_EXIT_REASONS \ _ER(EXIT_READ_CR0, 0x000) \ @@ -352,15 +355,18 @@ static int kvm_nested_vmexit_handler(struct trace_seq *s, struct pevent_record * union kvm_mmu_page_role { unsigned word; struct { - unsigned glevels:4; unsigned level:4; + unsigned cr4_pae:1; unsigned quadrant:2; - unsigned pad_for_nice_hex_output:6; unsigned direct:1; unsigned access:3; unsigned invalid:1; - unsigned cr4_pge:1; unsigned nxe:1; + unsigned cr0_wp:1; + unsigned smep_and_not_wp:1; + unsigned smap_and_not_wp:1; + unsigned pad_for_nice_hex_output:8; + unsigned smm:8; }; }; @@ -385,15 +391,18 @@ static int kvm_mmu_print_role(struct trace_seq *s, struct pevent_record *record, if (pevent_is_file_bigendian(event->pevent) == pevent_is_host_bigendian(event->pevent)) { - trace_seq_printf(s, "%u/%u q%u%s %s%s %spge %snxe", + trace_seq_printf(s, "%u q%u%s %s%s %spae %snxe %swp%s%s%s", role.level, -role.glevels, role.quadrant, role.direct ? " direct" : "", access_str[role.access], role.invalid ? " invalid" : "", -role.cr4_pge ? "" : "!", -role.nxe ? "" : "!"); +role.cr4_pae ? "" : "!", +role.nxe ? "" : "!", +role.cr0_wp ? "" : "!", +role.smep_and_not_wp ? " smep" : "", +role.smap_and_not_wp ? " smap" : "", +role.smm ? " smm" : ""); } else trace_seq_printf(s, "WORD: %08x", role.word); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Difference between vcpu_load and kvm_sched_in ?
On 10/20/15 11:44 PM, Paolo Bonzini wrote: On 20/10/2015 11:57, Yacine wrote: vcpu_load; start cr3 trapping; vcpu_put it worked correctly (in my logs I see that vcpu.cpu become equal to "cpu = raw_smp_processor_id();") but the VM blocks for a lot of time due to mutex in vcpu_load (up to serveral seconds and sometimes minutes !) Right, that's because while the CPU is running the mutex is taken. If the VCPU doesn't exit, the mutex is held. I replaced vcpu_load with kvm_sched_in, now everything works perfectly and the VM doesn't block at all (logs here: http://pastebin.com/h5XNNMcb). So, what I want to know is: what is the difference between vcpu_load and kvm_sched_in ? both of this functions call kvm_arch_vcpu_loadbut the latter one does it without doing a mutex kvm_sched_out and kvm_sched_in are part of KVM's preemption hooks. The hooks are registered only between vcpu_load and vcpu_put, therefore they know that the mutex is taken. The sequence will go like this: vcpu_load kvm_sched_out kvm_sched_in kvm_sched_out kvm_sched_in ... vcpu_put If this should be: vcpu_load kvm_sched_in kvm_sched_out kvm_sched_in kvm_sched_out ... vcpu_put Regards, Wanpeng Li -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tools lib traceevent: update KVM plugin
On Tue, 20 Oct 2015 12:32:44 -0200 Arnaldo Carvalho de Melowrote: > > Ping? Arnaldo, ok to include this patch in my tree? > > Applying, I saw it before, but lost track, perhaps was waiting for > Steven Rostedt to chime in, but now I noticed he wasn't CCed, he is now. What happens if you run new perf on an older kernel. Is this new plugin going to be screwed up? Plugins should be backward compatible. Is the plugin even still needed? I'm looking at some of the kvm events and they seem to be mostly self sufficient. What ones need a plugin today? -- Steve -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] KVM/arm: enable enhanced armv7 fp/simd lazy switch
On 10/20/2015 12:24 AM, Christoffer Dall wrote: > On Mon, Oct 19, 2015 at 04:25:04PM -0700, Mario Smarduch wrote: >> >> >> On 10/19/2015 3:14 AM, Christoffer Dall wrote: >>> On Sat, Sep 26, 2015 at 04:43:29PM -0700, Mario Smarduch wrote: This patch enhances current lazy vfp/simd hardware switch. In addition to current lazy switch, it tracks vfp/simd hardware state with a vcpu lazy flag. vcpu lazy flag is set on guest access and trap to vfp/simd hardware switch handler. On vm-enter if lazy flag is set skip trap enable and saving host fpexc. On vm-exit if flag is set skip hardware context switch and return to host with guest context. On vcpu_put check if vcpu lazy flag is set, and execute a hardware context switch to restore host. Signed-off-by: Mario Smarduch--- arch/arm/include/asm/kvm_asm.h | 1 + arch/arm/kvm/arm.c | 17 arch/arm/kvm/interrupts.S | 60 +++--- arch/arm/kvm/interrupts_head.S | 12 ++--- 4 files changed, 71 insertions(+), 19 deletions(-) diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h index 194c91b..4b45d79 100644 --- a/arch/arm/include/asm/kvm_asm.h +++ b/arch/arm/include/asm/kvm_asm.h @@ -97,6 +97,7 @@ extern char __kvm_hyp_code_end[]; extern void __kvm_flush_vm_context(void); extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa); extern void __kvm_tlb_flush_vmid(struct kvm *kvm); +extern void __kvm_restore_host_vfp_state(struct kvm_vcpu *vcpu); extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu); #endif diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index ce404a5..79f49c7 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -105,6 +105,20 @@ void kvm_arch_check_processor_compat(void *rtn) *(int *)rtn = 0; } +/** + * kvm_switch_fp_regs() - switch guest/host VFP/SIMD registers + * @vcpu: pointer to vcpu structure. + * >>> >>> nit: stray blank line >> ok >>> + */ +static void kvm_switch_fp_regs(struct kvm_vcpu *vcpu) +{ +#ifdef CONFIG_ARM + if (vcpu->arch.vfp_lazy == 1) { + kvm_call_hyp(__kvm_restore_host_vfp_state, vcpu); >>> >>> why do you have to do this in HYP mode ? >> Calling it directly works fine. I moved the function outside hyp start/end >> range in interrupts.S. Not thinking outside the box, just thought let them >> all >> be hyp calls. >> >>> + vcpu->arch.vfp_lazy = 0; + } +#endif >>> >>> we've tried to put stuff like this in header files to avoid the ifdefs >>> so far. Could that be done here as well? >> >> That was a to do, but didn't get around to it. >>> +} /** * kvm_arch_init_vm - initializes a VM data structure @@ -295,6 +309,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + /* Check if Guest accessed VFP registers */ >>> >>> misleading comment: this function does more than checking >> Yep sure does, will change. >>> + kvm_switch_fp_regs(vcpu); + /* * The arch-generic KVM code expects the cpu field of a vcpu to be -1 * if the vcpu is no longer assigned to a cpu. This is used for the diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S index 900ef6d..6d98232 100644 --- a/arch/arm/kvm/interrupts.S +++ b/arch/arm/kvm/interrupts.S @@ -96,6 +96,29 @@ ENTRY(__kvm_flush_vm_context) bx lr ENDPROC(__kvm_flush_vm_context) +/** + * void __kvm_restore_host_vfp_state(struct vcpu *vcpu) - Executes a lazy + * fp/simd switch, saves the guest, restores host. + * >>> >>> nit: stray blank line >> ok. >>> + */ +ENTRY(__kvm_restore_host_vfp_state) +#ifdef CONFIG_VFPv3 + push{r3-r7} + + add r7, r0, #VCPU_VFP_GUEST + store_vfp_state r7 + + add r7, r0, #VCPU_VFP_HOST + ldr r7, [r7] + restore_vfp_state r7 + + ldr r3, [vcpu, #VCPU_VFP_FPEXC] >>> >>> either use r0 or vcpu throughout this function please >> Yeah that's bad - in the same function to >>> + VFPFMXR FPEXC, r3 + + pop {r3-r7} +#endif + bx lr +ENDPROC(__kvm_restore_host_vfp_state) / * Hypervisor world-switch code @@ -119,11 +142,15 @@ ENTRY(__kvm_vcpu_run) @ If the host kernel has not been configured with VFPv3 support, @ then it is safer if we deny guests from using it as well. #ifdef CONFIG_VFPv3 + @ r7 must be preserved until next vfp lazy check >>> >>> I don't understand this
Re: Steal time accounting in KVM. Benchmark.
Yes, VM1 results are as before. Alexey On Tue, Oct 20, 2015 at 4:04 PM, Wanpeng Liwrote: > On 10/21/15 4:05 AM, Alexey Makhalov wrote: >> >> 'echo NO_NONTASK_CAPACITY > /sys/kernel/debug/sched_features' in both >> guests. >> Results: >> VM1: STA is disabled -- no changes, still little bit bellow expected 90% >> VM2: STA is enabled -- result is changed, but still bad. Hard to say >> better or worse. It prefers to stuck at quarters (100% 75% 50% 25%) >> Output is attached. > > > If the output in attachment is for VM2 only? > > Regards, > Wanpeng Li -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Steal time accounting in KVM. Benchmark.
On 10/21/15 4:05 AM, Alexey Makhalov wrote: 'echo NO_NONTASK_CAPACITY > /sys/kernel/debug/sched_features' in both guests. Results: VM1: STA is disabled -- no changes, still little bit bellow expected 90% VM2: STA is enabled -- result is changed, but still bad. Hard to say better or worse. It prefers to stuck at quarters (100% 75% 50% 25%) Output is attached. If the output in attachment is for VM2 only? Regards, Wanpeng Li -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network hangs when communicating with host
I now have another issue. My binary fails to mmap a file within lkvm sandbox. The same binary works fine on host and in qemu. I've added strace into sandbox script, and here is the output: [pid 837] openat(AT_FDCWD, "syzkaller-shm048878722", O_RDWR|O_CLOEXEC) = 5 [pid 837] mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_SHARED, 5, 0) = -1 EINVAL (Invalid argument) I don't see anything that can potentially cause EINVAL here. Is it possible that lkvm somehow affects kernel behavior here? I run lkvm as: $ taskset 1 /kvmtool/lkvm sandbox --disk syz-0 --mem=2048 --cpus=2 --kernel /arch/x86/boot/bzImage --network mode=user --sandbox /workdir/kvm/syz-0.sh On Mon, Oct 19, 2015 at 4:20 PM, Sasha Levinwrote: > On 10/19/2015 05:28 AM, Dmitry Vyukov wrote: >> On Mon, Oct 19, 2015 at 11:22 AM, Andre Przywara >> wrote: >>> Hi Dmitry, >>> >>> On 19/10/15 10:05, Dmitry Vyukov wrote: On Fri, Oct 16, 2015 at 7:25 PM, Sasha Levin wrote: > On 10/15/2015 04:20 PM, Dmitry Vyukov wrote: >> Hello, >> >> I am trying to run a program in lkvm sandbox so that it communicates >> with a program on host. I run lkvm as: >> >> ./lkvm sandbox --disk sandbox-test --mem=2048 --cpus=4 --kernel >> /arch/x86/boot/bzImage --network mode=user -- /my_prog >> >> /my_prog then connects to a program on host over a tcp socket. >> I see that host receives some data, sends some data back, but then >> my_prog hangs on network read. >> >> To localize this I wrote 2 programs (attached). ping is run on host >> and pong is run from lkvm sandbox. They successfully establish tcp >> connection, but after some iterations both hang on read. >> >> Networking code in Go runtime is there for more than 3 years, widely >> used in production and does not have any known bugs. However, it uses >> epoll edge-triggered readiness notifications that known to be tricky. >> Is it possible that lkvm contains some networking bug? Can it be >> related to the data races in lkvm I reported earlier today? >>> >>> Just to let you know: >>> I think we have seen networking issues in the past - root over NFS had >>> issues IIRC. Will spent some time on debugging this and it looked like a >>> race condition in kvmtool's virtio implementation. I think pinning >>> kvmtool's virtio threads to one host core made this go away. However >>> although he tried hard (even by Will's standards!) he couldn't find a >>> the real root cause or a fix at the time he looked at it and we found >>> other ways to work around the issues (using virtio-blk or initrd's). >>> >>> So it's quite possible that there are issues. I haven't had time yet to >>> look at your sanitizer reports, but it looks like a promising approach >>> to find the root cause. >> >> >> Thanks, Andre! >> >> ping/pong does not hang within at least 5 minutes when I run lkvm >> under taskset 1. >> >> And, yeah, this pretty strongly suggests a data race. ThreadSanitizer >> can point you to the bug within a minute, so you just need to say >> "aha! it is here". Or maybe not. There are no guarantees. But if you >> already spent significant time on this, then checking the reports >> definitely looks like a good idea. > > Okay, that's good to know. > > I have a few busy days, but I'll definitely try to clear up these reports > as they seem to be pointing to real issues. > > > Thanks, > Sasha > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] userfault21 update
On Mon, Oct 19, 2015 at 5:42 PM, Andrea Arcangeliwrote: > Hello Patrick, > > On Mon, Oct 12, 2015 at 11:04:11AM -0400, Patrick Donnelly wrote: >> Hello Andrea, >> >> On Mon, Jun 15, 2015 at 1:22 PM, Andrea Arcangeli >> wrote: >> > This is an incremental update to the userfaultfd code in -mm. >> >> Sorry I'm late to this party. I'm curious how a ptrace monitor might >> use a userfaultfd to handle faults in all of its tracees. Is this >> possible without having each (newly forked) tracee "cooperate" by >> creating a userfaultfd and passing that to the tracer? > > To make the non cooperative usage work, userfaulfd also needs more > features to track fork() and mremap() syscalls and such, as the > monitor needs to be aware about modifications to the address space of > each "mm" is managing and of new forked "mm" as well. So fork() won't > need to call userfaultfd once we add those features, but it still > doesn't need to know about the "pid". The uffd_msg already has padding > to add the features you need for that. > > Pavel invented and developed those features for the non cooperative > usage to implement postcopy live migration of containers. He posted > some patchset on the lists too, but it probably needs to be rebased on > upstream. > > The ptrace monitor thread can also fault into the userfault area if it > wants to (but only if it's not the userfault manager thread as well). > I didn't expect the ptrace monitor to want to be a userfault manager > too though. > [...] Okay, it's definitely tricky to make this work for a tree of non-cooperative processes. Brainstorming some ideas: o If we are using ptrace, then we can add a ptrace event for receiving the userfaultfd associated with the tracee, via waitpid (!). The ptrace monitor can deduplicate userfaultfds by looking at the inode. It can also associate a userfaultfd with a group of threads sharing a mm. [For my possible use-case with Parrot[1], we already track the shared address spaces of tracees in order to implement an mmap hook.] o The userfaultfd can have a flag for tracking a tree of processes (which can be sent via unix sockets to the userfault handler) and use an opaque tag (the mm pointer?) to disambiguate the faults, instead of a pid. There would need to be some kind of message to notify about newly cloned threads and the mm associated with them? Yes, you wouldn't be able to know which pid (or kernel/ptrace thread) generated a fault but at least you would know which pids the mm belongs to. I didn't see the patchset Pavel posted in a quick search of the archives. Only this [2]. [1] http://ccl.cse.nd.edu/software/parrot/ [2] https://lkml.org/lkml/2015/1/15/103 -- Patrick Donnelly -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html