Re: [Qemu-devel] Re: KVM call minutes for June 15
Anthony Liguori anth...@codemonkey.ws writes: On 06/15/2010 10:41 AM, Christoph Hellwig wrote: On Tue, Jun 15, 2010 at 08:18:12AM -0700, Chris Wright wrote: KVM/qemu patches - patch rate is high, documentation is low, review is low - patches need to include better descriptions and documentation - will slow down patch writers - will make it easier for patch reviewers What is the qemu patch review policy anyway? We don't really have a coherent policy. Suggestions for improvement are always appreciated. There are no Reviewed-by: included in the actual commits, Reviewed-by/Ack-by's are pretty helpful for me. In terms of including them in commit messages, if there's a strong feeling that that would be helpful then it's something I can look at doing but it also requires a fair bit of manual work during commit. Can't hurt reviewer motivation. Could it be automated? Find replies, extract tags. If you want your acks to be picked up, you better make sure your References header works, and your tags are formatted correctly. [...] -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/13/2010 06:03 PM, Avi Kivity wrote: Currently fpu management is only lazy in one direction. When we switch into a task, we may avoid loading the fpu state in the hope that the task will never use it. If we guess right we save an fpu load/save cycle; if not, a Device not Available exception will remind us to load the fpu. However, in the other direction, fpu management is eager. When we switch out of an fpu-using task, we always save its fpu state. This is wasteful if the task(s) that run until we switch back in all don't use the fpu, since we could have kept the task's fpu on the cpu all this time and saved an fpu save/load cycle. This can be quite common with threaded interrupts, but will also happen with normal kernel threads and even normal user tasks. This patch series converts task fpu management to be fully lazy. When switching out of a task, we keep its fpu state on the cpu, only flushing it if some other task needs the fpu. Ingo, Peter, any feedback on this? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/24] Implement VMXON and VMXOFF
On Tue, Jun 15, 2010, Marcelo Tosatti wrote about Re: [PATCH 3/24] Implement VMXON and VMXOFF: + if (!(vcpu-arch.cr4 X86_CR4_VMXE) || + !(vcpu-arch.cr0 X86_CR0_PE) || kvm_read_cr0_bits, kvm_read_cr4_bits. Thanks. I'll change that. -- Nadav Har'El|Wednesday, Jun 16 2010, 4 Tammuz 5770 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |In Fortran, God is real unless declared http://nadav.harel.org.il |an integer. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()
On Tue, Jun 15, 2010 at 02:22:06PM +0300, Avi Kivity wrote: Too much duplication. How about putting the tail end of the function in a common helper (with an inatomic flag)? btw, is_hwpoison_address() is racy. While it looks up the address, some other task can unmap the page tables under us. Where is is_hwpoison_address() coming from? I can't find it anywhere. Anyways hwpoison will not remove the page as long as there is a reference to it, so as long as you get the reference race free against another task you're ok. Of course there might be always an error in between and the hardware may poison the data, but we don't try to handle all kernel errors. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/16/2010 10:32 AM, H. Peter Anvin wrote: On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. Right, this optimization isn't free. I think the tradeoff is favourable since task migrations are much less frequent than context switches within the same cpu, can the scheduler experts comment? We can also mitigate some of the IPIs if we know that we're migrating on the cpu we're migrating from (i.e. we're pushing tasks to another cpu, not pulling them from their cpu). Is that a common case, and if so, where can I hook a call to unlazy_fpu() (or its new equivalent)? Note that kvm on intel has exactly the same issue (the VMPTR and VMCS are on-chip registers that are expensive to load and save, so we keep them loaded even while not scheduled, and IPI if we notice we've migrated; note that architecturally the cpu can cache multiple VMCSs simultaneously (though I doubt they cache multiple VMCSs microarchitecturally at this point)). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/17] Unify vendor TSC logic
Zachary Amsden wrote: Move the TSC control logic from the vendor backends into x86.c by adding adjust_tsc_offset to x86 ops. Now all TSC decisions can be done in one place. Also, rename some variable in the VCPU structure to more accurately reflect their actual content. VMX backend would record last observed TSC very late (possibly in an IPI to clear the VMCS from a remote CPU), now it is done earlier. Note this is still not yet perfect. We adjust TSC in both directions when the TSC is unstable, resulting in desynchronized TSCs for SMP guests. A better choice would be to compensate based on a master reference clock. Signed-off-by: Zachary Amsden zams...@redhat.com --- arch/x86/include/asm/kvm_host.h |5 +++-- arch/x86/kvm/svm.c | 25 +++-- arch/x86/kvm/vmx.c | 20 arch/x86/kvm/x86.c | 16 +++- 4 files changed, 33 insertions(+), 33 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 91631b8..94f6ce8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -253,7 +253,6 @@ struct kvm_mmu { }; struct kvm_vcpu_arch { - u64 host_tsc; /* * rip and regs accesses must go through * kvm_{register,rip}_{read,write} functions. @@ -334,9 +333,10 @@ struct kvm_vcpu_arch { gpa_t time; struct pvclock_vcpu_time_info hv_clock; - unsigned int hv_clock_tsc_khz; + unsigned int hw_tsc_khz; unsigned int time_offset; struct page *time_page; + u64 last_host_tsc; bool nmi_pending; bool nmi_injected; @@ -530,6 +530,7 @@ struct kvm_x86_ops { u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); int (*get_lpage_level)(void); bool (*rdtscp_supported)(void); + void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment); void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 09b2145..ee2cf30 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -948,18 +948,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) int i; if (unlikely(cpu != vcpu-cpu)) { - u64 delta; - - if (check_tsc_unstable()) { - /* - * Make sure that the guest sees a monotonically - * increasing TSC. - */ - delta = vcpu-arch.host_tsc - native_read_tsc(); - svm-vmcb-control.tsc_offset += delta; - if (is_nested(svm)) - svm-nested.hsave-control.tsc_offset += delta; - } svm-asid_generation = 0; } @@ -975,8 +963,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu) ++vcpu-stat.host_state_reload; for (i = 0; i NR_HOST_SAVE_USER_MSRS; i++) wrmsrl(host_save_user_msrs[i], svm-host_user_msrs[i]); - - vcpu-arch.host_tsc = native_read_tsc(); } static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu) @@ -3422,6 +3408,15 @@ static bool svm_rdtscp_supported(void) return false; } +static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm-vmcb-control.tsc_offset += adjustment; + if (is_nested(svm)) + svm-nested.hsave-control.tsc_offset += adjustment; +} + static void svm_fpu_deactivate(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -3506,6 +3501,8 @@ static struct kvm_x86_ops svm_x86_ops = { .rdtscp_supported = svm_rdtscp_supported, .set_supported_cpuid = svm_set_supported_cpuid, + + .adjust_tsc_offset = svm_adjust_tsc_offset, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a657e08..a993e67 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -498,7 +498,6 @@ static void __vcpu_clear(void *arg) vmcs_clear(vmx-vmcs); if (per_cpu(current_vmcs, cpu) == vmx-vmcs) per_cpu(current_vmcs, cpu) = NULL; - rdtscll(vmx-vcpu.arch.host_tsc); list_del(vmx-local_vcpus_link); vmx-vcpu.cpu = -1; vmx-launched = 0; @@ -881,7 +880,6 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx) static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); - u64 tsc_this, delta, new_offset; u64 phys_addr = __pa(per_cpu(vmxarea, cpu)); if (!vmm_exclusive) @@ -914,16 +912,6 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp); vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */ - - /* -
Re: [PATCH 04/17] Fix deep C-state TSC desynchronization
Zachary Amsden wrote: When CPUs with unstable TSCs enter deep C-state, TSC may stop running. This causes us to require resynchronization. Since we can't tell when this may potentially happen, we assume the worst by forcing re-compensation for it at every point the VCPU task is descheduled. Is there any method to do the compensation on host TSC when recovering from C-state? It should be simpler than do it on guest TSC. Signed-off-by: Zachary Amsden zams...@redhat.com --- arch/x86/kvm/x86.c | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c8289d0..618c435 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1822,7 +1822,18 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { kvm_x86_ops-vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); + vcpu-arch.last_host_tsc = native_read_tsc(); + + /* + * When potentially leaving a CPU with unstable TSCs, we risk + * that the CPU enters deep C-state. If it does, the TSC may + * go out of sync but we will not recalibrate because the test + * vcpu-cpu != cpu can not detect this condition. So set + * vcpu-cpu = -1 to force the recalibration above. + */ + if (check_tsc_unstable()) + vcpu-cpu = -1; } static int is_efer_nx(void) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/17] Keep SMP VMs more in sync on unstable TSC
Zachary Amsden wrote: SMP VMs on machines with unstable TSC have their TSC offset adjusted by the local offset delta from last measurement. This does not take into account how long it has been since the measurement, leading to drift. Minimize the drift by accounting for any time difference the kernel has observed. Signed-off-by: Zachary Amsden zams...@redhat.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/x86.c | 20 +++- 2 files changed, 20 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 94f6ce8..1afecd7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -337,6 +337,7 @@ struct kvm_vcpu_arch { unsigned int time_offset; struct page *time_page; u64 last_host_tsc; + u64 last_host_ns; bool nmi_pending; bool nmi_injected; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 618c435..b1bdf05 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1810,6 +1810,19 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) /* Make sure TSC doesn't go backwards */ s64 tsc_delta = !vcpu-arch.last_host_tsc ? 0 : native_read_tsc() - vcpu-arch.last_host_tsc; + + /* Subtract elapsed cycle time from the delta computation */ + if (check_tsc_unstable() vcpu-arch.last_host_ns) { + s64 delta; + struct timespec ts; + ktime_get_ts(ts); + monotonic_to_bootbased(ts); + delta = timespec_to_ns(ts) - vcpu-arch.last_host_ns; + delta = delta * per_cpu(cpu_tsc_khz, cpu); This seems not work well on a cpu w/o CONSTANT_TSC. + delta = delta / USEC_PER_SEC; + tsc_delta -= delta; + } + if (tsc_delta 0 || check_tsc_unstable()) kvm_x86_ops-adjust_tsc_offset(vcpu, -tsc_delta); kvm_migrate_timers(vcpu); @@ -1832,8 +1845,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) * vcpu-cpu != cpu can not detect this condition. So set * vcpu-cpu = -1 to force the recalibration above. */ - if (check_tsc_unstable()) + if (check_tsc_unstable()) { + struct timespec ts; + ktime_get_ts(ts); + monotonic_to_bootbased(ts); + vcpu-arch.last_host_ns = timespec_to_ns(ts); vcpu-cpu = -1; + } } static int is_efer_nx(void) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock
Zachary Amsden wrote: Kernel time, which advances in discrete steps may progress much slower than TSC. As a result, when kvmclock is adjusted to a new base, the apparent time to the guest, which runs at a much higher, nsec scaled rate based on the current TSC, may have already been observed to have a larger value (kernel_ns + scaled tsc) than the value to which we are setting it (kernel_ns + 0). This is one issue of kvmclock which tries to supply a clocksource whose precision may even higher than host. We must instead compute the clock as potentially observed by the guest for kernel_ns to make sure it does not go backwards. Signed-off-by: Zachary Amsden zams...@redhat.com --- arch/x86/include/asm/kvm_host.h |4 ++ arch/x86/kvm/x86.c | 79 +-- 2 files changed, 71 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1afecd7..7ec2472 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -338,6 +338,8 @@ struct kvm_vcpu_arch { struct page *time_page; u64 last_host_tsc; u64 last_host_ns; + u64 last_guest_tsc; + u64 last_kernel_ns; bool nmi_pending; bool nmi_injected; @@ -455,6 +457,8 @@ struct kvm_vcpu_stat { u32 hypercalls; u32 irq_injections; u32 nmi_injections; + u32 tsc_overshoot; + u32 tsc_ahead; }; struct kvm_x86_ops { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 52d7d34..703ea43 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -138,6 +138,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = { { insn_emulation_fail, VCPU_STAT(insn_emulation_fail) }, { irq_injections, VCPU_STAT(irq_injections) }, { nmi_injections, VCPU_STAT(nmi_injections) }, + { tsc_overshoot, VCPU_STAT(tsc_overshoot) }, + { tsc_ahead, VCPU_STAT(tsc_ahead) }, { mmu_shadow_zapped, VM_STAT(mmu_shadow_zapped) }, { mmu_pte_write, VM_STAT(mmu_pte_write) }, { mmu_pte_updated, VM_STAT(mmu_pte_updated) }, @@ -927,33 +929,84 @@ static int kvm_recompute_guest_time(struct kvm_vcpu *v) struct kvm_vcpu_arch *vcpu = v-arch; void *shared_kaddr; unsigned long this_tsc_khz; + s64 kernel_ns, max_kernel_ns; + u64 tsc_timestamp; if ((!vcpu-time_page)) return 0; - this_tsc_khz = get_cpu_var(cpu_tsc_khz); - put_cpu_var(cpu_tsc_khz); + /* + * The protection we require is simple: we must not be preempted from + * the CPU between our read of the TSC khz and our read of the TSC. + * Interrupt protection is not strictly required, but it does result in + * greater accuracy for the TSC / kernel_ns measurement. + */ + local_irq_save(flags); + this_tsc_khz = __get_cpu_var(cpu_tsc_khz); + kvm_get_msr(v, MSR_IA32_TSC, tsc_timestamp); + ktime_get_ts(ts); + monotonic_to_bootbased(ts); + kernel_ns = timespec_to_ns(ts); + local_irq_restore(flags); + if (unlikely(this_tsc_khz == 0)) { kvm_request_guest_time_update(v); return 1; } + /* + * Time as measured by the TSC may go backwards when resetting the base + * tsc_timestamp. The reason for this is that the TSC resolution is + * higher than the resolution of the other clock scales. Thus, many + * possible measurments of the TSC correspond to one measurement of any + * other clock, and so a spread of values is possible. This is not a + * problem for the computation of the nanosecond clock; with TSC rates + * around 1GHZ, there can only be a few cycles which correspond to one + * nanosecond value, and any path through this code will inevitably + * take longer than that. However, with the kernel_ns value itself, + * the precision may be much lower, down to HZ granularity. If the + * first sampling of TSC against kernel_ns ends in the low part of the + * range, and the second in the high end of the range, we can get: + * + * (TSC - offset_low) * S + kns_old (TSC - offset_high) * S + kns_new + * + * As the sampling errors potentially range in the thousands of cycles, + * it is possible such a time value has already been observed by the + * guest. To protect against this, we must compute the system time as + * observed by the guest and ensure the new system time is greater. + */ + max_kernel_ns = 0; + if (vcpu-hv_clock.tsc_timestamp) { + max_kernel_ns = vcpu-last_guest_tsc - Since you do the comparison with kernel_ns, so what you need here is tsc_timestamp which looks more like the 'last' tsc seen by guest. + vcpu-hv_clock.tsc_timestamp; + max_kernel_ns = pvclock_scale_delta(max_kernel_ns, +
Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()
On 06/16/2010 10:59 AM, Andi Kleen wrote: On Tue, Jun 15, 2010 at 02:22:06PM +0300, Avi Kivity wrote: Too much duplication. How about putting the tail end of the function in a common helper (with an inatomic flag)? btw, is_hwpoison_address() is racy. While it looks up the address, some other task can unmap the page tables under us. Where is is_hwpoison_address() coming from? I can't find it anywhere. kvm.git master mm/memory-failure.c (19564281fe). Anyways hwpoison will not remove the page as long as there is a reference to it, so as long as you get the reference race free against another task you're ok. Of course there might be always an error in between and the hardware may poison the data, but we don't try to handle all kernel errors. The page is fine, the page tables are not. Another task can munmap() the thing while is_hwpoison_address() is running. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/17] Add helper function get_kernel_ns
Zachary Amsden wrote: On 06/14/2010 10:41 PM, Avi Kivity wrote: On 06/15/2010 10:34 AM, Zachary Amsden wrote: Add a helper function for the multiple places this is used. Note that it must not be called in preemptible context, as that would mean the kernel could enter software suspend state, which would cause non-atomic operation of the monotonic_to_bootbased computation. Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this as well? Currently, they are not bootbased (but perhaps should be). Signed-off-by: Zachary Amsdenzams...@redhat.com --- arch/x86/kvm/x86.c | 26 +- 1 files changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 703ea43..15c7317 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz, __func__, base_khz, scaled_khz, shift, *pmultiplier); } +static inline u64 get_kernel_ns(void) +{ +struct timespec ts; + +WARN_ON(preemptible()); +ktime_get_ts(ts); +monotonic_to_bootbased(ts); +return timespec_to_ns(ts); +} + static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz); Isn't something like this a candidate for the time infrastructure? Should it be? It certainly seems reasonable. Also, should we be using it for the cases KVM_GET_CLOCK / KVM_SET_CLOCK? It seems global kvmclock_offset for the VM ignores the bootbased conversion. Yes we should. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] KVM: busy-spin detector
On 06/11/2010 05:25 AM, Marcelo Tosatti wrote: The following patch implements a simple busy-spin detector. It considers a vcpu as busy-spinning if there are two consecutive exits due to external interrupt on the same RIP, and sleeps for 100us in that case. It is very likely that if the vcpu is making progress it will either exit for other reasons or change RIP. The percentage numbers below represent improvement in kernel build time in comparison with mainline (RHEL 5.4 guest). make -j16, 8 cpu host: smp 16: 3% smp 18: 10% smp 32: 14% smp 4, make -j4, pinned to 2 cpus: 4% smp 8, make -j8, pinned to 2 cpus: 5% The big problem here is not in the patch itself, but in kvm_vcpu_on_spin(). An un-overcommitted guest will trigger this on long spinlocks and go to sleep, and if the lock is released while we're still sleeping, we lose performance. @@ -4654,6 +4665,8 @@ static int vcpu_enter_guest(struct kvm_v kvm_lapic_sync_from_vapic(vcpu); r = kvm_x86_ops-handle_exit(vcpu); + if (r == 1) + vcpu-arch.last_rip = ~(0UL); out: return r; } != 2, no? It's a little ugly to overload the return code like that. Perhaps: +vcpu-arch.save_last_rip = vcpu-arch.last_rip +vcpu-arch.last_rip = ~0UL; r = kvm_x86_ops-handle_exit(vcpu); And kvm_detect_spin() can restore last_rip. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
Alex Williamson alex.william...@redhat.com writes: On Tue, 2010-06-15 at 12:28 +0100, Paul Brook wrote: Alex proposed to disambiguate by adding identified properties of the immediate parent bus and device to the path component. For PCI, these are dev.fn. Likewise for any other bus where devices have unambigous bus address. The driver name carries no information! From user POV, driver names are very handly to address a device intuitively - except for the case you have tones of devices on the same bus that are handled by the same driver. For that case we need to augment the device name with a useful per-bus ID, derived from the bus address where available, otherwise based on instance numbers. This is where I think you're missing a trick. We don't need to augment the name, we just need to allow the bus id to be used instead. For the case of a hot remove, I agree. If the user specifies pci_del pci.0/03.0, that's completely sufficient because we don't care what's in that slot, just remove it. However, I still see some use cases for device names in the path. Take for example: (A): /i440FX-pcihost/pci.0/e1000.05.0 vs (B): /pci.0/05.0 (removing both the root bridge driver name and the device driver name) / is the main system bus. System bus defines no bus address at the moment. Therefore, you have to use the driver name i440FX-pcihost. /i440FX-pcihost/pci.0 is the PCI bus. PCI bus defines a bus address: dev.fn. Therefore, you can either use the bus address @05.0, or the driver name e1000. We have /i440FX-pcihost/pci.0/e1000 vs. /i440FX-pcihost/pci.0/@05.0. If we attach a pci option rom to the device, create some ram to store the option rom and name the ram block $(PATH)/rom, with (A) we know more about the hierarchy to get to the actual devfn device, and we know the type of device that's in the slot. With (B), there's no robustness. If we migrated using (B), we could be stuffing a pc e1000 option rom into a ppc lsi895, just because it happened to live that the same place and have a ram block named rom. Including driver names increases the uniqueness of the path. Another example; if we have two drivers that create a vmstate with name foo, plug one driver into slot 03.0 on the migration source, the other into slot 03.0 on the migration destination, what happens? It seems likely that the destination will try to load the vmstate from a different driver and fail in wonderful and bizarre ways. If we use (A), each path automatically has it's own namespace. ISA is also a good example even though it doesn't do hotplug. Given this set: /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/mc146818rtc /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-serial.0x3f8 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-parallel.0x378 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/i8042 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-fdc /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/ne2k_isa.0x300 versus this set: /pci.0.01.0/isa.0 /pci.0.01.0/isa.0/0x3f8 /pci.0.01.0/isa.0/0x378 /pci.0.01.0/isa.0 /pci.0.01.0/isa.0 /pci.0.01.0/isa.0/0x300 Which one has devices that are easier to uniquely identify? For other buses, we need to make something up. Note that addressing by bus address rather than name is generally useful, not just in the context of savevm. For instance, I'd appreciate being able to say something like device_del pci.0/04.0. And I prefer device_del [.../]pci.0/e1000. Otherwise you need to dump the bus first before you can identify which device you want to remove. We can allow both. A bus address is sufficient to uniquely identify a device. A bus address (assuming it exists) is sufficient to uniquely identify a device, on a given VM. A bus address only identifies a location when comparing two separate VMs. Identifying a device on a given VM is all a qdev path does. If you want to check two VMs have the same device in the same slot, then you need to compare *devices*, not their names. You propose to encode *one* property of the device in the name, namely its driver. This is far from sufficient. If you tell me you need it anyway for migration, I'll have to take that at face value (I'm not an expert there). But please do not call it qdev path, because it ain't. I see no reason to require the driver name, or to include it in the canonical device address. Migration. Including the driver name extends our ability to uniquely identify a device across separate VMs. It's then up to the vmstate code to figure out whether the device are compatible for migrate state. Migration needs to recreate the same qdev tree on the destination. Driver name is just *one* property of a device. Migration needs to transfer *all* properties. Why encode driver name in the path, but not the rest? Why can't you put the driver name wherever you put the rest? An easy way to get that is to reserve part of the name space for bus addresses. If
Re: [Qemu-devel] Re: [RFC PATCH 0/5] Introduce canonical device hierarchy string
Alex Williamson alex.william...@redhat.com writes: On Tue, 2010-06-15 at 10:53 +0200, Markus Armbruster wrote: Alex Williamson alex.william...@redhat.com writes: On Mon, 2010-06-14 at 09:02 +0200, Gerd Hoffmann wrote: Hi, My premise with this attempt is that we walk the hierarchy and use the names to create the base of the path. As we get to the device, particularly to the parent bus of the device, we need to start looking at properties to ensure uniqueness. You'll need that for every bus along the way down to the device. Create a virtual machine with two lsi scsi host adapters, then attach a disk with scsi id 0 to each. Just the scsi id isn't good enougth to identify the device. You'll need the lsi pci address too. Yep, see below. For now, the only properties I've tagged as path properties are PCI bus addresses and MAC addresses. mac address isn't needed here. You need the property which specifies the bus address. For PCI this obviously is the PCI address. For scsi the scsi id. For ISA you can use the I/O port base. virtio-serial the port number, ... PCI: addr SCSI: scsi-id ISA: serial/parallel = iobase, others?? If there's no iobase (pathological case), require ID. ide-drive: unit Bus name is IDE, but it's clear enough what you mean :) I put ide-drive here because the unit is a property of the device, not the bus. I consider that a (very minor) bug. I2C: address virtio-serial doesn't seem to make a DeviceState per port, so I think it can be skipped. Really? Anyway, its port number should do as bus address. Maybe I'm not specifying it correctly. I see a max_nr_ports property, but I don't see that each port is a separate qdev. I see property nr in virtconsole_info and virtserialport_info. I can't see any other virtio-serial devices. I'm sure I'm still missing some... s390-virtio SSI System I'll need some help coming up with useful properties to key on for these. I had hoped there's only one System bus. USB usb-storage seems to have a useful drive property that lets me distinguish these devices: /i440FX-pcihost/pci.0/piix3-usb-uhci.01.2/usb.0/usb-storage.usb0/scsi.0/scsi-disk.0 /i440FX-pcihost/pci.0/piix3-usb-uhci.01.2/usb.0/usb-storage.usb1/scsi.0/scsi-disk.0 drive But otherwise USB is disappointingly devoid of useful properties at the bus level. Paul suggested physical ports. Doesn't look like we have them, but that should be fixable. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 0/5] Introduce canonical device hierarchy string
Markus Armbruster arm...@redhat.com writes: Alex Williamson alex.william...@redhat.com writes: On Tue, 2010-06-15 at 10:53 +0200, Markus Armbruster wrote: Alex Williamson alex.william...@redhat.com writes: [...] virtio-serial doesn't seem to make a DeviceState per port, so I think it can be skipped. Really? Anyway, its port number should do as bus address. Maybe I'm not specifying it correctly. I see a max_nr_ports property, but I don't see that each port is a separate qdev. I see property nr in virtconsole_info and virtserialport_info. I can't see any other virtio-serial devices. Same issue as with ide-drive's unit: should be a bus property. [...] -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/9] rework KVM mmu_shrink() code
On 06/15/2010 04:55 PM, Dave Hansen wrote: This is a big RFC for the moment. These need a bunch more runtime testing. -- We've seen contention in the mmu_shrink() function. First of all, that's surprising. I tried to configure the shrinker so it would stay away from kvm unless memory was really tight. The reason is that kvm mmu pages can cost as much as 1-2 ms of cpu time to build, perhaps even more, so we shouldn't drop them lightly. It's certainly a neglected area that needs attention, though. This patch set reworks it to hopefully be more scalable to large numbers of CPUs, as well as large numbers of running VMs. The patches are ordered with increasing invasiveness. These seem to boot and run fine. I'm running about 40 VMs at once, while doing echo 3 /proc/sys/vm/drop_caches, and killing/restarting VMs constantly. Will drop_caches actually shrink the kvm caches too? If so we probably need to add that to autotest since it's a really good stress test for the mmu. Seems to be relatively stable, and seems to keep the numbers of kvm_mmu_page_header objects down. That's no necessarily a good thing, those things are expensive to recreate. Of course, when we do need to reclaim them, that should be efficient. We also do a very bad job of selecting which page to reclaim. We need to start using the accessed bit on sptes that point to shadow page tables, and then look those up and reclaim unreferenced pages sooner. With shadow paging there can be tons of unsync pages that are basically unused and can be reclaimed at no cost to future runtime. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
(Cc:-ed various performance/optimization folks) * Avi Kivity a...@redhat.com wrote: On 06/16/2010 10:32 AM, H. Peter Anvin wrote: On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. Right, this optimization isn't free. I think the tradeoff is favourable since task migrations are much less frequent than context switches within the same cpu, can the scheduler experts comment? This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. So this does not look like a patch-set we could apply without gathering a _ton_ of hard data about advantages and disadvantages. We can also mitigate some of the IPIs if we know that we're migrating on the cpu we're migrating from (i.e. we're pushing tasks to another cpu, not pulling them from their cpu). Is that a common case, and if so, where can I hook a call to unlazy_fpu() (or its new equivalent)? When the system goes from idle to less idle then most of the 'fast' migrations happen on a 'push' model - on a busy CPU we wake up a new task and push it out to a known-idle CPU. At that point we can indeed unlazy the FPU with probably little cost. But on busy servers where most wakeups are IRQ based the chance of being on the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of CPUs. If there's some sucky corner case in theory we could approach it statistically and measure the ratio of fast vs. slow migration vs. local context switches - but that looks a bit complex. Dunno. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/9] abstract kvm x86 mmu-n_free_mmu_pages
On 06/15/2010 04:55 PM, Dave Hansen wrote: First of all, I think free is a poor name for this value. In this context, it means, the number of mmu pages which this kvm instance should be able to allocate. To me, free implies much more that the objects are there and ready for use. I think available is a much better description, especially when you see how it is calculated. Agreed. Note that if reclaim is improved, we can drop it completely and let the kernel trim the cache when it grows too large. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value
On 06/15/2010 04:55 PM, Dave Hansen wrote: Note: this is the real meat of the patch set. It can be applied up to this point, and everything will probably be improved, at least a bit. Of slab shrinkers, the VM code says: * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is * querying the cache size, so a fastpath for that case is appropriate. and it *means* it. Look at how it calls the shrinkers: nr_before = (*shrinker-shrink)(0, gfp_mask); shrink_ret = (*shrinker-shrink)(this_scan, gfp_mask); So, if you do anything stupid in your shrinker, the VM will doubly punish you. Ouch. The mmu_shrink() function takes the global kvm_lock, then acquires every VM's kvm-mmu_lock in sequence. If we have 100 VMs, then we're going to take 101 locks. We do it twice, so each call takes 202 locks. If we're under memory pressure, we can have each cpu trying to do this. It can get really hairy, and we've seen lock spinning in mmu_shrink() be the dominant entry in profiles. This is guaranteed to optimize at least half of those lock aquisitions away. It removes the need to take any of the locks when simply trying to count objects. Signed-off-by: Dave Hansend...@linux.vnet.ibm.com --- linux-2.6.git-dave/arch/x86/kvm/mmu.c | 33 +++-- 1 file changed, 23 insertions(+), 10 deletions(-) diff -puN arch/x86/kvm/mmu.c~make_global_used_value arch/x86/kvm/mmu.c --- linux-2.6.git/arch/x86/kvm/mmu.c~make_global_used_value 2010-06-09 15:14:30.0 -0700 +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c 2010-06-09 15:14:30.0 -0700 @@ -891,6 +891,19 @@ static int is_empty_shadow_page(u64 *spt } #endif +/* + * This value is the sum of all of the kvm instances's + * kvm-arch.n_used_mmu_pages values. We need a global, + * aggregate version in order to make the slab shrinker + * faster + */ +static unsigned int kvm_total_used_mmu_pages; The variable needs to be at the top of the file. +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr) +{ + kvm-arch.n_used_mmu_pages += nr; + kvm_total_used_mmu_pages += nr; Needs an atomic operation, since there's no global lock here. To avoid bouncing this cacheline around, make the variable percpu and make readers take a sum across all cpus. Side benefit is that you no longer need an atomic but a local_t, which is considerably cheaper. +} + -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()
The page is fine, the page tables are not. Another task can munmap() the thing while is_hwpoison_address() is running. Ok that boils down to me not seeing that source. If it accesses the page tables yes then it's racy. But whoever looked up the page tables in the first place should have returned an -EFAULT. There's no useful address attached to poison. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 8/9] reduce kvm_lock hold times in mmu_skrink()
On 06/15/2010 04:55 PM, Dave Hansen wrote: mmu_shrink() is effectively single-threaded since the global kvm_lock is held over the entire function. I beleive its only use here is for synchronization of the vm_list. Instead of using the kvm_lock to ensure consistency of the list, we instead obtain a kvm_get_kvm() reference. This keeps the kvm object on the vm_list while we shrink it. Since we don't need the lock to maintain the list any more, we can drop it. We'll reacquire it if we need to get another object off. This leads to a larger number of atomic ops, but reduces lock hold times: the typical latency vs. throughput debate. snip content diff -puN kernel/profile.c~optimize_shrinker-3 kernel/profile.c --- linux-2.6.git/kernel/profile.c~optimize_shrinker-3 2010-06-11 09:09:43.0 -0700 +++ linux-2.6.git-dave/kernel/profile.c 2010-06-11 09:12:24.0 -0700 @@ -314,6 +314,8 @@ void profile_hits(int type, void *__pc, if (prof_on != type || !prof_buffer) return; pc = min((pc - (unsigned long)_stext) prof_shift, prof_len - 1); + if ((pc == prof_len - 1) printk_ratelimit()) + printk(profile_hits(%d, %p, %d)\n, type, __pc, nr_hits); i = primary = (pc (NR_PROFILE_GRP - 1)) PROFILE_GRPSHIFT; secondary = (~(pc 1) (NR_PROFILE_GRP - 1)) PROFILE_GRPSHIFT; cpu = get_cpu(); _ Unrelated gunk... -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: KVM call minutes for June 15
2010/6/16 Markus Armbruster arm...@redhat.com: Anthony Liguori anth...@codemonkey.ws writes: On 06/15/2010 10:41 AM, Christoph Hellwig wrote: On Tue, Jun 15, 2010 at 08:18:12AM -0700, Chris Wright wrote: KVM/qemu patches - patch rate is high, documentation is low, review is low - patches need to include better descriptions and documentation - will slow down patch writers - will make it easier for patch reviewers What is the qemu patch review policy anyway? We don't really have a coherent policy. Suggestions for improvement are always appreciated. There are no Reviewed-by: included in the actual commits, Reviewed-by/Ack-by's are pretty helpful for me. In terms of including them in commit messages, if there's a strong feeling that that would be helpful then it's something I can look at doing but it also requires a fair bit of manual work during commit. Can't hurt reviewer motivation. Could it be automated? Find replies, extract tags. If you want your acks to be picked up, you better make sure your References header works, and your tags are formatted correctly. How about letting the submitter to include acked-by or reviewed-by manually and repost? It wouldn't make the maintainers busy. Although the traffic would increase, it would show the gratitude from submitter to the reviewer. Thanks, Yoshi [...] -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call minutes for June 15
On 06/15/2010 05:18 PM, Chris Wright wrote: - size for each section would be useful (breaks protocol) - while size is possibly useful, breaks protocol It is not necessary to break the protocol. If you're okay with only having the size information when the migration data has been saved to a file, you can put the directory at the end of the migration data, after the EOF section. Something like QEMU_VM_SECTION_EOF QEMU_VM_SECTION_DIRECTORY copy of the migration data, with the actual data replaced by a single 8-byte pointer to the beginning of the section: QEMU_VM_SECTION_START section id 5 block instance id version id 8-byte pointer QEMU_VM_SECTION_START section id 3 ram instance id version id 8-byte pointer ... QEMU_VM_SECTION_FULL section id 10 cpu_common instance id version id 8-byte pointer ... QEMU_VM_SECTION_EOF 8-byte pointer QEMU_VM_SECTION_DIRECTORY 8-byte pointer Note that by definition the last 8 bytes will point to the beginning of the directory. You can read the last 18 bytes to reduce (to almost zero) the possibility of a false positive. The directory table can be built at save time and streamed after the EOF without causing an error if the receiver closes its connection during the streaming of the directory. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On Wed, Jun 16, 2010 at 10:39:41AM +0200, Ingo Molnar wrote: (Cc:-ed various performance/optimization folks) * Avi Kivity a...@redhat.com wrote: On 06/16/2010 10:32 AM, H. Peter Anvin wrote: On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. Right, this optimization isn't free. I think the tradeoff is favourable since task migrations are much less frequent than context switches within the same cpu, can the scheduler experts comment? This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. It can be nice to code an absolute worst-case microbenchmark too. Task migration can actually be very important to the point of being almost a fastpath in some workloads where threads are oversubscribed to CPUs and blocking on some contented resource (IO or mutex or whatever). I suspect the main issues in that case is the actual context switching and contention, but it would be nice to see just how much slower it could get. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit : in the long run most processes will be using the FPU due to SIMM instructions. I believe glibc already uses SIMM instructions for e.g. memcpy and friends, i.e. basically all applications... Samuel -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2933400 ] virtio-blk io errors / data corruption on raw drives 1 TB
Bugs item #2933400, was opened at 2010-01-16 15:35 Message generated for change (Comment added) made by masc82 You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2933400group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Closed Resolution: Fixed Priority: 9 Private: No Submitted By: MaSc82 (masc82) Assigned to: Nobody/Anonymous (nobody) Summary: virtio-blk io errors / data corruption on raw drives 1 TB Initial Comment: When attaching raw drives 1 TB, buffer io errors will most likely occur, filesystems get corrupted. Processes (like mkfs.ext4) might freeze completely when filesystems are created on the guest. Here's a typical log excerpt when using mkfs.ext4 on a 1.5 TB drive on a Ubuntu 9.10 guest: (kern.log) Jan 15 20:40:44 q kernel: [ 677.076602] Buffer I/O error on device vde, logical block 366283764 Jan 15 20:40:44 q kernel: [ 677.076607] Buffer I/O error on device vde, logical block 366283765 Jan 15 20:40:44 q kernel: [ 677.076611] Buffer I/O error on device vde, logical block 366283766 Jan 15 20:40:44 q kernel: [ 677.076616] Buffer I/O error on device vde, logical block 366283767 Jan 15 20:40:44 q kernel: [ 677.076621] Buffer I/O error on device vde, logical block 366283768 Jan 15 20:40:44 q kernel: [ 677.076626] Buffer I/O error on device vde, logical block 366283769 (messages) Jan 15 20:40:44 q kernel: [ 677.076534] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076541] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076546] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076599] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076604] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076609] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076613] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076618] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076623] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076628] lost page write due to I/O error on vde Jan 15 20:45:55 q Backgrounding to notify hosts... (The following entries will repeat infinitely, mkfs.ext4 will not exit and cannot be killed) Jan 15 20:49:27 q kernel: [ 1200.520096] mkfs.ext4 D 0 1839 1709 0x Jan 15 20:49:27 q kernel: [ 1200.520101] 88004e157cb8 0082 88004e157c58 00015880 Jan 15 20:49:27 q kernel: [ 1200.520115] 88004ef6c7c0 00015880 00015880 00015880 Jan 15 20:49:27 q kernel: [ 1200.520118] 00015880 88004ef6c7c0 00015880 00015880 Jan 15 20:49:27 q kernel: [ 1200.520123] Call Trace: Jan 15 20:49:27 q kernel: [ 1200.520157] [810da0f0] ? sync_page+0x0/0x50 Jan 15 20:49:27 q kernel: [ 1200.520178] [815255f8] io_schedule+0x28/0x40 Jan 15 20:49:27 q kernel: [ 1200.520182] [810da12d] sync_page+0x3d/0x50 Jan 15 20:49:27 q kernel: [ 1200.520185] [81525b17] __wait_on_bit+0x57/0x80 Jan 15 20:49:27 q kernel: [ 1200.520192] [810da29e] wait_on_page_bit+0x6e/0x80 Jan 15 20:49:27 q kernel: [ 1200.520205] [81078650] ? wake_bit_function+0x0/0x40 Jan 15 20:49:27 q kernel: [ 1200.520210] [810e44e0] ? pagevec_lookup_tag+0x20/0x30 Jan 15 20:49:27 q kernel: [ 1200.520213] [810da745] wait_on_page_writeback_range+0xf5/0x190 Jan 15 20:49:27 q kernel: [ 1200.520217] [810da807] filemap_fdatawait+0x27/0x30 Jan 15 20:49:27 q kernel: [ 1200.520220] [810dacb4] filemap_write_and_wait+0x44/0x50 Jan 15 20:49:27 q kernel: [ 1200.520235] [8114ba9f] __sync_blockdev+0x1f/0x40 Jan 15 20:49:27 q kernel: [ 1200.520239] [8114bace] sync_blockdev+0xe/0x10 Jan 15 20:49:27 q kernel: [ 1200.520241] [8114baea] block_fsync+0x1a/0x20 Jan 15 20:49:27 q kernel: [ 1200.520249] [81142f26] vfs_fsync+0x86/0xf0 Jan 15 20:49:27 q kernel: [ 1200.520252] [81142fc9] do_fsync+0x39/0x60 Jan 15 20:49:27 q kernel: [ 1200.520255] [8114301b] sys_fsync+0xb/0x10 Jan 15 20:49:27 q kernel: [ 1200.520271] [81011fc2] system_call_fastpath+0x16/0x1b In my case I was switching to virtio at one point, but the behaviour didn't show until there was 1 TB data on the filesystem. very dangerous. Tested using 2 different SATA controllers, 1.5 TB lvm/mdraid, single 1.5 TB drive and 2 TB lvm/mdraid. The behaviour does not occur with if=scsi or if=ide. #2914397 might be related: https://sourceforge.net/tracker/?func=detailaid=2914397group_id=180599atid=893831 This blog post might also relate: http://www.neuhalfen.name/2009/08/05/OpenSolaris_KVM_and_large_IDE_drives/ CPU: Intel
Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
On 06/15/2010 04:55 PM, Dave Hansen wrote: In a previous patch, we removed the 'nr_to_scan' tracking. It was not being used to track the number of objects scanned, so we stopped using it entirely. Here, we strart using it again. The theory here is simple; if we already have the refcount and the kvm-mmu_lock, then we should do as much work as possible under the lock. The downside is that we're less fair about the KVM instances from which we reclaim. Each call to mmu_shrink() will tend to pick on one instance, after which it gets moved to the end of the list and left alone for a while. That also increases the latency hit, as well as a potential fault storm, on that instance. Spreading out is less efficient, but smoother. If mmu_shrink() has already done a significant amount of scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu() will also ensure that we do not over-reclaim when we have already done a lot of work in this call. In the end, this patch defines a scan as: 1. An attempt to acquire a refcount on a 'struct kvm' 2. freeing a kvm mmu page This would probably be most ideal if we can expose some of the work done by kvm_mmu_remove_some_alloc_mmu_pages() as also counting as scanning, but I think we have churned enough for the moment. It usually removes one page. Signed-off-by: Dave Hansend...@linux.vnet.ibm.com --- linux-2.6.git-dave/arch/x86/kvm/mmu.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive arch/x86/kvm/mmu.c --- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive 2010-06-14 11:30:44.0 -0700 +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c 2010-06-14 11:38:04.0 -0700 @@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv idx = srcu_read_lock(kvm-srcu); spin_lock(kvm-mmu_lock); - if (kvm-arch.n_used_mmu_pages 0) - freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm); + while (nr_to_scan 0 kvm-arch.n_used_mmu_pages 0) { + freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm); + nr_to_scan--; + } What tree are you patching? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/16/2010 11:39 AM, Ingo Molnar wrote: (Cc:-ed various performance/optimization folks) * Avi Kivitya...@redhat.com wrote: On 06/16/2010 10:32 AM, H. Peter Anvin wrote: On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. Right, this optimization isn't free. I think the tradeoff is favourable since task migrations are much less frequent than context switches within the same cpu, can the scheduler experts comment? This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. I have really no idea what to measure. Which would you most like to see? So this does not look like a patch-set we could apply without gathering a _ton_ of hard data about advantages and disadvantages. I agree (not to mention that I'm not really close to having an applyable patchset). Note some of the advantages will not be in throughput but in latency (making kernel_fpu_begin() preemptible, and reducing context switch time for event threads). We can also mitigate some of the IPIs if we know that we're migrating on the cpu we're migrating from (i.e. we're pushing tasks to another cpu, not pulling them from their cpu). Is that a common case, and if so, where can I hook a call to unlazy_fpu() (or its new equivalent)? When the system goes from idle to less idle then most of the 'fast' migrations happen on a 'push' model - on a busy CPU we wake up a new task and push it out to a known-idle CPU. At that point we can indeed unlazy the FPU with probably little cost. Can you point me to the code which does this? But on busy servers where most wakeups are IRQ based the chance of being on the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of CPUs. But don't we usually avoid pulls due to NUMA and cache considerations? If there's some sucky corner case in theory we could approach it statistically and measure the ratio of fast vs. slow migration vs. local context switches - but that looks a bit complex. I certainly wouldn't want to start with it. Dunno. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/16/2010 12:10 PM, Nick Piggin wrote: This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. It can be nice to code an absolute worst-case microbenchmark too. Sure. Task migration can actually be very important to the point of being almost a fastpath in some workloads where threads are oversubscribed to CPUs and blocking on some contented resource (IO or mutex or whatever). I suspect the main issues in that case is the actual context switching and contention, but it would be nice to see just how much slower it could get. If it's just cpu oversubscription then the IPIs will be limited by the rebalance rate and the time slice, so as you say it has to involve contention and frequent wakeups as well as heavy cpu usage. That won't be easy to code. Can you suggest an existing benchmark to run? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()
On 06/16/2010 11:49 AM, Andi Kleen wrote: The page is fine, the page tables are not. Another task can munmap() the thing while is_hwpoison_address() is running. Ok that boils down to me not seeing that source. If it accesses the page tables yes then it's racy. But whoever looked up the page tables in the first place should have returned an -EFAULT. There's no useful address attached to poison. We need to distinguish between genuine -EFAULT and poisoned address. That's why I suggested get_user_pages_ptes_fast. You can return page = NULL (-EFAULT) and the pte in the same go. No race, and useful for other cases. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/16/2010 12:01 PM, Samuel Thibault wrote: Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit : in the long run most processes will be using the FPU due to SIMM instructions. I believe glibc already uses SIMM instructions for e.g. memcpy and friends, i.e. basically all applications... I think they ought to be using 'rep movs' on newer processors, but yes you're right. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC qdev path semantics (was: [Qemu-devel] [RFC PATCH 0/5] Introduce canonical device hierarchy string)
A number of changes to qdev paths have been proposed in various threads. It's becoming harder to keep track of them, so let me sum them up in one place. Please correct me if I misrepresent your ideas. I'm going to describe the current state of things, and the proposed changes (marked with ###). The device tree has the main system bus as root. A child of a bus is a device. A child of a device is a bus. A qdev path consists of qdev path components separated by '/'. It resolves to a node in the device tree, either bus or device. The qdev path / resolves to the root, i.e. the main system bus. The qdev path IDENT, where IDENT is an identifier, resolves to the device whose id is IDENT. If PATH resolves to device DEV, and a child of DEV has the name IDENT, then we resolve to that bus. Bus names are chosen by the system as follows: * If the driver of the parent device model provides a name, use that. * Else, if the parent device has id ID, use ID.NUM, where NUM is the bus number, counting from zero in creation order. * Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM is the bus number, as above. ### Paul proposes to drop ID.NUM. ### Paul proposes to either drop TYPE.NUM (and require drivers to provide bus names), or make NUM count separately for each bus type. If PATH resolves to bus BUS, then we resolve PATH/IDENT as follows: * If a child of BUS has id IDENT, then we resolve to that device. ### Jan proposes to drop this rule. * Else, if a child of BUS has a driver with name IDENT, then we resolve to that device. If more than one exist, resolve to the first one. This assumes children are ordered. Order is the same as in info qtree. Currently, it's reverse creation order. This is *not* a stable address. * Else, if a child of BUS has a driver with alias name IDENT, then we resolve to that device. If more than one exist, resolve to the first one. This assumes children are ordered. Order is the same as in info qtree. Currently, it's reverse creation order. This is *not* a stable address. ### I propose: we resolve PATH/@BUS-ADDR to the child of BUS with bus address BUS-ADDR, if devices on this type of bus have bus addresses. The format of BUS-ADDR depends on the bus. ### Paul proposes to require all buses to define bus addresses. Make one up if necessary. ### Jan proposes: we resolve PATH/IDENT.NUM as follows: * If a child of BUS has a driver with name IDENT and an instance number NUM, then we resolve to that device. Need a suitable definition of instance number. Jan proposes to number devices with the same driver on the same bus. This assumes children are ordered, see above. This is *not* a stable address if the bus supports hot-plug. I propose to us bus-address as instance number. Works best together with Paul's proposal to define bus addresses. Syntax id...@bus-addr makes more sense then. * Else, same with driver alias name. ### Here's a possible synthesis of the above three proposals: require bus addresses, and permit any of PATH/IDENT PATH/@BUS-ADDR PATH/id...@bus-addr PATH/IDENT can't address instances that don't come first. IDENT in PATH/id...@bus-addr is redundant. Therefore, it can't be the canonical qdev path. That's fine, PATH/@BUS-ADDR serves. If the above rules resolve PATH to a device in a context where we expect a bus, and the device has exactly one bus, resolve it to that bus instead. ### Jan and I propose to drop this rule. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kvm_queue_exception - what is going wrong here?
Hi *, we recently have encountered kvm_queue_exception events in dmesg. What is causing these exceptions? It appears that the guest is at least suffering performance loss. Is the qemu-system-x86 hang event linked to the exception? Is there anything we can do to prevent these exceptions to happen? Best regards, Sebastian - [1811816.496620] set_cr3: #GP, reserved bits [1811816.496624] [ cut here ] [1811816.496640] WARNING: at /usr/src/linux-2.6.31/arch/x86/kvm/x86.c:202 kvm_queue_exception_e+0x61/0x70 [kvm ]() [1811816.496643] Hardware name: S5520HC [1811816.496644] Modules linked in: drbd tun ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv 4 xt_state nf_conntrack ipt_REJECT xt_tcpudp ppdev iptable_filter ip_tables lp parport x_tables kvm_intel kvm bridge stp bonding e100 via_rhine 3c59x 8139too mii shpchp iTCO_wdt iTCO_vendor_support pcspkr joydev ses encl osure usbhid bnx2 aacraid e1000e igb dca fbcon tileblit font bitblit softcursor [1811816.496669] Pid: 26170, comm: qemu-system-x86 Not tainted 2.6.31-14-server #48 [1811816.496672] Call Trace: [1811816.496682] [a0140651] ? kvm_queue_exception_e+0x61/0x70 [kvm] [1811816.496689] [8105f458] warn_slowpath_common+0x78/0xd0 [1811816.496692] [8105f4bf] warn_slowpath_null+0xf/0x20 [1811816.496701] [a0140651] kvm_queue_exception_e+0x61/0x70 [kvm] [1811816.496711] [a01406ff] load_guest_segment_descriptor+0x9f/0xb0 [kvm] [1811816.496721] [a0140753] kvm_load_segment_descriptor+0x43/0x110 [kvm] [1811816.496730] [a0140a20] ? kvm_inject_gp+0x10/0x20 [kvm] [1811816.496740] [a0140bed] ? kvm_set_cr3+0x15d/0x170 [kvm] [1811816.496749] [a0140d67] load_state_from_tss32+0x167/0x240 [kvm] [1811816.496758] [a0142a69] kvm_task_switch_32+0x119/0x130 [kvm] [1811816.496768] [a0142f93] kvm_task_switch+0x153/0x300 [kvm] [1811816.496777] [a0137df1] ? gfn_to_hva+0x11/0x90 [kvm] [1811816.496786] [a0138122] ? kvm_read_guest_page+0x62/0x70 [kvm] [1811816.496792] [a016cbab] handle_task_switch+0x6b/0x160 [kvm_intel] [1811816.496796] [a0169b45] vmx_handle_exit+0xf5/0x280 [kvm_intel] [1811816.496801] [a016d822] ? vmx_vcpu_run+0x292/0x856 [kvm_intel] [1811816.496811] [a013ddbf] vcpu_enter_guest+0x2cf/0x600 [kvm] [1811816.496814] [8107a12e] ? finish_wait+0x5e/0x80 [1811816.496819] [81541710] ? _spin_lock_irq+0x10/0x20 [1811816.496822] [81541603] ? __down_read+0xc3/0xce [1811816.496825] [81079fa0] ? autoremove_wake_function+0x0/0x40 [1811816.496834] [a013e153] __vcpu_run+0x63/0x330 [kvm] [1811816.496844] [a01440ab] kvm_arch_vcpu_ioctl_run+0x8b/0x1f0 [kvm] [1811816.496853] [a01367f2] kvm_vcpu_ioctl+0x2e2/0x630 [kvm] [1811816.496856] [8106e99f] ? dequeue_signal+0x9f/0x180 [1811816.496859] [81131d41] vfs_ioctl+0x31/0xa0 [1811816.496862] [811321b3] do_vfs_ioctl+0x373/0x400 [1811816.496864] [811322d9] sys_ioctl+0x99/0xa0 [1811816.496862] [811321b3] do_vfs_ioctl+0x373/0x400 [1811816.496864] [811322d9] sys_ioctl+0x99/0xa0 [1811816.496867] [8106d3e9] ? do_sigpending+0xa9/0xc0 [1811816.496870] [81012082] system_call_fastpath+0x16/0x1b [1811816.496872] ---[ end trace c6ea0da2c3c9d0e4 ]--- [1811816.496887] set_cr3: #GP, pdptrs reserved bits [1811816.496889] [ cut here ] [1811816.496898] WARNING: at /usr/src/linux-2.6.31/arch/x86/kvm/x86.c:202 kvm_queue_exception_e+0x61/0x70 [kvm]() [1811816.496900] Hardware name: S5520HC [1811816.496901] Modules linked in: drbd tun ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp ppdev iptable_filter ip_tables lp parport x_tables kvm_intel kvm bridge stp bonding e100 via_rhine 3c59x 8139too mii shpchp iTCO_wdt iTCO_vendor_support pcspkr joydev ses enclosure usbhid bnx2 aacraid e1000e igb dca fbcon tileblit font bitblit softcursor [1811816.496923] Pid: 26170, comm: qemu-system-x86 Tainted: GW 2.6.31-14-server #48 [1811816.496925] Call Trace: [1811816.496934] [a0140651] ? kvm_queue_exception_e+0x61/0x70 [kvm] [1811816.496937] [8105f458] warn_slowpath_common+0x78/0xd0 [1811816.496940] [8105f4bf] warn_slowpath_null+0xf/0x20 [1811816.496950] [a0140651] kvm_queue_exception_e+0x61/0x70 [kvm] [1811816.496959] [a01406ff] load_guest_segment_descriptor+0x9f/0xb0 [kvm] [1811816.496969] [a0140753] kvm_load_segment_descriptor+0x43/0x110 [kvm] [1811816.496978] [a0140b52] ? kvm_set_cr3+0xc2/0x170 [kvm] [1811816.496988] [a0140d9c] load_state_from_tss32+0x19c/0x240 [kvm] [1811816.496997] [a0142a69] kvm_task_switch_32+0x119/0x130 [kvm] [1811816.497006] [a0142f93] kvm_task_switch+0x153/0x300 [kvm] [1811816.497015] [a0137df1] ?
Re: Loss of network connectivity with high load
On Wednesday, 16 June 2010 00:36:43 -0300, Daniel Bareiro wrote: There have been a few bugs similar to this reported that should be pretty easy to find. Basically, I'd say try 0.12.4, it should have a few fixes in this area over 0.12.3. I was testing with qemu-kvm 0.12.4 but I'm having the same problem. I re-read the announce about qemu-kvm 0.12.4 sent by Avi, but I didn't find in it some reference about fixes related to Virtio. If there's anything else I can try, please tell me. I found some references to similar problems: http://forum.proxmox.com/threads/3117-virtio-net-crashing-(stop-sending-traffic) http://www.mail-archive.com/kvm@vger.kernel.org/msg26033.html http://marc.info/?l=kvmm=126564542625725w=2 In particular, I note the patches provided by Tom Lendacky and Herbert Xu. Are these patches applied to qemu-kvm 0.12.3 or 0.12.4? If so, it could be a regression? Thanks in advance for your reply. Regards, Daniel -- Fingerprint: BFB3 08D6 B4D1 31B2 72B9 29CE 6696 BF1B 14E6 1D37 Powered by Debian GNU/Linux Lenny - Linux user #188.598 signature.asc Description: Digital signature
Re: kvm_queue_exception - what is going wrong here?
On Wed, Jun 16, 2010 at 12:05:10PM +0200, Sebastian Hetze wrote: Hi *, we recently have encountered kvm_queue_exception events in dmesg. What is causing these exceptions? It appears that the guest is at least suffering performance loss. Is the qemu-system-x86 hang event linked to the exception? Is there anything we can do to prevent these exceptions to happen? What is your guest? It look like it tries to switch task and tss it tries to switch to is broken, so task switch emulation code encounters multiple exception and it complains about it. Theoretically we should abort task switch after fist exception condition (set_cr3: #GP, reserved bits), but error handling in task switch emulation code is far from perfect especially in such old kernel like 2.6.31. Best regards, Sebastian - [1811816.496620] set_cr3: #GP, reserved bits [1811816.496624] [ cut here ] [1811816.496640] WARNING: at /usr/src/linux-2.6.31/arch/x86/kvm/x86.c:202 kvm_queue_exception_e+0x61/0x70 [kvm ]() [1811816.496643] Hardware name: S5520HC [1811816.496644] Modules linked in: drbd tun ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv 4 xt_state nf_conntrack ipt_REJECT xt_tcpudp ppdev iptable_filter ip_tables lp parport x_tables kvm_intel kvm bridge stp bonding e100 via_rhine 3c59x 8139too mii shpchp iTCO_wdt iTCO_vendor_support pcspkr joydev ses encl osure usbhid bnx2 aacraid e1000e igb dca fbcon tileblit font bitblit softcursor [1811816.496669] Pid: 26170, comm: qemu-system-x86 Not tainted 2.6.31-14-server #48 [1811816.496672] Call Trace: [1811816.496682] [a0140651] ? kvm_queue_exception_e+0x61/0x70 [kvm] [1811816.496689] [8105f458] warn_slowpath_common+0x78/0xd0 [1811816.496692] [8105f4bf] warn_slowpath_null+0xf/0x20 [1811816.496701] [a0140651] kvm_queue_exception_e+0x61/0x70 [kvm] [1811816.496711] [a01406ff] load_guest_segment_descriptor+0x9f/0xb0 [kvm] [1811816.496721] [a0140753] kvm_load_segment_descriptor+0x43/0x110 [kvm] [1811816.496730] [a0140a20] ? kvm_inject_gp+0x10/0x20 [kvm] [1811816.496740] [a0140bed] ? kvm_set_cr3+0x15d/0x170 [kvm] [1811816.496749] [a0140d67] load_state_from_tss32+0x167/0x240 [kvm] [1811816.496758] [a0142a69] kvm_task_switch_32+0x119/0x130 [kvm] [1811816.496768] [a0142f93] kvm_task_switch+0x153/0x300 [kvm] [1811816.496777] [a0137df1] ? gfn_to_hva+0x11/0x90 [kvm] [1811816.496786] [a0138122] ? kvm_read_guest_page+0x62/0x70 [kvm] [1811816.496792] [a016cbab] handle_task_switch+0x6b/0x160 [kvm_intel] [1811816.496796] [a0169b45] vmx_handle_exit+0xf5/0x280 [kvm_intel] [1811816.496801] [a016d822] ? vmx_vcpu_run+0x292/0x856 [kvm_intel] [1811816.496811] [a013ddbf] vcpu_enter_guest+0x2cf/0x600 [kvm] [1811816.496814] [8107a12e] ? finish_wait+0x5e/0x80 [1811816.496819] [81541710] ? _spin_lock_irq+0x10/0x20 [1811816.496822] [81541603] ? __down_read+0xc3/0xce [1811816.496825] [81079fa0] ? autoremove_wake_function+0x0/0x40 [1811816.496834] [a013e153] __vcpu_run+0x63/0x330 [kvm] [1811816.496844] [a01440ab] kvm_arch_vcpu_ioctl_run+0x8b/0x1f0 [kvm] [1811816.496853] [a01367f2] kvm_vcpu_ioctl+0x2e2/0x630 [kvm] [1811816.496856] [8106e99f] ? dequeue_signal+0x9f/0x180 [1811816.496859] [81131d41] vfs_ioctl+0x31/0xa0 [1811816.496862] [811321b3] do_vfs_ioctl+0x373/0x400 [1811816.496864] [811322d9] sys_ioctl+0x99/0xa0 [1811816.496862] [811321b3] do_vfs_ioctl+0x373/0x400 [1811816.496864] [811322d9] sys_ioctl+0x99/0xa0 [1811816.496867] [8106d3e9] ? do_sigpending+0xa9/0xc0 [1811816.496870] [81012082] system_call_fastpath+0x16/0x1b [1811816.496872] ---[ end trace c6ea0da2c3c9d0e4 ]--- [1811816.496887] set_cr3: #GP, pdptrs reserved bits [1811816.496889] [ cut here ] [1811816.496898] WARNING: at /usr/src/linux-2.6.31/arch/x86/kvm/x86.c:202 kvm_queue_exception_e+0x61/0x70 [kvm]() [1811816.496900] Hardware name: S5520HC [1811816.496901] Modules linked in: drbd tun ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp ppdev iptable_filter ip_tables lp parport x_tables kvm_intel kvm bridge stp bonding e100 via_rhine 3c59x 8139too mii shpchp iTCO_wdt iTCO_vendor_support pcspkr joydev ses enclosure usbhid bnx2 aacraid e1000e igb dca fbcon tileblit font bitblit softcursor [1811816.496923] Pid: 26170, comm: qemu-system-x86 Tainted: GW 2.6.31-14-server #48 [1811816.496925] Call Trace: [1811816.496934] [a0140651] ? kvm_queue_exception_e+0x61/0x70 [kvm] [1811816.496937] [8105f458] warn_slowpath_common+0x78/0xd0 [1811816.496940] [8105f4bf] warn_slowpath_null+0xf/0x20
Re: RFC qdev path semantics (was: [Qemu-devel] [RFC PATCH 0/5] Introduce canonical device hierarchy string)
* Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM is the bus number, as above. ### Paul proposes to either drop TYPE.NUM (and require drivers to provide bus names), or make NUM count separately for each bus type. I revised this proposal: Drop the .NUM part, and require drivers provide a bus name if TYPE would result in ambiguous names. Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()
On Tue, Jun 15, 2010 at 7:22 PM, Avi Kivity a...@redhat.com wrote: btw, is_hwpoison_address() is racy. While it looks up the address, some other task can unmap the page tables under us. Andi/Huang? One way of fixing it is get_user_pages_ptes_fast(), which also returns the pte, also atomically. I want it for other reasons as well (respond to a read fault by gupping the page for read, but allowing write access if the pte indicates it is writeable). Yes. is_hwpoison_address() is racy. But I think it is not absolutely necessary to call is_hwpoison_address() in hva_to_pfn_atomic(), is it? For is_hwpoison_address() in hva_to_pfn(), we can protect it with mmap_sem. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()
On 06/16/2010 01:51 PM, huang ying wrote: On Tue, Jun 15, 2010 at 7:22 PM, Avi Kivitya...@redhat.com wrote: btw, is_hwpoison_address() is racy. While it looks up the address, some other task can unmap the page tables under us. Andi/Huang? One way of fixing it is get_user_pages_ptes_fast(), which also returns the pte, also atomically. I want it for other reasons as well (respond to a read fault by gupping the page for read, but allowing write access if the pte indicates it is writeable). Yes. is_hwpoison_address() is racy. But I think it is not absolutely necessary to call is_hwpoison_address() in hva_to_pfn_atomic(), is it? We can probably ignore it, yes. For is_hwpoison_address() in hva_to_pfn(), we can protect it with mmap_sem. Not very appealing, but should work. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/24] Implement VMXON and VMXOFF
Hi, On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 3/24] Implement VMXON and VMXOFF: On 06/13/2010 03:24 PM, Nadav Har'El wrote: This patch allows a guest to use the VMXON and VMXOFF instructions, and emulates them accordingly. Basically this amounts to checking some prerequisites, and then remembering whether the guest has enabled or disabled VMX operation. Should probably reorder with next patch. I can't do this if I want the code to compile after each patch, because the next patch (controlling when setting cr4.VMXE can be set) needs to check whether VMXON was done. Please (here and elsewhere) use the standard kernel style for multiline comments - start with /* on a line by itself. Sure, sorry about that. I guess I need to (re)read the Linux coding style document. +vmx-nested.vmxon = 1; = true I'll change that. I learned C more than a decade before the advent of stdbool.h, so in my mind, 1 has always been, and still is, the right and only way to write true... But of course it doesn't mean I need to inflict my old style on everybody else ;-) Need to block INIT signals in the local apic as well (fine for a separate patch). I've been looking into how I might best go about achieving this. The APIC_DM_INIT handler is in lapic.c, which is not aware of VMX or (obviously) nested VMX. So I need to add some sort of generic block INIT flag which that code will check. Is this the sort of fix you had in mind? A different change could be to write a handler for exit reason 3, which we get if there's a real INIT signal in the host; If we get exit reason 3 from L2, we need to exit to L1 to handle it, while if we get exit reason 3 from L1 that has done VMXON, we simply need to do nothing (according to the spec). So I'm not sure which of these two things is what we really about. What kind of scenario did you have in mind where this INIT business is relevant? Here is the patch with your above comments fixed *except* the INIT thing: --- Subject: [PATCH 3/24] Implement VMXON and VMXOFF This patch allows a guest to use the VMXON and VMXOFF instructions, and emulates them accordingly. Basically this amounts to checking some prerequisites, and then remembering whether the guest has enabled or disabled VMX operation. Signed-off-by: Nadav Har'El n...@il.ibm.com --- --- .before/arch/x86/kvm/vmx.c 2010-06-16 13:20:19.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2010-06-16 13:20:19.0 +0300 @@ -125,6 +125,17 @@ struct shared_msr_entry { u64 mask; }; +/* + * The nested_vmx structure is part of vcpu_vmx, and holds information we need + * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example, + * the current VMCS set by L1, a list of the VMCSs used to run the active + * L2 guests on the hardware, and more. + */ +struct nested_vmx { + /* Has the level1 guest done vmxon? */ + bool vmxon; +}; + struct vcpu_vmx { struct kvm_vcpu vcpu; struct list_head local_vcpus_link; @@ -176,6 +187,9 @@ struct vcpu_vmx { u32 exit_reason; bool rdtscp_enabled; + + /* Support for guest hypervisors (nested VMX) */ + struct nested_vmx nested; }; static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu) @@ -3361,6 +3375,90 @@ static int handle_vmx_insn(struct kvm_vc return 1; } +/* + * Emulate the VMXON instruction. + * Currently, we just remember that VMX is active, and do not save or even + * inspect the argument to VMXON (the so-called VMXON pointer) because we + * do not currently need to store anything in that guest-allocated memory + * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their + * argument is different from the VMXON pointer (which the spec says they do). + */ +static int handle_vmon(struct kvm_vcpu *vcpu) +{ + struct kvm_segment cs; + struct vcpu_vmx *vmx = to_vmx(vcpu); + + /* The Intel VMX Instruction Reference lists a bunch of bits that +* are prerequisite to running VMXON, most notably CR4.VMXE must be +* set to 1. Otherwise, we should fail with #UD. We test these now: +*/ + if (!nested || + !kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) || + !kvm_read_cr0_bits(vcpu, X86_CR0_PE) || + (vmx_get_rflags(vcpu) X86_EFLAGS_VM)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + vmx_get_segment(vcpu, cs, VCPU_SREG_CS); + if (is_long_mode(vcpu) !cs.l) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + if (vmx_get_cpl(vcpu)) { + kvm_inject_gp(vcpu, 0); + return 1; + } + + vmx-nested.vmxon = true; + + skip_emulated_instruction(vcpu); + return 1; +} + +/* + * Intel's VMX Instruction Reference specifies a common set of prerequisites + * for running VMX instructions (except VMXON, whose prerequisites are + * slightly
Re: [PATCH 3/24] Implement VMXON and VMXOFF
On 06/16/2010 02:14 PM, Nadav Har'El wrote: Hi, On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 3/24] Implement VMXON and VMXOFF: On 06/13/2010 03:24 PM, Nadav Har'El wrote: This patch allows a guest to use the VMXON and VMXOFF instructions, and emulates them accordingly. Basically this amounts to checking some prerequisites, and then remembering whether the guest has enabled or disabled VMX operation. Should probably reorder with next patch. I can't do this if I want the code to compile after each patch, because the next patch (controlling when setting cr4.VMXE can be set) needs to check whether VMXON was done. You can have this patch add the vmxon check. But it doesn't matter too much, you can keep the current order. Need to block INIT signals in the local apic as well (fine for a separate patch). I've been looking into how I might best go about achieving this. The APIC_DM_INIT handler is in lapic.c, which is not aware of VMX or (obviously) nested VMX. So I need to add some sort of generic block INIT flag which that code will check. Is this the sort of fix you had in mind? It's not enough to block INIT, there is also exit reason 3, INIT signal. So you need to call x86.c code from the lapic, which needs to call a kvm_x86_op hook which lets vmx.c decide whether the INIT needs to be intercepted or not, and what to do with it (ignore in root mode, exit in non-root mode) Note the check needs to be done in vcpu context, not during delivery as it is done now. So we probably need a KVM_REQ_INIT bit in vcpu-requests, which we can check during guest entry where we know if we're in root or non-root mode. Pretty complicated and esoteric. We can defer this now while we work out more immediate issues, but it needs to be addressed. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC qdev path semantics
Markus Armbruster wrote: A number of changes to qdev paths have been proposed in various threads. It's becoming harder to keep track of them, so let me sum them up in one place. Please correct me if I misrepresent your ideas. I'm going to describe the current state of things, and the proposed changes (marked with ###). The device tree has the main system bus as root. A child of a bus is a device. A child of a device is a bus. A qdev path consists of qdev path components separated by '/'. It resolves to a node in the device tree, either bus or device. The qdev path / resolves to the root, i.e. the main system bus. Another aspect: A path may start with an arbitrary bus name, not only the system bus. Although this is ambiguous, we need to keep it for addressing the bus itself due to existing client use. But, IMO, we should at least start deprecating this for addressing elements below that bus (e.g. pci.0/e1000). And besides specifying devices via absolute qdev paths, we also support addressing them via their ID if present. I checked if ID/BUS[/...] was supported so far, but it wasn't. So I did not propose this yet although it might be a useful abbreviation. The qdev path IDENT, where IDENT is an identifier, resolves to the device whose id is IDENT. If PATH resolves to device DEV, and a child of DEV has the name IDENT, then we resolve to that bus. Bus names are chosen by the system as follows: * If the driver of the parent device model provides a name, use that. * Else, if the parent device has id ID, use ID.NUM, where NUM is the bus number, counting from zero in creation order. * Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM is the bus number, as above. ### Paul proposes to drop ID.NUM. ### Paul proposes to either drop TYPE.NUM (and require drivers to provide bus names), or make NUM count separately for each bus type. If PATH resolves to bus BUS, then we resolve PATH/IDENT as follows: * If a child of BUS has id IDENT, then we resolve to that device. ### Jan proposes to drop this rule. * Else, if a child of BUS has a driver with name IDENT, then we resolve to that device. If more than one exist, resolve to the first one. This assumes children are ordered. Order is the same as in info qtree. Currently, it's reverse creation order. This is *not* a stable address. * Else, if a child of BUS has a driver with alias name IDENT, then we resolve to that device. If more than one exist, resolve to the first one. This assumes children are ordered. Order is the same as in info qtree. Currently, it's reverse creation order. This is *not* a stable address. ### I propose: we resolve PATH/@BUS-ADDR to the child of BUS with bus address BUS-ADDR, if devices on this type of bus have bus addresses. The format of BUS-ADDR depends on the bus. ### Paul proposes to require all buses to define bus addresses. Make one up if necessary. ### Jan proposes: we resolve PATH/IDENT.NUM as follows: * If a child of BUS has a driver with name IDENT and an instance number NUM, then we resolve to that device. Need a suitable definition of instance number. Jan proposes to number devices with the same driver on the same bus. This assumes children are ordered, see above. This is *not* a stable address if the bus supports hot-plug. I propose to us bus-address as instance number. Works best together with Paul's proposal to define bus addresses. Syntax id...@bus-addr makes more sense then. I would be fine with this scheme, but I assume we still need instance numbers as fallback for buses without any usable addressing. Example: I have a patch queued that uses this for internal addressing of all hpet devices on the system bus (to connect them to the ISA IRQs). * Else, same with driver alias name. ### Here's a possible synthesis of the above three proposals: require bus addresses, and permit any of PATH/IDENT PATH/@BUS-ADDR PATH/id...@bus-addr PATH/IDENT can't address instances that don't come first. PATH/IDENT[.INSTANCE] would resolve the addressability. IDENT in PATH/id...@bus-addr is redundant. Therefore, it can't be the canonical qdev path. That's fine, PATH/@BUS-ADDR serves. If the above rules resolve PATH to a device in a context where we expect a bus, and the device has exactly one bus, resolve it to that bus instead. ### Jan and I propose to drop this rule. Thanks for this summary! Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/15/2010 05:47 PM, Dave Hansen wrote: That's a bug that needs to be fixed. Eventually the host will come under pressure and will balloon the guest. If that kills the guest, the ballooning is not effective as a host memory management technique. I'm not convinced that it's just a bug that can be fixed. Consider a case where a host sees a guest with 100MB of free memory at the exact moment that a database app sees that memory. The host tries to balloon that memory away at the same time that the app goes and allocates it. That can certainly lead to an OOM very quickly, even for very small amounts of memory (much less than 100MB). Where's the bug? I think the issues are really fundamental to ballooning. There are two issues involved. One is, can the kernel accurately determine the amount of memory it needs to work? We have resources such as RAM and swap. We have liabilities in the form of swappable userspace memory, mlocked userspace memory, kernel memory to support these, and various reclaimable and non-reclaimable kernel caches. Can we determine the minimum amount of RAM to support are workload at a point in time? If we had this, we could modify the balloon to refuse to balloon if it takes the kernel beneath the minimum amount of RAM needed. In fact, this is similar to allocating memory with overcommit_memory = 0. The difference is the balloon allocates mlocked memory, while normal allocations can be charged against swap. But fundamentally it's the same. If all the guests do this, then it leaves that much more free memory on the host, which can be used flexibly for extra host page cache, new guests, etc... If the host detects lots of pagecache misses it can balloon guests down. If pagecache is quiet, why change anything? Page cache misses alone are not really sufficient. This is the classic problem where we try to differentiate streaming I/O (which we can't effectively cache) from I/O which can be effectively cached. True. Random I/O across a very large dataset is also difficult to cache. If the host wants to start new guests, it can balloon guests down. If no new guests are wanted, why change anything? We're talking about an environment which we're always trying to optimize. Imagine that we're always trying to consolidate guests on to smaller numbers of hosts. We're effectively in a state where we _always_ want new guests. If this came at no cost to the guests, you'd be right. But at some point guest performance will be hit by this, so the advantage gained from freeing memory will be balanced by the disadvantage. Also, memory is not the only resource. At some point you become cpu bound; at that point freeing memory doesn't help and in fact may increase your cpu load. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC qdev path semantics
Markus Armbruster wrote: A number of changes to qdev paths have been proposed in various threads. It's becoming harder to keep track of them, so let me sum them up in one place. Please correct me if I misrepresent your ideas. I'm going to describe the current state of things, and the proposed changes (marked with ###). The device tree has the main system bus as root. A child of a bus is a device. A child of a device is a bus. A qdev path consists of qdev path components separated by '/'. It resolves to a node in the device tree, either bus or device. The qdev path / resolves to the root, i.e. the main system bus. Another aspect: A path may start with an arbitrary bus name, not only the system bus. Although this is ambiguous, we need to keep it for addressing the bus itself due to existing client use. But, IMO, we should at least start deprecating this for addressing elements below that bus (e.g. pci.0/e1000). I think this would be better served by adding explicit aliases/IDs for those use-cases. i.e. define the global ID pci.0 to be an alias for /i440FX-pcihost/pci Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: KVM call minutes for June 15
On Wednesday 16 June 2010, Markus Armbruster wrote: Can't hurt reviewer motivation. Could it be automated? Find replies, extract tags. If you want your acks to be picked up, you better make sure your References header works, and your tags are formatted correctly. I think pwclient (https://patchwork.kernel.org/) can do this for you. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC qdev path semantics
Paul Brook wrote: Markus Armbruster wrote: A number of changes to qdev paths have been proposed in various threads. It's becoming harder to keep track of them, so let me sum them up in one place. Please correct me if I misrepresent your ideas. I'm going to describe the current state of things, and the proposed changes (marked with ###). The device tree has the main system bus as root. A child of a bus is a device. A child of a device is a bus. A qdev path consists of qdev path components separated by '/'. It resolves to a node in the device tree, either bus or device. The qdev path / resolves to the root, i.e. the main system bus. Another aspect: A path may start with an arbitrary bus name, not only the system bus. Although this is ambiguous, we need to keep it for addressing the bus itself due to existing client use. But, IMO, we should at least start deprecating this for addressing elements below that bus (e.g. pci.0/e1000). I think this would be better served by adding explicit aliases/IDs for those use-cases. i.e. define the global ID pci.0 to be an alias for /i440FX-pcihost/pci Makes sense. We could attach this ID to the BusState (corresponding to DeviceState:id) and manually set it during machine init. qbus_find_recursive would then look for a matching ID instead of a name. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup
Am 04.06.2010 16:06, schrieb Kevin Wolf: Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger: From: Nicholas Bellinger n...@linux-iscsi.org This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and PERSISTENT_RESERVE_IN case in scsi_req_length() to extra the incoming buffer length into SCSIRequest-cmd.xfer, and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in order to properly set SCSI_XFER_TO_DEV for WRITE data. Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and TCM_Loop target ports. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- hw/scsi-bus.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index b8e4b71..75ec74e 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd) case INQUIRY: req-cmd.xfer = cmd[4] | (cmd[3] 8); break; +case PERSISTENT_RESERVE_OUT: +case PERSISTENT_RESERVE_IN: +req-cmd.xfer = cmd[8] | (cmd[7] 8); Maybe I'm missing something, but isn't exactly the same value set in the switch block above? (for cmd[0] 5 == 2) Nicholas? This isn't applied yet because I'm waiting for your answer. Is there a reason why it makes sense to do it explicitly here instead using the generic code a few lines above? I think the same applied to patch 2/2. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: RFC qdev path semantics
I think this would be better served by adding explicit aliases/IDs for those use-cases. i.e. define the global ID pci.0 to be an alias for /i440FX-pcihost/pci Makes sense. We could attach this ID to the BusState (corresponding to DeviceState:id) and manually set it during machine init. qbus_find_recursive would then look for a matching ID instead of a name. If we accept your proposal that global ids are not accepted as local device identifiers, then these probably shouldn't be stored in the device tree at all. Allowing at most one global ID per node also seems like a rather arbitrary limitation. How about having a global alias table that maps an ID to a devtree path? The code to handle these becomes a straight string substitution followed by a normal path lookup. We don't need separate code for device v.s. bus, and global_device_id/busname just works. Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1: +struct __attribute__ ((__packed__)) vmcs12 { +/* According to the Intel spec, a VMCS region must start with the + * following two fields. Then follow implementation-specific data. + */ +u32 revision_id; +u32 abort; +}; Note that this structure becomes an ABI, it cannot change except in a backward compatible way due to the need for live migration. So I'd like a documentation patch that adds a description of the content to Documentation/kvm/. It can be as simple as listing the structure definition. I agree that if struct vmcs12 is changed, this will cause problems for live migration, but why does this mean that the struct's fields or layout an ABI worth documenting? After all, isn't the idea of VMCS that its internal content and layout is opaque for the L1 guest - he can only read/write it with VMREAD/VMWRITE, and those two instructions are the ABI (which is of course documented in the Intel spec) - not the content of the vmcs12 structure. Even if the guest knew the exact layout of this structure, he's not supposed to use it. By the way, we have not actually checked that live migration is working as expected with nested virtualization running. I expect there to be more pitfalls and bugs even before we consider migration between two different versions. We would indeed like to allow live migration of different kinds (of L1 with all its L2 guests; Of all L2 guests of a L1; Of a single L2 guest), but we're trying to finish the more basic functionality first. Thanks, Nadav. -- Nadav Har'El|Wednesday, Jun 16 2010, 4 Tammuz 5770 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |It is error alone which needs the support http://nadav.harel.org.il |of government. Truth can stand by itself. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Monitor support QEMU trace framework
This is v3 of a set of patches to enable trace visualization control via the QEMU monitor. It is based on trace infrastructure posted by Stefan : ( http://lists.gnu.org/archive/html/qemu-devel/2010-05/msg02407.html ) This patchset adds monitor commands : - info trace : to view current contents of the trace buffer - info tracepoints : to view all available tracepoints and their state. - tracepoint NAME on|off : to enable/disable the logging of data from tracepoint 'NAME' to on/off. Changelog : - Clean-ups from v2, particularly relating to export of tdb_hash() -- Prerna Saxena Linux Technology Centre, IBM Systems and Technology Lab, Bangalore, India -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Export hash function
Rename tdb_hash() to qemu_hash(). Move definition from qdict.c to a new file qemu-misc.c for use by tracing infrastructure. Signed-off-by: Prerna Saxena pre...@linux.vnet.ibm.com --- Makefile.objs |2 +- qdict.c | 24 qemu-common.h |3 +++ qemu-misc.c | 24 4 files changed, 32 insertions(+), 21 deletions(-) create mode 100644 qemu-misc.c diff --git a/Makefile.objs b/Makefile.objs index 7cb40ac..53e3a65 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -1,6 +1,6 @@ ### # QObject -qobject-obj-y = qint.o qstring.o qdict.o qlist.o qfloat.o qbool.o +qobject-obj-y = qint.o qstring.o qdict.o qemu-misc.o qlist.o qfloat.o qbool.o qobject-obj-y += qjson.o json-lexer.o json-streamer.o json-parser.o qobject-obj-y += qerror.o diff --git a/qdict.c b/qdict.c index 175bc17..d4588ef 100644 --- a/qdict.c +++ b/qdict.c @@ -53,22 +53,6 @@ QDict *qobject_to_qdict(const QObject *obj) } /** - * tdb_hash(): based on the hash agorithm from gdbm, via tdb - * (from module-init-tools) - */ -static unsigned int tdb_hash(const char *name) -{ -unsigned value;/* Used to compute the hash value. */ -unsigned i; /* Used to cycle through random values. */ - -/* Set the initial value from the key size. */ -for (value = 0x238F13AF * strlen(name), i=0; name[i]; i++) -value = (value + (((const unsigned char *)name)[i] (i*5 % 24))); - -return (1103515243 * value + 12345); -} - -/** * alloc_entry(): allocate a new QDictEntry */ static QDictEntry *alloc_entry(const char *key, QObject *value) @@ -113,7 +97,7 @@ void qdict_put_obj(QDict *qdict, const char *key, QObject *value) unsigned int hash; QDictEntry *entry; -hash = tdb_hash(key) % QDICT_HASH_SIZE; +hash = qemu_hash(key) % QDICT_HASH_SIZE; entry = qdict_find(qdict, key, hash); if (entry) { /* replace key's value */ @@ -137,7 +121,7 @@ QObject *qdict_get(const QDict *qdict, const char *key) { QDictEntry *entry; -entry = qdict_find(qdict, key, tdb_hash(key) % QDICT_HASH_SIZE); +entry = qdict_find(qdict, key, qemu_hash(key) % QDICT_HASH_SIZE); return (entry == NULL ? NULL : entry-value); } @@ -148,7 +132,7 @@ QObject *qdict_get(const QDict *qdict, const char *key) */ int qdict_haskey(const QDict *qdict, const char *key) { -unsigned int hash = tdb_hash(key) % QDICT_HASH_SIZE; +unsigned int hash = qemu_hash(key) % QDICT_HASH_SIZE; return (qdict_find(qdict, key, hash) == NULL ? 0 : 1); } @@ -347,7 +331,7 @@ void qdict_del(QDict *qdict, const char *key) { QDictEntry *entry; -entry = qdict_find(qdict, key, tdb_hash(key) % QDICT_HASH_SIZE); +entry = qdict_find(qdict, key, qemu_hash(key) % QDICT_HASH_SIZE); if (entry) { QLIST_REMOVE(entry, next); qentry_destroy(entry); diff --git a/qemu-common.h b/qemu-common.h index a4888e5..d225e45 100644 --- a/qemu-common.h +++ b/qemu-common.h @@ -17,6 +17,9 @@ typedef struct QEMUTimer QEMUTimer; typedef struct QEMUFile QEMUFile; typedef struct QEMUBH QEMUBH; +/* Hash function definition */ +unsigned int qemu_hash(const char *name); + /* Hack around the mess dyngen-exec.h causes: We need QEMU_NORETURN in files that cannot include the following headers without conflicts. This condition has to be removed once dyngen is gone. */ diff --git a/qemu-misc.c b/qemu-misc.c new file mode 100644 index 000..a69196a --- /dev/null +++ b/qemu-misc.c @@ -0,0 +1,24 @@ +/* + * Definition of tdb_hash() moved here, from qdict.c. Renamed to qemu_hash() + * + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later. + * See the COPYING.LIB file in the top-level directory. + * + */ +#include qemu-common.h + +/** + * tdb_hash(): based on the hash agorithm from gdbm, via tdb + * (from module-init-tools). Renamed to qemu_hash(). + */ +unsigned int qemu_hash(const char *name) +{ +unsigned value; /* Used to compute the hash value. */ +unsigned i; /* Used to cycle through random values. */ + +/* Set the initial value from the key size. */ +for (value = 0x238F13AF * strlen(name), i=0; name[i]; i++) +value = (value + (((const unsigned char *)name)[i] (i*5 % 24))); + +return (1103515243 * value + 12345); +} -- Prerna Saxena Linux Technology Centre, IBM Systems and Technology Lab, Bangalore, India -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Monitor command 'info trace'
Monitor command 'info trace' to display contents of trace buffer Signed-off-by: Prerna Saxena pre...@linux.vnet.ibm.com --- configure |3 +++ monitor.c | 12 qemu-monitor.hx |4 simpletrace.c | 13 + tracetool |2 ++ 5 files changed, 34 insertions(+), 0 deletions(-) diff --git a/configure b/configure index 675d0fc..56af8dd 100755 --- a/configure +++ b/configure @@ -2302,6 +2302,9 @@ bsd) esac echo TRACE_BACKEND=$trace_backend $config_host_mak +if test $trace_backend = simple; then + echo CONFIG_SIMPLE_TRACE=y $config_host_mak +fi if test $trace_backend = ust; then LIBS=-lust $LIBS fi diff --git a/monitor.c b/monitor.c index ad50f12..8b60830 100644 --- a/monitor.c +++ b/monitor.c @@ -55,6 +55,9 @@ #include json-streamer.h #include json-parser.h #include osdep.h +#ifdef CONFIG_SIMPLE_TRACE +#include trace.h +#endif //#define DEBUG //#define DEBUG_COMPLETION @@ -2780,6 +2783,15 @@ static const mon_cmd_t info_cmds[] = { .help = show roms, .mhandler.info = do_info_roms, }, +#if defined(CONFIG_SIMPLE_TRACE) +{ +.name = trace, +.args_type = , +.params = , +.help = show current contents of trace buffer, +.mhandler.info = do_info_trace, +}, +#endif { .name = NULL, }, diff --git a/qemu-monitor.hx b/qemu-monitor.hx index b6e3467..766c30f 100644 --- a/qemu-monitor.hx +++ b/qemu-monitor.hx @@ -114,6 +114,10 @@ show migration status show balloon information @item info qtree show device tree +#ifdef CONFIG_SIMPLE_TRACE +...@item info trace +show contents of trace buffer +#endif @end table ETEXI diff --git a/simpletrace.c b/simpletrace.c index 2fec4d3..239ae3f 100644 --- a/simpletrace.c +++ b/simpletrace.c @@ -62,3 +62,16 @@ void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4, unsigned long x5) { trace(event, x1, x2, x3, x4, x5); } + +void do_info_trace(Monitor *mon) +{ +unsigned int i, max_idx; + +max_idx = trace_idx ? trace_idx : TRACE_BUF_LEN; + +for (i=0; imax_idx ;i++) { +monitor_printf(mon, Event %ld : %ld %ld %ld %ld %ld\n, + trace_buf[i].event, trace_buf[i].x1, trace_buf[i].x2, +trace_buf[i].x3, trace_buf[i].x4, trace_buf[i].x5); +} +} diff --git a/tracetool b/tracetool index 9ea9c08..2c73bab 100755 --- a/tracetool +++ b/tracetool @@ -130,6 +130,7 @@ void trace2(TraceEvent event, unsigned long x1, unsigned long x2); void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3); void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4); void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4, unsigned long x5); +void do_info_trace(Monitor *mon); EOF simple_event_num=0 @@ -289,6 +290,7 @@ tracetoh() #define TRACE_H #include qemu-common.h +#include monitor.h EOF convert h echo #endif /* TRACE_H */ -- Prerna Saxena Linux Technology Centre, IBM Systems and Technology Lab, Bangalore, India -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Toggle tracepoint state
This patch adds support for dynamically enabling/disabling of tracepoints. This is done by internally maintaining each tracepoint's state, and permitting logging of data from a tracepoint only if it is in an 'active' state. Monitor commands added : 1) info tracepoints : to view all available tracepoints and their state. 2) tracepoint NAME on|off : to enable/disable data logging from a given tracepoint. Eg, tracepoint paio_submit off disables logging of data when paio_submit is hit. Signed-off-by: Prerna Saxena pre...@linux.vnet.ibm.com --- monitor.c | 16 ++ qemu-monitor.hx | 18 simpletrace.c | 63 +++ tracetool | 30 +++--- vl.c|6 + 5 files changed, 129 insertions(+), 4 deletions(-) diff --git a/monitor.c b/monitor.c index 8b60830..238bdf0 100644 --- a/monitor.c +++ b/monitor.c @@ -548,6 +548,15 @@ static void do_commit(Monitor *mon, const QDict *qdict) } } +#ifdef CONFIG_SIMPLE_TRACE +static void do_change_tracepoint_state(Monitor *mon, const QDict *qdict) +{ +const char *tp_name = qdict_get_str(qdict, name); +bool new_state = qdict_get_bool(qdict, option); +change_tracepoint_state(tp_name, new_state); +} +#endif + static void user_monitor_complete(void *opaque, QObject *ret_data) { MonitorCompletionData *data = (MonitorCompletionData *)opaque; @@ -2791,6 +2800,13 @@ static const mon_cmd_t info_cmds[] = { .help = show current contents of trace buffer, .mhandler.info = do_info_trace, }, +{ +.name = tracepoints, +.args_type = , +.params = , +.help = show available tracepoints their state, +.mhandler.info = do_info_all_tracepoints, +}, #endif { .name = NULL, diff --git a/qemu-monitor.hx b/qemu-monitor.hx index 766c30f..8540b8f 100644 --- a/qemu-monitor.hx +++ b/qemu-monitor.hx @@ -117,6 +117,8 @@ show device tree #ifdef CONFIG_SIMPLE_TRACE @item info trace show contents of trace buffer +...@item info tracepoints +show available tracepoints and their state #endif @end table ETEXI @@ -225,6 +227,22 @@ STEXI @item logfile @var{filename} @findex logfile Output logs to @var{filename}. +#ifdef CONFIG_SIMPLE_TRACE +ETEXI + +{ +.name = tracepoint, +.args_type = name:s,option:b, +.params = name on|off, +.help = changes status of a specific tracepoint, +.mhandler.cmd = do_change_tracepoint_state, +}, + +STEXI +...@item tracepoint +...@findex tracepoint +changes status of a tracepoint +#endif ETEXI { diff --git a/simpletrace.c b/simpletrace.c index 239ae3f..4221a8f 100644 --- a/simpletrace.c +++ b/simpletrace.c @@ -3,6 +3,12 @@ #include trace.h typedef struct { +char *tp_name; +bool state; +unsigned int hash; +} Tracepoint; + +typedef struct { unsigned long event; unsigned long x1; unsigned long x2; @@ -18,11 +24,29 @@ enum { static TraceRecord trace_buf[TRACE_BUF_LEN]; static unsigned int trace_idx; static FILE *trace_fp; +static Tracepoint trace_list[NR_TRACEPOINTS]; + +void init_tracepoint(const char *tname, TraceEvent tevent) +{ +if (!tname || tevent NR_TRACEPOINTS) { +return; +} + +trace_list[tevent].tp_name = (char*)qemu_malloc(strlen(tname)+1); +strncpy(trace_list[tevent].tp_name, tname, strlen(tname)); +trace_list[tevent].hash = qemu_hash(tname); +trace_list[tevent].state = 1; /* Enable all by default */ +} static void trace(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4, unsigned long x5) { TraceRecord *rec = trace_buf[trace_idx]; + +if (!trace_list[event].state) { +return; +} + rec-event = event; rec-x1 = x1; rec-x2 = x2; @@ -75,3 +99,42 @@ void do_info_trace(Monitor *mon) trace_buf[i].x3, trace_buf[i].x4, trace_buf[i].x5); } } + +void do_info_all_tracepoints(Monitor *mon) +{ +unsigned int i; + +for (i=0; iNR_TRACEPOINTS; i++) { +monitor_printf(mon, %s [Event ID %u] : state %u\n, +trace_list[i].tp_name, i, trace_list[i].state); +} +} + +static Tracepoint* find_tracepoint_by_name(const char *tname) +{ +unsigned int i, name_hash; + +if (!tname) { +return NULL; +} + +name_hash = qemu_hash(tname); + +for (i=0; iNR_TRACEPOINTS; i++) { +if (trace_list[i].hash == name_hash + !strncmp(trace_list[i].tp_name, tname, strlen(tname))) { +return trace_list[i]; +} +} +return NULL; /* indicates
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
Jan Kiszka jan.kis...@siemens.com writes: Markus Armbruster wrote: Jan Kiszka jan.kis...@siemens.com writes: [...] The format I will propose is global-ID|/absolute/path, no more /path/global-ID as this comes with the risk of ambiguity (ID may shadow bus-local name of a device). Doesn't this break existing usage? Please name one. It could only be weird corner cases like device_add driver,bus=bus1/ID/bus2 or bus=ID/bus. But given that we always allowed to address bus2 directly (and this is something I cannot and will not change), does this really matter? Maybe allowing paths to start with an ID is something worth considering, will think about this again. I checked with Dan: libvirt doesn't care. So I don't either. [...] -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup
On Wed, 2010-06-16 at 14:13 +0200, Kevin Wolf wrote: Am 04.06.2010 16:06, schrieb Kevin Wolf: Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger: From: Nicholas Bellinger n...@linux-iscsi.org This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and PERSISTENT_RESERVE_IN case in scsi_req_length() to extra the incoming buffer length into SCSIRequest-cmd.xfer, and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in order to properly set SCSI_XFER_TO_DEV for WRITE data. Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and TCM_Loop target ports. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- hw/scsi-bus.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index b8e4b71..75ec74e 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd) case INQUIRY: req-cmd.xfer = cmd[4] | (cmd[3] 8); break; +case PERSISTENT_RESERVE_OUT: +case PERSISTENT_RESERVE_IN: +req-cmd.xfer = cmd[8] | (cmd[7] 8); Maybe I'm missing something, but isn't exactly the same value set in the switch block above? (for cmd[0] 5 == 2) Nicholas? This isn't applied yet because I'm waiting for your answer. Is there a reason why it makes sense to do it explicitly here instead using the generic code a few lines above? I think the same applied to patch 2/2. Hi Kevin, I just tested this again and you are correct, the reassignment of req-cmd.xfer for PR and Maintence CDBs is unnecessary in scsi_req_length(). I will go ahead and drop part this from my tree. Please let me know if you would like me to resend the patch series. Best, --nab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC qdev path semantics
Markus Armbruster arm...@redhat.com writes: [...] Bus names are chosen by the system as follows: * If the driver of the parent device model provides a name, use that. * Else, if the parent device has id ID, use ID.NUM, where NUM is the bus number, counting from zero in creation order. * Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM is the bus number, as above. ### Paul proposes to drop ID.NUM. ABI change: -device lsi,id=my-scsi -device scsi-disk,bus=my-scsi.0 no longer works. ### Paul proposes to either drop TYPE.NUM (and require drivers to provide bus names), or make NUM count separately for each bus type. Likewise. I'm not saying we can't do this, just that we need to consider backward compatibility. [...] -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/17] Eliminate duplicated timer code
On Mon, Jun 14, 2010 at 09:34:03PM -1000, Zachary Amsden wrote: Move duplicated timer related code into arch_vcpu_load rather than vendor callouts. Should be an isomorphic transformation. Signed-off-by: Zachary Amsden zams...@redhat.com Acked-by: Glauber Costa glom...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup
Am 16.06.2010 15:03, schrieb Nicholas A. Bellinger: On Wed, 2010-06-16 at 14:13 +0200, Kevin Wolf wrote: Am 04.06.2010 16:06, schrieb Kevin Wolf: Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger: From: Nicholas Bellinger n...@linux-iscsi.org This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and PERSISTENT_RESERVE_IN case in scsi_req_length() to extra the incoming buffer length into SCSIRequest-cmd.xfer, and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in order to properly set SCSI_XFER_TO_DEV for WRITE data. Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and TCM_Loop target ports. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- hw/scsi-bus.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index b8e4b71..75ec74e 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd) case INQUIRY: req-cmd.xfer = cmd[4] | (cmd[3] 8); break; +case PERSISTENT_RESERVE_OUT: +case PERSISTENT_RESERVE_IN: +req-cmd.xfer = cmd[8] | (cmd[7] 8); Maybe I'm missing something, but isn't exactly the same value set in the switch block above? (for cmd[0] 5 == 2) Nicholas? This isn't applied yet because I'm waiting for your answer. Is there a reason why it makes sense to do it explicitly here instead using the generic code a few lines above? I think the same applied to patch 2/2. Hi Kevin, I just tested this again and you are correct, the reassignment of req-cmd.xfer for PR and Maintence CDBs is unnecessary in scsi_req_length(). I will go ahead and drop part this from my tree. Please let me know if you would like me to resend the patch series. Sure, taking simpler code is always better, so I'd be happier with a v2. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup
On Wed, 2010-06-16 at 06:03 -0700, Nicholas A. Bellinger wrote: On Wed, 2010-06-16 at 14:13 +0200, Kevin Wolf wrote: Am 04.06.2010 16:06, schrieb Kevin Wolf: Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger: From: Nicholas Bellinger n...@linux-iscsi.org This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and PERSISTENT_RESERVE_IN case in scsi_req_length() to extra the incoming buffer length into SCSIRequest-cmd.xfer, and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in order to properly set SCSI_XFER_TO_DEV for WRITE data. Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and TCM_Loop target ports. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- hw/scsi-bus.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index b8e4b71..75ec74e 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd) case INQUIRY: req-cmd.xfer = cmd[4] | (cmd[3] 8); break; +case PERSISTENT_RESERVE_OUT: +case PERSISTENT_RESERVE_IN: +req-cmd.xfer = cmd[8] | (cmd[7] 8); Maybe I'm missing something, but isn't exactly the same value set in the switch block above? (for cmd[0] 5 == 2) Nicholas? This isn't applied yet because I'm waiting for your answer. Is there a reason why it makes sense to do it explicitly here instead using the generic code a few lines above? I think the same applied to patch 2/2. Hi Kevin, I just tested this again and you are correct, the reassignment of req-cmd.xfer for PR and Maintence CDBs is unnecessary in scsi_req_length(). I will go ahead and drop part this from my tree. Please let me know if you would like me to resend the patch series. Actually, I should mention that I have only tested this with TYPE_DISK so far. I think we do still need the Maintenance CDBs length reassignment for MMC and TYPE_ROM from the second patch: diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index 75ec74e..7d80405 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -329,6 +329,17 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd) case PERSISTENT_RESERVE_IN: req-cmd.xfer = cmd[8] | (cmd[7] 8); break; +case MAINTENANCE_OUT: +case MAINTENANCE_IN: +if (req-dev-type != TYPE_ROM) { +/* Used for MI_REPORT_TARGET_PGS, MO_SET_TARGET_PGS et al. */ +req-cmd.xfer = cmd[9] | (cmd[8] 8) | +(cmd[7] 16) | (cmd[6] 24); +} else { +/* GPCMD_REPORT_KEY and GPCMD_SEND_KEY from multi media commands */ +req-cmd.xfer = cmd[9] | (cmd[8] 8); +} +break; } Do you have a problem with leaving this reassignment in for TYPE_ROM..? Best, --nab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
On 06/16/2010 03:24 PM, Nadav Har'El wrote: On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1: +struct __attribute__ ((__packed__)) vmcs12 { + /* According to the Intel spec, a VMCS region must start with the +* following two fields. Then follow implementation-specific data. +*/ + u32 revision_id; + u32 abort; +}; Note that this structure becomes an ABI, it cannot change except in a backward compatible way due to the need for live migration. So I'd like a documentation patch that adds a description of the content to Documentation/kvm/. It can be as simple as listing the structure definition. I agree that if struct vmcs12 is changed, this will cause problems for live migration, but why does this mean that the struct's fields or layout an ABI worth documenting? It's a way of adding a barrier to changing it, and of determining which versions are compatible with which other versions. After all, isn't the idea of VMCS that its internal content and layout is opaque for the L1 guest - he can only read/write it with VMREAD/VMWRITE, and those two instructions are the ABI (which is of course documented in the Intel spec) - not the content of the vmcs12 structure. Even if the guest knew the exact layout of this structure, he's not supposed to use it. Right, it's only migration, not for guest use. Or perhaps for someone debugging a hypervisor. By the way, we have not actually checked that live migration is working as expected with nested virtualization running. I expect there to be more pitfalls and bugs even before we consider migration between two different versions. We would indeed like to allow live migration of different kinds (of L1 with all its L2 guests; Of all L2 guests of a L1; Of a single L2 guest), but we're trying to finish the more basic functionality first. Live migration will not work without ioctls to save/load the vmptr and vmxon state. nsvm has a hack where they force an exit before existing to host userspace, so host userspace never sees guest mode. I don't like it much, and in any case it can't work for nvmx since you need to migrate the vmxon state. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/17] Unify vendor TSC logic
On Wed, Jun 16, 2010 at 04:10:10PM +0800, Jason Wang wrote: Zachary Amsden wrote: void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { + kvm_x86_ops-vcpu_load(vcpu, cpu); if (unlikely(vcpu-cpu != cpu)) { + /* Make sure TSC doesn't go backwards */ + s64 tsc_delta = !vcpu-arch.last_host_tsc ? 0 : + native_read_tsc() - vcpu-arch.last_host_tsc; + if (tsc_delta 0 || check_tsc_unstable()) It's better to do the adjustment also when tsc_delta 0 And why do you think so? Doing it on tsc_delta 0 would force us to adjust at every entry but the first. And I guess we want to adjust as few times as we can. For example, we would adjust on every cpu bounce even for machines that has a perfectly sync tsc. This could introduce an error not present before. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup
Am 16.06.2010 15:09, schrieb Nicholas A. Bellinger: On Wed, 2010-06-16 at 06:03 -0700, Nicholas A. Bellinger wrote: On Wed, 2010-06-16 at 14:13 +0200, Kevin Wolf wrote: Am 04.06.2010 16:06, schrieb Kevin Wolf: Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger: From: Nicholas Bellinger n...@linux-iscsi.org This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and PERSISTENT_RESERVE_IN case in scsi_req_length() to extra the incoming buffer length into SCSIRequest-cmd.xfer, and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in order to properly set SCSI_XFER_TO_DEV for WRITE data. Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and TCM_Loop target ports. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- hw/scsi-bus.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index b8e4b71..75ec74e 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd) case INQUIRY: req-cmd.xfer = cmd[4] | (cmd[3] 8); break; +case PERSISTENT_RESERVE_OUT: +case PERSISTENT_RESERVE_IN: +req-cmd.xfer = cmd[8] | (cmd[7] 8); Maybe I'm missing something, but isn't exactly the same value set in the switch block above? (for cmd[0] 5 == 2) Nicholas? This isn't applied yet because I'm waiting for your answer. Is there a reason why it makes sense to do it explicitly here instead using the generic code a few lines above? I think the same applied to patch 2/2. Hi Kevin, I just tested this again and you are correct, the reassignment of req-cmd.xfer for PR and Maintence CDBs is unnecessary in scsi_req_length(). I will go ahead and drop part this from my tree. Please let me know if you would like me to resend the patch series. Actually, I should mention that I have only tested this with TYPE_DISK so far. I think we do still need the Maintenance CDBs length reassignment for MMC and TYPE_ROM from the second patch: diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index 75ec74e..7d80405 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -329,6 +329,17 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd) case PERSISTENT_RESERVE_IN: req-cmd.xfer = cmd[8] | (cmd[7] 8); break; +case MAINTENANCE_OUT: +case MAINTENANCE_IN: +if (req-dev-type != TYPE_ROM) { +/* Used for MI_REPORT_TARGET_PGS, MO_SET_TARGET_PGS et al. */ +req-cmd.xfer = cmd[9] | (cmd[8] 8) | +(cmd[7] 16) | (cmd[6] 24); +} else { +/* GPCMD_REPORT_KEY and GPCMD_SEND_KEY from multi media commands */ +req-cmd.xfer = cmd[9] | (cmd[8] 8); +} +break; } Do you have a problem with leaving this reassignment in for TYPE_ROM..? In a place where it's different from what was set above, I'm not against it. I just wanted to reduce some code duplication. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC qdev path semantics
Bus names are chosen by the system as follows: * If the driver of the parent device model provides a name, use that. * Else, if the parent device has id ID, use ID.NUM, where NUM is the bus number, counting from zero in creation order. * Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM is the bus number, as above. ### Paul proposes to drop ID.NUM. ABI change: -device lsi,id=my-scsi -device scsi-disk,bus=my-scsi.0 no longer works. IMO this is a fundamentally broken ABI, so I don't care. ### Paul proposes to either drop TYPE.NUM (and require drivers to provide bus names), or make NUM count separately for each bus type. Likewise. I'd be surprised if anyone actually uses absolute device paths at this time, and they're probably going to be broken by other changes. Using these default bus names as global identifiers is fixable using aliases (e.g. -device lsi,bus=pci.0). I'd expect this to cover most interesting uses. See http://lists.nongnu.org/archive/html/qemu-devel/2010-06/msg02149.html Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/17] Fix deep C-state TSC desynchronization
On Mon, Jun 14, 2010 at 09:34:06PM -1000, Zachary Amsden wrote: When CPUs with unstable TSCs enter deep C-state, TSC may stop running. This causes us to require resynchronization. Since we can't tell when this may potentially happen, we assume the worst by forcing re-compensation for it at every point the VCPU task is descheduled. can't we just compensate everytime check_tsc_unstable() is true? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/17] Keep SMP VMs more in sync on unstable TSC
On Mon, Jun 14, 2010 at 09:34:07PM -1000, Zachary Amsden wrote: SMP VMs on machines with unstable TSC have their TSC offset adjusted by the local offset delta from last measurement. This does not take into account how long it has been since the measurement, leading to drift. Minimize the drift by accounting for any time difference the kernel has observed. Signed-off-by: Zachary Amsden zams...@redhat.com I believe this should be done not only if we have check_tsc_unstable() == true, but anytime we adjust the tsc. I mean: Sure it is expected to be much more relevant in this case, but if we're testing generally for tsc_delta 0 in the adjustment code, it is because we believe it can happen, even if tsc is stable (otherwise, we'd better take it off completely). And in that case, we should account elapsed time too. --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/x86.c | 20 +++- 2 files changed, 20 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 94f6ce8..1afecd7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -337,6 +337,7 @@ struct kvm_vcpu_arch { unsigned int time_offset; struct page *time_page; u64 last_host_tsc; + u64 last_host_ns; bool nmi_pending; bool nmi_injected; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 618c435..b1bdf05 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1810,6 +1810,19 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) /* Make sure TSC doesn't go backwards */ s64 tsc_delta = !vcpu-arch.last_host_tsc ? 0 : native_read_tsc() - vcpu-arch.last_host_tsc; + + /* Subtract elapsed cycle time from the delta computation */ + if (check_tsc_unstable() vcpu-arch.last_host_ns) { + s64 delta; + struct timespec ts; + ktime_get_ts(ts); + monotonic_to_bootbased(ts); + delta = timespec_to_ns(ts) - vcpu-arch.last_host_ns; + delta = delta * per_cpu(cpu_tsc_khz, cpu); + delta = delta / USEC_PER_SEC; + tsc_delta -= delta; + } + if (tsc_delta 0 || check_tsc_unstable()) kvm_x86_ops-adjust_tsc_offset(vcpu, -tsc_delta); kvm_migrate_timers(vcpu); @@ -1832,8 +1845,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) * vcpu-cpu != cpu can not detect this condition. So set * vcpu-cpu = -1 to force the recalibration above. */ - if (check_tsc_unstable()) + if (check_tsc_unstable()) { + struct timespec ts; + ktime_get_ts(ts); + monotonic_to_bootbased(ts); + vcpu-arch.last_host_ns = timespec_to_ns(ts); vcpu-cpu = -1; + } } static int is_efer_nx(void) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/24] Implement VMPTRLD
On Sun, Jun 13, 2010 at 03:27:41PM +0300, Nadav Har'El wrote: This patch implements the VMPTRLD instruction. Signed-off-by: Nadav Har'El n...@il.ibm.com --- --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300 @@ -3829,6 +3829,26 @@ static int read_guest_vmcs_gpa(struct kv return 0; } +static void set_rflags_to_vmx_fail_invalid(struct kvm_vcpu *vcpu) +{ + unsigned long rflags; + rflags = vmx_get_rflags(vcpu); + rflags |= X86_EFLAGS_CF; + rflags = ~X86_EFLAGS_PF ~X86_EFLAGS_AF ~X86_EFLAGS_ZF + ~X86_EFLAGS_SF ~X86_EFLAGS_OF; + vmx_set_rflags(vcpu, rflags); +} + +static void set_rflags_to_vmx_fail_valid(struct kvm_vcpu *vcpu) +{ + unsigned long rflags; + rflags = vmx_get_rflags(vcpu); + rflags |= X86_EFLAGS_ZF; + rflags = ~X86_EFLAGS_PF ~X86_EFLAGS_AF ~X86_EFLAGS_CF + ~X86_EFLAGS_SF ~X86_EFLAGS_OF; + vmx_set_rflags(vcpu, rflags); +} + static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu) { unsigned long rflags; @@ -3869,6 +3889,57 @@ static int handle_vmclear(struct kvm_vcp return 1; } +static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t guest_vmcs_addr) +{ + bool ret; + struct vmcs12 *vmcs12; + struct page *vmcs_page = nested_get_page(vcpu, guest_vmcs_addr); + if (vmcs_page == NULL) + return 0; + vmcs12 = (struct vmcs12 *)kmap_atomic(vmcs_page, KM_USER0); + if (vmcs12-revision_id == VMCS12_REVISION) + ret = 1; + else { + set_rflags_to_vmx_fail_valid(vcpu); Should set VM-Instruction Error Field accordingly. + ret = 0; + } + kunmap_atomic(vmcs12, KM_USER0); + kvm_release_page_dirty(vmcs_page); + return ret; +} + +/* Emulate the VMPTRLD instruction */ +static int handle_vmptrld(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + gpa_t guest_vmcs_addr; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (read_guest_vmcs_gpa(vcpu, guest_vmcs_addr)) { + set_rflags_to_vmx_fail_invalid(vcpu); + return 1; + } + + if (!verify_vmcs12_revision(vcpu, guest_vmcs_addr)) + return 1; Should check that guest_vmcs_addr != VMXON address. I think this check is missing from VMCLEAR too. + + if (vmx-nested.current_vmptr != guest_vmcs_addr) { + vmx-nested.current_vmptr = guest_vmcs_addr; + + if (nested_create_current_vmcs(vcpu)) { + printk(KERN_ERR %s error could not allocate memory, + __func__); + return -ENOMEM; + } + } + + clear_rflags_cf_zf(vcpu); + skip_emulated_instruction(vcpu); + return 1; +} + static int handle_invlpg(struct kvm_vcpu *vcpu) { unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); @@ -4153,7 +4224,7 @@ static int (*kvm_vmx_exit_handlers[])(st [EXIT_REASON_VMCALL] = handle_vmcall, [EXIT_REASON_VMCLEAR] = handle_vmclear, [EXIT_REASON_VMLAUNCH]= handle_vmx_insn, - [EXIT_REASON_VMPTRLD] = handle_vmx_insn, + [EXIT_REASON_VMPTRLD] = handle_vmptrld, [EXIT_REASON_VMPTRST] = handle_vmx_insn, [EXIT_REASON_VMREAD] = handle_vmx_insn, [EXIT_REASON_VMRESUME]= handle_vmx_insn, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] [scsi-bus]: Changes for PERSISTENT_RESERVE_OUT and MAINTENANCE CDBs v2
From: Nicholas Bellinger n...@linux-iscsi.org Greetings, Attached is v2 for a series containing a handful of changes required for scsi-bus in order to get PERSISTENT_RESERVE_OUT and MAINTENANCE CDBs working as expected. Thanks to Kevin Wolf for pointing out the unnecessary SCSIRequest-cmd.xfer reassignments in scsi_req_length() from the original series. Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation + scsi-generic and TCM_Loop target ports. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org Nicholas Bellinger (2): [scsi-bus]: Add PERSISTENT_RESERVE_OUT SCSIRequest-cmd.mode setup [scsi-bus]: Add MAINTENANCE_IN and MAINTENANCE_OUT SCSIRequest xfer and mode assignments hw/scsi-bus.c | 11 +++ hw/scsi-defs.h |2 ++ 2 files changed, 13 insertions(+), 0 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] [scsi-bus]: Add PERSISTENT_RESERVE_OUT SCSIRequest-cmd.mode setup
From: Nicholas Bellinger n...@linux-iscsi.org This patch updates hw/scsi-bus.c to add the PERSISTENT_RESERVE_OUT cdb case in scsi_req_xfer_mode() to set SCSI_XFER_TO_DEV for outgoing WRITE data. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- hw/scsi-bus.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index b8e4b71..7a5fa48 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -389,6 +389,7 @@ static void scsi_req_xfer_mode(SCSIRequest *req) case MEDIUM_SCAN: case SEND_VOLUME_TAG: case WRITE_LONG_2: +case PERSISTENT_RESERVE_OUT: req-cmd.mode = SCSI_XFER_TO_DEV; break; default: -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] [scsi-bus]: Add MAINTENANCE_IN and MAINTENANCE_OUT SCSIRequest xfer and mode assignments
From: Nicholas Bellinger n...@linux-iscsi.org This patch updates hw/scsi-bus.c to add MAINTENANCE_IN and MAINTENANCE_OUT case in scsi_req_length() for TYPE_ROM with MMC commands. It also adds the MAINTENANCE_OUT case in scsi_req_xfer_mode() to set SCSI_XFER_TO_DEV for outgoing write data. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- hw/scsi-bus.c | 10 ++ hw/scsi-defs.h |2 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index 7a5fa48..ed3554c 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -325,6 +325,13 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd) case INQUIRY: req-cmd.xfer = cmd[4] | (cmd[3] 8); break; +case MAINTENANCE_OUT: +case MAINTENANCE_IN: +if (req-dev-type == TYPE_ROM) { +/* GPCMD_REPORT_KEY and GPCMD_SEND_KEY from multi media commands */ +req-cmd.xfer = cmd[9] | (cmd[8] 8); +} +break; } return 0; } @@ -390,6 +397,7 @@ static void scsi_req_xfer_mode(SCSIRequest *req) case SEND_VOLUME_TAG: case WRITE_LONG_2: case PERSISTENT_RESERVE_OUT: +case MAINTENANCE_OUT: req-cmd.mode = SCSI_XFER_TO_DEV; break; default: @@ -535,6 +543,8 @@ const char *scsi_command_name(uint8_t cmd) [ SPACE] = SPACE, [ INQUIRY ] = INQUIRY, [ RECOVER_BUFFERED_DATA] = RECOVER_BUFFERED_DATA, +[ MAINTENANCE_IN ] = MAINTENANCE_IN, +[ MAINTENANCE_OUT ] = MAINTENANCE_OUT, [ MODE_SELECT ] = MODE_SELECT, [ RESERVE ] = RESERVE, [ RELEASE ] = RELEASE, diff --git a/hw/scsi-defs.h b/hw/scsi-defs.h index 5890094..a4a3518 100644 --- a/hw/scsi-defs.h +++ b/hw/scsi-defs.h @@ -84,6 +84,8 @@ #define MODE_SENSE_10 0x5a #define PERSISTENT_RESERVE_IN 0x5e #define PERSISTENT_RESERVE_OUT 0x5f +#define MAINTENANCE_IN0xa3 +#define MAINTENANCE_OUT 0xa4 #define MOVE_MEDIUM 0xa5 #define READ_12 0xa8 #define WRITE_12 0xaa -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: RFC qdev path semantics
Paul Brook wrote: I think this would be better served by adding explicit aliases/IDs for those use-cases. i.e. define the global ID pci.0 to be an alias for /i440FX-pcihost/pci Makes sense. We could attach this ID to the BusState (corresponding to DeviceState:id) and manually set it during machine init. qbus_find_recursive would then look for a matching ID instead of a name. If we accept your proposal that global ids are not accepted as local device identifiers, then these probably shouldn't be stored in the device tree at all. Allowing at most one global ID per node also seems like a rather arbitrary limitation. How about having a global alias table that maps an ID to a devtree path? The code to handle these becomes a straight string substitution followed by a normal path lookup. We don't need separate code for device v.s. bus, and global_device_id/busname just works. No principle concerns about such a table. It should nevertheless be reference-based to assist lifetime management of aliases that are created together with the devices they point to. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Monitor support QEMU trace framework
Prerna Saxena wrote: This is v3 of a set of patches to enable trace visualization control via the QEMU monitor. It is based on trace infrastructure posted by Stefan : ( http://lists.gnu.org/archive/html/qemu-devel/2010-05/msg02407.html ) This patchset adds monitor commands : - info trace : to view current contents of the trace buffer - info tracepoints : to view all available tracepoints and their state. - tracepoint NAME on|off : to enable/disable the logging of data from tracepoint 'NAME' to on/off. Changelog : - Clean-ups from v2, particularly relating to export of tdb_hash() But it's still using the legacy monitor command interface. See [1]+[2] for a mechanism that allows you to convert to the new interface without automatically exposing your commands via QMP. Jan [1] http://thread.gmane.org/gmane.comp.emulators.qemu/74356 [2] http://thread.gmane.org/gmane.comp.emulators.qemu/74372 -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 16/17] TSC reset compensation
On Mon, Jun 14, 2010 at 09:34:18PM -1000, Zachary Amsden wrote: Attempt to synchronize TSCs which are reset to the same value. In the case of a reliable hardware TSC, we can just re-use the same offset, but on non-reliable hardware, we can get closer by adjusting the offset to match the elapsed time. Signed-off-by: Zachary Amsden zams...@redhat.com --- arch/x86/kvm/x86.c | 34 -- 1 files changed, 32 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8e836e9..cedb71f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -937,14 +937,44 @@ static inline void kvm_request_guest_time_update(struct kvm_vcpu *v) set_bit(KVM_REQ_CLOCK_SYNC, v-requests); } +static inline int kvm_tsc_reliable(void) +{ + return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) + !check_tsc_unstable()); +} + why can't we re-use vmware TSC_RELIABLE flag? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/24] Implement VMPTRST
On Mon, Jun 14, 2010 at 12:15:10PM +0300, Avi Kivity wrote: On 06/13/2010 03:28 PM, Nadav Har'El wrote: This patch implements the VMPTRST instruction. Signed-off-by: Nadav Har'Eln...@il.ibm.com --- --- .before/arch/x86/kvm/x86.c 2010-06-13 15:01:29.0 +0300 +++ .after/arch/x86/kvm/x86.c2010-06-13 15:01:29.0 +0300 @@ -3301,7 +3301,7 @@ static int kvm_read_guest_virt_system(gv return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, error); } -static int kvm_write_guest_virt_system(gva_t addr, void *val, +int kvm_write_guest_virt_system(gva_t addr, void *val, unsigned int bytes, struct kvm_vcpu *vcpu, u32 *error) write_guest_virt_system() is used by writes which need to ignore the cpl, for example when a cpl 3 instruction loads a segment, the processor needs to update the accessed flag even though it is only accessible to cpl 0. That's not your case, you need the ordinary write_guest_virt(). Um, I see there is no kvm_write_guest_virt(), you'll have to introduce it. the code uses this function after checking cpl to be zero, so may be it is ok, not to pretty though. I was actually hoping to get rid of all kvm_(read|write)_guest_virt* and replace existing uses with emulator_(read|write)_emulated, but this patch series adds more users that will be hard to replace :( +/* Emulate the VMPTRST instruction */ +static int handle_vmptrst(struct kvm_vcpu *vcpu) +{ +int r = 0; +unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); +u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); +gva_t vmcs_gva; + +if (!nested_vmx_check_permission(vcpu)) +return 1; + +vmcs_gva = get_vmx_mem_address(vcpu, exit_qualification, + vmx_instruction_info); +if (vmcs_gva == 0) +return 1; What's wrong with gva 0? It's favoured by exploiters everywhere. +r = kvm_write_guest_virt_system(vmcs_gva, + (void *)to_vmx(vcpu)-nested.current_vmptr, + sizeof(u64), vcpu, NULL); +if (r) { Check against the X86EMUL return codes. You'll need to inject a page fault on failure. +printk(KERN_INFO %s failed to write vmptr\n, __func__); +return 1; +} +clear_rflags_cf_zf(vcpu); +skip_emulated_instruction(vcpu); +return 1; +} + -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock
On Wed, Jun 16, 2010 at 04:11:26PM +0800, Jason Wang wrote: Zachary Amsden wrote: Kernel time, which advances in discrete steps may progress much slower than TSC. As a result, when kvmclock is adjusted to a new base, the apparent time to the guest, which runs at a much higher, nsec scaled rate based on the current TSC, may have already been observed to have a larger value (kernel_ns + scaled tsc) than the value to which we are setting it (kernel_ns + 0). This is one issue of kvmclock which tries to supply a clocksource whose precision may even higher than host. What if we export to the guest the current clock resolution, and when doing guest reads, simply chop whatever value we got to the lowest acceptable value? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/17] Add helper function get_kernel_ns
On Wed, Jun 16, 2010 at 04:12:46PM +0800, Jason Wang wrote: Zachary Amsden wrote: On 06/14/2010 10:41 PM, Avi Kivity wrote: On 06/15/2010 10:34 AM, Zachary Amsden wrote: Add a helper function for the multiple places this is used. Note that it must not be called in preemptible context, as that would mean the kernel could enter software suspend state, which would cause non-atomic operation of the monotonic_to_bootbased computation. Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this as well? Currently, they are not bootbased (but perhaps should be). Signed-off-by: Zachary Amsdenzams...@redhat.com --- arch/x86/kvm/x86.c | 26 +- 1 files changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 703ea43..15c7317 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz, __func__, base_khz, scaled_khz, shift, *pmultiplier); } +static inline u64 get_kernel_ns(void) +{ +struct timespec ts; + +WARN_ON(preemptible()); +ktime_get_ts(ts); +monotonic_to_bootbased(ts); +return timespec_to_ns(ts); +} + static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz); Isn't something like this a candidate for the time infrastructure? Should it be? It certainly seems reasonable. Also, should we be using it for the cases KVM_GET_CLOCK / KVM_SET_CLOCK? It seems global kvmclock_offset for the VM ignores the bootbased conversion. Yes we should. indeed. Also shows the importance of doing it in a central place. Have we had an acessor like that before, we would probably not have made this mistake. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock
On 06/16/2010 04:58 PM, Glauber Costa wrote: On Wed, Jun 16, 2010 at 04:11:26PM +0800, Jason Wang wrote: Zachary Amsden wrote: Kernel time, which advances in discrete steps may progress much slower than TSC. As a result, when kvmclock is adjusted to a new base, the apparent time to the guest, which runs at a much higher, nsec scaled rate based on the current TSC, may have already been observed to have a larger value (kernel_ns + scaled tsc) than the value to which we are setting it (kernel_ns + 0). This is one issue of kvmclock which tries to supply a clocksource whose precision may even higher than host. What if we export to the guest the current clock resolution, and when doing guest reads, simply chop whatever value we got to the lowest acceptable value? The clock resolution can change, and while we can expose it reliably through pvclock, do we need a notification so that the guest can update other internal structures? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Loss of network connectivity with high load
On 06/16/2010 12:00 AM, Daniel Bareiro wrote: Hi all! I'm using Linux 2.6.31.13 compiled with the kernel.org source code on KVM host with Debian GNU/Linux Lenny amd64. Also I'm using Debian GNU/Linux Lenny amd64 virtual machine with kernel 2.6.26-2 from Debian repositories. I'm using qemu-kvm 0.12.3. 2.6.26 is really old. Can you try a new kernel in the guest? Recent kernels had many virtio fixes. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] [scsi-bus]: Changes for PERSISTENT_RESERVE_OUT and MAINTENANCE CDBs v2
Am 16.06.2010 15:42, schrieb Nicholas A. Bellinger: From: Nicholas Bellinger n...@linux-iscsi.org Greetings, Attached is v2 for a series containing a handful of changes required for scsi-bus in order to get PERSISTENT_RESERVE_OUT and MAINTENANCE CDBs working as expected. Thanks to Kevin Wolf for pointing out the unnecessary SCSIRequest-cmd.xfer reassignments in scsi_req_length() from the original series. Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation + scsi-generic and TCM_Loop target ports. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org Thanks, applied to the block branch. One more thing: You usually put things like [scsi-bus] in brackets so that they are removed by git am. I think this is actually useful information to have in the commit log, so would you mind leaving the brackets out for your next submissions? Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/24] Add VMCS fields to the vmcs12
On Sun, Jun 13, 2010 at 03:28:43PM +0300, Nadav Har'El wrote: In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the standard VMCS fields. These fields are encapsulated in a struct shadow_vmcs. Later patches will enable L1 to read and write these fields using VMREAD/ VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing a real VMCS for running L2. Signed-off-by: Nadav Har'El n...@il.ibm.com --- --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300 @@ -117,6 +117,136 @@ struct shared_msr_entry { u64 mask; }; +/* shadow_vmcs is a structure used in nested VMX for holding a copy of all + * standard VMCS fields. It is used for emulating a VMCS for L1 (see vmcs12), + * and also for easier access to VMCS data (see l1_shadow_vmcs). + */ +struct __attribute__ ((__packed__)) shadow_vmcs { + u16 virtual_processor_id; + u16 guest_es_selector; + u16 guest_cs_selector; + u16 guest_ss_selector; + u16 guest_ds_selector; + u16 guest_fs_selector; + u16 guest_gs_selector; + u16 guest_ldtr_selector; + u16 guest_tr_selector; + u16 host_es_selector; + u16 host_cs_selector; + u16 host_ss_selector; + u16 host_ds_selector; + u16 host_fs_selector; + u16 host_gs_selector; + u16 host_tr_selector; + u64 io_bitmap_a; + u64 io_bitmap_b; + u64 msr_bitmap; + u64 vm_exit_msr_store_addr; + u64 vm_exit_msr_load_addr; + u64 vm_entry_msr_load_addr; + u64 tsc_offset; + u64 virtual_apic_page_addr; + u64 apic_access_addr; + u64 ept_pointer; + u64 guest_physical_address; + u64 vmcs_link_pointer; + u64 guest_ia32_debugctl; + u64 guest_ia32_pat; + u64 guest_pdptr0; + u64 guest_pdptr1; + u64 guest_pdptr2; + u64 guest_pdptr3; + u64 host_ia32_pat; + u32 pin_based_vm_exec_control; + u32 cpu_based_vm_exec_control; + u32 exception_bitmap; + u32 page_fault_error_code_mask; + u32 page_fault_error_code_match; + u32 cr3_target_count; + u32 vm_exit_controls; + u32 vm_exit_msr_store_count; + u32 vm_exit_msr_load_count; + u32 vm_entry_controls; + u32 vm_entry_msr_load_count; + u32 vm_entry_intr_info_field; + u32 vm_entry_exception_error_code; + u32 vm_entry_instruction_len; + u32 tpr_threshold; + u32 secondary_vm_exec_control; + u32 vm_instruction_error; + u32 vm_exit_reason; + u32 vm_exit_intr_info; + u32 vm_exit_intr_error_code; + u32 idt_vectoring_info_field; + u32 idt_vectoring_error_code; + u32 vm_exit_instruction_len; + u32 vmx_instruction_info; + u32 guest_es_limit; + u32 guest_cs_limit; + u32 guest_ss_limit; + u32 guest_ds_limit; + u32 guest_fs_limit; + u32 guest_gs_limit; + u32 guest_ldtr_limit; + u32 guest_tr_limit; + u32 guest_gdtr_limit; + u32 guest_idtr_limit; + u32 guest_es_ar_bytes; + u32 guest_cs_ar_bytes; + u32 guest_ss_ar_bytes; + u32 guest_ds_ar_bytes; + u32 guest_fs_ar_bytes; + u32 guest_gs_ar_bytes; + u32 guest_ldtr_ar_bytes; + u32 guest_tr_ar_bytes; + u32 guest_interruptibility_info; + u32 guest_activity_state; + u32 guest_sysenter_cs; + u32 host_ia32_sysenter_cs; + unsigned long cr0_guest_host_mask; + unsigned long cr4_guest_host_mask; + unsigned long cr0_read_shadow; + unsigned long cr4_read_shadow; + unsigned long cr3_target_value0; + unsigned long cr3_target_value1; + unsigned long cr3_target_value2; + unsigned long cr3_target_value3; + unsigned long exit_qualification; + unsigned long guest_linear_address; + unsigned long guest_cr0; + unsigned long guest_cr3; + unsigned long guest_cr4; + unsigned long guest_es_base; + unsigned long guest_cs_base; + unsigned long guest_ss_base; + unsigned long guest_ds_base; + unsigned long guest_fs_base; + unsigned long guest_gs_base; + unsigned long guest_ldtr_base; + unsigned long guest_tr_base; + unsigned long guest_gdtr_base; + unsigned long guest_idtr_base; + unsigned long guest_dr7; + unsigned long guest_rsp; + unsigned long guest_rip; + unsigned long guest_rflags; + unsigned long guest_pending_dbg_exceptions; + unsigned long guest_sysenter_esp; + unsigned long guest_sysenter_eip; + unsigned long host_cr0; + unsigned long host_cr3; + unsigned long host_cr4; + unsigned long host_fs_base; + unsigned long host_gs_base; + unsigned long host_tr_base; + unsigned long host_gdtr_base; + unsigned long host_idtr_base; + unsigned long host_ia32_sysenter_esp; + unsigned long host_ia32_sysenter_eip; + unsigned long host_rsp; + unsigned long host_rip; +}; +
Re: [Qemu-devel] Re: RFC qdev path semantics
Paul Brook p...@codesourcery.com writes: Bus names are chosen by the system as follows: * If the driver of the parent device model provides a name, use that. * Else, if the parent device has id ID, use ID.NUM, where NUM is the bus number, counting from zero in creation order. * Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM is the bus number, as above. ### Paul proposes to drop ID.NUM. ABI change: -device lsi,id=my-scsi -device scsi-disk,bus=my-scsi.0 no longer works. IMO this is a fundamentally broken ABI, so I don't care. Users of this ABI won't appreciate that attitude. I do support dropping ID.NUM, but we owe our users due ABI diligence. ### Paul proposes to either drop TYPE.NUM (and require drivers to provide bus names), or make NUM count separately for each bus type. Likewise. I'd be surprised if anyone actually uses absolute device paths at this time, and they're probably going to be broken by other changes. Yes. Using these default bus names as global identifiers is fixable using aliases (e.g. -device lsi,bus=pci.0). I'd expect this to cover most interesting uses. See http://lists.nongnu.org/archive/html/qemu-devel/2010-06/msg02149.html Keeping the old bus names work for the buses we create automatically shouldn't be hard. Only a few, and no ID.NUM there. It's the user-created buses that worry me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages
On Tue, Jun 15, 2010 at 06:55:22AM -0700, Dave Hansen wrote: I think doing this makes the code much more readable. That's borne out by the fact that this patch removes code. used also happens to be the number that we need to return back to the slab code when our shrinker gets called. Keeping this value as opposed to free makes the next patch simpler. So, 'struct kvm' is kzalloc()'d. 'struct kvm_arch' is a structure member (and not a pointer) of 'struct kvm'. That means they start out zeroed. I _think_ they get initialized properly by kvm_mmu_change_mmu_pages(). But, that only happens via kvm ioctls. I have a suspicion that they values are actually inconsistent until those ioctls get called; free and alloc are both zero. But, the VM can't really get run until these ioctl()s get called anyway. There are also some checks for negative used_pages values which confused me. It might all tie together. Anyway, another benefit of storing 'used' intead of 'free' is that the values are consistent from the moment the structure is allocated: no negative used value. Signed-off-by: Dave Hansen d...@linux.vnet.ibm.com --- linux-2.6.git-dave/arch/x86/include/asm/kvm_host.h |2 - linux-2.6.git-dave/arch/x86/kvm/mmu.c | 29 +++-- linux-2.6.git-dave/arch/x86/kvm/mmu.h |3 +- 3 files changed, 13 insertions(+), 21 deletions(-) diff -puN arch/x86/include/asm/kvm_host.h~replace-free-with-used arch/x86/include/asm/kvm_host.h --- linux-2.6.git/arch/x86/include/asm/kvm_host.h~replace-free-with-used 2010-06-09 15:14:29.0 -0700 +++ linux-2.6.git-dave/arch/x86/include/asm/kvm_host.h2010-06-09 15:14:29.0 -0700 @@ -380,7 +380,7 @@ struct kvm_mem_aliases { struct kvm_arch { struct kvm_mem_aliases *aliases; - unsigned int n_free_mmu_pages; + unsigned int n_used_mmu_pages; unsigned int n_requested_mmu_pages; unsigned int n_max_mmu_pages; atomic_t invlpg_counter; diff -puN arch/x86/kvm/mmu.c~replace-free-with-used arch/x86/kvm/mmu.c --- linux-2.6.git/arch/x86/kvm/mmu.c~replace-free-with-used 2010-06-09 15:14:29.0 -0700 +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c 2010-06-09 15:14:29.0 -0700 @@ -898,7 +898,7 @@ static void kvm_mmu_free_page(struct kvm __free_page(virt_to_page(sp-spt)); __free_page(virt_to_page(sp-gfns)); kfree(sp); - ++kvm-arch.n_free_mmu_pages; + --kvm-arch.n_used_mmu_pages; } static unsigned kvm_page_table_hashfn(gfn_t gfn) @@ -919,7 +919,7 @@ static struct kvm_mmu_page *kvm_mmu_allo bitmap_zero(sp-slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS); sp-multimapped = 0; sp-parent_pte = parent_pte; - --vcpu-kvm-arch.n_free_mmu_pages; + ++vcpu-kvm-arch.n_used_mmu_pages; return sp; } @@ -1516,39 +1516,30 @@ static int kvm_mmu_zap_page(struct kvm * /* * Changing the number of mmu pages allocated to the vm - * Note: if kvm_nr_mmu_pages is too small, you will get dead lock + * Note: if goal_nr_mmu_pages is too small, you will get dead lock */ -void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages) +void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int goal_nr_mmu_pages) { - int used_pages; - - used_pages = kvm-arch.n_max_mmu_pages - kvm_mmu_available_pages(kvm); - used_pages = max(0, used_pages); - /* * If we set the number of mmu pages to be smaller be than the * number of actived pages , we must to free some mmu pages before we * change the value */ - if (used_pages kvm_nr_mmu_pages) { - while (used_pages kvm_nr_mmu_pages + if (kvm-arch.n_used_mmu_pages goal_nr_mmu_pages) { + while (kvm-arch.n_used_mmu_pages goal_nr_mmu_pages !list_empty(kvm-arch.active_mmu_pages)) { struct kvm_mmu_page *page; page = container_of(kvm-arch.active_mmu_pages.prev, struct kvm_mmu_page, link); - used_pages -= kvm_mmu_zap_page(kvm, page); - used_pages--; + kvm-arch.n_used_mmu_pages -= kvm_mmu_zap_page(kvm, page); + kvm-arch.n_used_mmu_pages--; kvm-arch.n_used_mmu_pages is deaccounted for in kvm_mmu_zap_page - kvm_mmu_free_page already. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86 emulator: fix pusha instruction emulation
On Tue, Jun 15, 2010 at 09:00:16AM +0800, Wei Yongjun wrote: emulate pusha instruction only writeback the last EDI register, but the other registers which need to be writeback is ignored. This patch fixed it. --- arch/x86/kvm/emulate.c | 133 ++-- 1 files changed, 73 insertions(+), 60 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/24] Implement VMREAD and VMWRITE
On Mon, Jun 14, 2010 at 12:36:02PM +0300, Avi Kivity wrote: On 06/13/2010 03:29 PM, Nadav Har'El wrote: Implement the VMREAD and VMWRITE instructions. With these instructions, L1 can read and write to the VMCS it is holding. The values are read or written to the fields of the shadow_vmcs structure introduced in the previous patch. + +static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu) +{ +switch (field_type) { +case VMCS_FIELD_TYPE_U16: +return 2; +case VMCS_FIELD_TYPE_U32: +return 4; +case VMCS_FIELD_TYPE_U64: +return 8; +case VMCS_FIELD_TYPE_ULONG: +#ifdef CONFIG_X86_64 +if (is_long_mode(vcpu)) +return 8; +#endif +return 4; No need for the ifdef, is_long_mode() works everywhere. +} +return 0; /* should never happen */ Then BUG()? +} + struct vcpu_vmx { struct kvm_vcpu vcpu; struct list_head local_vcpus_link; @@ -4184,6 +4220,189 @@ static int handle_vmclear(struct kvm_vcp return 1; } +static int handle_vmread_reg(struct kvm_vcpu *vcpu, int reg, + unsigned long field) +{ +u64 field_value; +if (!nested_vmcs_read_any(vcpu, field,field_value)) +return 0; + +#ifdef CONFIG_X86_64 +switch (vmcs_field_type(field)) { +case VMCS_FIELD_TYPE_U64: case VMCS_FIELD_TYPE_ULONG: +if (!is_long_mode(vcpu)) { +kvm_register_write(vcpu, reg+1, field_value 32); What's this reg+1 thing? I thought vmread simply ignores the upper half. +field_value = (u32)field_value; +} +} +#endif +kvm_register_write(vcpu, reg, field_value); +return 1; +} + +static int handle_vmread_mem(struct kvm_vcpu *vcpu, gva_t gva, + unsigned long field) +{ +u64 field_value; +if (!nested_vmcs_read_any(vcpu, field,field_value)) +return 0; + +/* It's ok to use *_system, because handle_vmread verifies cpl=0 */ +kvm_write_guest_virt_system(gva,field_value, + vmcs_field_size(vmcs_field_type(field), vcpu), + vcpu, NULL); vmread doesn't support 64-bit writes to memory outside long mode, so you'll have to truncate the write. I think you'll be better off returning a 32-bit size in vmcs_field_size() in these cases. Actually write should be always 32bit long outside IA-32e mode and 64bit long in 64 bit mode. Unused bits should be set to zero. +return 1; +} + +static int handle_vmread(struct kvm_vcpu *vcpu) +{ +unsigned long field; +unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); +u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); +gva_t gva = 0; +int read_succeed; + +if (!nested_vmx_check_permission(vcpu)) +return 1; + +if (!nested_map_current(vcpu)) { +printk(KERN_INFO %s invalid shadow vmcs\n, __func__); +set_rflags_to_vmx_fail_invalid(vcpu); +return 1; +} Can do the read_any() here. + +/* decode instruction info to get the field to read and where to store + * its value */ +field = kvm_register_read(vcpu, VMX_OPERAND_REG2(vmx_instruction_info)); +if (VMX_OPERAND_IS_REG(vmx_instruction_info)) { +read_succeed = handle_vmread_reg(vcpu, +VMX_OPERAND_REG(vmx_instruction_info), field); +} else { +gva = get_vmx_mem_address(vcpu, exit_qualification, + vmx_instruction_info); +if (gva == 0) +return 1; +read_succeed = handle_vmread_mem(vcpu, gva, field); +} + +if (read_succeed) { +clear_rflags_cf_zf(vcpu); +skip_emulated_instruction(vcpu); +} else { +set_rflags_to_vmx_fail_valid(vcpu); +vmcs_write32(VM_INSTRUCTION_ERROR, 12); s_e_i() in any case but an exception. +} + +nested_unmap_current(vcpu); +return 1; +} + + +if (VMX_OPERAND_IS_REG(vmx_instruction_info)) +field_value = kvm_register_read(vcpu, +VMX_OPERAND_REG(vmx_instruction_info)); +else { +gva = get_vmx_mem_address(vcpu, exit_qualification, +vmx_instruction_info); +if (gva == 0) +return 1; +kvm_read_guest_virt(gva,field_value, +vmcs_field_size(field_type, vcpu), vcpu, NULL); Check for exception. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from
Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock
On Wed, Jun 16, 2010 at 05:13:02PM +0300, Avi Kivity wrote: On 06/16/2010 04:58 PM, Glauber Costa wrote: On Wed, Jun 16, 2010 at 04:11:26PM +0800, Jason Wang wrote: Zachary Amsden wrote: Kernel time, which advances in discrete steps may progress much slower than TSC. As a result, when kvmclock is adjusted to a new base, the apparent time to the guest, which runs at a much higher, nsec scaled rate based on the current TSC, may have already been observed to have a larger value (kernel_ns + scaled tsc) than the value to which we are setting it (kernel_ns + 0). This is one issue of kvmclock which tries to supply a clocksource whose precision may even higher than host. What if we export to the guest the current clock resolution, and when doing guest reads, simply chop whatever value we got to the lowest acceptable value? The clock resolution can change, and while we can expose it reliably through pvclock, do we need a notification so that the guest can update other internal structures? I believe the only thing we need is to warn the guest whenever this changes. We can probably fit it in one of the paddings we have, and add a flag to say it is valid (now that we have the infrastructure). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] qemu: Enable XSAVE related CPUID
On Thursday 10 June 2010 11:31:02 Sheng Yang wrote: We can support it in KVM now. The 0xd leaf is queried from KVM. Hi Marcelo Avi How about patch 1 and 2 in this series? They are used to enable XSAVE cpuid. -- regards Yang, Sheng Signed-off-by: Sheng Yang sh...@linux.intel.com --- target-i386/cpuid.c | 21 + 1 files changed, 21 insertions(+), 0 deletions(-) diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index fb78061..26e4054 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -1081,6 +1081,27 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *ecx = 0; *edx = 0; break; +case 0xD: +/* Processor Extended State */ +if (!(env-cpuid_ext_features CPUID_EXT_XSAVE)) { +*eax = 0; +*ebx = 0; +*ecx = 0; +*edx = 0; +break; +} +if (kvm_enabled()) { +*eax = kvm_arch_get_supported_cpuid(env, 0xd, count, R_EAX); +*ebx = kvm_arch_get_supported_cpuid(env, 0xd, count, R_EBX); +*ecx = kvm_arch_get_supported_cpuid(env, 0xd, count, R_ECX); +*edx = kvm_arch_get_supported_cpuid(env, 0xd, count, R_EDX); +} else { +*eax = 0; +*ebx = 0; +*ecx = 0; +*edx = 0; +} +break; case 0x8000: *eax = env-cpuid_xlevel; *ebx = env-cpuid_vendor1; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/24] Implement VMREAD and VMWRITE
On Sun, Jun 13, 2010 at 03:29:13PM +0300, Nadav Har'El wrote: Implement the VMREAD and VMWRITE instructions. With these instructions, L1 can read and write to the VMCS it is holding. The values are read or written to the fields of the shadow_vmcs structure introduced in the previous patch. Signed-off-by: Nadav Har'El n...@il.ibm.com --- --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300 @@ -299,6 +299,42 @@ struct nested_vmx { int l2_vmcs_num; }; +enum vmcs_field_type { + VMCS_FIELD_TYPE_U16 = 0, + VMCS_FIELD_TYPE_U64 = 1, + VMCS_FIELD_TYPE_U32 = 2, + VMCS_FIELD_TYPE_ULONG = 3 +}; + +#define VMCS_FIELD_LENGTH_OFFSET 13 +#define VMCS_FIELD_LENGTH_MASK 0x6000 + +static inline int vmcs_field_type(unsigned long field) +{ + if (0x1 field)/* one of the *_HIGH fields, all are 32 bit */ + return VMCS_FIELD_TYPE_U32; + return (VMCS_FIELD_LENGTH_MASK field) 13; +} + +static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu) +{ + switch (field_type) { + case VMCS_FIELD_TYPE_U16: + return 2; + case VMCS_FIELD_TYPE_U32: + return 4; + case VMCS_FIELD_TYPE_U64: + return 8; + case VMCS_FIELD_TYPE_ULONG: +#ifdef CONFIG_X86_64 + if (is_long_mode(vcpu)) + return 8; +#endif + return 4; + } + return 0; /* should never happen */ +} + struct vcpu_vmx { struct kvm_vcpu vcpu; struct list_head local_vcpus_link; @@ -4184,6 +4220,189 @@ static int handle_vmclear(struct kvm_vcp return 1; } +static inline bool nested_vmcs_read_any(struct kvm_vcpu *vcpu, + unsigned long field, u64 *ret) +{ + short offset = vmcs_field_to_offset(field); + char *p; + + if (offset 0) + return 0; + if (!to_vmx(vcpu)-nested.current_l2_page) + return 0; + + p = ((char *)(get_shadow_vmcs(vcpu))) + offset; + + switch (vmcs_field_type(field)) { + case VMCS_FIELD_TYPE_ULONG: + *ret = *((unsigned long *)p); + return 1; + case VMCS_FIELD_TYPE_U16: + *ret = (u16) *((unsigned long *)p); + return 1; + case VMCS_FIELD_TYPE_U32: + *ret = (u32) *((unsigned long *)p); + return 1; + case VMCS_FIELD_TYPE_U64: + *ret = *((u64 *)p); + return 1; + default: + return 0; /* can never happen. */ + } +} + +static int handle_vmread_reg(struct kvm_vcpu *vcpu, int reg, + unsigned long field) +{ + u64 field_value; + if (!nested_vmcs_read_any(vcpu, field, field_value)) + return 0; + +#ifdef CONFIG_X86_64 + switch (vmcs_field_type(field)) { + case VMCS_FIELD_TYPE_U64: case VMCS_FIELD_TYPE_ULONG: + if (!is_long_mode(vcpu)) { + kvm_register_write(vcpu, reg+1, field_value 32); + field_value = (u32)field_value; + } + } +#endif + kvm_register_write(vcpu, reg, field_value); + return 1; +} + +static int handle_vmread_mem(struct kvm_vcpu *vcpu, gva_t gva, + unsigned long field) +{ + u64 field_value; + if (!nested_vmcs_read_any(vcpu, field, field_value)) + return 0; + + /* It's ok to use *_system, because handle_vmread verifies cpl=0 */ + kvm_write_guest_virt_system(gva, field_value, + vmcs_field_size(vmcs_field_type(field), vcpu), + vcpu, NULL); + return 1; +} + +static int handle_vmread(struct kvm_vcpu *vcpu) +{ + unsigned long field; + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); + gva_t gva = 0; + int read_succeed; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (!nested_map_current(vcpu)) { + printk(KERN_INFO %s invalid shadow vmcs\n, __func__); + set_rflags_to_vmx_fail_invalid(vcpu); + return 1; + } + + /* decode instruction info to get the field to read and where to store + * its value */ + field = kvm_register_read(vcpu, VMX_OPERAND_REG2(vmx_instruction_info)); + if (VMX_OPERAND_IS_REG(vmx_instruction_info)) { + read_succeed = handle_vmread_reg(vcpu, + VMX_OPERAND_REG(vmx_instruction_info), field); + } else { + gva = get_vmx_mem_address(vcpu, exit_qualification, + vmx_instruction_info); + if (gva == 0) + return 1; + read_succeed = handle_vmread_mem(vcpu, gva, field); + } +
Re: [RFC][PATCH 0/9] rework KVM mmu_shrink() code
On Wed, 2010-06-16 at 11:38 +0300, Avi Kivity wrote: On 06/15/2010 04:55 PM, Dave Hansen wrote: These seem to boot and run fine. I'm running about 40 VMs at once, while doing echo 3 /proc/sys/vm/drop_caches, and killing/restarting VMs constantly. Will drop_caches actually shrink the kvm caches too? If so we probably need to add that to autotest since it's a really good stress test for the mmu. I'm completely sure. I crashed my machines several times this way during testing. Seems to be relatively stable, and seems to keep the numbers of kvm_mmu_page_header objects down. That's no necessarily a good thing, those things are expensive to recreate. Of course, when we do need to reclaim them, that should be efficient. Oh, I meant that I didn't break the shrinker completely. We also do a very bad job of selecting which page to reclaim. We need to start using the accessed bit on sptes that point to shadow page tables, and then look those up and reclaim unreferenced pages sooner. With shadow paging there can be tons of unsync pages that are basically unused and can be reclaimed at no cost to future runtime. Sounds like a good next step. -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value
On Wed, 2010-06-16 at 11:48 +0300, Avi Kivity wrote: On 06/15/2010 04:55 PM, Dave Hansen wrote: +/* + * This value is the sum of all of the kvm instances's + * kvm-arch.n_used_mmu_pages values. We need a global, + * aggregate version in order to make the slab shrinker + * faster + */ +static unsigned int kvm_total_used_mmu_pages; The variable needs to be at the top of the file. Gotcha, will do. +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr) +{ + kvm-arch.n_used_mmu_pages += nr; + kvm_total_used_mmu_pages += nr; Needs an atomic operation, since there's no global lock here. To avoid bouncing this cacheline around, make the variable percpu and make readers take a sum across all cpus. Side benefit is that you no longer need an atomic but a local_t, which is considerably cheaper. That's a good point. All of the modifications are done under locks, but the fast path isn't any more. I'll fix it up. -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
On Wed, 2010-06-16 at 12:24 +0300, Avi Kivity wrote: On 06/15/2010 04:55 PM, Dave Hansen wrote: In a previous patch, we removed the 'nr_to_scan' tracking. It was not being used to track the number of objects scanned, so we stopped using it entirely. Here, we strart using it again. The theory here is simple; if we already have the refcount and the kvm-mmu_lock, then we should do as much work as possible under the lock. The downside is that we're less fair about the KVM instances from which we reclaim. Each call to mmu_shrink() will tend to pick on one instance, after which it gets moved to the end of the list and left alone for a while. That also increases the latency hit, as well as a potential fault storm, on that instance. Spreading out is less efficient, but smoother. This is probably something that we need to go back and actually measure. My suspicion is that, when memory fills up and this shrinker is getting called a lot, it will be naturally fair. That list gets shuffled around enough, and mmu_shrink() called often enough that no VMs get picked on too unfairly. I'll go back and see if I can quantify this a bit, though. I do worry about the case where you really have only a single CPU going into reclaim and a very small number of VMs on the system. You're basically guaranteeing that you'll throw away nr_to_scan of the poor victim VM's, with no penalty on the other guy. If mmu_shrink() has already done a significant amount of scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu() will also ensure that we do not over-reclaim when we have already done a lot of work in this call. In the end, this patch defines a scan as: 1. An attempt to acquire a refcount on a 'struct kvm' 2. freeing a kvm mmu page This would probably be most ideal if we can expose some of the work done by kvm_mmu_remove_some_alloc_mmu_pages() as also counting as scanning, but I think we have churned enough for the moment. It usually removes one page. Does it always just go right now and free it, or is there any real scanning that has to go on? diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive arch/x86/kvm/mmu.c --- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive 2010-06-14 11:30:44.0 -0700 +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c 2010-06-14 11:38:04.0 -0700 @@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv idx = srcu_read_lock(kvm-srcu); spin_lock(kvm-mmu_lock); - if (kvm-arch.n_used_mmu_pages 0) - freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm); + while (nr_to_scan 0 kvm-arch.n_used_mmu_pages 0) { + freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm); + nr_to_scan--; + } What tree are you patching? These applied to Linus's latest as of yesterday. -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/24] Implement VMPTRST
On Wed, Jun 16, 2010, Gleb Natapov wrote about Re: [PATCH 11/24] Implement VMPTRST: On Mon, Jun 14, 2010 at 12:15:10PM +0300, Avi Kivity wrote: write_guest_virt_system() is used by writes which need to ignore the cpl, for example when a cpl 3 instruction loads a segment, the processor needs to update the accessed flag even though it is only accessible to cpl 0. That's not your case, you need the ordinary write_guest_virt(). Um, I see there is no kvm_write_guest_virt(), you'll have to introduce it. the code uses this function after checking cpl to be zero, so may be it is ok, not to pretty though. I was actually hoping to get rid of all kvm_(read|write)_guest_virt* and replace existing uses with emulator_(read|write)_emulated, but this patch series adds more users that will be hard to replace :( If I remember the history correctly, this is exactly what happened in this code. We used to use kvm_write_guest_virt(), until a few months ago it disappeared. I thought it to be fine to call write_guest_virt_system() because, like you said, we only already check above that cpl=0, so it is fine to assume we have cpl 0 privileges. So while it might look a bit strange at first, I think it should be fine and there's no need to create more functions. -- Nadav Har'El|Wednesday, Jun 16 2010, 4 Tammuz 5770 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Amateurs built the ark - professionals http://nadav.harel.org.il |built the Titanic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2948108 ] qemu-kvm fails to migrate properly in 0.12.2
Bugs item #2948108, was opened at 2010-02-08 14:06 Message generated for change (Comment added) made by ramereth You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2948108group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: qemu Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Lance Albertson (ramereth) Assigned to: Nobody/Anonymous (nobody) Summary: qemu-kvm fails to migrate properly in 0.12.2 Initial Comment: I'm using Ganeti and encountered a problem [1] with migrating KVM virtual machines. I have confirmed that this is not a problem specific to Ganeti and is also happening to the Libvirt project [2]. When migrating from host A to host B, everything goes fine however the guest VM (running CentOS 5.4) clock changes to approximately 600 seconds in the future. When attempting to migrate back to host B, the VM locks up and stops responding. However, if you fix the clock prior to syncing back to host A, the migration happens however the clock on the VM literally stops so the guest is pretty unusable. I tried the same scenario using Ubuntu 9.10 for the guest OS and has different issues. Upon migrating to host B, the clock on the guest VM was ahead by over 3 days into the future. Then when I migrate it back to host A the VM locks up hard. I can confirm I have no problem using migration when using 0.11.1. Here's the detail information about our setup: Hardware: Intel(R) Xeon(R) CPU 5160 @ 3.00GHz qemu-kvm: 0.12.2 kernel: 2.6.29-hardened Linux OS: Gentoo Hardened If you need more information, please let me know and I'll be happy to provide it. Thanks- [1] http://groups.google.com/group/ganeti/browse_thread/thread/9b2c556557e749cf [2] http://thread.gmane.org/gmane.comp.emulators.libvirt/20771 -- Comment By: Lance Albertson (ramereth) Date: 2010-06-16 08:39 Message: Has there been any progress made on this bug? I noticed that I had the same problem with 0.11.1 and just resorted to not using the kvm-clock. -- Comment By: Lance Albertson (ramereth) Date: 2010-02-09 11:29 Message: The patch you provided doesn't seem to apply cleanly on older kernels so I went ahead and used the latest stable 2.6.32.8 which seems to include the patch. Unfortunately I have the same problem as before with the time moving up by nearly three days and the VM locking up upon returning to the original host. -- Comment By: Lance Albertson (ramereth) Date: 2010-02-09 08:09 Message: I'll give that a try sometime today -- Comment By: Brian Jackson (iggy_cav) Date: 2010-02-09 07:55 Message: Can someone try this? http://www.mail-archive.com/kvm@vger.kernel.org/msg28046.html -- Comment By: ttt (mrt2k9) Date: 2010-02-09 01:50 Message: Using acpi_pm both on hosts and guests (Debian 5.0 AMD64) seems to work. I now did 20-30 live migrations or so and it worked like a charm. Thanks! -- Comment By: Lance Albertson (ramereth) Date: 2010-02-08 16:33 Message: Any idea why kvmclock is causing this issue? Has there been any recent changes to that part of qemu-kvm? Thanks for the tips on CentOS, I'll give that a try and see if that works as a workaround for now. -- Comment By: Brian Jackson (iggy_cav) Date: 2010-02-08 16:30 Message: I think 5.4 had kvmclock backported to it... I just think it doesn't have the same /sys adjustments for clock, so you have to use the no-kvmclock thing or possibly something else. http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.4/html/Virtualization_Guide/chap-Virtualization-KVM_guest_timing_management.html has some tips for disabling it I think possibly just clocksource=acpi_pm is needed -- Comment By: Lance Albertson (ramereth) Date: 2010-02-08 16:13 Message: It seems to work if I use something like acpi_pm for the clock source (on Debian). The clock also seemed to stay in time. I haven't tested it on the other OS's but it seems that CentOS doesn't even use kvmclock (which means sense considering the age of the kernel). -- Comment By: Brian Jackson (iggy_cav) Date: 2010-02-08 15:54 Message: guest you can either use the no-kvmclock boot param or choose a different clock in: /sys/devices/system/clocksource/clocksource0/current_clocksource possible values are in:
Re: [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages
On Wed, 2010-06-16 at 11:25 -0300, Marcelo Tosatti wrote: - if (used_pages kvm_nr_mmu_pages) { - while (used_pages kvm_nr_mmu_pages + if (kvm-arch.n_used_mmu_pages goal_nr_mmu_pages) { + while (kvm-arch.n_used_mmu_pages goal_nr_mmu_pages !list_empty(kvm-arch.active_mmu_pages)) { struct kvm_mmu_page *page; page = container_of(kvm-arch.active_mmu_pages.prev, struct kvm_mmu_page, link); - used_pages -= kvm_mmu_zap_page(kvm, page); - used_pages--; + kvm-arch.n_used_mmu_pages -= kvm_mmu_zap_page(kvm, page); + kvm-arch.n_used_mmu_pages--; kvm-arch.n_used_mmu_pages is deaccounted for in kvm_mmu_zap_page - kvm_mmu_free_page already. Ahh, I see that now. Thanks! -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND PATCH] acpi_piix4: save gpe and pci hotplug slot status
Anthony Liguori anth...@codemonkey.ws wrote: On 06/14/2010 03:28 PM, Alex Williamson wrote: PCI hotplug currently doesn't work after a migration because we don't migrate the enable bits of the GPE state. Pull hotplug structs into vmstate. Signed-off-by: Alex Williamsonalex.william...@redhat.com Applied. Thanks. Regards, Anthony Liguori I think this is better implemented as a subsection. We didin't need this until hotplug arrived, I think that checking if any up/down are != 0 and then send it as subsections is a best way to do it. This way it could also be backported to stable. Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] qemu: kvm: Enable XSAVE live migration support
Marcelo Tosatti wrote: On Fri, Jun 11, 2010 at 12:36:49PM +0800, Sheng Yang wrote: Signed-off-by: Sheng Yang sh...@linux.intel.com --- qemu-kvm-x86.c| 109 - qemu-kvm.c| 24 +++ qemu-kvm.h| 28 + target-i386/cpu.h |5 ++ target-i386/kvm.c |2 + target-i386/machine.c | 20 + 6 files changed, 169 insertions(+), 19 deletions(-) Applied, thanks. Oops, late remark: Why introducing this feature against qemu-kvm instead of upstream? Doesn't this just generate additional conversion work and the risk of divergence to upstream in the migration protocol? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value
On Wed, 2010-06-16 at 11:48 +0300, Avi Kivity wrote: +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr) +{ + kvm-arch.n_used_mmu_pages += nr; + kvm_total_used_mmu_pages += nr; Needs an atomic operation, since there's no global lock here. To avoid bouncing this cacheline around, make the variable percpu and make readers take a sum across all cpus. Side benefit is that you no longer need an atomic but a local_t, which is considerably cheaper. We do have the stuff in: include/linux/percpu_counter.h the downside being that they're not precise and they're *HUGE* according to the comment. :) It's actually fairly difficult to do a counter which is precise, scalable, and works well for small CPU counts when NR_CPUS is large. Do you mind if we just stick with a plain atomic_t for now? -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND PATCH] acpi_piix4: save gpe and pci hotplug slot status
On Wed, 2010-06-16 at 17:47 +0200, Juan Quintela wrote: Anthony Liguori anth...@codemonkey.ws wrote: On 06/14/2010 03:28 PM, Alex Williamson wrote: PCI hotplug currently doesn't work after a migration because we don't migrate the enable bits of the GPE state. Pull hotplug structs into vmstate. Signed-off-by: Alex Williamsonalex.william...@redhat.com Applied. Thanks. Regards, Anthony Liguori I think this is better implemented as a subsection. We didin't need this until hotplug arrived, I think that checking if any up/down are != 0 and then send it as subsections is a best way to do it. This way it could also be backported to stable. The slots aren't really the issue, they were mostly for completeness. The key is gpe.en, which is likely always going to be all 1s for an ACPI aware OS. So if we test != 0, we're going to need that subsection in 99% of the cases. Maybe we can assume gpe.en is all set on the target, but I don't really look forward to finding out the ways that might break. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] qemu: kvm: Enable XSAVE live migration support
On Wed, Jun 16, 2010 at 05:48:46PM +0200, Jan Kiszka wrote: Marcelo Tosatti wrote: On Fri, Jun 11, 2010 at 12:36:49PM +0800, Sheng Yang wrote: Signed-off-by: Sheng Yang sh...@linux.intel.com --- qemu-kvm-x86.c| 109 - qemu-kvm.c| 24 +++ qemu-kvm.h| 28 + target-i386/cpu.h |5 ++ target-i386/kvm.c |2 + target-i386/machine.c | 20 + 6 files changed, 169 insertions(+), 19 deletions(-) Applied, thanks. Oops, late remark: Why introducing this feature against qemu-kvm instead of upstream? Doesn't this just generate additional conversion work and the risk of divergence to upstream in the migration protocol? Thats true. Sheng, can you add save/restore support to uq/master to avoid these problems? Then the cpuid bits can be also merged upstream. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2948108 ] qemu-kvm fails to migrate properly in 0.12.2
Bugs item #2948108, was opened at 2010-02-08 16:06 Message generated for change (Comment added) made by iggy_cav You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2948108group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: qemu Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Lance Albertson (ramereth) Assigned to: Nobody/Anonymous (nobody) Summary: qemu-kvm fails to migrate properly in 0.12.2 Initial Comment: I'm using Ganeti and encountered a problem [1] with migrating KVM virtual machines. I have confirmed that this is not a problem specific to Ganeti and is also happening to the Libvirt project [2]. When migrating from host A to host B, everything goes fine however the guest VM (running CentOS 5.4) clock changes to approximately 600 seconds in the future. When attempting to migrate back to host B, the VM locks up and stops responding. However, if you fix the clock prior to syncing back to host A, the migration happens however the clock on the VM literally stops so the guest is pretty unusable. I tried the same scenario using Ubuntu 9.10 for the guest OS and has different issues. Upon migrating to host B, the clock on the guest VM was ahead by over 3 days into the future. Then when I migrate it back to host A the VM locks up hard. I can confirm I have no problem using migration when using 0.11.1. Here's the detail information about our setup: Hardware: Intel(R) Xeon(R) CPU 5160 @ 3.00GHz qemu-kvm: 0.12.2 kernel: 2.6.29-hardened Linux OS: Gentoo Hardened If you need more information, please let me know and I'll be happy to provide it. Thanks- [1] http://groups.google.com/group/ganeti/browse_thread/thread/9b2c556557e749cf [2] http://thread.gmane.org/gmane.comp.emulators.libvirt/20771 -- Comment By: Brian Jackson (iggy_cav) Date: 2010-06-16 12:30 Message: There have been some fixes committed and there is another long set of patches being discussed on the list now. Unfortunately, the fixes require both host and guest updates. I don't think they will be in till 2.6.36 or so. If they are suitable for stable releases, they can probably go into 2.6.32 and the other maintained stable trees. -- Comment By: Lance Albertson (ramereth) Date: 2010-06-16 10:39 Message: Has there been any progress made on this bug? I noticed that I had the same problem with 0.11.1 and just resorted to not using the kvm-clock. -- Comment By: Lance Albertson (ramereth) Date: 2010-02-09 13:29 Message: The patch you provided doesn't seem to apply cleanly on older kernels so I went ahead and used the latest stable 2.6.32.8 which seems to include the patch. Unfortunately I have the same problem as before with the time moving up by nearly three days and the VM locking up upon returning to the original host. -- Comment By: Lance Albertson (ramereth) Date: 2010-02-09 10:09 Message: I'll give that a try sometime today -- Comment By: Brian Jackson (iggy_cav) Date: 2010-02-09 09:55 Message: Can someone try this? http://www.mail-archive.com/kvm@vger.kernel.org/msg28046.html -- Comment By: ttt (mrt2k9) Date: 2010-02-09 03:50 Message: Using acpi_pm both on hosts and guests (Debian 5.0 AMD64) seems to work. I now did 20-30 live migrations or so and it worked like a charm. Thanks! -- Comment By: Lance Albertson (ramereth) Date: 2010-02-08 18:33 Message: Any idea why kvmclock is causing this issue? Has there been any recent changes to that part of qemu-kvm? Thanks for the tips on CentOS, I'll give that a try and see if that works as a workaround for now. -- Comment By: Brian Jackson (iggy_cav) Date: 2010-02-08 18:30 Message: I think 5.4 had kvmclock backported to it... I just think it doesn't have the same /sys adjustments for clock, so you have to use the no-kvmclock thing or possibly something else. http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.4/html/Virtualization_Guide/chap-Virtualization-KVM_guest_timing_management.html has some tips for disabling it I think possibly just clocksource=acpi_pm is needed -- Comment By: Lance Albertson (ramereth) Date: 2010-02-08 18:13 Message: It seems to work if I use something like acpi_pm for the clock source (on Debian). The clock also seemed
Re: [PATCH 2/3] qemu: Enable XSAVE related CPUID
On Wed, Jun 16, 2010 at 10:59:07PM +0800, Sheng Yang wrote: On Thursday 10 June 2010 11:31:02 Sheng Yang wrote: We can support it in KVM now. The 0xd leaf is queried from KVM. Hi Marcelo Avi How about patch 1 and 2 in this series? They are used to enable XSAVE cpuid. Applied. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: KVM call minutes for June 15
On Tue, Jun 15, 2010 at 4:13 PM, Anthony Liguori anth...@codemonkey.ws wrote: On 06/15/2010 10:41 AM, Christoph Hellwig wrote: On Tue, Jun 15, 2010 at 08:18:12AM -0700, Chris Wright wrote: KVM/qemu patches - patch rate is high, documentation is low, review is low - patches need to include better descriptions and documentation - will slow down patch writers - will make it easier for patch reviewers What is the qemu patch review policy anyway? We don't really have a coherent policy. Suggestions for improvement are always appreciated. I think a policy exists, it's just sort of 'laissez-faire'. But a clear, written policy agreed by all wouldn't hurt. Perhaps we should make a checklist or FAQ. I think it should be an easy task even for someone without much QEMU background to do just CODING_STYLE or missing SoB checks. Currently there are a lot of commits with blatant CODING_STYLE violations. Even partial reviews for specific issues would be helpful. There are no Reviewed-by: included in the actual commits, Reviewed-by/Ack-by's are pretty helpful for me. In terms of including them in commit messages, if there's a strong feeling that that would be helpful then it's something I can look at doing but it also requires a fair bit of manual work during commit. and the requirement for a positive review also seem to vary a lot, up to the point that some commiters commit code that has never hit a public mailing list before. That really shouldn't happen and if it does, please point it out on the list. This also happens due to lack of policy. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html