KVM causing high CPU load since 2.6.26
Hello virtualists, Avi asked me to bring this issue to the ML, so here it is: Searching for solutions to a persistant problem with our KVM hosts, I stumbled across Avis post [REGRESSION] High, likely incorrect process cpu usage counters with kvm and 2.6.2[67] dated Sun, 31 Aug 2008 08:43:41 -0700. Ok, long story short, we've been experiencing the 100% CPU (sometimes of course even 200% or %) in top on the KVM host, where there was 2-3% before. I think, this happened after 2.6.25. We've been running 2.6.29 for some time now. Now we updated the host machine again and are running Gentoo with 2.6.30-r6 - the latest linux kernel (as of 2009-09-07). We're using the KVM that comes shipped with the kernel. The host machines are identical hardware configuration, built around two Xeon 5148 (Core2 with 2.33Ghz) The problem is still persistent. In the guest, the load is 0.0, idle 100%. However, top of the corresponding kvm process on host shows 100% (or more) CPU from time to time. I do NOT believe, that this is only an accounting problem, as the host (4 cores machine), starts to assign less CPU time to processes if all CPUs are fed with these KVM processes. The hosts are running 8-12 KVM processs on average. The guests are also Gentoo Linux machines, also kernel 2.6.30, some are 32bit some 64bit, problem seems independent of that. No guest is paravirtualized. I'm somewhat puzzled, that the problem seems to persist over one year after you discovered this or something that looks identical to me. The only betterment we've seen with the latest kernel update is, that the kvm processes fall to 0% CPU usage for some time, so the load of the host machine was lowered. Still, when they pop up with 100% CPU for 20 seconds or so, no reason within the guest can be observed. I'm willing to assist in any more qualified reportings and (eventual) bughunting, should you consider this worthwhile pursuing. As for means to reproduce this: I have no idea. Tried on my notebook (Intel T7200 @ 2.00GHz running 2.6.30-tuxonice-r5 - also Gentoo) the problem is not there. So to sum up the problem: * was not present until and including 2.6.25, since 2.6.26 * is only present on our Xeon 5148 machines * is independent of the guest being 32bit or 64bit (also Gentoo Linux) * is independent whether we took the kvm included in the kernel or external (88) So if anyone has ideas how to treat this, I'd be glad to hear them. greetings, Richard -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Heads up: More user-unaccessible x86 states?
Hi, while preparing new IOCTLs to let user space query set the yet unaccessible NMI states (pending and masked) I also came across the interrupt shadow masks. Unless I missed something I would say that we so far break them in the rare case that a migration happens right while any of them is asserted. So I guess I should extend my interface and stuff them in as well. Do we have more of such unaccessible states on x86 that could be included, too? Would be a good chance... Jan signature.asc Description: OpenPGP digital signature
Re: Heads up: More user-unaccessible x86 states?
On 10/04/2009 10:59 AM, Jan Kiszka wrote: Hi, while preparing new IOCTLs to let user space query set the yet unaccessible NMI states (pending and masked) I also came across the interrupt shadow masks. Unless I missed something I would say that we so far break them in the rare case that a migration happens right while any of them is asserted. So I guess I should extend my interface and stuff them in as well. Do we have more of such unaccessible states on x86 that could be included, too? Would be a good chance... There's some hidden state in the cpuid mechanism. I think we expose it though (just don't use it in qemu). The PDPTRs are hidden state that we should save/restore, though no sane guest relies on them. I think we can lose information if we migrate during a SIPI (sipi_vector), though that might be fixable without exposing it. We'll might also lost debug traps. We drop pending exceptions; normally that's fine since they'll reinject themselves, but MCE will not. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
On 10/02/2009 10:19 PM, Gregory Haskins wrote: What: xinterface is a mechanism that allows kernel modules external to the kvm.ko proper to interface with a running guest. It accomplishes this by creating an abstracted interface which does not expose any private details of the guest or its related KVM structures, and provides a mechanism to find and bind to this interface at run-time. If this is needed, it should be done as a virt_address_space to which kvm and other modules bind, instead of as something that kvm exports and other modules import. The virt_address_space can be identified by an fd and passed around to kvm and other modules. Why: There are various subsystems that would like to interact with a KVM guest which are ideally suited to exist outside the domain of the kvm.ko core logic. For instance, external pci-passthrough, virtual-bus, and virtio-net modules are currently under development. In order for these modules to successfully interact with the guest, they need, at the very least, various interfaces for signaling IO events, pointer translation, and possibly memory mapping. The signaling case is covered by the recent introduction of the irqfd/ioeventfd mechanisms. This patch provides a mechanism to cover the other cases. Note that today we only expose pointer-translation related functions, but more could be added at a future date as needs arise. Example usage: QEMU instantiates a guest, and an external module foo that desires the ability to interface with the guest (say via open(/dev/foo)). QEMU may then pass the kvmfd to foo via an ioctl, such as: ioctl(foofd, FOO_SET_VMID,kvmfd). Upon receipt, the foo module can issue kvm_xinterface_bind(kvmfd) to acquire the proper context. Internally, the struct kvm* and associated struct module* will remain pinned at least until the foo module calls kvm_xinterface_put(). So, under my suggestion above, you'd call sys_create_virt_address_space(), populate it, and pass the result to kvm and to foo. This allows the use of virt_address_space without kvm and doesn't require foo to interact with kvm. +struct kvm_xinterface_ops { + unsigned long (*copy_to)(struct kvm_xinterface *intf, +unsigned long gpa, const void *src, +unsigned long len); + unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst, + unsigned long gpa, unsigned long len); + struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf, + unsigned long gpa, + unsigned long len); How would vmap() work with live migration? + +static inline void +_kvm_xinterface_release(struct kref *kref) +{ + struct kvm_xinterface *intf; + struct module *owner; + + intf = container_of(kref, struct kvm_xinterface, kref); + + owner = intf-owner; + rmb(); Why rmb? + + intf-ops-release(intf); + module_put(owner); +} + + +/* + * gpa_to_hva() - translate a guest-physical to host-virtual using + * a per-cpu cache of the memslot. + * + * The gfn_to_memslot() call is relatively expensive, and the gpa access + * patterns exhibit a high degree of locality. Therefore, lets cache + * the last slot used on a per-cpu basis to optimize the lookup + * + * assumes slots_lock held for read + */ +static unsigned long +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa) +{ + int cpu = get_cpu(); + unsigned long gfn = gpa PAGE_SHIFT; + struct kvm_memory_slot *memslot = _intf-slotcache[cpu]; + unsigned long addr= 0; + + if (!memslot + || gfn memslot-base_gfn + || gfn= memslot-base_gfn + memslot-npages) { + + memslot = gfn_to_memslot(_intf-kvm, gfn); + if (!memslot) + goto out; + + _intf-slotcache[cpu] = memslot; + } + + addr = _gfn_to_hva(gfn, memslot) + offset_in_page(gpa); + +out: + put_cpu(); + + return addr; +} A simple per-vcpu cache (in struct kvm_vcpu) is likely to give better results. +static unsigned long +xinterface_copy_to(struct kvm_xinterface *intf, unsigned long gpa, + const void *src, unsigned long n) +{ + struct _xinterface *_intf = to_intf(intf); + unsigned long dst; + bool kthread = !current-mm; + + down_read(_intf-kvm-slots_lock); + + dst = gpa_to_hva(_intf, gpa); + if (!dst) + goto out; + + if (kthread) + use_mm(_intf-mm); + + if (kthread || _intf-mm == current-mm) + n = copy_to_user((void *)dst, src, n); + else + n = _slow_copy_to_user(_intf, dst, src, n); Can't you switch the mm temporarily instead of this? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list:
Re: [PATCH v2 3/4] KVM: add io services to xinterface
On 10/02/2009 10:19 PM, Gregory Haskins wrote: We want to add a more efficient way to get PIO signals out of the guest, so we add an xioevent interface. This allows a client to register for notifications when a specific MMIO/PIO address is touched by the guest. This is an alternative interface to ioeventfd, which is performance limited by io-bus scaling and eventfd wait-queue based notification mechanism. This also has the advantage of retaining the full PIO data payload and passing it to the recipient. Can you detail the problems with io-bus scaling and eventfd wait-queues? Maybe we should fix these instead. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] KVM: add scatterlist support to xinterface
On 10/02/2009 10:19 PM, Gregory Haskins wrote: This allows a scatter-gather approach to IO, which will be useful for building high performance interfaces, like zero-copy and low-latency copy (avoiding multiple calls to copy_to/from). The interface is based on the existing scatterlist infrastructure. The caller is expected to pass in a scatterlist with its dma field populated with valid GPAs. The xinterface will then populate each entry by translating the GPA to a page*. The caller signifies completion by simply performing a put_page() on each page returned in the list. Signed-off-by: Gregory Haskinsghask...@novell.com --- include/linux/kvm_xinterface.h |4 ++ virt/kvm/xinterface.c | 72 2 files changed, 76 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h index 684b6f8..eefb575 100644 --- a/include/linux/kvm_xinterface.h +++ b/include/linux/kvm_xinterface.h @@ -9,6 +9,7 @@ #includelinux/kref.h #includelinux/module.h #includelinux/file.h +#includelinux/scatterlist.h struct kvm_xinterface; struct kvm_xvmap; @@ -36,6 +37,9 @@ struct kvm_xinterface_ops { u64 addr, unsigned long len, unsigned long flags); + unsigned long (*sgmap)(struct kvm_xinterface *intf, + struct scatterlist *sgl, int nents, + unsigned long flags); void (*release)(struct kvm_xinterface *); }; diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c index c356835..16729f6 100644 --- a/virt/kvm/xinterface.c +++ b/virt/kvm/xinterface.c @@ -467,6 +467,77 @@ fail: } +static unsigned long +xinterface_sgmap(struct kvm_xinterface *intf, +struct scatterlist *sgl, int nents, +unsigned long flags) +{ + struct _xinterface *_intf = to_intf(intf); + struct task_struct *p = _intf-task; + struct mm_struct *mm = _intf-mm; + struct kvm *kvm = _intf-kvm; + struct kvm_memory_slot *memslot = NULL; + boolkthread = !current-mm; + int ret; + struct scatterlist *sg; + int i; + + down_read(kvm-slots_lock); + + if (kthread) + use_mm(_intf-mm); + + for_each_sg(sgl, sg, nents, i) { + unsigned long gpa= sg_dma_address(sg); + unsigned long len= sg_dma_len(sg); + unsigned long gfn= gpa PAGE_SHIFT; + off_t offset = offset_in_page(gpa); + unsigned long hva; + struct page*pg; + + /* ensure that we do not have more than one page per entry */ + if ((PAGE_ALIGN(len + offset) PAGE_SHIFT) != 1) { + ret = -EINVAL; + break; + } + + /* check for a memslot-cache miss */ + if (!memslot + || gfn memslot-base_gfn + || gfn= memslot-base_gfn + memslot-npages) { + memslot = gfn_to_memslot(kvm, gfn); + if (!memslot) { + ret = -EFAULT; + break; + } + } + + hva = (memslot-userspace_addr + + (gfn - memslot-base_gfn) * PAGE_SIZE); + + if (kthread || current-mm == mm) + ret = get_user_pages_fast(hva, 1, 1,pg); + else + ret = get_user_pages(p, mm, hva, 1, 1, 0,pg, NULL); One of these needs the mm semaphore. + + if (ret != 1) { + if (ret= 0) + ret = -EFAULT; + break; + } + + sg_set_page(sg, pg, len, offset); + ret = 0; + } + + if (kthread) + unuse_mm(_intf-mm); + + up_read(kvm-slots_lock); + + return ret; +} + static void xinterface_release(struct kvm_xinterface *intf) { @@ -483,6 +554,7 @@ struct kvm_xinterface_ops _xinterface_ops = { .copy_from = xinterface_copy_from, .vmap= xinterface_vmap, .ioevent = xinterface_ioevent, + .sgmap = xinterface_sgmap, .release = xinterface_release, }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to
Re: Heads up: More user-unaccessible x86 states?
Avi Kivity wrote: On 10/04/2009 10:59 AM, Jan Kiszka wrote: Hi, while preparing new IOCTLs to let user space query set the yet unaccessible NMI states (pending and masked) I also came across the interrupt shadow masks. Unless I missed something I would say that we so far break them in the rare case that a migration happens right while any of them is asserted. So I guess I should extend my interface and stuff them in as well. Do we have more of such unaccessible states on x86 that could be included, too? Would be a good chance... There's some hidden state in the cpuid mechanism. I think we expose it though (just don't use it in qemu). Do you have more details on this? The PDPTRs are hidden state that we should save/restore, though no sane guest relies on them. A quick glance at SVM makes me think that those registered are not exposed there. So when keeping in mind that we may only help VMX guests, I think i makes even less sense to fix this, does it? I think we can lose information if we migrate during a SIPI (sipi_vector), though that might be fixable without exposing it. Hmm, I see. But even it it's not fixable, such an extension would be an in-kernel irqchip thing. We'll might also lost debug traps. We drop pending exceptions; normally that's fine since they'll reinject themselves, but MCE will not. So would it make sense and fix those two issues when we simply save and restore the pending exception? Jan signature.asc Description: OpenPGP digital signature
Re: Heads up: More user-unaccessible x86 states?
On 10/04/2009 12:50 PM, Jan Kiszka wrote: Avi Kivity wrote: On 10/04/2009 10:59 AM, Jan Kiszka wrote: Hi, while preparing new IOCTLs to let user space query set the yet unaccessible NMI states (pending and masked) I also came across the interrupt shadow masks. Unless I missed something I would say that we so far break them in the rare case that a migration happens right while any of them is asserted. So I guess I should extend my interface and stuff them in as well. Do we have more of such unaccessible states on x86 that could be included, too? Would be a good chance... There's some hidden state in the cpuid mechanism. I think we expose it though (just don't use it in qemu). Do you have more details on this? Some cpuid leaves return different information based on an internal counter. This is indicated by KVM_CPUID_FLAG_STATEFUL_FUNC. The PDPTRs are hidden state that we should save/restore, though no sane guest relies on them. A quick glance at SVM makes me think that those registered are not exposed there. So when keeping in mind that we may only help VMX guests, I think i makes even less sense to fix this, does it? Yes. With npt the PDPTRs are essentially gone. I think we can lose information if we migrate during a SIPI (sipi_vector), though that might be fixable without exposing it. Hmm, I see. But even it it's not fixable, such an extension would be an in-kernel irqchip thing. Yes. We'll might also lost debug traps. We drop pending exceptions; normally that's fine since they'll reinject themselves, but MCE will not. So would it make sense and fix those two issues when we simply save and restore the pending exception? Yes. btw, instead of adding a new ioctl, perhaps it makes sense to define a new KVM_VCPU_STATE structure that holds all current and future state (with generous reserved space), instead of separating state over a dozen ioctls. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest: hrtimer: interrupt too slow
Marcelo Tosatti wrote: Michael, Can you please give the patch below a try please? (without acpi_pm timer or priority adjustments for the guest). Sure. I'll try it out in a hour or two, while I can experiment freely because it's weekend. But I wonder... [] hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns It seems the way hrtimer_interrupt_hanging calculates min_delta is wrong (especially to virtual machines). The guest vcpu can be scheduled out during the execution of the hrtimer callbacks (and the callbacks themselves can do operations that translate to blocking operations in the hypervisor). So high min_delta values can be calculated if, for example, a single hrtimer_interrupt run takes two host time slices to execute, while some other higher priority task runs for N slices in between. Using the hrtimer_interrupt execution time (which can be the worse case at any given time), as the min_delta is problematic. So simply increase min_delta_ns by 50% once every detected failure, which will eventually lead to an acceptable threshold (the algorithm should scale back to down lower min_delta, to adjust back to wealthier times, too). ..I wonder what should I check for. I mean, the end result of this patch is not entirely clear to me, what should it change? I see that instead of the now-calculated-after-error (probably very large) min_delta, it's increased to a smaller value. So I should be getting more such messages (forcing min_delta to $foo), but the responsiveness of the guest should stay in more or less acceptable range (unless it will continue erroring, in which case the responsiveness will be slowly reduced). Yes indeed, it's better than current situation, when the guest works fine initially but out of the sudden it switches to a wild very-slow-to-reply mode. But it does not look like a right solution either, even if the back adjustment (mentioned in the last statement above) will be implemented. Unless I completely missed the point... Neverless, the question stands: what I'm looking for now, when the patch is applied? I can't measure the responsiveness, especially since the min_delta gets set to different (large) values each time (I mean current situation without the patch). Thanks! /mjt diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 49da79a..8997978 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer) #ifdef CONFIG_HIGH_RES_TIMERS -static int force_clock_reprogram; - /* * After 5 iteration's attempts, we consider that hrtimer_interrupt() * is hanging, which could happen with something that slows the interrupt - * such as the tracing. Then we force the clock reprogramming for each future - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns - * threshold that we will overwrite. - * The next tick event will be scheduled to 3 times we currently spend on - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend - * 1/4 of their time to process the hrtimer interrupts. This is enough to - * let it running without serious starvation. + * such as the tracing, so we increase min_delta_ns. */ static inline void -hrtimer_interrupt_hanging(struct clock_event_device *dev, - ktime_t try_time) +hrtimer_interrupt_hanging(struct clock_event_device *dev) { - force_clock_reprogram = 1; - dev-min_delta_ns = (unsigned long)try_time.tv64 * 3; - printk(KERN_WARNING hrtimer: interrupt too slow, - forcing clock min delta to %lu ns\n, dev-min_delta_ns); + dev-min_delta_ns += dev-min_delta_ns 1; + if (printk_ratelimit()) + printk(KERN_WARNING hrtimer: interrupt too slow, + forcing clock min delta to %lu ns\n, + dev-min_delta_ns); } /* * High resolution timer interrupt @@ -1276,7 +1268,7 @@ void hrtimer_interrupt(struct clock_event_device *dev) retry: /* 5 retries is enough to notice a hang */ if (!(++nr_retries % 5)) - hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now)); + hrtimer_interrupt_hanging(dev); now = ktime_get(); @@ -1342,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device *dev) /* Reprogramming necessary ? */ if (expires_next.tv64 != KTIME_MAX) { - if (tick_program_event(expires_next, force_clock_reprogram)) + if (tick_program_event(expires_next, 0)) goto retry; } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PATCH: kvm-userspace: ksm support
From a8ca226de8efb4f0447e4ef87bf034cf18996745 Mon Sep 17 00:00:00 2001 From: Izik Eidus iei...@redhat.com Date: Sun, 4 Oct 2009 14:01:31 +0200 Subject: [PATCH] kvm-userspace: add ksm support Calling to madvise(MADV_MERGEABLE) on the memory allocations. Signed-off-by: Izik Eidus iei...@redhat.com --- exec.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index 5c9edf7..406d2cb 100644 --- a/exec.c +++ b/exec.c @@ -2538,6 +2538,9 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size) new_block-host = file_ram_alloc(size, mem_path); if (!new_block-host) { new_block-host = qemu_vmalloc(size); +#ifdef MADV_MERGEABLE +madvise(new_block-host, size, MADV_MERGEABLE); +#endif } new_block-offset = last_ram_offset; new_block-length = size; -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATCH: kvm-userspace: ksm support
On 10/04/2009 02:16 PM, Izik Eidus wrote: From a8ca226de8efb4f0447e4ef87bf034cf18996745 Mon Sep 17 00:00:00 2001 From: Izik Eidusiei...@redhat.com Date: Sun, 4 Oct 2009 14:01:31 +0200 Subject: [PATCH] kvm-userspace: add ksm support Calling to madvise(MADV_MERGEABLE) on the memory allocations. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on default_x86_64_debian_5_0
The Buildbot has detected a new failure of default_x86_64_debian_5_0 on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/default_x86_64_debian_5_0/builds/94 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_1 Build Reason: Build Source Stamp: [branch next] HEAD Blamelist: Aurelien Jarno aurel...@aurel32.net,Blue Swirl blauwir...@gmail.com,Bruce Rogers brog...@novell.com,Dominic Evans oldma...@gmail.com,Gerd Hoffmann kra...@redhat.com,Glauber Costa glom...@redhat.com,Igor V. Kovalenko igor.v.kovale...@gmail.com,Izik Eidus iei...@redhat.com,Juan Quintela quint...@redhat.com,Laurent Desnogues laurent.desnog...@gmail.com,Luiz Capitulino lcapitul...@redhat.com,Marcelo Tosatti mtosa...@redhat.com,Mark McLoughlin mar...@redhat.com,Pierre Riteau pierre.rit...@irisa.fr,Stefan Weil w...@mail.berlios.de,malc av1...@comtv.ru BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on default_i386_out_of_tree
The Buildbot has detected a new failure of default_i386_out_of_tree on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/default_i386_out_of_tree/builds/33 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_2 Build Reason: Build Source Stamp: [branch next] HEAD Blamelist: Aurelien Jarno aurel...@aurel32.net,Blue Swirl blauwir...@gmail.com,Bruce Rogers brog...@novell.com,Dominic Evans oldma...@gmail.com,Gerd Hoffmann kra...@redhat.com,Glauber Costa glom...@redhat.com,Igor V. Kovalenko igor.v.kovale...@gmail.com,Izik Eidus iei...@redhat.com,Juan Quintela quint...@redhat.com,Laurent Desnogues laurent.desnog...@gmail.com,Luiz Capitulino lcapitul...@redhat.com,Marcelo Tosatti mtosa...@redhat.com,Mark McLoughlin mar...@redhat.com,Pierre Riteau pierre.rit...@irisa.fr,Stefan Weil w...@mail.berlios.de,malc av1...@comtv.ru BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: x86: Refactor guest debug IOCTL handling
On 10/03/2009 12:31 AM, Jan Kiszka wrote: Much of so far vendor-specific code for setting up guest debug can actually be handled by the generic code. This also fixes a minor deficit in the SVM part /wrt processing KVM_GUESTDBG_ENABLE. Applied both, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: x86: Preserve guest single-stepping on register
On 10/03/2009 12:31 AM, Jan Kiszka wrote: Give user space more flexibility /wrt its IOCTL order. So far updating the rflags via KVM_SET_REGS ignored potentially set single-step flags. Now they will be kept. kvm_rip_write(vcpu, regs-rip); - kvm_x86_ops-set_rflags(vcpu, regs-rflags); + rflags = regs-rflags; + if (vcpu-guest_debug KVM_GUESTDBG_SINGLESTEP) + rflags |= X86_EFLAGS_TF | X86_EFLAGS_RF; + kvm_x86_ops-set_rflags(vcpu, rflags); I think we need same on popf instruction emulation. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Q: Stopped VM still using host cpu CPU ?
On 10/01/2009 12:32 PM, Daniel Schwager wrote: Hi, we are running some stopped (sending stop via kvm-monitor socket) vm's on our system. My intention was to pause (stop) the vm's and unpause (cont) them on demand (very fast, without time delay, within 2 seconds ..). After 'stop'ing, the vm's still using CPU-load, like the top will tell: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 25983 root 20 0 495m 407m 1876 R 8.9 2.5 228:09.15 qemu-system-x86 25523 root 20 0 495m 2040 1868 S 7.9 0.0 2700:16 qemu-system-x86 It shouldn't do that. How long is this after the 'stop'? Was the guest doing intensive I/O? Are you sure qemu responded to the stop command (by printing the monitor prompt)? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Q: Stopped VM still using host cpu CPU ?
On 10/01/2009 01:47 PM, Daniel Schwager wrote: If i send a signal STOP/CONT (kill -STOPpid or kill -CONTpid) to the KVM-process, it looks like the kvm does not (sure ;-) use any host CPU usage. - Are there some side effects using this approach ? (e.g. with networking, ...) The monitor, vnc, and sdl stop working. - And, why is there a way to send STOP/CONT via socket to KVM-process ? Why not using the sending-signal-apporch ? There's the 'stop' command... -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: disable paravirt mmu reporting
On 10/02/2009 12:28 AM, Marcelo Tosatti wrote: Disable paravirt MMU capability reporting, so that new (or rebooted) guests switch to native operation. Paravirt MMU is a burden to maintain and does not bring significant advantages compared to shadow anymore. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Q: Stopped VM still using host cpu CPU ?
After 'stop'ing, the vm's still using CPU-load, like the top will tell PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 25983 root 20 0 495m 407m 1876 R 8.9 2.5 228:09.15 qemu-system-x86 It shouldn't do that. ok. How long is this after the 'stop'? 30 seconds or 2 days ... the process takes CPU all the time Was the guest doing intensive I/O? No - normaly only a standby windows xp, without running programs Are you sure qemu responded to the stop command (by printing the monitor prompt)? sure. I use the socket communication to CONT the process - works well (-: Also, if a STOP the kvm, I can access the internal VNC-server (sure, mouse movements will NOT move the mouse - because it's stopped..) I'm using kvm-86. regard Danny -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Q: Stopped VM still using host cpu CPU ?
On 10/04/2009 05:21 PM, Daniel Schwager wrote: How long is this after the 'stop'? 30 seconds or 2 days ... the process takes CPU all the time Can you take an oprofile run to see where it's spending its time? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/47] KVM: x86: Disallow hypercalls for guest callers in rings 0
On 09/30/2009 08:58 AM, Jan Lübbe wrote: Hi! On Wed, 2009-08-26 at 13:29 +0300, Avi Kivity wrote: From: Jan Kiszkajan.kis...@siemens.com So far unprivileged guest callers running in ring 3 can issue, e.g., MMU hypercalls. Normally, such callers cannot provide any hand-crafted MMU command structure as it has to be passed by its physical address, but they can still crash the guest kernel by passing random addresses. To close the hole, this patch considers hypercalls valid only if issued from guest ring 0. This may still be relaxed on a per-hypercall base in the future once required. Does kvm-72 (used by Debian and Ubuntu in stable releases) have the problem? If yes, would the approach in this fix also work there? Probably yes to both. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Heads up: More user-unaccessible x86 states?
Avi Kivity wrote: On 10/04/2009 12:50 PM, Jan Kiszka wrote: Avi Kivity wrote: On 10/04/2009 10:59 AM, Jan Kiszka wrote: Hi, while preparing new IOCTLs to let user space query set the yet unaccessible NMI states (pending and masked) I also came across the interrupt shadow masks. Unless I missed something I would say that we so far break them in the rare case that a migration happens right while any of them is asserted. So I guess I should extend my interface and stuff them in as well. Do we have more of such unaccessible states on x86 that could be included, too? Would be a good chance... There's some hidden state in the cpuid mechanism. I think we expose it though (just don't use it in qemu). Do you have more details on this? Some cpuid leaves return different information based on an internal counter. This is indicated by KVM_CPUID_FLAG_STATEFUL_FUNC. The PDPTRs are hidden state that we should save/restore, though no sane guest relies on them. A quick glance at SVM makes me think that those registered are not exposed there. So when keeping in mind that we may only help VMX guests, I think i makes even less sense to fix this, does it? Yes. With npt the PDPTRs are essentially gone. I think we can lose information if we migrate during a SIPI (sipi_vector), though that might be fixable without exposing it. Hmm, I see. But even it it's not fixable, such an extension would be an in-kernel irqchip thing. Yes. We'll might also lost debug traps. We drop pending exceptions; normally that's fine since they'll reinject themselves, but MCE will not. So would it make sense and fix those two issues when we simply save and restore the pending exception? Yes. btw, instead of adding a new ioctl, perhaps it makes sense to define a new KVM_VCPU_STATE structure that holds all current and future state (with generous reserved space), instead of separating state over a dozen ioctls. OK, makes sense. With our without lapic state? How much future space? Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 2/2] KVM: x86: Preserve guest single-stepping on register
Avi Kivity wrote: On 10/03/2009 12:31 AM, Jan Kiszka wrote: Give user space more flexibility /wrt its IOCTL order. So far updating the rflags via KVM_SET_REGS ignored potentially set single-step flags. Now they will be kept. kvm_rip_write(vcpu, regs-rip); -kvm_x86_ops-set_rflags(vcpu, regs-rflags); +rflags = regs-rflags; +if (vcpu-guest_debug KVM_GUESTDBG_SINGLESTEP) +rflags |= X86_EFLAGS_TF | X86_EFLAGS_RF; +kvm_x86_ops-set_rflags(vcpu, rflags); I think we need same on popf instruction emulation. H, good point. Mind reverting 2/2 and applying this one instead? Jan - KVM: x86: Rework guest single-step flag injection and filtering Push TF and RF injection and filtering on guest single-stepping into the vender get/set_rflags callbacks. This makes the whole mechanism more robust /wrt user space IOTCTL order and instruction emulations. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/svm.c |8 +++- arch/x86/kvm/vmx.c |4 arch/x86/kvm/x86.c | 24 +--- 3 files changed, 20 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 279a2ae..407e1a7 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -797,11 +797,17 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu) static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu) { - return to_svm(vcpu)-vmcb-save.rflags; + unsigned long rflags = to_svm(vcpu)-vmcb-save.rflags; + + if (vcpu-guest_debug KVM_GUESTDBG_SINGLESTEP) + rflags = ~(unsigned long)(X86_EFLAGS_TF | X86_EFLAGS_RF); + return rflags; } static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) { + if (vcpu-guest_debug KVM_GUESTDBG_SINGLESTEP) + rflags |= X86_EFLAGS_TF | X86_EFLAGS_RF; to_svm(vcpu)-vmcb-save.rflags = rflags; } diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 70020e5..8e678ef 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -787,6 +787,8 @@ static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu) rflags = vmcs_readl(GUEST_RFLAGS); if (to_vmx(vcpu)-rmode.vm86_active) rflags = ~(unsigned long)(X86_EFLAGS_IOPL | X86_EFLAGS_VM); + if (vcpu-guest_debug KVM_GUESTDBG_SINGLESTEP) + rflags = ~(unsigned long)(X86_EFLAGS_TF | X86_EFLAGS_RF); return rflags; } @@ -794,6 +796,8 @@ static void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) { if (to_vmx(vcpu)-rmode.vm86_active) rflags |= X86_EFLAGS_IOPL | X86_EFLAGS_VM; + if (vcpu-guest_debug KVM_GUESTDBG_SINGLESTEP) + rflags |= X86_EFLAGS_TF | X86_EFLAGS_RF; vmcs_writel(GUEST_RFLAGS, rflags); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index aa5d574..5b562dd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3840,12 +3840,6 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) regs-rip = kvm_rip_read(vcpu); regs-rflags = kvm_x86_ops-get_rflags(vcpu); - /* -* Don't leak debug flags in case they were set for guest debugging -*/ - if (vcpu-guest_debug KVM_GUESTDBG_SINGLESTEP) - regs-rflags = ~(X86_EFLAGS_TF | X86_EFLAGS_RF); - vcpu_put(vcpu); return 0; @@ -3872,13 +3866,11 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) kvm_register_write(vcpu, VCPU_REGS_R13, regs-r13); kvm_register_write(vcpu, VCPU_REGS_R14, regs-r14); kvm_register_write(vcpu, VCPU_REGS_R15, regs-r15); - #endif kvm_rip_write(vcpu, regs-rip); kvm_x86_ops-set_rflags(vcpu, regs-rflags); - vcpu-arch.exception.pending = false; vcpu_put(vcpu); @@ -4471,12 +4463,15 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg) { unsigned long rflags; - int old_debug; int i; vcpu_load(vcpu); - old_debug = vcpu-guest_debug; + /* +* Read rflags as long as potentially injected trace flags are still +* filtered out. +*/ + rflags = kvm_x86_ops-get_rflags(vcpu); vcpu-guest_debug = dbg-control; if (!(vcpu-guest_debug KVM_GUESTDBG_ENABLE)) @@ -4493,11 +4488,10 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, vcpu-arch.switch_db_regs = (vcpu-arch.dr7 DR7_BP_EN_MASK); } - rflags = kvm_x86_ops-get_rflags(vcpu); - if (vcpu-guest_debug KVM_GUESTDBG_SINGLESTEP) - rflags |= X86_EFLAGS_TF | X86_EFLAGS_RF; - else if (old_debug KVM_GUESTDBG_SINGLESTEP) - rflags = ~(X86_EFLAGS_TF | X86_EFLAGS_RF); + /* +* Trigger an rflags update that will inject or remove the trace +* flags. +*/
Re: docs on storage pools?
Richard Wurman wrote: So far I've been using files and/or LVM partitions for my VMs -- basically by using virt-manager and modifying existing XML configs and just copying my VM files to be reused. I'm wondering how KVM storage pools work -- at first I thought it was something like KVM's version of LVM where you can just dump all your VMs in one space .. .but it looks like it's really means different places you want to store your VMs: The 'storage pool' concept you're talking about is libvirt functionality, not KVM/QEMU: http://libvirt.org/storage.html - dir: Filesystem Directory - disk: Physical Disk Device - fs: Pre-Formatted Block Device - iscsi: iSCSI Target -logical: LVM Volume Group - netfs: Network exported directory I understand things like LVM and storing VMs in a filesystem directory.. but what real difference is there by going through the GUI? I suppose nothing. Maybe I'm overthinking this -- it's just a frontend to where you store your VMs? Exposing storage management through libvirt allows remote storage provisioning, and saves libvirt users (like virt-install and virt-manager) the trouble of knowing all the differing details between creating lvm LVs, disk partitions, raw/qcow2/vmdk images, etc. For desktop virt using raw files for storage, there isn't much need to concern yourself with the concept. Any further questions should be directed to libvirt-l...@redhat.com (for libvirt) or virt-tools-l...@redhat.com (for virt-manager). - Cole -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/21] i8254.c: Add pr_fmt(fmt)
Add pr_fmt(fmt) pit: fmt Strip pit: prefixes from pr_debug Signed-off-by: Joe Perches j...@perches.com --- arch/x86/kvm/i8254.c | 12 +++- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index 82ad523..fa83a15 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -29,6 +29,8 @@ * Based on QEMU and Xen. */ +#define pr_fmt(fmt) pit: fmt + #include linux/kvm_host.h #include irq.h @@ -262,7 +264,7 @@ void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu) static void destroy_pit_timer(struct kvm_timer *pt) { - pr_debug(pit: execute del timer!\n); + pr_debug(execute del timer!\n); hrtimer_cancel(pt-timer); } @@ -284,7 +286,7 @@ static void create_pit_timer(struct kvm_kpit_state *ps, u32 val, int is_period) interval = muldiv64(val, NSEC_PER_SEC, KVM_PIT_FREQ); - pr_debug(pit: create pit timer, interval is %llu nsec\n, interval); + pr_debug(create pit timer, interval is %llu nsec\n, interval); /* TODO The new value only affected after the retriggered */ hrtimer_cancel(pt-timer); @@ -309,7 +311,7 @@ static void pit_load_count(struct kvm *kvm, int channel, u32 val) WARN_ON(!mutex_is_locked(ps-lock)); - pr_debug(pit: load_count val is %d, channel is %d\n, val, channel); + pr_debug(load_count val is %d, channel is %d\n, val, channel); /* * The largest possible initial count is 0; this is equivalent @@ -395,8 +397,8 @@ static int pit_ioport_write(struct kvm_io_device *this, mutex_lock(pit_state-lock); if (val != 0) - pr_debug(pit: write addr is 0x%x, len is %d, val is 0x%x\n, - (unsigned int)addr, len, val); + pr_debug(write addr is 0x%x, len is %d, val is 0x%x\n, +(unsigned int)addr, len, val); if (addr == 3) { channel = val 6; -- 1.6.3.1.10.g659a0.dirty -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/21] pr_dbg, pr_fmt
One possible long term goal is to stop adding #define pr_fmt(fmt) KBUILD_MODNAME : fmt to source files to prefix modulename to logging output. It might be useful to eventually have kernel.h use a standard #define pr_fmt which includes KBUILD_MODNAME instead of a blank or empty define. Perhaps over time, the source modules that use pr_level with prefixes can be converted to use pr_fmt. If all modules are converted, that will allow the printk routine to add module/filename/line/offset to the logging lines using some function similar to dynamic_debug and substantially reduct object string use by removing the repeated prefixes. This patchset does not get to that result. The patches right now uses _more_ string space because all logging messages have unshared prefixes but it may be a useful start. The patchset strips prefixes from printks and adds pr_fmt to arch/x86, crypto, kernel, and a few drivers. It also converts printk(KERN_level to pr_level in a few files that already had some pr_level uses. The conversion also generally used long length format strings in the place of multiple short strings to ease any grep/search. Joe Perches (21): include/linux/ dynamic_debug.h kernel.h: Remove KBUILD_MODNAME from dynamic_pr_debug, add #define pr_dbg ftrace.c: Add #define pr_fmt(fmt) KBUILD_MODNAME : fmt mtrr: use pr_level and pr_fmt microcode: use pr_level and add pr_fmt amd_iommu.c: use pr_level and add pr_fmt(fmt) es7000_32.c: use pr_level and add pr_fmt(fmt) arch/x86/kernel/apic/: use pr_level and add pr_fmt(fmt) mcheck/: use pr_level and add pr_fmt(fmt) arch/x86/kernel/setup_percpu.c: use pr_level and add pr_fmt(fmt) arch/x86/kernel/cpu/: use pr_level and add pr_fmt(fmt) i8254.c: Add pr_fmt(fmt) kmmio.c: Add and use pr_fmt(fmt) testmmiotrace.c: Add and use pr_fmt(fmt) crypto/: use pr_level and add pr_fmt(fmt) kernel/power/: use pr_level and add pr_fmt(fmt) kernel/kexec.c: use pr_level and add pr_fmt(fmt) crypto/async_tx/raid6test.c: use pr_level and add pr_fmt(fmt) arch/x86/mm/mmio-mod.c: use pr_fmt drivers/net/bonding/: : use pr_fmt drivers/net/tlan: use pr_level and add pr_fmt(fmt) drivers/net/tlan.h: Convert printk(KERN_DEBUG to pr_dbg( arch/x86/kernel/amd_iommu.c| 71 ++-- arch/x86/kernel/apic/apic.c| 48 ++-- arch/x86/kernel/apic/apic_flat_64.c|5 +- arch/x86/kernel/apic/bigsmp_32.c |8 +- arch/x86/kernel/apic/es7000_32.c | 12 +- arch/x86/kernel/apic/io_apic.c | 239 ++-- arch/x86/kernel/apic/nmi.c | 29 +- arch/x86/kernel/apic/numaq_32.c| 53 ++-- arch/x86/kernel/apic/probe_32.c| 18 +- arch/x86/kernel/apic/probe_64.c|8 +- arch/x86/kernel/apic/summit_32.c | 23 +- arch/x86/kernel/apic/x2apic_uv_x.c | 26 +- arch/x86/kernel/cpu/addon_cpuid_features.c |9 +- arch/x86/kernel/cpu/amd.c | 26 +- arch/x86/kernel/cpu/bugs.c | 23 +- arch/x86/kernel/cpu/bugs_64.c |4 +- arch/x86/kernel/cpu/centaur.c | 12 +- arch/x86/kernel/cpu/common.c | 54 ++-- arch/x86/kernel/cpu/cpu_debug.c|4 +- arch/x86/kernel/cpu/cyrix.c| 12 +- arch/x86/kernel/cpu/intel.c| 14 +- arch/x86/kernel/cpu/intel_cacheinfo.c | 14 +- arch/x86/kernel/cpu/mcheck/mce-inject.c| 20 +- arch/x86/kernel/cpu/mcheck/mce.c | 59 ++-- arch/x86/kernel/cpu/mcheck/mce_intel.c |8 +- arch/x86/kernel/cpu/mcheck/p5.c| 21 +- arch/x86/kernel/cpu/mcheck/therm_throt.c | 21 +- arch/x86/kernel/cpu/mcheck/threshold.c |7 +- arch/x86/kernel/cpu/mcheck/winchip.c |8 +- arch/x86/kernel/cpu/mtrr/centaur.c |4 +- arch/x86/kernel/cpu/mtrr/cleanup.c | 59 ++-- arch/x86/kernel/cpu/mtrr/generic.c | 39 +- arch/x86/kernel/cpu/mtrr/main.c| 32 +- arch/x86/kernel/cpu/perf_event.c | 10 +- arch/x86/kernel/cpu/perfctr-watchdog.c | 11 +- arch/x86/kernel/cpu/transmeta.c| 20 +- arch/x86/kernel/cpu/vmware.c | 11 +- arch/x86/kernel/ftrace.c |8 +- arch/x86/kernel/microcode_amd.c|5 +- arch/x86/kernel/microcode_core.c | 23 +- arch/x86/kernel/microcode_intel.c | 47 +-- arch/x86/kernel/setup_percpu.c | 13 +- arch/x86/kvm/i8254.c | 12 +- arch/x86/mm/kmmio.c| 40 +- arch/x86/mm/mmio-mod.c | 71 ++-- arch/x86/mm/testmmiotrace.c| 29 +- crypto/algapi.c|4 +- crypto/ansi_cprng.c| 39 +- crypto/async_tx/async_tx.c |5 +- crypto/async_tx/raid6test.c| 30 +- crypto/fips.c
buildbot failure in qemu-kvm on disable_kvm_x86_64_out_of_tree
The Buildbot has detected a new failure of disable_kvm_x86_64_out_of_tree on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_x86_64_out_of_tree/builds/34 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_1 Build Reason: The Nightly scheduler named 'nightly_disable_kvm' triggered this build Build Source Stamp: [branch master] HEAD Blamelist: BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest: hrtimer: interrupt too slow
On Sun, Oct 04, 2009 at 04:01:02PM +0400, Michael Tokarev wrote: Marcelo Tosatti wrote: Michael, Can you please give the patch below a try please? (without acpi_pm timer or priority adjustments for the guest). Sure. I'll try it out in a hour or two, while I can experiment freely because it's weekend. But I wonder... [] hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns It seems the way hrtimer_interrupt_hanging calculates min_delta is wrong (especially to virtual machines). The guest vcpu can be scheduled out during the execution of the hrtimer callbacks (and the callbacks themselves can do operations that translate to blocking operations in the hypervisor). So high min_delta values can be calculated if, for example, a single hrtimer_interrupt run takes two host time slices to execute, while some other higher priority task runs for N slices in between. Using the hrtimer_interrupt execution time (which can be the worse case at any given time), as the min_delta is problematic. So simply increase min_delta_ns by 50% once every detected failure, which will eventually lead to an acceptable threshold (the algorithm should scale back to down lower min_delta, to adjust back to wealthier times, too). ..I wonder what should I check for. I mean, the end result of this patch is not entirely clear to me, what should it change? I see that instead of the now-calculated-after-error (probably very large) min_delta, it's increased to a smaller value. So I should be getting more such messages (forcing min_delta to $foo), but the responsiveness of the guest should stay in more or less acceptable range (unless it will continue erroring, in which case the responsiveness will be slowly reduced). Right. Yes indeed, it's better than current situation, when the guest works fine initially but out of the sudden it switches to a wild very-slow-to-reply mode. But it does not look like a right solution either, even if the back adjustment (mentioned in the last statement above) will be implemented. Unless I completely missed the point... Neverless, the question stands: what I'm looking for now, when the patch is applied? I can't measure the responsiveness, especially since the min_delta gets set to different (large) values each time (I mean current situation without the patch). You should see min_delta_ns increase to a much smaller value, hopefully in the 2000-1 range. min_delta_ns is the minimum delay a high resolution timer can have. You had it set in the hundreds of milliseconds range. Thanks! /mjt diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 49da79a..8997978 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer) #ifdef CONFIG_HIGH_RES_TIMERS -static int force_clock_reprogram; - /* * After 5 iteration's attempts, we consider that hrtimer_interrupt() * is hanging, which could happen with something that slows the interrupt - * such as the tracing. Then we force the clock reprogramming for each future - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns - * threshold that we will overwrite. - * The next tick event will be scheduled to 3 times we currently spend on - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend - * 1/4 of their time to process the hrtimer interrupts. This is enough to - * let it running without serious starvation. + * such as the tracing, so we increase min_delta_ns. */ static inline void -hrtimer_interrupt_hanging(struct clock_event_device *dev, -ktime_t try_time) +hrtimer_interrupt_hanging(struct clock_event_device *dev) { -force_clock_reprogram = 1; -dev-min_delta_ns = (unsigned long)try_time.tv64 * 3; -printk(KERN_WARNING hrtimer: interrupt too slow, -forcing clock min delta to %lu ns\n, dev-min_delta_ns); +dev-min_delta_ns += dev-min_delta_ns 1; +if (printk_ratelimit()) +printk(KERN_WARNING hrtimer: interrupt too slow, +forcing clock min delta to %lu ns\n, +dev-min_delta_ns); } /* * High resolution timer interrupt @@ -1276,7 +1268,7 @@ void hrtimer_interrupt(struct clock_event_device *dev) retry: /* 5 retries is enough to notice a hang */ if (!(++nr_retries % 5)) -hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now)); +hrtimer_interrupt_hanging(dev); now = ktime_get(); @@ -1342,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device *dev) /* Reprogramming necessary ? */ if (expires_next.tv64 != KTIME_MAX) { -if (tick_program_event(expires_next, force_clock_reprogram)) +if (tick_program_event(expires_next, 0)) goto retry; } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to