Re: [PATCH v4 8/8] macvtap/tun: add VNET_BE flag
On Wed, Apr 22, 2015 at 12:01:29PM +0200, Greg Kurz wrote: On Tue, 21 Apr 2015 20:30:23 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Tue, Apr 21, 2015 at 06:22:20PM +0200, Greg Kurz wrote: On Tue, 21 Apr 2015 16:06:33 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Fri, Apr 10, 2015 at 12:20:21PM +0200, Greg Kurz wrote: The VNET_LE flag was introduced to fix accesses to virtio 1.0 headers that are always little-endian. It can also be used to handle the special case of a legacy little-endian device implemented by a big-endian host. Let's add a flag and ioctls for big-endian devices as well. If both flags are set, little-endian wins. Since this is isn't a common usecase, the feature is controlled by a kernel config option (not set by default). Both macvtap and tun are covered by this patch since they share the same API with userland. Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com --- drivers/net/Kconfig | 12 drivers/net/macvtap.c | 60 +- drivers/net/tun.c | 62 ++- include/uapi/linux/if_tun.h |2 + 4 files changed, 134 insertions(+), 2 deletions(-) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index df51d60..f0e23a0 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -244,6 +244,18 @@ config TUN If you don't know what to use this for, you don't need it. +config TUN_VNET_BE + bool Support for big-endian vnet headers + default n + ---help--- + This option allows TUN/TAP and MACVTAP device drivers to parse + vnet headers that are in big-endian byte order. It is useful + when the headers come from a big-endian legacy virtio driver and + the host is little-endian. + + Unless you have a little-endian system hosting a big-endian virtual + machine with a virtio NIC, you should say N. + should mention cross-endian, not big-endian, right? The current TUN_VNET_LE related code is already doing cross-endian: without this patch, one can already run a LE guest on a BE host... wouldn't it be confusing to mention cross-endian only when the guest is BE ? Hmm I think no - LE is also useful for virtio 1 - this is what it was intended for after all. What about having a completely distinct implementation for cross-endian that don't reuse the existing code and defines then ? I think implementation and interface are fine, just the documentation can be improved a bit. How about: Support for cross-endian vnet headers on little-endian kernels. Accordingly CONFIG_TUN_VNET_CROSS_LE ? Sure. And what about also renaming the ioctl to TUNSETVNETCROSSLE then ? -- Greg I think not. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM/ARM updates for v4.1, take 2
On 22/04/2015 17:08, Marc Zyngier wrote: Paolo, Marcelo, This is the second pull request for the KVM/ARM updates targeting v4.1. Not much to see this time, just a couple of borring fixes. Pulled. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH stable] KVM: x86: Fix lost interrupt on irr_pending race
On Tue, Apr 21, 2015 at 10:47:37AM +0200, Paolo Bonzini wrote: On 21/04/2015 09:52, Paolo Bonzini wrote: From: Nadav Amit na...@cs.technion.ac.il [ upstream commit f210f7572bedf3320599e8b2d8e8ec2d96270d0b ] apic_find_highest_irr assumes irr_pending is set if any vector in APIC_IRR is set. If this assumption is broken and apicv is disabled, the injection of interrupts may be deferred until another interrupt is delivered to the guest. Ultimately, if no other interrupt should be injected to that vCPU, the pending interrupt may be lost. commit 56cc2406d68c (KVM: nVMX: fix acknowledge interrupt on exit when APICv is in use) changed the behavior of apic_clear_irr so irr_pending is cleared after setting APIC_IRR vector. After this commit, if apic_set_irr and apic_clear_irr run simultaneously, a race may occur, resulting in APIC_IRR vector set, and irr_pending cleared. In the following example, assume a single vector is set in IRR prior to calling apic_clear_irr: apic_set_irrapic_clear_irr -- apic-irr_pending = true; apic_clear_vector(...); vec = apic_search_irr(apic); // = vec == -1 apic_set_vector(...); apic-irr_pending = (vec != -1); // = apic-irr_pending == false Nonetheless, it appears the race might even occur prior to this commit: apic_set_irrapic_clear_irr -- apic-irr_pending = true; apic-irr_pending = false; apic_clear_vector(...); if (apic_search_irr(apic) != -1) apic-irr_pending = true; // = apic-irr_pending == false apic_set_vector(...); Fixing this issue by: 1. Restoring the previous behavior of apic_clear_irr: clear irr_pending, call apic_clear_vector, and then if APIC_IRR is non-zero, set irr_pending. 2. On apic_set_irr: first call apic_set_vector, then set irr_pending. Signed-off-by: Nadav Amit na...@cs.technion.ac.il Fixes: 33e4c68656a2e461b296ce714ec322978de85412 Cc: sta...@vger.kernel.org # 2.6.32+ Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- The race was reported in 3.17+ by Brad Campbell and in 2.6.32 by Saso Slavicic, so it qualifies for stable. Patch for kernels before 3.17: Thanks Paolo. I was going to apply this backport to the 3.16 kernel but it looks like the original commit is a clean cherry-pick. Shall I still apply your backport, or do you think the original commit should be applied instead? Cheers, -- Luís diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 6e8ce5a1a05d..e0e5642dae41 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -341,8 +341,12 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_irr); static inline void apic_set_irr(int vec, struct kvm_lapic *apic) { - apic-irr_pending = true; apic_set_vector(vec, apic-regs + APIC_IRR); + /* + * irr_pending must be true if any interrupt is pending; set it after + * APIC_IRR to avoid race with apic_clear_irr + */ + apic-irr_pending = true; } static inline int apic_search_irr(struct kvm_lapic *apic) Thanks, Paolo -- To unsubscribe from this list: send the line unsubscribe stable in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] project proposal
On 22/04/2015 10:51, Catalin Vasile wrote: If we want a mainstream userspace backend that could interact with a lot of crypto engines, we could use OpenSSL (it can actually use cryptodev and AF_ALG as engines). For now, until mid June (my diploma project presentation) I still want to use vhost as a backend for the sole purpose of having a finished backend which now I have a good grasp upon. If the finished work would be good enough work to be merged upstream will be talked later. As a GSoC project, OpenSSL as a backend would continue the virtio-crypto development, as it's not uncommon to have multiple types of backends. The current work on virtio-crypto qemu and guest module is pretty backend agnostic, and could allow future development(use of other backends and other features). OpenSSL's license is not compatible with QEMU, hence the suggestion of using gnutls. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH stable] KVM: x86: Fix lost interrupt on irr_pending race
On 22/04/2015 15:34, Luis Henriques wrote: Thanks Paolo. I was going to apply this backport to the 3.16 kernel but it looks like the original commit is a clean cherry-pick. Shall I still apply your backport, or do you think the original commit should be applied instead? Indeed you're right. I wrote the backport for 3.16(.0). However, commit 56cc2406d68c0f09505c389e276f27a99f495cbd was marked for stable, so it's necessary to cherry-pick the entire patch on the stable kernel where the buggy commit was backported. That should be, according to the sta...@vger.kernel.org archives, 3.10.54+, 3.13.11.7+, 3.14.18+, 3.16.2+. Paolo Cheers, -- Luís diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 6e8ce5a1a05d..e0e5642dae41 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -341,8 +341,12 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_irr); static inline void apic_set_irr(int vec, struct kvm_lapic *apic) { -apic-irr_pending = true; apic_set_vector(vec, apic-regs + APIC_IRR); +/* + * irr_pending must be true if any interrupt is pending; set it after + * APIC_IRR to avoid race with apic_clear_irr + */ +apic-irr_pending = true; } static inline int apic_search_irr(struct kvm_lapic *apic) Thanks, Paolo -- To unsubscribe from this list: send the line unsubscribe stable in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH stable] KVM: x86: Fix lost interrupt on irr_pending race
On Wed, Apr 22, 2015 at 03:47:04PM +0200, Paolo Bonzini wrote: On 22/04/2015 15:34, Luis Henriques wrote: Thanks Paolo. I was going to apply this backport to the 3.16 kernel but it looks like the original commit is a clean cherry-pick. Shall I still apply your backport, or do you think the original commit should be applied instead? Indeed you're right. I wrote the backport for 3.16(.0). However, commit 56cc2406d68c0f09505c389e276f27a99f495cbd was marked for stable, so it's necessary to cherry-pick the entire patch on the stable kernel where the buggy commit was backported. That should be, according to the sta...@vger.kernel.org archives, 3.10.54+, 3.13.11.7+, 3.14.18+, 3.16.2+. Great, thanks for the quick reply. I'll queue the (entire) fix for the 3.16 kernel. Cheers, -- Luís Paolo Cheers, -- Luís diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 6e8ce5a1a05d..e0e5642dae41 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -341,8 +341,12 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_irr); static inline void apic_set_irr(int vec, struct kvm_lapic *apic) { - apic-irr_pending = true; apic_set_vector(vec, apic-regs + APIC_IRR); + /* + * irr_pending must be true if any interrupt is pending; set it after + * APIC_IRR to avoid race with apic_clear_irr + */ + apic-irr_pending = true; } static inline int apic_search_irr(struct kvm_lapic *apic) Thanks, Paolo -- To unsubscribe from this list: send the line unsubscribe stable in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe stable in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: arm/arm64: check IRQ number on userland injection
From: Andre Przywara andre.przyw...@arm.com When userland injects a SPI via the KVM_IRQ_LINE ioctl we currently only check it against a fixed limit, which historically is set to 127. With the new dynamic IRQ allocation the effective limit may actually be smaller (64). So when now a malicious or buggy userland injects a SPI in that range, we spill over on our VGIC bitmaps and bytemaps memory. I could trigger a host kernel NULL pointer dereference with current mainline by injecting some bogus IRQ number from a hacked kvmtool: - DEBUG: kvm_vgic_inject_irq(kvm, cpu=0, irq=114, level=1) DEBUG: vgic_update_irq_pending(kvm, cpu=0, irq=114, level=1) DEBUG: IRQ #114 still in the game, writing to bytemap now... Unable to handle kernel NULL pointer dereference at virtual address pgd = ffc07652e000 [] *pgd=f658b003, *pud=f658b003, *pmd= Internal error: Oops: 9606 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 1053 Comm: lkvm-msi-irqinj Not tainted 4.0.0-rc7+ #3027 Hardware name: FVP Base (DT) task: ffc0774e9680 ti: ffc0765a8000 task.ti: ffc0765a8000 PC is at kvm_vgic_inject_irq+0x234/0x310 LR is at kvm_vgic_inject_irq+0x30c/0x310 pc : [ffcae0a8] lr : [ffcae180] pstate: 8145 . So this patch fixes this by checking the SPI number against the actual limit. Also we remove the former legacy hard limit of 127 in the ioctl code. Signed-off-by: Andre Przywara andre.przyw...@arm.com Reviewed-by: Christoffer Dall christoffer.d...@linaro.org CC: sta...@vger.kernel.org # 4.0, 3.19, 3.18 [maz: wrap KVM_ARM_IRQ_GIC_MAX with #ifndef __KERNEL__, as suggested by Christopher Covington] Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/uapi/asm/kvm.h | 8 +++- arch/arm/kvm/arm.c| 3 +-- arch/arm64/include/uapi/asm/kvm.h | 8 +++- virt/kvm/arm/vgic.c | 3 +++ 4 files changed, 18 insertions(+), 4 deletions(-) diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h index 2499867..df3f60c 100644 --- a/arch/arm/include/uapi/asm/kvm.h +++ b/arch/arm/include/uapi/asm/kvm.h @@ -195,8 +195,14 @@ struct kvm_arch_memory_slot { #define KVM_ARM_IRQ_CPU_IRQ0 #define KVM_ARM_IRQ_CPU_FIQ1 -/* Highest supported SPI, from VGIC_NR_IRQS */ +/* + * This used to hold the highest supported SPI, but it is now obsolete + * and only here to provide source code level compatibility with older + * userland. The highest SPI number can be set via KVM_DEV_ARM_VGIC_GRP_NR_IRQS. + */ +#ifndef __KERNEL__ #define KVM_ARM_IRQ_GIC_MAX127 +#endif /* One single KVM irqchip, ie. the VGIC */ #define KVM_NR_IRQCHIPS 1 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 6f53645..d9631ec 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -671,8 +671,7 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level, if (!irqchip_in_kernel(kvm)) return -ENXIO; - if (irq_num VGIC_NR_PRIVATE_IRQS || - irq_num KVM_ARM_IRQ_GIC_MAX) + if (irq_num VGIC_NR_PRIVATE_IRQS) return -EINVAL; return kvm_vgic_inject_irq(kvm, 0, irq_num, level); diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index c154c0b..d268320 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -188,8 +188,14 @@ struct kvm_arch_memory_slot { #define KVM_ARM_IRQ_CPU_IRQ0 #define KVM_ARM_IRQ_CPU_FIQ1 -/* Highest supported SPI, from VGIC_NR_IRQS */ +/* + * This used to hold the highest supported SPI, but it is now obsolete + * and only here to provide source code level compatibility with older + * userland. The highest SPI number can be set via KVM_DEV_ARM_VGIC_GRP_NR_IRQS. + */ +#ifndef __KERNEL__ #define KVM_ARM_IRQ_GIC_MAX127 +#endif /* One single KVM irqchip, ie. the VGIC */ #define KVM_NR_IRQCHIPS 1 diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c index 7ed7873..78fb820 100644 --- a/virt/kvm/arm/vgic.c +++ b/virt/kvm/arm/vgic.c @@ -1561,6 +1561,9 @@ int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num, goto out; } + if (irq_num = kvm-arch.vgic.nr_irqs) + return -EINVAL; + vcpu_id = vgic_update_irq_pending(kvm, cpuid, irq_num, level); if (vcpu_id = 0) { /* kick the specified vcpu */ -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] project proposal
I found my way through it's API. http://www.gnutls.org/manual/gnutls.html#Cryptographic-API Does anyone know if it has one shot givencrypt (generate IV and encrypt as one job)? I see an option to get random data, but I was thinking if there is an one shot option. On Wed, Apr 22, 2015 at 4:43 PM, Paolo Bonzini pbonz...@redhat.com wrote: On 22/04/2015 10:51, Catalin Vasile wrote: If we want a mainstream userspace backend that could interact with a lot of crypto engines, we could use OpenSSL (it can actually use cryptodev and AF_ALG as engines). For now, until mid June (my diploma project presentation) I still want to use vhost as a backend for the sole purpose of having a finished backend which now I have a good grasp upon. If the finished work would be good enough work to be merged upstream will be talked later. As a GSoC project, OpenSSL as a backend would continue the virtio-crypto development, as it's not uncommon to have multiple types of backends. The current work on virtio-crypto qemu and guest module is pretty backend agnostic, and could allow future development(use of other backends and other features). OpenSSL's license is not compatible with QEMU, hence the suggestion of using gnutls. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: arm: irqfd: fix value returned by kvm_irq_map_gsi
From: Eric Auger eric.au...@linaro.org irqfd/arm curently does not support routing. kvm_irq_map_gsi is supposed to return all the routing entries associated with the provided gsi and return the number of those entries. We should return 0 at this point. Signed-off-by: Eric Auger eric.au...@linaro.org Acked-by: Christoffer Dall christoffer.d...@linaro.org Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- virt/kvm/arm/vgic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c index 8d550ff..7ed7873 100644 --- a/virt/kvm/arm/vgic.c +++ b/virt/kvm/arm/vgic.c @@ -2141,7 +2141,7 @@ int kvm_irq_map_gsi(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *entries, int gsi) { - return gsi; + return 0; } int kvm_irq_map_chip_pin(struct kvm *kvm, unsigned irqchip, unsigned pin) -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] KVM/ARM updates for v4.1, take 2
Paolo, Marcelo, This is the second pull request for the KVM/ARM updates targeting v4.1. Not much to see this time, just a couple of borring fixes. Thanks, M. The following changes since commit b79013b2449c23f1f505bdf39c5a6c330338b244: Merge tag 'staging-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging (2015-04-13 17:37:33 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git tags/kvm-arm-for-4.1-take2 for you to fetch changes up to fd1d0ddf2ae92fb3df42ed476939861806c5d785: KVM: arm/arm64: check IRQ number on userland injection (2015-04-22 15:42:24 +0100) KVM/ARM changes for v4.1, take #2: Rather small this time: - a fix for a nasty bug with virtual IRQ injection - a fix for irqfd Andre Przywara (1): KVM: arm/arm64: check IRQ number on userland injection Eric Auger (1): KVM: arm: irqfd: fix value returned by kvm_irq_map_gsi arch/arm/include/uapi/asm/kvm.h | 8 +++- arch/arm/kvm/arm.c| 3 +-- arch/arm64/include/uapi/asm/kvm.h | 8 +++- virt/kvm/arm/vgic.c | 5 - 4 files changed, 19 insertions(+), 5 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Non-exiting rdpmc on KVM guests?
On Tue, Apr 21, 2015 at 02:10:53PM -0700, Andy Lutomirski wrote: One question is whether we care if we leak unrelated counters to the guest. (We already leak them to unrelated user tasks, so this is hopefully not a big deal. OTOH, the API is different for guests.) Good question indeed. I really do not know. Another question is whether it's even worth trying to optimize this. I think I just ran into a bunch of people who think virt pmu stuff is important, but we'll have to see if they follow up with the effort of actually doing the work. My only concern is that they'll not make a mess of things ;-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/12] Remaining improvements for HV KVM
On Wed, Apr 15, 2015 at 10:16:41PM +0200, Alexander Graf wrote: On 14.04.15 13:56, Paul Mackerras wrote: Did you forget to push it out or something? Your kvm-ppc-queue branch is still at 4.0-rc1 as far as I can see. Oops, not sure how that happened. Does it show up correctly for you now? Yes, it's fine now, thanks. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/12] Remaining improvements for HV KVM
On Wed, Apr 15, 2015 at 10:16:41PM +0200, Alexander Graf wrote: On 14.04.15 13:56, Paul Mackerras wrote: Did you forget to push it out or something? Your kvm-ppc-queue branch is still at 4.0-rc1 as far as I can see. Oops, not sure how that happened. Does it show up correctly for you now? Yes, it's fine now, thanks. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 4/4] KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
2015-04-18 02:23-0400, Wei Huang: This patch enables AMD guest VM to access (R/W) PMU related MSRs, which include PERFCTR[0..3] and EVNTSEL[0..3]. Signed-off-by: Wei Huang w...@redhat.com --- Reviewed-by: Radim Krčmář rkrc...@redhat.com diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c @@ -2268,27 +2268,17 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) * which we perfectly emulate ;-). Any other value should be at least * reported, some guests depend on them. (This comment is a bit outdated now too.) */ - case MSR_K7_EVNTSEL0: - case MSR_K7_EVNTSEL1: - case MSR_K7_EVNTSEL2: - case MSR_K7_EVNTSEL3: - if (data != 0) - vcpu_unimpl(vcpu, unimplemented perfctr wrmsr: - 0x%x data 0x%llx\n, msr, data); - break; - /* at least RHEL 4 unconditionally writes to the perfctr registers, - * so we ignore writes to make it happy. - */ @@ -2513,6 +2503,12 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) case MSR_K7_EVNTSEL0: case MSR_K7_EVNTSEL1: case MSR_K7_EVNTSEL2: | case MSR_K7_EVNTSEL3: | case MSR_K7_PERFCTR0: case MSR_K7_PERFCTR1: case MSR_K7_PERFCTR2: case MSR_K7_PERFCTR3: (As we depend on continuous ranges anyway, the GCCism comes to mind: 'case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:') case MSR_P6_PERFCTR0: case MSR_P6_PERFCTR1: case MSR_P6_EVNTSEL0: -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 0/4] KVM vPMU support for AMD CPUs
2015-04-18 02:23-0400, Wei Huang: Currently KVM only supports vPMU for Intel CPUs. This patchset enables KVM vPMU support for AMD platform by creating a common PMU interface for x86. By refractoring, PMU related MSR accesses from guest VMs are dispatched to corresponding functions defined in arch specific files. V3: * Rebase the code to the latest of KVM tree (queue branch); * Branch out the Intel specific code from pmu.c to pmu_intel.c, in order to reflect the change history more accurately; * Name the parameters/variables more consistently (use msr, idx, pmc_idx) across files; * Fix issues (whitespaces, macro names, ...) based on Radim's V2 comments; * Fix the MSR_K7_PERFCTRn and MSR_K7_EVNTSELn access code (in patch 4); I still wasn't happy about the API, especially naming, but didnt't find any bugs in functionality, also Tested-by: Radim Krčmář rkrc...@redhat.com I didn't give reviewed-by to all patches, but if we want it fast, there's no problem in fixing some stuff later. (Though it usually doesn't happen and we end up with bad code.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 3/4] KVM: x86/vPMU: Implement AMD vPMU code for KVM
2015-04-18 02:23-0400, Wei Huang: This patch replaces the empty AMD vPMU functions (in pmu_amd.c) with real implementation. Signed-off-by: Wei Huang w...@redhat.com --- Reviewed-by: Radim Krčmář rkrc...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 2/4] KVM: x86/vPMU: Create vPMU interface for VMX and SVM
2015-04-18 02:23-0400, Wei Huang: This patch splits existing vPMU code into a common vPMU interface (pmc.c) and Intel specific vPMU code (pmu_intel.c) using the following steps: - Part of arechitectural vPMU code is extracted and moved to pmu_intel.c file. They are hooked up with the newly-defined intel_pmu_ops, which will be called from pmu interface; - Create a dummy pmu_amd.c file for AMD SVM with empty functions; All architectural vPMU functions are now called via PMU function dispatcher (kvm_pmu_ops). This function dispatcher is initialized by calling kvm_x86_ops-get_pmu_ops() at the beginning. Also note that Intel and AMD modules are now generated by combinig their corresponding arch files (vmx.c/svm.c) and pmu files (pmu_intel.c/pmu_amd.c). Signed-off-by: Wei Huang w...@redhat.com --- diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c @@ -19,83 +18,41 @@ +/* NOTE: + * - Each perf counter is defined as struct kvm_pmc; + * - There are two types of perf counters: general purpose (gp) and fixed. + * gp counters are stored in gp_counters[] and fixed counters are stored + * in fixed_counters[] respectively. Both of them are part of struct + * kvm_pmu; + * - pmu.c understands the difference between gp counters and fixed counters. + * However AMD doesn't support fixed-counters; + * - There are three types of index to access perf counters (PMC): + * 1. MSR (named msr): For example Intel has MSR_IA32_PERFCTRn and AMD + *has MSR_K7_PERFCTRn. + * 2. MSR Index (named idx): Unless it's named msr :( + This normally is used by RDPMC instruction. + *For instance AMD RDPMC instruction uses _0003h in ECX to access + *C001_0007h (MSR_K7_PERCTR3). Intel has a similar mechanism, except + *that it also supports fixed counters. idx can be used to as index to + *gp and fixed counters. + * 3. Global PMC Index (named pmc_idx): pmc_idx is an index specific to PMU + *code. Each pmc_idx, stored in kvm_pmc.idx field, is unique across + *all perf counters (both gp and fixed). The mapping relationship + *between pmc_idx and perf counters is as the following: + ** Intel: [0 .. INTEL_PMC_MAX_GENERIC-1] = gp counters + * [INTEL_PMC_IDX_FIXED .. INTEL_PMC_IDX_FIXED + 2] = fixed + ** AMD: [0 .. AMD64_NUM_COUNTERS-1] = gp counters + */ The declaration from [1/4] will hopefully help to show what I dislike: struct kvm_pmu_ops { int (*check_msr)(struct kvm_vcpu *vcpu, unsigned msr); struct kvm_pmc *(*msr_to_pmc)(struct kvm_vcpu *vcpu, unsigned idx); bool (*is_pmu_msr)(struct kvm_vcpu *vcpu, u32 msr); int (*get_msr)(struct kvm_vcpu *vcpu, u32 index, u64 *data); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); }; This makes you think how to use it those two similar checks and what you gain by converting to pmc ... There are actually two groups of meaning for msr 1) check_msr, msr_to_pmc (msr = PMC identifier) 2) is_pmu_msr, get_msr, set_msr (msr = MSR identifier) And even after you know there are two meanings, only the position in declaration really helps to distinguish them, which is far from what I'd call good naming for API. (I think that 'check_msr' goes well with 'get_msr' and 'set_msr', and wrappers just prepend 'kvm_pmu_'.) Any different names (ideally not very similar) would work better. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Non-exiting rdpmc on KVM guests?
On 21/04/2015 22:51, Peter Zijlstra wrote: However, if you take into account that RDPMC can also be used to read an inactive counter, and that multiple guests fight for the same host counters, it's even harder to ensure that the guest counter indices match those on the host. That doesn't make sense, only a single vcpu task will ever run at any one time. Right, but it puts more pressure on the scheduler which could end up going more often through the slow path. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote: On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini pbonz...@redhat.com wrote: On 17/04/2015 22:18, Marcelo Tosatti wrote: The bug which this is fixing is very rare, have no memory of a report. In fact, its even difficult to create a synthetic reproducer. But then why was the task migration notifier even in Jeremy's original code for Xen? Was it supposed to work even on non-synchronized TSC? If that's the case, then it could be reverted indeed; but then why did you commit this patch to 4.1? Did you think of something that would cause the seqcount-like protocol to fail, and that turned out not to be the case later? I was only following the mailing list sparsely in March. I don't think anyone ever tried that hard to test this stuff. There was an infinte loop that Firefox was triggering as a KVM guest somewhat reliably until a couple months ago in the same vdso code. :( https://bugzilla.redhat.com/show_bug.cgi?id=1174664 --- Comment #5 from Juan Quintela quint...@redhat.com --- Another round # dmesg | grep msr [0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00 [0.00] kvm-clock: cpu 0, msr 1:1ffd8001, primary cpu clock [0.00] kvm-stealtime: cpu 0, msr 11fc0d100 [0.041174] kvm-clock: cpu 1, msr 1:1ffd8041, secondary cpu clock [0.053011] kvm-stealtime: cpu 1, msr 11fc8d100 After start: [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8000' 1ffd8000: 0x3b401060 0xfffc7f4b 0x3b42d040 0xfffc7f4b 1ffd8010: 0x3b42d460 0xfffc7f4b 0x3b42d4c0 0xfffc7f4b [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8040' 1ffd8040: 0x3b42d700 0xfffc7f4b 0x3b42d760 0xfffc7f4b 1ffd8050: 0x3b42d7c0 0xfffc7f4b 0x3b42d820 0xfffc7f4b When firefox hangs [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8000' 1ffd8000: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 1ffd8010: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8040' 1ffd8040: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 1ffd8050: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Wed, Apr 22, 2015 at 11:01:49PM +0200, Paolo Bonzini wrote: On 22/04/2015 22:56, Marcelo Tosatti wrote: But then why was the task migration notifier even in Jeremy's original code for Xen? To cover for the vcpu1 - vcpu2 - vcpu1 case, i believe. Ok, to cover it for non-synchronized TSC. While KVM requires synchronized TSC. If that's the case, then it could be reverted indeed; but then why did you commit this patch to 4.1? Because it fixes the problem Andy reported (see Subject: KVM: x86: fix kvmclock write race (v2) on kvm@). As long as you have Radim's fix on top. But if it's so rare, and it was known that fixing the host protocol was just as good a solution, why was the guest fix committed? I don't know. Should have fixed the host protocol. I'm just trying to understand. I am worried that this patch was rushed in; so far I had assumed it wasn't (a revert of a revert is rare enough that you don't do it lightly...) but maybe I was wrong. Yes it was rushed in. Right now I cannot even decide whether to revert it (and please Peter in the process :)) or submit the Kconfig symbol patch officially. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Mon, Apr 20, 2015 at 06:59:04PM +0200, Paolo Bonzini wrote: On 17/04/2015 22:18, Marcelo Tosatti wrote: The bug which this is fixing is very rare, have no memory of a report. In fact, its even difficult to create a synthetic reproducer. But then why was the task migration notifier even in Jeremy's original code for Xen? To cover for the vcpu1 - vcpu2 - vcpu1 case, i believe. Was it supposed to work even on non-synchronized TSC? Yes it is supposed to work on non-synchronized TSC. If that's the case, then it could be reverted indeed; but then why did you commit this patch to 4.1? Because it fixes the problem Andy reported (see Subject: KVM: x86: fix kvmclock write race (v2) on kvm@). As long as you have Radim's fix on top. Did you think of something that would cause the seqcount-like protocol to fail, and that turned out not to be the case later? I was only following the mailing list sparsely in March. No. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On 22/04/2015 22:56, Marcelo Tosatti wrote: But then why was the task migration notifier even in Jeremy's original code for Xen? To cover for the vcpu1 - vcpu2 - vcpu1 case, i believe. Ok, to cover it for non-synchronized TSC. While KVM requires synchronized TSC. If that's the case, then it could be reverted indeed; but then why did you commit this patch to 4.1? Because it fixes the problem Andy reported (see Subject: KVM: x86: fix kvmclock write race (v2) on kvm@). As long as you have Radim's fix on top. But if it's so rare, and it was known that fixing the host protocol was just as good a solution, why was the guest fix committed? I'm just trying to understand. I am worried that this patch was rushed in; so far I had assumed it wasn't (a revert of a revert is rare enough that you don't do it lightly...) but maybe I was wrong. Right now I cannot even decide whether to revert it (and please Peter in the process :)) or submit the Kconfig symbol patch officially. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] project proposal
On Tue, Apr 21, 2015 at 04:07:56PM +0200, Paolo Bonzini wrote: On 21/04/2015 16:07, Catalin Vasile wrote: I don't get the part with getting cryptodev upstream. I don't know what getting cryptodev upstream actually implies. From what I know cryptodev is done (is a functional project) that was rejected in the Linux Kernel and there isn't actually way to get it upstream. Yes, I agree. The limitations of AF_ALG need to addressed somehow, so what is the next step? Stefan pgpDV1dGiX8CC.pgp Description: PGP signature
Re: [GSoC] project proposal
On Tue, Apr 21, 2015 at 05:24:55PM +0300, Catalin Vasile wrote: Can you give me more details on GnuTLS? I'm going through some documentation and code and I see that it doesn't actually have separate encryption and authentication primitives. gnutls is a natural choice because QEMU already uses it for TLS, but if it doesn't support the primitives you need, then AF_ALG could be used directly. http://www.gnutls.org/manual/gnutls.html#Using-GnuTLS-as-a-cryptographic-library Stefan pgpucapBiwS6o.pgp Description: PGP signature
[RFC PATCH 3/3] kvm/powerpc: report guest steal time in host
On powerpc, kvm tracks both the guest steal time as well as the time when guest was idle and this gets sent in to the guest through DTL. The guest accounts these entries as either steal time or idle time based on the last running task. Since the true guest idle status is not visible to the host, we can't accurately report the guest steal time in the host. However, tracking the guest vcpu cede status can get us a reasonable (within 5% variation) vcpu steal time since guest vcpus cede the processor on entering the idle task. To do this, we introduce a new field ceded_st in kvm_vcpu_arch structure to accurately track the guest vcpu cede status (this is needed since the existing ceded field is modified before we can use it). During DTL entry creation, we check this flag and account the time as stolen if the guest vcpu had not ceded. Tests show that the steal time being reported in the host with this approach is around 5% higher than the steal time shown in guest. Please suggest if there are ways to get more accurate steal time information in the host. Signed-off-by: Naveen N. Rao naveen.n@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_host.h | 1 + arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kvm/book3s_hv.c| 2 ++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 3 +++ 4 files changed, 7 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 8ef0512..7db48c4 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -655,6 +655,7 @@ struct kvm_vcpu_arch { u64 busy_preempt; u32 emul_inst; + u8 ceded_st; #endif }; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 4717859..765c7c4 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -521,6 +521,7 @@ int main(void) DEFINE(VCPU_DEC_EXPIRES, offsetof(struct kvm_vcpu, arch.dec_expires)); DEFINE(VCPU_PENDING_EXC, offsetof(struct kvm_vcpu, arch.pending_exceptions)); DEFINE(VCPU_CEDED, offsetof(struct kvm_vcpu, arch.ceded)); + DEFINE(VCPU_CEDED_ST, offsetof(struct kvm_vcpu, arch.ceded_st)); DEFINE(VCPU_PRODDED, offsetof(struct kvm_vcpu, arch.prodded)); DEFINE(VCPU_MMCR, offsetof(struct kvm_vcpu, arch.mmcr)); DEFINE(VCPU_PMC, offsetof(struct kvm_vcpu, arch.pmc)); diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index de74756..ad7c0e3 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -545,6 +545,8 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu, spin_lock_irq(vcpu-arch.tbacct_lock); stolen += vcpu-arch.busy_stolen; vcpu-arch.busy_stolen = 0; + if (!vcpu-arch.ceded_st stolen) + (pid_task(vcpu-pid, PIDTYPE_PID))-gstime += stolen; spin_unlock_irq(vcpu-arch.tbacct_lock); if (!dt || !vpa) return; diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 6cbf163..28f304e 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -873,6 +873,7 @@ deliver_guest_interrupt: fast_guest_return: li r0,0 stb r0,VCPU_CEDED(r4) /* cancel cede */ + stb r0,VCPU_CEDED_ST(r4)/* cancel cede */ mtspr SPRN_HSRR0,r10 mtspr SPRN_HSRR1,r11 @@ -1889,6 +1890,7 @@ _GLOBAL(kvmppc_h_cede) std r11,VCPU_MSR(r3) li r0,1 stb r0,VCPU_CEDED(r3) + stb r0,VCPU_CEDED_ST(r3) sync/* order setting ceded vs. testing prodded */ lbz r5,VCPU_PRODDED(r3) cmpwi r5,0 @@ -2052,6 +2054,7 @@ kvm_cede_prodded: stb r0,VCPU_PRODDED(r3) sync/* order testing prodded vs. clearing ceded */ stb r0,VCPU_CEDED(r3) + stb r0,VCPU_CEDED_ST(r3) li r3,H_SUCCESS blr -- 2.3.5 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/3] kvm/x86: report guest steal time in host
Report guest steal time in host task statistics. On x86, this is just the scheduler run_delay. Signed-off-by: Naveen N. Rao naveen.n@linux.vnet.ibm.com --- arch/x86/kvm/x86.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0ee725f..737b0e4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2094,6 +2094,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu) vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal; vcpu-arch.st.steal.version += 2; + current-gstime += vcpu-arch.st.accum_steal; vcpu-arch.st.accum_steal = 0; kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime, -- 2.3.5 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/3] Report guest steal time in host
Steal time accounts the time duration during which a guest vcpu was ready to run, but was not scheduled to run by the hypervisor. This is particularly relevant in cloud environment where customers would want to use this as an indicator that their guests are being throttled. However, as it stands today, guest steal time information is not visible from the hypervisor. For cloud service providers, this is problematic since they would want to overcommit cpu resources to achieve optimum resource utilization while at the same time ensuring guests are not throttled. It is useful for service providers to have access to the guest steal time data so that they can base their overcommit/guest packing decisions on this. Higher guest steal time can be used as a trigger to change how the guests are scheduled, or even migrate guests out of a system. This patchset attempts to make the guest steal times available in the host. This is achieved by introducing a new field in per-task statistics (/proc/pid/stat and /proc/pid/task/pid/stat) to accumulate per-vcpu steal time. Programs (such as pidstat) can then be enhanced to report this information on a per-thread basis [If there is a better place/way to expose this, please let me know]. As an example, with pidstat on ppc64: Guest steal time information using mpstat: - [root@rhel7-img ~]# mpstat -P ALL 1 Linux 3.19.0nnr (rhel7-img) 04/15/2015 _ppc64_ (4 CPU) 03:13:23 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:24 PM all 12.250.001.250.001.002.25 13.75 0.000.00 69.50 03:13:24 PM0 46.530.000.000.000.004.95 45.54 0.000.002.97 03:13:24 PM10.000.000.000.000.004.043.03 0.000.00 92.93 03:13:24 PM20.000.000.000.003.960.992.97 0.000.00 92.08 03:13:24 PM33.000.004.000.000.000.004.00 0.000.00 89.00 03:13:24 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:25 PM all 12.590.000.000.000.000.25 12.35 0.000.00 74.81 03:13:25 PM0 50.000.000.000.000.000.98 49.02 0.000.000.00 03:13:25 PM10.980.000.000.000.000.000.00 0.000.00 99.02 03:13:25 PM20.000.000.000.000.000.000.00 0.000.00 100.00 03:13:25 PM30.000.000.000.000.000.000.00 0.000.00 100.00 03:13:25 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:26 PM all 12.990.000.000.000.250.00 12.75 0.000.00 74.02 03:13:26 PM0 51.960.000.000.000.000.00 48.04 0.000.000.00 03:13:26 PM10.000.000.000.000.000.000.00 0.000.00 100.00 03:13:26 PM20.000.000.000.000.980.002.94 0.000.00 96.08 03:13:26 PM30.000.000.000.000.000.000.00 0.000.00 100.00 03:13:26 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:27 PM all 12.530.001.000.250.000.25 12.03 0.000.00 73.93 03:13:27 PM0 51.020.000.000.000.000.00 48.98 0.000.000.00 03:13:27 PM10.000.004.040.000.000.000.00 0.000.00 95.96 03:13:27 PM20.000.000.000.000.000.000.00 0.000.00 100.00 03:13:27 PM30.000.000.000.000.000.000.00 0.000.00 100.00 Average: CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle Average: all 12.910.000.540.010.040.12 12.39 0.000.00 74.00 Average: 0 51.360.000.030.000.030.26 48.27 0.000.000.05 Average: 10.020.001.540.020.020.150.36 0.000.00 97.89 Average: 20.000.000.520.000.090.020.36 0.000.00 99.02 Average: 30.050.000.070.000.020.090.34 0.000.00 99.43 Steal time information in host using (locally modified) pidstat: --- [naveen@xx sysstat]$ ./pidstat -C qemu -tIu 1 Linux 3.19.0nnr (xx.in.ibm.com) 04/15/2015 _ppc64_ (64 CPU) 04:43:20 AM UID TGID TID%usr %system %guest%CPU %steal CPU Command 04:43:22 AM 1008 3001 -0.000.00 54.213.39 45.79 12 qemu-system-ppc 04:43:22 AM 1008 - 30050.000.00 54.213.390.00 12 |__qemu-system-ppc 04:43:22 AM UID
[RFC PATCH 0/3] Report guest steal time in host
Steal time accounts the time duration during which a guest vcpu was ready to run, but was not scheduled to run by the hypervisor. This is particularly relevant in cloud environment where customers would want to use this as an indicator that their guests are being throttled. However, as it stands today, guest steal time information is not visible from the hypervisor. For cloud service providers, this is problematic since they would want to overcommit cpu resources to achieve optimum resource utilization while at the same time ensuring guests are not throttled. It is useful for service providers to have access to the guest steal time data so that they can base their overcommit/guest packing decisions on this. Higher guest steal time can be used as a trigger to change how the guests are scheduled, or even migrate guests out of a system. This patchset attempts to make the guest steal times available in the host. This is achieved by introducing a new field in per-task statistics (/proc/pid/stat and /proc/pid/task/pid/stat) to accumulate per-vcpu steal time. Programs (such as pidstat) can then be enhanced to report this information on a per-thread basis [If there is a better place/way to expose this, please let me know]. As an example, with pidstat on ppc64: Guest steal time information using mpstat: - [root@rhel7-img ~]# mpstat -P ALL 1 Linux 3.19.0nnr (rhel7-img) 04/15/2015 _ppc64_ (4 CPU) 03:13:23 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:24 PM all 12.250.001.250.001.002.25 13.75 0.000.00 69.50 03:13:24 PM0 46.530.000.000.000.004.95 45.54 0.000.002.97 03:13:24 PM10.000.000.000.000.004.043.03 0.000.00 92.93 03:13:24 PM20.000.000.000.003.960.992.97 0.000.00 92.08 03:13:24 PM33.000.004.000.000.000.004.00 0.000.00 89.00 03:13:24 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:25 PM all 12.590.000.000.000.000.25 12.35 0.000.00 74.81 03:13:25 PM0 50.000.000.000.000.000.98 49.02 0.000.000.00 03:13:25 PM10.980.000.000.000.000.000.00 0.000.00 99.02 03:13:25 PM20.000.000.000.000.000.000.00 0.000.00 100.00 03:13:25 PM30.000.000.000.000.000.000.00 0.000.00 100.00 03:13:25 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:26 PM all 12.990.000.000.000.250.00 12.75 0.000.00 74.02 03:13:26 PM0 51.960.000.000.000.000.00 48.04 0.000.000.00 03:13:26 PM10.000.000.000.000.000.000.00 0.000.00 100.00 03:13:26 PM20.000.000.000.000.980.002.94 0.000.00 96.08 03:13:26 PM30.000.000.000.000.000.000.00 0.000.00 100.00 03:13:26 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:27 PM all 12.530.001.000.250.000.25 12.03 0.000.00 73.93 03:13:27 PM0 51.020.000.000.000.000.00 48.98 0.000.000.00 03:13:27 PM10.000.004.040.000.000.000.00 0.000.00 95.96 03:13:27 PM20.000.000.000.000.000.000.00 0.000.00 100.00 03:13:27 PM30.000.000.000.000.000.000.00 0.000.00 100.00 Average: CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle Average: all 12.910.000.540.010.040.12 12.39 0.000.00 74.00 Average: 0 51.360.000.030.000.030.26 48.27 0.000.000.05 Average: 10.020.001.540.020.020.150.36 0.000.00 97.89 Average: 20.000.000.520.000.090.020.36 0.000.00 99.02 Average: 30.050.000.070.000.020.090.34 0.000.00 99.43 Steal time information in host using (locally modified) pidstat: --- [naveen@xx sysstat]$ ./pidstat -C qemu -tIu 1 Linux 3.19.0nnr (xx.in.ibm.com) 04/15/2015 _ppc64_ (64 CPU) 04:43:20 AM UID TGID TID%usr %system %guest%CPU %steal CPU Command 04:43:22 AM 1008 3001 -0.000.00 54.213.39 45.79 12 qemu-system-ppc 04:43:22 AM 1008 - 30050.000.00 54.213.390.00 12 |__qemu-system-ppc 04:43:22 AM UID
[RFC PATCH 3/3] kvm/powerpc: report guest steal time in host
On powerpc, kvm tracks both the guest steal time as well as the time when guest was idle and this gets sent in to the guest through DTL. The guest accounts these entries as either steal time or idle time based on the last running task. Since the true guest idle status is not visible to the host, we can't accurately report the guest steal time in the host. However, tracking the guest vcpu cede status can get us a reasonable (within 5% variation) vcpu steal time since guest vcpus cede the processor on entering the idle task. To do this, we introduce a new field ceded_st in kvm_vcpu_arch structure to accurately track the guest vcpu cede status (this is needed since the existing ceded field is modified before we can use it). During DTL entry creation, we check this flag and account the time as stolen if the guest vcpu had not ceded. Tests show that the steal time being reported in the host with this approach is around 5% higher than the steal time shown in guest. Please suggest if there are ways to get more accurate steal time information in the host. Signed-off-by: Naveen N. Rao naveen.n@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_host.h | 1 + arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kvm/book3s_hv.c| 2 ++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 3 +++ 4 files changed, 7 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 8ef0512..7db48c4 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -655,6 +655,7 @@ struct kvm_vcpu_arch { u64 busy_preempt; u32 emul_inst; + u8 ceded_st; #endif }; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 4717859..765c7c4 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -521,6 +521,7 @@ int main(void) DEFINE(VCPU_DEC_EXPIRES, offsetof(struct kvm_vcpu, arch.dec_expires)); DEFINE(VCPU_PENDING_EXC, offsetof(struct kvm_vcpu, arch.pending_exceptions)); DEFINE(VCPU_CEDED, offsetof(struct kvm_vcpu, arch.ceded)); + DEFINE(VCPU_CEDED_ST, offsetof(struct kvm_vcpu, arch.ceded_st)); DEFINE(VCPU_PRODDED, offsetof(struct kvm_vcpu, arch.prodded)); DEFINE(VCPU_MMCR, offsetof(struct kvm_vcpu, arch.mmcr)); DEFINE(VCPU_PMC, offsetof(struct kvm_vcpu, arch.pmc)); diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index de74756..ad7c0e3 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -545,6 +545,8 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu, spin_lock_irq(vcpu-arch.tbacct_lock); stolen += vcpu-arch.busy_stolen; vcpu-arch.busy_stolen = 0; + if (!vcpu-arch.ceded_st stolen) + (pid_task(vcpu-pid, PIDTYPE_PID))-gstime += stolen; spin_unlock_irq(vcpu-arch.tbacct_lock); if (!dt || !vpa) return; diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 6cbf163..28f304e 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -873,6 +873,7 @@ deliver_guest_interrupt: fast_guest_return: li r0,0 stb r0,VCPU_CEDED(r4) /* cancel cede */ + stb r0,VCPU_CEDED_ST(r4)/* cancel cede */ mtspr SPRN_HSRR0,r10 mtspr SPRN_HSRR1,r11 @@ -1889,6 +1890,7 @@ _GLOBAL(kvmppc_h_cede) std r11,VCPU_MSR(r3) li r0,1 stb r0,VCPU_CEDED(r3) + stb r0,VCPU_CEDED_ST(r3) sync/* order setting ceded vs. testing prodded */ lbz r5,VCPU_PRODDED(r3) cmpwi r5,0 @@ -2052,6 +2054,7 @@ kvm_cede_prodded: stb r0,VCPU_PRODDED(r3) sync/* order testing prodded vs. clearing ceded */ stb r0,VCPU_CEDED(r3) + stb r0,VCPU_CEDED_ST(r3) li r3,H_SUCCESS blr -- 2.3.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/3] procfs: add guest steal time in /proc/pid/stat
Introduce a field in /proc/pid/stat to expose guest steal time. Signed-off-by: Naveen N. Rao naveen.n@linux.vnet.ibm.com --- fs/proc/array.c | 6 ++ include/linux/sched.h | 7 +++ kernel/fork.c | 2 +- 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/proc/array.c b/fs/proc/array.c index 1295a00..d86f00e 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -363,6 +363,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, unsigned long rsslim = 0; char tcomm[sizeof(task-comm)]; unsigned long flags; + cputime_t gstime; state = *get_task_state(task); vsize = eip = esp = 0; @@ -382,6 +383,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, sigemptyset(sigcatch); cutime = cstime = utime = stime = 0; cgtime = gtime = 0; + gstime = 0; if (lock_task_sighand(task, flags)) { struct signal_struct *sig = task-signal; @@ -410,6 +412,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, min_flt += t-min_flt; maj_flt += t-maj_flt; gtime += task_gtime(t); + gstime += task_gstime(t); } while_each_thread(task, t); min_flt += sig-min_flt; @@ -432,6 +435,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, maj_flt = task-maj_flt; task_cputime_adjusted(task, utime, stime); gtime = task_gtime(task); + gstime = task_gstime(task); } /* scale priority and nice values from timeslices to -20..20 */ @@ -505,6 +509,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, else seq_put_decimal_ll(m, ' ', 0); + seq_put_decimal_ull(m, ' ', cputime_to_clock_t(gstime)); + seq_putc(m, '\n'); if (mm) mmput(mm); diff --git a/include/linux/sched.h b/include/linux/sched.h index 0eabab9..cb57954 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1429,6 +1429,7 @@ struct task_struct { cputime_t utime, stime, utimescaled, stimescaled; cputime_t gtime; + cputime_t gstime; #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE struct cputime prev_cputime; #endif @@ -1955,6 +1956,12 @@ static inline cputime_t task_gtime(struct task_struct *t) return t-gtime; } #endif + +static inline cputime_t task_gstime(struct task_struct *t) +{ + return t-gstime; +} + extern void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st); extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st); diff --git a/kernel/fork.c b/kernel/fork.c index cf65139..529ebe5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1293,7 +1293,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, init_sigpending(p-pending); - p-utime = p-stime = p-gtime = 0; + p-utime = p-stime = p-gtime = p-gstime = 0; p-utimescaled = p-stimescaled = 0; #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE p-prev_cputime.utime = p-prev_cputime.stime = 0; -- 2.3.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/3] kvm/x86: report guest steal time in host
Report guest steal time in host task statistics. On x86, this is just the scheduler run_delay. Signed-off-by: Naveen N. Rao naveen.n@linux.vnet.ibm.com --- arch/x86/kvm/x86.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0ee725f..737b0e4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2094,6 +2094,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu) vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal; vcpu-arch.st.steal.version += 2; + current-gstime += vcpu-arch.st.accum_steal; vcpu-arch.st.accum_steal = 0; kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime, -- 2.3.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/3] procfs: add guest steal time in /proc/pid/stat
Introduce a field in /proc/pid/stat to expose guest steal time. Signed-off-by: Naveen N. Rao naveen.n@linux.vnet.ibm.com --- fs/proc/array.c | 6 ++ include/linux/sched.h | 7 +++ kernel/fork.c | 2 +- 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/proc/array.c b/fs/proc/array.c index 1295a00..d86f00e 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -363,6 +363,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, unsigned long rsslim = 0; char tcomm[sizeof(task-comm)]; unsigned long flags; + cputime_t gstime; state = *get_task_state(task); vsize = eip = esp = 0; @@ -382,6 +383,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, sigemptyset(sigcatch); cutime = cstime = utime = stime = 0; cgtime = gtime = 0; + gstime = 0; if (lock_task_sighand(task, flags)) { struct signal_struct *sig = task-signal; @@ -410,6 +412,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, min_flt += t-min_flt; maj_flt += t-maj_flt; gtime += task_gtime(t); + gstime += task_gstime(t); } while_each_thread(task, t); min_flt += sig-min_flt; @@ -432,6 +435,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, maj_flt = task-maj_flt; task_cputime_adjusted(task, utime, stime); gtime = task_gtime(task); + gstime = task_gstime(task); } /* scale priority and nice values from timeslices to -20..20 */ @@ -505,6 +509,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, else seq_put_decimal_ll(m, ' ', 0); + seq_put_decimal_ull(m, ' ', cputime_to_clock_t(gstime)); + seq_putc(m, '\n'); if (mm) mmput(mm); diff --git a/include/linux/sched.h b/include/linux/sched.h index 0eabab9..cb57954 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1429,6 +1429,7 @@ struct task_struct { cputime_t utime, stime, utimescaled, stimescaled; cputime_t gtime; + cputime_t gstime; #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE struct cputime prev_cputime; #endif @@ -1955,6 +1956,12 @@ static inline cputime_t task_gtime(struct task_struct *t) return t-gtime; } #endif + +static inline cputime_t task_gstime(struct task_struct *t) +{ + return t-gstime; +} + extern void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st); extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st); diff --git a/kernel/fork.c b/kernel/fork.c index cf65139..529ebe5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1293,7 +1293,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, init_sigpending(p-pending); - p-utime = p-stime = p-gtime = 0; + p-utime = p-stime = p-gtime = p-gstime = 0; p-utimescaled = p-stimescaled = 0; #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE p-prev_cputime.utime = p-prev_cputime.stime = 0; -- 2.3.5 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: tweak types of fields in kvm_lapic_irq
On 22/04/2015 11:35, Radim Krčmář wrote: Change the level field to bool, since we assign 1 sometimes, but just mask icr_low with APIC_INT_ASSERT in apic_send_-ipi. Would be more consistent to change that assignment instead ... If we dropped the idea that struct kvm_lapic_irq fields can be bitORed to get the ICR, we could also easily change trig_mode/dest_mode to bool level_trig/logical_dest. (I can do a followup patch.) Right, I thought of both. However, level is something that has an obviously understandable meaning as a bool, while trig_mode/dest_mode as you said have to be renamed as well. You're right on the u8 type for vector, too. But I probably will end up not committing this patch at all... Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: tweak types of fields in kvm_lapic_irq
2015-04-21 19:01+0200, Paolo Bonzini: Change to u16 if they only contain data in the low 16 bits. Change the level field to bool, since we assign 1 sometimes, but just mask icr_low with APIC_INT_ASSERT in apic_send_ipi. Would be more consistent to change that assignment instead ... If we dropped the idea that struct kvm_lapic_irq fields can be bitORed to get the ICR, we could also easily change trig_mode/dest_mode to bool level_trig/logical_dest. (I can do a followup patch.) Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/include/asm/kvm_host.h | 8 arch/x86/kvm/lapic.c| 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3a19e30f0be0..dc83b43d0850 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -689,10 +689,10 @@ struct msr_data { struct kvm_lapic_irq { u32 vector; Vector can be u8. - u32 delivery_mode; - u32 dest_mode; - u32 level; - u32 trig_mode; + u16 delivery_mode; + u16 dest_mode; + bool level; + u16 trig_mode; I'd prefer to have the u8 vector as well, but it works, Reviewed-by: Radim Krčmář rkrc...@redhat.com u32 shorthand; u32 dest_id; }; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index abf165330881..ba585d0c42c5 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -914,7 +914,7 @@ static void apic_send_ipi(struct kvm_lapic *apic) irq.vector = icr_low APIC_VECTOR_MASK; irq.delivery_mode = icr_low APIC_MODE_MASK; irq.dest_mode = icr_low APIC_DEST_MASK; - irq.level = icr_low APIC_INT_ASSERT; + irq.level = (icr_low APIC_INT_ASSERT) != 0; irq.trig_mode = icr_low APIC_INT_LEVELTRIG; irq.shorthand = icr_low APIC_SHORT_MASK; if (apic_x2apic_mode(apic)) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/10] KVM: arm64: guest debug, add SW break point support
Zhichao Huang zhichao.hu...@linaro.org writes: On Tue, Mar 31, 2015 at 04:08:04PM +0100, Alex Bennée wrote: This adds support for SW breakpoints inserted by userspace. We do this by trapping all BKPT exceptions in the hypervisor (MDCR_EL2_TDE). why should we trap all debug exceptions? The trap for cp14 register r/w seems enough to record relevant informations to context switch the dbg register while neccessary. Lets think about this case when the SW breakpoint exception occurs: If KVM doesn't trap it and pass it back to userspace to handle it would have to deliver it to the guest. The guest not having inserted the breakpoint in the first place would get very confused. So what we actually do is re-route the exception to the hypervisor and stop the VM and return to userspace with the debug information. Once in QEMU we check to see if the SW breakpoint was one of the ones we inserted at which point control is passed back to the host GDB (attached via the GDB stub in QEMU). If it is not a breakpoint which was set-up by the host then it must be one for the guest at which point we need to ensure the exception is delivered to the guest for it to process. -- Alex Bennée -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 8/8] macvtap/tun: add VNET_BE flag
On Tue, 21 Apr 2015 20:30:23 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Tue, Apr 21, 2015 at 06:22:20PM +0200, Greg Kurz wrote: On Tue, 21 Apr 2015 16:06:33 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Fri, Apr 10, 2015 at 12:20:21PM +0200, Greg Kurz wrote: The VNET_LE flag was introduced to fix accesses to virtio 1.0 headers that are always little-endian. It can also be used to handle the special case of a legacy little-endian device implemented by a big-endian host. Let's add a flag and ioctls for big-endian devices as well. If both flags are set, little-endian wins. Since this is isn't a common usecase, the feature is controlled by a kernel config option (not set by default). Both macvtap and tun are covered by this patch since they share the same API with userland. Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com --- drivers/net/Kconfig | 12 drivers/net/macvtap.c | 60 +- drivers/net/tun.c | 62 ++- include/uapi/linux/if_tun.h |2 + 4 files changed, 134 insertions(+), 2 deletions(-) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index df51d60..f0e23a0 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -244,6 +244,18 @@ config TUN If you don't know what to use this for, you don't need it. +config TUN_VNET_BE + bool Support for big-endian vnet headers + default n + ---help--- + This option allows TUN/TAP and MACVTAP device drivers to parse + vnet headers that are in big-endian byte order. It is useful + when the headers come from a big-endian legacy virtio driver and + the host is little-endian. + + Unless you have a little-endian system hosting a big-endian virtual + machine with a virtio NIC, you should say N. + should mention cross-endian, not big-endian, right? The current TUN_VNET_LE related code is already doing cross-endian: without this patch, one can already run a LE guest on a BE host... wouldn't it be confusing to mention cross-endian only when the guest is BE ? Hmm I think no - LE is also useful for virtio 1 - this is what it was intended for after all. What about having a completely distinct implementation for cross-endian that don't reuse the existing code and defines then ? I think implementation and interface are fine, just the documentation can be improved a bit. How about: Support for cross-endian vnet headers on little-endian kernels. Accordingly CONFIG_TUN_VNET_CROSS_LE ? Sure. And what about also renaming the ioctl to TUNSETVNETCROSSLE then ? -- Greg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] Report guest steal time in host
Am 22.04.2015 um 12:24 schrieb Naveen N. Rao: Steal time accounts the time duration during which a guest vcpu was ready to run, but was not scheduled to run by the hypervisor. This is particularly relevant in cloud environment where customers would want to use this as an indicator that their guests are being throttled. However, as it stands today, guest steal time information is not visible from the hypervisor. For cloud service providers, this is problematic since they would want to overcommit cpu resources to achieve optimum resource utilization while at the same time ensuring guests are not throttled. It is useful for service providers to have access to the guest steal time data so that they can base their overcommit/guest packing decisions on this. Higher guest steal time can be used as a trigger to change how the guests are scheduled, or even migrate guests out of a system. This patchset attempts to make the guest steal times available in the host. This is achieved by introducing a new field in per-task statistics (/proc/pid/stat and /proc/pid/task/pid/stat) to accumulate per-vcpu steal time. Programs (such as pidstat) can then be enhanced to report this information on a per-thread basis [If there is a better place/way to expose this, please let me know]. As an example, with pidstat on ppc64: Guest steal time information using mpstat: - [root@rhel7-img ~]# mpstat -P ALL 1 Linux 3.19.0nnr (rhel7-img) 04/15/2015 _ppc64_ (4 CPU) 03:13:23 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:24 PM all 12.250.001.250.001.002.25 13.75 0.000.00 69.50 03:13:24 PM0 46.530.000.000.000.004.95 45.54 0.000.002.97 03:13:24 PM10.000.000.000.000.004.043.03 0.000.00 92.93 03:13:24 PM20.000.000.000.003.960.992.97 0.000.00 92.08 03:13:24 PM33.000.004.000.000.000.004.00 0.000.00 89.00 03:13:24 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:25 PM all 12.590.000.000.000.000.25 12.35 0.000.00 74.81 03:13:25 PM0 50.000.000.000.000.000.98 49.02 0.000.000.00 03:13:25 PM10.980.000.000.000.000.000.00 0.000.00 99.02 03:13:25 PM20.000.000.000.000.000.000.00 0.000.00 100.00 03:13:25 PM30.000.000.000.000.000.000.00 0.000.00 100.00 03:13:25 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:26 PM all 12.990.000.000.000.250.00 12.75 0.000.00 74.02 03:13:26 PM0 51.960.000.000.000.000.00 48.04 0.000.000.00 03:13:26 PM10.000.000.000.000.000.000.00 0.000.00 100.00 03:13:26 PM20.000.000.000.000.980.002.94 0.000.00 96.08 03:13:26 PM30.000.000.000.000.000.000.00 0.000.00 100.00 03:13:26 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:27 PM all 12.530.001.000.250.000.25 12.03 0.000.00 73.93 03:13:27 PM0 51.020.000.000.000.000.00 48.98 0.000.000.00 03:13:27 PM10.000.004.040.000.000.000.00 0.000.00 95.96 03:13:27 PM20.000.000.000.000.000.000.00 0.000.00 100.00 03:13:27 PM30.000.000.000.000.000.000.00 0.000.00 100.00 Average: CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle Average: all 12.910.000.540.010.040.12 12.39 0.000.00 74.00 Average: 0 51.360.000.030.000.030.26 48.27 0.000.000.05 Average: 10.020.001.540.020.020.150.36 0.000.00 97.89 Average: 20.000.000.520.000.090.020.36 0.000.00 99.02 Average: 30.050.000.070.000.020.090.34 0.000.00 99.43 Steal time information in host using (locally modified) pidstat: --- [naveen@xx sysstat]$ ./pidstat -C qemu -tIu 1 Linux 3.19.0nnr (xx.in.ibm.com) 04/15/2015 _ppc64_ (64 CPU) 04:43:20 AM UID TGID TID%usr %system %guest%CPU %steal CPU Command 04:43:22 AM 1008 3001 -0.000.00 54.213.39 45.79
Re: [RFC PATCH 0/3] Report guest steal time in host
On 2015/04/22 01:05PM, Christian Borntraeger wrote: Am 22.04.2015 um 12:24 schrieb Naveen N. Rao: Steal time accounts the time duration during which a guest vcpu was ready to run, but was not scheduled to run by the hypervisor. This is particularly relevant in cloud environment where customers would want to use this as an indicator that their guests are being throttled. However, as it stands today, guest steal time information is not visible from the hypervisor. For cloud service providers, this is problematic since they would want to overcommit cpu resources to achieve optimum resource utilization while at the same time ensuring guests are not throttled. It is useful for service providers to have access to the guest steal time data so that they can base their overcommit/guest packing decisions on this. Higher guest steal time can be used as a trigger to change how the guests are scheduled, or even migrate guests out of a system. This patchset attempts to make the guest steal times available in the host. This is achieved by introducing a new field in per-task statistics (/proc/pid/stat and /proc/pid/task/pid/stat) to accumulate per-vcpu steal time. Programs (such as pidstat) can then be enhanced to report this information on a per-thread basis [If there is a better place/way to expose this, please let me know]. As an example, with pidstat on ppc64: Guest steal time information using mpstat: - [root@rhel7-img ~]# mpstat -P ALL 1 Linux 3.19.0nnr (rhel7-img) 04/15/2015 _ppc64_ (4 CPU) 03:13:23 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:24 PM all 12.250.001.250.001.002.25 13.75 0.000.00 69.50 03:13:24 PM0 46.530.000.000.000.004.95 45.54 0.000.002.97 03:13:24 PM10.000.000.000.000.004.043.03 0.000.00 92.93 03:13:24 PM20.000.000.000.003.960.992.97 0.000.00 92.08 03:13:24 PM33.000.004.000.000.000.004.00 0.000.00 89.00 03:13:24 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:25 PM all 12.590.000.000.000.000.25 12.35 0.000.00 74.81 03:13:25 PM0 50.000.000.000.000.000.98 49.02 0.000.000.00 03:13:25 PM10.980.000.000.000.000.000.00 0.000.00 99.02 03:13:25 PM20.000.000.000.000.000.000.00 0.000.00 100.00 03:13:25 PM30.000.000.000.000.000.000.00 0.000.00 100.00 03:13:25 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:26 PM all 12.990.000.000.000.250.00 12.75 0.000.00 74.02 03:13:26 PM0 51.960.000.000.000.000.00 48.04 0.000.000.00 03:13:26 PM10.000.000.000.000.000.000.00 0.000.00 100.00 03:13:26 PM20.000.000.000.000.980.002.94 0.000.00 96.08 03:13:26 PM30.000.000.000.000.000.000.00 0.000.00 100.00 03:13:26 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:27 PM all 12.530.001.000.250.000.25 12.03 0.000.00 73.93 03:13:27 PM0 51.020.000.000.000.000.00 48.98 0.000.000.00 03:13:27 PM10.000.004.040.000.000.000.00 0.000.00 95.96 03:13:27 PM20.000.000.000.000.000.000.00 0.000.00 100.00 03:13:27 PM30.000.000.000.000.000.000.00 0.000.00 100.00 Average: CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle Average: all 12.910.000.540.010.040.12 12.39 0.000.00 74.00 Average: 0 51.360.000.030.000.030.26 48.27 0.000.000.05 Average: 10.020.001.540.020.020.150.36 0.000.00 97.89 Average: 20.000.000.520.000.090.020.36 0.000.00 99.02 Average: 30.050.000.070.000.020.090.34 0.000.00 99.43 Steal time information in host using (locally modified) pidstat: --- [naveen@xx sysstat]$ ./pidstat -C qemu -tIu 1 Linux 3.19.0nnr (xx.in.ibm.com) 04/15/2015 _ppc64_ (64 CPU) 04:43:20
Re: [RFC PATCH 0/3] Report guest steal time in host
On 2015/04/22 01:05PM, Christian Borntraeger wrote: Am 22.04.2015 um 12:24 schrieb Naveen N. Rao: Steal time accounts the time duration during which a guest vcpu was ready to run, but was not scheduled to run by the hypervisor. This is particularly relevant in cloud environment where customers would want to use this as an indicator that their guests are being throttled. However, as it stands today, guest steal time information is not visible from the hypervisor. For cloud service providers, this is problematic since they would want to overcommit cpu resources to achieve optimum resource utilization while at the same time ensuring guests are not throttled. It is useful for service providers to have access to the guest steal time data so that they can base their overcommit/guest packing decisions on this. Higher guest steal time can be used as a trigger to change how the guests are scheduled, or even migrate guests out of a system. This patchset attempts to make the guest steal times available in the host. This is achieved by introducing a new field in per-task statistics (/proc/pid/stat and /proc/pid/task/pid/stat) to accumulate per-vcpu steal time. Programs (such as pidstat) can then be enhanced to report this information on a per-thread basis [If there is a better place/way to expose this, please let me know]. As an example, with pidstat on ppc64: Guest steal time information using mpstat: - [root@rhel7-img ~]# mpstat -P ALL 1 Linux 3.19.0nnr (rhel7-img) 04/15/2015 _ppc64_ (4 CPU) 03:13:23 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:24 PM all 12.250.001.250.001.002.25 13.75 0.000.00 69.50 03:13:24 PM0 46.530.000.000.000.004.95 45.54 0.000.002.97 03:13:24 PM10.000.000.000.000.004.043.03 0.000.00 92.93 03:13:24 PM20.000.000.000.003.960.992.97 0.000.00 92.08 03:13:24 PM33.000.004.000.000.000.004.00 0.000.00 89.00 03:13:24 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:25 PM all 12.590.000.000.000.000.25 12.35 0.000.00 74.81 03:13:25 PM0 50.000.000.000.000.000.98 49.02 0.000.000.00 03:13:25 PM10.980.000.000.000.000.000.00 0.000.00 99.02 03:13:25 PM20.000.000.000.000.000.000.00 0.000.00 100.00 03:13:25 PM30.000.000.000.000.000.000.00 0.000.00 100.00 03:13:25 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:26 PM all 12.990.000.000.000.250.00 12.75 0.000.00 74.02 03:13:26 PM0 51.960.000.000.000.000.00 48.04 0.000.000.00 03:13:26 PM10.000.000.000.000.000.000.00 0.000.00 100.00 03:13:26 PM20.000.000.000.000.980.002.94 0.000.00 96.08 03:13:26 PM30.000.000.000.000.000.000.00 0.000.00 100.00 03:13:26 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle 03:13:27 PM all 12.530.001.000.250.000.25 12.03 0.000.00 73.93 03:13:27 PM0 51.020.000.000.000.000.00 48.98 0.000.000.00 03:13:27 PM10.000.004.040.000.000.000.00 0.000.00 95.96 03:13:27 PM20.000.000.000.000.000.000.00 0.000.00 100.00 03:13:27 PM30.000.000.000.000.000.000.00 0.000.00 100.00 Average: CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle Average: all 12.910.000.540.010.040.12 12.39 0.000.00 74.00 Average: 0 51.360.000.030.000.030.26 48.27 0.000.000.05 Average: 10.020.001.540.020.020.150.36 0.000.00 97.89 Average: 20.000.000.520.000.090.020.36 0.000.00 99.02 Average: 30.050.000.070.000.020.090.34 0.000.00 99.43 Steal time information in host using (locally modified) pidstat: --- [naveen@xx sysstat]$ ./pidstat -C qemu -tIu 1 Linux 3.19.0nnr (xx.in.ibm.com) 04/15/2015 _ppc64_ (64 CPU) 04:43:20
Re: [PATCH v4 7/8] vhost: feature to set the vring endianness
On Wed, Apr 22, 2015 at 11:08:54AM +0200, Greg Kurz wrote: On Tue, 21 Apr 2015 20:25:03 +0200 Michael S. Tsirkin m...@redhat.com wrote: [ ... ] @@ -630,6 +634,53 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) return 0; } +#ifdef CONFIG_VHOST_SET_ENDIAN_LEGACY +static long vhost_set_vring_big_endian(struct vhost_virtqueue *vq, +int __user *argp) +{ + struct vhost_vring_state s; + + if (vq-private_data) + return -EBUSY; + + if (copy_from_user(s, argp, sizeof(s))) + return -EFAULT; + + if (s.num s.num != 1) s.num ~0x1 Since s.num is unsigned and I assume this won't change, what about s.num 1 as suggested by Cornelia ? I just tried and gcc optimizes s.num != 0 s.num != 1 to s.num 1 The former will be more readable once we replace 0 and 1 with defines. So ignore my advice, keep code as is but use defines. Ok. [ ... ] --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -103,6 +103,15 @@ struct vhost_memory { /* Get accessor: reads index, writes value in num */ #define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state) +/* Set the vring byte order in num. This is a legacy only API that is simply + * ignored when VIRTIO_F_VERSION_1 is set. + * 0 to set to little-endian + * 1 to set to big-endian How about defines for these? Ok. I'll put the defines here so that all the cross-endian stuff lies in the same hunk. Is it ok for you ? Fine. + * other values return EINVAL. Pls also add a note saying that not all kernel configurations support this ioctl, but all configurations that support SET also support GET. Ok. + */ +#define VHOST_SET_VRING_BIG_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state) +#define VHOST_GET_VRING_BIG_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state) + /* The following ioctls use eventfd file descriptors to signal and poll * for events. */ I'm inclined to think VHOST_SET_VRING_ENDIAN is a slightly better name. What do you think? Or VHOST_SET_VRING_CROSS_ENDIAN ? I like the idea to keep a hint that this API is for cross-endian only... like the rest of this series. -- Greg I think VHOST_SET_VRING_CROSS_ENDIAN is not a good name - it would imply 1 for cross endian, 0 for native endian. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] project proposal
On Wed, Apr 22, 2015 at 11:20 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Tue, Apr 21, 2015 at 04:07:56PM +0200, Paolo Bonzini wrote: On 21/04/2015 16:07, Catalin Vasile wrote: I don't get the part with getting cryptodev upstream. I don't know what getting cryptodev upstream actually implies. From what I know cryptodev is done (is a functional project) that was rejected in the Linux Kernel and there isn't actually way to get it upstream. Yes, I agree. The limitations of AF_ALG need to addressed somehow, so what is the next step? Stefan If we want a mainstream userspace backend that could interact with a lot of crypto engines, we could use OpenSSL (it can actually use cryptodev and AF_ALG as engines). For now, until mid June (my diploma project presentation) I still want to use vhost as a backend for the sole purpose of having a finished backend which now I have a good grasp upon. If the finished work would be good enough work to be merged upstream will be talked later. As a GSoC project, OpenSSL as a backend would continue the virtio-crypto development, as it's not uncommon to have multiple types of backends. The current work on virtio-crypto qemu and guest module is pretty backend agnostic, and could allow future development(use of other backends and other features). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 7/8] vhost: feature to set the vring endianness
On Tue, 21 Apr 2015 20:25:03 +0200 Michael S. Tsirkin m...@redhat.com wrote: [ ... ] @@ -630,6 +634,53 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) return 0; } +#ifdef CONFIG_VHOST_SET_ENDIAN_LEGACY +static long vhost_set_vring_big_endian(struct vhost_virtqueue *vq, + int __user *argp) +{ + struct vhost_vring_state s; + + if (vq-private_data) + return -EBUSY; + + if (copy_from_user(s, argp, sizeof(s))) + return -EFAULT; + + if (s.num s.num != 1) s.num ~0x1 Since s.num is unsigned and I assume this won't change, what about s.num 1 as suggested by Cornelia ? I just tried and gcc optimizes s.num != 0 s.num != 1 to s.num 1 The former will be more readable once we replace 0 and 1 with defines. So ignore my advice, keep code as is but use defines. Ok. [ ... ] --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -103,6 +103,15 @@ struct vhost_memory { /* Get accessor: reads index, writes value in num */ #define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state) +/* Set the vring byte order in num. This is a legacy only API that is simply + * ignored when VIRTIO_F_VERSION_1 is set. + * 0 to set to little-endian + * 1 to set to big-endian How about defines for these? Ok. I'll put the defines here so that all the cross-endian stuff lies in the same hunk. Is it ok for you ? Fine. + * other values return EINVAL. Pls also add a note saying that not all kernel configurations support this ioctl, but all configurations that support SET also support GET. Ok. + */ +#define VHOST_SET_VRING_BIG_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state) +#define VHOST_GET_VRING_BIG_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state) + /* The following ioctls use eventfd file descriptors to signal and poll * for events. */ I'm inclined to think VHOST_SET_VRING_ENDIAN is a slightly better name. What do you think? Or VHOST_SET_VRING_CROSS_ENDIAN ? I like the idea to keep a hint that this API is for cross-endian only... like the rest of this series. -- Greg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] project proposal
In those examples algorithms are used with standard protocols, not with standalone algorithms. CryptoAPI itself offers basic primitives such as encryption and authentication which can be combined however you like. Some combinations care result in other protocol implementations as well. On Wed, Apr 22, 2015 at 11:27 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Tue, Apr 21, 2015 at 05:24:55PM +0300, Catalin Vasile wrote: Can you give me more details on GnuTLS? I'm going through some documentation and code and I see that it doesn't actually have separate encryption and authentication primitives. gnutls is a natural choice because QEMU already uses it for TLS, but if it doesn't support the primitives you need, then AF_ALG could be used directly. http://www.gnutls.org/manual/gnutls.html#Using-GnuTLS-as-a-cryptographic-library Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html