Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)
On 05/07/2012 08:21 PM, Eduardo Habkost wrote: Andre? Are you able to help to answer the question below? Sorry for the delay, the easy answers first: I would like to clarify what's the expected behavior of -cpu host to be able to continue working on it. The purpose of -cpu host is to let guests use ISA features that the host CPU provides. Those would not need any KVM/QEMU intervention, because they work out of the box. Things like AES or SSE4.2, which are used by Linux and glibc, for instance. Users would expect those to be usable, and actually we only need to set the bits and are done. A second goal is to get rid of the awkward and artificial family/model/stepping settings and just let the guest see the actual CPU model. This has some strings attached, though, but other virtualization solution do it the same way and they can cope with it. A third thing currently not really addressed are the more informational bits like cache size and organization and TLB layout. Those are properties of the (core) CPU (except shared L3, cache maybe) and apply to guests as well. I don't see any reason why this should not be exposed to the guest. From the top of my head I don't know any prominent user (just glibc reading the cache size, but not really using it), but I guess there are applications which benefit from it. What clearly is not the intention of -cpu host is to provide access to every feature a certain CPU model provides. So things which require emulation or hypervisor intervention are not in the primary use case. That does not mean that they don't need to implemented, but that was not the purpose of -cpu host and they should be handled independently. Regards, Andre. I believe the code will need to be fixed on either case, but first we need to figure out what are the expectations/requirements, to know _which_ changes will be needed. On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote: (CCing Andre Przywara, in case he can help to clarify what's the expected meaning of -cpu host) [...] I am not sure I understand what you are proposing. Let me explain the use case I am thinking about: - Feature FOO is of type (A) (e.g. just a new instruction set that doesn't require additional userspace support) - User has a Qemu vesion that doesn't know anything about feature FOO - User gets a new CPU that supports feature FOO - User gets a new kernel that supports feature FOO (i.e. has FOO in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User expects to get feature FOO enabled if using -cpu host, without upgrading Qemu. The problem here is: to support the above use-case, userspace need a probing mechanism that can differentiate _new_ (previously unknown) features that are in group (A) (safe to blindly enable) from features that are in group (B) (that can't be enabled without an userspace upgrade). In short, it becomes a problem if we consider the following case: - Feature BAR is of type (B) (it can't be enabled without extra userspace support) - User has a Qemu version that doesn't know anything about feature BAR - User gets a new CPU that supports feature BAR - User gets a new kernel that supports feature BAR (i.e. has BAR in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User simply shouldn't get feature BAR enabled, even if using -cpu host, otherwise Qemu would break. If userspace always limited itself to features it knows about, it would be really easy to implement the feature without any new probing mechanism from the kernel. But that's not how I think users expect -cpu host to work. Maybe I am wrong, I don't know. I am CCing Andre, who introduced the -cpu host feature, in case he can explain what's the expected semantics on the cases above. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: Build fixes for ppc64-softmmu
On 05/09/2012 05:59 AM, Benjamin Herrenschmidt wrote: This is a small collection of build fixes that allow the qemu-kvm tree to build a ppc64-softmmu target with kvm enabled (provided device-assignment is disabled). Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)
On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote: On 08.05.2012, at 22:14, Eduardo Habkost wrote: On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote: On 07.05.2012, at 20:21, Eduardo Habkost wrote: Andre? Are you able to help to answer the question below? I would like to clarify what's the expected behavior of -cpu host to be able to continue working on it. I believe the code will need to be fixed on either case, but first we need to figure out what are the expectations/requirements, to know _which_ changes will be needed. On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote: (CCing Andre Przywara, in case he can help to clarify what's the expected meaning of -cpu host) [...] I am not sure I understand what you are proposing. Let me explain the use case I am thinking about: - Feature FOO is of type (A) (e.g. just a new instruction set that doesn't require additional userspace support) - User has a Qemu vesion that doesn't know anything about feature FOO - User gets a new CPU that supports feature FOO - User gets a new kernel that supports feature FOO (i.e. has FOO in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User expects to get feature FOO enabled if using -cpu host, without upgrading Qemu. The problem here is: to support the above use-case, userspace need a probing mechanism that can differentiate _new_ (previously unknown) features that are in group (A) (safe to blindly enable) from features that are in group (B) (that can't be enabled without an userspace upgrade). In short, it becomes a problem if we consider the following case: - Feature BAR is of type (B) (it can't be enabled without extra userspace support) - User has a Qemu version that doesn't know anything about feature BAR - User gets a new CPU that supports feature BAR - User gets a new kernel that supports feature BAR (i.e. has BAR in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User simply shouldn't get feature BAR enabled, even if using -cpu host, otherwise Qemu would break. If userspace always limited itself to features it knows about, it would be really easy to implement the feature without any new probing mechanism from the kernel. But that's not how I think users expect -cpu host to work. Maybe I am wrong, I don't know. I am CCing Andre, who introduced the -cpu host feature, in case he can explain what's the expected semantics on the cases above. Can you think of any feature that'd go into category B? - TSC-deadline: can't be enabled unless userspace takes care to enable the in-kernel irqchip. The kernel can check if in-kernel irqchip has it enabled and otherwise mask it out, no? How kernel should know that userspace does not emulate it? - x2apic: ditto. Same here. For user space irqchip the kernel side doesn't care. If in-kernel APIC is enabled, check for its capabilities. Same here. Well, technically both of those features can't be implemented in userspace right now since MSRs are terminated in the kernel, but I wouldn't make it into ABI. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH unit-tests] Add async page fault test
On 05/08/2012 02:24 PM, Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com Please describe the regression you're testing for. We could even link it to the fix with the commit hash. void vfree(void *mem) { unsigned long size = ((unsigned long *)mem)[-1]; diff --git a/lib/x86/vm.h b/lib/x86/vm.h index 71ab4a8..ff4842f 100644 --- a/lib/x86/vm.h +++ b/lib/x86/vm.h @@ -22,6 +22,7 @@ void vfree(void *mem); void *vmap(unsigned long long phys, unsigned long size); void *alloc_vpage(void); void *alloc_vpages(ulong nr); +unsigned long virt_to_phys_cr3(void *mem); uint64_t. @@ -0,0 +1,98 @@ +/* + * Async PF test. For the test to actually do anywathing it ineeds to be started + * in memory cgroup with 512M of memory and with more then 1G memory provided + * to the guest. + */ Please include instructions or a script on how to do that. Alterative ways of doing this: - file-backed memory using FUSE to control paging - add madvise(MADV_DONTNEED) support to testdev, and have the guest trigger page-in itself. I'm not asking to change this test, just providing ideas for the future in case fine-grained control is needed. It also doesn't thrash the disk. +#include x86/msr.h +#include x86/processor.h +#include x86/apic-defs.h +#include x86/apic.h +#include x86/desc.h +#include x86/isr.h +#include x86/vm.h + +#include libcflat.h +#include stdint.h + +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1 +#define KVM_PV_REASON_PAGE_READY 2 + +#define MSR_KVM_ASYNC_PF_EN 0x4b564d02 + +#define KVM_ASYNC_PF_ENABLED(1 0) +#define KVM_ASYNC_PF_SEND_ALWAYS(1 1) + +volatile uint32_t apf_reason __attribute__((aligned(64))); +char *buf; +volatile uint64_t i; +volatile unsigned long phys; uint64_t. + +static void pf_isr(struct ex_regs *r) +{ + void* virt = (void*)((ulong)(buf+i) ~4095ul); + + switch (get_apf_reason()) { + case 0: default: + printf(unexpected #PF at %p\n, read_cr2()); + fail = true; + break; + case KVM_PV_REASON_PAGE_NOT_PRESENT: + phys = virt_to_phys_cr3(virt); + install_pte(phys_to_virt(read_cr3()), 1, virt, phys, 0); + write_cr3(read_cr3()); What's the point of these? + printf(Got not present #PF token %x virt addr %p phys addr %p\n, read_cr2(), virt, phys); + while(phys) { + irq_enable(); + halt(); Racy... you need safe_halt() here. + irq_disable(); + } + break; + case KVM_PV_REASON_PAGE_READY: + printf(Got present #PF token %x\n, read_cr2()); + if ((uint32_t)read_cr2() == ~0) + break; + install_pte(phys_to_virt(read_cr3()), 1, virt, phys | PTE_PRESENT | PTE_WRITE, 0); + write_cr3(read_cr3()); + phys = 0; + break; + } +} + -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)
On 09.05.2012, at 10:14, Gleb Natapov g...@redhat.com wrote: On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote: On 08.05.2012, at 22:14, Eduardo Habkost wrote: On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote: On 07.05.2012, at 20:21, Eduardo Habkost wrote: Andre? Are you able to help to answer the question below? I would like to clarify what's the expected behavior of -cpu host to be able to continue working on it. I believe the code will need to be fixed on either case, but first we need to figure out what are the expectations/requirements, to know _which_ changes will be needed. On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote: (CCing Andre Przywara, in case he can help to clarify what's the expected meaning of -cpu host) [...] I am not sure I understand what you are proposing. Let me explain the use case I am thinking about: - Feature FOO is of type (A) (e.g. just a new instruction set that doesn't require additional userspace support) - User has a Qemu vesion that doesn't know anything about feature FOO - User gets a new CPU that supports feature FOO - User gets a new kernel that supports feature FOO (i.e. has FOO in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User expects to get feature FOO enabled if using -cpu host, without upgrading Qemu. The problem here is: to support the above use-case, userspace need a probing mechanism that can differentiate _new_ (previously unknown) features that are in group (A) (safe to blindly enable) from features that are in group (B) (that can't be enabled without an userspace upgrade). In short, it becomes a problem if we consider the following case: - Feature BAR is of type (B) (it can't be enabled without extra userspace support) - User has a Qemu version that doesn't know anything about feature BAR - User gets a new CPU that supports feature BAR - User gets a new kernel that supports feature BAR (i.e. has BAR in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User simply shouldn't get feature BAR enabled, even if using -cpu host, otherwise Qemu would break. If userspace always limited itself to features it knows about, it would be really easy to implement the feature without any new probing mechanism from the kernel. But that's not how I think users expect -cpu host to work. Maybe I am wrong, I don't know. I am CCing Andre, who introduced the -cpu host feature, in case he can explain what's the expected semantics on the cases above. Can you think of any feature that'd go into category B? - TSC-deadline: can't be enabled unless userspace takes care to enable the in-kernel irqchip. The kernel can check if in-kernel irqchip has it enabled and otherwise mask it out, no? How kernel should know that userspace does not emulate it? You have to enable the in-kernel apic to use it, at which point the kernel knows it's in use, right? - x2apic: ditto. Same here. For user space irqchip the kernel side doesn't care. If in-kernel APIC is enabled, check for its capabilities. Same here. Well, technically both of those features can't be implemented in userspace right now since MSRs are terminated in the kernel, but I Doesn't sound like the greatest design - unless you deprecate the non-in-kernel apic case. wouldn't make it into ABI. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH unit-tests] Add async page fault test
On Wed, May 09, 2012 at 11:29:11AM +0300, Avi Kivity wrote: On 05/08/2012 02:24 PM, Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com Please describe the regression you're testing for. We could even link it to the fix with the commit hash. The test does not really tests for a regression, because there wasn't one. It test that prefault does not generates spurious #PFs. Will write that. void vfree(void *mem) { unsigned long size = ((unsigned long *)mem)[-1]; diff --git a/lib/x86/vm.h b/lib/x86/vm.h index 71ab4a8..ff4842f 100644 --- a/lib/x86/vm.h +++ b/lib/x86/vm.h @@ -22,6 +22,7 @@ void vfree(void *mem); void *vmap(unsigned long long phys, unsigned long size); void *alloc_vpage(void); void *alloc_vpages(ulong nr); +unsigned long virt_to_phys_cr3(void *mem); uint64_t. virt_to_phys() also unsigned long. And get_pte() that virt_to_phys_cr3() uses also. I guess the code is not ready for more then 2^32 memory in 32bit VM. @@ -0,0 +1,98 @@ +/* + * Async PF test. For the test to actually do anywathing it ineeds to be started + * in memory cgroup with 512M of memory and with more then 1G memory provided + * to the guest. + */ Please include instructions or a script on how to do that. OK. Alterative ways of doing this: - file-backed memory using FUSE to control paging Not sure how that can be done. - add madvise(MADV_DONTNEED) support to testdev, and have the guest trigger page-in itself. MADV_DONTNEED will drop page, not swap it out. I'm not asking to change this test, just providing ideas for the future in case fine-grained control is needed. It also doesn't thrash the disk. +#include x86/msr.h +#include x86/processor.h +#include x86/apic-defs.h +#include x86/apic.h +#include x86/desc.h +#include x86/isr.h +#include x86/vm.h + +#include libcflat.h +#include stdint.h + +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1 +#define KVM_PV_REASON_PAGE_READY 2 + +#define MSR_KVM_ASYNC_PF_EN 0x4b564d02 + +#define KVM_ASYNC_PF_ENABLED(1 0) +#define KVM_ASYNC_PF_SEND_ALWAYS(1 1) + +volatile uint32_t apf_reason __attribute__((aligned(64))); +char *buf; +volatile uint64_t i; +volatile unsigned long phys; uint64_t. + +static void pf_isr(struct ex_regs *r) +{ + void* virt = (void*)((ulong)(buf+i) ~4095ul); + + switch (get_apf_reason()) { + case 0: default: I'd rather make deafult: fail the test. It shouldn't happen. + printf(unexpected #PF at %p\n, read_cr2()); + fail = true; + break; + case KVM_PV_REASON_PAGE_NOT_PRESENT: + phys = virt_to_phys_cr3(virt); + install_pte(phys_to_virt(read_cr3()), 1, virt, phys, 0); + write_cr3(read_cr3()); What's the point of these? Shouldn't we reload page tables after changing them? + printf(Got not present #PF token %x virt addr %p phys addr %p\n, read_cr2(), virt, phys); + while(phys) { + irq_enable(); + halt(); Racy... you need safe_halt() here. The code generated is sti; hlt; cli. By I as well may add safe_halt() to do sti; hlt explicit. + irq_disable(); + } + break; + case KVM_PV_REASON_PAGE_READY: + printf(Got present #PF token %x\n, read_cr2()); + if ((uint32_t)read_cr2() == ~0) + break; + install_pte(phys_to_virt(read_cr3()), 1, virt, phys | PTE_PRESENT | PTE_WRITE, 0); + write_cr3(read_cr3()); + phys = 0; + break; + } +} + -- error compiling committee.c: too many arguments to function -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)
On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote: On 09.05.2012, at 10:14, Gleb Natapov g...@redhat.com wrote: On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote: On 08.05.2012, at 22:14, Eduardo Habkost wrote: On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote: On 07.05.2012, at 20:21, Eduardo Habkost wrote: Andre? Are you able to help to answer the question below? I would like to clarify what's the expected behavior of -cpu host to be able to continue working on it. I believe the code will need to be fixed on either case, but first we need to figure out what are the expectations/requirements, to know _which_ changes will be needed. On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote: (CCing Andre Przywara, in case he can help to clarify what's the expected meaning of -cpu host) [...] I am not sure I understand what you are proposing. Let me explain the use case I am thinking about: - Feature FOO is of type (A) (e.g. just a new instruction set that doesn't require additional userspace support) - User has a Qemu vesion that doesn't know anything about feature FOO - User gets a new CPU that supports feature FOO - User gets a new kernel that supports feature FOO (i.e. has FOO in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User expects to get feature FOO enabled if using -cpu host, without upgrading Qemu. The problem here is: to support the above use-case, userspace need a probing mechanism that can differentiate _new_ (previously unknown) features that are in group (A) (safe to blindly enable) from features that are in group (B) (that can't be enabled without an userspace upgrade). In short, it becomes a problem if we consider the following case: - Feature BAR is of type (B) (it can't be enabled without extra userspace support) - User has a Qemu version that doesn't know anything about feature BAR - User gets a new CPU that supports feature BAR - User gets a new kernel that supports feature BAR (i.e. has BAR in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User simply shouldn't get feature BAR enabled, even if using -cpu host, otherwise Qemu would break. If userspace always limited itself to features it knows about, it would be really easy to implement the feature without any new probing mechanism from the kernel. But that's not how I think users expect -cpu host to work. Maybe I am wrong, I don't know. I am CCing Andre, who introduced the -cpu host feature, in case he can explain what's the expected semantics on the cases above. Can you think of any feature that'd go into category B? - TSC-deadline: can't be enabled unless userspace takes care to enable the in-kernel irqchip. The kernel can check if in-kernel irqchip has it enabled and otherwise mask it out, no? How kernel should know that userspace does not emulate it? You have to enable the in-kernel apic to use it, at which point the kernel knows it's in use, right? - x2apic: ditto. Same here. For user space irqchip the kernel side doesn't care. If in-kernel APIC is enabled, check for its capabilities. Same here. Well, technically both of those features can't be implemented in userspace right now since MSRs are terminated in the kernel, but I Doesn't sound like the greatest design - unless you deprecate the non-in-kernel apic case. You mean terminating MSRs in kernel does not sound like the greatest design? I do not disagree. That is why IMO kernel can't filter out TSC-deadline and x2apic like you suggest. wouldn't make it into ABI. -- Gleb. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH unit-tests] Add async page fault test
On 05/09/2012 11:41 AM, Gleb Natapov wrote: void vfree(void *mem) { unsigned long size = ((unsigned long *)mem)[-1]; diff --git a/lib/x86/vm.h b/lib/x86/vm.h index 71ab4a8..ff4842f 100644 --- a/lib/x86/vm.h +++ b/lib/x86/vm.h @@ -22,6 +22,7 @@ void vfree(void *mem); void *vmap(unsigned long long phys, unsigned long size); void *alloc_vpage(void); void *alloc_vpages(ulong nr); +unsigned long virt_to_phys_cr3(void *mem); uint64_t. virt_to_phys() also unsigned long. And get_pte() that virt_to_phys_cr3() uses also. I guess the code is not ready for more then 2^32 memory in 32bit VM. It's certainly not enterprise quality yet. But let's not add more problems. Alterative ways of doing this: - file-backed memory using FUSE to control paging Not sure how that can be done. - add madvise(MADV_DONTNEED) support to testdev, and have the guest trigger page-in itself. MADV_DONTNEED will drop page, not swap it out. Right, but it will be have to be reloaded from disk (it has to be file-backed for this to work). If it's dirty, sync it first. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH unit-tests] Add async page fault test
On Wed, May 09, 2012 at 11:52:41AM +0300, Avi Kivity wrote: On 05/09/2012 11:41 AM, Gleb Natapov wrote: void vfree(void *mem) { unsigned long size = ((unsigned long *)mem)[-1]; diff --git a/lib/x86/vm.h b/lib/x86/vm.h index 71ab4a8..ff4842f 100644 --- a/lib/x86/vm.h +++ b/lib/x86/vm.h @@ -22,6 +22,7 @@ void vfree(void *mem); void *vmap(unsigned long long phys, unsigned long size); void *alloc_vpage(void); void *alloc_vpages(ulong nr); +unsigned long virt_to_phys_cr3(void *mem); uint64_t. virt_to_phys() also unsigned long. And get_pte() that virt_to_phys_cr3() uses also. I guess the code is not ready for more then 2^32 memory in 32bit VM. It's certainly not enterprise quality yet. But let's not add more problems. Okay. Alterative ways of doing this: - file-backed memory using FUSE to control paging Not sure how that can be done. - add madvise(MADV_DONTNEED) support to testdev, and have the guest trigger page-in itself. MADV_DONTNEED will drop page, not swap it out. Right, but it will be have to be reloaded from disk (it has to be file-backed for this to work). If it's dirty, sync it first. Hmm, yes if it is file backed it may work. Setting up qemu to use file backed memory is one more complication while running the test though. I haven't checked by I am not sure that MADV_DONTNEED will drop page immediately though. It probably puts it on some list to be freed later. Hmm actually looking at the comments it seems like this is what happens: /* * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about * data it wants to keep. Be sure to free swap resources too. The * zap_page_range call sets things up for shrink_active_list to actually * free * these pages later if no one else has touched them in the meantime, * although we could add these pages to a global reuse list for * shrink_active_list to pick up before reclaiming other pages. */ -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/22] ppc: avoid buffer overrun: use pstrcpy, not strncpy
From: Jim Meyering meyer...@redhat.com A terminal NUL is required by caller's use of strchr. It's better not to use strncpy at all, since there is no need to zero out hundreds of trailing bytes for each iteration. Signed-off-by: Jim Meyering meyer...@redhat.com --- target-ppc/kvm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c index c09cc39..fb79e9f 100644 --- a/target-ppc/kvm.c +++ b/target-ppc/kvm.c @@ -587,7 +587,7 @@ static int read_cpuinfo(const char *field, char *value, int len) break; } if (!strncmp(line, field, field_len)) { -strncpy(value, line, len); +pstrcpy(value, len, line); ret = 0; break; } -- 1.7.10.1.487.ga3935e6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/22] ppc: avoid buffer overrun: use pstrcpy, not strncpy
From: Jim Meyering meyer...@redhat.com A terminal NUL is required by caller's use of strchr. It's better not to use strncpy at all, since there is no need to zero out hundreds of trailing bytes for each iteration. Signed-off-by: Jim Meyering meyer...@redhat.com --- target-ppc/kvm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c index c09cc39..fb79e9f 100644 --- a/target-ppc/kvm.c +++ b/target-ppc/kvm.c @@ -587,7 +587,7 @@ static int read_cpuinfo(const char *field, char *value, int len) break; } if (!strncmp(line, field, field_len)) { -strncpy(value, line, len); +pstrcpy(value, len, line); ret = 0; break; } -- 1.7.10.1.487.ga3935e6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)
On Wed, May 09, 2012 at 11:05:58AM +0200, Alexander Graf wrote: On 09.05.2012, at 10:51, Gleb Natapov wrote: On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote: On 09.05.2012, at 10:14, Gleb Natapov g...@redhat.com wrote: On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote: On 08.05.2012, at 22:14, Eduardo Habkost wrote: On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote: On 07.05.2012, at 20:21, Eduardo Habkost wrote: Andre? Are you able to help to answer the question below? I would like to clarify what's the expected behavior of -cpu host to be able to continue working on it. I believe the code will need to be fixed on either case, but first we need to figure out what are the expectations/requirements, to know _which_ changes will be needed. On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote: (CCing Andre Przywara, in case he can help to clarify what's the expected meaning of -cpu host) [...] I am not sure I understand what you are proposing. Let me explain the use case I am thinking about: - Feature FOO is of type (A) (e.g. just a new instruction set that doesn't require additional userspace support) - User has a Qemu vesion that doesn't know anything about feature FOO - User gets a new CPU that supports feature FOO - User gets a new kernel that supports feature FOO (i.e. has FOO in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User expects to get feature FOO enabled if using -cpu host, without upgrading Qemu. The problem here is: to support the above use-case, userspace need a probing mechanism that can differentiate _new_ (previously unknown) features that are in group (A) (safe to blindly enable) from features that are in group (B) (that can't be enabled without an userspace upgrade). In short, it becomes a problem if we consider the following case: - Feature BAR is of type (B) (it can't be enabled without extra userspace support) - User has a Qemu version that doesn't know anything about feature BAR - User gets a new CPU that supports feature BAR - User gets a new kernel that supports feature BAR (i.e. has BAR in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User simply shouldn't get feature BAR enabled, even if using -cpu host, otherwise Qemu would break. If userspace always limited itself to features it knows about, it would be really easy to implement the feature without any new probing mechanism from the kernel. But that's not how I think users expect -cpu host to work. Maybe I am wrong, I don't know. I am CCing Andre, who introduced the -cpu host feature, in case he can explain what's the expected semantics on the cases above. Can you think of any feature that'd go into category B? - TSC-deadline: can't be enabled unless userspace takes care to enable the in-kernel irqchip. The kernel can check if in-kernel irqchip has it enabled and otherwise mask it out, no? How kernel should know that userspace does not emulate it? You have to enable the in-kernel apic to use it, at which point the kernel knows it's in use, right? - x2apic: ditto. Same here. For user space irqchip the kernel side doesn't care. If in-kernel APIC is enabled, check for its capabilities. Same here. Well, technically both of those features can't be implemented in userspace right now since MSRs are terminated in the kernel, but I Doesn't sound like the greatest design - unless you deprecate the non-in-kernel apic case. You mean terminating MSRs in kernel does not sound like the greatest design? I do not disagree. That is why IMO kernel can't filter out TSC-deadline and x2apic like you suggest. I still don't see why it can't. Imagine we would filter TSC-deadline and x2apic by default in the kernel - they are not known to exist yet. Now, we implement TSC-deadline in the kernel. We still filter TSC-deadline out in GET_SUPORTED_CPUID in the kernel. But we provide an interface to user space that says call me to enable TSC-deadline CPUID, but only if you're using the in-kernel apic New user space calls that ioctl when it's using the in-kernel apic, it doesn't when it's using the user space apic. Old user space doesn't call that ioctl. First of all we already have TSC-deadline in GET_SUPORTED_CPUID without any additional ioctls. And second I do not see why we need additional iostls here. Hmm, so may be I misunderstood you. You propose to mask TSC-deadline and x2apic out from GET_SUPORTED_CPUID if irq chip is not in kernel, not from KVM_SET_CPUID? For those two features it may make sense indeed. Not sure there won't be others that are not dependent on irq chip presence. You propose to add additional ioctls to enable them if they appear? So at the end all bits in GET_SUPPORTED_CPUID are consistent with what user
Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)
On 05/09/2012 11:38 AM, Gleb Natapov wrote: On Wed, May 09, 2012 at 11:05:58AM +0200, Alexander Graf wrote: On 09.05.2012, at 10:51, Gleb Natapov wrote: On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote: On 09.05.2012, at 10:14, Gleb Natapovg...@redhat.com wrote: On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote: On 08.05.2012, at 22:14, Eduardo Habkost wrote: On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote: On 07.05.2012, at 20:21, Eduardo Habkost wrote: Andre? Are you able to help to answer the question below? I would like to clarify what's the expected behavior of -cpu host to be able to continue working on it. I believe the code will need to be fixed on either case, but first we need to figure out what are the expectations/requirements, to know _which_ changes will be needed. On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote: (CCing Andre Przywara, in case he can help to clarify what's the expected meaning of -cpu host) [...] I am not sure I understand what you are proposing. Let me explain the use case I am thinking about: - Feature FOO is of type (A) (e.g. just a new instruction set that doesn't require additional userspace support) - User has a Qemu vesion that doesn't know anything about feature FOO - User gets a new CPU that supports feature FOO - User gets a new kernel that supports feature FOO (i.e. has FOO in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User expects to get feature FOO enabled if using -cpu host, without upgrading Qemu. The problem here is: to support the above use-case, userspace need a probing mechanism that can differentiate _new_ (previously unknown) features that are in group (A) (safe to blindly enable) from features that are in group (B) (that can't be enabled without an userspace upgrade). In short, it becomes a problem if we consider the following case: - Feature BAR is of type (B) (it can't be enabled without extra userspace support) - User has a Qemu version that doesn't know anything about feature BAR - User gets a new CPU that supports feature BAR - User gets a new kernel that supports feature BAR (i.e. has BAR in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User simply shouldn't get feature BAR enabled, even if using -cpu host, otherwise Qemu would break. If userspace always limited itself to features it knows about, it would be really easy to implement the feature without any new probing mechanism from the kernel. But that's not how I think users expect -cpu host to work. Maybe I am wrong, I don't know. I am CCing Andre, who introduced the -cpu host feature, in case he can explain what's the expected semantics on the cases above. Can you think of any feature that'd go into category B? - TSC-deadline: can't be enabled unless userspace takes care to enable the in-kernel irqchip. The kernel can check if in-kernel irqchip has it enabled and otherwise mask it out, no? How kernel should know that userspace does not emulate it? You have to enable the in-kernel apic to use it, at which point the kernel knows it's in use, right? - x2apic: ditto. Same here. For user space irqchip the kernel side doesn't care. If in-kernel APIC is enabled, check for its capabilities. Same here. Well, technically both of those features can't be implemented in userspace right now since MSRs are terminated in the kernel, but I Doesn't sound like the greatest design - unless you deprecate the non-in-kernel apic case. You mean terminating MSRs in kernel does not sound like the greatest design? I do not disagree. That is why IMO kernel can't filter out TSC-deadline and x2apic like you suggest. I still don't see why it can't. Imagine we would filter TSC-deadline and x2apic by default in the kernel - they are not known to exist yet. Now, we implement TSC-deadline in the kernel. We still filter TSC-deadline out in GET_SUPORTED_CPUID in the kernel. But we provide an interface to user space that says call me to enable TSC-deadline CPUID, but only if you're using the in-kernel apic New user space calls that ioctl when it's using the in-kernel apic, it doesn't when it's using the user space apic. Old user space doesn't call that ioctl. First of all we already have TSC-deadline in GET_SUPORTED_CPUID without any additional ioctls. And second I do not see why we need additional iostls here. Yeah, some times our ABI is already broken :(. What a shame... Hmm, so may be I misunderstood you. You propose to mask TSC-deadline and x2apic out from GET_SUPORTED_CPUID if irq chip is not in kernel, not from KVM_SET_CPUID? For those two features it may make sense indeed. Not sure there won't be others that are not dependent on irq chip presence. You propose to add additional ioctls to enable them if they appear? That's the only backwards compatible way to design this without putting a plethora of knowledge into user space I can see, yes. So at the end all bits in
Re: 1G huge pages in Linux guest VM
On 05/08/2012 09:38 PM, Sriram Murthy wrote: Hi, I wanted to know if we can pass a hugetlbfs with pagesize=1G created on the host to a Linux guest (The host is running RHEL 6.2 - x86_64 on Intel Westmere)?. Yes, this is supported. I am using RHEL 6.2 on the host and guest and the processor that has been passed by KVM to the guest is Intel Westmere and has the pdpe1gb flag set. Does this mean that the guest supports 1G huge pages? Yes. Note 1gb hugepages on the guest are supported even if not provided by the host (of course the performance improvement will be smaller). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] Add async page fault test
The test checks that when hypervisor prefaults swapped in page it does not generates spurious #PFs in case page was unmapped while it was swapped out. diff --git a/config-x86-common.mak b/config-x86-common.mak index c8fbda7..6976f78 100644 --- a/config-x86-common.mak +++ b/config-x86-common.mak @@ -34,7 +34,7 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \ $(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat \ $(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat \ $(TEST_DIR)/kvmclock_test.flat $(TEST_DIR)/eventinj.flat \ - $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat + $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat $(TEST_DIR)/asyncpf.flat ifdef API tests-common += api/api-sample @@ -90,6 +90,8 @@ $(TEST_DIR)/s3.elf: $(cstart.o) $(TEST_DIR)/s3.o $(TEST_DIR)/pmu.elf: $(cstart.o) $(TEST_DIR)/pmu.o +$(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o + arch_clean: $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \ $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o diff --git a/lib/x86/processor.h b/lib/x86/processor.h index c7e1afb..e46d8d0 100644 --- a/lib/x86/processor.h +++ b/lib/x86/processor.h @@ -302,4 +302,9 @@ static inline void invlpg(void *va) { asm volatile(invlpg (%0) ::r (va) : memory); } + +static inline void safe_halt(void) +{ + asm volatile(sti; hlt); +} #endif diff --git a/lib/x86/vm.c b/lib/x86/vm.c index 550ec9b..71b70fd 100644 --- a/lib/x86/vm.c +++ b/lib/x86/vm.c @@ -219,6 +219,11 @@ void *vmalloc(unsigned long size) return mem; } +uint64_t virt_to_phys_cr3(void *mem) +{ +return (get_pte(phys_to_virt(read_cr3()), mem) PTE_ADDR) + ((ulong)mem (PAGE_SIZE - 1)); +} + void vfree(void *mem) { unsigned long size = ((unsigned long *)mem)[-1]; diff --git a/lib/x86/vm.h b/lib/x86/vm.h index 71ab4a8..3473f8d 100644 --- a/lib/x86/vm.h +++ b/lib/x86/vm.h @@ -22,6 +22,7 @@ void vfree(void *mem); void *vmap(unsigned long long phys, unsigned long size); void *alloc_vpage(void); void *alloc_vpages(ulong nr); +uint64_t virt_to_phys_cr3(void *mem); void install_pte(unsigned long *cr3, int pte_level, diff --git a/x86/asyncpf.c b/x86/asyncpf.c new file mode 100644 index 000..62c0455 --- /dev/null +++ b/x86/asyncpf.c @@ -0,0 +1,113 @@ +/* + * Async PF test. For the test to actually do anywathing it ineeds to be started + * in memory cgroup with 512M of memory and with more then 1G memory provided + * to the guest. + * + * To create cgroup do as root: + * mkdir /dev/cgroup + * mount -t cgroup none -omemory /dev/cgroup + * chmod a+rxw /dev/cgroup/ + * + * From a shell you will start qemu from: + * mkdir /dev/cgroup/1 + * echo $$ /dev/cgroup/1/tasks + * echo 512M /dev/cgroup/1/memory.limit_in_bytes + * + */ +#include x86/msr.h +#include x86/processor.h +#include x86/apic-defs.h +#include x86/apic.h +#include x86/desc.h +#include x86/isr.h +#include x86/vm.h + +#include libcflat.h +#include stdint.h + +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1 +#define KVM_PV_REASON_PAGE_READY 2 + +#define MSR_KVM_ASYNC_PF_EN 0x4b564d02 + +#define KVM_ASYNC_PF_ENABLED(1 0) +#define KVM_ASYNC_PF_SEND_ALWAYS(1 1) + +volatile uint32_t apf_reason __attribute__((aligned(64))); +char *buf; +volatile uint64_t i; +volatile uint64_t phys; +bool fail; + +static inline uint32_t get_apf_reason(void) +{ + uint32_t r = apf_reason; + apf_reason = 0; + return r; +} + +static void pf_isr(struct ex_regs *r) +{ + void* virt = (void*)((ulong)(buf+i) ~(PAGE_SIZE-1)); + uint32_t reason = get_apf_reason(); + + switch (reason) { + case 0: + printf(unexpected #PF at %p\n, read_cr2()); + fail = true; + break; + case KVM_PV_REASON_PAGE_NOT_PRESENT: + phys = virt_to_phys_cr3(virt); + install_pte(phys_to_virt(read_cr3()), 1, virt, phys, 0); + write_cr3(read_cr3()); + printf(Got not present #PF token %x virt addr %p phys addr %p\n, read_cr2(), virt, phys); + while(phys) { + safe_halt(); /* enables irq */ + irq_disable(); + } + break; + case KVM_PV_REASON_PAGE_READY: + printf(Got present #PF token %x\n, read_cr2()); + if ((uint32_t)read_cr2() == ~0) + break; + install_pte(phys_to_virt(read_cr3()), 1, virt, phys | PTE_PRESENT | PTE_WRITE, 0); + write_cr3(read_cr3()); + phys = 0; + break; + default: + printf(unexpected async pf reason %d\n, reason); +
[GIT PULL] KVM fixes for 3.4-rc6
Linus, please pull from: git://git.kernel.org/pub/scm/virt/kvm/kvm.git master for you to fetch changes up to 331b646d60b0c3885208e1e02bd9f40319953efc: KVM: ia64: fix build due to typo (2012-05-09 13:53:28 +0300) Two asynchronous page fault fixes (one guest, one host), a powerpc page refcount fix, and an ia64 build fix. Avi Kivity (1): KVM: ia64: fix build due to typo David Gibson (1): KVM: PPC: Book3S HV: Fix refcounting of hugepages Gleb Natapov (2): KVM: ensure async PF event wakes up vcpu from halt KVM: Do not take reference to mm during async #PF arch/ia64/kvm/kvm-ia64.c|2 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 22 +- arch/powerpc/kvm/book3s_hv.c|2 -- arch/x86/kernel/kvm.c |9 + arch/x86/kvm/x86.c |1 + 5 files changed, 16 insertions(+), 20 deletions(-) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] kvm tools: fix sigsegv in irq__exit
From: Sasha Levin levinsasha...@gmail.com We free the structure, but never remove them from the tree or list, then we freed them the next time we ran through that structure. This patch also simplifies irq__exit a bit. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/x86/irq.c | 17 + 1 files changed, 5 insertions(+), 12 deletions(-) diff --git a/tools/kvm/x86/irq.c b/tools/kvm/x86/irq.c index dc07f28..e83df99 100644 --- a/tools/kvm/x86/irq.c +++ b/tools/kvm/x86/irq.c @@ -179,25 +179,18 @@ int irq__exit(struct kvm *kvm) free(irq_routing); - ent = rb_first(pci_tree); - for (;;) { + while ((ent = rb_first(pci_tree))) { struct pci_dev *dev; - struct rb_node *next; struct irq_line *line; - struct list_head *node, *tmp; - - if (!ent) - break; - - next = rb_next(ent); dev = rb_entry(ent, struct pci_dev, node); - list_for_each_safe(node, tmp, dev-lines) { - line = list_entry(node, struct irq_line, node); + while (!list_empty(dev-lines)) { + line = list_first_entry(dev-lines, struct irq_line, node); + list_del(line-node); free(line); } + rb_erase(dev-node, pci_tree); free(dev); - ent = next; } return 0; -- 1.7.8.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] kvm tools: use accessor function for virtio-9p FIDs
From: Sasha Levin levinsasha...@gmail.com Since the 9p functions don't know the size of the fid array, they might request an FID outside of the allowed range. Use an accessor to prevent that and to hide the internal implementation from them. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/virtio/9p.c | 66 +++- 1 files changed, 37 insertions(+), 29 deletions(-) diff --git a/tools/kvm/virtio/9p.c b/tools/kvm/virtio/9p.c index e054dff..35e089d 100644 --- a/tools/kvm/virtio/9p.c +++ b/tools/kvm/virtio/9p.c @@ -22,6 +22,14 @@ static LIST_HEAD(devs); static int compat_id = -1; +static struct p9_fid *get_fid(struct p9_dev *p9dev, int fid) +{ + if (fid = VIRTIO_9P_MAX_FID) + die(virtio-9p max FID (%u) reached!, VIRTIO_9P_MAX_FID); + + return p9dev-fids[fid]; +} + /* Warning: Immediately use value returned from this function */ static const char *rel_to_abs(struct p9_dev *p9dev, const char *path, char *abs_path) @@ -156,7 +164,7 @@ static void virtio_p9_open(struct p9_dev *p9dev, virtio_p9_pdu_readf(pdu, dd, fid, flags); - new_fid = p9dev-fids[fid]; + new_fid = get_fid(p9dev, fid); if (lstat(new_fid-abs_path, st) 0) goto err_out; @@ -197,7 +205,7 @@ static void virtio_p9_create(struct p9_dev *p9dev, virtio_p9_pdu_readf(pdu, dsddd, dfid_val, name, flags, mode, gid); - dfid = p9dev-fids[dfid_val]; + dfid = get_fid(p9dev, dfid_val); flags = virtio_p9_openflags(flags); @@ -245,7 +253,7 @@ static void virtio_p9_mkdir(struct p9_dev *p9dev, virtio_p9_pdu_readf(pdu, dsdd, dfid_val, name, mode, gid); - dfid = p9dev-fids[dfid_val]; + dfid = get_fid(p9dev, dfid_val); sprintf(full_path, %s/%s, dfid-abs_path, name); ret = mkdir(full_path, mode); @@ -287,11 +295,11 @@ static void virtio_p9_walk(struct p9_dev *p9dev, virtio_p9_pdu_readf(pdu, ddw, fid_val, newfid_val, nwname); - new_fid = p9dev-fids[newfid_val]; + new_fid = get_fid(p9dev, newfid_val); nwqid = 0; if (nwname) { - struct p9_fid *fid = p9dev-fids[fid_val]; + struct p9_fid *fid = get_fid(p9dev, fid_val); strcpy(new_fid-path, fid-path); /* skip the space for count */ @@ -366,7 +374,7 @@ static void virtio_p9_attach(struct p9_dev *p9dev, stat2qid(st, qid); - fid = p9dev-fids[fid_val]; + fid = get_fid(p9dev, fid_val); fid-fid = fid_val; fid-uid = uid; fid-is_dir = 1; @@ -418,7 +426,7 @@ static void virtio_p9_read(struct p9_dev *p9dev, rcount = 0; virtio_p9_pdu_readf(pdu, dqd, fid_val, offset, count); - fid = p9dev-fids[fid_val]; + fid = get_fid(p9dev, fid_val); iov_base = pdu-in_iov[0].iov_base; iov_len = pdu-in_iov[0].iov_len; @@ -469,7 +477,7 @@ static void virtio_p9_readdir(struct p9_dev *p9dev, rcount = 0; virtio_p9_pdu_readf(pdu, dqd, fid_val, offset, count); - fid = p9dev-fids[fid_val]; + fid = get_fid(p9dev, fid_val); if (!fid-is_dir) { errno = -EINVAL; @@ -525,7 +533,7 @@ static void virtio_p9_getattr(struct p9_dev *p9dev, struct p9_stat_dotl statl; virtio_p9_pdu_readf(pdu, dq, fid_val, request_mask); - fid = p9dev-fids[fid_val]; + fid = get_fid(p9dev, fid_val); if (lstat(fid-abs_path, st) 0) goto err_out; @@ -573,7 +581,7 @@ static void virtio_p9_setattr(struct p9_dev *p9dev, struct p9_iattr_dotl p9attr; virtio_p9_pdu_readf(pdu, dI, fid_val, p9attr); - fid = p9dev-fids[fid_val]; + fid = get_fid(p9dev, fid_val); if (p9attr.valid ATTR_MODE) { ret = chmod(fid-abs_path, p9attr.mode); @@ -652,7 +660,7 @@ static void virtio_p9_write(struct p9_dev *p9dev, int twrite_size = sizeof(u32) + sizeof(u64) + sizeof(u32); virtio_p9_pdu_readf(pdu, dqd, fid_val, offset, count); - fid = p9dev-fids[fid_val]; + fid = get_fid(p9dev, fid_val); iov_base = pdu-out_iov[0].iov_base; iov_len = pdu-out_iov[0].iov_len; @@ -691,7 +699,7 @@ static void virtio_p9_remove(struct p9_dev *p9dev, struct p9_fid *fid; virtio_p9_pdu_readf(pdu, d, fid_val); - fid = p9dev-fids[fid_val]; + fid = get_fid(p9dev, fid_val); ret = remove(fid-abs_path); if (ret 0) @@ -714,8 +722,8 @@ static void virtio_p9_rename(struct p9_dev *p9dev, char full_path[PATH_MAX], *new_name; virtio_p9_pdu_readf(pdu, dds, fid_val, new_fid_val, new_name); - fid = p9dev-fids[fid_val]; - new_fid = p9dev-fids[new_fid_val]; + fid = get_fid(p9dev, fid_val); + new_fid = get_fid(p9dev, new_fid_val); sprintf(full_path, %s/%s,
[PATCH 4/4] kvm tools: increase amount of virtio-9p FIDs
From: Sasha Levin levinsasha...@gmail.com Increase the amount of FIDs since we easily reach current amount with simple stress tests. This should be changed to use a rbtree or something faster in the future. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/virtio-9p.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/tools/kvm/include/kvm/virtio-9p.h b/tools/kvm/include/kvm/virtio-9p.h index cd2869a..5902701 100644 --- a/tools/kvm/include/kvm/virtio-9p.h +++ b/tools/kvm/include/kvm/virtio-9p.h @@ -12,7 +12,7 @@ #define VIRTQUEUE_NUM 128 #defineVIRTIO_9P_DEFAULT_TAG kvm_9p #define VIRTIO_9P_HDR_LEN (sizeof(u32)+sizeof(u8)+sizeof(u16)) -#define VIRTIO_9P_MAX_FID 256 +#define VIRTIO_9P_MAX_FID 16384 #define VIRTIO_9P_VERSION_DOTL 9P2000.L #define MAX_TAG_LEN32 -- 1.7.8.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] kvm tools: use bitwise 'and' in kvm-ipc
From: Sasha Levin levinsasha...@gmail.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/kvm-ipc.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/tools/kvm/kvm-ipc.c b/tools/kvm/kvm-ipc.c index 257c806c..70831b8 100644 --- a/tools/kvm/kvm-ipc.c +++ b/tools/kvm/kvm-ipc.c @@ -153,7 +153,7 @@ static void *kvm_ipc__thread(void *param) r = kvm_ipc__receive(client); } while (r == 0); - } else if (event.events (EPOLLERR | EPOLLRDHUP | EPOLLHUP)) { + } else if (event.events (EPOLLERR | EPOLLRDHUP | EPOLLHUP)) { kvm_ipc__close_conn(fd); } else { kvm_ipc__receive(fd); -- 1.7.8.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] Add async page fault test
On Wed, May 09, 2012 at 02:44:42PM +0200, Igor Mammedov wrote: On 05/09/2012 12:33 PM, Gleb Natapov wrote: The test checks that when hypervisor prefaults swapped in page it does not generates spurious #PFs in case page was unmapped while it was swapped out. diff --git a/config-x86-common.mak b/config-x86-common.mak index c8fbda7..6976f78 100644 --- a/config-x86-common.mak +++ b/config-x86-common.mak @@ -34,7 +34,7 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \ $(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat \ $(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat \ $(TEST_DIR)/kvmclock_test.flat $(TEST_DIR)/eventinj.flat \ - $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat + $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat $(TEST_DIR)/asyncpf.flat ifdef API tests-common += api/api-sample @@ -90,6 +90,8 @@ $(TEST_DIR)/s3.elf: $(cstart.o) $(TEST_DIR)/s3.o $(TEST_DIR)/pmu.elf: $(cstart.o) $(TEST_DIR)/pmu.o +$(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o + arch_clean: $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \ $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o diff --git a/lib/x86/processor.h b/lib/x86/processor.h index c7e1afb..e46d8d0 100644 --- a/lib/x86/processor.h +++ b/lib/x86/processor.h @@ -302,4 +302,9 @@ static inline void invlpg(void *va) { asm volatile(invlpg (%0) ::r (va) : memory); } + +static inline void safe_halt(void) +{ +asm volatile(sti; hlt); +} #endif diff --git a/lib/x86/vm.c b/lib/x86/vm.c index 550ec9b..71b70fd 100644 --- a/lib/x86/vm.c +++ b/lib/x86/vm.c @@ -219,6 +219,11 @@ void *vmalloc(unsigned long size) return mem; } +uint64_t virt_to_phys_cr3(void *mem) +{ +return (get_pte(phys_to_virt(read_cr3()), mem) PTE_ADDR) + ((ulong)mem (PAGE_SIZE - 1)); +} + void vfree(void *mem) { unsigned long size = ((unsigned long *)mem)[-1]; diff --git a/lib/x86/vm.h b/lib/x86/vm.h index 71ab4a8..3473f8d 100644 --- a/lib/x86/vm.h +++ b/lib/x86/vm.h @@ -22,6 +22,7 @@ void vfree(void *mem); void *vmap(unsigned long long phys, unsigned long size); void *alloc_vpage(void); void *alloc_vpages(ulong nr); +uint64_t virt_to_phys_cr3(void *mem); void install_pte(unsigned long *cr3, int pte_level, diff --git a/x86/asyncpf.c b/x86/asyncpf.c new file mode 100644 index 000..62c0455 --- /dev/null +++ b/x86/asyncpf.c @@ -0,0 +1,113 @@ +/* + * Async PF test. For the test to actually do anywathing it ineeds to be started typo? Either that, or a new word! -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to determine the backing host physical memory for a given guest ?
Hello, On an 8 socket Westmere host I am attempting to run a single guest and characterize the virtualization overhead for a system intensive workload (AIM7-high_systime) as the size of the guest scales (10way/64G, 20way/128G, ... 80way/512G). To do some comparisons between the native vs. guest runs. I have been using numactl to control the cpu node memory node bindings for the qemu instance. For larger guest sizes I end up binding across multiple localities. for e.g. a 40 way guest : numactl --cpunodebind=0,1,2,3 --membind=0,1,2,3 \ qemu-system-x86_64 -smp 40 -m 262144 \ I understand that actual mappings from a guest virtual address to host physical address could change. Is there a way to determine [at a given instant] which host's NUMA node is providing the backing physical memory for the active guest's kernel and also for the the apps actively running in the guest ? Guessing that there is a better way (some tool available?) than just diff'ng the per node memory usage...from the before and after output of numactl --hardware on the host. Thanks Vinod -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/11] KVM fixes for 3.3.5
After a long hiatus, here are a bunch of very delayed fixes for 3.3.5. Alex Williamson (1): KVM: lock slots_lock around device assignment Avi Kivity (3): KVM: Ensure all vcpus are consistent with in-kernel irqchip settings KVM: VMX: Fix delayed load of shared MSRs KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context Christian Borntraeger (1): KVM: s390: Sanitize fpc registers for KVM_SET_FPU Gleb Natapov (1): KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation Jens Freimann (1): KVM: s390: do store status after handling STOP_ON_STOP bit Marcelo Tosatti (1): KVM: VMX: vmx_set_cr0 expects kvm-srcu locked Nadav Har'El (1): KVM: nVMX: Fix erroneous exception bitmap check Takuya Yoshikawa (2): KVM: Fix write protection race during dirty logging KVM: mmu_notifier: Flush TLBs before releasing mmu_lock arch/ia64/kvm/kvm-ia64.c |5 + arch/s390/kvm/intercept.c | 20 arch/s390/kvm/kvm-s390.c |2 +- arch/x86/kvm/pmu.c|2 +- arch/x86/kvm/vmx.c| 10 +- arch/x86/kvm/x86.c| 19 +-- include/linux/kvm_host.h |7 +++ virt/kvm/iommu.c | 23 +++ virt/kvm/kvm_main.c | 23 ++- 9 files changed, 77 insertions(+), 34 deletions(-) -- 1.7.10.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/11] KVM: s390: do store status after handling STOP_ON_STOP bit
From: Jens Freimann jf...@linux.vnet.ibm.com In handle_stop() handle the stop bit before doing the store status as described for Stop and Store Status in the Principles of Operation. We have to give up the local_int.lock before calling kvm store status since it calls gmap_fault() which might sleep. Since local_int.lock only protects local_int.* and not guest memory we can give up the lock. Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 9e0d5473e2f0ba2d2fe9dab9408edef3060b710e) --- arch/s390/kvm/intercept.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c index 0243454..a5f6eff 100644 --- a/arch/s390/kvm/intercept.c +++ b/arch/s390/kvm/intercept.c @@ -133,13 +133,6 @@ static int handle_stop(struct kvm_vcpu *vcpu) vcpu-stat.exit_stop_request++; spin_lock_bh(vcpu-arch.local_int.lock); - if (vcpu-arch.local_int.action_bits ACTION_STORE_ON_STOP) { - vcpu-arch.local_int.action_bits = ~ACTION_STORE_ON_STOP; - rc = kvm_s390_vcpu_store_status(vcpu, - KVM_S390_STORE_STATUS_NOADDR); - if (rc = 0) - rc = -EOPNOTSUPP; - } if (vcpu-arch.local_int.action_bits ACTION_RELOADVCPU_ON_STOP) { vcpu-arch.local_int.action_bits = ~ACTION_RELOADVCPU_ON_STOP; @@ -155,7 +148,18 @@ static int handle_stop(struct kvm_vcpu *vcpu) rc = -EOPNOTSUPP; } - spin_unlock_bh(vcpu-arch.local_int.lock); + if (vcpu-arch.local_int.action_bits ACTION_STORE_ON_STOP) { + vcpu-arch.local_int.action_bits = ~ACTION_STORE_ON_STOP; + /* store status must be called unlocked. Since local_int.lock +* only protects local_int.* and not guest memory we can give +* up the lock here */ + spin_unlock_bh(vcpu-arch.local_int.lock); + rc = kvm_s390_vcpu_store_status(vcpu, + KVM_S390_STORE_STATUS_NOADDR); + if (rc = 0) + rc = -EOPNOTSUPP; + } else + spin_unlock_bh(vcpu-arch.local_int.lock); return rc; } -- 1.7.10.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/11] KVM: s390: Sanitize fpc registers for KVM_SET_FPU
From: Christian Borntraeger borntrae...@de.ibm.com commit 7eef87dc99e419b1cc051e4417c37e4744d7b661 (KVM: s390: fix register setting) added a load of the floating point control register to the KVM_SET_FPU path. Lets make sure that the fpc is valid. Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 851755871c1f3184f4124c466e85881f17fa3226) --- arch/s390/kvm/kvm-s390.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index d1c44573..d3cb86c 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -418,7 +418,7 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu, int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { memcpy(vcpu-arch.guest_fpregs.fprs, fpu-fprs, sizeof(fpu-fprs)); - vcpu-arch.guest_fpregs.fpc = fpu-fpc; + vcpu-arch.guest_fpregs.fpc = fpu-fpc FPC_VALID_MASK; restore_fp_regs(vcpu-arch.guest_fpregs); return 0; } -- 1.7.10.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/11] KVM: Fix write protection race during dirty logging
From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp This patch fixes a race introduced by: commit 95d4c16ce78cb6b7549a09159c409d52ddd18dae KVM: Optimize dirty logging by rmap_write_protect() During protecting pages for dirty logging, other threads may also try to protect a page in mmu_sync_children() or kvm_mmu_get_page(). In such a case, because get_dirty_log releases mmu_lock before flushing TLB's, the following race condition can happen: A (get_dirty_log) B (another thread) lock(mmu_lock) clear pte.w unlock(mmu_lock) lock(mmu_lock) pte.w is already cleared unlock(mmu_lock) skip TLB flush return ... TLB flush Though thread B assumes the page has already been protected when it returns, the remaining TLB entry will break that assumption. This patch fixes this problem by making get_dirty_log hold the mmu_lock until it flushes the TLB's. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 6dbf79e7164e9a86c1e466062c48498142ae6128) --- arch/x86/kvm/x86.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9cbfc06..410b6b0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2997,6 +2997,8 @@ static void write_protect_slot(struct kvm *kvm, unsigned long *dirty_bitmap, unsigned long nr_dirty_pages) { + spin_lock(kvm-mmu_lock); + /* Not many dirty pages compared to # of shadow pages. */ if (nr_dirty_pages kvm-arch.n_used_mmu_pages) { unsigned long gfn_offset; @@ -3004,16 +3006,13 @@ static void write_protect_slot(struct kvm *kvm, for_each_set_bit(gfn_offset, dirty_bitmap, memslot-npages) { unsigned long gfn = memslot-base_gfn + gfn_offset; - spin_lock(kvm-mmu_lock); kvm_mmu_rmap_write_protect(kvm, gfn, memslot); - spin_unlock(kvm-mmu_lock); } kvm_flush_remote_tlbs(kvm); - } else { - spin_lock(kvm-mmu_lock); + } else kvm_mmu_slot_remove_write_access(kvm, memslot-id); - spin_unlock(kvm-mmu_lock); - } + + spin_unlock(kvm-mmu_lock); } /* -- 1.7.10.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/11] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Other threads may process the same page in that small window and skip TLB flush and then return before these functions do flush. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 565f3be2174611f364405bbea2d86e153c2e7e78) --- virt/kvm/kvm_main.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c4ac57e..4c68c1e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, */ idx = srcu_read_lock(kvm-srcu); spin_lock(kvm-mmu_lock); + kvm-mmu_notifier_seq++; need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm-tlbs_dirty; - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); - /* we've to flush the tlb before the pages can be freed */ if (need_tlb_flush) kvm_flush_remote_tlbs(kvm); + spin_unlock(kvm-mmu_lock); + srcu_read_unlock(kvm-srcu, idx); } static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, @@ -335,12 +335,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, for (; start end; start += PAGE_SIZE) need_tlb_flush |= kvm_unmap_hva(kvm, start); need_tlb_flush |= kvm-tlbs_dirty; - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); - /* we've to flush the tlb before the pages can be freed */ if (need_tlb_flush) kvm_flush_remote_tlbs(kvm); + + spin_unlock(kvm-mmu_lock); + srcu_read_unlock(kvm-srcu, idx); } static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, @@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, idx = srcu_read_lock(kvm-srcu); spin_lock(kvm-mmu_lock); - young = kvm_age_hva(kvm, address); - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); + young = kvm_age_hva(kvm, address); if (young) kvm_flush_remote_tlbs(kvm); + spin_unlock(kvm-mmu_lock); + srcu_read_unlock(kvm-srcu, idx); + return young; } -- 1.7.10.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/11] KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation
From: Gleb Natapov g...@redhat.com Signed-off-by: Gleb Natapov g...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 270c6c79f4e15e599f47174ecedad932463af7a2) --- arch/x86/kvm/pmu.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 7aad544..3e48c1d 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -413,7 +413,7 @@ int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data) struct kvm_pmc *counters; u64 ctr; - pmc = (3u 30) - 1; + pmc = ~(3u 30); if (!fixed pmc = pmu-nr_arch_gp_counters) return 1; if (fixed pmc = pmu-nr_arch_fixed_counters) -- 1.7.10.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/11] KVM: VMX: Fix delayed load of shared MSRs
Shared MSRs (MSR_*STAR and related) are stored in both vmx-guest_msrs and in the CPU registers, but vmx_set_msr() only updated memory. Prior to 46199f33c2953, this didn't matter, since we called vmx_load_host_state(), which scheduled a vmx_save_host_state(), which re-synchronized the CPU state, but now we don't, so the CPU state will not be synchronized until the next exit to host userspace. This mostly affects nested vmx workloads, which play with these MSRs a lot. Fix by loading the MSR eagerly. Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 9ee73970c03edb68146ceb1ba2a7033c99a5e017) --- arch/x86/kvm/vmx.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 3b4c8d8..fafb325 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2219,6 +2219,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) msr = find_msr_entry(vmx, msr_index); if (msr) { msr-data = data; + if (msr - vmx-guest_msrs vmx-save_nmsrs) + kvm_set_shared_msr(msr-index, msr-data, + msr-mask); break; } ret = kvm_set_msr_common(vcpu, msr_index, data); -- 1.7.10.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/11] KVM: VMX: vmx_set_cr0 expects kvm-srcu locked
From: Marcelo Tosatti mtosa...@redhat.com vmx_set_cr0 is called from vcpu run context, therefore it expects kvm-srcu to be held (for setting up the real-mode TSS). Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 7a4f5ad051e02139a9f1c0f7f4b1acb88915852b) --- arch/x86/kvm/vmx.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5d1b0c7..7f33e33 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3918,7 +3918,9 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx-vpid); vmx-vcpu.arch.cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET; + vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu); vmx_set_cr0(vmx-vcpu, kvm_read_cr0(vcpu)); /* enter rmode */ + srcu_read_unlock(vcpu-kvm-srcu, vcpu-srcu_idx); vmx_set_cr4(vmx-vcpu, 0); vmx_set_efer(vmx-vcpu, 0); vmx_fpu_activate(vmx-vcpu); -- 1.7.10.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/11] KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context
kvm_set_shared_msr() may not be called in preemptible context, but vmx_set_msr() does so: BUG: using smp_processor_id() in preemptible [] code: qemu-kvm/22713 caller is kvm_set_shared_msr+0x32/0xa0 [kvm] Pid: 22713, comm: qemu-kvm Not tainted 3.4.0-rc3+ #39 Call Trace: [8131fa82] debug_smp_processor_id+0xe2/0x100 [a0328ae2] kvm_set_shared_msr+0x32/0xa0 [kvm] [a03a103b] vmx_set_msr+0x28b/0x2d0 [kvm_intel] ... Making kvm_set_shared_msr() work in preemptible is cleaner, but it's used in the fast path. Making two variants is overkill, so this patch just disables preemption around the call. Reported-by: Dave Jones da...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com (cherry picked from commit 2225fd56049643c1a7d645c0ce9d499d43c7974e) --- arch/x86/kvm/vmx.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 7f33e33..a7a6f60 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2219,9 +2219,12 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) msr = find_msr_entry(vmx, msr_index); if (msr) { msr-data = data; - if (msr - vmx-guest_msrs vmx-save_nmsrs) + if (msr - vmx-guest_msrs vmx-save_nmsrs) { + preempt_disable(); kvm_set_shared_msr(msr-index, msr-data, msr-mask); + preempt_enable(); + } break; } ret = kvm_set_msr_common(vcpu, msr_index, data); -- 1.7.10.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/11] KVM: lock slots_lock around device assignment
From: Alex Williamson alex.william...@redhat.com As pointed out by Jason Baron, when assigning a device to a guest we first set the iommu domain pointer, which enables mapping and unmapping of memory slots to the iommu. This leaves a window where this path is enabled, but we haven't synchronized the iommu mappings to the existing memory slots. Thus a slot being removed at that point could send us down unexpected code paths removing non-existent pinnings and iommu mappings. Take the slots_lock around creating the iommu domain and initial mappings as well as around iommu teardown to avoid this race. Signed-off-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com (cherry picked from commit 21a1416a1c945c5aeaeaf791b63c64926018eb77) --- virt/kvm/iommu.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c index fec1723..e9fff98 100644 --- a/virt/kvm/iommu.c +++ b/virt/kvm/iommu.c @@ -240,9 +240,13 @@ int kvm_iommu_map_guest(struct kvm *kvm) return -ENODEV; } + mutex_lock(kvm-slots_lock); + kvm-arch.iommu_domain = iommu_domain_alloc(pci_bus_type); - if (!kvm-arch.iommu_domain) - return -ENOMEM; + if (!kvm-arch.iommu_domain) { + r = -ENOMEM; + goto out_unlock; + } if (!allow_unsafe_assigned_interrupts !iommu_domain_has_cap(kvm-arch.iommu_domain, @@ -253,17 +257,16 @@ int kvm_iommu_map_guest(struct kvm *kvm) module option.\n, __func__); iommu_domain_free(kvm-arch.iommu_domain); kvm-arch.iommu_domain = NULL; - return -EPERM; + r = -EPERM; + goto out_unlock; } r = kvm_iommu_map_memslots(kvm); if (r) - goto out_unmap; - - return 0; + kvm_iommu_unmap_memslots(kvm); -out_unmap: - kvm_iommu_unmap_memslots(kvm); +out_unlock: + mutex_unlock(kvm-slots_lock); return r; } @@ -340,7 +343,11 @@ int kvm_iommu_unmap_guest(struct kvm *kvm) if (!domain) return 0; + mutex_lock(kvm-slots_lock); kvm_iommu_unmap_memslots(kvm); + kvm-arch.iommu_domain = NULL; + mutex_unlock(kvm-slots_lock); + iommu_domain_free(domain); return 0; } -- 1.7.10.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH unit-tests] Add async page fault test
On Wed, 9 May 2012 11:59:17 +0300 Gleb Natapov g...@redhat.com wrote: Hmm, yes if it is file backed it may work. Setting up qemu to use file backed memory is one more complication while running the test though. I haven't checked by I am not sure that MADV_DONTNEED will drop page immediately though. It probably puts it on some list to be freed later. Hmm actually looking at the comments it seems like this is what happens: /* * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about * data it wants to keep. Be sure to free swap resources too. The * zap_page_range call sets things up for shrink_active_list to actually * free * these pages later if no one else has touched them in the meantime, * although we could add these pages to a global reuse list for * shrink_active_list to pick up before reclaiming other pages. */ zap_page_range() actually frees these pages, no? Virtio balloon seems to rely on this. Thanks, Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH unit-tests] Add async page fault test
On 05/09/2012 04:18 PM, Takuya Yoshikawa wrote: On Wed, 9 May 2012 11:59:17 +0300 Gleb Natapov g...@redhat.com wrote: Hmm, yes if it is file backed it may work. Setting up qemu to use file backed memory is one more complication while running the test though. I haven't checked by I am not sure that MADV_DONTNEED will drop page immediately though. It probably puts it on some list to be freed later. Hmm actually looking at the comments it seems like this is what happens: /* * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about * data it wants to keep. Be sure to free swap resources too. The * zap_page_range call sets things up for shrink_active_list to actually * free * these pages later if no one else has touched them in the meantime, * although we could add these pages to a global reuse list for * shrink_active_list to pick up before reclaiming other pages. */ zap_page_range() actually frees these pages, no? Virtio balloon seems to rely on this. The pages are removed from the user address space. But if they're not anonymous, the pages still live in the page cache. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH unit-tests] Add async page fault test
On Wed, 09 May 2012 16:20:23 +0300 Avi Kivity a...@redhat.com wrote: zap_page_range() actually frees these pages, no? Virtio balloon seems to rely on this. The pages are removed from the user address space. But if they're not anonymous, the pages still live in the page cache. Ah, about non-anonymous/file-backed case, I see. Thanks, Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] bitops: add _local bitops
kvm needs to update some hypervisor variables atomically in a sense that the operation can't be interrupted in the middle. However the hypervisor always runs on the same CPU so it does not need any memory barrier or lock prefix. At Peter Anvin's suggestion, add _local bitops for this purpose: define them as non-atomics for x86 and (for now) atomics for everyone else. Uses are not restricted to virtualization: they might be useful to communicate with an interrupt handler if we know that it's running on the same CPU. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Link to previous discussion: http://www.spinics.net/lists/kvm/msg72241.html Documentation/atomic_ops.txt | 19 ++ arch/x86/include/asm/bitops.h |1 + include/asm-generic/bitops.h |1 + include/asm-generic/bitops/local-atomic.h | 92 + include/asm-generic/bitops/local.h| 85 ++ include/linux/bitops.h|8 +++ 6 files changed, 206 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/bitops/local-atomic.h create mode 100644 include/asm-generic/bitops/local.h diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt index 27f2b21..b7e3b67 100644 --- a/Documentation/atomic_ops.txt +++ b/Documentation/atomic_ops.txt @@ -520,6 +520,25 @@ The __clear_bit_unlock version is non-atomic, however it still implements unlock barrier semantics. This can be useful if the lock itself is protecting the other bits in the word. +Local versions of the bitmask operations are also provided. They are used in +contexts where the operations need to be performed atomically with respect to +the local CPU, but no other CPU accesses the bitmask. This assumption makes it +possible to avoid the need for SMP protection and use less expensive atomic +operations in the implementation. +They have names similar to the above bitmask operation interfaces, +except that _local is sufficed to the interface name. + + void set_bit_local(unsigned long nr, volatile unsigned long *addr); + void clear_bit_local(unsigned long nr, volatile unsigned long *addr); + void change_bit_local(unsigned long nr, volatile unsigned long *addr); + int test_and_set_bit_local(unsigned long nr, volatile unsigned long *addr); + int test_and_clear_bit_local(unsigned long nr, volatile unsigned long *addr); + int test_and_change_bit_local(unsigned long nr, volatile unsigned long *addr); + +These local variants are useful for example if the bitmask may be accessed from +a local intrerrupt, or from a hypervisor on the same CPU if running in a VM. +These local variants also do not have any special memory barrier semantics. + Finally, there are non-atomic versions of the bitmask operations provided. They are used in contexts where some other higher-level SMP locking scheme is being used to protect the bitmask, and thus less diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h index b97596e..8784cd7 100644 --- a/arch/x86/include/asm/bitops.h +++ b/arch/x86/include/asm/bitops.h @@ -509,6 +509,7 @@ static __always_inline int fls64(__u64 x) #include asm-generic/bitops/le.h #include asm-generic/bitops/ext2-atomic-setbit.h +#include asm-generic/bitops/local.h #endif /* __KERNEL__ */ #endif /* _ASM_X86_BITOPS_H */ diff --git a/include/asm-generic/bitops.h b/include/asm-generic/bitops.h index 280ca7a..d720c9e 100644 --- a/include/asm-generic/bitops.h +++ b/include/asm-generic/bitops.h @@ -40,5 +40,6 @@ #include asm-generic/bitops/non-atomic.h #include asm-generic/bitops/le.h #include asm-generic/bitops/ext2-atomic.h +#include asm-generic/bitops/local-atomic.h #endif /* __ASM_GENERIC_BITOPS_H */ diff --git a/include/asm-generic/bitops/local-atomic.h b/include/asm-generic/bitops/local-atomic.h new file mode 100644 index 000..94ad261 --- /dev/null +++ b/include/asm-generic/bitops/local-atomic.h @@ -0,0 +1,92 @@ +#ifndef ASM_GENERIC_BITOPS_LOCAL_ATOMIC_H +#define ASM_GENERIC_BITOPS_LOCAL_ATOMIC_H +/** + * Local atomic operations + * + * These operations give no atomicity or ordering guarantees if result + * observed from another CPU. Atomicity is guaranteed if result is observed + * from the same CPU, e.g. from a local interrupt, or a hypervisor if running + * in a VM. + * Atomicity is not guaranteed across CPUs: if two examples of these operations + * race on different CPUs, one can appear to succeed but actually fail. Use + * non-local atomics instead or protect such SMP accesses with a lock. + * These operations can be reordered. No memory barrier is implied. + */ + +/** + * Implement local operations in terms of atomics. + * For use from a local interrupt, this is always safe but suboptimal: many + * architectures can use bitops/local.h instead. + * For use from a hypervisor, make sure your architecture doesn't + * rely on locks for atomics: if it does -
Re: How to determine the backing host physical memory for a given guest ?
On 05/09/2012 04:05 PM, Chegu Vinod wrote: Hello, On an 8 socket Westmere host I am attempting to run a single guest and characterize the virtualization overhead for a system intensive workload (AIM7-high_systime) as the size of the guest scales (10way/64G, 20way/128G, ... 80way/512G). To do some comparisons between the native vs. guest runs. I have been using numactl to control the cpu node memory node bindings for the qemu instance. For larger guest sizes I end up binding across multiple localities. for e.g. a 40 way guest : numactl --cpunodebind=0,1,2,3 --membind=0,1,2,3 \ qemu-system-x86_64 -smp 40 -m 262144 \ I understand that actual mappings from a guest virtual address to host physical address could change. Is there a way to determine [at a given instant] which host's NUMA node is providing the backing physical memory for the active guest's kernel and also for the the apps actively running in the guest ? Guessing that there is a better way (some tool available?) than just diff'ng the per node memory usage...from the before and after output of numactl --hardware on the host. Not sure if that's what you want, but there's Documentation/vm/pagemap.txt. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On 05/09/2012 06:45 AM, Michael S. Tsirkin wrote: kvm needs to update some hypervisor variables atomically in a sense that the operation can't be interrupted in the middle. However the hypervisor always runs on the same CPU so it does not need any memory barrier or lock prefix. At Peter Anvin's suggestion, add _local bitops for this purpose: define them as non-atomics for x86 and (for now) atomics for everyone else. Uses are not restricted to virtualization: they might be useful to communicate with an interrupt handler if we know that it's running on the same CPU. Signed-off-by: Michael S. Tsirkin m...@redhat.com I don't think you can use the x86 nonatomics as-is, because they don't contain optimization barriers. -hpa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On 05/09/2012 04:45 PM, Michael S. Tsirkin wrote: +Local versions of the bitmask operations are also provided. They are used in +contexts where the operations need to be performed atomically with respect to +the local CPU, but no other CPU accesses the bitmask. This assumption makes it +possible to avoid the need for SMP protection and use less expensive atomic +operations in the implementation. +They have names similar to the above bitmask operation interfaces, +except that _local is sufficed to the interface name. suffixed (better: appended) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On Wed, May 09, 2012 at 07:03:37AM -0700, H. Peter Anvin wrote: On 05/09/2012 06:45 AM, Michael S. Tsirkin wrote: kvm needs to update some hypervisor variables atomically in a sense that the operation can't be interrupted in the middle. However the hypervisor always runs on the same CPU so it does not need any memory barrier or lock prefix. At Peter Anvin's suggestion, add _local bitops for this purpose: define them as non-atomics for x86 and (for now) atomics for everyone else. Uses are not restricted to virtualization: they might be useful to communicate with an interrupt handler if we know that it's running on the same CPU. Signed-off-by: Michael S. Tsirkin m...@redhat.com I don't think you can use the x86 nonatomics as-is, because they don't contain optimization barriers. -hpa You are right of course. So I'll remove bitops/local.h move the code to x86/ and open-code. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On Wed, May 09, 2012 at 07:03:37AM -0700, H. Peter Anvin wrote: On 05/09/2012 06:45 AM, Michael S. Tsirkin wrote: kvm needs to update some hypervisor variables atomically in a sense that the operation can't be interrupted in the middle. However the hypervisor always runs on the same CPU so it does not need any memory barrier or lock prefix. At Peter Anvin's suggestion, add _local bitops for this purpose: define them as non-atomics for x86 and (for now) atomics for everyone else. Uses are not restricted to virtualization: they might be useful to communicate with an interrupt handler if we know that it's running on the same CPU. Signed-off-by: Michael S. Tsirkin m...@redhat.com I don't think you can use the x86 nonatomics as-is, because they don't contain optimization barriers. -hpa Just adding a memory clobber to asm will be enough, right? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On Wed, May 09, 2012 at 07:03:37AM -0700, H. Peter Anvin wrote: On 05/09/2012 06:45 AM, Michael S. Tsirkin wrote: kvm needs to update some hypervisor variables atomically in a sense that the operation can't be interrupted in the middle. However the hypervisor always runs on the same CPU so it does not need any memory barrier or lock prefix. At Peter Anvin's suggestion, add _local bitops for this purpose: define them as non-atomics for x86 and (for now) atomics for everyone else. Uses are not restricted to virtualization: they might be useful to communicate with an interrupt handler if we know that it's running on the same CPU. Signed-off-by: Michael S. Tsirkin m...@redhat.com I don't think you can use the x86 nonatomics as-is, because they don't contain optimization barriers. -hpa By the way, clear_bit on x86 does not seem to contain an optimization barrier - is my reading correct? Lock prefix does not affect the compiler, right? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On 05/09/2012 08:47 AM, Michael S. Tsirkin wrote: By the way, clear_bit on x86 does not seem to contain an optimization barrier - is my reading correct? Lock prefix does not affect the compiler, right? Yes, as it clearly states in the comment: * clear_bit() is atomic and may not be reordered. However, it does * not contain a memory barrier, so if it is used for locking purposes, * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit() * in order to ensure changes are visible on other processors. There is clear_bit_unlock() which has the barrier semantics. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] bitops: add _local bitops
kvm needs to update some hypervisor variables atomically in a sense that the operation can't be interrupted in the middle. However the hypervisor always runs on the same CPU so it does not need any memory barrier or lock prefix. Add _local bitops for this purpose: define them as non-atomics for x86 and (for now) atomics for everyone else. Uses are not restricted to virtualization: they might be useful to communicate with an interrupt handler if we know that it's running on the same CPU. Signed-off-by: Michael S. Tsirkin m...@redhat.com Acked-by: Arnd Bergmann a...@arndb.de --- Changes from v1: - Fix x86 version so it includes optimization barriers as suggested by Peter Anvin - Fix documentation as suggested by Avi Kivity - Remove duplicate files as suggested by Arnd Bergmann Link to discussion preceding v1: http://www.spinics.net/lists/kvm/msg72241.html Documentation/atomic_ops.txt | 19 arch/x86/include/asm/bitops.h | 132 + include/asm-generic/bitops.h |1 + include/asm-generic/bitops/local-atomic.h | 114 + include/linux/bitops.h|8 ++ 5 files changed, 274 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/bitops/local-atomic.h diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt index 27f2b21..b45eb12 100644 --- a/Documentation/atomic_ops.txt +++ b/Documentation/atomic_ops.txt @@ -520,6 +520,25 @@ The __clear_bit_unlock version is non-atomic, however it still implements unlock barrier semantics. This can be useful if the lock itself is protecting the other bits in the word. +Local versions of the bitmask operations are also provided. They are used in +contexts where the operations need to be performed atomically with respect to +the local CPU, but no other CPU accesses the bitmask. This assumption makes it +possible to avoid the need for SMP protection and use less expensive atomic +operations in the implementation. +They have names similar to the above bitmask operation interfaces, +except that _local is appended to the interface name. + + void set_bit_local(unsigned long nr, volatile unsigned long *addr); + void clear_bit_local(unsigned long nr, volatile unsigned long *addr); + void change_bit_local(unsigned long nr, volatile unsigned long *addr); + int test_and_set_bit_local(unsigned long nr, volatile unsigned long *addr); + int test_and_clear_bit_local(unsigned long nr, volatile unsigned long *addr); + int test_and_change_bit_local(unsigned long nr, volatile unsigned long *addr); + +These local variants are useful for example if the bitmask may be accessed from +a local intrerrupt, or from a hypervisor on the same CPU if running in a VM. +These local variants also do not have any special memory barrier semantics. + Finally, there are non-atomic versions of the bitmask operations provided. They are used in contexts where some other higher-level SMP locking scheme is being used to protect the bitmask, and thus less diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h index b97596e..4e6d900 100644 --- a/arch/x86/include/asm/bitops.h +++ b/arch/x86/include/asm/bitops.h @@ -41,6 +41,20 @@ #define CONST_MASK_ADDR(nr, addr) BITOP_ADDR((void *)(addr) + ((nr)3)) #define CONST_MASK(nr) (1 ((nr) 7)) +/* + * x86 has its own version of local atomic operations. These operations + * change memory using a single instruction, but give no atomicity or ordering + * guarantees if result observed from another CPU. Atomicity is guaranteed if + * result is observed from the same CPU, e.g. from a local interrupt, or a + * hypervisor if running in a VM. + * Atomicity is not guaranteed across CPUs: if two examples of these operations + * race on different CPUs, one can appear to succeed but actually fail. + * To get atomicity guarantees across CPUs use non-local atomics instead. + * These operations can be reordered. No memory barrier is implied. + */ + +#define ARCH_HAS_BITOPS_LOCAL_ATOMIC 1 + /** * set_bit - Atomically set a bit in memory * @nr: the bit to set @@ -85,6 +99,19 @@ static inline void __set_bit(int nr, volatile unsigned long *addr) } /** + * set_bit_local - Atomically set a bit in memory + * @nr: the bit to set + * @addr: the address to start counting from + * + * This operation is atomic with respect to local CPU only. No memory barrier + * is implied. + */ +static inline void set_bit_local(int nr, volatile unsigned long *addr) +{ + asm volatile(bts %1,%0 : ADDR : Ir (nr) : memory); +} + +/** * clear_bit - Clears a bit in memory * @nr: Bit to clear * @addr: Address to start counting from @@ -149,6 +176,30 @@ static inline void __clear_bit_unlock(unsigned nr, volatile unsigned long *addr) #define smp_mb__after_clear_bit() barrier() /** + * clear_bit - Clears a bit in memory + * @nr: Bit to clear + *
Re: [PATCH] bitops: add _local bitops
On Wed, May 09, 2012 at 09:24:41AM -0700, H. Peter Anvin wrote: On 05/09/2012 08:47 AM, Michael S. Tsirkin wrote: By the way, clear_bit on x86 does not seem to contain an optimization barrier - is my reading correct? Lock prefix does not affect the compiler, right? Yes, as it clearly states in the comment: * clear_bit() is atomic and may not be reordered. However, it does * not contain a memory barrier, so if it is used for locking purposes, * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit() * in order to ensure changes are visible on other processors. There is clear_bit_unlock() which has the barrier semantics. -hpa Well it talks about a memory barrier, not an optimization barrier. If compiler reorders code, changes will appear in the wrong order on the current processor, not just on other processors, no? Sorry if I'm confused about this point, this is what Documentation/atomic_ops.txt made me believe: quote For example consider the following code: while (a 0) do_something(); If the compiler can prove that do_something() does not store to the variable a, then the compiler is within its rights transforming this to the following: tmp = a; if (a 0) for (;;) do_something(); /quote -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On 05/09/2012 09:36 AM, Michael S. Tsirkin wrote: Well it talks about a memory barrier, not an optimization barrier. Same thing. If compiler reorders code, changes will appear in the wrong order on the current processor, not just on other processors, no? Yes. For your _local I would just copy the atomic bitops but remote the locks in most cases. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On Wed, May 09, 2012 at 09:45:57AM -0700, H. Peter Anvin wrote: On 05/09/2012 09:36 AM, Michael S. Tsirkin wrote: Well it talks about a memory barrier, not an optimization barrier. Same thing. I see. So it really should say 'any barrier', right? Documentation/atomic_ops.txt goes to great length to distinguish between the two and we probably should not confuse things. If compiler reorders code, changes will appear in the wrong order on the current processor, not just on other processors, no? Yes. So this seems to contradict what the comment says: clear_bit() is atomic and may not be reordered. and you say compiler *can* reorder it, and below you should call smp_mb__before_clear_bit() and/or * smp_mb__after_clear_bit() in order to ensure changes are visible on other processors. and in fact this is not enough, you also need to call barrier() to ensure changes are visible on the same processor in the correct order. For your _local I would just copy the atomic bitops but remote the locks in most cases. -hpa Right, I sent v2 that does exactly this. My question about documentation for change_bit is an unrelated one: to me, it looks like the documentation for change_bit does not match the implementation, or at least is somewhat confusing. -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On Wed, 9 May 2012 16:45:29 +0300 Michael S. Tsirkin m...@redhat.com wrote: kvm needs to update some hypervisor variables atomically in a sense that the operation can't be interrupted in the middle. However the hypervisor always runs on the same CPU so it does not need any memory barrier or lock prefix. Well. It adds more complexity, makes the kernel harder to understand and maintain and introduces more opportunities for developers to add bugs. So from that point of view, the best way of handling this patch is to delete it. Presumably the patch offers some benefit to offest all those costs. But you didn't tell us what that benefit is, so we cannot make a decision. IOW: numbers, please. Convincing ones, for realistic test cases. Secondly: can KVM just use __set_bit() and friends? I suspect those interfaces happen to meet your requirements. At least on architectures you care about. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On 05/09/2012 12:19 PM, Andrew Morton wrote: Secondly: can KVM just use __set_bit() and friends? I suspect those interfaces happen to meet your requirements. At least on architectures you care about. __set_bit() and friends right now don't have optimization barriers, meaning the compiler is free to move them around. -hpa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)
On Wed, May 09, 2012 at 12:38:37PM +0300, Gleb Natapov wrote: On Wed, May 09, 2012 at 11:05:58AM +0200, Alexander Graf wrote: On 09.05.2012, at 10:51, Gleb Natapov wrote: On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote: On 09.05.2012, at 10:14, Gleb Natapov g...@redhat.com wrote: On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote: On 08.05.2012, at 22:14, Eduardo Habkost wrote: On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote: On 07.05.2012, at 20:21, Eduardo Habkost wrote: Andre? Are you able to help to answer the question below? I would like to clarify what's the expected behavior of -cpu host to be able to continue working on it. I believe the code will need to be fixed on either case, but first we need to figure out what are the expectations/requirements, to know _which_ changes will be needed. On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote: (CCing Andre Przywara, in case he can help to clarify what's the expected meaning of -cpu host) [...] I am not sure I understand what you are proposing. Let me explain the use case I am thinking about: - Feature FOO is of type (A) (e.g. just a new instruction set that doesn't require additional userspace support) - User has a Qemu vesion that doesn't know anything about feature FOO - User gets a new CPU that supports feature FOO - User gets a new kernel that supports feature FOO (i.e. has FOO in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User expects to get feature FOO enabled if using -cpu host, without upgrading Qemu. The problem here is: to support the above use-case, userspace need a probing mechanism that can differentiate _new_ (previously unknown) features that are in group (A) (safe to blindly enable) from features that are in group (B) (that can't be enabled without an userspace upgrade). In short, it becomes a problem if we consider the following case: - Feature BAR is of type (B) (it can't be enabled without extra userspace support) - User has a Qemu version that doesn't know anything about feature BAR - User gets a new CPU that supports feature BAR - User gets a new kernel that supports feature BAR (i.e. has BAR in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User simply shouldn't get feature BAR enabled, even if using -cpu host, otherwise Qemu would break. If userspace always limited itself to features it knows about, it would be really easy to implement the feature without any new probing mechanism from the kernel. But that's not how I think users expect -cpu host to work. Maybe I am wrong, I don't know. I am CCing Andre, who introduced the -cpu host feature, in case he can explain what's the expected semantics on the cases above. Can you think of any feature that'd go into category B? - TSC-deadline: can't be enabled unless userspace takes care to enable the in-kernel irqchip. The kernel can check if in-kernel irqchip has it enabled and otherwise mask it out, no? How kernel should know that userspace does not emulate it? You have to enable the in-kernel apic to use it, at which point the kernel knows it's in use, right? - x2apic: ditto. Same here. For user space irqchip the kernel side doesn't care. If in-kernel APIC is enabled, check for its capabilities. Same here. Well, technically both of those features can't be implemented in userspace right now since MSRs are terminated in the kernel, but I Doesn't sound like the greatest design - unless you deprecate the non-in-kernel apic case. You mean terminating MSRs in kernel does not sound like the greatest design? I do not disagree. That is why IMO kernel can't filter out TSC-deadline and x2apic like you suggest. I still don't see why it can't. Imagine we would filter TSC-deadline and x2apic by default in the kernel - they are not known to exist yet. Now, we implement TSC-deadline in the kernel. We still filter TSC-deadline out in GET_SUPORTED_CPUID in the kernel. But we provide an interface to user space that says call me to enable TSC-deadline CPUID, but only if you're using the in-kernel apic We have that interface already, it is called KVM_SET_CPUID. :-) New user space calls that ioctl when it's using the in-kernel apic, it doesn't when it's using the user space apic. Old user space doesn't call that ioctl. First of all we already have TSC-deadline in GET_SUPORTED_CPUID without any additional ioctls. And second I do not see why we need additional iostls here. We don't have TSC-deadline set today (and that's what started this thread), but we have x2apic. So what you say is true for x2apic, anyway. Hmm, so may be I
Re: [PATCH] bitops: add _local bitops
On Wed, May 09, 2012 at 12:19:40PM -0700, Andrew Morton wrote: On Wed, 9 May 2012 16:45:29 +0300 Michael S. Tsirkin m...@redhat.com wrote: kvm needs to update some hypervisor variables atomically in a sense that the operation can't be interrupted in the middle. However the hypervisor always runs on the same CPU so it does not need any memory barrier or lock prefix. Well. It adds more complexity, makes the kernel harder to understand and maintain and introduces more opportunities for developers to add bugs. So from that point of view, the best way of handling this patch is to delete it. Presumably the patch offers some benefit to offest all those costs. But you didn't tell us what that benefit is, so we cannot make a decision. IOW: numbers, please. Convincing ones, for realistic test cases. I can try but in practice it's not an optimization. What kvm needs is a guarantee that a memory update will be done in a single instruction. Secondly: can KVM just use __set_bit() and friends? I suspect those interfaces happen to meet your requirements. At least on architectures you care about. Sigh. I started with this, but then Avi Kivity said that he's worried about someone changing __test_and_clear_bit on x86 such that the change won't be atomic (single instruction) anymore. So I put inline asm into kvm.c. This drew comment from Peter that maintaining separate asm code in kvm.c adds too much maintainance overhead so I should implement _local, add asm-generic fallback and put it all in generic code. In practice ATM any of the above will work. We probably don't even need to add barrier() calls since what we do afterwards is apic access which has an optimization barrier anyway. But I'm fine with adding them in there just in case if that's what people want. However, since we've come full circle, I'd like to have a discussion on what, exactly, is acceptable to all maintainers. Avi, Andrew, Peter, could you please comment? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On 05/09/2012 01:07 PM, Michael S. Tsirkin wrote: In practice ATM any of the above will work. We probably don't even need to add barrier() calls since what we do afterwards is apic access which has an optimization barrier anyway. But I'm fine with adding them in there just in case if that's what people want. If you have the optimization barrier anyway, then I'd be fine with you just using __test_and_clear_bit() and add to a comment in arch/x86/include/asm/bitops.h that KVM needs it to be locally atomic. -hpa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On Wed, May 09, 2012 at 01:10:04PM -0700, H. Peter Anvin wrote: On 05/09/2012 01:07 PM, Michael S. Tsirkin wrote: In practice ATM any of the above will work. We probably don't even need to add barrier() calls since what we do afterwards is apic access which has an optimization barrier anyway. But I'm fine with adding them in there just in case if that's what people want. If you have the optimization barrier anyway, then I'd be fine with you just using __test_and_clear_bit() and add to a comment in arch/x86/include/asm/bitops.h that KVM needs it to be locally atomic. -hpa Sounds good. Avi? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)
On 09.05.2012, at 21:38, Eduardo Habkost wrote: On Wed, May 09, 2012 at 12:38:37PM +0300, Gleb Natapov wrote: On Wed, May 09, 2012 at 11:05:58AM +0200, Alexander Graf wrote: On 09.05.2012, at 10:51, Gleb Natapov wrote: On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote: On 09.05.2012, at 10:14, Gleb Natapov g...@redhat.com wrote: On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote: On 08.05.2012, at 22:14, Eduardo Habkost wrote: On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote: On 07.05.2012, at 20:21, Eduardo Habkost wrote: Andre? Are you able to help to answer the question below? I would like to clarify what's the expected behavior of -cpu host to be able to continue working on it. I believe the code will need to be fixed on either case, but first we need to figure out what are the expectations/requirements, to know _which_ changes will be needed. On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote: (CCing Andre Przywara, in case he can help to clarify what's the expected meaning of -cpu host) [...] I am not sure I understand what you are proposing. Let me explain the use case I am thinking about: - Feature FOO is of type (A) (e.g. just a new instruction set that doesn't require additional userspace support) - User has a Qemu vesion that doesn't know anything about feature FOO - User gets a new CPU that supports feature FOO - User gets a new kernel that supports feature FOO (i.e. has FOO in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User expects to get feature FOO enabled if using -cpu host, without upgrading Qemu. The problem here is: to support the above use-case, userspace need a probing mechanism that can differentiate _new_ (previously unknown) features that are in group (A) (safe to blindly enable) from features that are in group (B) (that can't be enabled without an userspace upgrade). In short, it becomes a problem if we consider the following case: - Feature BAR is of type (B) (it can't be enabled without extra userspace support) - User has a Qemu version that doesn't know anything about feature BAR - User gets a new CPU that supports feature BAR - User gets a new kernel that supports feature BAR (i.e. has BAR in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User simply shouldn't get feature BAR enabled, even if using -cpu host, otherwise Qemu would break. If userspace always limited itself to features it knows about, it would be really easy to implement the feature without any new probing mechanism from the kernel. But that's not how I think users expect -cpu host to work. Maybe I am wrong, I don't know. I am CCing Andre, who introduced the -cpu host feature, in case he can explain what's the expected semantics on the cases above. Can you think of any feature that'd go into category B? - TSC-deadline: can't be enabled unless userspace takes care to enable the in-kernel irqchip. The kernel can check if in-kernel irqchip has it enabled and otherwise mask it out, no? How kernel should know that userspace does not emulate it? You have to enable the in-kernel apic to use it, at which point the kernel knows it's in use, right? - x2apic: ditto. Same here. For user space irqchip the kernel side doesn't care. If in-kernel APIC is enabled, check for its capabilities. Same here. Well, technically both of those features can't be implemented in userspace right now since MSRs are terminated in the kernel, but I Doesn't sound like the greatest design - unless you deprecate the non-in-kernel apic case. You mean terminating MSRs in kernel does not sound like the greatest design? I do not disagree. That is why IMO kernel can't filter out TSC-deadline and x2apic like you suggest. I still don't see why it can't. Imagine we would filter TSC-deadline and x2apic by default in the kernel - they are not known to exist yet. Now, we implement TSC-deadline in the kernel. We still filter TSC-deadline out in GET_SUPORTED_CPUID in the kernel. But we provide an interface to user space that says call me to enable TSC-deadline CPUID, but only if you're using the in-kernel apic We have that interface already, it is called KVM_SET_CPUID. :-) New user space calls that ioctl when it's using the in-kernel apic, it doesn't when it's using the user space apic. Old user space doesn't call that ioctl. First of all we already have TSC-deadline in GET_SUPORTED_CPUID without any additional ioctls. And second I do not see why we need additional iostls here. We don't have TSC-deadline set today (and that's what started this thread), but we have x2apic. So what you say is true for x2apic, anyway. Hmm, so may be I misunderstood you. You propose to mask TSC-deadline and x2apic out from GET_SUPORTED_CPUID if irq chip is not in kernel, not from KVM_SET_CPUID? For those two features it may make
Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush
On Tue, May 08, 2012 at 03:39:43PM +0300, Avi Kivity wrote: On 05/08/2012 05:25 AM, Marcelo Tosatti wrote: We need call kvm_cond_flush_remote_tlbs in rmap_write-protect unconditionally? Yes, that's the point. We change spin_lock(mmu_lock) conditionally flush the tlb spin_unlock(mmu_lock) to spin_lock(mmu_lock) conditionally mark the tlb as dirty spin_unlock(mmu_lock) kvm_cond_flush_remote_tlbs() but that means the entire codebase has to be converted. Is there any other site which expects sptes and TLBs to be in sync, other than rmap_write_protect? I wouldn't say rmap_write_protect() expects sptes and TLBs to be in sync. Rather its callers. Please convert the flush_remote_tlbs at the end of set_spte (RW-RO) to mark_dirty. I'd like to take an incremental approach, since there are many paths. I don't have a concrete plan though. Looks good in general (patchset is incomplete though). One thing that is annoying is that there is no guarantee of progress for flushed_count increment (it can, in theory, always race with a mark_dirty). But since that would be only a performance, and not correctness, aspect, it is fine. We don't have a while () look in cond_flush(), so it won't be slowed down by the race; whoever caused the race will have to flush on their own, but they do that anyway now. We can regress if vcpu 0 marks the tlb as dirty, and then vcpu 0 and 1 simultaneously flush, even though vcpu 1 did nothing to deserve it. I don't see a way around it except to hope its a rare event. It has the advantage that it requires code to explicitly document where the TLB must be flushed and the sites which expect sptes to be in sync with TLBs. I'm looking for an idea of how to make the flush in those cases not hold mmu_lock. You can't easily for rmap_write_protect, must check that the state is unchanged (write protect operation is still relevant). Current patchset (with corrections to comments by Xiao) is good enough already IMO. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next v2 03/74] target-i386: Pass X86CPU to do_cpu_{init,sipi}()
Allows to use cpu_reset() in place of cpu_state_reset(). Signed-off-by: Andreas Färber afaer...@suse.de --- cpu-exec.c |4 ++-- target-i386/cpu.h|4 ++-- target-i386/helper.c | 13 - target-i386/kvm.c|6 -- 4 files changed, 16 insertions(+), 11 deletions(-) diff --git a/cpu-exec.c b/cpu-exec.c index 0344cd5..fbb39cb 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -287,11 +287,11 @@ int cpu_exec(CPUArchState *env) #if defined(TARGET_I386) if (interrupt_request CPU_INTERRUPT_INIT) { svm_check_intercept(env, SVM_EXIT_INIT); -do_cpu_init(env); +do_cpu_init(x86_env_get_cpu(env)); env-exception_index = EXCP_HALTED; cpu_loop_exit(env); } else if (interrupt_request CPU_INTERRUPT_SIPI) { -do_cpu_sipi(env); +do_cpu_sipi(x86_env_get_cpu(env)); } else if (env-hflags2 HF2_GIF_MASK) { if ((interrupt_request CPU_INTERRUPT_SMI) !(env-hflags HF_SMM_MASK)) { diff --git a/target-i386/cpu.h b/target-i386/cpu.h index b5b9a50..4bff61d 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -1051,8 +1051,8 @@ static inline void cpu_get_tb_cpu_state(CPUX86State *env, target_ulong *pc, (env-eflags (IOPL_MASK | TF_MASK | RF_MASK | VM_MASK)); } -void do_cpu_init(CPUX86State *env); -void do_cpu_sipi(CPUX86State *env); +void do_cpu_init(X86CPU *cpu); +void do_cpu_sipi(X86CPU *cpu); #define MCE_INJECT_BROADCAST1 #define MCE_INJECT_UNCOND_AO2 diff --git a/target-i386/helper.c b/target-i386/helper.c index 0b22582..fe181be 100644 --- a/target-i386/helper.c +++ b/target-i386/helper.c @@ -1187,27 +1187,30 @@ CPUX86State *cpu_x86_init(const char *cpu_model) } #if !defined(CONFIG_USER_ONLY) -void do_cpu_init(CPUX86State *env) +void do_cpu_init(X86CPU *cpu) { +CPUX86State *env = cpu-env; int sipi = env-interrupt_request CPU_INTERRUPT_SIPI; uint64_t pat = env-pat; -cpu_state_reset(env); +cpu_reset(CPU(cpu)); env-interrupt_request = sipi; env-pat = pat; apic_init_reset(env-apic_state); env-halted = !cpu_is_bsp(env); } -void do_cpu_sipi(CPUX86State *env) +void do_cpu_sipi(X86CPU *cpu) { +CPUX86State *env = cpu-env; + apic_sipi(env-apic_state); } #else -void do_cpu_init(CPUX86State *env) +void do_cpu_init(X86CPU *cpu) { } -void do_cpu_sipi(CPUX86State *env) +void do_cpu_sipi(X86CPU *cpu) { } #endif diff --git a/target-i386/kvm.c b/target-i386/kvm.c index e74a9e4..0d0d8f6 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1698,6 +1698,8 @@ void kvm_arch_post_run(CPUX86State *env, struct kvm_run *run) int kvm_arch_process_async_events(CPUX86State *env) { +X86CPU *cpu = x86_env_get_cpu(env); + if (env-interrupt_request CPU_INTERRUPT_MCE) { /* We must not raise CPU_INTERRUPT_MCE if it's not supported. */ assert(env-mcg_cap); @@ -1732,11 +1734,11 @@ int kvm_arch_process_async_events(CPUX86State *env) } if (env-interrupt_request CPU_INTERRUPT_INIT) { kvm_cpu_synchronize_state(env); -do_cpu_init(env); +do_cpu_init(cpu); } if (env-interrupt_request CPU_INTERRUPT_SIPI) { kvm_cpu_synchronize_state(env); -do_cpu_sipi(env); +do_cpu_sipi(cpu); } if (env-interrupt_request CPU_INTERRUPT_TPR) { env-interrupt_request = ~CPU_INTERRUPT_TPR; -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next v2 03/74] target-i386: Pass X86CPU to do_cpu_{init,sipi}()
Allows to use cpu_reset() in place of cpu_state_reset(). Signed-off-by: Andreas Färber afaer...@suse.de --- cpu-exec.c |4 ++-- target-i386/cpu.h|4 ++-- target-i386/helper.c | 13 - target-i386/kvm.c|6 -- 4 files changed, 16 insertions(+), 11 deletions(-) diff --git a/cpu-exec.c b/cpu-exec.c index 0344cd5..fbb39cb 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -287,11 +287,11 @@ int cpu_exec(CPUArchState *env) #if defined(TARGET_I386) if (interrupt_request CPU_INTERRUPT_INIT) { svm_check_intercept(env, SVM_EXIT_INIT); -do_cpu_init(env); +do_cpu_init(x86_env_get_cpu(env)); env-exception_index = EXCP_HALTED; cpu_loop_exit(env); } else if (interrupt_request CPU_INTERRUPT_SIPI) { -do_cpu_sipi(env); +do_cpu_sipi(x86_env_get_cpu(env)); } else if (env-hflags2 HF2_GIF_MASK) { if ((interrupt_request CPU_INTERRUPT_SMI) !(env-hflags HF_SMM_MASK)) { diff --git a/target-i386/cpu.h b/target-i386/cpu.h index b5b9a50..4bff61d 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -1051,8 +1051,8 @@ static inline void cpu_get_tb_cpu_state(CPUX86State *env, target_ulong *pc, (env-eflags (IOPL_MASK | TF_MASK | RF_MASK | VM_MASK)); } -void do_cpu_init(CPUX86State *env); -void do_cpu_sipi(CPUX86State *env); +void do_cpu_init(X86CPU *cpu); +void do_cpu_sipi(X86CPU *cpu); #define MCE_INJECT_BROADCAST1 #define MCE_INJECT_UNCOND_AO2 diff --git a/target-i386/helper.c b/target-i386/helper.c index 0b22582..fe181be 100644 --- a/target-i386/helper.c +++ b/target-i386/helper.c @@ -1187,27 +1187,30 @@ CPUX86State *cpu_x86_init(const char *cpu_model) } #if !defined(CONFIG_USER_ONLY) -void do_cpu_init(CPUX86State *env) +void do_cpu_init(X86CPU *cpu) { +CPUX86State *env = cpu-env; int sipi = env-interrupt_request CPU_INTERRUPT_SIPI; uint64_t pat = env-pat; -cpu_state_reset(env); +cpu_reset(CPU(cpu)); env-interrupt_request = sipi; env-pat = pat; apic_init_reset(env-apic_state); env-halted = !cpu_is_bsp(env); } -void do_cpu_sipi(CPUX86State *env) +void do_cpu_sipi(X86CPU *cpu) { +CPUX86State *env = cpu-env; + apic_sipi(env-apic_state); } #else -void do_cpu_init(CPUX86State *env) +void do_cpu_init(X86CPU *cpu) { } -void do_cpu_sipi(CPUX86State *env) +void do_cpu_sipi(X86CPU *cpu) { } #endif diff --git a/target-i386/kvm.c b/target-i386/kvm.c index e74a9e4..0d0d8f6 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1698,6 +1698,8 @@ void kvm_arch_post_run(CPUX86State *env, struct kvm_run *run) int kvm_arch_process_async_events(CPUX86State *env) { +X86CPU *cpu = x86_env_get_cpu(env); + if (env-interrupt_request CPU_INTERRUPT_MCE) { /* We must not raise CPU_INTERRUPT_MCE if it's not supported. */ assert(env-mcg_cap); @@ -1732,11 +1734,11 @@ int kvm_arch_process_async_events(CPUX86State *env) } if (env-interrupt_request CPU_INTERRUPT_INIT) { kvm_cpu_synchronize_state(env); -do_cpu_init(env); +do_cpu_init(cpu); } if (env-interrupt_request CPU_INTERRUPT_SIPI) { kvm_cpu_synchronize_state(env); -do_cpu_sipi(env); +do_cpu_sipi(cpu); } if (env-interrupt_request CPU_INTERRUPT_TPR) { env-interrupt_request = ~CPU_INTERRUPT_TPR; -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: Implement PCID/INVPCID for guests with EPT
This patch handles PCID/INVPCID for guests. Process-context identifiers (PCIDs) are a facility by which a logical processor may cache information for multiple linear-address spaces so that the processor may retain cached information when software switches to a different linear-address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual Volume 3A for details. For guests with EPT, the PCID feature is enabled and INVPCID behaves as running natively. For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD. Signed-off-by: Mao, Junjie junjie@intel.com --- arch/x86/include/asm/cpufeature.h |1 + arch/x86/include/asm/kvm_host.h|3 +- arch/x86/include/asm/processor-flags.h |2 + arch/x86/include/asm/vmx.h |2 + arch/x86/kvm/cpuid.c |6 ++- arch/x86/kvm/cpuid.h |8 arch/x86/kvm/svm.c |6 +++ arch/x86/kvm/vmx.c | 63 ++-- 8 files changed, 85 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 8d67d42..1aedbc0 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -203,6 +203,7 @@ #define X86_FEATURE_SMEP (9*32+ 7) /* Supervisor Mode Execution Protection */ #define X86_FEATURE_BMI2 (9*32+ 8) /* 2nd group bit manipulation extensions */ #define X86_FEATURE_ERMS (9*32+ 9) /* Enhanced REP MOVSB/STOSB */ +#define X86_FEATURE_INVPCID(9*32+10) /* INVPCID instruction */ #if defined(__KERNEL__) !defined(__ASSEMBLY__) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 74c9edf..bb9a707 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -52,7 +52,7 @@ #define CR4_RESERVED_BITS \ (~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\ | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \ - | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR \ + | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \ | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_RDWRGSFS \ | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE)) @@ -660,6 +660,7 @@ struct kvm_x86_ops { u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); int (*get_lpage_level)(void); bool (*rdtscp_supported)(void); + bool (*pcid_supported)(void); void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment, bool host); void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h index f8ab3ea..aea1d1d 100644 --- a/arch/x86/include/asm/processor-flags.h +++ b/arch/x86/include/asm/processor-flags.h @@ -44,6 +44,7 @@ */ #define X86_CR3_PWT0x0008 /* Page Write Through */ #define X86_CR3_PCD0x0010 /* Page Cache Disable */ +#define X86_CR3_PCID_MASK 0x0fff /* PCID Mask */ /* * Intel CPU features in CR4 @@ -61,6 +62,7 @@ #define X86_CR4_OSXMMEXCPT 0x0400 /* enable unmasked SSE exceptions */ #define X86_CR4_VMXE 0x2000 /* enable VMX virtualization */ #define X86_CR4_RDWRGSFS 0x0001 /* enable RDWRGSFS support */ +#define X86_CR4_PCIDE 0x0002 /* enable PCID support */ #define X86_CR4_OSXSAVE 0x0004 /* enable xsave and xrestore */ #define X86_CR4_SMEP 0x0010 /* enable SMEP support */ diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 31f180c..b81525c 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -60,6 +60,7 @@ #define SECONDARY_EXEC_WBINVD_EXITING 0x0040 #define SECONDARY_EXEC_UNRESTRICTED_GUEST 0x0080 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING 0x0400 +#define SECONDARY_EXEC_ENABLE_INVPCID 0x1000 #define PIN_BASED_EXT_INTR_MASK 0x0001 @@ -281,6 +282,7 @@ enum vmcs_field { #define EXIT_REASON_EPT_MISCONFIG 49 #define EXIT_REASON_WBINVD 54 #define EXIT_REASON_XSETBV 55 +#define EXIT_REASON_INVPCID58 /* * Interruption-information format diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 9fed5be..8d4a361 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -201,6 +201,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, unsigned f_lm = 0; #endif unsigned f_rdtscp = kvm_x86_ops-rdtscp_supported() ? F(RDTSCP) : 0; + unsigned f_pcid = kvm_x86_ops-pcid_supported() ? F(PCID) : 0; + unsigned f_invpcid = kvm_x86_ops-pcid_supported() ? F(INVPCID) : 0; /* cpuid 1.edx */ const u32 kvm_supported_word0_x86_features = @@ -228,7 +230,7
Re: How to determine the backing host physical memory for a given guest ?
On 5/9/2012 6:46 AM, Avi Kivity wrote: On 05/09/2012 04:05 PM, Chegu Vinod wrote: Hello, On an 8 socket Westmere host I am attempting to run a single guest and characterize the virtualization overhead for a system intensive workload (AIM7-high_systime) as the size of the guest scales (10way/64G, 20way/128G, ... 80way/512G). To do some comparisons between the native vs. guest runs. I have been using numactl to control the cpu node memory node bindings for the qemu instance. For larger guest sizes I end up binding across multiple localities. for e.g. a 40 way guest : numactl --cpunodebind=0,1,2,3 --membind=0,1,2,3 \ qemu-system-x86_64 -smp 40 -m 262144 \ I understand that actual mappings from a guest virtual address to host physical address could change. Is there a way to determine [at a given instant] which host's NUMA node is providing the backing physical memory for the active guest's kernel and also for the the apps actively running in the guest ? Guessing that there is a better way (some tool available?) than just diff'ng the per node memory usage...from the before and after output of numactl --hardware on the host. Not sure if that's what you want, but there's Documentation/vm/pagemap.txt. Thanks for the pointer Avi ! Will give it a try... FYI... I tried using the recent version of the crash utility (http://people.redhat.com/anderson/) with the upstream kvm.git kernel (3.4.0-rc4+ ) and it seems to provides VA - PA mappings for a given app on a live system. Also looks like there is an extension to this crash utility... called : qemu-vtop. which is supposed to give the GPA-HVA-HPA mappings. Need to give this a try...and see if it works. Thx! Vinod -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: PPC: Book3S: PR: Handle EMUL_ASSIST
In addition to normal priviledged instruction traps, we can also receive emulation assist traps on newer hardware that has the HV bit set. Handle that one the same way as a privileged instruction, including the instruction fetching. That way we don't execute old instructions that we happen to still leave in that field when an emul assist trap comes. This fixes -M mac99 / -M g3beige on p7 bare metal for me. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_segment.S |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kvm/book3s_segment.S b/arch/powerpc/kvm/book3s_segment.S index 8b2fc66..e446658 100644 --- a/arch/powerpc/kvm/book3s_segment.S +++ b/arch/powerpc/kvm/book3s_segment.S @@ -252,6 +252,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE) beq ld_last_prev_inst cmpwi r12, BOOK3S_INTERRUPT_ALIGNMENT beq-ld_last_inst +#ifdef CONFIG_PPC64 +BEGIN_FTR_SECTION + cmpwi r12, BOOK3S_INTERRUPT_H_EMUL_ASSIST + beq-ld_last_inst +END_FTR_SECTION_IFSET(CPU_FTR_HVMODE) +#endif b no_ld_last_inst -- 1.6.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: PPC: Book3S: PR: Fix hsrr code
When jumping back into the kernel to code that knows that it would be using HSRR registers instead of SRR registers, we need to make sure we pass it all information on where to jump to in HSRR registers. Unfortunately, we used r10 to store the information to distinguish between the HSRR and SRR case. That register got clobbered in between though, rendering the later comparison invalid. Instead, let's use cr1 to store this information. That way we don't need yet another register and everyone's happy. This fixes PR KVM on POWER7 bare metal for me. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_segment.S |7 +++ 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kvm/book3s_segment.S b/arch/powerpc/kvm/book3s_segment.S index e446658..798491a 100644 --- a/arch/powerpc/kvm/book3s_segment.S +++ b/arch/powerpc/kvm/book3s_segment.S @@ -198,8 +198,8 @@ kvmppc_interrupt: /* Save guest PC and MSR */ #ifdef CONFIG_PPC64 BEGIN_FTR_SECTION - mr r10, r12 - andi. r0,r12,0x2 + andi. r0, r12, 0x2 + cmpwi cr1, r0, 0 beq 1f mfspr r3,SPRN_HSRR0 mfspr r4,SPRN_HSRR1 @@ -346,8 +346,7 @@ no_dcbz32_off: #ifdef CONFIG_PPC64 BEGIN_FTR_SECTION - andi. r0,r10,0x2 - beq 1f + beq cr1, 1f mtspr SPRN_HSRR1, r6 mtspr SPRN_HSRR0, r8 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE) -- 1.6.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qemu kvm: Accept PCID feature
This patch makes Qemu accept the PCID feature specified from configuration or command line options. Signed-off-by: Mao, Junjie junjie@intel.com --- target-i386/cpuid.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index b9bfeaf..bc061b0 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -48,7 +48,7 @@ static const char *ext_feature_name[] = { ds_cpl, vmx, smx, est, tm2, ssse3, cid, NULL, fma, cx16, xtpr, pdcm, -NULL, NULL, dca, sse4.1|sse4_1, +NULL, pcid, dca, sse4.1|sse4_1, sse4.2|sse4_2, x2apic, movbe, popcnt, NULL, aes, xsave, osxsave, avx, NULL, NULL, hypervisor, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: PPC: Book3S: PR: Handle EMUL_ASSIST
In addition to normal priviledged instruction traps, we can also receive emulation assist traps on newer hardware that has the HV bit set. Handle that one the same way as a privileged instruction, including the instruction fetching. That way we don't execute old instructions that we happen to still leave in that field when an emul assist trap comes. This fixes -M mac99 / -M g3beige on p7 bare metal for me. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_segment.S |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kvm/book3s_segment.S b/arch/powerpc/kvm/book3s_segment.S index 8b2fc66..e446658 100644 --- a/arch/powerpc/kvm/book3s_segment.S +++ b/arch/powerpc/kvm/book3s_segment.S @@ -252,6 +252,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE) beq ld_last_prev_inst cmpwi r12, BOOK3S_INTERRUPT_ALIGNMENT beq-ld_last_inst +#ifdef CONFIG_PPC64 +BEGIN_FTR_SECTION + cmpwi r12, BOOK3S_INTERRUPT_H_EMUL_ASSIST + beq-ld_last_inst +END_FTR_SECTION_IFSET(CPU_FTR_HVMODE) +#endif b no_ld_last_inst -- 1.6.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html