Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling
On Thu, Sep 05, 2013 at 02:05:09PM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2013-09-03 at 13:53 +0300, Gleb Natapov wrote: > > > Or supporting all IOMMU links (and leaving emulated stuff as is) in on > > > "device" is the last thing I have to do and then you'll ack the patch? > > > > > I am concerned more about API here. Internal implementation details I > > leave to powerpc experts :) > > So Gleb, I want to step in for a bit here. > > While I understand that the new KVM device API is all nice and shiny and that > this > whole thing should probably have been KVM devices in the first place (had they > existed or had we been told back then), the point is, the API for handling > HW IOMMUs that Alexey is trying to add is an extension of an existing > mechanism > used for emulated IOMMUs. > > The internal data structure is shared, and fundamentally, by forcing him to > use that new KVM device for the "new stuff", we create a oddball API with > an ioctl for one type of iommu and a KVM device for the other, which makes > the implementation a complete mess in the kernel (and you should care :-) > Is it unfixable mess? Even if Alexey will do what you suggested earlier? - Convert *both* existing TCE objects to the new KVM_CREATE_DEVICE, and have some backward compat code for the old one. The point is implementation usually can be changed, but for API it is much harder to do so. > So for something completely new, I would tend to agree with you. However, I > still think that for this specific case, we should just plonk-in the original > ioctl proposed by Alexey and be done with it. > Do you think this is the last extension to IOMMU code, or we will see more and will use same justification to continue adding ioctls? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/11] KVM: PPC: Book3S HV: Support POWER6 compatibility mode on POWER7
On Fri, Sep 06, 2013 at 10:58:16AM +0530, Aneesh Kumar K.V wrote: > Paul Mackerras writes: > > > This enables us to use the Processor Compatibility Register (PCR) on > > POWER7 to put the processor into architecture 2.05 compatibility mode > > when running a guest. In this mode the new instructions and registers > > that were introduced on POWER7 are disabled in user mode. This > > includes all the VSX facilities plus several other instructions such > > as ldbrx, stdbrx, popcntw, popcntd, etc. > > > > To select this mode, we have a new register accessible through the > > set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT. Setting > > this to zero gives the full set of capabilities of the processor. > > Setting it to one of the "logical" PVR values defined in PAPR puts > > the vcpu into the compatibility mode for the corresponding > > architecture level. The supported values are: > > > > 0x0f02 Architecture 2.05 (POWER6) > > 0x0f03 Architecture 2.06 (POWER7) > > 0x0f13 Architecture 2.06+ (POWER7+) > > > > Since the PCR is per-core, the architecture compatibility level and > > the corresponding PCR value are stored in the struct kvmppc_vcore, and > > are therefore shared between all vcpus in a virtual core. > > We already have KVM_SET_SREGS taking pvr as argument. Can't we do > this kvmppc_set_pvr ?. Can you also share the qemu changes ? There I > guess we need to do update the "cpu-version" in the device tree so > that /proc/cpuinfo shows the right information in the guest The discussion on the qemu mailing list pointed out that we aren't really changing the PVR; the guest still sees the real PVR, and what we're doing is setting a mode of the CPU rather than changing it into an older CPU. So, it seemed better to use something separate from the PVR. Also, if we used the pvr setting to convey this information, then apparently under TCG the guest would see the logical PVR value if it read the (apparently) real PVR. Alexey is working on the relevant QEMU patches. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling
On 09/06/2013 04:01 PM, Gleb Natapov wrote: > On Fri, Sep 06, 2013 at 09:38:21AM +1000, Alexey Kardashevskiy wrote: >> On 09/06/2013 04:10 AM, Gleb Natapov wrote: >>> On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote: On 09/03/2013 08:53 PM, Gleb Natapov wrote: > On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote: >> On 09/01/2013 10:06 PM, Gleb Natapov wrote: >>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote: This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests targeted an IOMMU TCE table without passing them to user space which saves time on switching to user space and back. Both real and virtual modes are supported. The kernel tries to handle a TCE request in the real mode, if fails it passes the request to the virtual mode to complete the operation. If it a virtual mode handler fails, the request is passed to user space. The first user of this is VFIO on POWER. Trampolines to the VFIO external user API functions are required for this patch. This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap requests. The device supports a single attribute which is a struct with LIOBN and IOMMU fd. When the attribute is set, the device establishes the connection between KVM and VFIO. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Paul Mackerras Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU" KVM device * release_spapr_tce_table() is not shared between different TCE types * reduced the patch size by moving VFIO external API trampolines to separate patche * moved documentation from Documentation/virtual/kvm/api.txt to Documentation/virtual/kvm/devices/spapr_tce_iommu.txt v8: * fixed warnings from check_patch.pl 2013/07/11: * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled for KVM_BOOK3S_64 * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense for this here but the next patch for hugepages support will use it more. 2013/07/06: * added realmode arch_spin_lock to protect TCE table from races in real and virtual modes * POWERPC IOMMU API is changed to support real mode * iommu_take_ownership and iommu_release_ownership are protected by iommu_table's locks * VFIO external user API use rewritten * multiple small fixes 2013/06/27: * tce_list page is referenced now in order to protect it from accident invalidation during H_PUT_TCE_INDIRECT execution * added use of the external user VFIO API 2013/06/05: * changed capability number * changed ioctl number * update the doc article number 2013/05/20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job --- .../virtual/kvm/devices/spapr_tce_iommu.txt| 37 +++ arch/powerpc/include/asm/kvm_host.h| 4 + arch/powerpc/kvm/book3s_64_vio.c | 310 - arch/powerpc/kvm/book3s_64_vio_hv.c| 122 arch/powerpc/kvm/powerpc.c | 1 + include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c| 5 + 7 files changed, 477 insertions(+), 3 deletions(-) create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_
Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling
On Fri, Sep 06, 2013 at 09:38:21AM +1000, Alexey Kardashevskiy wrote: > On 09/06/2013 04:10 AM, Gleb Natapov wrote: > > On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote: > >> On 09/03/2013 08:53 PM, Gleb Natapov wrote: > >>> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote: > On 09/01/2013 10:06 PM, Gleb Natapov wrote: > > On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote: > >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > >> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing > >> them to user space which saves time on switching to user space and > >> back. > >> > >> Both real and virtual modes are supported. The kernel tries to > >> handle a TCE request in the real mode, if fails it passes the request > >> to the virtual mode to complete the operation. If it a virtual mode > >> handler fails, the request is passed to user space. > >> > >> The first user of this is VFIO on POWER. Trampolines to the VFIO > >> external > >> user API functions are required for this patch. > >> > >> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus > >> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel > >> handling > >> of map/unmap requests. The device supports a single attribute which is > >> a struct with LIOBN and IOMMU fd. When the attribute is set, the device > >> establishes the connection between KVM and VFIO. > >> > >> Tests show that this patch increases transmission speed from 220MB/s > >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > >> > >> Signed-off-by: Paul Mackerras > >> Signed-off-by: Alexey Kardashevskiy > >> > >> --- > >> > >> Changes: > >> v9: > >> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU" > >> KVM device > >> * release_spapr_tce_table() is not shared between different TCE types > >> * reduced the patch size by moving VFIO external API > >> trampolines to separate patche > >> * moved documentation from Documentation/virtual/kvm/api.txt to > >> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > >> > >> v8: > >> * fixed warnings from check_patch.pl > >> > >> 2013/07/11: > >> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled > >> for KVM_BOOK3S_64 > >> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much > >> sense > >> for this here but the next patch for hugepages support will use it > >> more. > >> > >> 2013/07/06: > >> * added realmode arch_spin_lock to protect TCE table from races > >> in real and virtual modes > >> * POWERPC IOMMU API is changed to support real mode > >> * iommu_take_ownership and iommu_release_ownership are protected by > >> iommu_table's locks > >> * VFIO external user API use rewritten > >> * multiple small fixes > >> > >> 2013/06/27: > >> * tce_list page is referenced now in order to protect it from accident > >> invalidation during H_PUT_TCE_INDIRECT execution > >> * added use of the external user VFIO API > >> > >> 2013/06/05: > >> * changed capability number > >> * changed ioctl number > >> * update the doc article number > >> > >> 2013/05/20: > >> * removed get_user() from real mode handlers > >> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts > >> there > >> translated TCEs, tries realmode_get_page() on those and if it fails, it > >> passes control over the virtual mode handler which tries to finish > >> the request handling > >> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY > >> bit > >> on a page > >> * The only reason to pass the request to user mode now is when the > >> user mode > >> did not register TCE table in the kernel, in all other cases the > >> virtual mode > >> handler is expected to do the job > >> --- > >> .../virtual/kvm/devices/spapr_tce_iommu.txt| 37 +++ > >> arch/powerpc/include/asm/kvm_host.h| 4 + > >> arch/powerpc/kvm/book3s_64_vio.c | 310 > >> - > >> arch/powerpc/kvm/book3s_64_vio_hv.c| 122 > >> arch/powerpc/kvm/powerpc.c | 1 + > >> include/linux/kvm_host.h | 1 + > >> virt/kvm/kvm_main.c| 5 + > >> 7 files changed, 477 insertions(+), 3 deletions(-) > >> create mode 100644 > >> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > >> > >> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > >> b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > >> new file mode 100644 > >
Re: [PATCH 06/11] KVM: PPC: Book3S HV: Support POWER6 compatibility mode on POWER7
Paul Mackerras writes: > This enables us to use the Processor Compatibility Register (PCR) on > POWER7 to put the processor into architecture 2.05 compatibility mode > when running a guest. In this mode the new instructions and registers > that were introduced on POWER7 are disabled in user mode. This > includes all the VSX facilities plus several other instructions such > as ldbrx, stdbrx, popcntw, popcntd, etc. > > To select this mode, we have a new register accessible through the > set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT. Setting > this to zero gives the full set of capabilities of the processor. > Setting it to one of the "logical" PVR values defined in PAPR puts > the vcpu into the compatibility mode for the corresponding > architecture level. The supported values are: > > 0x0f02Architecture 2.05 (POWER6) > 0x0f03Architecture 2.06 (POWER7) > 0x0f13Architecture 2.06+ (POWER7+) > > Since the PCR is per-core, the architecture compatibility level and > the corresponding PCR value are stored in the struct kvmppc_vcore, and > are therefore shared between all vcpus in a virtual core. We already have KVM_SET_SREGS taking pvr as argument. Can't we do this kvmppc_set_pvr ?. Can you also share the qemu changes ? There I guess we need to do update the "cpu-version" in the device tree so that /proc/cpuinfo shows the right information in the guest -aneesh -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 04/10] KVM: PPC: Book3S HV: Flush the correct number of TLB sets on POWER8
POWER8 has 512 sets in the TLB, compared to 128 for POWER7, so we need to do more tlbiel instructions when flushing the TLB on POWER8. Signed-off-by: Paul Mackerras --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 1de0b65..612c9c8 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -667,7 +667,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201) andcr7,r7,r0 stdcx. r7,0,r6 bne 23b - li r6,128 /* and flush the TLB */ + /* Flush the TLB of any entries for this LPID */ + /* use arch 2.07S as a proxy for POWER8 */ +BEGIN_FTR_SECTION + li r6,512 /* POWER8 has 512 sets */ +FTR_SECTION_ELSE + li r6,128 /* POWER7 has 128 sets */ +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_207S) mtctr r6 li r7,0x800/* IS field = 0b10 */ ptesync -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 05/10] KVM: PPC: Book3S HV: Add handler for HV facility unavailable
From: Michael Ellerman At present this should never happen, since the host kernel sets HFSCR to allow access to all facilities. It's better to be prepared to handle it cleanly if it does ever happen, though. Signed-off-by: Michael Ellerman Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/kvm_asm.h | 1 + arch/powerpc/kvm/book3s_hv.c | 9 + 2 files changed, 10 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_asm.h b/arch/powerpc/include/asm/kvm_asm.h index 851bac7..f6401eb 100644 --- a/arch/powerpc/include/asm/kvm_asm.h +++ b/arch/powerpc/include/asm/kvm_asm.h @@ -99,6 +99,7 @@ #define BOOK3S_INTERRUPT_PERFMON 0xf00 #define BOOK3S_INTERRUPT_ALTIVEC 0xf20 #define BOOK3S_INTERRUPT_VSX 0xf40 +#define BOOK3S_INTERRUPT_H_FAC_UNAVAIL 0xf80 #define BOOK3S_IRQPRIO_SYSTEM_RESET0 #define BOOK3S_IRQPRIO_DATA_SEGMENT1 diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 724daa5..da8619e 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -704,6 +704,15 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, kvmppc_core_queue_program(vcpu, 0x8); r = RESUME_GUEST; break; + /* +* This occurs if the guest (kernel or userspace), does something that +* is prohibited by HFSCR. We just generate a program interrupt to +* the guest. +*/ + case BOOK3S_INTERRUPT_H_FAC_UNAVAIL: + kvmppc_core_queue_program(vcpu, 0x8); + r = RESUME_GUEST; + break; default: kvmppc_dump_regs(vcpu); printk(KERN_EMERG "trap=0x%x | pc=0x%lx | msr=0x%llx\n", -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/10] Support POWER8 in HV KVM
This series adds support for the POWER8 CPU in HV KVM. POWER8 adds several new guest-accessible instructions, special-purpose registers, and other features such as doorbell interrupts and hardware transactional memory. It also adds new hypervisor-controlled features such as relocation-on interrupts, and replaces the DABR/DABRX registers with the new DAWR/DAWRX registers with expanded capabilities. The POWER8 CPU has compatibility modes for architecture 2.06 (POWER7/POWER7+) and 2.05 (POWER6). This series does not yet handle the checkpointed register state of the guest. arch/powerpc/include/asm/kvm_asm.h| 2 + arch/powerpc/include/asm/kvm_book3s_asm.h | 1 + arch/powerpc/include/asm/kvm_host.h | 29 +- arch/powerpc/include/asm/reg.h| 25 +- arch/powerpc/include/uapi/asm/kvm.h | 1 + arch/powerpc/kernel/asm-offsets.c | 28 +- arch/powerpc/kvm/book3s_hv.c | 234 +++-- arch/powerpc/kvm/book3s_hv_interrupts.S | 8 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 538 ++ 9 files changed, 694 insertions(+), 172 deletions(-) Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 02/10] KVM: PPC: Book3S HV: Don't set DABR on POWER8
From: Michael Neuling POWER8 doesn't have the DABR and DABRX registers; instead it has new DAWR/DAWRX registers, which will be handled in a later patch. Signed-off-by: Michael Neuling Signed-off-by: Paul Mackerras --- arch/powerpc/kvm/book3s_hv_interrupts.S | 2 ++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 ++--- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S b/arch/powerpc/kvm/book3s_hv_interrupts.S index 2a85bf6..0be3404 100644 --- a/arch/powerpc/kvm/book3s_hv_interrupts.S +++ b/arch/powerpc/kvm/book3s_hv_interrupts.S @@ -57,9 +57,11 @@ BEGIN_FTR_SECTION std r3, HSTATE_DSCR(r13) END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) +BEGIN_FTR_SECTION /* Save host DABR */ mfspr r3, SPRN_DABR std r3, HSTATE_DABR(r13) +END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S) /* Hard-disable interrupts */ mfmsr r10 diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 0721d2a..d4f6f5f 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -95,11 +95,13 @@ kvmppc_got_guest: /* Back from guest - restore host state and return to caller */ +BEGIN_FTR_SECTION /* Restore host DABR and DABRX */ ld r5,HSTATE_DABR(r13) li r6,7 mtspr SPRN_DABR,r5 mtspr SPRN_DABRX,r6 +END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S) /* Restore SPRG3 */ ld r3,PACA_SPRG3(r13) @@ -423,15 +425,17 @@ kvmppc_hv_entry: std r0, PPC_LR_STKOFF(r1) stdur1, -112(r1) +BEGIN_FTR_SECTION /* Set partition DABR */ /* Do this before re-enabling PMU to avoid P7 DABR corruption bug */ li r5,3 ld r6,VCPU_DABR(r4) mtspr SPRN_DABRX,r5 mtspr SPRN_DABR,r6 -BEGIN_FTR_SECTION + BEGIN_FTR_SECTION_NESTED(89) isync -END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) + END_FTR_SECTION_NESTED(CPU_FTR_ARCH_206, CPU_FTR_ARCH_206, 89) +END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S) /* Load guest PMU registers */ /* R4 is live here (vcpu pointer) */ @@ -1716,6 +1720,9 @@ ignore_hdec: b fast_guest_return _GLOBAL(kvmppc_h_set_dabr) +BEGIN_FTR_SECTION + b 2f +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) std r4,VCPU_DABR(r3) /* Work around P7 bug where DABR can get corrupted on mtspr */ 1: mtspr SPRN_DABR,r4 @@ -1723,7 +1730,7 @@ _GLOBAL(kvmppc_h_set_dabr) cmpdr4, r5 bne 1b isync - li r3,0 +2: li r3,0 blr _GLOBAL(kvmppc_h_cede) -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 10/10] KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbells
POWER8 has support for hypervisor doorbell interrupts. Though the kernel doesn't use them for IPIs on the powernv platform yet, it probably will in future, so this makes KVM cope gracefully if a hypervisor doorbell interrupt arrives while in a guest. Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/kvm_asm.h | 1 + arch/powerpc/kvm/book3s_hv.c| 1 + arch/powerpc/kvm/book3s_hv_rmhandlers.S | 7 +++ 3 files changed, 9 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_asm.h b/arch/powerpc/include/asm/kvm_asm.h index f6401eb..4c2040a 100644 --- a/arch/powerpc/include/asm/kvm_asm.h +++ b/arch/powerpc/include/asm/kvm_asm.h @@ -96,6 +96,7 @@ #define BOOK3S_INTERRUPT_H_DATA_STORAGE0xe00 #define BOOK3S_INTERRUPT_H_INST_STORAGE0xe20 #define BOOK3S_INTERRUPT_H_EMUL_ASSIST 0xe40 +#define BOOK3S_INTERRUPT_H_DOORBELL0xe80 #define BOOK3S_INTERRUPT_PERFMON 0xf00 #define BOOK3S_INTERRUPT_ALTIVEC 0xf20 #define BOOK3S_INTERRUPT_VSX 0xf40 diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 95a635d..dc5ce77 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -644,6 +644,7 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, r = RESUME_GUEST; break; case BOOK3S_INTERRUPT_EXTERNAL: + case BOOK3S_INTERRUPT_H_DOORBELL: vcpu->stat.ext_intr_exits++; r = RESUME_GUEST; break; diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 4b3a10e..018513a 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -2034,10 +2034,17 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_207S) BEGIN_FTR_SECTION cmpwi r6, 5 /* privileged doorbell? */ beq 0f + cmpwi r6, 3 /* hypervisor doorbell? */ + beq 3f END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) li r3, 1 /* anything else, return 1 */ 0: blr + /* hypervisor doorbell */ +3: li r12, BOOK3S_INTERRUPT_H_DOORBELL + li r3, 1 + blr + /* * Determine what sort of external interrupt is pending (if any). * Returns: -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 03/10] KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs
From: Michael Neuling This adds fields to the struct kvm_vcpu_arch to store the new guest-accessible SPRs on POWER8, adds code to the get/set_one_reg functions to allow userspace to access this state, and adds code to the guest entry and exit to context-switch these SPRs between host and guest. Note that DPDES (Directed Privileged Doorbell Exception State) is shared between threads on a core; hence we store it in struct kvmppc_vcore and have the master thread save and restore it. Signed-off-by: Michael Neuling Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/kvm_host.h | 25 +- arch/powerpc/include/asm/reg.h | 17 arch/powerpc/include/uapi/asm/kvm.h | 1 + arch/powerpc/kernel/asm-offsets.c | 23 + arch/powerpc/kvm/book3s_hv.c| 153 +++- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 144 ++ 6 files changed, 360 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 6a43455..4786bb0 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -300,6 +300,7 @@ struct kvmppc_vcore { u64 tb_offset; /* guest timebase - host timebase */ u32 arch_compat; ulong pcr; + ulong dpdes;/* doorbell state (POWER8) */ }; #define VCORE_ENTRY_COUNT(vc) ((vc)->entry_exit_count & 0xff) @@ -454,6 +455,7 @@ struct kvm_vcpu_arch { ulong pc; ulong ctr; ulong lr; + ulong tar; ulong xer; u32 cr; @@ -463,13 +465,32 @@ struct kvm_vcpu_arch { ulong guest_owned_ext; ulong purr; ulong spurr; + ulong ic; + ulong vtb; ulong dscr; ulong amr; ulong uamor; + ulong iamr; u32 ctrl; ulong dabr; + ulong dawr; + ulong dawrx; + ulong ciabr; ulong cfar; ulong ppr; + ulong pspb; + ulong fscr; + ulong tfhar; + ulong tfiar; + ulong texasr; + ulong ebbhr; + ulong ebbrr; + ulong bescr; + ulong csigr; + ulong tacr; + ulong tcscr; + ulong acop; + ulong wort; #endif u32 vrsave; /* also USPRG0 */ u32 mmucr; @@ -503,10 +524,12 @@ struct kvm_vcpu_arch { u32 ccr1; u32 dbsr; - u64 mmcr[3]; + u64 mmcr[5]; u32 pmc[8]; + u32 spmc[2]; u64 siar; u64 sdar; + u64 sier; #ifdef CONFIG_KVM_EXIT_TIMING struct mutex exit_timing_lock; diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 52ff962..4ca8b85 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -218,6 +218,11 @@ #define CTRL_TE 0x00c0 /* thread enable */ #define CTRL_RUNLATCH0x1 #define SPRN_DAWR 0xB4 +#define SPRN_CIABR 0xBB +#define CIABR_PRIV 0x3 +#define CIABR_PRIV_USER 1 +#define CIABR_PRIV_SUPER 2 +#define CIABR_PRIV_HYPER 3 #define SPRN_DAWRX 0xBC #define DAWRX_USER (1UL << 0) #define DAWRX_KERNEL (1UL << 1) @@ -255,6 +260,8 @@ #define SPRN_HRMOR 0x139 /* Real mode offset register */ #define SPRN_HSRR0 0x13A /* Hypervisor Save/Restore 0 */ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ +#define SPRN_IC0x350 /* Virtual Instruction Count */ +#define SPRN_VTB 0x351 /* Virtual Time Base */ #define SPRN_FSCR 0x099 /* Facility Status & Control Register */ #define FSCR_TAR (1 << (63-55)) /* Enable Target Address Register */ #define FSCR_EBB (1 << (63-56)) /* Enable Event Based Branching */ @@ -354,6 +361,8 @@ #define DER_EBRKE 0x0002 /* External Breakpoint Interrupt */ #define DER_DPIE 0x0001 /* Dev. Port Nonmaskable Request */ #define SPRN_DMISS 0x3D0 /* Data TLB Miss Register */ +#define SPRN_DHDES 0x0B1 /* Directed Hyp. Doorbell Exc. State */ +#define SPRN_DPDES 0x0B0 /* Directed Priv. Doorbell Exc. State */ #define SPRN_EAR 0x11A /* External Address Register */ #define SPRN_HASH1 0x3D2 /* Primary Hash Address Register */ #define SPRN_HASH2 0x3D3 /* Secondary Hash Address Resgister */ @@ -413,6 +422,7 @@ #define SPRN_IABR 0x3F2 /* Instruction Address Breakpoint Register */ #define SPRN_IABR2 0x3FA /* 83xx */ #define SPRN_IBCR 0x135 /* 83xx Insn Breakpoint Control Reg */ +#define SPRN_IAMR 0x03D /* Instr. Authority Mask Reg */ #define SPRN_HID4 0x3F4 /* 970 HID4 */ #define HID4_LPES0 (1ul << (63-0)) /* LPAR env. sel. bit 0 */ #define HID4_RMLS2_SH (63 - 2) /* Real mode limit bottom 2 bits */ @@ -526,6 +536,7 @@ #define SPRN_PIR 0x3FF /* Processor Identification Register
[RFC PATCH 01/10] KVM: PPC: Book3S HV: Align physical CPU thread numbers with virtual
On a threaded processor such as POWER7, we group VCPUs into virtual cores and arrange that the VCPUs in a virtual core run on the same physical core. Currently we don't enforce any correspondence between virtual thread numbers within a virtual core and physical thread numbers. Physical threads are allocated starting at 0 on a first-come first-served basis to runnable virtual threads (VCPUs). POWER8 implements a new "msgsndp" instruction which guest kernels can use to interrupt other threads in the same core or sub-core. Since the instruction takes the destination physical thread ID as a parameter, it becomes necessary to align the physical thread IDs with the virtual thread IDs, that is, to make sure virtual thread N within a virtual core always runs on physical thread N. This means that it's possible that thread 0, which is where we call __kvmppc_vcore_entry, may end up running some other vcpu than the one whose task called kvmppc_run_core(), or it may end up running no vcpu at all, if for example thread 0 of the virtual core is currently executing in userspace. Thus we can't rely on thread 0 to be the master responsible for switching the MMU. Instead we now have an explicit 'is_master' flag which is set for the vcpu whose task called kvmppc_run_core(). The master then has to wait for thread 0 to enter real mode before switching the MMU. Also, we no longer pass the vcpu pointer to __kvmppc_vcore_entry, but instead let the assembly code load it from the PACA. Since the assembly code will need to know the kvm pointer and the thread ID for threads which don't have a vcpu, we move the thread ID into the PACA and we add a kvm pointer to the virtual core structure. In the case where thread 0 has no vcpu to run, we arrange for it to go to nap mode, using a new flag value in the PACA 'napping' field so we can differentiate it when it wakes from the other nap cases. We set the bit for the thread in the vcore 'napping_threads' field so that when other threads come out of the guest they will send an IPI to thread 0 to wake it up. When it does wake up, we clear that bit, see what caused the wakeup, and either exit back to the kernel, or start running virtual thread 0 in the case where it now wants to enter the guest and the other threads are still in the guest. Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/kvm_book3s_asm.h | 1 + arch/powerpc/include/asm/kvm_host.h | 4 + arch/powerpc/kernel/asm-offsets.c | 5 +- arch/powerpc/kvm/book3s_hv.c | 49 arch/powerpc/kvm/book3s_hv_interrupts.S | 6 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 188 +- 6 files changed, 192 insertions(+), 61 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h index 22f4606..8669ac0 100644 --- a/arch/powerpc/include/asm/kvm_book3s_asm.h +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h @@ -87,6 +87,7 @@ struct kvmppc_host_state { u8 hwthread_req; u8 hwthread_state; u8 host_ipi; + u8 ptid; struct kvm_vcpu *kvm_vcpu; struct kvmppc_vcore *kvm_vcore; unsigned long xics_phys; diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 5a40270..6a43455 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -284,6 +284,8 @@ struct kvmppc_vcore { int n_woken; int nap_count; int napping_threads; + int real_mode_threads; + int first_vcpuid; u16 pcpu; u16 last_cpu; u8 vcore_state; @@ -294,6 +296,7 @@ struct kvmppc_vcore { u64 stolen_tb; u64 preempt_tb; struct kvm_vcpu *runner; + struct kvm *kvm; u64 tb_offset; /* guest timebase - host timebase */ u32 arch_compat; ulong pcr; @@ -575,6 +578,7 @@ struct kvm_vcpu_arch { int state; int ptid; bool timer_running; + u8 is_master; wait_queue_head_t cpu_run; struct kvm_vcpu_arch_shared *shared; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 115dd64..51d7eed 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -514,13 +514,15 @@ int main(void) DEFINE(VCPU_FAULT_DAR, offsetof(struct kvm_vcpu, arch.fault_dar)); DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst)); DEFINE(VCPU_TRAP, offsetof(struct kvm_vcpu, arch.trap)); - DEFINE(VCPU_PTID, offsetof(struct kvm_vcpu, arch.ptid)); + DEFINE(VCPU_IS_MASTER, offsetof(struct kvm_vcpu, arch.is_master)); DEFINE(VCPU_CFAR, offsetof(struct kvm_vcpu, arch.cfar)); DEFINE(VCPU_PPR, offsetof(struct kvm_vcpu, arch.ppr)); DEFINE(VCORE_ENTRY_EXIT, offsetof(struct kvmppc_vcore, entry_exit_count)); DEFINE(VCORE_NAP_COUNT, offsetof(struct kvmppc_vcore, nap_count)); DE
[RFC PATCH 09/10] KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8
POWER8 has a bit in the LPCR to enable or disable the PURR and SPURR registers to count when in the guest. Set this bit. POWER8 has a field in the LPCR called AIL (Alternate Interrupt Location) which is used to enable relocation-on interrupts. Allow userspace to set this field. Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/reg.h | 2 ++ arch/powerpc/kvm/book3s_hv.c | 6 ++ 2 files changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 73fdd62..60c2dd8 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -291,8 +291,10 @@ #define LPCR_RMLS0x1C00 /* impl dependent rmo limit sel */ #define LPCR_RMLS_SH (63-37) #define LPCR_ILE 0x0200 /* !HV irqs set MSR:LE */ +#define LPCR_AIL 0x0180 /* Alternate interrupt location */ #define LPCR_AIL_0 0x /* MMU off exception offset 0x0 */ #define LPCR_AIL_3 0x0180 /* MMU on exception offset 0xc00...4xxx */ +#define LPCR_ONL 0x0004 /* online - PURR/SPURR count */ #define LPCR_PECE0x0001f000 /* powersave exit cause enable */ #define LPCR_PECEDP0x0001 /* directed priv dbells cause exit */ #define LPCR_PECEDH0x8000 /* directed hyp dbells cause exit */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 217041f..95a635d 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -783,8 +783,11 @@ static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 new_lpcr) /* * Userspace can only modify DPFD (default prefetch depth), * ILE (interrupt little-endian) and TC (translation control). +* On POWER8 userspace can also modify AIL (alt. interrupt loc.) */ mask = LPCR_DPFD | LPCR_ILE | LPCR_TC; + if (cpu_has_feature(CPU_FTR_ARCH_207S)) + mask |= LPCR_AIL; kvm->arch.lpcr = (kvm->arch.lpcr & ~mask) | (new_lpcr & mask); mutex_unlock(&kvm->lock); } @@ -2169,6 +2172,9 @@ int kvmppc_core_init_vm(struct kvm *kvm) LPCR_VPM0 | LPCR_VPM1; kvm->arch.vrma_slb_v = SLB_VSID_B_1T | (VRMA_VSID << SLB_VSID_SHIFT_1T); + /* On POWER8 turn on online bit to enable PURR/SPURR */ + if (cpu_has_feature(CPU_FTR_ARCH_207S)) + lpcr |= LPCR_ONL; } kvm->arch.lpcr = lpcr; -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 06/10] KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8
This allows us to select architecture 2.05 (POWER6) or 2.06 (POWER7) compatibility modes on a POWER8 processor. Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/reg.h | 2 ++ arch/powerpc/kvm/book3s_hv.c | 16 +++- 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 4ca8b85..483e0a2 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -315,6 +315,8 @@ #define SPRN_PCR 0x152 /* Processor compatibility register */ #define PCR_VEC_DIS (1ul << (63-0)) /* Vec. disable (pre POWER8) */ #define PCR_VSX_DIS (1ul << (63-1)) /* VSX disable (pre POWER8) */ +#define PCR_TM_DIS (1ul << (63-2)) /* Trans. memory disable (POWER8) */ +#define PCR_ARCH_206 0x4 /* Architecture 2.06 */ #define PCR_ARCH_205 0x2 /* Architecture 2.05 */ #defineSPRN_HEIR 0x153 /* Hypervisor Emulated Instruction Register */ #define SPRN_TLBINDEXR 0x154 /* P7 TLB control register */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index da8619e..217041f 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -177,14 +177,28 @@ int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat) switch (arch_compat) { case PVR_ARCH_205: - pcr = PCR_ARCH_205; + /* +* If an arch bit is set in PCR, all the defined +* higher-order arch bits also have to be set. +*/ + pcr = PCR_ARCH_206 | PCR_ARCH_205; break; case PVR_ARCH_206: case PVR_ARCH_206p: + pcr = PCR_ARCH_206; + break; + case PVR_ARCH_207: break; default: return -EINVAL; } + + if (!cpu_has_feature(CPU_FTR_ARCH_207S)) { + /* POWER7 can't emulate POWER8 */ + if (!(pcr & PCR_ARCH_206)) + return -EINVAL; + pcr &= ~PCR_ARCH_206; + } } spin_lock(&vc->lock); -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 07/10] KVM: PPC: Book3S HV: Consolidate code that checks reason for wake from nap
Currently in book3s_hv_rmhandlers.S we have three places where we have woken up from nap mode and we check the reason field in SRR1 to see what event woke us up. This consolidates them into a new function, kvmppc_check_wake_reason. It looks at the wake reason field in SRR1, and if it indicates that an external interrupt caused the wakeup, calls kvmppc_read_intr to check what sort of interrupt it was. This also consolidates the two places where we synthesize an external interrupt (0x500 vector) for the guest. Now, if the guest exit code finds that there was an external interrupt which has been handled (i.e. it was an IPI indicating that there is now an interrupt pending for the guest), it jumps to deliver_guest_interrupt, which is in the last part of the guest entry code, where we synthesize guest external and decrementer interrupts. That code has been streamlined a little and now clears LPCR[MER] when appropriate as well as setting it. The extra clearing of any pending IPI on a secondary, offline CPU thread before going back to nap mode has been removed. It is no longer necessary now that we have code to read and acknowledge IPIs in the guest exit path. This fixes a minor bug in the H_CEDE real-mode handling - previously, if we found that other threads were already exiting the guest when we were about to go to nap mode, we would branch to the cede wakeup path and end up looking in SRR1 for a wakeup reason. Now we branch to a point after we have checked the wakeup reason. This also fixes a minor bug in kvmppc_read_intr - previously it could return 0xff rather than 1, in the case where we find that a host IPI is pending after we have cleared the IPI. Now it returns 1. Signed-off-by: Paul Mackerras --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 193 +--- 1 file changed, 78 insertions(+), 115 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 612c9c8..6351ce2 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -234,8 +234,10 @@ kvm_novcpu_wakeup: stb r0, HSTATE_NAPPING(r13) stb r0, HSTATE_HWTHREAD_REQ(r13) + /* check the wake reason */ + bl kvmppc_check_wake_reason + /* see if any other thread is already exiting */ - li r12, 0 lwz r0, VCORE_ENTRY_EXIT(r5) cmpwi r0, 0x100 bge kvm_novcpu_exit @@ -245,23 +247,14 @@ kvm_novcpu_wakeup: li r0, 1 sld r0, r0, r7 addir6, r5, VCORE_NAPPING_THREADS -4: lwarx r3, 0, r6 - andcr3, r3, r0 - stwcx. r3, 0, r6 +4: lwarx r7, 0, r6 + andcr7, r7, r0 + stwcx. r7, 0, r6 bne 4b - /* Check the wake reason in SRR1 to see why we got here */ - mfspr r3, SPRN_SRR1 - rlwinm r3, r3, 44-31, 0x7 /* extract wake reason field */ - cmpwi r3, 4 /* was it an external interrupt? */ - bne kvm_novcpu_exit /* if not, exit the guest */ - - /* extern interrupt - read and handle it */ - li r12, BOOK3S_INTERRUPT_EXTERNAL - bl kvmppc_read_intr + /* See if the wake reason means we need to exit */ cmpdi r3, 0 bge kvm_novcpu_exit - li r12, 0 /* Got an IPI but other vcpus aren't yet exiting, must be a latecomer */ ld r4, HSTATE_KVM_VCPU(r13) @@ -325,58 +318,23 @@ kvm_start_guest: */ /* Check the wake reason in SRR1 to see why we got here */ - mfspr r3,SPRN_SRR1 - rlwinm r3,r3,44-31,0x7 /* extract wake reason field */ - cmpwi r3,4/* was it an external interrupt? */ - bne 27f /* if not */ - ld r5,HSTATE_XICS_PHYS(r13) - li r7,XICS_XIRR/* if it was an external interrupt, */ - lwzcix r8,r5,r7/* get and ack the interrupt */ - sync - clrldi. r9,r8,40/* get interrupt source ID. */ - beq 28f /* none there? */ - cmpwi r9,XICS_IPI /* was it an IPI? */ - bne 29f - li r0,0xff - li r6,XICS_MFRR - stbcix r0,r5,r6/* clear IPI */ - stwcix r8,r5,r7/* EOI the interrupt */ - sync/* order loading of vcpu after that */ + bl kvmppc_check_wake_reason + cmpdi r3, 0 + bge kvm_no_guest /* get vcpu pointer, NULL if we have no vcpu to run */ ld r4,HSTATE_KVM_VCPU(r13) cmpdi r4,0 /* if we have no vcpu to run, go back to sleep */ beq kvm_no_guest - b 30f -27:/* XXX should handle hypervisor maintenance interrupts etc. here */ - b kvm_no_guest -28:/* SRR1 said external but ICP said
[RFC PATCH 08/10] KVM: PPC: Book3S HV: Handle guest using doorbells for IPIs
* SRR1 wake reason field for system reset interrupt on wakeup from nap is now a 4-bit field on P8, compared to 3 bits on P7. * Set PECEDP in LPCR when napping because of H_CEDE so guest doorbells will wake us up. * Waking up from nap because of a guest doorbell interrupt is not a reason to exit the guest. Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/reg.h | 4 +++- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 19 +++ 2 files changed, 18 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 483e0a2..73fdd62 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -293,7 +293,9 @@ #define LPCR_ILE 0x0200 /* !HV irqs set MSR:LE */ #define LPCR_AIL_0 0x /* MMU off exception offset 0x0 */ #define LPCR_AIL_3 0x0180 /* MMU on exception offset 0xc00...4xxx */ -#define LPCR_PECE0x7000 /* powersave exit cause enable */ +#define LPCR_PECE0x0001f000 /* powersave exit cause enable */ +#define LPCR_PECEDP0x0001 /* directed priv dbells cause exit */ +#define LPCR_PECEDH0x8000 /* directed hyp dbells cause exit */ #define LPCR_PECE0 0x4000 /* ext. exceptions can cause exit */ #define LPCR_PECE1 0x2000 /* decrementer can cause exit */ #define LPCR_PECE2 0x1000 /* machine check etc can cause exit */ diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 6351ce2..4b3a10e 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1894,13 +1894,16 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_206) bl .kvmppc_save_fp /* -* Take a nap until a decrementer or external interrupt occurs, -* with PECE1 (wake on decr) and PECE0 (wake on external) set in LPCR +* Take a nap until a decrementer or external or doobell interrupt +* occurs, with PECE1, PECE0 and PECEDP set in LPCR */ li r0,1 stb r0,HSTATE_HWTHREAD_REQ(r13) mfspr r5,SPRN_LPCR ori r5,r5,LPCR_PECE0 | LPCR_PECE1 +BEGIN_FTR_SECTION + orisr5,r5,LPCR_PECEDP@h +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) mtspr SPRN_LPCR,r5 isync li r0, 0 @@ -2016,14 +2019,22 @@ machine_check_realmode: */ kvmppc_check_wake_reason: mfspr r6, SPRN_SRR1 - rlwinm r6, r6, 44-31, 0x7 /* extract wake reason field */ - cmpwi r6, 4 /* was it an external interrupt? */ +BEGIN_FTR_SECTION + rlwinm r6, r6, 45-31, 0xf /* extract wake reason field (P8) */ +FTR_SECTION_ELSE + rlwinm r6, r6, 45-31, 0xe /* P7 wake reason field is 3 bits */ +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_207S) + cmpwi r6, 8 /* was it an external interrupt? */ li r12, BOOK3S_INTERRUPT_EXTERNAL beq kvmppc_read_intr/* if so, see what it was */ li r3, 0 li r12, 0 cmpwi r6, 6 /* was it the decrementer? */ beq 0f +BEGIN_FTR_SECTION + cmpwi r6, 5 /* privileged doorbell? */ + beq 0f +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) li r3, 1 /* anything else, return 1 */ 0: blr -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/11] KVM: PPC: Book3S HV: Return -EINVAL rather than BUG'ing
From: Michael Ellerman This means that if we do happen to get a trap that we don't know about, we abort the guest rather than crashing the host kernel. Signed-off-by: Michael Ellerman Signed-off-by: Paul Mackerras --- arch/powerpc/kvm/book3s_hv.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 0bb23a9..731e46e 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -709,8 +709,7 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, printk(KERN_EMERG "trap=0x%x | pc=0x%lx | msr=0x%llx\n", vcpu->arch.trap, kvmppc_get_pc(vcpu), vcpu->arch.shregs.msr); - r = RESUME_HOST; - BUG(); + r = -EINVAL; break; } -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/11] KVM: PPC: Book3S HV: Pull out interrupt-reading code into a subroutine
This moves the code in book3s_hv_rmhandlers.S that reads any pending interrupt from the XICS interrupt controller, and works out whether it is an IPI for the guest, an IPI for the host, or a device interrupt, into a new function called kvmppc_read_intr. Later patches will need this. Signed-off-by: Paul Mackerras --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 117 +++- 1 file changed, 68 insertions(+), 49 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index d9ab139..01515b6 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -868,46 +868,11 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_206) * set, we know the host wants us out so let's do it now */ do_ext_interrupt: - lbz r0, HSTATE_HOST_IPI(r13) - cmpwi r0, 0 - bne ext_interrupt_to_host - - /* Now read the interrupt from the ICP */ - ld r5, HSTATE_XICS_PHYS(r13) - li r7, XICS_XIRR - cmpdi r5, 0 - beq-ext_interrupt_to_host - lwzcix r3, r5, r7 - rlwinm. r0, r3, 0, 0xff - sync - beq 3f /* if nothing pending in the ICP */ - - /* We found something in the ICP... -* -* If it's not an IPI, stash it in the PACA and return to -* the host, we don't (yet) handle directing real external -* interrupts directly to the guest -*/ - cmpwi r0, XICS_IPI - bne ext_stash_for_host - - /* It's an IPI, clear the MFRR and EOI it */ - li r0, 0xff - li r6, XICS_MFRR - stbcix r0, r5, r6 /* clear the IPI */ - stwcix r3, r5, r7 /* EOI it */ - sync - - /* We need to re-check host IPI now in case it got set in the -* meantime. If it's clear, we bounce the interrupt to the -* guest -*/ - lbz r0, HSTATE_HOST_IPI(r13) - cmpwi r0, 0 - bne-1f + bl kvmppc_read_intr + cmpdi r3, 0 + bgt ext_interrupt_to_host /* Allright, looks like an IPI for the guest, we need to set MER */ -3: /* Check if any CPU is heading out to the host, if so head out too */ ld r5, HSTATE_KVM_VCORE(r13) lwz r0, VCORE_ENTRY_EXIT(r5) @@ -936,17 +901,6 @@ do_ext_interrupt: mtspr SPRN_LPCR, r8 b fast_guest_return - /* We raced with the host, we need to resend that IPI, bummer */ -1: li r0, IPI_PRIORITY - stbcix r0, r5, r6 /* set the IPI */ - sync - b ext_interrupt_to_host - -ext_stash_for_host: - /* It's not an IPI and it's for the host, stash it in the PACA -* before exit, it will be picked up by the host ICP driver -*/ - stw r3, HSTATE_SAVED_XIRR(r13) ext_interrupt_to_host: guest_exit_cont: /* r9 = vcpu, r12 = trap, r13 = paca */ @@ -1821,6 +1775,71 @@ machine_check_realmode: b fast_interrupt_c_return /* + * Determine what sort of external interrupt is pending (if any). + * Returns: + * 0 if no interrupt is pending + * 1 if an interrupt is pending that needs to be handled by the host + * -1 if there was a guest wakeup IPI (which has now been cleared) + */ +kvmppc_read_intr: + /* see if a host IPI is pending */ + li r3, 1 + lbz r0, HSTATE_HOST_IPI(r13) + cmpwi r0, 0 + bne 1f + + /* Now read the interrupt from the ICP */ + ld r6, HSTATE_XICS_PHYS(r13) + li r7, XICS_XIRR + cmpdi r6, 0 + beq-1f + lwzcix r0, r6, r7 + rlwinm. r3, r0, 0, 0xff + sync + beq 1f /* if nothing pending in the ICP */ + + /* We found something in the ICP... +* +* If it's not an IPI, stash it in the PACA and return to +* the host, we don't (yet) handle directing real external +* interrupts directly to the guest +*/ + cmpwi r3, XICS_IPI/* if there is, is it an IPI? */ + li r3, 1 + bne 42f + + /* It's an IPI, clear the MFRR and EOI it */ + li r3, 0xff + li r8, XICS_MFRR + stbcix r3, r6, r8 /* clear the IPI */ + stwcix r0, r6, r7 /* EOI it */ + sync + + /* We need to re-check host IPI now in case it got set in the +* meantime. If it's clear, we bounce the interrupt to the +* guest +*/ + lbz r0, HSTATE_HOST_IPI(r13) + cmpwi r0, 0 + bne-43f + + /* OK, it's an IPI for us */ + li r3, -1 +1: blr + +42:/* It's not an IPI and it's for the host, stash it in the PACA +* before exit, it will be picked up by the host ICP driver +*/ + stw r0, HSTATE_SAVED_XIRR(r13) + b
[PATCH 10/11] KVM: PPC: Book3S HV: Avoid unbalanced increments of VPA yield count
The yield count in the VPA is supposed to be incremented every time we enter the guest, and every time we exit the guest, so that its value is even when the vcpu is running in the guest and odd when it isn't. However, it's currently possible that we increment the yield count on the way into the guest but then find that other CPU threads are already exiting the guest, so we go back to nap mode via the secondary_too_late label. In this situation we don't increment the yield count again, breaking the relationship between the LSB of the count and whether the vcpu is in the guest. To fix this, we move the increment of the yield count to a point after we have checked whether other CPU threads are exiting. Signed-off-by: Paul Mackerras --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 01515b6..31030f3 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -401,16 +401,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) /* Save R1 in the PACA */ std r1, HSTATE_HOST_R1(r13) - /* Increment yield count if they have a VPA */ - ld r3, VCPU_VPA(r4) - cmpdi r3, 0 - beq 25f - lwz r5, LPPACA_YIELDCOUNT(r3) - addir5, r5, 1 - stw r5, LPPACA_YIELDCOUNT(r3) - li r6, 1 - stb r6, VCPU_VPA_DIRTY(r4) -25: /* Load up DAR and DSISR */ ld r5, VCPU_DAR(r4) lwz r6, VCPU_DSISR(r4) @@ -525,6 +515,16 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201) mtspr SPRN_RMOR,r8 isync + /* Increment yield count if they have a VPA */ + ld r3, VCPU_VPA(r4) + cmpdi r3, 0 + beq 25f + lwz r5, LPPACA_YIELDCOUNT(r3) + addir5, r5, 1 + stw r5, LPPACA_YIELDCOUNT(r3) + li r6, 1 + stb r6, VCPU_VPA_DIRTY(r4) +25: /* Check if HDEC expires soon */ mfspr r3,SPRN_HDEC cmpwi r3,10 -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/11] KVM: PPC: Book3S HV: Restructure kvmppc_hv_entry to be a subroutine
We have two paths into and out of the low-level guest entry and exit code: from a vcpu task via kvmppc_hv_entry_trampoline, and from the system reset vector for an offline secondary thread on POWER7 via kvm_start_guest. Currently both just branch to kvmppc_hv_entry to enter the guest, and on guest exit, we test the vcpu physical thread ID to detect which way we came in and thus whether we should return to the vcpu task or go back to nap mode. In order to make the code flow clearer, and to keep the code relating to each flow together, this turns kvmppc_hv_entry into a subroutine that follows the normal conventions for call and return. This means that kvmppc_hv_entry_trampoline() and kvmppc_hv_entry() now establish normal stack frames, and we use the normal stack slots for saving return addresses rather than local_paca->kvm_hstate.vmhandler. Apart from that this is mostly moving code around unchanged. Signed-off-by: Paul Mackerras --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 344 +--- 1 file changed, 178 insertions(+), 166 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 023d8600..d9ab139 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -62,8 +62,11 @@ kvmppc_skip_Hinterrupt: * LR = return address to continue at after eventually re-enabling MMU */ _GLOBAL(kvmppc_hv_entry_trampoline) + mflrr0 + std r0, PPC_LR_STKOFF(r1) + stdur1, -112(r1) mfmsr r10 - LOAD_REG_ADDR(r5, kvmppc_hv_entry) + LOAD_REG_ADDR(r5, kvmppc_call_hv_entry) li r0,MSR_RI andcr0,r10,r0 li r6,MSR_IR | MSR_DR @@ -73,11 +76,103 @@ _GLOBAL(kvmppc_hv_entry_trampoline) mtsrr1 r6 RFI -/** - ** - * Entry code * - ** - */ +kvmppc_call_hv_entry: + bl kvmppc_hv_entry + + /* Back from guest - restore host state and return to caller */ + + /* Restore host DABR and DABRX */ + ld r5,HSTATE_DABR(r13) + li r6,7 + mtspr SPRN_DABR,r5 + mtspr SPRN_DABRX,r6 + + /* Restore SPRG3 */ + ld r3,PACA_SPRG3(r13) + mtspr SPRN_SPRG3,r3 + + /* +* Reload DEC. HDEC interrupts were disabled when +* we reloaded the host's LPCR value. +*/ + ld r3, HSTATE_DECEXP(r13) + mftbr4 + subfr4, r4, r3 + mtspr SPRN_DEC, r4 + + /* Reload the host's PMU registers */ + ld r3, PACALPPACAPTR(r13) /* is the host using the PMU? */ + lbz r4, LPPACA_PMCINUSE(r3) + cmpwi r4, 0 + beq 23f /* skip if not */ + lwz r3, HSTATE_PMC(r13) + lwz r4, HSTATE_PMC + 4(r13) + lwz r5, HSTATE_PMC + 8(r13) + lwz r6, HSTATE_PMC + 12(r13) + lwz r8, HSTATE_PMC + 16(r13) + lwz r9, HSTATE_PMC + 20(r13) +BEGIN_FTR_SECTION + lwz r10, HSTATE_PMC + 24(r13) + lwz r11, HSTATE_PMC + 28(r13) +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201) + mtspr SPRN_PMC1, r3 + mtspr SPRN_PMC2, r4 + mtspr SPRN_PMC3, r5 + mtspr SPRN_PMC4, r6 + mtspr SPRN_PMC5, r8 + mtspr SPRN_PMC6, r9 +BEGIN_FTR_SECTION + mtspr SPRN_PMC7, r10 + mtspr SPRN_PMC8, r11 +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201) + ld r3, HSTATE_MMCR(r13) + ld r4, HSTATE_MMCR + 8(r13) + ld r5, HSTATE_MMCR + 16(r13) + mtspr SPRN_MMCR1, r4 + mtspr SPRN_MMCRA, r5 + mtspr SPRN_MMCR0, r3 + isync +23: + + /* +* For external and machine check interrupts, we need +* to call the Linux handler to process the interrupt. +* We do that by jumping to absolute address 0x500 for +* external interrupts, or the machine_check_fwnmi label +* for machine checks (since firmware might have patched +* the vector area at 0x200). The [h]rfid at the end of the +* handler will return to the book3s_hv_interrupts.S code. +* For other interrupts we do the rfid to get back +* to the book3s_hv_interrupts.S code here. +*/ + ld r8, 112+PPC_LR_STKOFF(r1) + addir1, r1, 112 + ld r7, HSTATE_HOST_MSR(r13) + + cmpwi cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK + cmpwi r12, BOOK3S_INTERRUPT_EXTERNAL +BEGIN_FTR_SECTION + beq 11f +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) + + /* RFI into the highmem handler, or branch to interrupt handler */ +
[PATCH 07/11] KVM: PPC: Book3S HV: Implement H_CONFER
The H_CONFER hypercall is used when a guest vcpu is spinning on a lock held by another vcpu which has been preempted, and the spinning vcpu wishes to give its timeslice to the lock holder. We implement this in the straightforward way using kvm_vcpu_yield_to(). Signed-off-by: Paul Mackerras --- arch/powerpc/kvm/book3s_hv.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 1a10afa..0bb23a9 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -567,6 +567,15 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) } break; case H_CONFER: + target = kvmppc_get_gpr(vcpu, 4); + if (target == -1) + break; + tvcpu = kvmppc_find_vcpu(vcpu->kvm, target); + if (!tvcpu) { + ret = H_PARAMETER; + break; + } + kvm_vcpu_yield_to(tvcpu); break; case H_REGISTER_VPA: ret = do_h_register_vpa(vcpu, kvmppc_get_gpr(vcpu, 4), -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/11] KVM: PPC: Book3S HV: Save/restore SIAR and SDAR along with other PMU registers
Currently we are not saving and restoring the SIAR and SDAR registers in the PMU (performance monitor unit) on guest entry and exit. The result is that performance monitoring tools in the guest could get false information about where a program was executing and what data it was accessing at the time of a performance monitor interrupt. This fixes it by saving and restoring these registers along with the other PMU registers on guest entry/exit. This also provides a way for userspace to access these values for a vcpu via the one_reg interface. Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/kvm_host.h | 2 ++ arch/powerpc/kernel/asm-offsets.c | 2 ++ arch/powerpc/kvm/book3s_hv.c| 12 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 4 files changed, 24 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 3328353..91b833d 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -498,6 +498,8 @@ struct kvm_vcpu_arch { u64 mmcr[3]; u32 pmc[8]; + u64 siar; + u64 sdar; #ifdef CONFIG_KVM_EXIT_TIMING struct mutex exit_timing_lock; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 26098c2..6a7916d 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -505,6 +505,8 @@ int main(void) DEFINE(VCPU_PRODDED, offsetof(struct kvm_vcpu, arch.prodded)); DEFINE(VCPU_MMCR, offsetof(struct kvm_vcpu, arch.mmcr)); DEFINE(VCPU_PMC, offsetof(struct kvm_vcpu, arch.pmc)); + DEFINE(VCPU_SIAR, offsetof(struct kvm_vcpu, arch.siar)); + DEFINE(VCPU_SDAR, offsetof(struct kvm_vcpu, arch.sdar)); DEFINE(VCPU_SLB, offsetof(struct kvm_vcpu, arch.slb)); DEFINE(VCPU_SLB_MAX, offsetof(struct kvm_vcpu, arch.slb_max)); DEFINE(VCPU_SLB_NR, offsetof(struct kvm_vcpu, arch.slb_nr)); diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 8aadd23..29bdeca2 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -749,6 +749,12 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *val) i = id - KVM_REG_PPC_PMC1; *val = get_reg_val(id, vcpu->arch.pmc[i]); break; + case KVM_REG_PPC_SIAR: + *val = get_reg_val(id, vcpu->arch.siar); + break; + case KVM_REG_PPC_SDAR: + *val = get_reg_val(id, vcpu->arch.sdar); + break; #ifdef CONFIG_VSX case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31: if (cpu_has_feature(CPU_FTR_VSX)) { @@ -833,6 +839,12 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *val) i = id - KVM_REG_PPC_PMC1; vcpu->arch.pmc[i] = set_reg_val(id, *val); break; + case KVM_REG_PPC_SIAR: + vcpu->arch.siar = set_reg_val(id, *val); + break; + case KVM_REG_PPC_SDAR: + vcpu->arch.sdar = set_reg_val(id, *val); + break; #ifdef CONFIG_VSX case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31: if (cpu_has_feature(CPU_FTR_VSX)) { diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 60dce5b..bfb4b0a 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -196,8 +196,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201) ld r3, VCPU_MMCR(r4) ld r5, VCPU_MMCR + 8(r4) ld r6, VCPU_MMCR + 16(r4) + ld r7, VCPU_SIAR(r4) + ld r8, VCPU_SDAR(r4) mtspr SPRN_MMCR1, r5 mtspr SPRN_MMCRA, r6 + mtspr SPRN_SIAR, r7 + mtspr SPRN_SDAR, r8 mtspr SPRN_MMCR0, r3 isync @@ -1122,9 +1126,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) std r3, VCPU_MMCR(r9) /* if not, set saved MMCR0 to FC */ b 22f 21:mfspr r5, SPRN_MMCR1 + mfspr r7, SPRN_SIAR + mfspr r8, SPRN_SDAR std r4, VCPU_MMCR(r9) std r5, VCPU_MMCR + 8(r9) std r6, VCPU_MMCR + 16(r9) + std r7, VCPU_SIAR(r9) + std r8, VCPU_SDAR(r9) mfspr r3, SPRN_PMC1 mfspr r4, SPRN_PMC2 mfspr r5, SPRN_PMC3 -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/11] KVM: PPC: Book3S HV: Add GET/SET_ONE_REG interface for LPCR
This adds the ability for userspace to read and write the LPCR (Logical Partitioning Control Register) value relating to a guest via the GET/SET_ONE_REG interface. There is only one LPCR value for the guest, which can be accessed through any vcpu. Userspace can only modify the following fields of the LPCR value: DPFDDefault prefetch depth ILE Interrupt little-endian TC Translation control (secondary HPT hash group search disable) Signed-off-by: Paul Mackerras --- Documentation/virtual/kvm/api.txt | 1 + arch/powerpc/include/asm/reg.h | 2 ++ arch/powerpc/include/uapi/asm/kvm.h | 1 + arch/powerpc/kvm/book3s_hv.c| 21 + 4 files changed, 25 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index c36ff9af..1030ac9 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1835,6 +1835,7 @@ registers, find a list below: PPC | KVM_REG_PPC_PID | 64 PPC | KVM_REG_PPC_ACOP | 64 PPC | KVM_REG_PPC_VRSAVE | 32 + PPC | KVM_REG_PPC_LPCR | 64 PPC | KVM_REG_PPC_TM_GPR0 | 64 ... PPC | KVM_REG_PPC_TM_GPR31 | 64 diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 342e4ea..3fc0d06 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -275,6 +275,7 @@ #define LPCR_ISL (1ul << (63-2)) #define LPCR_VC_SH (63-2) #define LPCR_DPFD_SH (63-11) +#define LPCR_DPFD(7ul << LPCR_DPFD_SH) #define LPCR_VRMASD (0x1ful << (63-16)) #define LPCR_VRMA_L (1ul << (63-12)) #define LPCR_VRMA_LP0(1ul << (63-15)) @@ -291,6 +292,7 @@ #define LPCR_PECE2 0x1000 /* machine check etc can cause exit */ #define LPCR_MER 0x0800 /* Mediated External Exception */ #define LPCR_MER_SH 11 +#define LPCR_TC 0x0200 /* Translation control */ #define LPCR_LPES0x000c #define LPCR_LPES0 0x0008 /* LPAR Env selector 0 */ #define LPCR_LPES1 0x0004 /* LPAR Env selector 1 */ diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index b98bf3f..e42127d 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -533,6 +533,7 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_ACOP (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb3) #define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4) +#define KVM_REG_PPC_LPCR (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5) /* Transactional Memory checkpointed state: * This is all GPRs, all VSX regs and a subset of SPRs diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index b930caf..9c878d7 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -714,6 +714,21 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, return 0; } +static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 new_lpcr) +{ + struct kvm *kvm = vcpu->kvm; + u64 mask; + + mutex_lock(&kvm->lock); + /* +* Userspace can only modify DPFD (default prefetch depth), +* ILE (interrupt little-endian) and TC (translation control). +*/ + mask = LPCR_DPFD | LPCR_ILE | LPCR_TC; + kvm->arch.lpcr = (kvm->arch.lpcr & ~mask) | (new_lpcr & mask); + mutex_unlock(&kvm->lock); +} + int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *val) { int r = 0; @@ -796,6 +811,9 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *val) case KVM_REG_PPC_TB_OFFSET: *val = get_reg_val(id, vcpu->arch.vcore->tb_offset); break; + case KVM_REG_PPC_LPCR: + *val = get_reg_val(id, vcpu->kvm->arch.lpcr); + break; default: r = -EINVAL; break; @@ -900,6 +918,9 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *val) vcpu->arch.vcore->tb_offset = ALIGN(set_reg_val(id, *val), 1UL << 24); break; + case KVM_REG_PPC_LPCR: + kvmppc_set_lpcr(vcpu, set_reg_val(id, *val)); + break; default: r = -EINVAL; break; -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/11] KVM: PPC: Book3S: Add GET/SET_ONE_REG interface for VRSAVE
The VRSAVE register value for a vcpu is accessible through the GET/SET_SREGS interface for Book E processors, but not for Book 3S processors. In order to make this accessible for Book 3S processors, this adds a new register identifier for GET/SET_ONE_REG, and adds the code to implement it. Signed-off-by: Paul Mackerras --- Documentation/virtual/kvm/api.txt | 1 + arch/powerpc/include/uapi/asm/kvm.h | 2 ++ arch/powerpc/kvm/book3s.c | 10 ++ 3 files changed, 13 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 9486e5a..c36ff9af 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1834,6 +1834,7 @@ registers, find a list below: PPC | KVM_REG_PPC_TCSCR| 64 PPC | KVM_REG_PPC_PID | 64 PPC | KVM_REG_PPC_ACOP | 64 + PPC | KVM_REG_PPC_VRSAVE | 32 PPC | KVM_REG_PPC_TM_GPR0 | 64 ... PPC | KVM_REG_PPC_TM_GPR31 | 64 diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index a8124fe..b98bf3f 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -532,6 +532,8 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_PID(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb2) #define KVM_REG_PPC_ACOP (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb3) +#define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4) + /* Transactional Memory checkpointed state: * This is all GPRs, all VSX regs and a subset of SPRs */ diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 700df6f..f97369d 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -528,6 +528,9 @@ int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg) } val = get_reg_val(reg->id, vcpu->arch.vscr.u[3]); break; + case KVM_REG_PPC_VRSAVE: + val = get_reg_val(reg->id, vcpu->arch.vrsave); + break; #endif /* CONFIG_ALTIVEC */ case KVM_REG_PPC_DEBUG_INST: { u32 opcode = INS_TW; @@ -605,6 +608,13 @@ int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg) } vcpu->arch.vscr.u[3] = set_reg_val(reg->id, val); break; + case KVM_REG_PPC_VRSAVE: + if (!cpu_has_feature(CPU_FTR_ALTIVEC)) { + r = -ENXIO; + break; + } + vcpu->arch.vrsave = set_reg_val(reg->id, val); + break; #endif /* CONFIG_ALTIVEC */ #ifdef CONFIG_KVM_XICS case KVM_REG_PPC_ICP_STATE: -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/11] KVM: PPC: Book3S HV: Support POWER6 compatibility mode on POWER7
This enables us to use the Processor Compatibility Register (PCR) on POWER7 to put the processor into architecture 2.05 compatibility mode when running a guest. In this mode the new instructions and registers that were introduced on POWER7 are disabled in user mode. This includes all the VSX facilities plus several other instructions such as ldbrx, stdbrx, popcntw, popcntd, etc. To select this mode, we have a new register accessible through the set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT. Setting this to zero gives the full set of capabilities of the processor. Setting it to one of the "logical" PVR values defined in PAPR puts the vcpu into the compatibility mode for the corresponding architecture level. The supported values are: 0x0f02 Architecture 2.05 (POWER6) 0x0f03 Architecture 2.06 (POWER7) 0x0f13 Architecture 2.06+ (POWER7+) Since the PCR is per-core, the architecture compatibility level and the corresponding PCR value are stored in the struct kvmppc_vcore, and are therefore shared between all vcpus in a virtual core. Signed-off-by: Paul Mackerras --- Documentation/virtual/kvm/api.txt | 1 + arch/powerpc/include/asm/kvm_host.h | 2 ++ arch/powerpc/include/asm/reg.h | 11 +++ arch/powerpc/include/uapi/asm/kvm.h | 3 +++ arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kvm/book3s_hv.c| 35 + arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 +-- 7 files changed, 62 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 34a32b6..f1f300f 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1837,6 +1837,7 @@ registers, find a list below: PPC | KVM_REG_PPC_VRSAVE | 32 PPC | KVM_REG_PPC_LPCR | 64 PPC | KVM_REG_PPC_PPR | 64 + PPC | KVM_REG_PPC_ARCH_COMPAT | 32 PPC | KVM_REG_PPC_TM_GPR0 | 64 ... PPC | KVM_REG_PPC_TM_GPR31 | 64 diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index b0dcd18..5a40270 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -295,6 +295,8 @@ struct kvmppc_vcore { u64 preempt_tb; struct kvm_vcpu *runner; u64 tb_offset; /* guest timebase - host timebase */ + u32 arch_compat; + ulong pcr; }; #define VCORE_ENTRY_COUNT(vc) ((vc)->entry_exit_count & 0xff) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 3fc0d06..52ff962 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -305,6 +305,10 @@ #define LPID_RSVD0x3ff /* Reserved LPID for partn switching */ #defineSPRN_HMER 0x150 /* Hardware m? error recovery */ #defineSPRN_HMEER 0x151 /* Hardware m? enable error recovery */ +#define SPRN_PCR 0x152 /* Processor compatibility register */ +#define PCR_VEC_DIS (1ul << (63-0)) /* Vec. disable (pre POWER8) */ +#define PCR_VSX_DIS (1ul << (63-1)) /* VSX disable (pre POWER8) */ +#define PCR_ARCH_205 0x2 /* Architecture 2.05 */ #defineSPRN_HEIR 0x153 /* Hypervisor Emulated Instruction Register */ #define SPRN_TLBINDEXR 0x154 /* P7 TLB control register */ #define SPRN_TLBVPNR 0x155 /* P7 TLB control register */ @@ -1095,6 +1099,13 @@ #define PVR_BE 0x0070 #define PVR_PA6T 0x0090 +/* "Logical" PVR values defined in PAPR, representing architecture levels */ +#define PVR_ARCH_204 0x0f01 +#define PVR_ARCH_205 0x0f02 +#define PVR_ARCH_206 0x0f03 +#define PVR_ARCH_206p 0x0f13 +#define PVR_ARCH_207 0x0f04 + /* Macros for setting and retrieving special purpose registers */ #ifndef __ASSEMBLY__ #define mfmsr()({unsigned long rval; \ diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index fab6bc1..e420d46 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -536,6 +536,9 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_LPCR (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5) #define KVM_REG_PPC_PPR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb6) +/* Architecture compatibility level */ +#define KVM_REG_PPC_ARCH_COMPAT(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb7) + /* Transactional Memory checkpointed state: * This is all GPRs, all VSX regs and a subset of SPRs */ diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 5c6ea96..115dd64 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -522,6 +522,7 @@ int main(void) DEFINE(VCORE_IN_GUEST, offsetof(struct kvmppc_vcore, in_guest)); DEFINE(VCORE_NAPPING_THREADS, offsetof(struct kvmppc_vcore, napping_threads)); DEFINE(VCORE_TB_OF
[PATCH 05/11] KVM: PPC: Book3S HV: Add support for guest Program Priority Register
POWER7 and later IBM server processors have a register called the Program Priority Register (PPR), which controls the priority of each hardware CPU SMT thread, and affects how fast it runs compared to other SMT threads. This priority can be controlled by writing to the PPR or by use of a set of instructions of the form or rN,rN,rN which are otherwise no-ops but have been defined to set the priority to particular levels. This adds code to context switch the PPR when entering and exiting guests and to make the PPR value accessible through the SET/GET_ONE_REG interface. When entering the guest, we set the PPR as late as possible, because if we are setting a low thread priority it will make the code run slowly from that point on. Similarly, the first-level interrupt handlers save the PPR value in the PACA very early on, and set the thread priority to the medium level, so that the interrupt handling code runs at a reasonable speed. Signed-off-by: Paul Mackerras --- Documentation/virtual/kvm/api.txt | 1 + arch/powerpc/include/asm/exception-64s.h | 8 arch/powerpc/include/asm/kvm_book3s_asm.h | 1 + arch/powerpc/include/asm/kvm_host.h | 1 + arch/powerpc/include/uapi/asm/kvm.h | 1 + arch/powerpc/kernel/asm-offsets.c | 2 ++ arch/powerpc/kvm/book3s_hv.c | 6 ++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 +++- 8 files changed, 31 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 1030ac9..34a32b6 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1836,6 +1836,7 @@ registers, find a list below: PPC | KVM_REG_PPC_ACOP | 64 PPC | KVM_REG_PPC_VRSAVE | 32 PPC | KVM_REG_PPC_LPCR | 64 + PPC | KVM_REG_PPC_PPR | 64 PPC | KVM_REG_PPC_TM_GPR0 | 64 ... PPC | KVM_REG_PPC_TM_GPR31 | 64 diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h index 07ca627..b86c4db 100644 --- a/arch/powerpc/include/asm/exception-64s.h +++ b/arch/powerpc/include/asm/exception-64s.h @@ -203,6 +203,10 @@ do_kvm_##n: \ ld r10,area+EX_CFAR(r13); \ std r10,HSTATE_CFAR(r13); \ END_FTR_SECTION_NESTED(CPU_FTR_CFAR,CPU_FTR_CFAR,947); \ + BEGIN_FTR_SECTION_NESTED(948) \ + ld r10,area+EX_PPR(r13); \ + std r10,HSTATE_PPR(r13);\ + END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948);\ ld r10,area+EX_R10(r13); \ stw r9,HSTATE_SCRATCH1(r13);\ ld r9,area+EX_R9(r13); \ @@ -216,6 +220,10 @@ do_kvm_##n: \ ld r10,area+EX_R10(r13); \ beq 89f;\ stw r9,HSTATE_SCRATCH1(r13);\ + BEGIN_FTR_SECTION_NESTED(948) \ + ld r9,area+EX_PPR(r13);\ + std r9,HSTATE_PPR(r13); \ + END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948);\ ld r9,area+EX_R9(r13); \ std r12,HSTATE_SCRATCH0(r13); \ li r12,n; \ diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h index 9039d3c..22f4606 100644 --- a/arch/powerpc/include/asm/kvm_book3s_asm.h +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h @@ -101,6 +101,7 @@ struct kvmppc_host_state { #endif #ifdef CONFIG_PPC_BOOK3S_64 u64 cfar; + u64 ppr; #endif }; diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 9741bf0..b0dcd18 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -464,6 +464,7 @@ struct kvm_vcpu_arch { u32 ctrl; ulong dabr; ulong cfar; + ulong ppr; #endif u32 vrsave; /* also USPRG0 */ u32 mmucr; diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index e42127d..fab6bc1 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -534,6 +534,7 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4) #define KVM_REG_PPC_LPCR (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5) +#define KVM_REG_PPC_PPR
[PATCH 02/11] KVM: PPC: Book3S HV: Implement timebase offset for guests
This allows guests to have a different timebase origin from the host. This is needed for migration, where a guest can migrate from one host to another and the two hosts might have a different timebase origin. However, the timebase seen by the guest must not go backwards, and should go forwards only by a small amount corresponding to the time taken for the migration. Therefore this provides a new per-vcpu value accessed via the one_reg interface using the new KVM_REG_PPC_TB_OFFSET identifier. This value defaults to 0 and is not modified by KVM. On entering the guest, this value is added onto the timebase, and on exiting the guest, it is subtracted from the timebase. This is only supported for recent POWER hardware which has the TBU40 (timebase upper 40 bits) register. Writing to the TBU40 register only alters the upper 40 bits of the timebase, leaving the lower 24 bits unchanged. This provides a way to modify the timebase for guest migration without disturbing the synchronization of the timebase registers across CPU cores. The kernel rounds up the value given to a multiple of 2^24. Timebase values stored in KVM structures (struct kvm_vcpu, struct kvmppc_vcore, etc.) are stored as host timebase values. The timebase values in the dispatch trace log need to be guest timebase values, however, since that is read directly by the guest. This moves the setting of vcpu->arch.dec_expires on guest exit to a point after we have restored the host timebase so that vcpu->arch.dec_expires is a host timebase value. Signed-off-by: Paul Mackerras --- This differs from the previous version of this patch in that the value given to the set_one_reg interface is rounded up, as suggested by David Gibson, rather than truncated. Documentation/virtual/kvm/api.txt | 1 + arch/powerpc/include/asm/kvm_host.h | 1 + arch/powerpc/include/asm/reg.h | 1 + arch/powerpc/include/uapi/asm/kvm.h | 3 ++ arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kvm/book3s_hv.c| 10 ++- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 50 +++-- 7 files changed, 57 insertions(+), 10 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 341058c..9486e5a 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1810,6 +1810,7 @@ registers, find a list below: PPC | KVM_REG_PPC_TLB3PS | 32 PPC | KVM_REG_PPC_EPTCFG | 32 PPC | KVM_REG_PPC_ICP_STATE | 64 + PPC | KVM_REG_PPC_TB_OFFSET| 64 PPC | KVM_REG_PPC_SPMC1| 32 PPC | KVM_REG_PPC_SPMC2| 32 PPC | KVM_REG_PPC_IAMR | 64 diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 91b833d..9741bf0 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -294,6 +294,7 @@ struct kvmppc_vcore { u64 stolen_tb; u64 preempt_tb; struct kvm_vcpu *runner; + u64 tb_offset; /* guest timebase - host timebase */ }; #define VCORE_ENTRY_COUNT(vc) ((vc)->entry_exit_count & 0xff) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 5d7d9c2..342e4ea 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -243,6 +243,7 @@ #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */ #define SPRN_TBWL 0x11C /* Time Base Lower Register (super, R/W) */ #define SPRN_TBWU 0x11D /* Time Base Upper Register (super, R/W) */ +#define SPRN_TBU40 0x11E /* Timebase upper 40 bits (hyper, R/W) */ #define SPRN_SPURR 0x134 /* Scaled PURR */ #define SPRN_HSPRG00x130 /* Hypervisor Scratch 0 */ #define SPRN_HSPRG10x131 /* Hypervisor Scratch 1 */ diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 7ed41c0..a8124fe 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -504,6 +504,9 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_TLB3PS (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a) #define KVM_REG_PPC_EPTCFG (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9b) +/* Timebase offset */ +#define KVM_REG_PPC_TB_OFFSET (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9c) + /* POWER8 registers */ #define KVM_REG_PPC_SPMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9d) #define KVM_REG_PPC_SPMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9e) diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 6a7916d..ccb42cd 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -520,6 +520,7 @@ int main(void) DEFINE(VCORE_NAP_COUNT, offsetof(struct kvmppc_vcore, nap_count)); DEFINE(VCORE_IN_GUEST, offsetof(struct kvmppc_vcore, in_guest)); DEFINE(VCORE_NAPPING_THREADS, offsetof(struct kvmppc_vcore, napping_threads)); + DEFINE(VCORE_TB_OFFSET, offsetof(struct
[PATCH 00/11] HV KVM improvements in preparation for POWER8 support
This series of patches is based on Alex Graf's kvm-ppc-queue branch. It fixes some bugs, makes some more registers accessible through the one_reg interface, and implements some missing features such as support for the compatibility modes in recent POWER cpus and support for the guest having a different timebase origin from the host. These patches are all useful on POWER7 and will be needed for good POWER8 support. Please apply. Documentation/virtual/kvm/api.txt | 5 + arch/powerpc/include/asm/exception-64s.h | 8 + arch/powerpc/include/asm/kvm_book3s_asm.h | 1 + arch/powerpc/include/asm/kvm_host.h | 6 + arch/powerpc/include/asm/reg.h| 14 + arch/powerpc/include/uapi/asm/kvm.h | 10 + arch/powerpc/kernel/asm-offsets.c | 6 + arch/powerpc/kvm/book3s.c | 10 + arch/powerpc/kvm/book3s_hv.c | 96 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 544 +- 10 files changed, 469 insertions(+), 231 deletions(-) Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] kvm-unit-tests: VMX: Test suite for preemption timer
Test cases for preemption timer in nested VMX. Two aspects are tested: 1. Save preemption timer on VMEXIT if relevant bit set in EXIT_CONTROL 2. Test a relevant bug of KVM. The bug will not save preemption timer value if exit L2->L0 for some reason and enter L0->L2. Thus preemption timer will never trigger if the value is large enough. 3. Some other aspects are tested, e.g. preempt without save, preempt when value is 0. Signed-off-by: Arthur Chunqi Li --- ChangeLog to v1: 1. Add test of EXI_SAVE_PREEMPT enable and PIN_PREEMPT disable 2. Add test of PIN_PREEMPT enable and EXI_SAVE_PREEMPT enable/disable 3. Add test of preemption value is 0 x86/vmx.h |3 + x86/vmx_tests.c | 175 +++ 2 files changed, 178 insertions(+) diff --git a/x86/vmx.h b/x86/vmx.h index 28595d8..ebc8cfd 100644 --- a/x86/vmx.h +++ b/x86/vmx.h @@ -210,6 +210,7 @@ enum Encoding { GUEST_ACTV_STATE= 0x4826ul, GUEST_SMBASE= 0x4828ul, GUEST_SYSENTER_CS = 0x482aul, + PREEMPT_TIMER_VALUE = 0x482eul, /* 32-Bit Host State Fields */ HOST_SYSENTER_CS= 0x4c00ul, @@ -331,6 +332,7 @@ enum Ctrl_exi { EXI_LOAD_PERF = 1UL << 12, EXI_INTA= 1UL << 15, EXI_LOAD_EFER = 1UL << 21, + EXI_SAVE_PREEMPT= 1UL << 22, }; enum Ctrl_ent { @@ -342,6 +344,7 @@ enum Ctrl_pin { PIN_EXTINT = 1ul << 0, PIN_NMI = 1ul << 3, PIN_VIRT_NMI= 1ul << 5, + PIN_PREEMPT = 1ul << 6, }; enum Ctrl0 { diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c index c1b39f4..2e32031 100644 --- a/x86/vmx_tests.c +++ b/x86/vmx_tests.c @@ -1,4 +1,30 @@ #include "vmx.h" +#include "msr.h" +#include "processor.h" + +volatile u32 stage; + +static inline void vmcall() +{ + asm volatile("vmcall"); +} + +static inline void set_stage(u32 s) +{ + barrier(); + stage = s; + barrier(); +} + +static inline u32 get_stage() +{ + u32 s; + + barrier(); + s = stage; + barrier(); + return s; +} void basic_init() { @@ -76,6 +102,153 @@ int vmenter_exit_handler() return VMX_TEST_VMEXIT; } +u32 preempt_scale; +volatile unsigned long long tsc_val; +volatile u32 preempt_val; + +void preemption_timer_init() +{ + u32 ctrl_exit; + + // Enable EXI_SAVE_PREEMPT with PIN_PREEMPT dieabled + ctrl_exit = (vmcs_read(EXI_CONTROLS) | + EXI_SAVE_PREEMPT) & ctrl_exit_rev.clr; + vmcs_write(EXI_CONTROLS, ctrl_exit); + preempt_val = 1000; + vmcs_write(PREEMPT_TIMER_VALUE, preempt_val); + set_stage(0); + preempt_scale = rdmsr(MSR_IA32_VMX_MISC) & 0x1F; +} + +void preemption_timer_main() +{ + int i, j; + + if (!(ctrl_pin_rev.clr & PIN_PREEMPT)) { + printf("\tPreemption timer is not supported\n"); + return; + } + if (!(ctrl_exit_rev.clr & EXI_SAVE_PREEMPT)) + printf("\tSave preemption value is not supported\n"); + else { + // Test EXI_SAVE_PREEMPT enable and PIN_PREEMPT disable + set_stage(0); + // Consume enough time to let L2->L0->L2 occurs + for(i = 0; i < 10; i++) + for (j = 0; j < 1; j++); + vmcall(); + // Test PIN_PREEMPT enable and EXI_SAVE_PREEMPT enable/disable + set_stage(1); + vmcall(); + // Test both enable + if (get_stage() == 2) + vmcall(); + } + // Test the bug of reseting preempt value when L2->L0->L2 + set_stage(3); + vmcall(); + tsc_val = rdtsc(); + while (1) { + if (((rdtsc() - tsc_val) >> preempt_scale) + > 10 * preempt_val) { + report("Preemption timer timeout", 0); + break; + } + if (get_stage() == 4) + break; + } + // Test preempt val is 0 + set_stage(4); + report("Preemption timer, val=0", 0); +} + +int preemption_timer_exit_handler() +{ + u64 guest_rip; + ulong reason; + u32 insn_len; + u32 ctrl_exit; + u32 ctrl_pin; + + guest_rip = vmcs_read(GUEST_RIP); + reason = vmcs_read(EXI_REASON) & 0xff; + insn_len = vmcs_read(EXI_INST_LEN); + switch (reason) { + case VMX_PREEMPT: + switch (get_stage()) { + case 3: + if (((rdtsc() - tsc_val) >> preempt_scale) < preempt_val) + report("Preemption timer timeout", 0); + else + report("Preemption timer timeout", 1); + set_stage(get_stage() + 1); + b
[PATCH v4] KVM: nVMX: Fully support of nested VMX preemption timer
This patch contains the following two changes: 1. Fix the bug in nested preemption timer support. If vmexit L2->L0 with some reasons not emulated by L1, preemption timer value should be save in such exits. 2. Add support of "Save VMX-preemption timer value" VM-Exit controls to nVMX. With this patch, nested VMX preemption timer features are fully supported. Signed-off-by: Arthur Chunqi Li --- ChangeLog to v3: Move nested_adjust_preemption_timer to the latest place just before vmenter. Some minor changes. arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kvm/vmx.c| 49 +++-- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index bb04650..b93e09a 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -536,6 +536,7 @@ /* MSR_IA32_VMX_MISC bits */ #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << 29) +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F /* AMD-V MSRs */ #define MSR_VM_CR 0xc0010114 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..f364d16 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -374,6 +374,8 @@ struct nested_vmx { */ struct page *apic_access_page; u64 msr_ia32_feature_control; + /* Set if vmexit is L2->L1 */ + bool nested_vmx_exit; }; #define POSTED_INTR_ON 0 @@ -2204,7 +2206,17 @@ static __init void nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; + if (!(nested_vmx_pinbased_ctls_high & + PIN_BASED_VMX_PREEMPTION_TIMER) || + !(nested_vmx_exit_ctls_high & + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { + nested_vmx_exit_ctls_high &= + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); + nested_vmx_pinbased_ctls_high &= + (~PIN_BASED_VMX_PREEMPTION_TIMER); + } nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | VM_EXIT_LOAD_IA32_EFER); @@ -6707,6 +6719,24 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) +{ + u64 delta_tsc_l1; + u32 preempt_val_l1, preempt_val_l2, preempt_scale; + + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) & + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu, + native_read_tsc()) - vcpu->arch.last_guest_tsc; + preempt_val_l1 = delta_tsc_l1 >> preempt_scale; + if (preempt_val_l2 <= preempt_val_l1) + preempt_val_l2 = 0; + else + preempt_val_l2 -= preempt_val_l1; + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); +} + /* * The guest has exited. See if we can fix it or if we need userspace * assistance. @@ -6736,9 +6766,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) vmx->nested.nested_run_pending = 0; if (is_guest_mode(vcpu) && nested_vmx_exit_handled(vcpu)) { + vmx->nested.nested_vmx_exit = true; nested_vmx_vmexit(vcpu); return 1; } + vmx->nested.nested_vmx_exit = false; if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) { vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY; @@ -7132,6 +7164,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) debugctlmsr = get_debugctlmsr(); vmx->__launched = vmx->loaded_vmcs->launched; + if (is_guest_mode(vcpu) && !(vmx->nested.nested_vmx_exit)) + nested_adjust_preemption_timer(vcpu); asm( /* Store host registers */ "push %%" _ASM_DX "; push %%" _ASM_BP ";" @@ -7518,6 +7552,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); u32 exec_control; + u32 exit_control; vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector); vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector); @@ -7691,7 +7726,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER * bits are further modified by vmx_set_efer() below. */ - vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); +
Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling
On 09/06/2013 04:10 AM, Gleb Natapov wrote: > On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote: >> On 09/03/2013 08:53 PM, Gleb Natapov wrote: >>> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote: On 09/01/2013 10:06 PM, Gleb Natapov wrote: > On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote: >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT >> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing >> them to user space which saves time on switching to user space and back. >> >> Both real and virtual modes are supported. The kernel tries to >> handle a TCE request in the real mode, if fails it passes the request >> to the virtual mode to complete the operation. If it a virtual mode >> handler fails, the request is passed to user space. >> >> The first user of this is VFIO on POWER. Trampolines to the VFIO external >> user API functions are required for this patch. >> >> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus >> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling >> of map/unmap requests. The device supports a single attribute which is >> a struct with LIOBN and IOMMU fd. When the attribute is set, the device >> establishes the connection between KVM and VFIO. >> >> Tests show that this patch increases transmission speed from 220MB/s >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). >> >> Signed-off-by: Paul Mackerras >> Signed-off-by: Alexey Kardashevskiy >> >> --- >> >> Changes: >> v9: >> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU" >> KVM device >> * release_spapr_tce_table() is not shared between different TCE types >> * reduced the patch size by moving VFIO external API >> trampolines to separate patche >> * moved documentation from Documentation/virtual/kvm/api.txt to >> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt >> >> v8: >> * fixed warnings from check_patch.pl >> >> 2013/07/11: >> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled >> for KVM_BOOK3S_64 >> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much >> sense >> for this here but the next patch for hugepages support will use it more. >> >> 2013/07/06: >> * added realmode arch_spin_lock to protect TCE table from races >> in real and virtual modes >> * POWERPC IOMMU API is changed to support real mode >> * iommu_take_ownership and iommu_release_ownership are protected by >> iommu_table's locks >> * VFIO external user API use rewritten >> * multiple small fixes >> >> 2013/06/27: >> * tce_list page is referenced now in order to protect it from accident >> invalidation during H_PUT_TCE_INDIRECT execution >> * added use of the external user VFIO API >> >> 2013/06/05: >> * changed capability number >> * changed ioctl number >> * update the doc article number >> >> 2013/05/20: >> * removed get_user() from real mode handlers >> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there >> translated TCEs, tries realmode_get_page() on those and if it fails, it >> passes control over the virtual mode handler which tries to finish >> the request handling >> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit >> on a page >> * The only reason to pass the request to user mode now is when the user >> mode >> did not register TCE table in the kernel, in all other cases the virtual >> mode >> handler is expected to do the job >> --- >> .../virtual/kvm/devices/spapr_tce_iommu.txt| 37 +++ >> arch/powerpc/include/asm/kvm_host.h| 4 + >> arch/powerpc/kvm/book3s_64_vio.c | 310 >> - >> arch/powerpc/kvm/book3s_64_vio_hv.c| 122 >> arch/powerpc/kvm/powerpc.c | 1 + >> include/linux/kvm_host.h | 1 + >> virt/kvm/kvm_main.c| 5 + >> 7 files changed, 477 insertions(+), 3 deletions(-) >> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt >> >> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt >> b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt >> new file mode 100644 >> index 000..4bc8fc3 >> --- /dev/null >> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt >> @@ -0,0 +1,37 @@ >> +SPAPR TCE IOMMU device >> + >> +Capability: KVM_CAP_SPAPR_TCE_IOMMU >> +Architectures: powerpc >> + >> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU >> + >>
Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup
On 06.09.2013, at 00:09, Scott Wood wrote: > On Thu, 2013-07-11 at 01:09 +0200, Alexander Graf wrote: >> On 11.07.2013, at 01:08, Scott Wood wrote: >> >>> On 07/10/2013 06:04:53 PM, Alexander Graf wrote: On 11.07.2013, at 01:01, Benjamin Herrenschmidt wrote: > On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote: >>> #ifdef CONFIG_PPC64 >>> + /* >>> + * To avoid races, the caller must have gone directly from having >>> + * interrupts fully-enabled to hard-disabled. >>> + */ >>> + WARN_ON(local_paca->irq_happened != PACA_IRQ_HARD_DIS); >> >> WARN_ON(lazy_irq_pending()); ? > > Different semantics. What you propose will not catch irq_happened == 0 :-) Right, but we only ever reach here after hard_irq_disable() I think. >>> >>> And the WARN_ON helps us ensure that it stays that way. >> >> Heh - ok :). Works for me. > > What's the status on this patch? IIUC it was ok. Ben, could you please verify? Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: fix documentation
On Thu, 2013-09-05 at 15:22 -0700, Zi Shen Lim wrote: > Signed-off-by: Zi Shen Lim > --- Applied. Thanks! Alex > Documentation/vfio.txt | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt > index d7993dc..b9ca023 100644 > --- a/Documentation/vfio.txt > +++ b/Documentation/vfio.txt > @@ -167,8 +167,8 @@ group and can access them as follows: > int container, group, device, i; > struct vfio_group_status group_status = > { .argsz = sizeof(group_status) }; > - struct vfio_iommu_x86_info iommu_info = { .argsz = sizeof(iommu_info) }; > - struct vfio_iommu_x86_dma_map dma_map = { .argsz = sizeof(dma_map) }; > + struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) > }; > + struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) }; > struct vfio_device_info device_info = { .argsz = sizeof(device_info) }; > > /* Create a new container */ > @@ -193,7 +193,7 @@ group and can access them as follows: > ioctl(group, VFIO_GROUP_SET_CONTAINER, &container); > > /* Enable the IOMMU model we want */ > - ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU) > + ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); > > /* Get addition IOMMU info */ > ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info); > @@ -229,7 +229,7 @@ group and can access them as follows: > > irq.index = i; > > - ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, ®); > + ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq); > > /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */ > } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vfio: fix documentation
Signed-off-by: Zi Shen Lim --- Documentation/vfio.txt | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index d7993dc..b9ca023 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -167,8 +167,8 @@ group and can access them as follows: int container, group, device, i; struct vfio_group_status group_status = { .argsz = sizeof(group_status) }; - struct vfio_iommu_x86_info iommu_info = { .argsz = sizeof(iommu_info) }; - struct vfio_iommu_x86_dma_map dma_map = { .argsz = sizeof(dma_map) }; + struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) }; + struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) }; struct vfio_device_info device_info = { .argsz = sizeof(device_info) }; /* Create a new container */ @@ -193,7 +193,7 @@ group and can access them as follows: ioctl(group, VFIO_GROUP_SET_CONTAINER, &container); /* Enable the IOMMU model we want */ - ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU) + ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); /* Get addition IOMMU info */ ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info); @@ -229,7 +229,7 @@ group and can access them as follows: irq.index = i; - ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, ®); + ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq); /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */ } -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup
On Thu, 2013-07-11 at 01:09 +0200, Alexander Graf wrote: > On 11.07.2013, at 01:08, Scott Wood wrote: > > > On 07/10/2013 06:04:53 PM, Alexander Graf wrote: > >> On 11.07.2013, at 01:01, Benjamin Herrenschmidt wrote: > >> > On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote: > >> >>> #ifdef CONFIG_PPC64 > >> >>> + /* > >> >>> + * To avoid races, the caller must have gone directly from having > >> >>> + * interrupts fully-enabled to hard-disabled. > >> >>> + */ > >> >>> + WARN_ON(local_paca->irq_happened != PACA_IRQ_HARD_DIS); > >> >> > >> >> WARN_ON(lazy_irq_pending()); ? > >> > > >> > Different semantics. What you propose will not catch irq_happened == 0 > >> > :-) > >> Right, but we only ever reach here after hard_irq_disable() I think. > > > > And the WARN_ON helps us ensure that it stays that way. > > Heh - ok :). Works for me. What's the status on this patch? -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenBSD 5.3 guest on KVM
Hi, Gleb. On Thursday 05 September 2013 21:00:50 +0300, Gleb Natapov wrote: > > > > Someone had this problem and could solve it somehow? There any > > > > debug information I can provide to help solve this? > > > For simple troubleshooting try "info status" from the QEMU monitor. > > ss01:~# telnet localhost 4046 > > Trying ::1... > > Connected to localhost. > > Escape character is '^]'. > > QEMU 1.1.2 monitor - type 'help' for more information > > (qemu) > > (qemu) > > (qemu) info status > > VM status: running > > (qemu) > > (qemu) > > (qemu) system_powerdown > > (qemu) > > (qemu) > > (qemu) info status > > VM status: running > > (qemu) > > > > Then the VM freezes. > And now, after it freezes, do one more "info status" also "info cpus" > and "info registers". Also what do you mean by "freezes"? Do you see > that VM started to shutdown after you issued system_powerdown? ss01:~# telnet localhost 4046 Trying ::1... Connected to localhost. Escape character is '^]'. QEMU 1.1.2 monitor - type 'help' for more information (qemu) (qemu) info status VM status: running (qemu) (qemu) system_powerdown (qemu) (qemu) info status VM status: running (qemu) (qemu) info status VM status: running (qemu) (qemu) info cpus * CPU #0: pc=0x80455278 thread_id=22792 (qemu) info cpus * CPU #0: pc=0x80455278 thread_id=22792 (qemu) info cpus * CPU #0: pc=0x80455278 thread_id=22792 (qemu) (qemu) info registers RAX= RBX= RCX=0204 RDX=b100 RSI=0022 RDI=80013900 RBP=8e7fcc80 RSP=8e7fcb78 R8 =0004 R9 =0001 R10=0010 R11=80455b90 R12=0006 R13=8e6f2010 R14=80119900 R15=8e6f2000 RIP=80448c60 RFL=0046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0023 bfff 00a0f300 DPL=3 DS16 [-WA] CS =0008 00a09b00 DPL=0 CS64 [-RA] SS =0010 00a09300 DPL=0 DS16 [-WA] DS =0023 bfff 00a0f300 DPL=3 DS16 [-WA] FS =0023 bfff 00a0f300 DPL=3 DS16 [-WA] GS =0023 80a27b60 bfff 00a0f300 DPL=3 DS16 [-WA] LDT= TR =0030 80011000 0067 8b00 DPL=0 TSS64-busy GDT= 80011068 003f IDT= 8001 0fff CR0=8001003b CR2=1567dc558a60 CR3=00e6 CR4=07b0 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0d01 FCW=037f FSW= [ST=0] FTW=00 MXCSR=1fa0 FPR0= FPR1= FPR2= FPR3= FPR4= FPR5= FPR6= FPR7= XMM00=3fa24000 XMM01=3fe2b400 XMM02= XMM03= XMM04= XMM05= XMM06= XMM07= XMM08= XMM09= XMM10= XMM11= XMM12= XMM13= XMM14= XMM15= (qemu) (qemu) info registers RAX= RBX= RCX=0204 RDX=b100 RSI=0022 RDI=80013900 RBP=8e7fcc80 RSP=8e7fcb78 R8 =0004 R9 =0001 R10=0010 R11=80455b90 R12=0006 R13=8e6f2010 R14=80119900 R15=8e6f2000 RIP=80448c60 RFL=0046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0023 bfff 00a0f300 DPL=3 DS16 [-WA] CS =0008 00a09b00 DPL=0 CS64 [-RA] SS =0010 00a09300 DPL=0 DS16 [-WA] DS =0023 bfff 00a0f300 DPL=3 DS16 [-WA] FS =0023 bfff 00a0f300 DPL=3 DS16 [-WA] GS =0023 80a27b60 bfff 00a0f300 DPL=3 DS16 [-WA] LDT= TR =0030 80011000 0067 8b00 DPL=0 TSS64-busy GDT= 80011068 003f IDT= 8001 0fff CR0=8001003b CR2=1567dc558a60 CR3=00e6 CR4=07b0 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0d01 FCW=037f FSW= [ST=0] FTW=00 MXCSR=1fa0 FPR0= FPR1= FPR2= FPR3= FPR4= FPR5= FPR6= FPR7= XMM00=3fa24000 XMM0
Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling
On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote: > On 09/03/2013 08:53 PM, Gleb Natapov wrote: > > On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote: > >> On 09/01/2013 10:06 PM, Gleb Natapov wrote: > >>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote: > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > and H_STUFF_TCE requests targeted an IOMMU TCE table without passing > them to user space which saves time on switching to user space and back. > > Both real and virtual modes are supported. The kernel tries to > handle a TCE request in the real mode, if fails it passes the request > to the virtual mode to complete the operation. If it a virtual mode > handler fails, the request is passed to user space. > > The first user of this is VFIO on POWER. Trampolines to the VFIO external > user API functions are required for this patch. > > This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus > number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling > of map/unmap requests. The device supports a single attribute which is > a struct with LIOBN and IOMMU fd. When the attribute is set, the device > establishes the connection between KVM and VFIO. > > Tests show that this patch increases transmission speed from 220MB/s > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > > Signed-off-by: Paul Mackerras > Signed-off-by: Alexey Kardashevskiy > > --- > > Changes: > v9: > * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU" > KVM device > * release_spapr_tce_table() is not shared between different TCE types > * reduced the patch size by moving VFIO external API > trampolines to separate patche > * moved documentation from Documentation/virtual/kvm/api.txt to > Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > > v8: > * fixed warnings from check_patch.pl > > 2013/07/11: > * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled > for KVM_BOOK3S_64 > * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much > sense > for this here but the next patch for hugepages support will use it more. > > 2013/07/06: > * added realmode arch_spin_lock to protect TCE table from races > in real and virtual modes > * POWERPC IOMMU API is changed to support real mode > * iommu_take_ownership and iommu_release_ownership are protected by > iommu_table's locks > * VFIO external user API use rewritten > * multiple small fixes > > 2013/06/27: > * tce_list page is referenced now in order to protect it from accident > invalidation during H_PUT_TCE_INDIRECT execution > * added use of the external user VFIO API > > 2013/06/05: > * changed capability number > * changed ioctl number > * update the doc article number > > 2013/05/20: > * removed get_user() from real mode handlers > * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there > translated TCEs, tries realmode_get_page() on those and if it fails, it > passes control over the virtual mode handler which tries to finish > the request handling > * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit > on a page > * The only reason to pass the request to user mode now is when the user > mode > did not register TCE table in the kernel, in all other cases the virtual > mode > handler is expected to do the job > --- > .../virtual/kvm/devices/spapr_tce_iommu.txt| 37 +++ > arch/powerpc/include/asm/kvm_host.h| 4 + > arch/powerpc/kvm/book3s_64_vio.c | 310 > - > arch/powerpc/kvm/book3s_64_vio_hv.c| 122 > arch/powerpc/kvm/powerpc.c | 1 + > include/linux/kvm_host.h | 1 + > virt/kvm/kvm_main.c| 5 + > 7 files changed, 477 insertions(+), 3 deletions(-) > create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > > diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > new file mode 100644 > index 000..4bc8fc3 > --- /dev/null > +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > @@ -0,0 +1,37 @@ > +SPAPR TCE IOMMU device > + > +Capability: KVM_CAP_SPAPR_TCE_IOMMU > +Architectures: powerpc > + > +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU > + > +Groups: > + KVM_DEV_SPAPR_TCE_IOMMU_ATT
Re: OpenBSD 5.3 guest on KVM
On Thu, Sep 05, 2013 at 12:44:58PM -0300, Daniel Bareiro wrote: > Hi, Pablo. > > On Thursday 05 September 2013 10:30:12 +0200, > Paolo Bonzini wrote: > > > > Someone had this problem and could solve it somehow? There any > > > debug information I can provide to help solve this? > > > For simple troubleshooting try "info status" from the QEMU monitor. > > ss01:~# telnet localhost 4046 > Trying ::1... > Connected to localhost. > Escape character is '^]'. > QEMU 1.1.2 monitor - type 'help' for more information > (qemu) > (qemu) > (qemu) info status > VM status: running > (qemu) > (qemu) > (qemu) system_powerdown > (qemu) > (qemu) > (qemu) info status > VM status: running > (qemu) > > Then the VM freezes. > And now, after it freezes, do one more "info status" also "info cpus" and "info registers". Also what do you mean by "freezes"? Do you see that VM started to shutdown after you issued system_powerdown? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/3] kvm tools: remove periodic tick in favour of a polling thread
Hi Sasha, On 04/09/13 19:01, Sasha Levin wrote: On 09/04/2013 01:48 PM, Pekka Enberg wrote: On Wed, Sep 4, 2013 at 8:40 PM, Jonathan Austin wrote: 'top' works on ARM with virtio console. I've just done some new testing and with the serial console emulation and I see the same as you're reporting. Previously with the 8250 emulation I'd booted to a prompt but didn't actually test top... I'm looking in to fixing this now... Looks like I need to find the right place from which to call serial8250_flush_tx now that it isn't getting called every tick. I've done the following and it works fixes 'top' with serial8250: ---8<-- diff --git a/tools/kvm/hw/serial.c b/tools/kvm/hw/serial.c index 931067f..a71e68d 100644 --- a/tools/kvm/hw/serial.c +++ b/tools/kvm/hw/serial.c @@ -260,6 +260,7 @@ static bool serial8250_out(struct ioport *ioport, struct kvm *kvm, u16 port, dev->lsr &= ~UART_LSR_TEMT; if (dev->txcnt == FIFO_LEN / 2) dev->lsr &= ~UART_LSR_THRE; + serial8250_flush_tx(kvm, dev); } else { /* Should never happpen */ dev->lsr &= ~(UART_LSR_TEMT | UART_LSR_THRE); ->8--- I guess it's a shame that we'll be printing each character (admittedly the rate will always be relatively low...) rather than flushing the buffer in a batch. Without a timer, though, I'm not sure I see a better option - every N chars doesn't seem like a good one to me. If you think that looks about right then I'll fold that in to the patch series, probably also removing the call to serial8250_flush_tx() in serial8250__receive. Yeah, looks good to me and makes top work again. We might want to make sure performance isn't hit with stuff that's intensive on the serial console. Indeed, the intention here is very much to reduce overhead... Do you have an idea already of what you'd like to test? I've written a little testcase that just prints an incrementing counter to the console in a tight loop for 5 seconds (I've tested both buffered and unbuffered output). The measure of 'performance' is how high we count in those 5 seconds. These are averages (mean) of 5 runs on x86. +unbuffered+-buffered-+ native | 36880 | 40354 | +--+--+ lkvm - original | 24302 | 25335 | +--+--+ lkvm - no-tick | 22895 | 28202 | +--+--+ I ran these all on the framebuffer console. I found that the numbers on gnome-terminal seemed to be affected by the activity level of other gui-ish things, and the numbers were different in gnome-terminal and xterm. If you want to see more testing then a suggestion of a way we can take host terminal performance out of the equation would make me more comfortable with the numbers. I do think that as comparisons to each other they're reasonable as they are, though. So at least in this reasonably artificial case it looks like a minor win for buffered output and a minor loss in the unbuffered case. Happy to try out other things if you're interested. Jonny Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] kvm: fix a bug and remove a redundancy in async_pf
Il 04/09/2013 22:32, Radim Krčmář ha scritto: > I did not reproduce the bug fixed in [1/2], but there are not that many > reasons why we could not unload a module, so the spot is quite obvious. > > > Radim Krčmář (2): > kvm: free resources after canceling async_pf > kvm: remove .done from struct kvm_async_pf > > include/linux/kvm_host.h | 1 - > virt/kvm/async_pf.c | 8 > 2 files changed, 4 insertions(+), 5 deletions(-) > Reviewed-by: Paolo Bonzini -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenBSD 5.3 guest on KVM
Hi, Pablo. On Thursday 05 September 2013 10:30:12 +0200, Paolo Bonzini wrote: > > Someone had this problem and could solve it somehow? There any > > debug information I can provide to help solve this? > For simple troubleshooting try "info status" from the QEMU monitor. ss01:~# telnet localhost 4046 Trying ::1... Connected to localhost. Escape character is '^]'. QEMU 1.1.2 monitor - type 'help' for more information (qemu) (qemu) (qemu) info status VM status: running (qemu) (qemu) (qemu) system_powerdown (qemu) (qemu) (qemu) info status VM status: running (qemu) Then the VM freezes. > You can also try this: > > http://www.linux-kvm.org/page/Tracing > > You will get a large log, you can send it to me offlist or via > something like Google Drive. The log has 612 MB (57 MB compressed with bzip2). I guess it must be because there are other production VMs running on that host. Is there a way of limiting the log? This result I got it running I found on the link you sent me. # trace-cmd record -b 2 -e kvm # trace-cmd report I forgot to mention that I'm using KVM versions provided by the repositories of Debian Wheezy: Linux:3.2.46-1+deb7u1 qemu-kvm: 1.1.2+dfsg-6 Thanks for your reply. Regards, Daniel -- Fingerprint: BFB3 08D6 B4D1 31B2 72B9 29CE 6696 BF1B 14E6 1D37 Powered by Debian GNU/Linux - Linux user #188.598 signature.asc Description: Digital signature
Re: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer
On Thu, Sep 5, 2013 at 7:05 PM, Zhang, Yang Z wrote: > Arthur Chunqi Li wrote on 2013-09-05: >> > Arthur Chunqi Li wrote on 2013-09-05: >> >> On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z >> >> >> >> wrote: >> >> > Arthur Chunqi Li wrote on 2013-09-04: >> >> >> This patch contains the following two changes: >> >> >> 1. Fix the bug in nested preemption timer support. If vmexit >> >> >> L2->L0 with some reasons not emulated by L1, preemption timer >> >> >> value should be save in such exits. >> >> >> 2. Add support of "Save VMX-preemption timer value" VM-Exit >> >> >> controls to nVMX. >> >> >> >> >> >> With this patch, nested VMX preemption timer features are fully >> supported. >> >> >> >> >> >> Signed-off-by: Arthur Chunqi Li >> >> >> --- >> >> >> This series depends on queue. >> >> >> >> >> >> arch/x86/include/uapi/asm/msr-index.h |1 + >> >> >> arch/x86/kvm/vmx.c| 51 >> >> >> ++--- >> >> >> 2 files changed, 48 insertions(+), 4 deletions(-) >> >> >> >> >> >> diff --git a/arch/x86/include/uapi/asm/msr-index.h >> >> >> b/arch/x86/include/uapi/asm/msr-index.h >> >> >> index bb04650..b93e09a 100644 >> >> >> --- a/arch/x86/include/uapi/asm/msr-index.h >> >> >> +++ b/arch/x86/include/uapi/asm/msr-index.h >> >> >> @@ -536,6 +536,7 @@ >> >> >> >> >> >> /* MSR_IA32_VMX_MISC bits */ >> >> >> #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL >> << >> >> 29) >> >> >> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F >> >> >> /* AMD-V MSRs */ >> >> >> >> >> >> #define MSR_VM_CR 0xc0010114 >> >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index >> >> >> 1f1da43..870caa8 >> >> >> 100644 >> >> >> --- a/arch/x86/kvm/vmx.c >> >> >> +++ b/arch/x86/kvm/vmx.c >> >> >> @@ -2204,7 +2204,14 @@ static __init void >> >> >> nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 >> >> >> VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif >> >> >> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; >> >> >> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT >> | >> >> >> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; >> >> >> + if (!(nested_vmx_pinbased_ctls_high & >> >> >> PIN_BASED_VMX_PREEMPTION_TIMER)) >> >> >> + nested_vmx_exit_ctls_high &= >> >> >> + >> (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); >> >> >> + if (!(nested_vmx_exit_ctls_high & >> >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) >> >> >> + nested_vmx_pinbased_ctls_high &= >> >> >> + (~PIN_BASED_VMX_PREEMPTION_TIMER); >> >> > The following logic is more clearly: >> >> > if(nested_vmx_pinbased_ctls_high & >> >> PIN_BASED_VMX_PREEMPTION_TIMER) >> >> > nested_vmx_exit_ctls_high |= >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER >> >> Here I have such consideration: this logic is wrong if CPU support >> >> PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this >> does >> >> occurs. So the codes above reads the MSR and reserves the features it >> >> supports, and here I just check if these two features are supported >> >> simultaneously. >> >> >> > No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on >> > PIN_BASED_VMX_PREEMPTION_TIMER. >> PIN_BASED_VMX_PREEMPTION_TIMER is an >> > independent feature >> > >> >> You remind that this piece of codes can write like this: >> >> if (!(nested_vmx_pin_based_ctls_high & >> >> PIN_BASED_VMX_PREEMPTION_TIMER) || >> >> !(nested_vmx_exit_ctls_high & >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { >> >> nested_vmx_exit_ctls_high >> >> &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); >> >> nested_vmx_pinbased_ctls_high &= >> >> (~PIN_BASED_VMX_PREEMPTION_TIMER); >> >> } >> >> >> >> This may reflect the logic I describe above that these two flags >> >> should support simultaneously, and brings less confusion. >> >> > >> >> > BTW: I don't see nested_vmx_setup_ctls_msrs() considers the >> >> > hardware's >> >> capability when expose those vmx features(not just preemption timer) to >> >> L1. >> >> The codes just above here, when setting pinbased control for nested >> >> vmx, it firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to >> >> mask the features hardware not support. So does other control fields. >> >> > >> > Yes, I saw it. >> > >> >> >> nested_vmx_exit_ctls_high |= >> >> >> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | >> >> >> >> VM_EXIT_LOAD_IA32_EFER); >> >> >> >> >> >> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct >> >> >> kvm_vcpu *vcpu, u64 *info1, u64 *info2) >> >> >> *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } >> >> >> >> >> >> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) { >> >> >> + u64 delta_tsc_l1; >> >> >> + u32 preempt_val_l1, preempt_val_l2, preempt_scale; >> >> >> + >> >> >> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) & >> >> >> + >> >> MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; >> >> >> + p
Re: [Xen-devel] Is fallback vhost_net to qemu for live migrate available?
On Tue, 3 Sep 2013, Michael S. Tsirkin wrote: > On Tue, Sep 03, 2013 at 09:40:48AM +0100, Wei Liu wrote: > > On Tue, Sep 03, 2013 at 09:28:11AM +0800, Qin Chuanyu wrote: > > > On 2013/9/2 15:57, Wei Liu wrote: > > > >On Sat, Aug 31, 2013 at 12:45:11PM +0800, Qin Chuanyu wrote: > > > >>On 2013/8/30 0:08, Anthony Liguori wrote: > > > >>>Hi Qin, > > > >> > > > By change the memory copy and notify mechanism ,currently virtio-net > > > with > > > vhost_net could run on Xen with good performance。 > > > >>> > > > >>>I think the key in doing this would be to implement a property > > > >>>ioeventfd and irqfd interface in the driver domain kernel. Just > > > >>>hacking vhost_net with Xen specific knowledge would be pretty nasty > > > >>>IMHO. > > > >>> > > > >>Yes, I add a kernel module which persist virtio-net pio_addr and > > > >>msix address as what kvm module did. Guest wake up vhost thread by > > > >>adding a hook func in evtchn_interrupt. > > > >> > > > >>>Did you modify the front end driver to do grant table mapping or is > > > >>>this all being done by mapping the domain's memory? > > > >>> > > > >>There is nothing changed in front end driver. Currently I use > > > >>alloc_vm_area to get address space, and map the domain's memory as > > > >>what what qemu did. > > > >> > > > > > > > >You mean you're using xc_map_foreign_range and friends in the backend to > > > >map guest memory? That's not very desirable as it violates Xen's > > > >security model. It would not be too hard to pass grant references > > > >instead of guest physical memory address IMHO. > > > > > > > In fact, I did what virtio-net have done in Qemu. I think security > > > is a pseudo question because Dom0 is under control. Right, but we are trying to move the backends out of Dom0, for scalability and security. Setting up a network driver domain is pretty easy and should work out of the box with Xen 4.3. That said, I agree that using xc_map_foreign_range is a good way to start. > > Consider that you might have driver domains. Not every domain is under > > control or trusted. > > I don't see anything that will prevent using driver domains here. Driver domains are not privileged, therefore cannot map random guest pages (unless they have been granted by the guest via the grant table). xc_map_foreign_range can't work from a driver domain. > > Also consider that security model like XSM can be > > used to audit operations to enhance security so your foreign mapping > > approach might not always work. > > It could be nice to have as an option, sure. > XSM is disabled by default though so I don't think lack of support for > that makes it a prototype. There are some security aware Xen based products in the market today that use XSM.
Re: [PATCH v2] KVM: mmu: allow page tables to be in read-only slots
On 09/05/2013 08:21 PM, Paolo Bonzini wrote: > Page tables in a read-only memory slot will currently cause a triple > fault when running with shadow paging, because the page walker uses > gfn_to_hva and it fails on such a slot. > > TianoCore uses such a page table. The idea is that, on real hardware, > the firmware can already run in 64-bit flat mode when setting up the > memory controller. Real hardware seems to be fine with that as long as > the accessed/dirty bits are set. Thus, this patch saves whether the > slot is readonly, and later checks it when updating the accessed and > dirty bits. > > Note that this scenario is not supported by NPT at all, as explained by > comments in the code. Reviewed-by: Xiao Guangrong -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH uq/master 2/2] KVM: make XSAVE support more robust
QEMU moves state from CPUArchState to struct kvm_xsave and back when it invokes the KVM_*_XSAVE ioctls. Because it doesn't treat the XSAVE region as an opaque blob, it might be impossible to set some state on the destination if migrating to an older version. This patch blocks migration if it finds that unsupported bits are set in the XSTATE_BV header field. To make this work robustly, QEMU should only report in env->xstate_bv those fields that will actually end up in the migration stream. Signed-off-by: Paolo Bonzini --- target-i386/kvm.c | 3 ++- target-i386/machine.c | 4 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 749aa09..df08a4b 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1291,7 +1291,8 @@ static int kvm_get_xsave(X86CPU *cpu) sizeof env->fpregs); memcpy(env->xmm_regs, &xsave->region[XSAVE_XMM_SPACE], sizeof env->xmm_regs); -env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV]; +env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV] & +XSTATE_SUPPORTED; memcpy(env->ymmh_regs, &xsave->region[XSAVE_YMMH_SPACE], sizeof env->ymmh_regs); return 0; diff --git a/target-i386/machine.c b/target-i386/machine.c index dc81cde..9e2cfcf 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -278,6 +278,10 @@ static int cpu_post_load(void *opaque, int version_id) CPUX86State *env = &cpu->env; int i; +if (env->xstate_bv & ~XSTATE_SUPPORTED) { +return -EINVAL; +} + /* * Real mode guest segments register DPL should be zero. * Older KVM version were setting it wrongly. -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH uq/master 0/2] KVM: issues with XSAVE support
This series fixes two migration bugs concerning KVM's XSAVE ioctls, both found by code inspection (the second in fact is just theoretical until AVX512 or MPX support is added to KVM). Please review. Paolo Bonzini (2): x86: fix migration from pre-version 12 KVM: make XSAVE support more robust target-i386/cpu.c | 1 + target-i386/cpu.h | 5 + target-i386/kvm.c | 3 ++- target-i386/machine.c | 4 4 files changed, 12 insertions(+), 1 deletion(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH uq/master 1/2] x86: fix migration from pre-version 12
On KVM, the KVM_SET_XSAVE would be executed with a 0 xstate_bv, and not restore anything. Since FP and SSE data are always valid, set them in xstate_bv at reset time. In fact, that value is the same that KVM_GET_XSAVE returns on pre-XSAVE hosts. Signed-off-by: Paolo Bonzini --- target-i386/cpu.c | 1 + target-i386/cpu.h | 5 + 2 files changed, 6 insertions(+) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index c36345e..ac83106 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -2386,6 +2386,7 @@ static void x86_cpu_reset(CPUState *s) env->fpuc = 0x37f; env->mxcsr = 0x1f80; +env->xstate_bv = XSTATE_FP | XSTATE_SSE; env->pat = 0x0007040600070406ULL; env->msr_ia32_misc_enable = MSR_IA32_MISC_ENABLE_DEFAULT; diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 5723eff..a153078 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -380,6 +380,11 @@ #define MSR_VM_HSAVE_PA 0xc0010117 +#define XSTATE_SUPPORTED (XSTATE_FP|XSTATE_SSE|XSTATE_YMM) +#define XSTATE_FP 1 +#define XSTATE_SSE 2 +#define XSTATE_YMM 4 + /* CPUID feature words */ typedef enum FeatureWord { FEAT_1_EDX, /* CPUID[1].EDX */ -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: prevent setting unsupported XSAVE states
A guest can still attempt to save and restore XSAVE states even if they have been masked in CPUID leaf 0Dh. This usually is not visible to the guest, but is still wrong: "Any attempt to set a reserved bit (as determined by the contents of EAX and EDX after executing CPUID with EAX=0DH, ECX= 0H) in XCR0 for a given processor will result in a #GP exception". The patch also performs the same checks as __kvm_set_xcr in KVM_SET_XSAVE. This catches migration from newer to older kernel/processor before the guest starts running. Cc: kvm@vger.kernel.org Cc: Gleb Natapov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/x86.c | 10 -- arch/x86/kvm/x86.h | 1 + 3 files changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index a20ecb5..d7c465d 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -182,7 +182,7 @@ static bool supported_xcr0_bit(unsigned bit) { u64 mask = ((u64)1 << bit); - return mask & (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) & host_xcr0; + return mask & KVM_SUPPORTED_XCR0 & host_xcr0; } #define F(x) bit(X86_FEATURE_##x) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3625798..801a882 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -586,6 +586,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) return 1; if ((xcr0 & XSTATE_YMM) && !(xcr0 & XSTATE_SSE)) return 1; + if (xcr0 & ~KVM_SUPPORTED_XCR0) + return 1; if (xcr0 & ~host_xcr0) return 1; kvm_put_guest_xcr0(vcpu); @@ -2980,10 +2982,14 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, u64 xstate_bv = *(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)]; - if (cpu_has_xsave) + if (cpu_has_xsave) { + if (xstate_bv & ~KVM_SUPPORTED_XCR0) + return -EINVAL; + if (xstate_bv & ~host_xcr0) + return -EINVAL; memcpy(&vcpu->arch.guest_fpu.state->xsave, guest_xsave->region, xstate_size); - else { + } else { if (xstate_bv & ~XSTATE_FPSSE) return -EINVAL; memcpy(&vcpu->arch.guest_fpu.state->fxsave, diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index e224f7a..587fb9e 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -122,6 +122,7 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt, gva_t addr, void *val, unsigned int bytes, struct x86_exception *exception); +#define KVM_SUPPORTED_XCR0 (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) extern u64 host_xcr0; extern struct static_key kvm_no_apic_vcpu; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: mmu: allow page tables to be in read-only slots
Page tables in a read-only memory slot will currently cause a triple fault when running with shadow paging, because the page walker uses gfn_to_hva and it fails on such a slot. TianoCore uses such a page table. The idea is that, on real hardware, the firmware can already run in 64-bit flat mode when setting up the memory controller. Real hardware seems to be fine with that as long as the accessed/dirty bits are set. Thus, this patch saves whether the slot is readonly, and later checks it when updating the accessed and dirty bits. Note that this scenario is not supported by NPT at all, as explained by comments in the code. Cc: sta...@vger.kernel.org Cc: kvm@vger.kernel.org Cc: Xiao Guangrong Cc: Gleb Natapov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/paging_tmpl.h | 20 +++- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c| 14 +- 3 files changed, 29 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 0433301..aa18aca 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -99,6 +99,7 @@ struct guest_walker { pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; + bool pte_writable[PT_MAX_FULL_LEVELS]; unsigned pt_access; unsigned pte_access; gfn_t gfn; @@ -235,6 +236,22 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, if (pte == orig_pte) continue; + /* +* If the slot is read-only, simply do not process the accessed +* and dirty bits. This is the correct thing to do if the slot +* is ROM, and page tables in read-as-ROM/write-as-MMIO slots +* are only supported if the accessed and dirty bits are already +* set in the ROM (so that MMIO writes are never needed). +* +* Note that NPT does not allow this at all and faults, since +* it always wants nested page table entries for the guest +* page tables to be writable. And EPT works but will simply +* overwrite the read-only memory to set the accessed and dirty +* bits. +*/ + if (unlikely(!walker->pte_writable[level - 1])) + continue; + ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); if (ret) return ret; @@ -309,7 +326,8 @@ retry_walk: goto error; real_gfn = gpa_to_gfn(real_gfn); - host_addr = gfn_to_hva(vcpu->kvm, real_gfn); + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn, + &walker->pte_writable[walker->level - 1]); if (unlikely(kvm_is_error_hva(host_addr))) goto error; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ca645a0..22f9cdf 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -533,6 +533,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages, struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn); +unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable); unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn); void kvm_release_page_clean(struct page *page); void kvm_release_page_dirty(struct page *page); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f7e4334..418d037 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1078,11 +1078,15 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn) EXPORT_SYMBOL_GPL(gfn_to_hva); /* - * The hva returned by this function is only allowed to be read. - * It should pair with kvm_read_hva() or kvm_read_hva_atomic(). + * If writable is set to false, the hva returned by this function is only + * allowed to be read. */ -static unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn) +unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable) { + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn); + if (writable) + *writable = !memslot_is_readonly(slot); + return __gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL, false); } @@ -1450,7 +1454,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, int r; unsigned long addr; - addr = gfn_to_hva_read(kvm, gfn); + addr = gfn_to_hva_read(kvm, gfn, NULL); if (kvm_is_error_hva(addr)) return -EFAULT; r = kvm_read_hva(data, (void __user *)addr + offset, len); @@ -1488,7 +1492,7 @@ int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer
Arthur Chunqi Li wrote on 2013-09-05: > > Arthur Chunqi Li wrote on 2013-09-05: > >> On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z > >> > >> wrote: > >> > Arthur Chunqi Li wrote on 2013-09-04: > >> >> This patch contains the following two changes: > >> >> 1. Fix the bug in nested preemption timer support. If vmexit > >> >> L2->L0 with some reasons not emulated by L1, preemption timer > >> >> value should be save in such exits. > >> >> 2. Add support of "Save VMX-preemption timer value" VM-Exit > >> >> controls to nVMX. > >> >> > >> >> With this patch, nested VMX preemption timer features are fully > supported. > >> >> > >> >> Signed-off-by: Arthur Chunqi Li > >> >> --- > >> >> This series depends on queue. > >> >> > >> >> arch/x86/include/uapi/asm/msr-index.h |1 + > >> >> arch/x86/kvm/vmx.c| 51 > >> >> ++--- > >> >> 2 files changed, 48 insertions(+), 4 deletions(-) > >> >> > >> >> diff --git a/arch/x86/include/uapi/asm/msr-index.h > >> >> b/arch/x86/include/uapi/asm/msr-index.h > >> >> index bb04650..b93e09a 100644 > >> >> --- a/arch/x86/include/uapi/asm/msr-index.h > >> >> +++ b/arch/x86/include/uapi/asm/msr-index.h > >> >> @@ -536,6 +536,7 @@ > >> >> > >> >> /* MSR_IA32_VMX_MISC bits */ > >> >> #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL > << > >> 29) > >> >> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F > >> >> /* AMD-V MSRs */ > >> >> > >> >> #define MSR_VM_CR 0xc0010114 > >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index > >> >> 1f1da43..870caa8 > >> >> 100644 > >> >> --- a/arch/x86/kvm/vmx.c > >> >> +++ b/arch/x86/kvm/vmx.c > >> >> @@ -2204,7 +2204,14 @@ static __init void > >> >> nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 > >> >> VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif > >> >> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; > >> >> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT > | > >> >> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; > >> >> + if (!(nested_vmx_pinbased_ctls_high & > >> >> PIN_BASED_VMX_PREEMPTION_TIMER)) > >> >> + nested_vmx_exit_ctls_high &= > >> >> + > (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); > >> >> + if (!(nested_vmx_exit_ctls_high & > >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) > >> >> + nested_vmx_pinbased_ctls_high &= > >> >> + (~PIN_BASED_VMX_PREEMPTION_TIMER); > >> > The following logic is more clearly: > >> > if(nested_vmx_pinbased_ctls_high & > >> PIN_BASED_VMX_PREEMPTION_TIMER) > >> > nested_vmx_exit_ctls_high |= > >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER > >> Here I have such consideration: this logic is wrong if CPU support > >> PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support > >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this > does > >> occurs. So the codes above reads the MSR and reserves the features it > >> supports, and here I just check if these two features are supported > >> simultaneously. > >> > > No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on > > PIN_BASED_VMX_PREEMPTION_TIMER. > PIN_BASED_VMX_PREEMPTION_TIMER is an > > independent feature > > > >> You remind that this piece of codes can write like this: > >> if (!(nested_vmx_pin_based_ctls_high & > >> PIN_BASED_VMX_PREEMPTION_TIMER) || > >> !(nested_vmx_exit_ctls_high & > >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { > >> nested_vmx_exit_ctls_high > >> &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); > >> nested_vmx_pinbased_ctls_high &= > >> (~PIN_BASED_VMX_PREEMPTION_TIMER); > >> } > >> > >> This may reflect the logic I describe above that these two flags > >> should support simultaneously, and brings less confusion. > >> > > >> > BTW: I don't see nested_vmx_setup_ctls_msrs() considers the > >> > hardware's > >> capability when expose those vmx features(not just preemption timer) to L1. > >> The codes just above here, when setting pinbased control for nested > >> vmx, it firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to > >> mask the features hardware not support. So does other control fields. > >> > > > Yes, I saw it. > > > >> >> nested_vmx_exit_ctls_high |= > >> >> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | > >> >> > VM_EXIT_LOAD_IA32_EFER); > >> >> > >> >> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct > >> >> kvm_vcpu *vcpu, u64 *info1, u64 *info2) > >> >> *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } > >> >> > >> >> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) { > >> >> + u64 delta_tsc_l1; > >> >> + u32 preempt_val_l1, preempt_val_l2, preempt_scale; > >> >> + > >> >> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) & > >> >> + > >> MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; > >> >> + preempt_val_l2 = > vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); > >> >> + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu, > >> >> + native_read_tsc()) - > vcpu->arc
[PATCH v2 08/15] KVM: MMU: introduce nulls desc
It likes nulls list and we use the pte-list as the nulls which can help us to detect whether the "desc" is moved to anther rmap then we can re-walk the rmap if that happened kvm->slots_lock is held when we do lockless walking that prevents rmap is reused (free rmap need to hold that lock) so that we can not see the same nulls used on different rmaps Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 35 +-- 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 08fb4e2..c5f1b27 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -913,6 +913,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) return level - 1; } +static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc) +{ + unsigned long marker; + + marker = (unsigned long)pte_list | 1UL; + desc->more = (struct pte_list_desc *)marker; +} + +static bool desc_is_a_nulls(struct pte_list_desc *desc) +{ + return (unsigned long)desc & 1; +} + +static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc) +{ + return (unsigned long *)((unsigned long)desc & ~1); +} + static int __find_first_free(struct pte_list_desc *desc) { int i; @@ -951,7 +969,7 @@ static int count_spte_number(struct pte_list_desc *desc) first_free = __find_first_free(desc); - for (desc_num = 0; desc->more; desc = desc->more) + for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more) desc_num++; return first_free + desc_num * PTE_LIST_EXT; @@ -985,6 +1003,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, desc = mmu_alloc_pte_list_desc(vcpu); desc->sptes[0] = (u64 *)*pte_list; desc->sptes[1] = spte; + desc_mark_nulls(pte_list, desc); *pte_list = (unsigned long)desc | 1; return 1; } @@ -1030,7 +1049,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list, /* * Only one entry existing but still use a desc to store it? */ - WARN_ON(!next_desc); + WARN_ON(desc_is_a_nulls(next_desc)); mmu_free_pte_list_desc(first_desc); *pte_list = (unsigned long)next_desc | 1ul; @@ -1041,7 +1060,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list, * Only one entry in this desc, move the entry to the head * then the desc can be freed. */ - if (!first_desc->sptes[1] && !first_desc->more) { + if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) { *pte_list = (unsigned long)first_desc->sptes[0]; mmu_free_pte_list_desc(first_desc); } @@ -1070,7 +1089,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list) rmap_printk("pte_list_remove: %p many->many\n", spte); desc = (struct pte_list_desc *)(*pte_list & ~1ul); - while (desc) { + while (!desc_is_a_nulls(desc)) { for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) if (desc->sptes[i] == spte) { pte_list_desc_remove_entry(pte_list, @@ -1097,11 +1116,13 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn) return fn((u64 *)*pte_list); desc = (struct pte_list_desc *)(*pte_list & ~1ul); - while (desc) { + while (!desc_is_a_nulls(desc)) { for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) fn(desc->sptes[i]); desc = desc->more; } + + WARN_ON(desc_get_nulls_value(desc) != pte_list); } static unsigned long *__gfn_to_rmap(gfn_t gfn, int level, @@ -1184,6 +1205,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter) iter->desc = (struct pte_list_desc *)(rmap & ~1ul); iter->pos = 0; + WARN_ON(desc_is_a_nulls(iter->desc)); return iter->desc->sptes[iter->pos]; } @@ -1204,7 +1226,8 @@ static u64 *rmap_get_next(struct rmap_iterator *iter) return sptep; } - iter->desc = iter->desc->more; + iter->desc = desc_is_a_nulls(iter->desc->more) ? + NULL : iter->desc->more; if (iter->desc) { iter->pos = 0; -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 06/15] KVM: MMU: update spte and add it into rmap before dirty log
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the its dirty bitmap, so we should ensure the writable spte can be found in rmap before the dirty bitmap is visible. Otherwise, we clear the dirty bitmap but fail to write-protect the page which is detailed in the comments in this patch Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 84 ++ arch/x86/kvm/x86.c | 10 +++ 2 files changed, 76 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index a983570..8ea54d9 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2428,6 +2428,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, { u64 spte; int ret = 0; + bool remap = is_rmap_spte(*sptep); if (set_mmio_spte(vcpu->kvm, sptep, gfn, pfn, pte_access)) return 0; @@ -2489,12 +2490,73 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, } } - if (pte_access & ACC_WRITE_MASK) - mark_page_dirty(vcpu->kvm, gfn); - set_pte: if (mmu_spte_update(sptep, spte)) kvm_flush_remote_tlbs(vcpu->kvm); + + if (!remap) { + if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD) + rmap_recycle(vcpu, sptep, gfn); + + if (level > PT_PAGE_TABLE_LEVEL) + ++vcpu->kvm->stat.lpages; + } + + /* +* The orders we require are: +* 1) set spte to writable __before__ set the dirty bitmap. +*It makes sure that dirty-logging is not missed when do +*live migration at the final step where kvm should stop +*the guest and push the remaining dirty pages got from +*dirty-bitmap to the destination. The similar cases are +*in fast_pf_fix_direct_spte() and kvm_write_guest_page(). +* +* 2) add the spte into rmap __before__ set the dirty bitmap. +* +* They can ensure we can find the writable spte on the rmap +* when we do lockless write-protection since +* kvm_vm_ioctl_get_dirty_log() write-protects the pages based +* on its dirty-bitmap, otherwise these cases will happen: +* +* CPU 0 CPU 1 +* kvm ioctl doing get-dirty-pages +* mark_page_dirty(gfn) which +* set the gfn on the dirty maps +* mask = xchg(dirty_bitmap, 0) +* +* try to write-protect gfns which +* are set on "mask" then walk then +* rmap, see no spte on that rmap +* add the spte into rmap +* +* !! Then the page can be freely wrote but not recorded in +* the dirty bitmap. +* +* And: +* +* VCPU 0CPU 1 +*kvm ioctl doing get-dirty-pages +* mark_page_dirty(gfn) which +* set the gfn on the dirty maps +* +* add spte into rmap +*mask = xchg(dirty_bitmap, 0) +* +*try to write-protect gfns which +*are set on "mask" then walk then +*rmap, see spte is on the ramp +*but it is readonly or nonpresent +* Mark spte writable +* +* !! Then the page can be freely wrote but not recorded in the +* dirty bitmap. +* +* See the comments in kvm_vm_ioctl_get_dirty_log(). +*/ + smp_wmb(); + + if (pte_access & ACC_WRITE_MASK) + mark_page_dirty(vcpu->kvm, gfn); done: return ret; } @@ -2504,9 +2566,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, int level, gfn_t gfn, pfn_t pfn, bool speculative, bool host_writable) { - int was_rmapped = 0; - int rmap_count; - pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__, *sptep, write_fault, gfn); @@ -2528,8 +2587,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte_to_pfn(*sptep), pfn); drop_spte(vcpu->kvm, sptep); kvm_flush_remote_tlbs(vcpu->kvm); - } else - was_rmapped = 1; + } } if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative, @@ -2547,16 +2605,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, is_large_pte(*sptep)? "2MB" : "4kB", *sptep & PT_PRESENT_MASK ?"RW":"R", gfn, *sptep, sptep); -
[PATCH v2 03/15] KVM: MMU: lazily drop large spte
Currently, kvm zaps the large spte if write-protected is needed, the later read can fault on that spte. Actually, we can make the large spte readonly instead of making them un-present, the page fault caused by read access can be avoided The idea is from Avi: | As I mentioned before, write-protecting a large spte is a good idea, | since it moves some work from protect-time to fault-time, so it reduces | jitter. This removes the need for the return value. This version has fixed the issue reported in 6b73a9606, the reason of that issue is that fast_page_fault() directly sets the readonly large spte to writable but only dirty the first page into the dirty-bitmap that means other pages are missed. Fixed it by only the normal sptes (on the PT_PAGE_TABLE_LEVEL level) can be fast fixed Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 36 arch/x86/kvm/x86.c | 8 ++-- 2 files changed, 26 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 869f1db..88107ee 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1177,8 +1177,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) /* * Write-protect on the specified @sptep, @pt_protect indicates whether - * spte writ-protection is caused by protecting shadow page table. - * @flush indicates whether tlb need be flushed. + * spte write-protection is caused by protecting shadow page table. * * Note: write protection is difference between drity logging and spte * protection: @@ -1187,10 +1186,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) * - for spte protection, the spte can be writable only after unsync-ing * shadow page. * - * Return true if the spte is dropped. + * Return true if tlb need be flushed. */ -static bool -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect) +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect) { u64 spte = *sptep; @@ -1200,17 +1198,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect) rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep); - if (__drop_large_spte(kvm, sptep)) { - *flush |= true; - return true; - } - if (pt_protect) spte &= ~SPTE_MMU_WRITEABLE; spte = spte & ~PT_WRITABLE_MASK; - *flush |= mmu_spte_update(sptep, spte); - return false; + return mmu_spte_update(sptep, spte); } static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, @@ -1222,11 +1214,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, for (sptep = rmap_get_first(*rmapp, &iter); sptep;) { BUG_ON(!(*sptep & PT_PRESENT_MASK)); - if (spte_write_protect(kvm, sptep, &flush, pt_protect)) { - sptep = rmap_get_first(*rmapp, &iter); - continue; - } + flush |= spte_write_protect(kvm, sptep, pt_protect); sptep = rmap_get_next(&iter); } @@ -2675,6 +2664,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, break; } + drop_large_spte(vcpu, iterator.sptep); + if (!is_shadow_present_pte(*iterator.sptep)) { u64 base_addr = iterator.addr; @@ -2876,6 +2867,19 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level, goto exit; /* +* Do not fix write-permission on the large spte since we only dirty +* the first page into the dirty-bitmap in fast_pf_fix_direct_spte() +* that means other pages are missed if its slot is dirty-logged. +* +* Instead, we let the slow page fault path create a normal spte to +* fix the access. +* +* See the comments in kvm_arch_commit_memory_region(). +*/ + if (sp->role.level > PT_PAGE_TABLE_LEVEL) + goto exit; + + /* * Currently, fast page fault only works for direct mapping since * the gfn is not stable for indirect shadow page. * See Documentation/virtual/kvm/locking.txt to get more detail. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e5ca72a..6ad0c07 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7208,8 +7208,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages); /* * Write protect all pages for dirty logging. -* Existing largepage mappings are destroyed here and new ones will -* not be created until the end of the logging. +* +* All the sptes including the large sptes which point to this +* slot are set to readonly. We can not create any new large +* spte on this slot until the end
[PATCH v2 00/15] KVM: MMU: locklessly wirte-protect
Changelog v2: - the changes from Gleb's review: 1) fix calculating the number of spte in the pte_list_add() 2) set iter->desc to NULL if meet a nulls desc to cleanup the code of rmap_get_next() 3) fix hlist corruption due to accessing sp->hlish out of mmu-lock 4) use rcu functions to access the rcu protected pointer 5) spte will be missed in lockless walker if the spte is moved in a desc (remove a spte from the rmap using only one desc). Fix it by bottom-up walking the desc - the changes from Paolo's review 1) make the order and memory barriers between update spte / add spte into rmap and dirty-log more clear - the changes from Marcelo's review: 1) let fast page fault only fix the spts on the last level (level = 1) 2) improve some changelogs and comments - the changes from Takuya's review: move the patch "flush tlb if the spte can be locklessly modified" forward to make it's more easily merged Thank all of you very much for your time and patience on this patchset! Since we use rcu_assign_pointer() to update the points in desc even if dirty log is disabled, i have measured the performance: Host: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz * 12 + 36G memory - migrate-perf (benchmark the time of get-dirty-log) before: Run 10 times, Avg time:9009483 ns. after: Run 10 times, Avg time:4807343 ns. - kerbench Guest: 12 VCPUs + 8G memory before: EPT is enabled: # cat 09-05-origin-ept | grep real real 85.58 real 83.47 real 82.95 EPT is disabled: # cat 09-05-origin-shadow | grep real real 138.77 real 138.99 real 139.55 after: EPT is enabled: # cat 09-05-lockless-ept | grep real real 83.40 real 82.81 real 83.39 EPT is disabled: # cat 09-05-lockless-shadow | grep real real 138.91 real 139.71 real 138.94 No performance regression! Background == Currently, when mark memslot dirty logged or get dirty page, we need to write-protect large guest memory, it is the heavy work, especially, we need to hold mmu-lock which is also required by vcpu to fix its page table fault and mmu-notifier when host page is being changed. In the extreme cpu / memory used guest, it becomes a scalability issue. This patchset introduces a way to locklessly write-protect guest memory. Idea == There are the challenges we meet and the ideas to resolve them. 1) How to locklessly walk rmap? The first idea we got to prevent "desc" being freed when we are walking the rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode, it updates the rmap really frequently. So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object to be reused more quickly. We also store a "nulls" in the last "desc" (desc->more) which can help us to detect whether the "desc" is moved to anther rmap then we can re-walk the rmap if that happened. I learned this idea from nulls-list. Another issue is, when a spte is deleted from the "desc", another spte in the last "desc" will be moved to this position to replace the deleted one. If the deleted one has been accessed and we do not access the replaced one, the replaced one is missed when we do lockless walk. To fix this case, we do not backward move the spte, instead, we forward move the entry: when a spte is deleted, we move the entry in the first desc to that position. 2) How to locklessly access shadow page table? It is easy if the handler is in the vcpu context, in that case we can use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that disable interrupt to stop shadow page be freed. But we are on the ioctl context and the paths we are optimizing for have heavy workload, disabling interrupt is not good for the system performance. We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use call_rcu() to free the shadow page if that indicator is set. Set/Clear the indicator are protected by slot-lock, so it need not be atomic and does not hurt the performance and the scalability. 3) How to locklessly write-protect guest memory? Currently, there are two behaviors when we write-protect guest memory, one is clearing the Writable bit on spte and the another one is dropping spte when it points to large page. The former is easy we only need to atomicly clear a bit but the latter is hard since we need to remove the spte from rmap. so we unify these two behaviors that only make the spte readonly. Making large spte readonly instead of nonpresent is also good for reducing jitter. And we need to pay more attention on the order of making spte writable, adding spte into rmap and setting the corresponding bit on dirty bitmap since kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap, we should ensure the writable spte can be found in rmap before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect the page. Performance result The performance result and the benchmark can be found
[PATCH v2 07/15] KVM: MMU: redesign the algorithm of pte_list
Change the algorithm to: 1) always add new desc to the first desc (pointed by parent_ptes/rmap) that is good to implement rcu-nulls-list-like lockless rmap walking 2) always move the entry in the first desc to the the position we want to remove when delete a spte in the parent_ptes/rmap (backward-move). It is good for us to implement lockless rmap walk since in the current code, when a spte is deleted from the "desc", another spte in the last "desc" will be moved to this position to replace the deleted one. If the deleted one has been accessed and we do not access the replaced one, the replaced one is missed when we do lockless walk. To fix this case, we do not backward move the spte, instead, we forward move the entry: when a spte is deleted, we move the entry in the first desc to that position Both of these also can reduce cache miss Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 180 - 1 file changed, 123 insertions(+), 57 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 8ea54d9..08fb4e2 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -913,6 +913,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) return level - 1; } +static int __find_first_free(struct pte_list_desc *desc) +{ + int i; + + for (i = 0; i < PTE_LIST_EXT; i++) + if (!desc->sptes[i]) + break; + return i; +} + +static int find_first_free(struct pte_list_desc *desc) +{ + int free = __find_first_free(desc); + + WARN_ON(free >= PTE_LIST_EXT); + return free; +} + +static int find_last_used(struct pte_list_desc *desc) +{ + int used = __find_first_free(desc) - 1; + + WARN_ON(used < 0 || used >= PTE_LIST_EXT); + return used; +} + +/* + * TODO: we can encode the desc number into the rmap/parent_ptes + * since at least 10 physical/virtual address bits are reserved + * on x86. It is worthwhile if it shows that the desc walking is + * a performance issue. + */ +static int count_spte_number(struct pte_list_desc *desc) +{ + int first_free, desc_num; + + first_free = __find_first_free(desc); + + for (desc_num = 0; desc->more; desc = desc->more) + desc_num++; + + return first_free + desc_num * PTE_LIST_EXT; +} + /* * Pte mapping structures: * @@ -923,99 +967,121 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) * * Returns the number of pte entries before the spte was added or zero if * the spte was not added. - * */ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, unsigned long *pte_list) { struct pte_list_desc *desc; - int i, count = 0; + int free_pos; if (!*pte_list) { rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte); *pte_list = (unsigned long)spte; - } else if (!(*pte_list & 1)) { + return 0; + } + + if (!(*pte_list & 1)) { rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte); desc = mmu_alloc_pte_list_desc(vcpu); desc->sptes[0] = (u64 *)*pte_list; desc->sptes[1] = spte; *pte_list = (unsigned long)desc | 1; - ++count; - } else { - rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte); - desc = (struct pte_list_desc *)(*pte_list & ~1ul); - while (desc->sptes[PTE_LIST_EXT-1] && desc->more) { - desc = desc->more; - count += PTE_LIST_EXT; - } - if (desc->sptes[PTE_LIST_EXT-1]) { - count += PTE_LIST_EXT; - desc->more = mmu_alloc_pte_list_desc(vcpu); - desc = desc->more; - } - for (i = 0; desc->sptes[i]; ++i) - ++count; - desc->sptes[i] = spte; + return 1; } - return count; + + rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte); + desc = (struct pte_list_desc *)(*pte_list & ~1ul); + + /* No empty entry in the desc. */ + if (desc->sptes[PTE_LIST_EXT - 1]) { + struct pte_list_desc *new_desc; + new_desc = mmu_alloc_pte_list_desc(vcpu); + new_desc->more = desc; + desc = new_desc; + *pte_list = (unsigned long)desc | 1; + } + + free_pos = find_first_free(desc); + desc->sptes[free_pos] = spte; + return count_spte_number(desc) - 1; } static void -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc, - int i, struct pte_list_desc *prev_desc) +pte_list_desc_remove_entry(unsigned long *pte_list, + struct pte_list_desc *desc, int
[PATCH v2 04/15] KVM: MMU: flush tlb if the spte can be locklessly modified
Relax the tlb flush condition since we will write-protect the spte out of mmu lock. Note lockless write-protection only marks the writable spte to readonly and the spte can be writable only if both SPTE_HOST_WRITEABLE and SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable) This patch is used to avoid this kind of race: VCPU 0 VCPU 1 lockless wirte protection: set spte.w = 0 lock mmu-lock write protection the spte to sync shadow page, see spte.w = 0, then without flush tlb unlock mmu-lock !!! At this point, the shadow page can still be writable due to the corrupt tlb entry Flush all TLB Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 88107ee..7488229 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -595,7 +595,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) * we always atomicly update it, see the comments in * spte_has_volatile_bits(). */ - if (is_writable_pte(old_spte) && !is_writable_pte(new_spte)) + if (spte_is_locklessly_modifiable(old_spte) && + !is_writable_pte(new_spte)) ret = true; if (!shadow_accessed_mask) -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 01/15] KVM: MMU: fix the count of spte number
If the desc is the last one and it is full, its sptes is not counted Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 6e2d2c8..7714fd8 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -948,6 +948,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, count += PTE_LIST_EXT; } if (desc->sptes[PTE_LIST_EXT-1]) { + count += PTE_LIST_EXT; desc->more = mmu_alloc_pte_list_desc(vcpu); desc = desc->more; } -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 05/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
Now we can flush all the TLBs out of the mmu lock without TLB corruption when write-proect the sptes, it is because: - we have marked large sptes readonly instead of dropping them that means we just change the spte from writable to readonly so that we only need to care the case of changing spte from present to present (changing the spte from present to nonpresent will flush all the TLBs immediately), in other words, the only case we need to care is mmu_spte_update() - in mmu_spte_update(), we have checked SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that means it does not depend on PT_WRITABLE_MASK anymore Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 18 ++ arch/x86/kvm/x86.c | 9 +++-- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 7488229..a983570 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -4320,15 +4320,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) if (*rmapp) __rmap_write_protect(kvm, rmapp, false); - if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { - kvm_flush_remote_tlbs(kvm); + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) cond_resched_lock(&kvm->mmu_lock); - } } } - kvm_flush_remote_tlbs(kvm); spin_unlock(&kvm->mmu_lock); + + /* +* We can flush all the TLBs out of the mmu lock without TLB +* corruption since we just change the spte from writable to +* readonly so that we only need to care the case of changing +* spte from present to present (changing the spte from present +* to nonpresent will flush all the TLBs immediately), in other +* words, the only case we care is mmu_spte_update() where we +* haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE +* instead of PT_WRITABLE_MASK, that means it does not depend +* on PT_WRITABLE_MASK anymore. +*/ + kvm_flush_remote_tlbs(kvm); } #define BATCH_ZAP_PAGES10 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6ad0c07..72f1487 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3560,11 +3560,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) offset = i * BITS_PER_LONG; kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask); } - if (is_dirty) - kvm_flush_remote_tlbs(kvm); spin_unlock(&kvm->mmu_lock); + /* +* All the TLBs can be flushed out of mmu lock, see the comments in +* kvm_mmu_slot_remove_write_access(). +*/ + if (is_dirty) + kvm_flush_remote_tlbs(kvm); + r = -EFAULT; if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n)) goto out; -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 02/15] KVM: MMU: properly check last spte in fast_page_fault()
Using sp->role.level instead of @level since @level is not got from the page table hierarchy There is no issue in current code since the fast page fault currently only fixes the fault caused by dirty-log that is always on the last level (level = 1) This patch makes the code more readable and avoids potential issue in the further development Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 7714fd8..869f1db 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2804,9 +2804,9 @@ static bool page_fault_can_be_fast(u32 error_code) } static bool -fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte) +fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, + u64 *sptep, u64 spte) { - struct kvm_mmu_page *sp = page_header(__pa(sptep)); gfn_t gfn; WARN_ON(!sp->role.direct); @@ -2832,6 +2832,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level, u32 error_code) { struct kvm_shadow_walk_iterator iterator; + struct kvm_mmu_page *sp; bool ret = false; u64 spte = 0ull; @@ -2852,7 +2853,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level, goto exit; } - if (!is_last_spte(spte, level)) + sp = page_header(__pa(iterator.sptep)); + if (!is_last_spte(spte, sp->role.level)) goto exit; /* @@ -2878,7 +2880,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level, * the gfn is not stable for indirect shadow page. * See Documentation/virtual/kvm/locking.txt to get more detail. */ - ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte); + ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte); exit: trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep, spte, ret); -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 09/15] KVM: MMU: introduce pte-list lockless walker
The basic idea is from nulls list which uses a nulls to indicate whether the desc is moved to different pte-list Note, we should do bottom-up walk in the desc since we always move the bottom entry to the deleted position. A desc only has 3 entries in the current code so it is not a problem now, but the issue will be triggered if we expend the size of desc in the further development Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 57 ++ 1 file changed, 53 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c5f1b27..3e1432f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -975,6 +975,10 @@ static int count_spte_number(struct pte_list_desc *desc) return first_free + desc_num * PTE_LIST_EXT; } +#define rcu_assign_pte_list(pte_list_p, value) \ + rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), \ + (unsigned long *)(value)) + /* * Pte mapping structures: * @@ -994,7 +998,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, if (!*pte_list) { rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte); - *pte_list = (unsigned long)spte; + rcu_assign_pte_list(pte_list, spte); return 0; } @@ -1004,7 +1008,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, desc->sptes[0] = (u64 *)*pte_list; desc->sptes[1] = spte; desc_mark_nulls(pte_list, desc); - *pte_list = (unsigned long)desc | 1; + rcu_assign_pte_list(pte_list, (unsigned long)desc | 1); return 1; } @@ -1017,7 +1021,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, new_desc = mmu_alloc_pte_list_desc(vcpu); new_desc->more = desc; desc = new_desc; - *pte_list = (unsigned long)desc | 1; + rcu_assign_pte_list(pte_list, (unsigned long)desc | 1); } free_pos = find_first_free(desc); @@ -1125,6 +1129,51 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn) WARN_ON(desc_get_nulls_value(desc) != pte_list); } +/* The caller should hold rcu lock. */ +static void pte_list_walk_lockless(unsigned long *pte_list, + pte_list_walk_fn fn) +{ + struct pte_list_desc *desc; + unsigned long pte_list_value; + int i; + +restart: + /* +* Force the pte_list to be reloaded. +* +* See the comments in hlist_nulls_for_each_entry_rcu(). +*/ + barrier(); + pte_list_value = *rcu_dereference(pte_list); + if (!pte_list_value) + return; + + if (!(pte_list_value & 1)) + return fn((u64 *)pte_list_value); + + desc = (struct pte_list_desc *)(pte_list_value & ~1ul); + while (!desc_is_a_nulls(desc)) { + /* +* We should do top-down walk since we always use the higher +* indices to replace the deleted entry if only one desc is +* used in the rmap when a spte is removed. Otherwise the +* moved entry will be missed. +*/ + for (i = PTE_LIST_EXT - 1; i >= 0; i--) + if (desc->sptes[i]) + fn(desc->sptes[i]); + + desc = rcu_dereference(desc->more); + + /* It is being initialized. */ + if (unlikely(!desc)) + goto restart; + } + + if (unlikely(desc_get_nulls_value(desc) != pte_list)) + goto restart; +} + static unsigned long *__gfn_to_rmap(gfn_t gfn, int level, struct kvm_memory_slot *slot) { @@ -4651,7 +4700,7 @@ int kvm_mmu_module_init(void) { pte_list_desc_cache = kmem_cache_create("pte_list_desc", sizeof(struct pte_list_desc), - 0, 0, NULL); + 0, SLAB_DESTROY_BY_RCU, NULL); if (!pte_list_desc_cache) goto nomem; -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 10/15] KVM: MMU: initialize the pointers in pte_list_desc properly
Since pte_list_desc will be locklessly accessed we need to atomicly initialize its pointers so that the lockless walker can not get the partial value from the pointer In this patch we use the way of assigning pointer to initialize its pointers which is always atomic instead of using kmem_cache_zalloc Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 27 +-- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3e1432f..fe80019 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -687,14 +687,15 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) } static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, - struct kmem_cache *base_cache, int min) + struct kmem_cache *base_cache, int min, + gfp_t flags) { void *obj; if (cache->nobjs >= min) return 0; while (cache->nobjs < ARRAY_SIZE(cache->objects)) { - obj = kmem_cache_zalloc(base_cache, GFP_KERNEL); + obj = kmem_cache_alloc(base_cache, flags); if (!obj) return -ENOMEM; cache->objects[cache->nobjs++] = obj; @@ -741,14 +742,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu) int r; r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache, - pte_list_desc_cache, 8 + PTE_PREFETCH_NUM); + pte_list_desc_cache, 8 + PTE_PREFETCH_NUM, + GFP_KERNEL); if (r) goto out; r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8); if (r) goto out; r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache, - mmu_page_header_cache, 4); + mmu_page_header_cache, 4, + GFP_KERNEL | __GFP_ZERO); out: return r; } @@ -913,6 +916,17 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) return level - 1; } +static void pte_list_desc_ctor(void *p) +{ + struct pte_list_desc *desc = p; + int i; + + for (i = 0; i < PTE_LIST_EXT; i++) + desc->sptes[i] = NULL; + + desc->more = NULL; +} + static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc) { unsigned long marker; @@ -1066,6 +1080,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list, */ if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) { *pte_list = (unsigned long)first_desc->sptes[0]; + first_desc->sptes[0] = NULL; mmu_free_pte_list_desc(first_desc); } } @@ -4699,8 +4714,8 @@ static void mmu_destroy_caches(void) int kvm_mmu_module_init(void) { pte_list_desc_cache = kmem_cache_create("pte_list_desc", - sizeof(struct pte_list_desc), - 0, SLAB_DESTROY_BY_RCU, NULL); + sizeof(struct pte_list_desc), + 0, SLAB_DESTROY_BY_RCU, pte_list_desc_ctor); if (!pte_list_desc_cache) goto nomem; -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread
It is easy if the handler is in the vcpu context, in that case we can use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that disable interrupt to stop shadow page being freed. But we are on the ioctl context and the paths we are optimizing for have heavy workload, disabling interrupt is not good for the system performance We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use call_rcu() to free the shadow page if that indicator is set. Set/Clear the indicator are protected by slot-lock, so it need not be atomic and does not hurt the performance and the scalability Signed-off-by: Xiao Guangrong --- arch/x86/include/asm/kvm_host.h | 6 +- arch/x86/kvm/mmu.c | 32 arch/x86/kvm/mmu.h | 22 ++ 3 files changed, 59 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c76ff74..8e4ca0d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -226,7 +226,10 @@ struct kvm_mmu_page { /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */ unsigned long mmu_valid_gen; - DECLARE_BITMAP(unsync_child_bitmap, 512); + union { + DECLARE_BITMAP(unsync_child_bitmap, 512); + struct rcu_head rcu; + }; #ifdef CONFIG_X86_32 /* @@ -554,6 +557,7 @@ struct kvm_arch { */ struct list_head active_mmu_pages; struct list_head zapped_obsolete_pages; + bool rcu_free_shadow_page; struct list_head assigned_dev_head; struct iommu_domain *iommu_domain; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2bf450a..f551fc7 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2355,6 +2355,30 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, return ret; } +static void kvm_mmu_isolate_pages(struct list_head *invalid_list) +{ + struct kvm_mmu_page *sp; + + list_for_each_entry(sp, invalid_list, link) + kvm_mmu_isolate_page(sp); +} + +static void free_pages_rcu(struct rcu_head *head) +{ + struct kvm_mmu_page *next, *sp; + + sp = container_of(head, struct kvm_mmu_page, rcu); + while (sp) { + if (!list_empty(&sp->link)) + next = list_first_entry(&sp->link, + struct kvm_mmu_page, link); + else + next = NULL; + kvm_mmu_free_page(sp); + sp = next; + } +} + static void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list) { @@ -2375,6 +2399,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, */ kvm_flush_remote_tlbs(kvm); + if (kvm->arch.rcu_free_shadow_page) { + kvm_mmu_isolate_pages(invalid_list); + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link); + list_del_init(invalid_list); + call_rcu(&sp->rcu, free_pages_rcu); + return; + } + list_for_each_entry_safe(sp, nsp, invalid_list, link) { WARN_ON(!sp->role.invalid || sp->root_count); kvm_mmu_isolate_page(sp); diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 77e044a..61217f3 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -117,4 +117,26 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access, } void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm); + +/* + * The caller should ensure that these two functions should be + * serially called. + */ +static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm) +{ + rcu_read_lock(); + + kvm->arch.rcu_free_shadow_page = true; + /* Set the indicator before access shadow page. */ + smp_mb(); +} + +static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm) +{ + /* Make sure that access shadow page has finished. */ + smp_mb(); + kvm->arch.rcu_free_shadow_page = false; + + rcu_read_unlock(); +} #endif -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 13/15] KVM: MMU: locklessly write-protect the page
Currently, when mark memslot dirty logged or get dirty page, we need to write-protect large guest memory, it is the heavy work, especially, we need to hold mmu-lock which is also required by vcpu to fix its page table fault and mmu-notifier when host page is being changed. In the extreme cpu / memory used guest, it becomes a scalability issue This patch introduces a way to locklessly write-protect guest memory Now, lockless rmap walk, lockless shadow page table access and lockless spte wirte-protection are ready, it is the time to implements page write-protection out of mmu-lock Signed-off-by: Xiao Guangrong --- arch/x86/include/asm/kvm_host.h | 4 arch/x86/kvm/mmu.c | 53 + arch/x86/kvm/mmu.h | 6 + arch/x86/kvm/x86.c | 11 - 4 files changed, 49 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8e4ca0d..00b44b1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -789,10 +789,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, u64 dirty_mask, u64 nx_mask, u64 x_mask); int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); -void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); -void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, -struct kvm_memory_slot *slot, -gfn_t gfn_offset, unsigned long mask); void kvm_mmu_zap_all(struct kvm *kvm); void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm); unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f551fc7..44b7822 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1376,8 +1376,31 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, return flush; } -/** - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages +static void __rmap_write_protect_lockless(u64 *sptep) +{ + u64 spte; + int level = page_header(__pa(sptep))->role.level; + +retry: + spte = mmu_spte_get_lockless(sptep); + if (unlikely(!is_last_spte(spte, level) || !is_writable_pte(spte))) + return; + + if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte)) + return; + + goto retry; +} + +static void rmap_write_protect_lockless(unsigned long *rmapp) +{ + pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless); +} + +/* + * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level + * pages out of mmu-lock. + * * @kvm: kvm instance * @slot: slot to protect * @gfn_offset: start of the BITS_PER_LONG pages we care about @@ -1386,16 +1409,17 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, * Used when we do not need to care about huge page mappings: e.g. during dirty * logging we do not have any such mappings. */ -void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, -struct kvm_memory_slot *slot, -gfn_t gfn_offset, unsigned long mask) +void +kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm, +struct kvm_memory_slot *slot, +gfn_t gfn_offset, unsigned long mask) { unsigned long *rmapp; while (mask) { rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), PT_PAGE_TABLE_LEVEL, slot); - __rmap_write_protect(kvm, rmapp, false); + rmap_write_protect_lockless(rmapp); /* clear the first set bit */ mask &= mask - 1; @@ -4547,7 +4571,7 @@ int kvm_mmu_setup(struct kvm_vcpu *vcpu) return init_kvm_mmu(vcpu); } -void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) +void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot) { struct kvm_memory_slot *memslot; gfn_t last_gfn; @@ -4556,8 +4580,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) memslot = id_to_memslot(kvm->memslots, slot); last_gfn = memslot->base_gfn + memslot->npages - 1; - spin_lock(&kvm->mmu_lock); - + kvm_mmu_rcu_free_page_begin(kvm); for (i = PT_PAGE_TABLE_LEVEL; i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) { unsigned long *rmapp; @@ -4567,15 +4590,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) last_index = gfn_to_index(last_gfn, memslot->base_gfn, i); for (index = 0; index <= last_index; ++index, ++rmapp) { - if (*rmapp) - __rmap_write_protect(kvm, rmapp, false); + rmap_write_protect_lockless(rmapp);
[PATCH v2 14/15] KVM: MMU: clean up spte_write_protect
Now, the only user of spte_write_protect is rmap_write_protect which always calls spte_write_protect with pt_protect = true, so drop it and the unused parameter @kvm Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 19 --- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 44b7822..f3f17a0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1330,8 +1330,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) } /* - * Write-protect on the specified @sptep, @pt_protect indicates whether - * spte write-protection is caused by protecting shadow page table. + * Write-protect on the specified @sptep. * * Note: write protection is difference between drity logging and spte * protection: @@ -1342,25 +1341,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) * * Return true if tlb need be flushed. */ -static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect) +static bool spte_write_protect(u64 *sptep) { u64 spte = *sptep; if (!is_writable_pte(spte) && - !(pt_protect && spte_is_locklessly_modifiable(spte))) + !spte_is_locklessly_modifiable(spte)) return false; rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep); - if (pt_protect) - spte &= ~SPTE_MMU_WRITEABLE; - spte = spte & ~PT_WRITABLE_MASK; + spte &= ~SPTE_MMU_WRITEABLE; + spte &= ~PT_WRITABLE_MASK; return mmu_spte_update(sptep, spte); } -static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, -bool pt_protect) +static bool __rmap_write_protect(unsigned long *rmapp) { u64 *sptep; struct rmap_iterator iter; @@ -1369,7 +1366,7 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, for (sptep = rmap_get_first(*rmapp, &iter); sptep;) { BUG_ON(!(*sptep & PT_PRESENT_MASK)); - flush |= spte_write_protect(kvm, sptep, pt_protect); + flush |= spte_write_protect(sptep); sptep = rmap_get_next(&iter); } @@ -1438,7 +1435,7 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn) for (i = PT_PAGE_TABLE_LEVEL; i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) { rmapp = __gfn_to_rmap(gfn, i, slot); - write_protected |= __rmap_write_protect(kvm, rmapp, true); + write_protected |= __rmap_write_protect(rmapp); } return write_protected; -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 11/15] KVM: MMU: reintroduce kvm_mmu_isolate_page()
It was removed by commit 834be0d83. Now we will need it to do lockless shadow page walking protected by rcu, so reintroduce it Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index fe80019..2bf450a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1675,14 +1675,30 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr) percpu_counter_add(&kvm_total_used_mmu_pages, nr); } -static void kvm_mmu_free_page(struct kvm_mmu_page *sp) +/* + * Remove the sp from shadow page cache, after call it, + * we can not find this sp from the cache, and the shadow + * page table is still valid. + * + * It should be under the protection of mmu lock. + */ +static void kvm_mmu_isolate_page(struct kvm_mmu_page *sp) { ASSERT(is_empty_shadow_page(sp->spt)); + hlist_del(&sp->hash_link); - list_del(&sp->link); - free_page((unsigned long)sp->spt); if (!sp->role.direct) free_page((unsigned long)sp->gfns); +} + +/* + * Free the shadow page table and the sp, we can do it + * out of the protection of mmu lock. + */ +static void kvm_mmu_free_page(struct kvm_mmu_page *sp) +{ + list_del(&sp->link); + free_page((unsigned long)sp->spt); kmem_cache_free(mmu_page_header_cache, sp); } @@ -2361,6 +2377,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, list_for_each_entry_safe(sp, nsp, invalid_list, link) { WARN_ON(!sp->role.invalid || sp->root_count); + kvm_mmu_isolate_page(sp); kvm_mmu_free_page(sp); } } -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 15/15] KVM: MMU: use rcu functions to access the pointer
Use rcu_assign_pointer() to update all the pointer in desc and use rcu_dereference() to lockless read the pointer Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 46 -- 1 file changed, 28 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f3f17a0..808c2d9 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -927,12 +927,23 @@ static void pte_list_desc_ctor(void *p) desc->more = NULL; } +#define rcu_assign_pte_list(pte_list_p, value) \ + rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), \ + (unsigned long *)(value)) + +#define rcu_assign_desc_more(morep, value) \ + rcu_assign_pointer(*(unsigned long __rcu **)&morep, \ + (unsigned long *)value) + +#define rcu_assign_spte(sptep, value) \ + rcu_assign_pointer(*(u64 __rcu **)&sptep, (u64 *)value) + static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc) { unsigned long marker; marker = (unsigned long)pte_list | 1UL; - desc->more = (struct pte_list_desc *)marker; + rcu_assign_desc_more(desc->more, (struct pte_list_desc *)marker); } static bool desc_is_a_nulls(struct pte_list_desc *desc) @@ -989,10 +1000,6 @@ static int count_spte_number(struct pte_list_desc *desc) return first_free + desc_num * PTE_LIST_EXT; } -#define rcu_assign_pte_list(pte_list_p, value) \ - rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), \ - (unsigned long *)(value)) - /* * Pte mapping structures: * @@ -1019,8 +1026,8 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, if (!(*pte_list & 1)) { rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte); desc = mmu_alloc_pte_list_desc(vcpu); - desc->sptes[0] = (u64 *)*pte_list; - desc->sptes[1] = spte; + rcu_assign_spte(desc->sptes[0], *pte_list); + rcu_assign_spte(desc->sptes[1], spte); desc_mark_nulls(pte_list, desc); rcu_assign_pte_list(pte_list, (unsigned long)desc | 1); return 1; @@ -1033,13 +1040,13 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, if (desc->sptes[PTE_LIST_EXT - 1]) { struct pte_list_desc *new_desc; new_desc = mmu_alloc_pte_list_desc(vcpu); - new_desc->more = desc; + rcu_assign_desc_more(new_desc->more, desc); desc = new_desc; rcu_assign_pte_list(pte_list, (unsigned long)desc | 1); } free_pos = find_first_free(desc); - desc->sptes[free_pos] = spte; + rcu_assign_spte(desc->sptes[free_pos], spte); return count_spte_number(desc) - 1; } @@ -1057,8 +1064,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list, * Move the entry from the first desc to this position we want * to remove. */ - desc->sptes[i] = first_desc->sptes[last_used]; - first_desc->sptes[last_used] = NULL; + rcu_assign_spte(desc->sptes[i], first_desc->sptes[last_used]); + rcu_assign_spte(first_desc->sptes[last_used], NULL); /* No valid entry in this desc, we can free this desc now. */ if (!first_desc->sptes[0]) { @@ -1070,7 +1077,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list, WARN_ON(desc_is_a_nulls(next_desc)); mmu_free_pte_list_desc(first_desc); - *pte_list = (unsigned long)next_desc | 1ul; + rcu_assign_pte_list(pte_list, (unsigned long)next_desc | 1ul); return; } @@ -1079,8 +1086,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list, * then the desc can be freed. */ if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) { - *pte_list = (unsigned long)first_desc->sptes[0]; - first_desc->sptes[0] = NULL; + rcu_assign_pte_list(pte_list, first_desc->sptes[0]); + rcu_assign_spte(first_desc->sptes[0], NULL); mmu_free_pte_list_desc(first_desc); } } @@ -1102,7 +1109,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list) pr_err("pte_list_remove: %p 1->BUG\n", spte); BUG(); } - *pte_list = 0; + rcu_assign_pte_list(pte_list, 0); return; } @@ -1174,9 +1181,12 @@ restart: * used in the rmap when a spte is removed. Otherwise the * moved entry will be missed. */ - for (i = PTE_LIST_EXT - 1; i >= 0; i--) - if (desc->sptes[i]) -
Re: Live migration makes VM unusable
I have made some experiments like you suggested. Migrating VM back to node where it worked fine does not help. I have found in VMs logs: Clocksource tsc unstable (delta = 123652847 ns) It was logged after first live migration but did not break VM. I continued to migrate it back and forth and after few migrations it broke like usual. Nothing more in logs. I was experimenting with clock source: http://liquidat.wordpress.com/2013/04/02/howto-fixing-unstable-clocksource-in-virtualised-centos/ It did not help. I have tried changing clock source manually but it still did not restore its functionality. regards -- Maciej Gałkiewicz Shelly Cloud Sp. z o. o., Sysadmin http://shellycloud.com/, mac...@shellycloud.com KRS: 440358 REGON: 101504426 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
Dear Sir/Madam, This is my fifth times of written you this email since last year till date but no response from you.Hope you get this one, as this is a personal email directed to you. My wife and I won a Jackpot Lottery of $11.3 million in July and have voluntarily decided to donate the sum of $500,000.00 USD to you as part of our own charity project to improve the lot of 10 lucky individuals all over the world. If you have received this email then you are one of the lucky recipients and all you have to do is get back with us so that we can send your details to the payout bank.Please note that you have to contact my private email for more informations(allenvioletlarge...@yahoo.co.uk ) You can verify this by visiting the web pages below. http://www.dailymail.co.uk/news/article-1326473/Canadian-couple-Allen-Violet- Large-away-entire-11-2m-lottery-win.html Good-luck, Allen and Violet Large Email:allenvioletlarge...@yahoo.co.uk -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer
On Thu, Sep 5, 2013 at 5:24 PM, Zhang, Yang Z wrote: > Arthur Chunqi Li wrote on 2013-09-05: >> On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z >> wrote: >> > Arthur Chunqi Li wrote on 2013-09-04: >> >> This patch contains the following two changes: >> >> 1. Fix the bug in nested preemption timer support. If vmexit L2->L0 >> >> with some reasons not emulated by L1, preemption timer value should >> >> be save in such exits. >> >> 2. Add support of "Save VMX-preemption timer value" VM-Exit controls >> >> to nVMX. >> >> >> >> With this patch, nested VMX preemption timer features are fully supported. >> >> >> >> Signed-off-by: Arthur Chunqi Li >> >> --- >> >> This series depends on queue. >> >> >> >> arch/x86/include/uapi/asm/msr-index.h |1 + >> >> arch/x86/kvm/vmx.c| 51 >> >> ++--- >> >> 2 files changed, 48 insertions(+), 4 deletions(-) >> >> >> >> diff --git a/arch/x86/include/uapi/asm/msr-index.h >> >> b/arch/x86/include/uapi/asm/msr-index.h >> >> index bb04650..b93e09a 100644 >> >> --- a/arch/x86/include/uapi/asm/msr-index.h >> >> +++ b/arch/x86/include/uapi/asm/msr-index.h >> >> @@ -536,6 +536,7 @@ >> >> >> >> /* MSR_IA32_VMX_MISC bits */ >> >> #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << >> 29) >> >> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F >> >> /* AMD-V MSRs */ >> >> >> >> #define MSR_VM_CR 0xc0010114 >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index >> >> 1f1da43..870caa8 >> >> 100644 >> >> --- a/arch/x86/kvm/vmx.c >> >> +++ b/arch/x86/kvm/vmx.c >> >> @@ -2204,7 +2204,14 @@ static __init void >> >> nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 >> >> VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif >> >> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; >> >> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | >> >> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; >> >> + if (!(nested_vmx_pinbased_ctls_high & >> >> PIN_BASED_VMX_PREEMPTION_TIMER)) >> >> + nested_vmx_exit_ctls_high &= >> >> + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); >> >> + if (!(nested_vmx_exit_ctls_high & >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) >> >> + nested_vmx_pinbased_ctls_high &= >> >> + (~PIN_BASED_VMX_PREEMPTION_TIMER); >> > The following logic is more clearly: >> > if(nested_vmx_pinbased_ctls_high & >> PIN_BASED_VMX_PREEMPTION_TIMER) >> > nested_vmx_exit_ctls_high |= >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER >> Here I have such consideration: this logic is wrong if CPU support >> PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this does >> occurs. So the codes above reads the MSR and reserves the features it >> supports, and here I just check if these two features are supported >> simultaneously. >> > No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on > PIN_BASED_VMX_PREEMPTION_TIMER. PIN_BASED_VMX_PREEMPTION_TIMER is an > independent feature > >> You remind that this piece of codes can write like this: >> if (!(nested_vmx_pin_based_ctls_high & >> PIN_BASED_VMX_PREEMPTION_TIMER) || >> !(nested_vmx_exit_ctls_high & >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { >> nested_vmx_exit_ctls_high >> &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); >> nested_vmx_pinbased_ctls_high &= >> (~PIN_BASED_VMX_PREEMPTION_TIMER); >> } >> >> This may reflect the logic I describe above that these two flags should >> support >> simultaneously, and brings less confusion. >> > >> > BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's >> capability when expose those vmx features(not just preemption timer) to L1. >> The codes just above here, when setting pinbased control for nested vmx, it >> firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to mask the >> features hardware not support. So does other control fields. >> > > Yes, I saw it. > >> >> nested_vmx_exit_ctls_high |= >> >> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | >> >> VM_EXIT_LOAD_IA32_EFER); >> >> >> >> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu >> >> *vcpu, u64 *info1, u64 *info2) >> >> *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } >> >> >> >> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) { >> >> + u64 delta_tsc_l1; >> >> + u32 preempt_val_l1, preempt_val_l2, preempt_scale; >> >> + >> >> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) & >> >> + >> MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; >> >> + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); >> >> + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu, >> >> + native_read_tsc()) - vcpu->arch.last_guest_tsc; >> >> + preempt_val_l1 = delta_tsc_l1 >> preempt_scale; >> >> + if (preempt_val_l2 - preempt_val_l1 < 0) >> >> + preempt_val_l2 = 0; >>
RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer
Arthur Chunqi Li wrote on 2013-09-05: > On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z > wrote: > > Arthur Chunqi Li wrote on 2013-09-04: > >> This patch contains the following two changes: > >> 1. Fix the bug in nested preemption timer support. If vmexit L2->L0 > >> with some reasons not emulated by L1, preemption timer value should > >> be save in such exits. > >> 2. Add support of "Save VMX-preemption timer value" VM-Exit controls > >> to nVMX. > >> > >> With this patch, nested VMX preemption timer features are fully supported. > >> > >> Signed-off-by: Arthur Chunqi Li > >> --- > >> This series depends on queue. > >> > >> arch/x86/include/uapi/asm/msr-index.h |1 + > >> arch/x86/kvm/vmx.c| 51 > >> ++--- > >> 2 files changed, 48 insertions(+), 4 deletions(-) > >> > >> diff --git a/arch/x86/include/uapi/asm/msr-index.h > >> b/arch/x86/include/uapi/asm/msr-index.h > >> index bb04650..b93e09a 100644 > >> --- a/arch/x86/include/uapi/asm/msr-index.h > >> +++ b/arch/x86/include/uapi/asm/msr-index.h > >> @@ -536,6 +536,7 @@ > >> > >> /* MSR_IA32_VMX_MISC bits */ > >> #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << > 29) > >> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F > >> /* AMD-V MSRs */ > >> > >> #define MSR_VM_CR 0xc0010114 > >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index > >> 1f1da43..870caa8 > >> 100644 > >> --- a/arch/x86/kvm/vmx.c > >> +++ b/arch/x86/kvm/vmx.c > >> @@ -2204,7 +2204,14 @@ static __init void > >> nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 > >> VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif > >> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; > >> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | > >> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; > >> + if (!(nested_vmx_pinbased_ctls_high & > >> PIN_BASED_VMX_PREEMPTION_TIMER)) > >> + nested_vmx_exit_ctls_high &= > >> + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); > >> + if (!(nested_vmx_exit_ctls_high & > >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) > >> + nested_vmx_pinbased_ctls_high &= > >> + (~PIN_BASED_VMX_PREEMPTION_TIMER); > > The following logic is more clearly: > > if(nested_vmx_pinbased_ctls_high & > PIN_BASED_VMX_PREEMPTION_TIMER) > > nested_vmx_exit_ctls_high |= > VM_EXIT_SAVE_VMX_PREEMPTION_TIMER > Here I have such consideration: this logic is wrong if CPU support > PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support > VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this does > occurs. So the codes above reads the MSR and reserves the features it > supports, and here I just check if these two features are supported > simultaneously. > No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on PIN_BASED_VMX_PREEMPTION_TIMER. PIN_BASED_VMX_PREEMPTION_TIMER is an independent feature > You remind that this piece of codes can write like this: > if (!(nested_vmx_pin_based_ctls_high & > PIN_BASED_VMX_PREEMPTION_TIMER) || > !(nested_vmx_exit_ctls_high & > VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { > nested_vmx_exit_ctls_high > &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); > nested_vmx_pinbased_ctls_high &= > (~PIN_BASED_VMX_PREEMPTION_TIMER); > } > > This may reflect the logic I describe above that these two flags should > support > simultaneously, and brings less confusion. > > > > BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's > capability when expose those vmx features(not just preemption timer) to L1. > The codes just above here, when setting pinbased control for nested vmx, it > firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to mask the > features hardware not support. So does other control fields. > > Yes, I saw it. > >> nested_vmx_exit_ctls_high |= > >> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | > >> VM_EXIT_LOAD_IA32_EFER); > >> > >> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu > >> *vcpu, u64 *info1, u64 *info2) > >> *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } > >> > >> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) { > >> + u64 delta_tsc_l1; > >> + u32 preempt_val_l1, preempt_val_l2, preempt_scale; > >> + > >> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) & > >> + > MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; > >> + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); > >> + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu, > >> + native_read_tsc()) - vcpu->arch.last_guest_tsc; > >> + preempt_val_l1 = delta_tsc_l1 >> preempt_scale; > >> + if (preempt_val_l2 - preempt_val_l1 < 0) > >> + preempt_val_l2 = 0; > >> + else > >> + preempt_val_l2 -= preempt_val_l1; > >> + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, > preempt_val_l2); } > >> /* > >> * The guest has e
Re: [PATCH] kvm-unit-tests: VMX: Test suite for preemption timer
Hi Jan, Gleb and Paolo, It suddenly occurred to me that, if guest's PIN_PREEMPT disabled while EXI_SAVE_PREEMPT_VALUE enabled, what will happen? The preempt value in vmcs will not be affected, yes? This cases fails to test in this patch. Arthur On Wed, Sep 4, 2013 at 11:26 PM, Arthur Chunqi Li wrote: > Test cases for preemption timer in nested VMX. Two aspects are tested: > 1. Save preemption timer on VMEXIT if relevant bit set in EXIT_CONTROL > 2. Test a relevant bug of KVM. The bug will not save preemption timer > value if exit L2->L0 for some reason and enter L0->L2. Thus preemption > timer will never trigger if the value is large enough. > > Signed-off-by: Arthur Chunqi Li > --- > x86/vmx.h |3 ++ > x86/vmx_tests.c | 117 > +++ > 2 files changed, 120 insertions(+) > > diff --git a/x86/vmx.h b/x86/vmx.h > index 28595d8..ebc8cfd 100644 > --- a/x86/vmx.h > +++ b/x86/vmx.h > @@ -210,6 +210,7 @@ enum Encoding { > GUEST_ACTV_STATE= 0x4826ul, > GUEST_SMBASE= 0x4828ul, > GUEST_SYSENTER_CS = 0x482aul, > + PREEMPT_TIMER_VALUE = 0x482eul, > > /* 32-Bit Host State Fields */ > HOST_SYSENTER_CS= 0x4c00ul, > @@ -331,6 +332,7 @@ enum Ctrl_exi { > EXI_LOAD_PERF = 1UL << 12, > EXI_INTA= 1UL << 15, > EXI_LOAD_EFER = 1UL << 21, > + EXI_SAVE_PREEMPT= 1UL << 22, > }; > > enum Ctrl_ent { > @@ -342,6 +344,7 @@ enum Ctrl_pin { > PIN_EXTINT = 1ul << 0, > PIN_NMI = 1ul << 3, > PIN_VIRT_NMI= 1ul << 5, > + PIN_PREEMPT = 1ul << 6, > }; > > enum Ctrl0 { > diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c > index c1b39f4..d358148 100644 > --- a/x86/vmx_tests.c > +++ b/x86/vmx_tests.c > @@ -1,4 +1,30 @@ > #include "vmx.h" > +#include "msr.h" > +#include "processor.h" > + > +volatile u32 stage; > + > +static inline void vmcall() > +{ > + asm volatile("vmcall"); > +} > + > +static inline void set_stage(u32 s) > +{ > + barrier(); > + stage = s; > + barrier(); > +} > + > +static inline u32 get_stage() > +{ > + u32 s; > + > + barrier(); > + s = stage; > + barrier(); > + return s; > +} > > void basic_init() > { > @@ -76,6 +102,95 @@ int vmenter_exit_handler() > return VMX_TEST_VMEXIT; > } > > +u32 preempt_scale; > +volatile unsigned long long tsc_val; > +volatile u32 preempt_val; > + > +void preemption_timer_init() > +{ > + u32 ctrl_pin; > + > + ctrl_pin = vmcs_read(PIN_CONTROLS) | PIN_PREEMPT; > + ctrl_pin &= ctrl_pin_rev.clr; > + vmcs_write(PIN_CONTROLS, ctrl_pin); > + preempt_val = 1000; > + vmcs_write(PREEMPT_TIMER_VALUE, preempt_val); > + preempt_scale = rdmsr(MSR_IA32_VMX_MISC) & 0x1F; > +} > + > +void preemption_timer_main() > +{ > + tsc_val = rdtsc(); > + if (!(ctrl_pin_rev.clr & PIN_PREEMPT)) { > + printf("\tPreemption timer is not supported\n"); > + return; > + } > + if (!(ctrl_exit_rev.clr & EXI_SAVE_PREEMPT)) > + printf("\tSave preemption value is not supported\n"); > + else { > + set_stage(0); > + vmcall(); > + if (get_stage() == 1) > + vmcall(); > + } > + while (1) { > + if (((rdtsc() - tsc_val) >> preempt_scale) > + > 10 * preempt_val) { > + report("Preemption timer", 0); > + break; > + } > + } > +} > + > +int preemption_timer_exit_handler() > +{ > + u64 guest_rip; > + ulong reason; > + u32 insn_len; > + u32 ctrl_exit; > + > + guest_rip = vmcs_read(GUEST_RIP); > + reason = vmcs_read(EXI_REASON) & 0xff; > + insn_len = vmcs_read(EXI_INST_LEN); > + switch (reason) { > + case VMX_PREEMPT: > + if (((rdtsc() - tsc_val) >> preempt_scale) < preempt_val) > + report("Preemption timer", 0); > + else > + report("Preemption timer", 1); > + return VMX_TEST_VMEXIT; > + case VMX_VMCALL: > + switch (get_stage()) { > + case 0: > + if (vmcs_read(PREEMPT_TIMER_VALUE) != preempt_val) > + report("Save preemption value", 0); > + else { > + set_stage(get_stage() + 1); > + ctrl_exit = (vmcs_read(EXI_CONTROLS) | > + EXI_SAVE_PREEMPT) & ctrl_exit_rev.clr; > + vmcs_write(EXI_CONTROLS, ctrl_exit); > + } > + break; > + case 1: > + if (vmcs_
Re: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer
On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z wrote: > Arthur Chunqi Li wrote on 2013-09-04: >> This patch contains the following two changes: >> 1. Fix the bug in nested preemption timer support. If vmexit L2->L0 with some >> reasons not emulated by L1, preemption timer value should be save in such >> exits. >> 2. Add support of "Save VMX-preemption timer value" VM-Exit controls to >> nVMX. >> >> With this patch, nested VMX preemption timer features are fully supported. >> >> Signed-off-by: Arthur Chunqi Li >> --- >> This series depends on queue. >> >> arch/x86/include/uapi/asm/msr-index.h |1 + >> arch/x86/kvm/vmx.c| 51 >> ++--- >> 2 files changed, 48 insertions(+), 4 deletions(-) >> >> diff --git a/arch/x86/include/uapi/asm/msr-index.h >> b/arch/x86/include/uapi/asm/msr-index.h >> index bb04650..b93e09a 100644 >> --- a/arch/x86/include/uapi/asm/msr-index.h >> +++ b/arch/x86/include/uapi/asm/msr-index.h >> @@ -536,6 +536,7 @@ >> >> /* MSR_IA32_VMX_MISC bits */ >> #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << 29) >> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F >> /* AMD-V MSRs */ >> >> #define MSR_VM_CR 0xc0010114 >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..870caa8 >> 100644 >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -2204,7 +2204,14 @@ static __init void >> nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 >> VM_EXIT_HOST_ADDR_SPACE_SIZE | >> #endif >> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; >> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | >> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; >> + if (!(nested_vmx_pinbased_ctls_high & >> PIN_BASED_VMX_PREEMPTION_TIMER)) >> + nested_vmx_exit_ctls_high &= >> + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); >> + if (!(nested_vmx_exit_ctls_high & >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) >> + nested_vmx_pinbased_ctls_high &= >> + (~PIN_BASED_VMX_PREEMPTION_TIMER); > The following logic is more clearly: > if(nested_vmx_pinbased_ctls_high & PIN_BASED_VMX_PREEMPTION_TIMER) > nested_vmx_exit_ctls_high |= VM_EXIT_SAVE_VMX_PREEMPTION_TIMER Here I have such consideration: this logic is wrong if CPU support PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this does occurs. So the codes above reads the MSR and reserves the features it supports, and here I just check if these two features are supported simultaneously. You remind that this piece of codes can write like this: if (!(nested_vmx_pin_based_ctls_high & PIN_BASED_VMX_PREEMPTION_TIMER) || !(nested_vmx_exit_ctls_high & VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { nested_vmx_exit_ctls_high &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); nested_vmx_pinbased_ctls_high &= (~PIN_BASED_VMX_PREEMPTION_TIMER); } This may reflect the logic I describe above that these two flags should support simultaneously, and brings less confusion. > > BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's > capability when expose those vmx features(not just preemption timer) to L1. The codes just above here, when setting pinbased control for nested vmx, it firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to mask the features hardware not support. So does other control fields. > >> nested_vmx_exit_ctls_high |= >> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | >> VM_EXIT_LOAD_IA32_EFER); >> >> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu >> *vcpu, u64 *info1, u64 *info2) >> *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } >> >> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) { >> + u64 delta_tsc_l1; >> + u32 preempt_val_l1, preempt_val_l2, preempt_scale; >> + >> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) & >> + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; >> + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); >> + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu, >> + native_read_tsc()) - vcpu->arch.last_guest_tsc; >> + preempt_val_l1 = delta_tsc_l1 >> preempt_scale; >> + if (preempt_val_l2 - preempt_val_l1 < 0) >> + preempt_val_l2 = 0; >> + else >> + preempt_val_l2 -= preempt_val_l1; >> + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); } >> /* >> * The guest has exited. See if we can fix it or if we need userspace >> * assistance. >> @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) >> struct vcpu_vmx *vmx = to_vmx(vcpu); >> u32 exit_reason = vmx->exit_reason; >> u32 vectoring_info = vmx->idt_vectoring_info; >> + int ret; >> >> /* If guest state is invalid, start emulating */ >> if (vmx->emulation_requir
Re: OpenBSD 5.3 guest on KVM
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Il 05/09/2013 01:31, Daniel Bareiro ha scritto: > Someone had this problem and could solve it somehow? There any > debug information I can provide to help solve this? For simple troubleshooting try "info status" from the QEMU monitor. You can also try this: http://www.linux-kvm.org/page/Tracing You will get a large log, you can send it to me offlist or via something like Google Drive. Paolo -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJSKEEUAAoJEBvWZb6bTYby4+UP/2V99leGzgxccTS6s3LHwAk8 UomjlcPmzXZ6UlC8ppGfHkuzWm/jbhmkMGX8DYrdYOJdctugFpbOrIaPA/pC2Na4 8o2LWZ5burvztUVS8O5d4wmfLHIgOvuIFj8MQ5YJ571HiYUKq3oipQf1yL5kzbtU 1p4/0lrXRwAz++YKXFfZ3A1vIQ5Ms6WwwHljyfZMZNvafymT87cPNskX4P3aO81U weYeTpIuTLdiBIzkA1Xqun7z3DzZszCB91UxeKTxvxZLiPHnaYocS7bTF8i710mQ BUfBXFUTH0xt46O1QCLrNHqbojSPaUwr7pkK2x1dCYfPISN9+nmZ9BgHGZCTTVIn Ga9kn0rdIfalpBXybgTFZBeqHimSYAou/CuVKgx4wLYMUVdE95dOFjVx3K9yOtfv ulmKqh7yGlWB9ZNRC+qet0BcwA6ssgutcHVMa46V0vE22J0phTcBPWWpYLw18IBl J2D5oOTEnZ3pqzxhXD2zjeD3pBBtZa+dsDLuhBIhkQI9Qcqv/N1I5oCEBOtbVBBh aWb0vEOBjdw9+cfadxtl2VQUV6GfwasBbOktFYR7gIG3ji/Asf35ZeC1xHAFv7xw 2pxxfGxLmX0NXdwgvlXz45+V0ITK8ZzuoU4tc2JDEXUk9Q2Slb9vHwxDPtbnDeo/ uybOp9WiNmQvrA4Thg1V =PwDG -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 60850] BUG: Bad page state in process libvirtd pfn:76000
https://bugzilla.kernel.org/show_bug.cgi?id=60850 Gleb changed: What|Removed |Added CC||g...@redhat.com --- Comment #1 from Gleb --- This is not related to KVM, so for right people to see it I suggest to change assignee. -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer
Arthur Chunqi Li wrote on 2013-09-04: > This patch contains the following two changes: > 1. Fix the bug in nested preemption timer support. If vmexit L2->L0 with some > reasons not emulated by L1, preemption timer value should be save in such > exits. > 2. Add support of "Save VMX-preemption timer value" VM-Exit controls to > nVMX. > > With this patch, nested VMX preemption timer features are fully supported. > > Signed-off-by: Arthur Chunqi Li > --- > This series depends on queue. > > arch/x86/include/uapi/asm/msr-index.h |1 + > arch/x86/kvm/vmx.c| 51 > ++--- > 2 files changed, 48 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/include/uapi/asm/msr-index.h > b/arch/x86/include/uapi/asm/msr-index.h > index bb04650..b93e09a 100644 > --- a/arch/x86/include/uapi/asm/msr-index.h > +++ b/arch/x86/include/uapi/asm/msr-index.h > @@ -536,6 +536,7 @@ > > /* MSR_IA32_VMX_MISC bits */ > #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << 29) > +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F > /* AMD-V MSRs */ > > #define MSR_VM_CR 0xc0010114 > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..870caa8 > 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -2204,7 +2204,14 @@ static __init void > nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 > VM_EXIT_HOST_ADDR_SPACE_SIZE | > #endif > - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; > + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | > + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; > + if (!(nested_vmx_pinbased_ctls_high & > PIN_BASED_VMX_PREEMPTION_TIMER)) > + nested_vmx_exit_ctls_high &= > + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); > + if (!(nested_vmx_exit_ctls_high & > VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) > + nested_vmx_pinbased_ctls_high &= > + (~PIN_BASED_VMX_PREEMPTION_TIMER); The following logic is more clearly: if(nested_vmx_pinbased_ctls_high & PIN_BASED_VMX_PREEMPTION_TIMER) nested_vmx_exit_ctls_high |= VM_EXIT_SAVE_VMX_PREEMPTION_TIMER BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's capability when expose those vmx features(not just preemption timer) to L1. > nested_vmx_exit_ctls_high |= > (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | > VM_EXIT_LOAD_IA32_EFER); > > @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu > *vcpu, u64 *info1, u64 *info2) > *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } > > +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) { > + u64 delta_tsc_l1; > + u32 preempt_val_l1, preempt_val_l2, preempt_scale; > + > + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) & > + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; > + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); > + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu, > + native_read_tsc()) - vcpu->arch.last_guest_tsc; > + preempt_val_l1 = delta_tsc_l1 >> preempt_scale; > + if (preempt_val_l2 - preempt_val_l1 < 0) > + preempt_val_l2 = 0; > + else > + preempt_val_l2 -= preempt_val_l1; > + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); } > /* > * The guest has exited. See if we can fix it or if we need userspace > * assistance. > @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) > struct vcpu_vmx *vmx = to_vmx(vcpu); > u32 exit_reason = vmx->exit_reason; > u32 vectoring_info = vmx->idt_vectoring_info; > + int ret; > > /* If guest state is invalid, start emulating */ > if (vmx->emulation_required) > @@ -6795,12 +6820,15 @@ static int vmx_handle_exit(struct kvm_vcpu > *vcpu) > > if (exit_reason < kvm_vmx_max_exit_handlers > && kvm_vmx_exit_handlers[exit_reason]) > - return kvm_vmx_exit_handlers[exit_reason](vcpu); > + ret = kvm_vmx_exit_handlers[exit_reason](vcpu); > else { > vcpu->run->exit_reason = KVM_EXIT_UNKNOWN; > vcpu->run->hw.hardware_exit_reason = exit_reason; > + ret = 0; > } > - return 0; > + if (is_guest_mode(vcpu)) > + nested_adjust_preemption_timer(vcpu); Move this forward to avoid the changes for ret. > + return ret; > } > > static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) @@ > -7518,6 +7546,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, > struct vmcs12 *vmcs12) { > struct vcpu_vmx *vmx = to_vmx(vcpu); > u32 exec_control; > + u32 exit_control; > > vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector); > vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector); @@ > -7691,7 +7720,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, > struct vmcs12 *v