Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

2013-09-05 Thread Gleb Natapov
On Thu, Sep 05, 2013 at 02:05:09PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2013-09-03 at 13:53 +0300, Gleb Natapov wrote:
> > > Or supporting all IOMMU links (and leaving emulated stuff as is) in on
> > > "device" is the last thing I have to do and then you'll ack the patch?
> > > 
> > I am concerned more about API here. Internal implementation details I
> > leave to powerpc experts :)
> 
> So Gleb, I want to step in for a bit here.
> 
> While I understand that the new KVM device API is all nice and shiny and that 
> this
> whole thing should probably have been KVM devices in the first place (had they
> existed or had we been told back then), the point is, the API for handling
> HW IOMMUs that Alexey is trying to add is an extension of an existing 
> mechanism
> used for emulated IOMMUs.
> 
> The internal data structure is shared, and fundamentally, by forcing him to
> use that new KVM device for the "new stuff", we create a oddball API with
> an ioctl for one type of iommu and a KVM device for the other, which makes
> the implementation a complete mess in the kernel (and you should care :-)
> 
Is it unfixable mess? Even if Alexey will do what you suggested earlier?

  - Convert *both* existing TCE objects to the new
  KVM_CREATE_DEVICE, and have some backward compat code for the old one.

The point is implementation usually can be changed, but for API it is
much harder to do so.

> So for something completely new, I would tend to agree with you. However, I
> still think that for this specific case, we should just plonk-in the original
> ioctl proposed by Alexey and be done with it.
> 
Do you think this is the last extension to IOMMU code, or we will see
more and will use same justification to continue adding ioctls?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/11] KVM: PPC: Book3S HV: Support POWER6 compatibility mode on POWER7

2013-09-05 Thread Paul Mackerras
On Fri, Sep 06, 2013 at 10:58:16AM +0530, Aneesh Kumar K.V wrote:
> Paul Mackerras  writes:
> 
> > This enables us to use the Processor Compatibility Register (PCR) on
> > POWER7 to put the processor into architecture 2.05 compatibility mode
> > when running a guest.  In this mode the new instructions and registers
> > that were introduced on POWER7 are disabled in user mode.  This
> > includes all the VSX facilities plus several other instructions such
> > as ldbrx, stdbrx, popcntw, popcntd, etc.
> >
> > To select this mode, we have a new register accessible through the
> > set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT.  Setting
> > this to zero gives the full set of capabilities of the processor.
> > Setting it to one of the "logical" PVR values defined in PAPR puts
> > the vcpu into the compatibility mode for the corresponding
> > architecture level.  The supported values are:
> >
> > 0x0f02  Architecture 2.05 (POWER6)
> > 0x0f03  Architecture 2.06 (POWER7)
> > 0x0f13  Architecture 2.06+ (POWER7+)
> >
> > Since the PCR is per-core, the architecture compatibility level and
> > the corresponding PCR value are stored in the struct kvmppc_vcore, and
> > are therefore shared between all vcpus in a virtual core.
> 
> We already have KVM_SET_SREGS taking pvr as argument. Can't we do
> this kvmppc_set_pvr ?. Can you also share the qemu changes ? There I
> guess we need to do update the "cpu-version" in the device tree so
> that /proc/cpuinfo shows the right information in the guest 

The discussion on the qemu mailing list pointed out that we aren't
really changing the PVR; the guest still sees the real PVR, and what
we're doing is setting a mode of the CPU rather than changing it into
an older CPU.  So, it seemed better to use something separate from the
PVR.  Also, if we used the pvr setting to convey this information,
then apparently under TCG the guest would see the logical PVR value if
it read the (apparently) real PVR.

Alexey is working on the relevant QEMU patches.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

2013-09-05 Thread Alexey Kardashevskiy
On 09/06/2013 04:01 PM, Gleb Natapov wrote:
> On Fri, Sep 06, 2013 at 09:38:21AM +1000, Alexey Kardashevskiy wrote:
>> On 09/06/2013 04:10 AM, Gleb Natapov wrote:
>>> On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote:
 On 09/03/2013 08:53 PM, Gleb Natapov wrote:
> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
>> On 09/01/2013 10:06 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
 This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
 and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
 them to user space which saves time on switching to user space and 
 back.

 Both real and virtual modes are supported. The kernel tries to
 handle a TCE request in the real mode, if fails it passes the request
 to the virtual mode to complete the operation. If it a virtual mode
 handler fails, the request is passed to user space.

 The first user of this is VFIO on POWER. Trampolines to the VFIO 
 external
 user API functions are required for this patch.

 This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
 number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel 
 handling
 of map/unmap requests. The device supports a single attribute which is
 a struct with LIOBN and IOMMU fd. When the attribute is set, the device
 establishes the connection between KVM and VFIO.

 Tests show that this patch increases transmission speed from 220MB/s
 to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

 Signed-off-by: Paul Mackerras 
 Signed-off-by: Alexey Kardashevskiy 

 ---

 Changes:
 v9:
 * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
 KVM device
 * release_spapr_tce_table() is not shared between different TCE types
 * reduced the patch size by moving VFIO external API
 trampolines to separate patche
 * moved documentation from Documentation/virtual/kvm/api.txt to
 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

 v8:
 * fixed warnings from check_patch.pl

 2013/07/11:
 * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
 for KVM_BOOK3S_64
 * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much 
 sense
 for this here but the next patch for hugepages support will use it 
 more.

 2013/07/06:
 * added realmode arch_spin_lock to protect TCE table from races
 in real and virtual modes
 * POWERPC IOMMU API is changed to support real mode
 * iommu_take_ownership and iommu_release_ownership are protected by
 iommu_table's locks
 * VFIO external user API use rewritten
 * multiple small fixes

 2013/06/27:
 * tce_list page is referenced now in order to protect it from accident
 invalidation during H_PUT_TCE_INDIRECT execution
 * added use of the external user VFIO API

 2013/06/05:
 * changed capability number
 * changed ioctl number
 * update the doc article number

 2013/05/20:
 * removed get_user() from real mode handlers
 * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts 
 there
 translated TCEs, tries realmode_get_page() on those and if it fails, it
 passes control over the virtual mode handler which tries to finish
 the request handling
 * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY 
 bit
 on a page
 * The only reason to pass the request to user mode now is when the 
 user mode
 did not register TCE table in the kernel, in all other cases the 
 virtual mode
 handler is expected to do the job
 ---
  .../virtual/kvm/devices/spapr_tce_iommu.txt|  37 +++
  arch/powerpc/include/asm/kvm_host.h|   4 +
  arch/powerpc/kvm/book3s_64_vio.c   | 310 
 -
  arch/powerpc/kvm/book3s_64_vio_hv.c| 122 
  arch/powerpc/kvm/powerpc.c |   1 +
  include/linux/kvm_host.h   |   1 +
  virt/kvm/kvm_main.c|   5 +
  7 files changed, 477 insertions(+), 3 deletions(-)
  create mode 100644 
 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

 diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt 
 b/Documentation/virtual/kvm/devices/spapr_

Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

2013-09-05 Thread Gleb Natapov
On Fri, Sep 06, 2013 at 09:38:21AM +1000, Alexey Kardashevskiy wrote:
> On 09/06/2013 04:10 AM, Gleb Natapov wrote:
> > On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote:
> >> On 09/03/2013 08:53 PM, Gleb Natapov wrote:
> >>> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
>  On 09/01/2013 10:06 PM, Gleb Natapov wrote:
> > On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
> >> them to user space which saves time on switching to user space and 
> >> back.
> >>
> >> Both real and virtual modes are supported. The kernel tries to
> >> handle a TCE request in the real mode, if fails it passes the request
> >> to the virtual mode to complete the operation. If it a virtual mode
> >> handler fails, the request is passed to user space.
> >>
> >> The first user of this is VFIO on POWER. Trampolines to the VFIO 
> >> external
> >> user API functions are required for this patch.
> >>
> >> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
> >> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel 
> >> handling
> >> of map/unmap requests. The device supports a single attribute which is
> >> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
> >> establishes the connection between KVM and VFIO.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Paul Mackerras 
> >> Signed-off-by: Alexey Kardashevskiy 
> >>
> >> ---
> >>
> >> Changes:
> >> v9:
> >> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
> >> KVM device
> >> * release_spapr_tce_table() is not shared between different TCE types
> >> * reduced the patch size by moving VFIO external API
> >> trampolines to separate patche
> >> * moved documentation from Documentation/virtual/kvm/api.txt to
> >> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>
> >> v8:
> >> * fixed warnings from check_patch.pl
> >>
> >> 2013/07/11:
> >> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
> >> for KVM_BOOK3S_64
> >> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much 
> >> sense
> >> for this here but the next patch for hugepages support will use it 
> >> more.
> >>
> >> 2013/07/06:
> >> * added realmode arch_spin_lock to protect TCE table from races
> >> in real and virtual modes
> >> * POWERPC IOMMU API is changed to support real mode
> >> * iommu_take_ownership and iommu_release_ownership are protected by
> >> iommu_table's locks
> >> * VFIO external user API use rewritten
> >> * multiple small fixes
> >>
> >> 2013/06/27:
> >> * tce_list page is referenced now in order to protect it from accident
> >> invalidation during H_PUT_TCE_INDIRECT execution
> >> * added use of the external user VFIO API
> >>
> >> 2013/06/05:
> >> * changed capability number
> >> * changed ioctl number
> >> * update the doc article number
> >>
> >> 2013/05/20:
> >> * removed get_user() from real mode handlers
> >> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts 
> >> there
> >> translated TCEs, tries realmode_get_page() on those and if it fails, it
> >> passes control over the virtual mode handler which tries to finish
> >> the request handling
> >> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY 
> >> bit
> >> on a page
> >> * The only reason to pass the request to user mode now is when the 
> >> user mode
> >> did not register TCE table in the kernel, in all other cases the 
> >> virtual mode
> >> handler is expected to do the job
> >> ---
> >>  .../virtual/kvm/devices/spapr_tce_iommu.txt|  37 +++
> >>  arch/powerpc/include/asm/kvm_host.h|   4 +
> >>  arch/powerpc/kvm/book3s_64_vio.c   | 310 
> >> -
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c| 122 
> >>  arch/powerpc/kvm/powerpc.c |   1 +
> >>  include/linux/kvm_host.h   |   1 +
> >>  virt/kvm/kvm_main.c|   5 +
> >>  7 files changed, 477 insertions(+), 3 deletions(-)
> >>  create mode 100644 
> >> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt 
> >> b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >> new file mode 100644
> >

Re: [PATCH 06/11] KVM: PPC: Book3S HV: Support POWER6 compatibility mode on POWER7

2013-09-05 Thread Aneesh Kumar K.V
Paul Mackerras  writes:

> This enables us to use the Processor Compatibility Register (PCR) on
> POWER7 to put the processor into architecture 2.05 compatibility mode
> when running a guest.  In this mode the new instructions and registers
> that were introduced on POWER7 are disabled in user mode.  This
> includes all the VSX facilities plus several other instructions such
> as ldbrx, stdbrx, popcntw, popcntd, etc.
>
> To select this mode, we have a new register accessible through the
> set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT.  Setting
> this to zero gives the full set of capabilities of the processor.
> Setting it to one of the "logical" PVR values defined in PAPR puts
> the vcpu into the compatibility mode for the corresponding
> architecture level.  The supported values are:
>
> 0x0f02Architecture 2.05 (POWER6)
> 0x0f03Architecture 2.06 (POWER7)
> 0x0f13Architecture 2.06+ (POWER7+)
>
> Since the PCR is per-core, the architecture compatibility level and
> the corresponding PCR value are stored in the struct kvmppc_vcore, and
> are therefore shared between all vcpus in a virtual core.

We already have KVM_SET_SREGS taking pvr as argument. Can't we do
this kvmppc_set_pvr ?. Can you also share the qemu changes ? There I
guess we need to do update the "cpu-version" in the device tree so
that /proc/cpuinfo shows the right information in the guest 


-aneesh

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 04/10] KVM: PPC: Book3S HV: Flush the correct number of TLB sets on POWER8

2013-09-05 Thread Paul Mackerras
POWER8 has 512 sets in the TLB, compared to 128 for POWER7, so we need
to do more tlbiel instructions when flushing the TLB on POWER8.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 1de0b65..612c9c8 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -667,7 +667,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
andcr7,r7,r0
stdcx.  r7,0,r6
bne 23b
-   li  r6,128  /* and flush the TLB */
+   /* Flush the TLB of any entries for this LPID */
+   /* use arch 2.07S as a proxy for POWER8 */
+BEGIN_FTR_SECTION
+   li  r6,512  /* POWER8 has 512 sets */
+FTR_SECTION_ELSE
+   li  r6,128  /* POWER7 has 128 sets */
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_207S)
mtctr   r6
li  r7,0x800/* IS field = 0b10 */
ptesync
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 05/10] KVM: PPC: Book3S HV: Add handler for HV facility unavailable

2013-09-05 Thread Paul Mackerras
From: Michael Ellerman 

At present this should never happen, since the host kernel sets
HFSCR to allow access to all facilities.  It's better to be prepared
to handle it cleanly if it does ever happen, though.

Signed-off-by: Michael Ellerman 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_asm.h | 1 +
 arch/powerpc/kvm/book3s_hv.c   | 9 +
 2 files changed, 10 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 851bac7..f6401eb 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -99,6 +99,7 @@
 #define BOOK3S_INTERRUPT_PERFMON   0xf00
 #define BOOK3S_INTERRUPT_ALTIVEC   0xf20
 #define BOOK3S_INTERRUPT_VSX   0xf40
+#define BOOK3S_INTERRUPT_H_FAC_UNAVAIL 0xf80
 
 #define BOOK3S_IRQPRIO_SYSTEM_RESET0
 #define BOOK3S_IRQPRIO_DATA_SEGMENT1
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 724daa5..da8619e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -704,6 +704,15 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
kvmppc_core_queue_program(vcpu, 0x8);
r = RESUME_GUEST;
break;
+   /*
+* This occurs if the guest (kernel or userspace), does something that
+* is prohibited by HFSCR.  We just generate a program interrupt to
+* the guest.
+*/
+   case BOOK3S_INTERRUPT_H_FAC_UNAVAIL:
+   kvmppc_core_queue_program(vcpu, 0x8);
+   r = RESUME_GUEST;
+   break;
default:
kvmppc_dump_regs(vcpu);
printk(KERN_EMERG "trap=0x%x | pc=0x%lx | msr=0x%llx\n",
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/10] Support POWER8 in HV KVM

2013-09-05 Thread Paul Mackerras
This series adds support for the POWER8 CPU in HV KVM.  POWER8 adds
several new guest-accessible instructions, special-purpose registers,
and other features such as doorbell interrupts and hardware
transactional memory.  It also adds new hypervisor-controlled features
such as relocation-on interrupts, and replaces the DABR/DABRX
registers with the new DAWR/DAWRX registers with expanded
capabilities.  The POWER8 CPU has compatibility modes for architecture
2.06 (POWER7/POWER7+) and 2.05 (POWER6).

This series does not yet handle the checkpointed register state of
the guest.

 arch/powerpc/include/asm/kvm_asm.h|   2 +
 arch/powerpc/include/asm/kvm_book3s_asm.h |   1 +
 arch/powerpc/include/asm/kvm_host.h   |  29 +-
 arch/powerpc/include/asm/reg.h|  25 +-
 arch/powerpc/include/uapi/asm/kvm.h   |   1 +
 arch/powerpc/kernel/asm-offsets.c |  28 +-
 arch/powerpc/kvm/book3s_hv.c  | 234 +++--
 arch/powerpc/kvm/book3s_hv_interrupts.S   |   8 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 538 ++
 9 files changed, 694 insertions(+), 172 deletions(-)

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 02/10] KVM: PPC: Book3S HV: Don't set DABR on POWER8

2013-09-05 Thread Paul Mackerras
From: Michael Neuling 

POWER8 doesn't have the DABR and DABRX registers; instead it has
new DAWR/DAWRX registers, which will be handled in a later patch.

Signed-off-by: Michael Neuling 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_interrupts.S |  2 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 ++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S 
b/arch/powerpc/kvm/book3s_hv_interrupts.S
index 2a85bf6..0be3404 100644
--- a/arch/powerpc/kvm/book3s_hv_interrupts.S
+++ b/arch/powerpc/kvm/book3s_hv_interrupts.S
@@ -57,9 +57,11 @@ BEGIN_FTR_SECTION
std r3, HSTATE_DSCR(r13)
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
 
+BEGIN_FTR_SECTION
/* Save host DABR */
mfspr   r3, SPRN_DABR
std r3, HSTATE_DABR(r13)
+END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
 
/* Hard-disable interrupts */
mfmsr   r10
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 0721d2a..d4f6f5f 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -95,11 +95,13 @@ kvmppc_got_guest:
 
/* Back from guest - restore host state and return to caller */
 
+BEGIN_FTR_SECTION
/* Restore host DABR and DABRX */
ld  r5,HSTATE_DABR(r13)
li  r6,7
mtspr   SPRN_DABR,r5
mtspr   SPRN_DABRX,r6
+END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
 
/* Restore SPRG3 */
ld  r3,PACA_SPRG3(r13)
@@ -423,15 +425,17 @@ kvmppc_hv_entry:
std r0, PPC_LR_STKOFF(r1)
stdur1, -112(r1)
 
+BEGIN_FTR_SECTION
/* Set partition DABR */
/* Do this before re-enabling PMU to avoid P7 DABR corruption bug */
li  r5,3
ld  r6,VCPU_DABR(r4)
mtspr   SPRN_DABRX,r5
mtspr   SPRN_DABR,r6
-BEGIN_FTR_SECTION
+ BEGIN_FTR_SECTION_NESTED(89)
isync
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
+ END_FTR_SECTION_NESTED(CPU_FTR_ARCH_206, CPU_FTR_ARCH_206, 89)
+END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
 
/* Load guest PMU registers */
/* R4 is live here (vcpu pointer) */
@@ -1716,6 +1720,9 @@ ignore_hdec:
b   fast_guest_return
 
 _GLOBAL(kvmppc_h_set_dabr)
+BEGIN_FTR_SECTION
+   b   2f
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
std r4,VCPU_DABR(r3)
/* Work around P7 bug where DABR can get corrupted on mtspr */
 1: mtspr   SPRN_DABR,r4
@@ -1723,7 +1730,7 @@ _GLOBAL(kvmppc_h_set_dabr)
cmpdr4, r5
bne 1b
isync
-   li  r3,0
+2: li  r3,0
blr
 
 _GLOBAL(kvmppc_h_cede)
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 10/10] KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbells

2013-09-05 Thread Paul Mackerras
POWER8 has support for hypervisor doorbell interrupts.  Though the
kernel doesn't use them for IPIs on the powernv platform yet, it
probably will in future, so this makes KVM cope gracefully if a
hypervisor doorbell interrupt arrives while in a guest.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_asm.h  | 1 +
 arch/powerpc/kvm/book3s_hv.c| 1 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 7 +++
 3 files changed, 9 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index f6401eb..4c2040a 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -96,6 +96,7 @@
 #define BOOK3S_INTERRUPT_H_DATA_STORAGE0xe00
 #define BOOK3S_INTERRUPT_H_INST_STORAGE0xe20
 #define BOOK3S_INTERRUPT_H_EMUL_ASSIST 0xe40
+#define BOOK3S_INTERRUPT_H_DOORBELL0xe80
 #define BOOK3S_INTERRUPT_PERFMON   0xf00
 #define BOOK3S_INTERRUPT_ALTIVEC   0xf20
 #define BOOK3S_INTERRUPT_VSX   0xf40
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 95a635d..dc5ce77 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -644,6 +644,7 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
r = RESUME_GUEST;
break;
case BOOK3S_INTERRUPT_EXTERNAL:
+   case BOOK3S_INTERRUPT_H_DOORBELL:
vcpu->stat.ext_intr_exits++;
r = RESUME_GUEST;
break;
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 4b3a10e..018513a 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2034,10 +2034,17 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_207S)
 BEGIN_FTR_SECTION
cmpwi   r6, 5   /* privileged doorbell? */
beq 0f
+   cmpwi   r6, 3   /* hypervisor doorbell? */
+   beq 3f
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
li  r3, 1   /* anything else, return 1 */
 0: blr
 
+   /* hypervisor doorbell */
+3: li  r12, BOOK3S_INTERRUPT_H_DOORBELL
+   li  r3, 1
+   blr
+
 /*
  * Determine what sort of external interrupt is pending (if any).
  * Returns:
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 03/10] KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs

2013-09-05 Thread Paul Mackerras
From: Michael Neuling 

This adds fields to the struct kvm_vcpu_arch to store the new
guest-accessible SPRs on POWER8, adds code to the get/set_one_reg
functions to allow userspace to access this state, and adds code to
the guest entry and exit to context-switch these SPRs between host
and guest.

Note that DPDES (Directed Privileged Doorbell Exception State) is
shared between threads on a core; hence we store it in struct
kvmppc_vcore and have the master thread save and restore it.

Signed-off-by: Michael Neuling 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_host.h |  25 +-
 arch/powerpc/include/asm/reg.h  |  17 
 arch/powerpc/include/uapi/asm/kvm.h |   1 +
 arch/powerpc/kernel/asm-offsets.c   |  23 +
 arch/powerpc/kvm/book3s_hv.c| 153 +++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 144 ++
 6 files changed, 360 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 6a43455..4786bb0 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -300,6 +300,7 @@ struct kvmppc_vcore {
u64 tb_offset;  /* guest timebase - host timebase */
u32 arch_compat;
ulong pcr;
+   ulong dpdes;/* doorbell state (POWER8) */
 };
 
 #define VCORE_ENTRY_COUNT(vc)  ((vc)->entry_exit_count & 0xff)
@@ -454,6 +455,7 @@ struct kvm_vcpu_arch {
ulong pc;
ulong ctr;
ulong lr;
+   ulong tar;
 
ulong xer;
u32 cr;
@@ -463,13 +465,32 @@ struct kvm_vcpu_arch {
ulong guest_owned_ext;
ulong purr;
ulong spurr;
+   ulong ic;
+   ulong vtb;
ulong dscr;
ulong amr;
ulong uamor;
+   ulong iamr;
u32 ctrl;
ulong dabr;
+   ulong dawr;
+   ulong dawrx;
+   ulong ciabr;
ulong cfar;
ulong ppr;
+   ulong pspb;
+   ulong fscr;
+   ulong tfhar;
+   ulong tfiar;
+   ulong texasr;
+   ulong ebbhr;
+   ulong ebbrr;
+   ulong bescr;
+   ulong csigr;
+   ulong tacr;
+   ulong tcscr;
+   ulong acop;
+   ulong wort;
 #endif
u32 vrsave; /* also USPRG0 */
u32 mmucr;
@@ -503,10 +524,12 @@ struct kvm_vcpu_arch {
u32 ccr1;
u32 dbsr;
 
-   u64 mmcr[3];
+   u64 mmcr[5];
u32 pmc[8];
+   u32 spmc[2];
u64 siar;
u64 sdar;
+   u64 sier;
 
 #ifdef CONFIG_KVM_EXIT_TIMING
struct mutex exit_timing_lock;
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 52ff962..4ca8b85 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -218,6 +218,11 @@
 #define   CTRL_TE  0x00c0  /* thread enable */
 #define   CTRL_RUNLATCH0x1
 #define SPRN_DAWR  0xB4
+#define SPRN_CIABR 0xBB
+#define   CIABR_PRIV   0x3
+#define   CIABR_PRIV_USER  1
+#define   CIABR_PRIV_SUPER 2
+#define   CIABR_PRIV_HYPER 3
 #define SPRN_DAWRX 0xBC
 #define   DAWRX_USER   (1UL << 0)
 #define   DAWRX_KERNEL (1UL << 1)
@@ -255,6 +260,8 @@
 #define SPRN_HRMOR 0x139   /* Real mode offset register */
 #define SPRN_HSRR0 0x13A   /* Hypervisor Save/Restore 0 */
 #define SPRN_HSRR1 0x13B   /* Hypervisor Save/Restore 1 */
+#define SPRN_IC0x350   /* Virtual Instruction Count */
+#define SPRN_VTB   0x351   /* Virtual Time Base */
 #define SPRN_FSCR  0x099   /* Facility Status & Control Register */
 #define   FSCR_TAR (1 << (63-55)) /* Enable Target Address Register */
 #define   FSCR_EBB (1 << (63-56)) /* Enable Event Based Branching */
@@ -354,6 +361,8 @@
 #define DER_EBRKE  0x0002  /* External Breakpoint Interrupt */
 #define DER_DPIE   0x0001  /* Dev. Port Nonmaskable Request */
 #define SPRN_DMISS 0x3D0   /* Data TLB Miss Register */
+#define SPRN_DHDES 0x0B1   /* Directed Hyp. Doorbell Exc. State */
+#define SPRN_DPDES 0x0B0   /* Directed Priv. Doorbell Exc. State */
 #define SPRN_EAR   0x11A   /* External Address Register */
 #define SPRN_HASH1 0x3D2   /* Primary Hash Address Register */
 #define SPRN_HASH2 0x3D3   /* Secondary Hash Address Resgister */
@@ -413,6 +422,7 @@
 #define SPRN_IABR  0x3F2   /* Instruction Address Breakpoint Register */
 #define SPRN_IABR2 0x3FA   /* 83xx */
 #define SPRN_IBCR  0x135   /* 83xx Insn Breakpoint Control Reg */
+#define SPRN_IAMR  0x03D   /* Instr. Authority Mask Reg */
 #define SPRN_HID4  0x3F4   /* 970 HID4 */
 #define  HID4_LPES0 (1ul << (63-0)) /* LPAR env. sel. bit 0 */
 #define HID4_RMLS2_SH   (63 - 2)   /* Real mode limit bottom 2 
bits */
@@ -526,6 +536,7 @@
 #define SPRN_PIR   0x3FF   /* Processor Identification Register 

[RFC PATCH 01/10] KVM: PPC: Book3S HV: Align physical CPU thread numbers with virtual

2013-09-05 Thread Paul Mackerras
On a threaded processor such as POWER7, we group VCPUs into virtual
cores and arrange that the VCPUs in a virtual core run on the same
physical core.  Currently we don't enforce any correspondence between
virtual thread numbers within a virtual core and physical thread
numbers.  Physical threads are allocated starting at 0 on a first-come
first-served basis to runnable virtual threads (VCPUs).

POWER8 implements a new "msgsndp" instruction which guest kernels can
use to interrupt other threads in the same core or sub-core.  Since
the instruction takes the destination physical thread ID as a parameter,
it becomes necessary to align the physical thread IDs with the virtual
thread IDs, that is, to make sure virtual thread N within a virtual
core always runs on physical thread N.

This means that it's possible that thread 0, which is where we call
__kvmppc_vcore_entry, may end up running some other vcpu than the
one whose task called kvmppc_run_core(), or it may end up running
no vcpu at all, if for example thread 0 of the virtual core is
currently executing in userspace.  Thus we can't rely on thread 0
to be the master responsible for switching the MMU.  Instead we now
have an explicit 'is_master' flag which is set for the vcpu whose
task called kvmppc_run_core().  The master then has to wait for
thread 0 to enter real mode before switching the MMU.  Also, we
no longer pass the vcpu pointer to __kvmppc_vcore_entry, but
instead let the assembly code load it from the PACA.

Since the assembly code will need to know the kvm pointer and the
thread ID for threads which don't have a vcpu, we move the thread
ID into the PACA and we add a kvm pointer to the virtual core
structure.

In the case where thread 0 has no vcpu to run, we arrange for it to
go to nap mode, using a new flag value in the PACA 'napping' field
so we can differentiate it when it wakes from the other nap cases.
We set the bit for the thread in the vcore 'napping_threads' field
so that when other threads come out of the guest they will send an
IPI to thread 0 to wake it up.  When it does wake up, we clear that
bit, see what caused the wakeup, and either exit back to the kernel,
or start running virtual thread 0 in the case where it now wants to
enter the guest and the other threads are still in the guest.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s_asm.h |   1 +
 arch/powerpc/include/asm/kvm_host.h   |   4 +
 arch/powerpc/kernel/asm-offsets.c |   5 +-
 arch/powerpc/kvm/book3s_hv.c  |  49 
 arch/powerpc/kvm/book3s_hv_interrupts.S   |   6 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 188 +-
 6 files changed, 192 insertions(+), 61 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 22f4606..8669ac0 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -87,6 +87,7 @@ struct kvmppc_host_state {
u8 hwthread_req;
u8 hwthread_state;
u8 host_ipi;
+   u8 ptid;
struct kvm_vcpu *kvm_vcpu;
struct kvmppc_vcore *kvm_vcore;
unsigned long xics_phys;
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 5a40270..6a43455 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -284,6 +284,8 @@ struct kvmppc_vcore {
int n_woken;
int nap_count;
int napping_threads;
+   int real_mode_threads;
+   int first_vcpuid;
u16 pcpu;
u16 last_cpu;
u8 vcore_state;
@@ -294,6 +296,7 @@ struct kvmppc_vcore {
u64 stolen_tb;
u64 preempt_tb;
struct kvm_vcpu *runner;
+   struct kvm *kvm;
u64 tb_offset;  /* guest timebase - host timebase */
u32 arch_compat;
ulong pcr;
@@ -575,6 +578,7 @@ struct kvm_vcpu_arch {
int state;
int ptid;
bool timer_running;
+   u8 is_master;
wait_queue_head_t cpu_run;
 
struct kvm_vcpu_arch_shared *shared;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 115dd64..51d7eed 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -514,13 +514,15 @@ int main(void)
DEFINE(VCPU_FAULT_DAR, offsetof(struct kvm_vcpu, arch.fault_dar));
DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst));
DEFINE(VCPU_TRAP, offsetof(struct kvm_vcpu, arch.trap));
-   DEFINE(VCPU_PTID, offsetof(struct kvm_vcpu, arch.ptid));
+   DEFINE(VCPU_IS_MASTER, offsetof(struct kvm_vcpu, arch.is_master));
DEFINE(VCPU_CFAR, offsetof(struct kvm_vcpu, arch.cfar));
DEFINE(VCPU_PPR, offsetof(struct kvm_vcpu, arch.ppr));
DEFINE(VCORE_ENTRY_EXIT, offsetof(struct kvmppc_vcore, 
entry_exit_count));
DEFINE(VCORE_NAP_COUNT, offsetof(struct kvmppc_vcore, nap_count));
DE

[RFC PATCH 09/10] KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8

2013-09-05 Thread Paul Mackerras
POWER8 has a bit in the LPCR to enable or disable the PURR and SPURR
registers to count when in the guest.  Set this bit.

POWER8 has a field in the LPCR called AIL (Alternate Interrupt Location)
which is used to enable relocation-on interrupts.  Allow userspace to
set this field.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/reg.h | 2 ++
 arch/powerpc/kvm/book3s_hv.c   | 6 ++
 2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 73fdd62..60c2dd8 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -291,8 +291,10 @@
 #define   LPCR_RMLS0x1C00  /* impl dependent rmo limit sel */
 #define  LPCR_RMLS_SH  (63-37)
 #define   LPCR_ILE 0x0200  /* !HV irqs set MSR:LE */
+#define   LPCR_AIL 0x0180  /* Alternate interrupt location */
 #define   LPCR_AIL_0   0x  /* MMU off exception offset 0x0 */
 #define   LPCR_AIL_3   0x0180  /* MMU on exception offset 0xc00...4xxx 
*/
+#define   LPCR_ONL 0x0004  /* online - PURR/SPURR count */
 #define   LPCR_PECE0x0001f000  /* powersave exit cause enable */
 #define LPCR_PECEDP0x0001  /* directed priv dbells cause 
exit */
 #define LPCR_PECEDH0x8000  /* directed hyp dbells cause 
exit */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 217041f..95a635d 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -783,8 +783,11 @@ static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 
new_lpcr)
/*
 * Userspace can only modify DPFD (default prefetch depth),
 * ILE (interrupt little-endian) and TC (translation control).
+* On POWER8 userspace can also modify AIL (alt. interrupt loc.)
 */
mask = LPCR_DPFD | LPCR_ILE | LPCR_TC;
+   if (cpu_has_feature(CPU_FTR_ARCH_207S))
+   mask |= LPCR_AIL;
kvm->arch.lpcr = (kvm->arch.lpcr & ~mask) | (new_lpcr & mask);
mutex_unlock(&kvm->lock);
 }
@@ -2169,6 +2172,9 @@ int kvmppc_core_init_vm(struct kvm *kvm)
LPCR_VPM0 | LPCR_VPM1;
kvm->arch.vrma_slb_v = SLB_VSID_B_1T |
(VRMA_VSID << SLB_VSID_SHIFT_1T);
+   /* On POWER8 turn on online bit to enable PURR/SPURR */
+   if (cpu_has_feature(CPU_FTR_ARCH_207S))
+   lpcr |= LPCR_ONL;
}
kvm->arch.lpcr = lpcr;
 
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 06/10] KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8

2013-09-05 Thread Paul Mackerras
This allows us to select architecture 2.05 (POWER6) or 2.06 (POWER7)
compatibility modes on a POWER8 processor.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/reg.h |  2 ++
 arch/powerpc/kvm/book3s_hv.c   | 16 +++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 4ca8b85..483e0a2 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -315,6 +315,8 @@
 #define SPRN_PCR   0x152   /* Processor compatibility register */
 #define   PCR_VEC_DIS  (1ul << (63-0)) /* Vec. disable (pre POWER8) */
 #define   PCR_VSX_DIS  (1ul << (63-1)) /* VSX disable (pre POWER8) */
+#define   PCR_TM_DIS   (1ul << (63-2)) /* Trans. memory disable (POWER8) */
+#define   PCR_ARCH_206 0x4 /* Architecture 2.06 */
 #define   PCR_ARCH_205 0x2 /* Architecture 2.05 */
 #defineSPRN_HEIR   0x153   /* Hypervisor Emulated Instruction 
Register */
 #define SPRN_TLBINDEXR 0x154   /* P7 TLB control register */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index da8619e..217041f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -177,14 +177,28 @@ int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 
arch_compat)
 
switch (arch_compat) {
case PVR_ARCH_205:
-   pcr = PCR_ARCH_205;
+   /*
+* If an arch bit is set in PCR, all the defined
+* higher-order arch bits also have to be set.
+*/
+   pcr = PCR_ARCH_206 | PCR_ARCH_205;
break;
case PVR_ARCH_206:
case PVR_ARCH_206p:
+   pcr = PCR_ARCH_206;
+   break;
+   case PVR_ARCH_207:
break;
default:
return -EINVAL;
}
+
+   if (!cpu_has_feature(CPU_FTR_ARCH_207S)) {
+   /* POWER7 can't emulate POWER8 */
+   if (!(pcr & PCR_ARCH_206))
+   return -EINVAL;
+   pcr &= ~PCR_ARCH_206;
+   }
}
 
spin_lock(&vc->lock);
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 07/10] KVM: PPC: Book3S HV: Consolidate code that checks reason for wake from nap

2013-09-05 Thread Paul Mackerras
Currently in book3s_hv_rmhandlers.S we have three places where we
have woken up from nap mode and we check the reason field in SRR1
to see what event woke us up.  This consolidates them into a new
function, kvmppc_check_wake_reason.  It looks at the wake reason
field in SRR1, and if it indicates that an external interrupt caused
the wakeup, calls kvmppc_read_intr to check what sort of interrupt
it was.

This also consolidates the two places where we synthesize an external
interrupt (0x500 vector) for the guest.  Now, if the guest exit code
finds that there was an external interrupt which has been handled
(i.e. it was an IPI indicating that there is now an interrupt pending
for the guest), it jumps to deliver_guest_interrupt, which is in the
last part of the guest entry code, where we synthesize guest external
and decrementer interrupts.  That code has been streamlined a little
and now clears LPCR[MER] when appropriate as well as setting it.

The extra clearing of any pending IPI on a secondary, offline CPU
thread before going back to nap mode has been removed.  It is no longer
necessary now that we have code to read and acknowledge IPIs in the
guest exit path.

This fixes a minor bug in the H_CEDE real-mode handling - previously,
if we found that other threads were already exiting the guest when we
were about to go to nap mode, we would branch to the cede wakeup path
and end up looking in SRR1 for a wakeup reason.  Now we branch to a
point after we have checked the wakeup reason.

This also fixes a minor bug in kvmppc_read_intr - previously it could
return 0xff rather than 1, in the case where we find that a host IPI
is pending after we have cleared the IPI.  Now it returns 1.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 193 +---
 1 file changed, 78 insertions(+), 115 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 612c9c8..6351ce2 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -234,8 +234,10 @@ kvm_novcpu_wakeup:
stb r0, HSTATE_NAPPING(r13)
stb r0, HSTATE_HWTHREAD_REQ(r13)
 
+   /* check the wake reason */
+   bl  kvmppc_check_wake_reason
+   
/* see if any other thread is already exiting */
-   li  r12, 0
lwz r0, VCORE_ENTRY_EXIT(r5)
cmpwi   r0, 0x100
bge kvm_novcpu_exit
@@ -245,23 +247,14 @@ kvm_novcpu_wakeup:
li  r0, 1
sld r0, r0, r7
addir6, r5, VCORE_NAPPING_THREADS
-4: lwarx   r3, 0, r6
-   andcr3, r3, r0
-   stwcx.  r3, 0, r6
+4: lwarx   r7, 0, r6
+   andcr7, r7, r0
+   stwcx.  r7, 0, r6
bne 4b
 
-   /* Check the wake reason in SRR1 to see why we got here */
-   mfspr   r3, SPRN_SRR1
-   rlwinm  r3, r3, 44-31, 0x7  /* extract wake reason field */
-   cmpwi   r3, 4   /* was it an external interrupt? */
-   bne kvm_novcpu_exit /* if not, exit the guest */
-
-   /* extern interrupt - read and handle it */
-   li  r12, BOOK3S_INTERRUPT_EXTERNAL
-   bl  kvmppc_read_intr
+   /* See if the wake reason means we need to exit */
cmpdi   r3, 0
bge kvm_novcpu_exit
-   li  r12, 0
 
/* Got an IPI but other vcpus aren't yet exiting, must be a latecomer */
ld  r4, HSTATE_KVM_VCPU(r13)
@@ -325,58 +318,23 @@ kvm_start_guest:
 */
 
/* Check the wake reason in SRR1 to see why we got here */
-   mfspr   r3,SPRN_SRR1
-   rlwinm  r3,r3,44-31,0x7 /* extract wake reason field */
-   cmpwi   r3,4/* was it an external interrupt? */
-   bne 27f /* if not */
-   ld  r5,HSTATE_XICS_PHYS(r13)
-   li  r7,XICS_XIRR/* if it was an external interrupt, */
-   lwzcix  r8,r5,r7/* get and ack the interrupt */
-   sync
-   clrldi. r9,r8,40/* get interrupt source ID. */
-   beq 28f /* none there? */
-   cmpwi   r9,XICS_IPI /* was it an IPI? */
-   bne 29f
-   li  r0,0xff
-   li  r6,XICS_MFRR
-   stbcix  r0,r5,r6/* clear IPI */
-   stwcix  r8,r5,r7/* EOI the interrupt */
-   sync/* order loading of vcpu after that */
+   bl  kvmppc_check_wake_reason
+   cmpdi   r3, 0
+   bge kvm_no_guest
 
/* get vcpu pointer, NULL if we have no vcpu to run */
ld  r4,HSTATE_KVM_VCPU(r13)
cmpdi   r4,0
/* if we have no vcpu to run, go back to sleep */
beq kvm_no_guest
-   b   30f
 
-27:/* XXX should handle hypervisor maintenance interrupts etc. here */
-   b   kvm_no_guest
-28:/* SRR1 said external but ICP said

[RFC PATCH 08/10] KVM: PPC: Book3S HV: Handle guest using doorbells for IPIs

2013-09-05 Thread Paul Mackerras
* SRR1 wake reason field for system reset interrupt on wakeup from nap
  is now a 4-bit field on P8, compared to 3 bits on P7.

* Set PECEDP in LPCR when napping because of H_CEDE so guest doorbells
  will wake us up.

* Waking up from nap because of a guest doorbell interrupt is not a
  reason to exit the guest.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/reg.h  |  4 +++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 19 +++
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 483e0a2..73fdd62 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -293,7 +293,9 @@
 #define   LPCR_ILE 0x0200  /* !HV irqs set MSR:LE */
 #define   LPCR_AIL_0   0x  /* MMU off exception offset 0x0 */
 #define   LPCR_AIL_3   0x0180  /* MMU on exception offset 0xc00...4xxx 
*/
-#define   LPCR_PECE0x7000  /* powersave exit cause enable */
+#define   LPCR_PECE0x0001f000  /* powersave exit cause enable */
+#define LPCR_PECEDP0x0001  /* directed priv dbells cause 
exit */
+#define LPCR_PECEDH0x8000  /* directed hyp dbells cause 
exit */
 #define LPCR_PECE0 0x4000  /* ext. exceptions can cause exit */
 #define LPCR_PECE1 0x2000  /* decrementer can cause exit */
 #define LPCR_PECE2 0x1000  /* machine check etc can cause exit */
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 6351ce2..4b3a10e 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1894,13 +1894,16 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_206)
bl  .kvmppc_save_fp
 
/*
-* Take a nap until a decrementer or external interrupt occurs,
-* with PECE1 (wake on decr) and PECE0 (wake on external) set in LPCR
+* Take a nap until a decrementer or external or doobell interrupt
+* occurs, with PECE1, PECE0 and PECEDP set in LPCR
 */
li  r0,1
stb r0,HSTATE_HWTHREAD_REQ(r13)
mfspr   r5,SPRN_LPCR
ori r5,r5,LPCR_PECE0 | LPCR_PECE1
+BEGIN_FTR_SECTION
+   orisr5,r5,LPCR_PECEDP@h
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
mtspr   SPRN_LPCR,r5
isync
li  r0, 0
@@ -2016,14 +2019,22 @@ machine_check_realmode:
  */
 kvmppc_check_wake_reason:
mfspr   r6, SPRN_SRR1
-   rlwinm  r6, r6, 44-31, 0x7  /* extract wake reason field */
-   cmpwi   r6, 4   /* was it an external interrupt? */
+BEGIN_FTR_SECTION
+   rlwinm  r6, r6, 45-31, 0xf  /* extract wake reason field (P8) */
+FTR_SECTION_ELSE
+   rlwinm  r6, r6, 45-31, 0xe  /* P7 wake reason field is 3 bits */
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_207S)
+   cmpwi   r6, 8   /* was it an external interrupt? */
li  r12, BOOK3S_INTERRUPT_EXTERNAL
beq kvmppc_read_intr/* if so, see what it was */
li  r3, 0
li  r12, 0
cmpwi   r6, 6   /* was it the decrementer? */
beq 0f
+BEGIN_FTR_SECTION
+   cmpwi   r6, 5   /* privileged doorbell? */
+   beq 0f
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
li  r3, 1   /* anything else, return 1 */
 0: blr
 
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/11] KVM: PPC: Book3S HV: Return -EINVAL rather than BUG'ing

2013-09-05 Thread Paul Mackerras
From: Michael Ellerman 

This means that if we do happen to get a trap that we don't know
about, we abort the guest rather than crashing the host kernel.

Signed-off-by: Michael Ellerman 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 0bb23a9..731e46e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -709,8 +709,7 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
printk(KERN_EMERG "trap=0x%x | pc=0x%lx | msr=0x%llx\n",
vcpu->arch.trap, kvmppc_get_pc(vcpu),
vcpu->arch.shregs.msr);
-   r = RESUME_HOST;
-   BUG();
+   r = -EINVAL;
break;
}
 
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/11] KVM: PPC: Book3S HV: Pull out interrupt-reading code into a subroutine

2013-09-05 Thread Paul Mackerras
This moves the code in book3s_hv_rmhandlers.S that reads any pending
interrupt from the XICS interrupt controller, and works out whether
it is an IPI for the guest, an IPI for the host, or a device interrupt,
into a new function called kvmppc_read_intr.  Later patches will
need this.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 117 +++-
 1 file changed, 68 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index d9ab139..01515b6 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -868,46 +868,11 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_206)
 * set, we know the host wants us out so let's do it now
 */
 do_ext_interrupt:
-   lbz r0, HSTATE_HOST_IPI(r13)
-   cmpwi   r0, 0
-   bne ext_interrupt_to_host
-
-   /* Now read the interrupt from the ICP */
-   ld  r5, HSTATE_XICS_PHYS(r13)
-   li  r7, XICS_XIRR
-   cmpdi   r5, 0
-   beq-ext_interrupt_to_host
-   lwzcix  r3, r5, r7
-   rlwinm. r0, r3, 0, 0xff
-   sync
-   beq 3f  /* if nothing pending in the ICP */
-
-   /* We found something in the ICP...
-*
-* If it's not an IPI, stash it in the PACA and return to
-* the host, we don't (yet) handle directing real external
-* interrupts directly to the guest
-*/
-   cmpwi   r0, XICS_IPI
-   bne ext_stash_for_host
-
-   /* It's an IPI, clear the MFRR and EOI it */
-   li  r0, 0xff
-   li  r6, XICS_MFRR
-   stbcix  r0, r5, r6  /* clear the IPI */
-   stwcix  r3, r5, r7  /* EOI it */
-   sync
-
-   /* We need to re-check host IPI now in case it got set in the
-* meantime. If it's clear, we bounce the interrupt to the
-* guest
-*/
-   lbz r0, HSTATE_HOST_IPI(r13)
-   cmpwi   r0, 0
-   bne-1f
+   bl  kvmppc_read_intr
+   cmpdi   r3, 0
+   bgt ext_interrupt_to_host
 
/* Allright, looks like an IPI for the guest, we need to set MER */
-3:
/* Check if any CPU is heading out to the host, if so head out too */
ld  r5, HSTATE_KVM_VCORE(r13)
lwz r0, VCORE_ENTRY_EXIT(r5)
@@ -936,17 +901,6 @@ do_ext_interrupt:
mtspr   SPRN_LPCR, r8
b   fast_guest_return
 
-   /* We raced with the host, we need to resend that IPI, bummer */
-1: li  r0, IPI_PRIORITY
-   stbcix  r0, r5, r6  /* set the IPI */
-   sync
-   b   ext_interrupt_to_host
-
-ext_stash_for_host:
-   /* It's not an IPI and it's for the host, stash it in the PACA
-* before exit, it will be picked up by the host ICP driver
-*/
-   stw r3, HSTATE_SAVED_XIRR(r13)
 ext_interrupt_to_host:
 
 guest_exit_cont:   /* r9 = vcpu, r12 = trap, r13 = paca */
@@ -1821,6 +1775,71 @@ machine_check_realmode:
b   fast_interrupt_c_return
 
 /*
+ * Determine what sort of external interrupt is pending (if any).
+ * Returns:
+ * 0 if no interrupt is pending
+ * 1 if an interrupt is pending that needs to be handled by the host
+ * -1 if there was a guest wakeup IPI (which has now been cleared)
+ */
+kvmppc_read_intr:
+   /* see if a host IPI is pending */
+   li  r3, 1
+   lbz r0, HSTATE_HOST_IPI(r13)
+   cmpwi   r0, 0
+   bne 1f
+
+   /* Now read the interrupt from the ICP */
+   ld  r6, HSTATE_XICS_PHYS(r13)
+   li  r7, XICS_XIRR
+   cmpdi   r6, 0
+   beq-1f
+   lwzcix  r0, r6, r7
+   rlwinm. r3, r0, 0, 0xff
+   sync
+   beq 1f  /* if nothing pending in the ICP */
+
+   /* We found something in the ICP...
+*
+* If it's not an IPI, stash it in the PACA and return to
+* the host, we don't (yet) handle directing real external
+* interrupts directly to the guest
+*/
+   cmpwi   r3, XICS_IPI/* if there is, is it an IPI? */
+   li  r3, 1
+   bne 42f
+
+   /* It's an IPI, clear the MFRR and EOI it */
+   li  r3, 0xff
+   li  r8, XICS_MFRR
+   stbcix  r3, r6, r8  /* clear the IPI */
+   stwcix  r0, r6, r7  /* EOI it */
+   sync
+
+   /* We need to re-check host IPI now in case it got set in the
+* meantime. If it's clear, we bounce the interrupt to the
+* guest
+*/
+   lbz r0, HSTATE_HOST_IPI(r13)
+   cmpwi   r0, 0
+   bne-43f
+
+   /* OK, it's an IPI for us */
+   li  r3, -1
+1: blr
+
+42:/* It's not an IPI and it's for the host, stash it in the PACA
+* before exit, it will be picked up by the host ICP driver
+*/
+   stw r0, HSTATE_SAVED_XIRR(r13)
+   b   

[PATCH 10/11] KVM: PPC: Book3S HV: Avoid unbalanced increments of VPA yield count

2013-09-05 Thread Paul Mackerras
The yield count in the VPA is supposed to be incremented every time
we enter the guest, and every time we exit the guest, so that its
value is even when the vcpu is running in the guest and odd when it
isn't.  However, it's currently possible that we increment the yield
count on the way into the guest but then find that other CPU threads
are already exiting the guest, so we go back to nap mode via the
secondary_too_late label.  In this situation we don't increment the
yield count again, breaking the relationship between the LSB of the
count and whether the vcpu is in the guest.

To fix this, we move the increment of the yield count to a point
after we have checked whether other CPU threads are exiting.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 01515b6..31030f3 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -401,16 +401,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
/* Save R1 in the PACA */
std r1, HSTATE_HOST_R1(r13)
 
-   /* Increment yield count if they have a VPA */
-   ld  r3, VCPU_VPA(r4)
-   cmpdi   r3, 0
-   beq 25f
-   lwz r5, LPPACA_YIELDCOUNT(r3)
-   addir5, r5, 1
-   stw r5, LPPACA_YIELDCOUNT(r3)
-   li  r6, 1
-   stb r6, VCPU_VPA_DIRTY(r4)
-25:
/* Load up DAR and DSISR */
ld  r5, VCPU_DAR(r4)
lwz r6, VCPU_DSISR(r4)
@@ -525,6 +515,16 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
mtspr   SPRN_RMOR,r8
isync
 
+   /* Increment yield count if they have a VPA */
+   ld  r3, VCPU_VPA(r4)
+   cmpdi   r3, 0
+   beq 25f
+   lwz r5, LPPACA_YIELDCOUNT(r3)
+   addir5, r5, 1
+   stw r5, LPPACA_YIELDCOUNT(r3)
+   li  r6, 1
+   stb r6, VCPU_VPA_DIRTY(r4)
+25:
/* Check if HDEC expires soon */
mfspr   r3,SPRN_HDEC
cmpwi   r3,10
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/11] KVM: PPC: Book3S HV: Restructure kvmppc_hv_entry to be a subroutine

2013-09-05 Thread Paul Mackerras
We have two paths into and out of the low-level guest entry and exit
code: from a vcpu task via kvmppc_hv_entry_trampoline, and from the
system reset vector for an offline secondary thread on POWER7 via
kvm_start_guest.  Currently both just branch to kvmppc_hv_entry to
enter the guest, and on guest exit, we test the vcpu physical thread
ID to detect which way we came in and thus whether we should return
to the vcpu task or go back to nap mode.

In order to make the code flow clearer, and to keep the code relating
to each flow together, this turns kvmppc_hv_entry into a subroutine
that follows the normal conventions for call and return.  This means
that kvmppc_hv_entry_trampoline() and kvmppc_hv_entry() now establish
normal stack frames, and we use the normal stack slots for saving
return addresses rather than local_paca->kvm_hstate.vmhandler.  Apart
from that this is mostly moving code around unchanged.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 344 +---
 1 file changed, 178 insertions(+), 166 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 023d8600..d9ab139 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -62,8 +62,11 @@ kvmppc_skip_Hinterrupt:
  * LR = return address to continue at after eventually re-enabling MMU
  */
 _GLOBAL(kvmppc_hv_entry_trampoline)
+   mflrr0
+   std r0, PPC_LR_STKOFF(r1)
+   stdur1, -112(r1)
mfmsr   r10
-   LOAD_REG_ADDR(r5, kvmppc_hv_entry)
+   LOAD_REG_ADDR(r5, kvmppc_call_hv_entry)
li  r0,MSR_RI
andcr0,r10,r0
li  r6,MSR_IR | MSR_DR
@@ -73,11 +76,103 @@ _GLOBAL(kvmppc_hv_entry_trampoline)
mtsrr1  r6
RFI
 
-/**
- **
- *   Entry code   *
- **
- */
+kvmppc_call_hv_entry:
+   bl  kvmppc_hv_entry
+
+   /* Back from guest - restore host state and return to caller */
+
+   /* Restore host DABR and DABRX */
+   ld  r5,HSTATE_DABR(r13)
+   li  r6,7
+   mtspr   SPRN_DABR,r5
+   mtspr   SPRN_DABRX,r6
+
+   /* Restore SPRG3 */
+   ld  r3,PACA_SPRG3(r13)
+   mtspr   SPRN_SPRG3,r3
+
+   /*
+* Reload DEC.  HDEC interrupts were disabled when
+* we reloaded the host's LPCR value.
+*/
+   ld  r3, HSTATE_DECEXP(r13)
+   mftbr4
+   subfr4, r4, r3
+   mtspr   SPRN_DEC, r4
+
+   /* Reload the host's PMU registers */
+   ld  r3, PACALPPACAPTR(r13)  /* is the host using the PMU? */
+   lbz r4, LPPACA_PMCINUSE(r3)
+   cmpwi   r4, 0
+   beq 23f /* skip if not */
+   lwz r3, HSTATE_PMC(r13)
+   lwz r4, HSTATE_PMC + 4(r13)
+   lwz r5, HSTATE_PMC + 8(r13)
+   lwz r6, HSTATE_PMC + 12(r13)
+   lwz r8, HSTATE_PMC + 16(r13)
+   lwz r9, HSTATE_PMC + 20(r13)
+BEGIN_FTR_SECTION
+   lwz r10, HSTATE_PMC + 24(r13)
+   lwz r11, HSTATE_PMC + 28(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
+   mtspr   SPRN_PMC1, r3
+   mtspr   SPRN_PMC2, r4
+   mtspr   SPRN_PMC3, r5
+   mtspr   SPRN_PMC4, r6
+   mtspr   SPRN_PMC5, r8
+   mtspr   SPRN_PMC6, r9
+BEGIN_FTR_SECTION
+   mtspr   SPRN_PMC7, r10
+   mtspr   SPRN_PMC8, r11
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
+   ld  r3, HSTATE_MMCR(r13)
+   ld  r4, HSTATE_MMCR + 8(r13)
+   ld  r5, HSTATE_MMCR + 16(r13)
+   mtspr   SPRN_MMCR1, r4
+   mtspr   SPRN_MMCRA, r5
+   mtspr   SPRN_MMCR0, r3
+   isync
+23:
+
+   /*
+* For external and machine check interrupts, we need
+* to call the Linux handler to process the interrupt.
+* We do that by jumping to absolute address 0x500 for
+* external interrupts, or the machine_check_fwnmi label
+* for machine checks (since firmware might have patched
+* the vector area at 0x200).  The [h]rfid at the end of the
+* handler will return to the book3s_hv_interrupts.S code.
+* For other interrupts we do the rfid to get back
+* to the book3s_hv_interrupts.S code here.
+*/
+   ld  r8, 112+PPC_LR_STKOFF(r1)
+   addir1, r1, 112
+   ld  r7, HSTATE_HOST_MSR(r13)
+
+   cmpwi   cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK
+   cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
+BEGIN_FTR_SECTION
+   beq 11f
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
+
+   /* RFI into the highmem handler, or branch to interrupt handler */
+  

[PATCH 07/11] KVM: PPC: Book3S HV: Implement H_CONFER

2013-09-05 Thread Paul Mackerras
The H_CONFER hypercall is used when a guest vcpu is spinning on a lock
held by another vcpu which has been preempted, and the spinning vcpu
wishes to give its timeslice to the lock holder.  We implement this
in the straightforward way using kvm_vcpu_yield_to().

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1a10afa..0bb23a9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -567,6 +567,15 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
}
break;
case H_CONFER:
+   target = kvmppc_get_gpr(vcpu, 4);
+   if (target == -1)
+   break;
+   tvcpu = kvmppc_find_vcpu(vcpu->kvm, target);
+   if (!tvcpu) {
+   ret = H_PARAMETER;
+   break;
+   }
+   kvm_vcpu_yield_to(tvcpu);
break;
case H_REGISTER_VPA:
ret = do_h_register_vpa(vcpu, kvmppc_get_gpr(vcpu, 4),
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/11] KVM: PPC: Book3S HV: Save/restore SIAR and SDAR along with other PMU registers

2013-09-05 Thread Paul Mackerras
Currently we are not saving and restoring the SIAR and SDAR registers in
the PMU (performance monitor unit) on guest entry and exit.  The result
is that performance monitoring tools in the guest could get false
information about where a program was executing and what data it was
accessing at the time of a performance monitor interrupt.  This fixes
it by saving and restoring these registers along with the other PMU
registers on guest entry/exit.

This also provides a way for userspace to access these values for a
vcpu via the one_reg interface.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_host.h |  2 ++
 arch/powerpc/kernel/asm-offsets.c   |  2 ++
 arch/powerpc/kvm/book3s_hv.c| 12 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  8 
 4 files changed, 24 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 3328353..91b833d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -498,6 +498,8 @@ struct kvm_vcpu_arch {
 
u64 mmcr[3];
u32 pmc[8];
+   u64 siar;
+   u64 sdar;
 
 #ifdef CONFIG_KVM_EXIT_TIMING
struct mutex exit_timing_lock;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 26098c2..6a7916d 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -505,6 +505,8 @@ int main(void)
DEFINE(VCPU_PRODDED, offsetof(struct kvm_vcpu, arch.prodded));
DEFINE(VCPU_MMCR, offsetof(struct kvm_vcpu, arch.mmcr));
DEFINE(VCPU_PMC, offsetof(struct kvm_vcpu, arch.pmc));
+   DEFINE(VCPU_SIAR, offsetof(struct kvm_vcpu, arch.siar));
+   DEFINE(VCPU_SDAR, offsetof(struct kvm_vcpu, arch.sdar));
DEFINE(VCPU_SLB, offsetof(struct kvm_vcpu, arch.slb));
DEFINE(VCPU_SLB_MAX, offsetof(struct kvm_vcpu, arch.slb_max));
DEFINE(VCPU_SLB_NR, offsetof(struct kvm_vcpu, arch.slb_nr));
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8aadd23..29bdeca2 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -749,6 +749,12 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id, 
union kvmppc_one_reg *val)
i = id - KVM_REG_PPC_PMC1;
*val = get_reg_val(id, vcpu->arch.pmc[i]);
break;
+   case KVM_REG_PPC_SIAR:
+   *val = get_reg_val(id, vcpu->arch.siar);
+   break;
+   case KVM_REG_PPC_SDAR:
+   *val = get_reg_val(id, vcpu->arch.sdar);
+   break;
 #ifdef CONFIG_VSX
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
@@ -833,6 +839,12 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, 
union kvmppc_one_reg *val)
i = id - KVM_REG_PPC_PMC1;
vcpu->arch.pmc[i] = set_reg_val(id, *val);
break;
+   case KVM_REG_PPC_SIAR:
+   vcpu->arch.siar = set_reg_val(id, *val);
+   break;
+   case KVM_REG_PPC_SDAR:
+   vcpu->arch.sdar = set_reg_val(id, *val);
+   break;
 #ifdef CONFIG_VSX
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 60dce5b..bfb4b0a 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -196,8 +196,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
ld  r3, VCPU_MMCR(r4)
ld  r5, VCPU_MMCR + 8(r4)
ld  r6, VCPU_MMCR + 16(r4)
+   ld  r7, VCPU_SIAR(r4)
+   ld  r8, VCPU_SDAR(r4)
mtspr   SPRN_MMCR1, r5
mtspr   SPRN_MMCRA, r6
+   mtspr   SPRN_SIAR, r7
+   mtspr   SPRN_SDAR, r8
mtspr   SPRN_MMCR0, r3
isync
 
@@ -1122,9 +1126,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
std r3, VCPU_MMCR(r9)   /* if not, set saved MMCR0 to FC */
b   22f
 21:mfspr   r5, SPRN_MMCR1
+   mfspr   r7, SPRN_SIAR
+   mfspr   r8, SPRN_SDAR
std r4, VCPU_MMCR(r9)
std r5, VCPU_MMCR + 8(r9)
std r6, VCPU_MMCR + 16(r9)
+   std r7, VCPU_SIAR(r9)
+   std r8, VCPU_SDAR(r9)
mfspr   r3, SPRN_PMC1
mfspr   r4, SPRN_PMC2
mfspr   r5, SPRN_PMC3
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/11] KVM: PPC: Book3S HV: Add GET/SET_ONE_REG interface for LPCR

2013-09-05 Thread Paul Mackerras
This adds the ability for userspace to read and write the LPCR
(Logical Partitioning Control Register) value relating to a guest
via the GET/SET_ONE_REG interface.  There is only one LPCR value
for the guest, which can be accessed through any vcpu.  Userspace
can only modify the following fields of the LPCR value:

DPFDDefault prefetch depth
ILE Interrupt little-endian
TC  Translation control (secondary HPT hash group search disable)

Signed-off-by: Paul Mackerras 
---
 Documentation/virtual/kvm/api.txt   |  1 +
 arch/powerpc/include/asm/reg.h  |  2 ++
 arch/powerpc/include/uapi/asm/kvm.h |  1 +
 arch/powerpc/kvm/book3s_hv.c| 21 +
 4 files changed, 25 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index c36ff9af..1030ac9 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1835,6 +1835,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_PID  | 64
   PPC   | KVM_REG_PPC_ACOP | 64
   PPC   | KVM_REG_PPC_VRSAVE   | 32
+  PPC   | KVM_REG_PPC_LPCR | 64
   PPC   | KVM_REG_PPC_TM_GPR0  | 64
   ...
   PPC   | KVM_REG_PPC_TM_GPR31 | 64
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 342e4ea..3fc0d06 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -275,6 +275,7 @@
 #define   LPCR_ISL (1ul << (63-2))
 #define   LPCR_VC_SH   (63-2)
 #define   LPCR_DPFD_SH (63-11)
+#define   LPCR_DPFD(7ul << LPCR_DPFD_SH)
 #define   LPCR_VRMASD  (0x1ful << (63-16))
 #define   LPCR_VRMA_L  (1ul << (63-12))
 #define   LPCR_VRMA_LP0(1ul << (63-15))
@@ -291,6 +292,7 @@
 #define LPCR_PECE2 0x1000  /* machine check etc can cause exit */
 #define   LPCR_MER 0x0800  /* Mediated External Exception */
 #define   LPCR_MER_SH  11
+#define   LPCR_TC  0x0200  /* Translation control */
 #define   LPCR_LPES0x000c
 #define   LPCR_LPES0   0x0008  /* LPAR Env selector 0 */
 #define   LPCR_LPES1   0x0004  /* LPAR Env selector 1 */
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index b98bf3f..e42127d 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -533,6 +533,7 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_ACOP   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb3)
 
 #define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4)
+#define KVM_REG_PPC_LPCR   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5)
 
 /* Transactional Memory checkpointed state:
  * This is all GPRs, all VSX regs and a subset of SPRs
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index b930caf..9c878d7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -714,6 +714,21 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
return 0;
 }
 
+static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 new_lpcr)
+{
+   struct kvm *kvm = vcpu->kvm;
+   u64 mask;
+
+   mutex_lock(&kvm->lock);
+   /*
+* Userspace can only modify DPFD (default prefetch depth),
+* ILE (interrupt little-endian) and TC (translation control).
+*/
+   mask = LPCR_DPFD | LPCR_ILE | LPCR_TC;
+   kvm->arch.lpcr = (kvm->arch.lpcr & ~mask) | (new_lpcr & mask);
+   mutex_unlock(&kvm->lock);
+}
+
 int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg 
*val)
 {
int r = 0;
@@ -796,6 +811,9 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id, union 
kvmppc_one_reg *val)
case KVM_REG_PPC_TB_OFFSET:
*val = get_reg_val(id, vcpu->arch.vcore->tb_offset);
break;
+   case KVM_REG_PPC_LPCR:
+   *val = get_reg_val(id, vcpu->kvm->arch.lpcr);
+   break;
default:
r = -EINVAL;
break;
@@ -900,6 +918,9 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union 
kvmppc_one_reg *val)
vcpu->arch.vcore->tb_offset =
ALIGN(set_reg_val(id, *val), 1UL << 24);
break;
+   case KVM_REG_PPC_LPCR:
+   kvmppc_set_lpcr(vcpu, set_reg_val(id, *val));
+   break;
default:
r = -EINVAL;
break;
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/11] KVM: PPC: Book3S: Add GET/SET_ONE_REG interface for VRSAVE

2013-09-05 Thread Paul Mackerras
The VRSAVE register value for a vcpu is accessible through the
GET/SET_SREGS interface for Book E processors, but not for Book 3S
processors.  In order to make this accessible for Book 3S processors,
this adds a new register identifier for GET/SET_ONE_REG, and adds
the code to implement it.

Signed-off-by: Paul Mackerras 
---
 Documentation/virtual/kvm/api.txt   |  1 +
 arch/powerpc/include/uapi/asm/kvm.h |  2 ++
 arch/powerpc/kvm/book3s.c   | 10 ++
 3 files changed, 13 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 9486e5a..c36ff9af 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1834,6 +1834,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_TCSCR| 64
   PPC   | KVM_REG_PPC_PID  | 64
   PPC   | KVM_REG_PPC_ACOP | 64
+  PPC   | KVM_REG_PPC_VRSAVE   | 32
   PPC   | KVM_REG_PPC_TM_GPR0  | 64
   ...
   PPC   | KVM_REG_PPC_TM_GPR31 | 64
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index a8124fe..b98bf3f 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -532,6 +532,8 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_PID(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb2)
 #define KVM_REG_PPC_ACOP   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb3)
 
+#define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4)
+
 /* Transactional Memory checkpointed state:
  * This is all GPRs, all VSX regs and a subset of SPRs
  */
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 700df6f..f97369d 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -528,6 +528,9 @@ int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, 
struct kvm_one_reg *reg)
}
val = get_reg_val(reg->id, vcpu->arch.vscr.u[3]);
break;
+   case KVM_REG_PPC_VRSAVE:
+   val = get_reg_val(reg->id, vcpu->arch.vrsave);
+   break;
 #endif /* CONFIG_ALTIVEC */
case KVM_REG_PPC_DEBUG_INST: {
u32 opcode = INS_TW;
@@ -605,6 +608,13 @@ int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, 
struct kvm_one_reg *reg)
}
vcpu->arch.vscr.u[3] = set_reg_val(reg->id, val);
break;
+   case KVM_REG_PPC_VRSAVE:
+   if (!cpu_has_feature(CPU_FTR_ALTIVEC)) {
+   r = -ENXIO;
+   break;
+   }
+   vcpu->arch.vrsave = set_reg_val(reg->id, val);
+   break;
 #endif /* CONFIG_ALTIVEC */
 #ifdef CONFIG_KVM_XICS
case KVM_REG_PPC_ICP_STATE:
-- 
1.8.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/11] KVM: PPC: Book3S HV: Support POWER6 compatibility mode on POWER7

2013-09-05 Thread Paul Mackerras
This enables us to use the Processor Compatibility Register (PCR) on
POWER7 to put the processor into architecture 2.05 compatibility mode
when running a guest.  In this mode the new instructions and registers
that were introduced on POWER7 are disabled in user mode.  This
includes all the VSX facilities plus several other instructions such
as ldbrx, stdbrx, popcntw, popcntd, etc.

To select this mode, we have a new register accessible through the
set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT.  Setting
this to zero gives the full set of capabilities of the processor.
Setting it to one of the "logical" PVR values defined in PAPR puts
the vcpu into the compatibility mode for the corresponding
architecture level.  The supported values are:

0x0f02  Architecture 2.05 (POWER6)
0x0f03  Architecture 2.06 (POWER7)
0x0f13  Architecture 2.06+ (POWER7+)

Since the PCR is per-core, the architecture compatibility level and
the corresponding PCR value are stored in the struct kvmppc_vcore, and
are therefore shared between all vcpus in a virtual core.

Signed-off-by: Paul Mackerras 
---
 Documentation/virtual/kvm/api.txt   |  1 +
 arch/powerpc/include/asm/kvm_host.h |  2 ++
 arch/powerpc/include/asm/reg.h  | 11 +++
 arch/powerpc/include/uapi/asm/kvm.h |  3 +++
 arch/powerpc/kernel/asm-offsets.c   |  1 +
 arch/powerpc/kvm/book3s_hv.c| 35 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 +--
 7 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 34a32b6..f1f300f 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1837,6 +1837,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_VRSAVE   | 32
   PPC   | KVM_REG_PPC_LPCR | 64
   PPC   | KVM_REG_PPC_PPR  | 64
+  PPC   | KVM_REG_PPC_ARCH_COMPAT | 32
   PPC   | KVM_REG_PPC_TM_GPR0  | 64
   ...
   PPC   | KVM_REG_PPC_TM_GPR31 | 64
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index b0dcd18..5a40270 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -295,6 +295,8 @@ struct kvmppc_vcore {
u64 preempt_tb;
struct kvm_vcpu *runner;
u64 tb_offset;  /* guest timebase - host timebase */
+   u32 arch_compat;
+   ulong pcr;
 };
 
 #define VCORE_ENTRY_COUNT(vc)  ((vc)->entry_exit_count & 0xff)
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 3fc0d06..52ff962 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -305,6 +305,10 @@
 #define   LPID_RSVD0x3ff   /* Reserved LPID for partn switching */
 #defineSPRN_HMER   0x150   /* Hardware m? error recovery */
 #defineSPRN_HMEER  0x151   /* Hardware m? enable error recovery */
+#define SPRN_PCR   0x152   /* Processor compatibility register */
+#define   PCR_VEC_DIS  (1ul << (63-0)) /* Vec. disable (pre POWER8) */
+#define   PCR_VSX_DIS  (1ul << (63-1)) /* VSX disable (pre POWER8) */
+#define   PCR_ARCH_205 0x2 /* Architecture 2.05 */
 #defineSPRN_HEIR   0x153   /* Hypervisor Emulated Instruction 
Register */
 #define SPRN_TLBINDEXR 0x154   /* P7 TLB control register */
 #define SPRN_TLBVPNR   0x155   /* P7 TLB control register */
@@ -1095,6 +1099,13 @@
 #define PVR_BE 0x0070
 #define PVR_PA6T   0x0090
 
+/* "Logical" PVR values defined in PAPR, representing architecture levels */
+#define PVR_ARCH_204   0x0f01
+#define PVR_ARCH_205   0x0f02
+#define PVR_ARCH_206   0x0f03
+#define PVR_ARCH_206p  0x0f13
+#define PVR_ARCH_207   0x0f04
+
 /* Macros for setting and retrieving special purpose registers */
 #ifndef __ASSEMBLY__
 #define mfmsr()({unsigned long rval; \
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index fab6bc1..e420d46 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -536,6 +536,9 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_LPCR   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5)
 #define KVM_REG_PPC_PPR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb6)
 
+/* Architecture compatibility level */
+#define KVM_REG_PPC_ARCH_COMPAT(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb7)
+
 /* Transactional Memory checkpointed state:
  * This is all GPRs, all VSX regs and a subset of SPRs
  */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 5c6ea96..115dd64 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -522,6 +522,7 @@ int main(void)
DEFINE(VCORE_IN_GUEST, offsetof(struct kvmppc_vcore, in_guest));
DEFINE(VCORE_NAPPING_THREADS, offsetof(struct kvmppc_vcore, 
napping_threads));
DEFINE(VCORE_TB_OF

[PATCH 05/11] KVM: PPC: Book3S HV: Add support for guest Program Priority Register

2013-09-05 Thread Paul Mackerras
POWER7 and later IBM server processors have a register called the
Program Priority Register (PPR), which controls the priority of
each hardware CPU SMT thread, and affects how fast it runs compared
to other SMT threads.  This priority can be controlled by writing to
the PPR or by use of a set of instructions of the form or rN,rN,rN
which are otherwise no-ops but have been defined to set the priority
to particular levels.

This adds code to context switch the PPR when entering and exiting
guests and to make the PPR value accessible through the SET/GET_ONE_REG
interface.  When entering the guest, we set the PPR as late as
possible, because if we are setting a low thread priority it will
make the code run slowly from that point on.  Similarly, the
first-level interrupt handlers save the PPR value in the PACA very
early on, and set the thread priority to the medium level, so that
the interrupt handling code runs at a reasonable speed.

Signed-off-by: Paul Mackerras 
---
 Documentation/virtual/kvm/api.txt |  1 +
 arch/powerpc/include/asm/exception-64s.h  |  8 
 arch/powerpc/include/asm/kvm_book3s_asm.h |  1 +
 arch/powerpc/include/asm/kvm_host.h   |  1 +
 arch/powerpc/include/uapi/asm/kvm.h   |  1 +
 arch/powerpc/kernel/asm-offsets.c |  2 ++
 arch/powerpc/kvm/book3s_hv.c  |  6 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 12 +++-
 8 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 1030ac9..34a32b6 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1836,6 +1836,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_ACOP | 64
   PPC   | KVM_REG_PPC_VRSAVE   | 32
   PPC   | KVM_REG_PPC_LPCR | 64
+  PPC   | KVM_REG_PPC_PPR  | 64
   PPC   | KVM_REG_PPC_TM_GPR0  | 64
   ...
   PPC   | KVM_REG_PPC_TM_GPR31 | 64
diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 07ca627..b86c4db 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -203,6 +203,10 @@ do_kvm_##n:
\
ld  r10,area+EX_CFAR(r13);  \
std r10,HSTATE_CFAR(r13);   \
END_FTR_SECTION_NESTED(CPU_FTR_CFAR,CPU_FTR_CFAR,947);  \
+   BEGIN_FTR_SECTION_NESTED(948)   \
+   ld  r10,area+EX_PPR(r13);   \
+   std r10,HSTATE_PPR(r13);\
+   END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948);\
ld  r10,area+EX_R10(r13);   \
stw r9,HSTATE_SCRATCH1(r13);\
ld  r9,area+EX_R9(r13); \
@@ -216,6 +220,10 @@ do_kvm_##n:
\
ld  r10,area+EX_R10(r13);   \
beq 89f;\
stw r9,HSTATE_SCRATCH1(r13);\
+   BEGIN_FTR_SECTION_NESTED(948)   \
+   ld  r9,area+EX_PPR(r13);\
+   std r9,HSTATE_PPR(r13); \
+   END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948);\
ld  r9,area+EX_R9(r13); \
std r12,HSTATE_SCRATCH0(r13);   \
li  r12,n;  \
diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 9039d3c..22f4606 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -101,6 +101,7 @@ struct kvmppc_host_state {
 #endif
 #ifdef CONFIG_PPC_BOOK3S_64
u64 cfar;
+   u64 ppr;
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 9741bf0..b0dcd18 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -464,6 +464,7 @@ struct kvm_vcpu_arch {
u32 ctrl;
ulong dabr;
ulong cfar;
+   ulong ppr;
 #endif
u32 vrsave; /* also USPRG0 */
u32 mmucr;
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index e42127d..fab6bc1 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -534,6 +534,7 @@ struct kvm_get_htab_header {
 
 #define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4)
 #define KVM_REG_PPC_LPCR   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5)
+#define KVM_REG_PPC_PPR 

[PATCH 02/11] KVM: PPC: Book3S HV: Implement timebase offset for guests

2013-09-05 Thread Paul Mackerras
This allows guests to have a different timebase origin from the host.
This is needed for migration, where a guest can migrate from one host
to another and the two hosts might have a different timebase origin.
However, the timebase seen by the guest must not go backwards, and
should go forwards only by a small amount corresponding to the time
taken for the migration.

Therefore this provides a new per-vcpu value accessed via the one_reg
interface using the new KVM_REG_PPC_TB_OFFSET identifier.  This value
defaults to 0 and is not modified by KVM.  On entering the guest, this
value is added onto the timebase, and on exiting the guest, it is
subtracted from the timebase.

This is only supported for recent POWER hardware which has the TBU40
(timebase upper 40 bits) register.  Writing to the TBU40 register only
alters the upper 40 bits of the timebase, leaving the lower 24 bits
unchanged.  This provides a way to modify the timebase for guest
migration without disturbing the synchronization of the timebase
registers across CPU cores.  The kernel rounds up the value given
to a multiple of 2^24.

Timebase values stored in KVM structures (struct kvm_vcpu, struct
kvmppc_vcore, etc.) are stored as host timebase values.  The timebase
values in the dispatch trace log need to be guest timebase values,
however, since that is read directly by the guest.  This moves the
setting of vcpu->arch.dec_expires on guest exit to a point after we
have restored the host timebase so that vcpu->arch.dec_expires is a
host timebase value.

Signed-off-by: Paul Mackerras 
---
This differs from the previous version of this patch in that the value
given to the set_one_reg interface is rounded up, as suggested by David
Gibson, rather than truncated.

 Documentation/virtual/kvm/api.txt   |  1 +
 arch/powerpc/include/asm/kvm_host.h |  1 +
 arch/powerpc/include/asm/reg.h  |  1 +
 arch/powerpc/include/uapi/asm/kvm.h |  3 ++
 arch/powerpc/kernel/asm-offsets.c   |  1 +
 arch/powerpc/kvm/book3s_hv.c| 10 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 50 +++--
 7 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 341058c..9486e5a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1810,6 +1810,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_TLB3PS   | 32
   PPC   | KVM_REG_PPC_EPTCFG   | 32
   PPC   | KVM_REG_PPC_ICP_STATE | 64
+  PPC   | KVM_REG_PPC_TB_OFFSET| 64
   PPC   | KVM_REG_PPC_SPMC1| 32
   PPC   | KVM_REG_PPC_SPMC2| 32
   PPC   | KVM_REG_PPC_IAMR | 64
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 91b833d..9741bf0 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -294,6 +294,7 @@ struct kvmppc_vcore {
u64 stolen_tb;
u64 preempt_tb;
struct kvm_vcpu *runner;
+   u64 tb_offset;  /* guest timebase - host timebase */
 };
 
 #define VCORE_ENTRY_COUNT(vc)  ((vc)->entry_exit_count & 0xff)
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 5d7d9c2..342e4ea 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -243,6 +243,7 @@
 #define SPRN_TBRU  0x10D   /* Time Base Read Upper Register (user, R/O) */
 #define SPRN_TBWL  0x11C   /* Time Base Lower Register (super, R/W) */
 #define SPRN_TBWU  0x11D   /* Time Base Upper Register (super, R/W) */
+#define SPRN_TBU40 0x11E   /* Timebase upper 40 bits (hyper, R/W) */
 #define SPRN_SPURR 0x134   /* Scaled PURR */
 #define SPRN_HSPRG00x130   /* Hypervisor Scratch 0 */
 #define SPRN_HSPRG10x131   /* Hypervisor Scratch 1 */
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 7ed41c0..a8124fe 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -504,6 +504,9 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_TLB3PS (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a)
 #define KVM_REG_PPC_EPTCFG (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9b)
 
+/* Timebase offset */
+#define KVM_REG_PPC_TB_OFFSET  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9c)
+
 /* POWER8 registers */
 #define KVM_REG_PPC_SPMC1  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9d)
 #define KVM_REG_PPC_SPMC2  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9e)
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 6a7916d..ccb42cd 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -520,6 +520,7 @@ int main(void)
DEFINE(VCORE_NAP_COUNT, offsetof(struct kvmppc_vcore, nap_count));
DEFINE(VCORE_IN_GUEST, offsetof(struct kvmppc_vcore, in_guest));
DEFINE(VCORE_NAPPING_THREADS, offsetof(struct kvmppc_vcore, 
napping_threads));
+   DEFINE(VCORE_TB_OFFSET, offsetof(struct 

[PATCH 00/11] HV KVM improvements in preparation for POWER8 support

2013-09-05 Thread Paul Mackerras
This series of patches is based on Alex Graf's kvm-ppc-queue branch.
It fixes some bugs, makes some more registers accessible through the
one_reg interface, and implements some missing features such as
support for the compatibility modes in recent POWER cpus and support
for the guest having a different timebase origin from the host.
These patches are all useful on POWER7 and will be needed for good
POWER8 support.

Please apply.

 Documentation/virtual/kvm/api.txt |   5 +
 arch/powerpc/include/asm/exception-64s.h  |   8 +
 arch/powerpc/include/asm/kvm_book3s_asm.h |   1 +
 arch/powerpc/include/asm/kvm_host.h   |   6 +
 arch/powerpc/include/asm/reg.h|  14 +
 arch/powerpc/include/uapi/asm/kvm.h   |  10 +
 arch/powerpc/kernel/asm-offsets.c |   6 +
 arch/powerpc/kvm/book3s.c |  10 +
 arch/powerpc/kvm/book3s_hv.c  |  96 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 544 +-
 10 files changed, 469 insertions(+), 231 deletions(-)

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] kvm-unit-tests: VMX: Test suite for preemption timer

2013-09-05 Thread Arthur Chunqi Li
Test cases for preemption timer in nested VMX. Two aspects are tested:
1. Save preemption timer on VMEXIT if relevant bit set in EXIT_CONTROL
2. Test a relevant bug of KVM. The bug will not save preemption timer
value if exit L2->L0 for some reason and enter L0->L2. Thus preemption
timer will never trigger if the value is large enough.
3. Some other aspects are tested, e.g. preempt without save, preempt
when value is 0.

Signed-off-by: Arthur Chunqi Li 
---
ChangeLog to v1:
1. Add test of EXI_SAVE_PREEMPT enable and PIN_PREEMPT disable
2. Add test of PIN_PREEMPT enable and EXI_SAVE_PREEMPT enable/disable
3. Add test of preemption value is 0

 x86/vmx.h   |3 +
 x86/vmx_tests.c |  175 +++
 2 files changed, 178 insertions(+)

diff --git a/x86/vmx.h b/x86/vmx.h
index 28595d8..ebc8cfd 100644
--- a/x86/vmx.h
+++ b/x86/vmx.h
@@ -210,6 +210,7 @@ enum Encoding {
GUEST_ACTV_STATE= 0x4826ul,
GUEST_SMBASE= 0x4828ul,
GUEST_SYSENTER_CS   = 0x482aul,
+   PREEMPT_TIMER_VALUE = 0x482eul,
 
/* 32-Bit Host State Fields */
HOST_SYSENTER_CS= 0x4c00ul,
@@ -331,6 +332,7 @@ enum Ctrl_exi {
EXI_LOAD_PERF   = 1UL << 12,
EXI_INTA= 1UL << 15,
EXI_LOAD_EFER   = 1UL << 21,
+   EXI_SAVE_PREEMPT= 1UL << 22,
 };
 
 enum Ctrl_ent {
@@ -342,6 +344,7 @@ enum Ctrl_pin {
PIN_EXTINT  = 1ul << 0,
PIN_NMI = 1ul << 3,
PIN_VIRT_NMI= 1ul << 5,
+   PIN_PREEMPT = 1ul << 6,
 };
 
 enum Ctrl0 {
diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index c1b39f4..2e32031 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -1,4 +1,30 @@
 #include "vmx.h"
+#include "msr.h"
+#include "processor.h"
+
+volatile u32 stage;
+
+static inline void vmcall()
+{
+   asm volatile("vmcall");
+}
+ 
+static inline void set_stage(u32 s)
+{
+   barrier();
+   stage = s;
+   barrier();
+}
+
+static inline u32 get_stage()
+{
+   u32 s;
+
+   barrier();
+   s = stage;
+   barrier();
+   return s;
+}
 
 void basic_init()
 {
@@ -76,6 +102,153 @@ int vmenter_exit_handler()
return VMX_TEST_VMEXIT;
 }
 
+u32 preempt_scale;
+volatile unsigned long long tsc_val;
+volatile u32 preempt_val;
+
+void preemption_timer_init()
+{
+   u32 ctrl_exit;
+
+   // Enable EXI_SAVE_PREEMPT with PIN_PREEMPT dieabled
+   ctrl_exit = (vmcs_read(EXI_CONTROLS) |
+   EXI_SAVE_PREEMPT) & ctrl_exit_rev.clr;
+   vmcs_write(EXI_CONTROLS, ctrl_exit);
+   preempt_val = 1000;
+   vmcs_write(PREEMPT_TIMER_VALUE, preempt_val);
+   set_stage(0);
+   preempt_scale = rdmsr(MSR_IA32_VMX_MISC) & 0x1F;
+}
+
+void preemption_timer_main()
+{
+   int i, j;
+
+   if (!(ctrl_pin_rev.clr & PIN_PREEMPT)) {
+   printf("\tPreemption timer is not supported\n");
+   return;
+   }
+   if (!(ctrl_exit_rev.clr & EXI_SAVE_PREEMPT))
+   printf("\tSave preemption value is not supported\n");
+   else {
+   // Test EXI_SAVE_PREEMPT enable and PIN_PREEMPT disable
+   set_stage(0);
+   // Consume enough time to let L2->L0->L2 occurs
+   for(i = 0; i < 10; i++)
+   for (j = 0; j < 1; j++);
+   vmcall();
+   // Test PIN_PREEMPT enable and EXI_SAVE_PREEMPT enable/disable
+   set_stage(1);
+   vmcall();
+   // Test both enable
+   if (get_stage() == 2)
+   vmcall();
+   }
+   // Test the bug of reseting preempt value when L2->L0->L2
+   set_stage(3);
+   vmcall();
+   tsc_val = rdtsc();
+   while (1) {
+   if (((rdtsc() - tsc_val) >> preempt_scale)
+   > 10 * preempt_val) {
+   report("Preemption timer timeout", 0);
+   break;
+   }
+   if (get_stage() == 4)
+   break;
+   }
+   // Test preempt val is 0
+   set_stage(4);
+   report("Preemption timer, val=0", 0);
+}
+
+int preemption_timer_exit_handler()
+{
+   u64 guest_rip;
+   ulong reason;
+   u32 insn_len;
+   u32 ctrl_exit;
+   u32 ctrl_pin;
+
+   guest_rip = vmcs_read(GUEST_RIP);
+   reason = vmcs_read(EXI_REASON) & 0xff;
+   insn_len = vmcs_read(EXI_INST_LEN);
+   switch (reason) {
+   case VMX_PREEMPT:
+   switch (get_stage()) {
+   case 3:
+   if (((rdtsc() - tsc_val) >> preempt_scale) < 
preempt_val)
+   report("Preemption timer timeout", 0);
+   else
+   report("Preemption timer timeout", 1);
+   set_stage(get_stage() + 1);
+   b

[PATCH v4] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-05 Thread Arthur Chunqi Li
This patch contains the following two changes:
1. Fix the bug in nested preemption timer support. If vmexit L2->L0
with some reasons not emulated by L1, preemption timer value should
be save in such exits.
2. Add support of "Save VMX-preemption timer value" VM-Exit controls
to nVMX.

With this patch, nested VMX preemption timer features are fully
supported.

Signed-off-by: Arthur Chunqi Li 
---
ChangeLog to v3:
Move nested_adjust_preemption_timer to the latest place just before vmenter.
Some minor changes.

 arch/x86/include/uapi/asm/msr-index.h |1 +
 arch/x86/kvm/vmx.c|   49 +++--
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/uapi/asm/msr-index.h 
b/arch/x86/include/uapi/asm/msr-index.h
index bb04650..b93e09a 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -536,6 +536,7 @@
 
 /* MSR_IA32_VMX_MISC bits */
 #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << 29)
+#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
 /* AMD-V MSRs */
 
 #define MSR_VM_CR   0xc0010114
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1f1da43..f364d16 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -374,6 +374,8 @@ struct nested_vmx {
 */
struct page *apic_access_page;
u64 msr_ia32_feature_control;
+   /* Set if vmexit is L2->L1 */
+   bool nested_vmx_exit;
 };
 
 #define POSTED_INTR_ON  0
@@ -2204,7 +2206,17 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #ifdef CONFIG_X86_64
VM_EXIT_HOST_ADDR_SPACE_SIZE |
 #endif
-   VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
+   VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
+   VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
+   if (!(nested_vmx_pinbased_ctls_high &
+   PIN_BASED_VMX_PREEMPTION_TIMER) ||
+   !(nested_vmx_exit_ctls_high &
+   VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
+   nested_vmx_exit_ctls_high &=
+   (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
+   nested_vmx_pinbased_ctls_high &=
+   (~PIN_BASED_VMX_PREEMPTION_TIMER);
+   }
nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
  VM_EXIT_LOAD_IA32_EFER);
 
@@ -6707,6 +6719,24 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 
*info1, u64 *info2)
*info2 = vmcs_read32(VM_EXIT_INTR_INFO);
 }
 
+static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu)
+{
+   u64 delta_tsc_l1;
+   u32 preempt_val_l1, preempt_val_l2, preempt_scale;
+
+   preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) &
+   MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
+   preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
+   delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu,
+   native_read_tsc()) - vcpu->arch.last_guest_tsc;
+   preempt_val_l1 = delta_tsc_l1 >> preempt_scale;
+   if (preempt_val_l2 <= preempt_val_l1)
+   preempt_val_l2 = 0;
+   else
+   preempt_val_l2 -= preempt_val_l1;
+   vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2);
+}
+
 /*
  * The guest has exited.  See if we can fix it or if we need userspace
  * assistance.
@@ -6736,9 +6766,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
vmx->nested.nested_run_pending = 0;
 
if (is_guest_mode(vcpu) && nested_vmx_exit_handled(vcpu)) {
+   vmx->nested.nested_vmx_exit = true;
nested_vmx_vmexit(vcpu);
return 1;
}
+   vmx->nested.nested_vmx_exit = false;
 
if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) {
vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
@@ -7132,6 +7164,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
debugctlmsr = get_debugctlmsr();
 
vmx->__launched = vmx->loaded_vmcs->launched;
+   if (is_guest_mode(vcpu) && !(vmx->nested.nested_vmx_exit))
+   nested_adjust_preemption_timer(vcpu);
asm(
/* Store host registers */
"push %%" _ASM_DX "; push %%" _ASM_BP ";"
@@ -7518,6 +7552,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 exec_control;
+   u32 exit_control;
 
vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
@@ -7691,7 +7726,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
 * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
 * bits are further modified by vmx_set_efer() below.
 */
-   vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+   

Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

2013-09-05 Thread Alexey Kardashevskiy
On 09/06/2013 04:10 AM, Gleb Natapov wrote:
> On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote:
>> On 09/03/2013 08:53 PM, Gleb Natapov wrote:
>>> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
 On 09/01/2013 10:06 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
>> them to user space which saves time on switching to user space and back.
>>
>> Both real and virtual modes are supported. The kernel tries to
>> handle a TCE request in the real mode, if fails it passes the request
>> to the virtual mode to complete the operation. If it a virtual mode
>> handler fails, the request is passed to user space.
>>
>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>> user API functions are required for this patch.
>>
>> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
>> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
>> of map/unmap requests. The device supports a single attribute which is
>> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
>> establishes the connection between KVM and VFIO.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Paul Mackerras 
>> Signed-off-by: Alexey Kardashevskiy 
>>
>> ---
>>
>> Changes:
>> v9:
>> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
>> KVM device
>> * release_spapr_tce_table() is not shared between different TCE types
>> * reduced the patch size by moving VFIO external API
>> trampolines to separate patche
>> * moved documentation from Documentation/virtual/kvm/api.txt to
>> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>
>> v8:
>> * fixed warnings from check_patch.pl
>>
>> 2013/07/11:
>> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
>> for KVM_BOOK3S_64
>> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much 
>> sense
>> for this here but the next patch for hugepages support will use it more.
>>
>> 2013/07/06:
>> * added realmode arch_spin_lock to protect TCE table from races
>> in real and virtual modes
>> * POWERPC IOMMU API is changed to support real mode
>> * iommu_take_ownership and iommu_release_ownership are protected by
>> iommu_table's locks
>> * VFIO external user API use rewritten
>> * multiple small fixes
>>
>> 2013/06/27:
>> * tce_list page is referenced now in order to protect it from accident
>> invalidation during H_PUT_TCE_INDIRECT execution
>> * added use of the external user VFIO API
>>
>> 2013/06/05:
>> * changed capability number
>> * changed ioctl number
>> * update the doc article number
>>
>> 2013/05/20:
>> * removed get_user() from real mode handlers
>> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
>> translated TCEs, tries realmode_get_page() on those and if it fails, it
>> passes control over the virtual mode handler which tries to finish
>> the request handling
>> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
>> on a page
>> * The only reason to pass the request to user mode now is when the user 
>> mode
>> did not register TCE table in the kernel, in all other cases the virtual 
>> mode
>> handler is expected to do the job
>> ---
>>  .../virtual/kvm/devices/spapr_tce_iommu.txt|  37 +++
>>  arch/powerpc/include/asm/kvm_host.h|   4 +
>>  arch/powerpc/kvm/book3s_64_vio.c   | 310 
>> -
>>  arch/powerpc/kvm/book3s_64_vio_hv.c| 122 
>>  arch/powerpc/kvm/powerpc.c |   1 +
>>  include/linux/kvm_host.h   |   1 +
>>  virt/kvm/kvm_main.c|   5 +
>>  7 files changed, 477 insertions(+), 3 deletions(-)
>>  create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>
>> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt 
>> b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>> new file mode 100644
>> index 000..4bc8fc3
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>> @@ -0,0 +1,37 @@
>> +SPAPR TCE IOMMU device
>> +
>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>> +Architectures: powerpc
>> +
>> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
>> +
>>

Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup

2013-09-05 Thread Alexander Graf

On 06.09.2013, at 00:09, Scott Wood wrote:

> On Thu, 2013-07-11 at 01:09 +0200, Alexander Graf wrote:
>> On 11.07.2013, at 01:08, Scott Wood wrote:
>> 
>>> On 07/10/2013 06:04:53 PM, Alexander Graf wrote:
 On 11.07.2013, at 01:01, Benjamin Herrenschmidt wrote:
> On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote:
>>> #ifdef CONFIG_PPC64
>>> + /*
>>> +  * To avoid races, the caller must have gone directly from having
>>> +  * interrupts fully-enabled to hard-disabled.
>>> +  */
>>> + WARN_ON(local_paca->irq_happened != PACA_IRQ_HARD_DIS);
>> 
>> WARN_ON(lazy_irq_pending()); ?
> 
> Different semantics. What you propose will not catch irq_happened == 0 :-)
 Right, but we only ever reach here after hard_irq_disable() I think.
>>> 
>>> And the WARN_ON helps us ensure that it stays that way.
>> 
>> Heh - ok :). Works for me.
> 
> What's the status on this patch?

IIUC it was ok. Ben, could you please verify?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: fix documentation

2013-09-05 Thread Alex Williamson
On Thu, 2013-09-05 at 15:22 -0700, Zi Shen Lim wrote:
> Signed-off-by: Zi Shen Lim 
> ---

Applied.  Thanks!

Alex

>  Documentation/vfio.txt | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index d7993dc..b9ca023 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -167,8 +167,8 @@ group and can access them as follows:
>   int container, group, device, i;
>   struct vfio_group_status group_status =
>   { .argsz = sizeof(group_status) };
> - struct vfio_iommu_x86_info iommu_info = { .argsz = sizeof(iommu_info) };
> - struct vfio_iommu_x86_dma_map dma_map = { .argsz = sizeof(dma_map) };
> + struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) 
> };
> + struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
>   struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
>  
>   /* Create a new container */
> @@ -193,7 +193,7 @@ group and can access them as follows:
>   ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
>  
>   /* Enable the IOMMU model we want */
> - ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)
> + ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
>  
>   /* Get addition IOMMU info */
>   ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
> @@ -229,7 +229,7 @@ group and can access them as follows:
>  
>   irq.index = i;
>  
> - ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, ®);
> + ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq);
>  
>   /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
>   }



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio: fix documentation

2013-09-05 Thread Zi Shen Lim
Signed-off-by: Zi Shen Lim 
---
 Documentation/vfio.txt | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index d7993dc..b9ca023 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -167,8 +167,8 @@ group and can access them as follows:
int container, group, device, i;
struct vfio_group_status group_status =
{ .argsz = sizeof(group_status) };
-   struct vfio_iommu_x86_info iommu_info = { .argsz = sizeof(iommu_info) };
-   struct vfio_iommu_x86_dma_map dma_map = { .argsz = sizeof(dma_map) };
+   struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) 
};
+   struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
 
/* Create a new container */
@@ -193,7 +193,7 @@ group and can access them as follows:
ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
 
/* Enable the IOMMU model we want */
-   ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)
+   ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
 
/* Get addition IOMMU info */
ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
@@ -229,7 +229,7 @@ group and can access them as follows:
 
irq.index = i;
 
-   ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, ®);
+   ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq);
 
/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
}
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup

2013-09-05 Thread Scott Wood
On Thu, 2013-07-11 at 01:09 +0200, Alexander Graf wrote:
> On 11.07.2013, at 01:08, Scott Wood wrote:
> 
> > On 07/10/2013 06:04:53 PM, Alexander Graf wrote:
> >> On 11.07.2013, at 01:01, Benjamin Herrenschmidt wrote:
> >> > On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote:
> >> >>> #ifdef CONFIG_PPC64
> >> >>> + /*
> >> >>> +  * To avoid races, the caller must have gone directly from having
> >> >>> +  * interrupts fully-enabled to hard-disabled.
> >> >>> +  */
> >> >>> + WARN_ON(local_paca->irq_happened != PACA_IRQ_HARD_DIS);
> >> >>
> >> >> WARN_ON(lazy_irq_pending()); ?
> >> >
> >> > Different semantics. What you propose will not catch irq_happened == 0 
> >> > :-)
> >> Right, but we only ever reach here after hard_irq_disable() I think.
> > 
> > And the WARN_ON helps us ensure that it stays that way.
> 
> Heh - ok :). Works for me.

What's the status on this patch?

-Scott



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenBSD 5.3 guest on KVM

2013-09-05 Thread Daniel Bareiro
Hi, Gleb.

On Thursday 05 September 2013 21:00:50 +0300,
Gleb Natapov wrote:

> > > > Someone had this problem and could solve it somehow? There any
> > > > debug information I can provide to help solve this?

> > > For simple troubleshooting try "info status" from the QEMU monitor.

> > ss01:~# telnet localhost 4046
> > Trying ::1...
> > Connected to localhost.
> > Escape character is '^]'.
> > QEMU 1.1.2 monitor - type 'help' for more information
> > (qemu)
> > (qemu)
> > (qemu) info status
> > VM status: running
> > (qemu)
> > (qemu)
> > (qemu) system_powerdown
> > (qemu)
> > (qemu)
> > (qemu) info status
> > VM status: running
> > (qemu)
> > 
> > Then the VM freezes.

> And now, after it freezes, do one more "info status" also "info cpus"
> and "info registers". Also what do you mean by "freezes"? Do you see
> that VM started to shutdown after you issued system_powerdown?

ss01:~# telnet localhost 4046
Trying ::1...
Connected to localhost.
Escape character is '^]'.
QEMU 1.1.2 monitor - type 'help' for more information
(qemu)
(qemu) info status
VM status: running
(qemu)
(qemu) system_powerdown
(qemu)
(qemu) info status
VM status: running
(qemu)
(qemu) info status
VM status: running
(qemu)
(qemu) info cpus
* CPU #0: pc=0x80455278 thread_id=22792
(qemu) info cpus
* CPU #0: pc=0x80455278 thread_id=22792
(qemu) info cpus
* CPU #0: pc=0x80455278 thread_id=22792
(qemu)
(qemu) info registers
RAX= RBX= RCX=0204 
RDX=b100
RSI=0022 RDI=80013900 RBP=8e7fcc80 
RSP=8e7fcb78
R8 =0004 R9 =0001 R10=0010 
R11=80455b90
R12=0006 R13=8e6f2010 R14=80119900 
R15=8e6f2000
RIP=80448c60 RFL=0046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0023  bfff 00a0f300 DPL=3 DS16 [-WA]
CS =0008   00a09b00 DPL=0 CS64 [-RA]
SS =0010   00a09300 DPL=0 DS16 [-WA]
DS =0023  bfff 00a0f300 DPL=3 DS16 [-WA]
FS =0023  bfff 00a0f300 DPL=3 DS16 [-WA]
GS =0023 80a27b60 bfff 00a0f300 DPL=3 DS16 [-WA]
LDT=   
TR =0030 80011000 0067 8b00 DPL=0 TSS64-busy
GDT= 80011068 003f
IDT= 8001 0fff
CR0=8001003b CR2=1567dc558a60 CR3=00e6 CR4=07b0
DR0= DR1= DR2= 
DR3=
DR6=0ff0 DR7=0400
EFER=0d01
FCW=037f FSW= [ST=0] FTW=00 MXCSR=1fa0
FPR0=  FPR1= 
FPR2=  FPR3= 
FPR4=  FPR5= 
FPR6=  FPR7= 
XMM00=3fa24000 XMM01=3fe2b400
XMM02= XMM03=
XMM04= XMM05=
XMM06= XMM07=
XMM08= XMM09=
XMM10= XMM11=
XMM12= XMM13=
XMM14= XMM15=
(qemu)
(qemu) info registers
RAX= RBX= RCX=0204 
RDX=b100
RSI=0022 RDI=80013900 RBP=8e7fcc80 
RSP=8e7fcb78
R8 =0004 R9 =0001 R10=0010 
R11=80455b90
R12=0006 R13=8e6f2010 R14=80119900 
R15=8e6f2000
RIP=80448c60 RFL=0046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0023  bfff 00a0f300 DPL=3 DS16 [-WA]
CS =0008   00a09b00 DPL=0 CS64 [-RA]
SS =0010   00a09300 DPL=0 DS16 [-WA]
DS =0023  bfff 00a0f300 DPL=3 DS16 [-WA]
FS =0023  bfff 00a0f300 DPL=3 DS16 [-WA]
GS =0023 80a27b60 bfff 00a0f300 DPL=3 DS16 [-WA]
LDT=   
TR =0030 80011000 0067 8b00 DPL=0 TSS64-busy
GDT= 80011068 003f
IDT= 8001 0fff
CR0=8001003b CR2=1567dc558a60 CR3=00e6 CR4=07b0
DR0= DR1= DR2= 
DR3=
DR6=0ff0 DR7=0400
EFER=0d01
FCW=037f FSW= [ST=0] FTW=00 MXCSR=1fa0
FPR0=  FPR1= 
FPR2=  FPR3= 
FPR4=  FPR5= 
FPR6=  FPR7= 
XMM00=3fa24000 XMM0

Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

2013-09-05 Thread Gleb Natapov
On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote:
> On 09/03/2013 08:53 PM, Gleb Natapov wrote:
> > On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
> >> On 09/01/2013 10:06 PM, Gleb Natapov wrote:
> >>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
>  This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>  and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
>  them to user space which saves time on switching to user space and back.
> 
>  Both real and virtual modes are supported. The kernel tries to
>  handle a TCE request in the real mode, if fails it passes the request
>  to the virtual mode to complete the operation. If it a virtual mode
>  handler fails, the request is passed to user space.
> 
>  The first user of this is VFIO on POWER. Trampolines to the VFIO external
>  user API functions are required for this patch.
> 
>  This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
>  number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
>  of map/unmap requests. The device supports a single attribute which is
>  a struct with LIOBN and IOMMU fd. When the attribute is set, the device
>  establishes the connection between KVM and VFIO.
> 
>  Tests show that this patch increases transmission speed from 220MB/s
>  to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
>  Signed-off-by: Paul Mackerras 
>  Signed-off-by: Alexey Kardashevskiy 
> 
>  ---
> 
>  Changes:
>  v9:
>  * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
>  KVM device
>  * release_spapr_tce_table() is not shared between different TCE types
>  * reduced the patch size by moving VFIO external API
>  trampolines to separate patche
>  * moved documentation from Documentation/virtual/kvm/api.txt to
>  Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> 
>  v8:
>  * fixed warnings from check_patch.pl
> 
>  2013/07/11:
>  * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
>  for KVM_BOOK3S_64
>  * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much 
>  sense
>  for this here but the next patch for hugepages support will use it more.
> 
>  2013/07/06:
>  * added realmode arch_spin_lock to protect TCE table from races
>  in real and virtual modes
>  * POWERPC IOMMU API is changed to support real mode
>  * iommu_take_ownership and iommu_release_ownership are protected by
>  iommu_table's locks
>  * VFIO external user API use rewritten
>  * multiple small fixes
> 
>  2013/06/27:
>  * tce_list page is referenced now in order to protect it from accident
>  invalidation during H_PUT_TCE_INDIRECT execution
>  * added use of the external user VFIO API
> 
>  2013/06/05:
>  * changed capability number
>  * changed ioctl number
>  * update the doc article number
> 
>  2013/05/20:
>  * removed get_user() from real mode handlers
>  * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
>  translated TCEs, tries realmode_get_page() on those and if it fails, it
>  passes control over the virtual mode handler which tries to finish
>  the request handling
>  * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
>  on a page
>  * The only reason to pass the request to user mode now is when the user 
>  mode
>  did not register TCE table in the kernel, in all other cases the virtual 
>  mode
>  handler is expected to do the job
>  ---
>   .../virtual/kvm/devices/spapr_tce_iommu.txt|  37 +++
>   arch/powerpc/include/asm/kvm_host.h|   4 +
>   arch/powerpc/kvm/book3s_64_vio.c   | 310 
>  -
>   arch/powerpc/kvm/book3s_64_vio_hv.c| 122 
>   arch/powerpc/kvm/powerpc.c |   1 +
>   include/linux/kvm_host.h   |   1 +
>   virt/kvm/kvm_main.c|   5 +
>   7 files changed, 477 insertions(+), 3 deletions(-)
>   create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> 
>  diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt 
>  b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>  new file mode 100644
>  index 000..4bc8fc3
>  --- /dev/null
>  +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>  @@ -0,0 +1,37 @@
>  +SPAPR TCE IOMMU device
>  +
>  +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>  +Architectures: powerpc
>  +
>  +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
>  +
>  +Groups:
>  +  KVM_DEV_SPAPR_TCE_IOMMU_ATT

Re: OpenBSD 5.3 guest on KVM

2013-09-05 Thread Gleb Natapov
On Thu, Sep 05, 2013 at 12:44:58PM -0300, Daniel Bareiro wrote:
> Hi, Pablo.
> 
> On Thursday 05 September 2013 10:30:12 +0200,
> Paolo Bonzini wrote:
> 
> > > Someone had this problem and could solve it somehow? There any
> > > debug information I can provide to help solve this?
> 
> > For simple troubleshooting try "info status" from the QEMU monitor.
> 
> ss01:~# telnet localhost 4046
> Trying ::1...
> Connected to localhost.
> Escape character is '^]'.
> QEMU 1.1.2 monitor - type 'help' for more information
> (qemu)
> (qemu)
> (qemu) info status
> VM status: running
> (qemu)
> (qemu)
> (qemu) system_powerdown
> (qemu)
> (qemu)
> (qemu) info status
> VM status: running
> (qemu)
> 
> Then the VM freezes.
> 
And now, after it freezes, do one more "info status" also "info cpus"
and "info registers". Also what do you mean by "freezes"? Do you see
that VM started to shutdown after you issued system_powerdown?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] kvm tools: remove periodic tick in favour of a polling thread

2013-09-05 Thread Jonathan Austin

Hi Sasha,

On 04/09/13 19:01, Sasha Levin wrote:

On 09/04/2013 01:48 PM, Pekka Enberg wrote:

On Wed, Sep 4, 2013 at 8:40 PM, Jonathan Austin  wrote:

'top' works on ARM with virtio console. I've just done some new testing
and with the serial console emulation and I see the same as you're reporting.
Previously with the 8250 emulation I'd booted to a prompt but didn't actually
test top...

I'm looking in to fixing this now... Looks like I need to find the right place
from which to call serial8250_flush_tx now that it isn't getting called every 
tick.

I've done the following and it works fixes 'top' with serial8250:
---8<--
diff --git a/tools/kvm/hw/serial.c b/tools/kvm/hw/serial.c
index 931067f..a71e68d 100644
--- a/tools/kvm/hw/serial.c
+++ b/tools/kvm/hw/serial.c
@@ -260,6 +260,7 @@ static bool serial8250_out(struct ioport *ioport, struct 
kvm *kvm, u16 port,
  dev->lsr &= ~UART_LSR_TEMT;
  if (dev->txcnt == FIFO_LEN / 2)
  dev->lsr &= ~UART_LSR_THRE;
+   serial8250_flush_tx(kvm, dev);
  } else {
  /* Should never happpen */
  dev->lsr &= ~(UART_LSR_TEMT | UART_LSR_THRE);

->8---

I guess it's a shame that we'll be printing each character (admittedly the rate 
will always be
relatively low...) rather than flushing the buffer in a batch. Without a timer, 
though, I'm
not sure I see a better option - every N chars doesn't seem like a good one to 
me.

If you think that looks about right then I'll fold that in to the patch series, 
probably also
removing the call to serial8250_flush_tx() in serial8250__receive.


Yeah, looks good to me and makes top work again.


We might want to make sure performance isn't hit with stuff that's intensive on 
the serial console.


Indeed, the intention here is very much to reduce overhead...

Do you have an idea already of what you'd like to test?

I've written a little testcase that just prints an incrementing counter 
to the console in a tight loop for 5 seconds (I've tested both buffered 
and unbuffered output). The measure of 'performance' is how high we 
count in those 5 seconds.


These are averages (mean) of 5 runs on x86.

+unbuffered+-buffered-+
native  |  36880   |  40354   |
+--+--+
lkvm - original |  24302   |  25335   |
+--+--+
lkvm - no-tick  |  22895   |  28202   |
+--+--+

I ran these all on the framebuffer console. I found that the numbers on 
gnome-terminal seemed to be affected by the activity level of other 
gui-ish things, and the numbers were different in gnome-terminal and 
xterm. If you want to see more testing then a suggestion of a way we can 
take host terminal performance out of the equation would make me more 
comfortable with the numbers. I do think that as comparisons to each 
other they're reasonable as they are, though.


So at least in this reasonably artificial case it looks like a minor win 
for buffered output and a minor loss in the unbuffered case.


Happy to try out other things if you're interested.

Jonny


Thanks,
Sasha





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] kvm: fix a bug and remove a redundancy in async_pf

2013-09-05 Thread Paolo Bonzini
Il 04/09/2013 22:32, Radim Krčmář ha scritto:
> I did not reproduce the bug fixed in [1/2], but there are not that many
> reasons why we could not unload a module, so the spot is quite obvious.
> 
> 
> Radim Krčmář (2):
>   kvm: free resources after canceling async_pf
>   kvm: remove .done from struct kvm_async_pf
> 
>  include/linux/kvm_host.h | 1 -
>  virt/kvm/async_pf.c  | 8 
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 

Reviewed-by: Paolo Bonzini 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenBSD 5.3 guest on KVM

2013-09-05 Thread Daniel Bareiro
Hi, Pablo.

On Thursday 05 September 2013 10:30:12 +0200,
Paolo Bonzini wrote:

> > Someone had this problem and could solve it somehow? There any
> > debug information I can provide to help solve this?

> For simple troubleshooting try "info status" from the QEMU monitor.

ss01:~# telnet localhost 4046
Trying ::1...
Connected to localhost.
Escape character is '^]'.
QEMU 1.1.2 monitor - type 'help' for more information
(qemu)
(qemu)
(qemu) info status
VM status: running
(qemu)
(qemu)
(qemu) system_powerdown
(qemu)
(qemu)
(qemu) info status
VM status: running
(qemu)

Then the VM freezes.

> You can also try this:
> 
> http://www.linux-kvm.org/page/Tracing
> 
> You will get a large log, you can send it to me offlist or via
> something like Google Drive.

The log has 612 MB (57 MB compressed with bzip2). I guess it must be
because there are other production VMs running on that host. Is there a
way of limiting the log? This result I got it running I found on the
link you sent me.

# trace-cmd record -b 2 -e kvm
# trace-cmd report


I forgot to mention that I'm using KVM versions provided by the
repositories of Debian Wheezy:

Linux:3.2.46-1+deb7u1
qemu-kvm: 1.1.2+dfsg-6


Thanks for your reply.

Regards,
Daniel
-- 
Fingerprint: BFB3 08D6 B4D1 31B2 72B9  29CE 6696 BF1B 14E6 1D37
Powered by Debian GNU/Linux - Linux user #188.598


signature.asc
Description: Digital signature


Re: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-05 Thread Arthur Chunqi Li
On Thu, Sep 5, 2013 at 7:05 PM, Zhang, Yang Z  wrote:
> Arthur Chunqi Li wrote on 2013-09-05:
>> > Arthur Chunqi Li wrote on 2013-09-05:
>> >> On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z
>> >> 
>> >> wrote:
>> >> > Arthur Chunqi Li wrote on 2013-09-04:
>> >> >> This patch contains the following two changes:
>> >> >> 1. Fix the bug in nested preemption timer support. If vmexit
>> >> >> L2->L0 with some reasons not emulated by L1, preemption timer
>> >> >> value should be save in such exits.
>> >> >> 2. Add support of "Save VMX-preemption timer value" VM-Exit
>> >> >> controls to nVMX.
>> >> >>
>> >> >> With this patch, nested VMX preemption timer features are fully
>> supported.
>> >> >>
>> >> >> Signed-off-by: Arthur Chunqi Li 
>> >> >> ---
>> >> >> This series depends on queue.
>> >> >>
>> >> >>  arch/x86/include/uapi/asm/msr-index.h |1 +
>> >> >>  arch/x86/kvm/vmx.c|   51
>> >> >> ++---
>> >> >>  2 files changed, 48 insertions(+), 4 deletions(-)
>> >> >>
>> >> >> diff --git a/arch/x86/include/uapi/asm/msr-index.h
>> >> >> b/arch/x86/include/uapi/asm/msr-index.h
>> >> >> index bb04650..b93e09a 100644
>> >> >> --- a/arch/x86/include/uapi/asm/msr-index.h
>> >> >> +++ b/arch/x86/include/uapi/asm/msr-index.h
>> >> >> @@ -536,6 +536,7 @@
>> >> >>
>> >> >>  /* MSR_IA32_VMX_MISC bits */
>> >> >>  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL
>> <<
>> >> 29)
>> >> >> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
>> >> >>  /* AMD-V MSRs */
>> >> >>
>> >> >>  #define MSR_VM_CR   0xc0010114
>> >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
>> >> >> 1f1da43..870caa8
>> >> >> 100644
>> >> >> --- a/arch/x86/kvm/vmx.c
>> >> >> +++ b/arch/x86/kvm/vmx.c
>> >> >> @@ -2204,7 +2204,14 @@ static __init void
>> >> >> nested_vmx_setup_ctls_msrs(void)  #ifdef CONFIG_X86_64
>> >> >>   VM_EXIT_HOST_ADDR_SPACE_SIZE |  #endif
>> >> >> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
>> >> >> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT
>> |
>> >> >> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
>> >> >> + if (!(nested_vmx_pinbased_ctls_high &
>> >> >> PIN_BASED_VMX_PREEMPTION_TIMER))
>> >> >> + nested_vmx_exit_ctls_high &=
>> >> >> +
>> (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
>> >> >> + if (!(nested_vmx_exit_ctls_high &
>> >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER))
>> >> >> + nested_vmx_pinbased_ctls_high &=
>> >> >> + (~PIN_BASED_VMX_PREEMPTION_TIMER);
>> >> > The following logic is more clearly:
>> >> > if(nested_vmx_pinbased_ctls_high &
>> >> PIN_BASED_VMX_PREEMPTION_TIMER)
>> >> > nested_vmx_exit_ctls_high |=
>> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
>> >> Here I have such consideration: this logic is wrong if CPU support
>> >> PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support
>> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this
>> does
>> >> occurs. So the codes above reads the MSR and reserves the features it
>> >> supports, and here I just check if these two features are supported
>> >> simultaneously.
>> >>
>> > No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on
>> > PIN_BASED_VMX_PREEMPTION_TIMER.
>> PIN_BASED_VMX_PREEMPTION_TIMER is an
>> > independent feature
>> >
>> >> You remind that this piece of codes can write like this:
>> >> if (!(nested_vmx_pin_based_ctls_high &
>> >> PIN_BASED_VMX_PREEMPTION_TIMER) ||
>> >> !(nested_vmx_exit_ctls_high &
>> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
>> >> nested_vmx_exit_ctls_high
>> >> &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
>> >> nested_vmx_pinbased_ctls_high &=
>> >> (~PIN_BASED_VMX_PREEMPTION_TIMER);
>> >> }
>> >>
>> >> This may reflect the logic I describe above that these two flags
>> >> should support simultaneously, and brings less confusion.
>> >> >
>> >> > BTW: I don't see nested_vmx_setup_ctls_msrs() considers the
>> >> > hardware's
>> >> capability when expose those vmx features(not just preemption timer) to 
>> >> L1.
>> >> The codes just above here, when setting pinbased control for nested
>> >> vmx, it firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to
>> >> mask the features hardware not support. So does other control fields.
>> >> >
>> > Yes, I saw it.
>> >
>> >> >>   nested_vmx_exit_ctls_high |=
>> >> >> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>> >> >>
>> VM_EXIT_LOAD_IA32_EFER);
>> >> >>
>> >> >> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct
>> >> >> kvm_vcpu *vcpu, u64 *info1, u64 *info2)
>> >> >>   *info2 = vmcs_read32(VM_EXIT_INTR_INFO);  }
>> >> >>
>> >> >> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) {
>> >> >> + u64 delta_tsc_l1;
>> >> >> + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
>> >> >> +
>> >> >> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) &
>> >> >> +
>> >> MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
>> >> >> + p

Re: [Xen-devel] Is fallback vhost_net to qemu for live migrate available?

2013-09-05 Thread Stefano Stabellini
On Tue, 3 Sep 2013, Michael S. Tsirkin wrote:
> On Tue, Sep 03, 2013 at 09:40:48AM +0100, Wei Liu wrote:
> > On Tue, Sep 03, 2013 at 09:28:11AM +0800, Qin Chuanyu wrote:
> > > On 2013/9/2 15:57, Wei Liu wrote:
> > > >On Sat, Aug 31, 2013 at 12:45:11PM +0800, Qin Chuanyu wrote:
> > > >>On 2013/8/30 0:08, Anthony Liguori wrote:
> > > >>>Hi Qin,
> > > >>
> > > By change the memory copy and notify mechanism ,currently virtio-net 
> > > with
> > > vhost_net could run on Xen with good performance。
> > > >>>
> > > >>>I think the key in doing this would be to implement a property
> > > >>>ioeventfd and irqfd interface in the driver domain kernel.  Just
> > > >>>hacking vhost_net with Xen specific knowledge would be pretty nasty
> > > >>>IMHO.
> > > >>>
> > > >>Yes, I add a kernel module which persist virtio-net pio_addr and
> > > >>msix address as what kvm module did. Guest wake up vhost thread by
> > > >>adding a hook func in evtchn_interrupt.
> > > >>
> > > >>>Did you modify the front end driver to do grant table mapping or is
> > > >>>this all being done by mapping the domain's memory?
> > > >>>
> > > >>There is nothing changed in front end driver. Currently I use
> > > >>alloc_vm_area to get address space, and map the domain's memory as
> > > >>what what qemu did.
> > > >>
> > > >
> > > >You mean you're using xc_map_foreign_range and friends in the backend to
> > > >map guest memory? That's not very desirable as it violates Xen's
> > > >security model. It would not be too hard to pass grant references
> > > >instead of guest physical memory address IMHO.
> > > >
> > > In fact, I did what virtio-net have done in Qemu. I think security
> > > is a pseudo question because Dom0 is under control.

Right, but we are trying to move the backends out of Dom0, for
scalability and security.
Setting up a network driver domain is pretty easy and should work out of
the box with Xen 4.3.
That said, I agree that using xc_map_foreign_range is a good way to start.


> > Consider that you might have driver domains. Not every domain is under
> > control or trusted.
> 
> I don't see anything that will prevent using driver domains here.

Driver domains are not privileged, therefore cannot map random guest
pages (unless they have been granted by the guest via the grant table).
xc_map_foreign_range can't work from a driver domain.


> > Also consider that security model like XSM can be
> > used to audit operations to enhance security so your foreign mapping
> > approach might not always work.
> 
> It could be nice to have as an option, sure.
> XSM is disabled by default though so I don't think lack of support for
> that makes it a prototype.

There are some security aware Xen based products in the market today
that use XSM.

Re: [PATCH v2] KVM: mmu: allow page tables to be in read-only slots

2013-09-05 Thread Xiao Guangrong
On 09/05/2013 08:21 PM, Paolo Bonzini wrote:
> Page tables in a read-only memory slot will currently cause a triple
> fault when running with shadow paging, because the page walker uses
> gfn_to_hva and it fails on such a slot.
> 
> TianoCore uses such a page table.  The idea is that, on real hardware,
> the firmware can already run in 64-bit flat mode when setting up the
> memory controller.  Real hardware seems to be fine with that as long as
> the accessed/dirty bits are set.  Thus, this patch saves whether the
> slot is readonly, and later checks it when updating the accessed and
> dirty bits.
> 
> Note that this scenario is not supported by NPT at all, as explained by
> comments in the code.

Reviewed-by: Xiao Guangrong 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH uq/master 2/2] KVM: make XSAVE support more robust

2013-09-05 Thread Paolo Bonzini
QEMU moves state from CPUArchState to struct kvm_xsave and back when it
invokes the KVM_*_XSAVE ioctls.  Because it doesn't treat the XSAVE
region as an opaque blob, it might be impossible to set some state on
the destination if migrating to an older version.

This patch blocks migration if it finds that unsupported bits are set
in the XSTATE_BV header field.  To make this work robustly, QEMU should
only report in env->xstate_bv those fields that will actually end up
in the migration stream.

Signed-off-by: Paolo Bonzini 
---
 target-i386/kvm.c | 3 ++-
 target-i386/machine.c | 4 
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 749aa09..df08a4b 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1291,7 +1291,8 @@ static int kvm_get_xsave(X86CPU *cpu)
 sizeof env->fpregs);
 memcpy(env->xmm_regs, &xsave->region[XSAVE_XMM_SPACE],
 sizeof env->xmm_regs);
-env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV];
+env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV] &
+XSTATE_SUPPORTED;
 memcpy(env->ymmh_regs, &xsave->region[XSAVE_YMMH_SPACE],
 sizeof env->ymmh_regs);
 return 0;
diff --git a/target-i386/machine.c b/target-i386/machine.c
index dc81cde..9e2cfcf 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -278,6 +278,10 @@ static int cpu_post_load(void *opaque, int version_id)
 CPUX86State *env = &cpu->env;
 int i;
 
+if (env->xstate_bv & ~XSTATE_SUPPORTED) {
+return -EINVAL;
+}
+ 
 /*
  * Real mode guest segments register DPL should be zero.
  * Older KVM version were setting it wrongly.
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH uq/master 0/2] KVM: issues with XSAVE support

2013-09-05 Thread Paolo Bonzini
This series fixes two migration bugs concerning KVM's XSAVE ioctls,
both found by code inspection (the second in fact is just theoretical
until AVX512 or MPX support is added to KVM).

Please review.

Paolo Bonzini (2):
  x86: fix migration from pre-version 12
  KVM: make XSAVE support more robust

 target-i386/cpu.c | 1 +
 target-i386/cpu.h | 5 +
 target-i386/kvm.c | 3 ++-
 target-i386/machine.c | 4 
 4 files changed, 12 insertions(+), 1 deletion(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH uq/master 1/2] x86: fix migration from pre-version 12

2013-09-05 Thread Paolo Bonzini
On KVM, the KVM_SET_XSAVE would be executed with a 0 xstate_bv,
and not restore anything.

Since FP and SSE data are always valid, set them in xstate_bv at reset
time.  In fact, that value is the same that KVM_GET_XSAVE returns on
pre-XSAVE hosts.

Signed-off-by: Paolo Bonzini 
---
 target-i386/cpu.c | 1 +
 target-i386/cpu.h | 5 +
 2 files changed, 6 insertions(+)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index c36345e..ac83106 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2386,6 +2386,7 @@ static void x86_cpu_reset(CPUState *s)
 env->fpuc = 0x37f;
 
 env->mxcsr = 0x1f80;
+env->xstate_bv = XSTATE_FP | XSTATE_SSE;
 
 env->pat = 0x0007040600070406ULL;
 env->msr_ia32_misc_enable = MSR_IA32_MISC_ENABLE_DEFAULT;
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 5723eff..a153078 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -380,6 +380,11 @@
 
 #define MSR_VM_HSAVE_PA 0xc0010117
 
+#define XSTATE_SUPPORTED   (XSTATE_FP|XSTATE_SSE|XSTATE_YMM)
+#define XSTATE_FP  1
+#define XSTATE_SSE 2
+#define XSTATE_YMM 4
+
 /* CPUID feature words */
 typedef enum FeatureWord {
 FEAT_1_EDX, /* CPUID[1].EDX */
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: prevent setting unsupported XSAVE states

2013-09-05 Thread Paolo Bonzini
A guest can still attempt to save and restore XSAVE states even if they
have been masked in CPUID leaf 0Dh.  This usually is not visible to
the guest, but is still wrong: "Any attempt to set a reserved bit (as
determined by the contents of EAX and EDX after executing CPUID with
EAX=0DH, ECX= 0H) in XCR0 for a given processor will result in a #GP
exception".

The patch also performs the same checks as __kvm_set_xcr in KVM_SET_XSAVE.
This catches migration from newer to older kernel/processor before the
guest starts running.

Cc: kvm@vger.kernel.org
Cc: Gleb Natapov 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/cpuid.c |  2 +-
 arch/x86/kvm/x86.c   | 10 --
 arch/x86/kvm/x86.h   |  1 +
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index a20ecb5..d7c465d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -182,7 +182,7 @@ static bool supported_xcr0_bit(unsigned bit)
 {
u64 mask = ((u64)1 << bit);
 
-   return mask & (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) & host_xcr0;
+   return mask & KVM_SUPPORTED_XCR0 & host_xcr0;
 }
 
 #define F(x) bit(X86_FEATURE_##x)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3625798..801a882 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -586,6 +586,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
return 1;
if ((xcr0 & XSTATE_YMM) && !(xcr0 & XSTATE_SSE))
return 1;
+   if (xcr0 & ~KVM_SUPPORTED_XCR0)
+   return 1;
if (xcr0 & ~host_xcr0)
return 1;
kvm_put_guest_xcr0(vcpu);
@@ -2980,10 +2982,14 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu 
*vcpu,
u64 xstate_bv =
*(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)];
 
-   if (cpu_has_xsave)
+   if (cpu_has_xsave) {
+   if (xstate_bv & ~KVM_SUPPORTED_XCR0)
+   return -EINVAL;
+   if (xstate_bv & ~host_xcr0)
+   return -EINVAL;
memcpy(&vcpu->arch.guest_fpu.state->xsave,
guest_xsave->region, xstate_size);
-   else {
+   } else {
if (xstate_bv & ~XSTATE_FPSSE)
return -EINVAL;
memcpy(&vcpu->arch.guest_fpu.state->fxsave,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index e224f7a..587fb9e 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -122,6 +122,7 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt 
*ctxt,
gva_t addr, void *val, unsigned int bytes,
struct x86_exception *exception);
 
+#define KVM_SUPPORTED_XCR0 (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
 extern u64 host_xcr0;
 
 extern struct static_key kvm_no_apic_vcpu;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: mmu: allow page tables to be in read-only slots

2013-09-05 Thread Paolo Bonzini
Page tables in a read-only memory slot will currently cause a triple
fault when running with shadow paging, because the page walker uses
gfn_to_hva and it fails on such a slot.

TianoCore uses such a page table.  The idea is that, on real hardware,
the firmware can already run in 64-bit flat mode when setting up the
memory controller.  Real hardware seems to be fine with that as long as
the accessed/dirty bits are set.  Thus, this patch saves whether the
slot is readonly, and later checks it when updating the accessed and
dirty bits.

Note that this scenario is not supported by NPT at all, as explained by
comments in the code.

Cc: sta...@vger.kernel.org
Cc: kvm@vger.kernel.org
Cc: Xiao Guangrong 
Cc: Gleb Natapov 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/paging_tmpl.h | 20 +++-
 include/linux/kvm_host.h   |  1 +
 virt/kvm/kvm_main.c| 14 +-
 3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 0433301..aa18aca 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -99,6 +99,7 @@ struct guest_walker {
pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
+   bool pte_writable[PT_MAX_FULL_LEVELS];
unsigned pt_access;
unsigned pte_access;
gfn_t gfn;
@@ -235,6 +236,22 @@ static int FNAME(update_accessed_dirty_bits)(struct 
kvm_vcpu *vcpu,
if (pte == orig_pte)
continue;
 
+   /*
+* If the slot is read-only, simply do not process the accessed
+* and dirty bits.  This is the correct thing to do if the slot
+* is ROM, and page tables in read-as-ROM/write-as-MMIO slots
+* are only supported if the accessed and dirty bits are already
+* set in the ROM (so that MMIO writes are never needed).
+*
+* Note that NPT does not allow this at all and faults, since
+* it always wants nested page table entries for the guest
+* page tables to be writable.  And EPT works but will simply
+* overwrite the read-only memory to set the accessed and dirty
+* bits.
+*/
+   if (unlikely(!walker->pte_writable[level - 1]))
+   continue;
+
ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, 
orig_pte, pte);
if (ret)
return ret;
@@ -309,7 +326,8 @@ retry_walk:
goto error;
real_gfn = gpa_to_gfn(real_gfn);
 
-   host_addr = gfn_to_hva(vcpu->kvm, real_gfn);
+   host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn,
+   &walker->pte_writable[walker->level 
- 1]);
if (unlikely(kvm_is_error_hva(host_addr)))
goto error;
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ca645a0..22f9cdf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -533,6 +533,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, 
struct page **pages,
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
+unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable);
 unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f7e4334..418d037 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1078,11 +1078,15 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
 /*
- * The hva returned by this function is only allowed to be read.
- * It should pair with kvm_read_hva() or kvm_read_hva_atomic().
+ * If writable is set to false, the hva returned by this function is only
+ * allowed to be read.
  */
-static unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn)
+unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable)
 {
+   struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
+   if (writable)
+   *writable = !memslot_is_readonly(slot);
+
return __gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL, false);
 }
 
@@ -1450,7 +1454,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void 
*data, int offset,
int r;
unsigned long addr;
 
-   addr = gfn_to_hva_read(kvm, gfn);
+   addr = gfn_to_hva_read(kvm, gfn, NULL);
if (kvm_is_error_hva(addr))
return -EFAULT;
r = kvm_read_hva(data, (void __user *)addr + offset, len);
@@ -1488,7 +1492,7 @@ int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, 
void *data,
   

RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-05 Thread Zhang, Yang Z
Arthur Chunqi Li wrote on 2013-09-05:
> > Arthur Chunqi Li wrote on 2013-09-05:
> >> On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z
> >> 
> >> wrote:
> >> > Arthur Chunqi Li wrote on 2013-09-04:
> >> >> This patch contains the following two changes:
> >> >> 1. Fix the bug in nested preemption timer support. If vmexit
> >> >> L2->L0 with some reasons not emulated by L1, preemption timer
> >> >> value should be save in such exits.
> >> >> 2. Add support of "Save VMX-preemption timer value" VM-Exit
> >> >> controls to nVMX.
> >> >>
> >> >> With this patch, nested VMX preemption timer features are fully
> supported.
> >> >>
> >> >> Signed-off-by: Arthur Chunqi Li 
> >> >> ---
> >> >> This series depends on queue.
> >> >>
> >> >>  arch/x86/include/uapi/asm/msr-index.h |1 +
> >> >>  arch/x86/kvm/vmx.c|   51
> >> >> ++---
> >> >>  2 files changed, 48 insertions(+), 4 deletions(-)
> >> >>
> >> >> diff --git a/arch/x86/include/uapi/asm/msr-index.h
> >> >> b/arch/x86/include/uapi/asm/msr-index.h
> >> >> index bb04650..b93e09a 100644
> >> >> --- a/arch/x86/include/uapi/asm/msr-index.h
> >> >> +++ b/arch/x86/include/uapi/asm/msr-index.h
> >> >> @@ -536,6 +536,7 @@
> >> >>
> >> >>  /* MSR_IA32_VMX_MISC bits */
> >> >>  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL
> <<
> >> 29)
> >> >> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
> >> >>  /* AMD-V MSRs */
> >> >>
> >> >>  #define MSR_VM_CR   0xc0010114
> >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
> >> >> 1f1da43..870caa8
> >> >> 100644
> >> >> --- a/arch/x86/kvm/vmx.c
> >> >> +++ b/arch/x86/kvm/vmx.c
> >> >> @@ -2204,7 +2204,14 @@ static __init void
> >> >> nested_vmx_setup_ctls_msrs(void)  #ifdef CONFIG_X86_64
> >> >>   VM_EXIT_HOST_ADDR_SPACE_SIZE |  #endif
> >> >> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
> >> >> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT
> |
> >> >> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
> >> >> + if (!(nested_vmx_pinbased_ctls_high &
> >> >> PIN_BASED_VMX_PREEMPTION_TIMER))
> >> >> + nested_vmx_exit_ctls_high &=
> >> >> +
> (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
> >> >> + if (!(nested_vmx_exit_ctls_high &
> >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER))
> >> >> + nested_vmx_pinbased_ctls_high &=
> >> >> + (~PIN_BASED_VMX_PREEMPTION_TIMER);
> >> > The following logic is more clearly:
> >> > if(nested_vmx_pinbased_ctls_high &
> >> PIN_BASED_VMX_PREEMPTION_TIMER)
> >> > nested_vmx_exit_ctls_high |=
> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
> >> Here I have such consideration: this logic is wrong if CPU support
> >> PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support
> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this
> does
> >> occurs. So the codes above reads the MSR and reserves the features it
> >> supports, and here I just check if these two features are supported
> >> simultaneously.
> >>
> > No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on
> > PIN_BASED_VMX_PREEMPTION_TIMER.
> PIN_BASED_VMX_PREEMPTION_TIMER is an
> > independent feature
> >
> >> You remind that this piece of codes can write like this:
> >> if (!(nested_vmx_pin_based_ctls_high &
> >> PIN_BASED_VMX_PREEMPTION_TIMER) ||
> >> !(nested_vmx_exit_ctls_high &
> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
> >> nested_vmx_exit_ctls_high
> >> &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
> >> nested_vmx_pinbased_ctls_high &=
> >> (~PIN_BASED_VMX_PREEMPTION_TIMER);
> >> }
> >>
> >> This may reflect the logic I describe above that these two flags
> >> should support simultaneously, and brings less confusion.
> >> >
> >> > BTW: I don't see nested_vmx_setup_ctls_msrs() considers the
> >> > hardware's
> >> capability when expose those vmx features(not just preemption timer) to L1.
> >> The codes just above here, when setting pinbased control for nested
> >> vmx, it firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to
> >> mask the features hardware not support. So does other control fields.
> >> >
> > Yes, I saw it.
> >
> >> >>   nested_vmx_exit_ctls_high |=
> >> >> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> >> >>
> VM_EXIT_LOAD_IA32_EFER);
> >> >>
> >> >> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct
> >> >> kvm_vcpu *vcpu, u64 *info1, u64 *info2)
> >> >>   *info2 = vmcs_read32(VM_EXIT_INTR_INFO);  }
> >> >>
> >> >> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) {
> >> >> + u64 delta_tsc_l1;
> >> >> + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
> >> >> +
> >> >> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) &
> >> >> +
> >> MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
> >> >> + preempt_val_l2 =
> vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
> >> >> + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu,
> >> >> + native_read_tsc()) -
> vcpu->arc

[PATCH v2 08/15] KVM: MMU: introduce nulls desc

2013-09-05 Thread Xiao Guangrong
It likes nulls list and we use the pte-list as the nulls which can help us to
detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
if that happened

kvm->slots_lock is held when we do lockless walking that prevents rmap
is reused (free rmap need to hold that lock) so that we can not see the same
nulls used on different rmaps

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 08fb4e2..c5f1b27 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
+{
+   unsigned long marker;
+
+   marker = (unsigned long)pte_list | 1UL;
+   desc->more = (struct pte_list_desc *)marker;
+}
+
+static bool desc_is_a_nulls(struct pte_list_desc *desc)
+{
+   return (unsigned long)desc & 1;
+}
+
+static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
+{
+   return (unsigned long *)((unsigned long)desc & ~1);
+}
+
 static int __find_first_free(struct pte_list_desc *desc)
 {
int i;
@@ -951,7 +969,7 @@ static int count_spte_number(struct pte_list_desc *desc)
 
first_free = __find_first_free(desc);
 
-   for (desc_num = 0; desc->more; desc = desc->more)
+   for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
desc_num++;
 
return first_free + desc_num * PTE_LIST_EXT;
@@ -985,6 +1003,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
+   desc_mark_nulls(pte_list, desc);
*pte_list = (unsigned long)desc | 1;
return 1;
}
@@ -1030,7 +1049,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
/*
 * Only one entry existing but still use a desc to store it?
 */
-   WARN_ON(!next_desc);
+   WARN_ON(desc_is_a_nulls(next_desc));
 
mmu_free_pte_list_desc(first_desc);
*pte_list = (unsigned long)next_desc | 1ul;
@@ -1041,7 +1060,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * Only one entry in this desc, move the entry to the head
 * then the desc can be freed.
 */
-   if (!first_desc->sptes[1] && !first_desc->more) {
+   if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
mmu_free_pte_list_desc(first_desc);
}
@@ -1070,7 +1089,7 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
 
rmap_printk("pte_list_remove:  %p many->many\n", spte);
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc) {
+   while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
if (desc->sptes[i] == spte) {
pte_list_desc_remove_entry(pte_list,
@@ -1097,11 +1116,13 @@ static void pte_list_walk(unsigned long *pte_list, 
pte_list_walk_fn fn)
return fn((u64 *)*pte_list);
 
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc) {
+   while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
fn(desc->sptes[i]);
desc = desc->more;
}
+
+   WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
@@ -1184,6 +1205,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct 
rmap_iterator *iter)
 
iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
iter->pos = 0;
+   WARN_ON(desc_is_a_nulls(iter->desc));
return iter->desc->sptes[iter->pos];
 }
 
@@ -1204,7 +1226,8 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
return sptep;
}
 
-   iter->desc = iter->desc->more;
+   iter->desc = desc_is_a_nulls(iter->desc->more) ?
+   NULL : iter->desc->more;
 
if (iter->desc) {
iter->pos = 0;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/15] KVM: MMU: update spte and add it into rmap before dirty log

2013-09-05 Thread Xiao Guangrong
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the its dirty
bitmap, so we should ensure the writable spte can be found in rmap before the
dirty bitmap is visible. Otherwise, we clear the dirty bitmap but fail to
write-protect the page which is detailed in the comments in this patch

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 84 ++
 arch/x86/kvm/x86.c | 10 +++
 2 files changed, 76 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a983570..8ea54d9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2428,6 +2428,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 {
u64 spte;
int ret = 0;
+   bool remap = is_rmap_spte(*sptep);
 
if (set_mmio_spte(vcpu->kvm, sptep, gfn, pfn, pte_access))
return 0;
@@ -2489,12 +2490,73 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}
}
 
-   if (pte_access & ACC_WRITE_MASK)
-   mark_page_dirty(vcpu->kvm, gfn);
-
 set_pte:
if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);
+
+   if (!remap) {
+   if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
+   rmap_recycle(vcpu, sptep, gfn);
+
+   if (level > PT_PAGE_TABLE_LEVEL)
+   ++vcpu->kvm->stat.lpages;
+   }
+
+   /*
+* The orders we require are:
+* 1) set spte to writable __before__ set the dirty bitmap.
+*It makes sure that dirty-logging is not missed when do
+*live migration at the final step where kvm should stop
+*the guest and push the remaining dirty pages got from
+*dirty-bitmap to the destination. The similar cases are
+*in fast_pf_fix_direct_spte() and kvm_write_guest_page().
+*
+* 2) add the spte into rmap __before__ set the dirty bitmap.
+*
+* They can ensure we can find the writable spte on the rmap
+* when we do lockless write-protection since
+* kvm_vm_ioctl_get_dirty_log() write-protects the pages based
+* on its dirty-bitmap, otherwise these cases will happen:
+*
+*  CPU 0 CPU 1
+*  kvm ioctl doing get-dirty-pages
+* mark_page_dirty(gfn) which
+* set the gfn on the dirty maps
+*  mask = xchg(dirty_bitmap, 0)
+*
+*  try to write-protect gfns which
+*  are set on "mask" then walk then
+*  rmap, see no spte on that rmap
+* add the spte into rmap
+*
+* !! Then the page can be freely wrote but not recorded in
+* the dirty bitmap.
+*
+* And:
+*
+*  VCPU 0CPU 1
+*kvm ioctl doing get-dirty-pages
+* mark_page_dirty(gfn) which
+* set the gfn on the dirty maps
+*
+* add spte into rmap
+*mask = xchg(dirty_bitmap, 0)
+*
+*try to write-protect gfns which
+*are set on "mask" then walk then
+*rmap, see spte is on the ramp
+*but it is readonly or nonpresent
+* Mark spte writable
+*
+* !! Then the page can be freely wrote but not recorded in the
+* dirty bitmap.
+*
+* See the comments in kvm_vm_ioctl_get_dirty_log().
+*/
+   smp_wmb();
+
+   if (pte_access & ACC_WRITE_MASK)
+   mark_page_dirty(vcpu->kvm, gfn);
 done:
return ret;
 }
@@ -2504,9 +2566,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 int level, gfn_t gfn, pfn_t pfn, bool speculative,
 bool host_writable)
 {
-   int was_rmapped = 0;
-   int rmap_count;
-
pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 *sptep, write_fault, gfn);
 
@@ -2528,8 +2587,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 spte_to_pfn(*sptep), pfn);
drop_spte(vcpu->kvm, sptep);
kvm_flush_remote_tlbs(vcpu->kvm);
-   } else
-   was_rmapped = 1;
+   }
}
 
if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
@@ -2547,16 +2605,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 is_large_pte(*sptep)? "2MB" : "4kB",
 *sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
 *sptep, sptep);
-  

[PATCH v2 03/15] KVM: MMU: lazily drop large spte

2013-09-05 Thread Xiao Guangrong
Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter.  This removes the need for the return value.

This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 36 
 arch/x86/kvm/x86.c |  8 ++--
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 869f1db..88107ee 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1177,8 +1177,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
 
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1187,10 +1186,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
  * - for spte protection, the spte can be writable only after unsync-ing
  *   shadow page.
  *
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
  */
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
 {
u64 spte = *sptep;
 
@@ -1200,17 +1198,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
*flush, bool pt_protect)
 
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-   if (__drop_large_spte(kvm, sptep)) {
-   *flush |= true;
-   return true;
-   }
-
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;
 
-   *flush |= mmu_spte_update(sptep, spte);
-   return false;
+   return mmu_spte_update(sptep, spte);
 }
 
 static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1222,11 +1214,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
 
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
-   if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
-   sptep = rmap_get_first(*rmapp, &iter);
-   continue;
-   }
 
+   flush |= spte_write_protect(kvm, sptep, pt_protect);
sptep = rmap_get_next(&iter);
}
 
@@ -2675,6 +2664,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
break;
}
 
+   drop_large_spte(vcpu, iterator.sptep);
+
if (!is_shadow_present_pte(*iterator.sptep)) {
u64 base_addr = iterator.addr;
 
@@ -2876,6 +2867,19 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
goto exit;
 
/*
+* Do not fix write-permission on the large spte since we only dirty
+* the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
+* that means other pages are missed if its slot is dirty-logged.
+*
+* Instead, we let the slow page fault path create a normal spte to
+* fix the access.
+*
+* See the comments in kvm_arch_commit_memory_region().
+*/
+   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+   goto exit;
+
+   /*
 * Currently, fast page fault only works for direct mapping since
 * the gfn is not stable for indirect shadow page.
 * See Documentation/virtual/kvm/locking.txt to get more detail.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5ca72a..6ad0c07 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7208,8 +7208,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
/*
 * Write protect all pages for dirty logging.
-* Existing largepage mappings are destroyed here and new ones will
-* not be created until the end of the logging.
+*
+* All the sptes including the large sptes which point to this
+* slot are set to readonly. We can not create any new large
+* spte on this slot until the end

[PATCH v2 00/15] KVM: MMU: locklessly wirte-protect

2013-09-05 Thread Xiao Guangrong
Changelog v2:

- the changes from Gleb's review:
  1) fix calculating the number of spte in the pte_list_add()
  2) set iter->desc to NULL if meet a nulls desc to cleanup the code of
 rmap_get_next()
  3) fix hlist corruption due to accessing sp->hlish out of mmu-lock
  4) use rcu functions to access the rcu protected pointer
  5) spte will be missed in lockless walker if the spte is moved in a desc
 (remove a spte from the rmap using only one desc). Fix it by bottom-up
 walking the desc

- the changes from Paolo's review
  1) make the order and memory barriers between update spte / add spte into
 rmap and dirty-log more clear
  
- the changes from Marcelo's review:
  1) let fast page fault only fix the spts on the last level (level = 1)
  2) improve some changelogs and comments

- the changes from Takuya's review:
  move the patch "flush tlb if the spte can be locklessly modified" forward
  to make it's more easily merged

Thank all of you very much for your time and patience on this patchset!
  
Since we use rcu_assign_pointer() to update the points in desc even if dirty
log is disabled, i have measured the performance:
Host: Intel(R) Xeon(R) CPU   X5690  @ 3.47GHz * 12 + 36G memory

- migrate-perf (benchmark the time of get-dirty-log)
  before: Run 10 times, Avg time:9009483 ns.
  after: Run 10 times, Avg time:4807343 ns.

- kerbench
  Guest: 12 VCPUs + 8G memory
  before:
EPT is enabled:
# cat 09-05-origin-ept | grep real   
real 85.58
real 83.47
real 82.95

EPT is disabled:
# cat 09-05-origin-shadow | grep real
real 138.77
real 138.99
real 139.55

  after:
EPT is enabled:
# cat 09-05-lockless-ept | grep real
real 83.40
real 82.81
real 83.39

EPT is disabled:
# cat 09-05-lockless-shadow | grep real
real 138.91
real 139.71
real 138.94

No performance regression!



Background
==
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we need to
hold mmu-lock which is also required by vcpu to fix its page table fault and
mmu-notifier when host page is being changed. In the extreme cpu / memory used
guest, it becomes a scalability issue.

This patchset introduces a way to locklessly write-protect guest memory.

Idea
==
There are the challenges we meet and the ideas to resolve them.

1) How to locklessly walk rmap?
The first idea we got to prevent "desc" being freed when we are walking the
rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
it updates the rmap really frequently.

So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
to be reused more quickly. We also store a "nulls" in the last "desc"
(desc->more) which can help us to detect whether the "desc" is moved to anther
rmap then we can re-walk the rmap if that happened. I learned this idea from
nulls-list.

Another issue is, when a spte is deleted from the "desc", another spte in the
last "desc" will be moved to this position to replace the deleted one. If the
deleted one has been accessed and we do not access the replaced one, the
replaced one is missed when we do lockless walk.
To fix this case, we do not backward move the spte, instead, we forward move
the entry: when a spte is deleted, we move the entry in the first desc to that
position.

2) How to locklessly access shadow page table?
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page be freed. But we are on the ioctl context
and the paths we are optimizing for have heavy workload, disabling interrupt is
not good for the system performance.

We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability.

3) How to locklessly write-protect guest memory?
Currently, there are two behaviors when we write-protect guest memory, one is
clearing the Writable bit on spte and the another one is dropping spte when it
points to large page. The former is easy we only need to atomicly clear a bit
but the latter is hard since we need to remove the spte from rmap. so we unify
these two behaviors that only make the spte readonly. Making large spte
readonly instead of nonpresent is also good for reducing jitter.

And we need to pay more attention on the order of making spte writable, adding
spte into rmap and setting the corresponding bit on dirty bitmap since
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
we should ensure the writable spte can be found in rmap before the dirty bitmap
is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
the page.

Performance result

The performance result and the benchmark can be found

[PATCH v2 07/15] KVM: MMU: redesign the algorithm of pte_list

2013-09-05 Thread Xiao Guangrong
Change the algorithm to:
1) always add new desc to the first desc (pointed by parent_ptes/rmap)
   that is good to implement rcu-nulls-list-like lockless rmap
   walking

2) always move the entry in the first desc to the the position we want
   to remove when delete a spte in the parent_ptes/rmap (backward-move).
   It is good for us to implement lockless rmap walk since in the current
   code, when a spte is deleted from the "desc", another spte in the last
   "desc" will be moved to this position to replace the deleted one. If the
   deleted one has been accessed and we do not access the replaced one, the
   replaced one is missed when we do lockless walk.
   To fix this case, we do not backward move the spte, instead, we forward
   move the entry: when a spte is deleted, we move the entry in the first
   desc to that position

Both of these also can reduce cache miss

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 180 -
 1 file changed, 123 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8ea54d9..08fb4e2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static int __find_first_free(struct pte_list_desc *desc)
+{
+   int i;
+
+   for (i = 0; i < PTE_LIST_EXT; i++)
+   if (!desc->sptes[i])
+   break;
+   return i;
+}
+
+static int find_first_free(struct pte_list_desc *desc)
+{
+   int free = __find_first_free(desc);
+
+   WARN_ON(free >= PTE_LIST_EXT);
+   return free;
+}
+
+static int find_last_used(struct pte_list_desc *desc)
+{
+   int used = __find_first_free(desc) - 1;
+
+   WARN_ON(used < 0 || used >= PTE_LIST_EXT);
+   return used;
+}
+
+/*
+ * TODO: we can encode the desc number into the rmap/parent_ptes
+ * since at least 10 physical/virtual address bits are reserved
+ * on x86. It is worthwhile if it shows that the desc walking is
+ * a performance issue.
+ */
+static int count_spte_number(struct pte_list_desc *desc)
+{
+   int first_free, desc_num;
+
+   first_free = __find_first_free(desc);
+
+   for (desc_num = 0; desc->more; desc = desc->more)
+   desc_num++;
+
+   return first_free + desc_num * PTE_LIST_EXT;
+}
+
 /*
  * Pte mapping structures:
  *
@@ -923,99 +967,121 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
  *
  * Returns the number of pte entries before the spte was added or zero if
  * the spte was not added.
- *
  */
 static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
unsigned long *pte_list)
 {
struct pte_list_desc *desc;
-   int i, count = 0;
+   int free_pos;
 
if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
*pte_list = (unsigned long)spte;
-   } else if (!(*pte_list & 1)) {
+   return 0;
+   }
+
+   if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
*pte_list = (unsigned long)desc | 1;
-   ++count;
-   } else {
-   rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
-   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
-   desc = desc->more;
-   count += PTE_LIST_EXT;
-   }
-   if (desc->sptes[PTE_LIST_EXT-1]) {
-   count += PTE_LIST_EXT;
-   desc->more = mmu_alloc_pte_list_desc(vcpu);
-   desc = desc->more;
-   }
-   for (i = 0; desc->sptes[i]; ++i)
-   ++count;
-   desc->sptes[i] = spte;
+   return 1;
}
-   return count;
+
+   rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
+   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+
+   /* No empty entry in the desc. */
+   if (desc->sptes[PTE_LIST_EXT - 1]) {
+   struct pte_list_desc *new_desc;
+   new_desc = mmu_alloc_pte_list_desc(vcpu);
+   new_desc->more = desc;
+   desc = new_desc;
+   *pte_list = (unsigned long)desc | 1;
+   }
+
+   free_pos = find_first_free(desc);
+   desc->sptes[free_pos] = spte;
+   return count_spte_number(desc) - 1;
 }
 
 static void
-pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
-  int i, struct pte_list_desc *prev_desc)
+pte_list_desc_remove_entry(unsigned long *pte_list,
+  struct pte_list_desc *desc, int

[PATCH v2 04/15] KVM: MMU: flush tlb if the spte can be locklessly modified

2013-09-05 Thread Xiao Guangrong
Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)

This patch is used to avoid this kind of race:

  VCPU 0 VCPU 1
lockless wirte protection:
  set spte.w = 0
 lock mmu-lock

 write protection the spte to sync shadow page,
 see spte.w = 0, then without flush tlb

 unlock mmu-lock

 !!! At this point, the shadow page can still be
 writable due to the corrupt tlb entry
 Flush all TLB

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 88107ee..7488229 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -595,7 +595,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 * we always atomicly update it, see the comments in
 * spte_has_volatile_bits().
 */
-   if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+   if (spte_is_locklessly_modifiable(old_spte) &&
+ !is_writable_pte(new_spte))
ret = true;
 
if (!shadow_accessed_mask)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/15] KVM: MMU: fix the count of spte number

2013-09-05 Thread Xiao Guangrong
If the desc is the last one and it is full, its sptes is not counted

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6e2d2c8..7714fd8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -948,6 +948,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
count += PTE_LIST_EXT;
}
if (desc->sptes[PTE_LIST_EXT-1]) {
+   count += PTE_LIST_EXT;
desc->more = mmu_alloc_pte_list_desc(vcpu);
desc = desc->more;
}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 05/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

2013-09-05 Thread Xiao Guangrong
Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
  just change the spte from writable to readonly so that we only need to care
  the case of changing spte from present to present (changing the spte from
  present to nonpresent will flush all the TLBs immediately), in other words,
  the only case we need to care is mmu_spte_update()

- in mmu_spte_update(), we have checked
  SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
  means it does not depend on PT_WRITABLE_MASK anymore

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 18 ++
 arch/x86/kvm/x86.c |  9 +++--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7488229..a983570 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4320,15 +4320,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
if (*rmapp)
__rmap_write_protect(kvm, rmapp, false);
 
-   if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
-   kvm_flush_remote_tlbs(kvm);
+   if (need_resched() || spin_needbreak(&kvm->mmu_lock))
cond_resched_lock(&kvm->mmu_lock);
-   }
}
}
 
-   kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
+
+   /*
+* We can flush all the TLBs out of the mmu lock without TLB
+* corruption since we just change the spte from writable to
+* readonly so that we only need to care the case of changing
+* spte from present to present (changing the spte from present
+* to nonpresent will flush all the TLBs immediately), in other
+* words, the only case we care is mmu_spte_update() where we
+* haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
+* instead of PT_WRITABLE_MASK, that means it does not depend
+* on PT_WRITABLE_MASK anymore.
+*/
+   kvm_flush_remote_tlbs(kvm);
 }
 
 #define BATCH_ZAP_PAGES10
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6ad0c07..72f1487 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3560,11 +3560,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
offset = i * BITS_PER_LONG;
kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
}
-   if (is_dirty)
-   kvm_flush_remote_tlbs(kvm);
 
spin_unlock(&kvm->mmu_lock);
 
+   /*
+* All the TLBs can be flushed out of mmu lock, see the comments in
+* kvm_mmu_slot_remove_write_access().
+*/
+   if (is_dirty)
+   kvm_flush_remote_tlbs(kvm);
+
r = -EFAULT;
if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
goto out;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 02/15] KVM: MMU: properly check last spte in fast_page_fault()

2013-09-05 Thread Xiao Guangrong
Using sp->role.level instead of @level since @level is not got from the
page table hierarchy

There is no issue in current code since the fast page fault currently only
fixes the fault caused by dirty-log that is always on the last level
(level = 1)

This patch makes the code more readable and avoids potential issue in the
further development

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7714fd8..869f1db 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2804,9 +2804,9 @@ static bool page_fault_can_be_fast(u32 error_code)
 }
 
 static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
+fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+   u64 *sptep, u64 spte)
 {
-   struct kvm_mmu_page *sp = page_header(__pa(sptep));
gfn_t gfn;
 
WARN_ON(!sp->role.direct);
@@ -2832,6 +2832,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
u32 error_code)
 {
struct kvm_shadow_walk_iterator iterator;
+   struct kvm_mmu_page *sp;
bool ret = false;
u64 spte = 0ull;
 
@@ -2852,7 +2853,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
goto exit;
}
 
-   if (!is_last_spte(spte, level))
+   sp = page_header(__pa(iterator.sptep));
+   if (!is_last_spte(spte, sp->role.level))
goto exit;
 
/*
@@ -2878,7 +2880,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
 * the gfn is not stable for indirect shadow page.
 * See Documentation/virtual/kvm/locking.txt to get more detail.
 */
-   ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
+   ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
 exit:
trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
  spte, ret);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 09/15] KVM: MMU: introduce pte-list lockless walker

2013-09-05 Thread Xiao Guangrong
The basic idea is from nulls list which uses a nulls to indicate
whether the desc is moved to different pte-list

Note, we should do bottom-up walk in the desc since we always move
the bottom entry to the deleted position. A desc only has 3 entries
in the current code so it is not a problem now, but the issue will
be triggered if we expend the size of desc in the further development

Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 57 ++
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c5f1b27..3e1432f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -975,6 +975,10 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
 }
 
+#define rcu_assign_pte_list(pte_list_p, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
+   (unsigned long *)(value))
+
 /*
  * Pte mapping structures:
  *
@@ -994,7 +998,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 
if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
-   *pte_list = (unsigned long)spte;
+   rcu_assign_pte_list(pte_list, spte);
return 0;
}
 
@@ -1004,7 +1008,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
desc_mark_nulls(pte_list, desc);
-   *pte_list = (unsigned long)desc | 1;
+   rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
}
 
@@ -1017,7 +1021,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
new_desc = mmu_alloc_pte_list_desc(vcpu);
new_desc->more = desc;
desc = new_desc;
-   *pte_list = (unsigned long)desc | 1;
+   rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}
 
free_pos = find_first_free(desc);
@@ -1125,6 +1129,51 @@ static void pte_list_walk(unsigned long *pte_list, 
pte_list_walk_fn fn)
WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
+/* The caller should hold rcu lock. */
+static void pte_list_walk_lockless(unsigned long *pte_list,
+  pte_list_walk_fn fn)
+{
+   struct pte_list_desc *desc;
+   unsigned long pte_list_value;
+   int i;
+
+restart:
+   /*
+* Force the pte_list to be reloaded.
+*
+* See the comments in hlist_nulls_for_each_entry_rcu().
+*/
+   barrier();
+   pte_list_value = *rcu_dereference(pte_list);
+   if (!pte_list_value)
+   return;
+
+   if (!(pte_list_value & 1))
+   return fn((u64 *)pte_list_value);
+
+   desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
+   while (!desc_is_a_nulls(desc)) {
+   /*
+* We should do top-down walk since we always use the higher
+* indices to replace the deleted entry if only one desc is
+* used in the rmap when a spte is removed. Otherwise the
+* moved entry will be missed.
+*/
+   for (i = PTE_LIST_EXT - 1; i >= 0; i--)
+   if (desc->sptes[i])
+   fn(desc->sptes[i]);
+
+   desc = rcu_dereference(desc->more);
+
+   /* It is being initialized. */
+   if (unlikely(!desc))
+   goto restart;
+   }
+
+   if (unlikely(desc_get_nulls_value(desc) != pte_list))
+   goto restart;
+}
+
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
 {
@@ -4651,7 +4700,7 @@ int kvm_mmu_module_init(void)
 {
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
sizeof(struct pte_list_desc),
-   0, 0, NULL);
+   0, SLAB_DESTROY_BY_RCU, NULL);
if (!pte_list_desc_cache)
goto nomem;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 10/15] KVM: MMU: initialize the pointers in pte_list_desc properly

2013-09-05 Thread Xiao Guangrong
Since pte_list_desc will be locklessly accessed we need to atomicly initialize
its pointers so that the lockless walker can not get the partial value from the
pointer

In this patch we use the way of assigning pointer to initialize its pointers
which is always atomic instead of using kmem_cache_zalloc

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3e1432f..fe80019 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -687,14 +687,15 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu 
*vcpu)
 }
 
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
- struct kmem_cache *base_cache, int min)
+ struct kmem_cache *base_cache, int min,
+ gfp_t flags)
 {
void *obj;
 
if (cache->nobjs >= min)
return 0;
while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
-   obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
+   obj = kmem_cache_alloc(base_cache, flags);
if (!obj)
return -ENOMEM;
cache->objects[cache->nobjs++] = obj;
@@ -741,14 +742,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
int r;
 
r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
-  pte_list_desc_cache, 8 + PTE_PREFETCH_NUM);
+  pte_list_desc_cache, 8 + PTE_PREFETCH_NUM,
+  GFP_KERNEL);
if (r)
goto out;
r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
if (r)
goto out;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
-  mmu_page_header_cache, 4);
+  mmu_page_header_cache, 4,
+  GFP_KERNEL | __GFP_ZERO);
 out:
return r;
 }
@@ -913,6 +916,17 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static void pte_list_desc_ctor(void *p)
+{
+   struct pte_list_desc *desc = p;
+   int i;
+
+   for (i = 0; i < PTE_LIST_EXT; i++)
+   desc->sptes[i] = NULL;
+
+   desc->more = NULL;
+}
+
 static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
 {
unsigned long marker;
@@ -1066,6 +1080,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 */
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
+   first_desc->sptes[0] = NULL;
mmu_free_pte_list_desc(first_desc);
}
 }
@@ -4699,8 +4714,8 @@ static void mmu_destroy_caches(void)
 int kvm_mmu_module_init(void)
 {
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
-   sizeof(struct pte_list_desc),
-   0, SLAB_DESTROY_BY_RCU, NULL);
+   sizeof(struct pte_list_desc),
+   0, SLAB_DESTROY_BY_RCU, pte_list_desc_ctor);
if (!pte_list_desc_cache)
goto nomem;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-09-05 Thread Xiao Guangrong
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page being freed. But we are on the ioctl
context and the paths we are optimizing for have heavy workload, disabling
interrupt is not good for the system performance

We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then
use call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  6 +-
 arch/x86/kvm/mmu.c  | 32 
 arch/x86/kvm/mmu.h  | 22 ++
 3 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c76ff74..8e4ca0d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -226,7 +226,10 @@ struct kvm_mmu_page {
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
unsigned long mmu_valid_gen;
 
-   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   union {
+   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   struct rcu_head rcu;
+   };
 
 #ifdef CONFIG_X86_32
/*
@@ -554,6 +557,7 @@ struct kvm_arch {
 */
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+   bool rcu_free_shadow_page;
 
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2bf450a..f551fc7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2355,6 +2355,30 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
return ret;
 }
 
+static void kvm_mmu_isolate_pages(struct list_head *invalid_list)
+{
+   struct kvm_mmu_page *sp;
+
+   list_for_each_entry(sp, invalid_list, link)
+   kvm_mmu_isolate_page(sp);
+}
+
+static void free_pages_rcu(struct rcu_head *head)
+{
+   struct kvm_mmu_page *next, *sp;
+
+   sp = container_of(head, struct kvm_mmu_page, rcu);
+   while (sp) {
+   if (!list_empty(&sp->link))
+   next = list_first_entry(&sp->link,
+ struct kvm_mmu_page, link);
+   else
+   next = NULL;
+   kvm_mmu_free_page(sp);
+   sp = next;
+   }
+}
+
 static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
 {
@@ -2375,6 +2399,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 */
kvm_flush_remote_tlbs(kvm);
 
+   if (kvm->arch.rcu_free_shadow_page) {
+   kvm_mmu_isolate_pages(invalid_list);
+   sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
+   list_del_init(invalid_list);
+   call_rcu(&sp->rcu, free_pages_rcu);
+   return;
+   }
+
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
kvm_mmu_isolate_page(sp);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 77e044a..61217f3 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -117,4 +117,26 @@ static inline bool permission_fault(struct kvm_mmu *mmu, 
unsigned pte_access,
 }
 
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
+
+/*
+ * The caller should ensure that these two functions should be
+ * serially called.
+ */
+static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
+{
+   rcu_read_lock();
+
+   kvm->arch.rcu_free_shadow_page = true;
+   /* Set the indicator before access shadow page. */
+   smp_mb();
+}
+
+static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
+{
+   /* Make sure that access shadow page has finished. */
+   smp_mb();
+   kvm->arch.rcu_free_shadow_page = false;
+
+   rcu_read_unlock();
+}
 #endif
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 13/15] KVM: MMU: locklessly write-protect the page

2013-09-05 Thread Xiao Guangrong
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we
need to hold mmu-lock which is also required by vcpu to fix its page table
fault and mmu-notifier when host page is being changed. In the extreme
cpu / memory used guest, it becomes a scalability issue

This patch introduces a way to locklessly write-protect guest memory

Now, lockless rmap walk, lockless shadow page table access and lockless
spte wirte-protection are ready, it is the time to implements page
write-protection out of mmu-lock

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  4 
 arch/x86/kvm/mmu.c  | 53 +
 arch/x86/kvm/mmu.h  |  6 +
 arch/x86/kvm/x86.c  | 11 -
 4 files changed, 49 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8e4ca0d..00b44b1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -789,10 +789,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f551fc7..44b7822 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1376,8 +1376,31 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
return flush;
 }
 
-/**
- * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+static void __rmap_write_protect_lockless(u64 *sptep)
+{
+   u64 spte;
+   int level = page_header(__pa(sptep))->role.level;
+
+retry:
+   spte = mmu_spte_get_lockless(sptep);
+   if (unlikely(!is_last_spte(spte, level) || !is_writable_pte(spte)))
+   return;
+
+   if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte))
+   return;
+
+   goto retry;
+}
+
+static void rmap_write_protect_lockless(unsigned long *rmapp)
+{
+   pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless);
+}
+
+/*
+ * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level
+ * pages out of mmu-lock.
+ *
  * @kvm: kvm instance
  * @slot: slot to protect
  * @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1386,16 +1409,17 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
  * Used when we do not need to care about huge page mappings: e.g. during dirty
  * logging we do not have any such mappings.
  */
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask)
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+struct kvm_memory_slot *slot,
+gfn_t gfn_offset, unsigned long mask)
 {
unsigned long *rmapp;
 
while (mask) {
rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
  PT_PAGE_TABLE_LEVEL, slot);
-   __rmap_write_protect(kvm, rmapp, false);
+   rmap_write_protect_lockless(rmapp);
 
/* clear the first set bit */
mask &= mask - 1;
@@ -4547,7 +4571,7 @@ int kvm_mmu_setup(struct kvm_vcpu *vcpu)
return init_kvm_mmu(vcpu);
 }
 
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot)
 {
struct kvm_memory_slot *memslot;
gfn_t last_gfn;
@@ -4556,8 +4580,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
memslot = id_to_memslot(kvm->memslots, slot);
last_gfn = memslot->base_gfn + memslot->npages - 1;
 
-   spin_lock(&kvm->mmu_lock);
-
+   kvm_mmu_rcu_free_page_begin(kvm);
for (i = PT_PAGE_TABLE_LEVEL;
 i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
unsigned long *rmapp;
@@ -4567,15 +4590,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
 
for (index = 0; index <= last_index; ++index, ++rmapp) {
-   if (*rmapp)
-   __rmap_write_protect(kvm, rmapp, false);
+   rmap_write_protect_lockless(rmapp);
 

[PATCH v2 14/15] KVM: MMU: clean up spte_write_protect

2013-09-05 Thread Xiao Guangrong
Now, the only user of spte_write_protect is rmap_write_protect which
always calls spte_write_protect with pt_protect = true, so drop
it and the unused parameter @kvm

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 44b7822..f3f17a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1330,8 +1330,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
 }
 
 /*
- * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte write-protection is caused by protecting shadow page table.
+ * Write-protect on the specified @sptep.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1342,25 +1341,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
  *
  * Return true if tlb need be flushed.
  */
-static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
+static bool spte_write_protect(u64 *sptep)
 {
u64 spte = *sptep;
 
if (!is_writable_pte(spte) &&
- !(pt_protect && spte_is_locklessly_modifiable(spte)))
+ !spte_is_locklessly_modifiable(spte))
return false;
 
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-   if (pt_protect)
-   spte &= ~SPTE_MMU_WRITEABLE;
-   spte = spte & ~PT_WRITABLE_MASK;
+   spte &= ~SPTE_MMU_WRITEABLE;
+   spte &= ~PT_WRITABLE_MASK;
 
return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
-bool pt_protect)
+static bool __rmap_write_protect(unsigned long *rmapp)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1369,7 +1366,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
 
-   flush |= spte_write_protect(kvm, sptep, pt_protect);
+   flush |= spte_write_protect(sptep);
sptep = rmap_get_next(&iter);
}
 
@@ -1438,7 +1435,7 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
for (i = PT_PAGE_TABLE_LEVEL;
 i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
rmapp = __gfn_to_rmap(gfn, i, slot);
-   write_protected |= __rmap_write_protect(kvm, rmapp, true);
+   write_protected |= __rmap_write_protect(rmapp);
}
 
return write_protected;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 11/15] KVM: MMU: reintroduce kvm_mmu_isolate_page()

2013-09-05 Thread Xiao Guangrong
It was removed by commit 834be0d83. Now we will need it to do lockless shadow
page walking protected by rcu, so reintroduce it

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fe80019..2bf450a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1675,14 +1675,30 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
*kvm, int nr)
percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+/*
+ * Remove the sp from shadow page cache, after call it,
+ * we can not find this sp from the cache, and the shadow
+ * page table is still valid.
+ *
+ * It should be under the protection of mmu lock.
+ */
+static void kvm_mmu_isolate_page(struct kvm_mmu_page *sp)
 {
ASSERT(is_empty_shadow_page(sp->spt));
+
hlist_del(&sp->hash_link);
-   list_del(&sp->link);
-   free_page((unsigned long)sp->spt);
if (!sp->role.direct)
free_page((unsigned long)sp->gfns);
+}
+
+/*
+ * Free the shadow page table and the sp, we can do it
+ * out of the protection of mmu lock.
+ */
+static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+{
+   list_del(&sp->link);
+   free_page((unsigned long)sp->spt);
kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -2361,6 +2377,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
+   kvm_mmu_isolate_page(sp);
kvm_mmu_free_page(sp);
}
 }
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 15/15] KVM: MMU: use rcu functions to access the pointer

2013-09-05 Thread Xiao Guangrong
Use rcu_assign_pointer() to update all the pointer in desc
and use rcu_dereference() to lockless read the pointer

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 46 --
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f3f17a0..808c2d9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -927,12 +927,23 @@ static void pte_list_desc_ctor(void *p)
desc->more = NULL;
 }
 
+#define rcu_assign_pte_list(pte_list_p, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
+ (unsigned long *)(value))
+
+#define rcu_assign_desc_more(morep, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)&morep, \
+ (unsigned long *)value)
+
+#define rcu_assign_spte(sptep, value)  \
+   rcu_assign_pointer(*(u64 __rcu **)&sptep, (u64 *)value)
+
 static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
 {
unsigned long marker;
 
marker = (unsigned long)pte_list | 1UL;
-   desc->more = (struct pte_list_desc *)marker;
+   rcu_assign_desc_more(desc->more, (struct pte_list_desc *)marker);
 }
 
 static bool desc_is_a_nulls(struct pte_list_desc *desc)
@@ -989,10 +1000,6 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
 }
 
-#define rcu_assign_pte_list(pte_list_p, value) \
-   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
-   (unsigned long *)(value))
-
 /*
  * Pte mapping structures:
  *
@@ -1019,8 +1026,8 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
-   desc->sptes[0] = (u64 *)*pte_list;
-   desc->sptes[1] = spte;
+   rcu_assign_spte(desc->sptes[0], *pte_list);
+   rcu_assign_spte(desc->sptes[1], spte);
desc_mark_nulls(pte_list, desc);
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
@@ -1033,13 +1040,13 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 
*spte,
if (desc->sptes[PTE_LIST_EXT - 1]) {
struct pte_list_desc *new_desc;
new_desc = mmu_alloc_pte_list_desc(vcpu);
-   new_desc->more = desc;
+   rcu_assign_desc_more(new_desc->more, desc);
desc = new_desc;
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}
 
free_pos = find_first_free(desc);
-   desc->sptes[free_pos] = spte;
+   rcu_assign_spte(desc->sptes[free_pos], spte);
return count_spte_number(desc) - 1;
 }
 
@@ -1057,8 +1064,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * Move the entry from the first desc to this position we want
 * to remove.
 */
-   desc->sptes[i] = first_desc->sptes[last_used];
-   first_desc->sptes[last_used] = NULL;
+   rcu_assign_spte(desc->sptes[i], first_desc->sptes[last_used]);
+   rcu_assign_spte(first_desc->sptes[last_used], NULL);
 
/* No valid entry in this desc, we can free this desc now. */
if (!first_desc->sptes[0]) {
@@ -1070,7 +1077,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
WARN_ON(desc_is_a_nulls(next_desc));
 
mmu_free_pte_list_desc(first_desc);
-   *pte_list = (unsigned long)next_desc | 1ul;
+   rcu_assign_pte_list(pte_list, (unsigned long)next_desc | 1ul);
return;
}
 
@@ -1079,8 +1086,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * then the desc can be freed.
 */
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
-   *pte_list = (unsigned long)first_desc->sptes[0];
-   first_desc->sptes[0] = NULL;
+   rcu_assign_pte_list(pte_list, first_desc->sptes[0]);
+   rcu_assign_spte(first_desc->sptes[0], NULL);
mmu_free_pte_list_desc(first_desc);
}
 }
@@ -1102,7 +1109,7 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
pr_err("pte_list_remove:  %p 1->BUG\n", spte);
BUG();
}
-   *pte_list = 0;
+   rcu_assign_pte_list(pte_list, 0);
return;
}
 
@@ -1174,9 +1181,12 @@ restart:
 * used in the rmap when a spte is removed. Otherwise the
 * moved entry will be missed.
 */
-   for (i = PTE_LIST_EXT - 1; i >= 0; i--)
-   if (desc->sptes[i])
-  

Re: Live migration makes VM unusable

2013-09-05 Thread Maciej Gałkiewicz
I have made some experiments like you suggested. Migrating VM back to
node where it worked fine does not help. I have found in VMs logs:

Clocksource tsc unstable (delta = 123652847 ns)

It was logged after first live migration but did not break VM. I
continued to migrate it back and forth and after few migrations it
broke like usual. Nothing more in logs. I was experimenting with clock
source:

http://liquidat.wordpress.com/2013/04/02/howto-fixing-unstable-clocksource-in-virtualised-centos/

It did not help. I have tried changing clock source manually but it
still did not restore its functionality.

regards
-- 
Maciej Gałkiewicz
Shelly Cloud Sp. z o. o., Sysadmin
http://shellycloud.com/, mac...@shellycloud.com
KRS: 440358 REGON: 101504426
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2013-09-05 Thread Allen and Violet Large


Dear Sir/Madam,

 This is my fifth times of written you this email since last year
till date but no response from you.Hope you get this one, as this is a
personal email directed to you. My wife and I won a Jackpot Lottery of
$11.3 million in July and have voluntarily decided to donate the sum of
$500,000.00 USD to you as part of our own charity project to improve the
lot of 10 lucky individuals all over the world. If you have received this
email then you are one of the lucky recipients and all you have to do is
get back with us so that we can send your details to the payout
bank.Please note that you have to contact my private email for more
informations(allenvioletlarge...@yahoo.co.uk )


 You can verify this by visiting the web pages below.

http://www.dailymail.co.uk/news/article-1326473/Canadian-couple-Allen-Violet-
 Large-away-entire-11-2m-lottery-win.html

 Good-luck,
 Allen and Violet Large
 Email:allenvioletlarge...@yahoo.co.uk

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-05 Thread Arthur Chunqi Li
On Thu, Sep 5, 2013 at 5:24 PM, Zhang, Yang Z  wrote:
> Arthur Chunqi Li wrote on 2013-09-05:
>> On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z 
>> wrote:
>> > Arthur Chunqi Li wrote on 2013-09-04:
>> >> This patch contains the following two changes:
>> >> 1. Fix the bug in nested preemption timer support. If vmexit L2->L0
>> >> with some reasons not emulated by L1, preemption timer value should
>> >> be save in such exits.
>> >> 2. Add support of "Save VMX-preemption timer value" VM-Exit controls
>> >> to nVMX.
>> >>
>> >> With this patch, nested VMX preemption timer features are fully supported.
>> >>
>> >> Signed-off-by: Arthur Chunqi Li 
>> >> ---
>> >> This series depends on queue.
>> >>
>> >>  arch/x86/include/uapi/asm/msr-index.h |1 +
>> >>  arch/x86/kvm/vmx.c|   51
>> >> ++---
>> >>  2 files changed, 48 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/arch/x86/include/uapi/asm/msr-index.h
>> >> b/arch/x86/include/uapi/asm/msr-index.h
>> >> index bb04650..b93e09a 100644
>> >> --- a/arch/x86/include/uapi/asm/msr-index.h
>> >> +++ b/arch/x86/include/uapi/asm/msr-index.h
>> >> @@ -536,6 +536,7 @@
>> >>
>> >>  /* MSR_IA32_VMX_MISC bits */
>> >>  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL <<
>> 29)
>> >> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
>> >>  /* AMD-V MSRs */
>> >>
>> >>  #define MSR_VM_CR   0xc0010114
>> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
>> >> 1f1da43..870caa8
>> >> 100644
>> >> --- a/arch/x86/kvm/vmx.c
>> >> +++ b/arch/x86/kvm/vmx.c
>> >> @@ -2204,7 +2204,14 @@ static __init void
>> >> nested_vmx_setup_ctls_msrs(void)  #ifdef CONFIG_X86_64
>> >>   VM_EXIT_HOST_ADDR_SPACE_SIZE |  #endif
>> >> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
>> >> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>> >> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
>> >> + if (!(nested_vmx_pinbased_ctls_high &
>> >> PIN_BASED_VMX_PREEMPTION_TIMER))
>> >> + nested_vmx_exit_ctls_high &=
>> >> + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
>> >> + if (!(nested_vmx_exit_ctls_high &
>> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER))
>> >> + nested_vmx_pinbased_ctls_high &=
>> >> + (~PIN_BASED_VMX_PREEMPTION_TIMER);
>> > The following logic is more clearly:
>> > if(nested_vmx_pinbased_ctls_high &
>> PIN_BASED_VMX_PREEMPTION_TIMER)
>> > nested_vmx_exit_ctls_high |=
>> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
>> Here I have such consideration: this logic is wrong if CPU support
>> PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support
>> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this does
>> occurs. So the codes above reads the MSR and reserves the features it
>> supports, and here I just check if these two features are supported
>> simultaneously.
>>
> No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on 
> PIN_BASED_VMX_PREEMPTION_TIMER. PIN_BASED_VMX_PREEMPTION_TIMER is an 
> independent feature
>
>> You remind that this piece of codes can write like this:
>> if (!(nested_vmx_pin_based_ctls_high &
>> PIN_BASED_VMX_PREEMPTION_TIMER) ||
>> !(nested_vmx_exit_ctls_high &
>> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
>> nested_vmx_exit_ctls_high
>> &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
>> nested_vmx_pinbased_ctls_high &=
>> (~PIN_BASED_VMX_PREEMPTION_TIMER);
>> }
>>
>> This may reflect the logic I describe above that these two flags should 
>> support
>> simultaneously, and brings less confusion.
>> >
>> > BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's
>> capability when expose those vmx features(not just preemption timer) to L1.
>> The codes just above here, when setting pinbased control for nested vmx, it
>> firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to mask the
>> features hardware not support. So does other control fields.
>> >
> Yes, I saw it.
>
>> >>   nested_vmx_exit_ctls_high |=
>> >> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>> >> VM_EXIT_LOAD_IA32_EFER);
>> >>
>> >> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu
>> >> *vcpu, u64 *info1, u64 *info2)
>> >>   *info2 = vmcs_read32(VM_EXIT_INTR_INFO);  }
>> >>
>> >> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) {
>> >> + u64 delta_tsc_l1;
>> >> + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
>> >> +
>> >> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) &
>> >> +
>> MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
>> >> + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
>> >> + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu,
>> >> + native_read_tsc()) - vcpu->arch.last_guest_tsc;
>> >> + preempt_val_l1 = delta_tsc_l1 >> preempt_scale;
>> >> + if (preempt_val_l2 - preempt_val_l1 < 0)
>> >> + preempt_val_l2 = 0;
>> 

RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-05 Thread Zhang, Yang Z
Arthur Chunqi Li wrote on 2013-09-05:
> On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z 
> wrote:
> > Arthur Chunqi Li wrote on 2013-09-04:
> >> This patch contains the following two changes:
> >> 1. Fix the bug in nested preemption timer support. If vmexit L2->L0
> >> with some reasons not emulated by L1, preemption timer value should
> >> be save in such exits.
> >> 2. Add support of "Save VMX-preemption timer value" VM-Exit controls
> >> to nVMX.
> >>
> >> With this patch, nested VMX preemption timer features are fully supported.
> >>
> >> Signed-off-by: Arthur Chunqi Li 
> >> ---
> >> This series depends on queue.
> >>
> >>  arch/x86/include/uapi/asm/msr-index.h |1 +
> >>  arch/x86/kvm/vmx.c|   51
> >> ++---
> >>  2 files changed, 48 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/include/uapi/asm/msr-index.h
> >> b/arch/x86/include/uapi/asm/msr-index.h
> >> index bb04650..b93e09a 100644
> >> --- a/arch/x86/include/uapi/asm/msr-index.h
> >> +++ b/arch/x86/include/uapi/asm/msr-index.h
> >> @@ -536,6 +536,7 @@
> >>
> >>  /* MSR_IA32_VMX_MISC bits */
> >>  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL <<
> 29)
> >> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
> >>  /* AMD-V MSRs */
> >>
> >>  #define MSR_VM_CR   0xc0010114
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
> >> 1f1da43..870caa8
> >> 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -2204,7 +2204,14 @@ static __init void
> >> nested_vmx_setup_ctls_msrs(void)  #ifdef CONFIG_X86_64
> >>   VM_EXIT_HOST_ADDR_SPACE_SIZE |  #endif
> >> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
> >> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> >> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
> >> + if (!(nested_vmx_pinbased_ctls_high &
> >> PIN_BASED_VMX_PREEMPTION_TIMER))
> >> + nested_vmx_exit_ctls_high &=
> >> + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
> >> + if (!(nested_vmx_exit_ctls_high &
> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER))
> >> + nested_vmx_pinbased_ctls_high &=
> >> + (~PIN_BASED_VMX_PREEMPTION_TIMER);
> > The following logic is more clearly:
> > if(nested_vmx_pinbased_ctls_high &
> PIN_BASED_VMX_PREEMPTION_TIMER)
> > nested_vmx_exit_ctls_high |=
> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
> Here I have such consideration: this logic is wrong if CPU support
> PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support
> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this does
> occurs. So the codes above reads the MSR and reserves the features it
> supports, and here I just check if these two features are supported
> simultaneously.
> 
No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on 
PIN_BASED_VMX_PREEMPTION_TIMER. PIN_BASED_VMX_PREEMPTION_TIMER is an 
independent feature

> You remind that this piece of codes can write like this:
> if (!(nested_vmx_pin_based_ctls_high &
> PIN_BASED_VMX_PREEMPTION_TIMER) ||
> !(nested_vmx_exit_ctls_high &
> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
> nested_vmx_exit_ctls_high
> &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
> nested_vmx_pinbased_ctls_high &=
> (~PIN_BASED_VMX_PREEMPTION_TIMER);
> }
> 
> This may reflect the logic I describe above that these two flags should 
> support
> simultaneously, and brings less confusion.
> >
> > BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's
> capability when expose those vmx features(not just preemption timer) to L1.
> The codes just above here, when setting pinbased control for nested vmx, it
> firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to mask the
> features hardware not support. So does other control fields.
> >
Yes, I saw it.

> >>   nested_vmx_exit_ctls_high |=
> >> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> >> VM_EXIT_LOAD_IA32_EFER);
> >>
> >> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu
> >> *vcpu, u64 *info1, u64 *info2)
> >>   *info2 = vmcs_read32(VM_EXIT_INTR_INFO);  }
> >>
> >> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) {
> >> + u64 delta_tsc_l1;
> >> + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
> >> +
> >> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) &
> >> +
> MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
> >> + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
> >> + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu,
> >> + native_read_tsc()) - vcpu->arch.last_guest_tsc;
> >> + preempt_val_l1 = delta_tsc_l1 >> preempt_scale;
> >> + if (preempt_val_l2 - preempt_val_l1 < 0)
> >> + preempt_val_l2 = 0;
> >> + else
> >> + preempt_val_l2 -= preempt_val_l1;
> >> + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE,
> preempt_val_l2); }
> >>  /*
> >>   * The guest has e

Re: [PATCH] kvm-unit-tests: VMX: Test suite for preemption timer

2013-09-05 Thread Arthur Chunqi Li
Hi Jan, Gleb and Paolo,

It suddenly occurred to me that, if guest's PIN_PREEMPT disabled while
EXI_SAVE_PREEMPT_VALUE enabled, what will happen? The preempt value in
vmcs will not be affected, yes?

This cases fails to test in this patch.

Arthur

On Wed, Sep 4, 2013 at 11:26 PM, Arthur Chunqi Li  wrote:
> Test cases for preemption timer in nested VMX. Two aspects are tested:
> 1. Save preemption timer on VMEXIT if relevant bit set in EXIT_CONTROL
> 2. Test a relevant bug of KVM. The bug will not save preemption timer
> value if exit L2->L0 for some reason and enter L0->L2. Thus preemption
> timer will never trigger if the value is large enough.
>
> Signed-off-by: Arthur Chunqi Li 
> ---
>  x86/vmx.h   |3 ++
>  x86/vmx_tests.c |  117 
> +++
>  2 files changed, 120 insertions(+)
>
> diff --git a/x86/vmx.h b/x86/vmx.h
> index 28595d8..ebc8cfd 100644
> --- a/x86/vmx.h
> +++ b/x86/vmx.h
> @@ -210,6 +210,7 @@ enum Encoding {
> GUEST_ACTV_STATE= 0x4826ul,
> GUEST_SMBASE= 0x4828ul,
> GUEST_SYSENTER_CS   = 0x482aul,
> +   PREEMPT_TIMER_VALUE = 0x482eul,
>
> /* 32-Bit Host State Fields */
> HOST_SYSENTER_CS= 0x4c00ul,
> @@ -331,6 +332,7 @@ enum Ctrl_exi {
> EXI_LOAD_PERF   = 1UL << 12,
> EXI_INTA= 1UL << 15,
> EXI_LOAD_EFER   = 1UL << 21,
> +   EXI_SAVE_PREEMPT= 1UL << 22,
>  };
>
>  enum Ctrl_ent {
> @@ -342,6 +344,7 @@ enum Ctrl_pin {
> PIN_EXTINT  = 1ul << 0,
> PIN_NMI = 1ul << 3,
> PIN_VIRT_NMI= 1ul << 5,
> +   PIN_PREEMPT = 1ul << 6,
>  };
>
>  enum Ctrl0 {
> diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
> index c1b39f4..d358148 100644
> --- a/x86/vmx_tests.c
> +++ b/x86/vmx_tests.c
> @@ -1,4 +1,30 @@
>  #include "vmx.h"
> +#include "msr.h"
> +#include "processor.h"
> +
> +volatile u32 stage;
> +
> +static inline void vmcall()
> +{
> +   asm volatile("vmcall");
> +}
> +
> +static inline void set_stage(u32 s)
> +{
> +   barrier();
> +   stage = s;
> +   barrier();
> +}
> +
> +static inline u32 get_stage()
> +{
> +   u32 s;
> +
> +   barrier();
> +   s = stage;
> +   barrier();
> +   return s;
> +}
>
>  void basic_init()
>  {
> @@ -76,6 +102,95 @@ int vmenter_exit_handler()
> return VMX_TEST_VMEXIT;
>  }
>
> +u32 preempt_scale;
> +volatile unsigned long long tsc_val;
> +volatile u32 preempt_val;
> +
> +void preemption_timer_init()
> +{
> +   u32 ctrl_pin;
> +
> +   ctrl_pin = vmcs_read(PIN_CONTROLS) | PIN_PREEMPT;
> +   ctrl_pin &= ctrl_pin_rev.clr;
> +   vmcs_write(PIN_CONTROLS, ctrl_pin);
> +   preempt_val = 1000;
> +   vmcs_write(PREEMPT_TIMER_VALUE, preempt_val);
> +   preempt_scale = rdmsr(MSR_IA32_VMX_MISC) & 0x1F;
> +}
> +
> +void preemption_timer_main()
> +{
> +   tsc_val = rdtsc();
> +   if (!(ctrl_pin_rev.clr & PIN_PREEMPT)) {
> +   printf("\tPreemption timer is not supported\n");
> +   return;
> +   }
> +   if (!(ctrl_exit_rev.clr & EXI_SAVE_PREEMPT))
> +   printf("\tSave preemption value is not supported\n");
> +   else {
> +   set_stage(0);
> +   vmcall();
> +   if (get_stage() == 1)
> +   vmcall();
> +   }
> +   while (1) {
> +   if (((rdtsc() - tsc_val) >> preempt_scale)
> +   > 10 * preempt_val) {
> +   report("Preemption timer", 0);
> +   break;
> +   }
> +   }
> +}
> +
> +int preemption_timer_exit_handler()
> +{
> +   u64 guest_rip;
> +   ulong reason;
> +   u32 insn_len;
> +   u32 ctrl_exit;
> +
> +   guest_rip = vmcs_read(GUEST_RIP);
> +   reason = vmcs_read(EXI_REASON) & 0xff;
> +   insn_len = vmcs_read(EXI_INST_LEN);
> +   switch (reason) {
> +   case VMX_PREEMPT:
> +   if (((rdtsc() - tsc_val) >> preempt_scale) < preempt_val)
> +   report("Preemption timer", 0);
> +   else
> +   report("Preemption timer", 1);
> +   return VMX_TEST_VMEXIT;
> +   case VMX_VMCALL:
> +   switch (get_stage()) {
> +   case 0:
> +   if (vmcs_read(PREEMPT_TIMER_VALUE) != preempt_val)
> +   report("Save preemption value", 0);
> +   else {
> +   set_stage(get_stage() + 1);
> +   ctrl_exit = (vmcs_read(EXI_CONTROLS) |
> +   EXI_SAVE_PREEMPT) & ctrl_exit_rev.clr;
> +   vmcs_write(EXI_CONTROLS, ctrl_exit);
> +   }
> +   break;
> +   case 1:
> +   if (vmcs_

Re: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-05 Thread Arthur Chunqi Li
On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z  wrote:
> Arthur Chunqi Li wrote on 2013-09-04:
>> This patch contains the following two changes:
>> 1. Fix the bug in nested preemption timer support. If vmexit L2->L0 with some
>> reasons not emulated by L1, preemption timer value should be save in such
>> exits.
>> 2. Add support of "Save VMX-preemption timer value" VM-Exit controls to
>> nVMX.
>>
>> With this patch, nested VMX preemption timer features are fully supported.
>>
>> Signed-off-by: Arthur Chunqi Li 
>> ---
>> This series depends on queue.
>>
>>  arch/x86/include/uapi/asm/msr-index.h |1 +
>>  arch/x86/kvm/vmx.c|   51
>> ++---
>>  2 files changed, 48 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/include/uapi/asm/msr-index.h
>> b/arch/x86/include/uapi/asm/msr-index.h
>> index bb04650..b93e09a 100644
>> --- a/arch/x86/include/uapi/asm/msr-index.h
>> +++ b/arch/x86/include/uapi/asm/msr-index.h
>> @@ -536,6 +536,7 @@
>>
>>  /* MSR_IA32_VMX_MISC bits */
>>  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << 29)
>> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
>>  /* AMD-V MSRs */
>>
>>  #define MSR_VM_CR   0xc0010114
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..870caa8
>> 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2204,7 +2204,14 @@ static __init void
>> nested_vmx_setup_ctls_msrs(void)  #ifdef CONFIG_X86_64
>>   VM_EXIT_HOST_ADDR_SPACE_SIZE |
>>  #endif
>> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
>> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
>> + if (!(nested_vmx_pinbased_ctls_high &
>> PIN_BASED_VMX_PREEMPTION_TIMER))
>> + nested_vmx_exit_ctls_high &=
>> + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
>> + if (!(nested_vmx_exit_ctls_high &
>> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER))
>> + nested_vmx_pinbased_ctls_high &=
>> + (~PIN_BASED_VMX_PREEMPTION_TIMER);
> The following logic is more clearly:
> if(nested_vmx_pinbased_ctls_high & PIN_BASED_VMX_PREEMPTION_TIMER)
> nested_vmx_exit_ctls_high |= VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
Here I have such consideration: this logic is wrong if CPU support
PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support
VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this does
occurs. So the codes above reads the MSR and reserves the features it
supports, and here I just check if these two features are supported
simultaneously.

You remind that this piece of codes can write like this:
if (!(nested_vmx_pin_based_ctls_high & PIN_BASED_VMX_PREEMPTION_TIMER) ||
!(nested_vmx_exit_ctls_high &
VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
nested_vmx_exit_ctls_high &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
nested_vmx_pinbased_ctls_high &= (~PIN_BASED_VMX_PREEMPTION_TIMER);
}

This may reflect the logic I describe above that these two flags
should support simultaneously, and brings less confusion.
>
> BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's 
> capability when expose those vmx features(not just preemption timer) to L1.
The codes just above here, when setting pinbased control for nested
vmx, it firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to
mask the features hardware not support. So does other control fields.
>
>>   nested_vmx_exit_ctls_high |=
>> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>> VM_EXIT_LOAD_IA32_EFER);
>>
>> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu
>> *vcpu, u64 *info1, u64 *info2)
>>   *info2 = vmcs_read32(VM_EXIT_INTR_INFO);  }
>>
>> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) {
>> + u64 delta_tsc_l1;
>> + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
>> +
>> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) &
>> + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
>> + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
>> + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu,
>> + native_read_tsc()) - vcpu->arch.last_guest_tsc;
>> + preempt_val_l1 = delta_tsc_l1 >> preempt_scale;
>> + if (preempt_val_l2 - preempt_val_l1 < 0)
>> + preempt_val_l2 = 0;
>> + else
>> + preempt_val_l2 -= preempt_val_l1;
>> + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); }
>>  /*
>>   * The guest has exited.  See if we can fix it or if we need userspace
>>   * assistance.
>> @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
>>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>>   u32 exit_reason = vmx->exit_reason;
>>   u32 vectoring_info = vmx->idt_vectoring_info;
>> + int ret;
>>
>>   /* If guest state is invalid, start emulating */
>>   if (vmx->emulation_requir

Re: OpenBSD 5.3 guest on KVM

2013-09-05 Thread Paolo Bonzini
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Il 05/09/2013 01:31, Daniel Bareiro ha scritto:
> Someone had this problem and could solve it somehow? There any
> debug information I can provide to help solve this?

For simple troubleshooting try "info status" from the QEMU monitor.

You can also try this:

http://www.linux-kvm.org/page/Tracing

You will get a large log, you can send it to me offlist or via
something like Google Drive.

Paolo

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJSKEEUAAoJEBvWZb6bTYby4+UP/2V99leGzgxccTS6s3LHwAk8
UomjlcPmzXZ6UlC8ppGfHkuzWm/jbhmkMGX8DYrdYOJdctugFpbOrIaPA/pC2Na4
8o2LWZ5burvztUVS8O5d4wmfLHIgOvuIFj8MQ5YJ571HiYUKq3oipQf1yL5kzbtU
1p4/0lrXRwAz++YKXFfZ3A1vIQ5Ms6WwwHljyfZMZNvafymT87cPNskX4P3aO81U
weYeTpIuTLdiBIzkA1Xqun7z3DzZszCB91UxeKTxvxZLiPHnaYocS7bTF8i710mQ
BUfBXFUTH0xt46O1QCLrNHqbojSPaUwr7pkK2x1dCYfPISN9+nmZ9BgHGZCTTVIn
Ga9kn0rdIfalpBXybgTFZBeqHimSYAou/CuVKgx4wLYMUVdE95dOFjVx3K9yOtfv
ulmKqh7yGlWB9ZNRC+qet0BcwA6ssgutcHVMa46V0vE22J0phTcBPWWpYLw18IBl
J2D5oOTEnZ3pqzxhXD2zjeD3pBBtZa+dsDLuhBIhkQI9Qcqv/N1I5oCEBOtbVBBh
aWb0vEOBjdw9+cfadxtl2VQUV6GfwasBbOktFYR7gIG3ji/Asf35ZeC1xHAFv7xw
2pxxfGxLmX0NXdwgvlXz45+V0ITK8ZzuoU4tc2JDEXUk9Q2Slb9vHwxDPtbnDeo/
uybOp9WiNmQvrA4Thg1V
=PwDG
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60850] BUG: Bad page state in process libvirtd pfn:76000

2013-09-05 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60850

Gleb  changed:

   What|Removed |Added

 CC||g...@redhat.com

--- Comment #1 from Gleb  ---
This is not related to KVM, so for right people to see it I suggest to change
assignee.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-05 Thread Zhang, Yang Z
Arthur Chunqi Li wrote on 2013-09-04:
> This patch contains the following two changes:
> 1. Fix the bug in nested preemption timer support. If vmexit L2->L0 with some
> reasons not emulated by L1, preemption timer value should be save in such
> exits.
> 2. Add support of "Save VMX-preemption timer value" VM-Exit controls to
> nVMX.
> 
> With this patch, nested VMX preemption timer features are fully supported.
> 
> Signed-off-by: Arthur Chunqi Li 
> ---
> This series depends on queue.
> 
>  arch/x86/include/uapi/asm/msr-index.h |1 +
>  arch/x86/kvm/vmx.c|   51
> ++---
>  2 files changed, 48 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/msr-index.h
> b/arch/x86/include/uapi/asm/msr-index.h
> index bb04650..b93e09a 100644
> --- a/arch/x86/include/uapi/asm/msr-index.h
> +++ b/arch/x86/include/uapi/asm/msr-index.h
> @@ -536,6 +536,7 @@
> 
>  /* MSR_IA32_VMX_MISC bits */
>  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << 29)
> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
>  /* AMD-V MSRs */
> 
>  #define MSR_VM_CR   0xc0010114
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..870caa8
> 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2204,7 +2204,14 @@ static __init void
> nested_vmx_setup_ctls_msrs(void)  #ifdef CONFIG_X86_64
>   VM_EXIT_HOST_ADDR_SPACE_SIZE |
>  #endif
> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
> + if (!(nested_vmx_pinbased_ctls_high &
> PIN_BASED_VMX_PREEMPTION_TIMER))
> + nested_vmx_exit_ctls_high &=
> + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
> + if (!(nested_vmx_exit_ctls_high &
> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER))
> + nested_vmx_pinbased_ctls_high &=
> + (~PIN_BASED_VMX_PREEMPTION_TIMER);
The following logic is more clearly:
if(nested_vmx_pinbased_ctls_high & PIN_BASED_VMX_PREEMPTION_TIMER)
nested_vmx_exit_ctls_high |= VM_EXIT_SAVE_VMX_PREEMPTION_TIMER

BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's 
capability when expose those vmx features(not just preemption timer) to L1. 

>   nested_vmx_exit_ctls_high |=
> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> VM_EXIT_LOAD_IA32_EFER);
> 
> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu
> *vcpu, u64 *info1, u64 *info2)
>   *info2 = vmcs_read32(VM_EXIT_INTR_INFO);  }
> 
> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) {
> + u64 delta_tsc_l1;
> + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
> +
> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) &
> + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
> + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
> + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu,
> + native_read_tsc()) - vcpu->arch.last_guest_tsc;
> + preempt_val_l1 = delta_tsc_l1 >> preempt_scale;
> + if (preempt_val_l2 - preempt_val_l1 < 0)
> + preempt_val_l2 = 0;
> + else
> + preempt_val_l2 -= preempt_val_l1;
> + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); }
>  /*
>   * The guest has exited.  See if we can fix it or if we need userspace
>   * assistance.
> @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>   u32 exit_reason = vmx->exit_reason;
>   u32 vectoring_info = vmx->idt_vectoring_info;
> + int ret;
> 
>   /* If guest state is invalid, start emulating */
>   if (vmx->emulation_required)
> @@ -6795,12 +6820,15 @@ static int vmx_handle_exit(struct kvm_vcpu
> *vcpu)
> 
>   if (exit_reason < kvm_vmx_max_exit_handlers
>   && kvm_vmx_exit_handlers[exit_reason])
> - return kvm_vmx_exit_handlers[exit_reason](vcpu);
> + ret = kvm_vmx_exit_handlers[exit_reason](vcpu);
>   else {
>   vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
>   vcpu->run->hw.hardware_exit_reason = exit_reason;
> + ret = 0;
>   }
> - return 0;
> + if (is_guest_mode(vcpu))
> + nested_adjust_preemption_timer(vcpu);
Move this forward to avoid the changes for ret.
> + return ret;
>  }
> 
>  static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) @@
> -7518,6 +7546,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
> struct vmcs12 *vmcs12)  {
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>   u32 exec_control;
> + u32 exit_control;
> 
>   vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
>   vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector); @@
> -7691,7 +7720,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
> struct vmcs12 *v