[PATCH 1/2] KVM: x86: Simplify __apic_accept_irq

2013-07-25 Thread Jan Kiszka
If posted interrupts are enabled, we can no longer track if an IRQ was
coalesced based on IRR. So drop this logic also from the classic
software path and simplify apic_test_and_set_irr to apic_set_irr.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/kvm/lapic.c |   23 ---
 1 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index afc1124..9dc3650 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -331,10 +331,10 @@ void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
 }
 EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
 
-static inline int apic_test_and_set_irr(int vec, struct kvm_lapic *apic)
+static inline void apic_set_irr(int vec, struct kvm_lapic *apic)
 {
apic-irr_pending = true;
-   return apic_test_and_set_vector(vec, apic-regs + APIC_IRR);
+   apic_set_vector(vec, apic-regs + APIC_IRR);
 }
 
 static inline int apic_search_irr(struct kvm_lapic *apic)
@@ -681,28 +681,21 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
delivery_mode,
if (unlikely(!apic_enabled(apic)))
break;
 
+   result = 1;
+
if (dest_map)
__set_bit(vcpu-vcpu_id, dest_map);
 
-   if (kvm_x86_ops-deliver_posted_interrupt) {
-   result = 1;
+   if (kvm_x86_ops-deliver_posted_interrupt)
kvm_x86_ops-deliver_posted_interrupt(vcpu, vector);
-   } else {
-   result = !apic_test_and_set_irr(vector, apic);
-
-   if (!result) {
-   if (trig_mode)
-   apic_debug(level trig mode repeatedly 
-   for vector %d, vector);
-   goto out;
-   }
+   else {
+   apic_set_irr(vector, apic);
 
kvm_make_request(KVM_REQ_EVENT, vcpu);
kvm_vcpu_kick(vcpu);
}
-out:
trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
-   trig_mode, vector, !result);
+ trig_mode, vector, false);
break;
 
case APIC_DM_REMRD:
-- 
1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: x86: Drop some unused functions from lapic

2013-07-25 Thread Jan Kiszka
Both have no users anymore.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/kvm/lapic.c |   10 --
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9dc3650..c98f054 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -79,16 +79,6 @@ static inline void apic_set_reg(struct kvm_lapic *apic, int 
reg_off, u32 val)
*((u32 *) (apic-regs + reg_off)) = val;
 }
 
-static inline int apic_test_and_set_vector(int vec, void *bitmap)
-{
-   return test_and_set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
-}
-
-static inline int apic_test_and_clear_vector(int vec, void *bitmap)
-{
-   return test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
-}
-
 static inline int apic_test_vector(int vec, void *bitmap)
 {
return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
-- 
1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] x86: properly handle kvm emulation of hyperv

2013-07-25 Thread Paolo Bonzini
Il 24/07/2013 23:37, H. Peter Anvin ha scritto:
 What I'm suggesting is exactly that except that the native hypervisor is 
 later in CPUID space.

Me too actually.

I was just suggesting an implementation of the idea (that takes into
account hypervisors detected by other means than CPUID).

Paolo

 KY Srinivasan k...@microsoft.com wrote:
 As Paolo suggested if there were some priority encoded, the guest could make 
 an
 informed decision. If the guest under question can run on both hypervisors A 
 and B,
 we would rather the guest discover hypervisor A when running on A and
 hypervisor B when running on B. The priority encoding could be as simple as
 surfacing the native hypervisor signature earlier in the CPUID space.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-unit-tests : Basic architecture of VMX nested test case

2013-07-25 Thread Jan Kiszka
On 2013-07-25 07:31, Arthur Chunqi Li wrote:
 This is the first version of VMX nested environment. It contains the
 basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/
 VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the
 basic execution routine in VMX nested environment andlet the VM print
 Hello World to inform its successfully run.
 
 The first release also includes a test suite for vmenter (vmlaunch and
 vmresume). Besides, hypercall mechanism is included and currently it is
 used to invoke VM normal exit.
 
 New files added:
 x86/vmx.h : contains all VMX related macro declerations
 x86/vmx.c : main file for VMX nested test case
 
 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com

Don't forget to update your public git as well.

 ---
  config-x86-common.mak |2 +
  config-x86_64.mak |1 +
  lib/x86/msr.h |5 +
  x86/cstart64.S|4 +
  x86/unittests.cfg |6 +
  x86/vmx.c |  712 
 +
  x86/vmx.h |  479 +
  7 files changed, 1209 insertions(+)
  create mode 100644 x86/vmx.c
  create mode 100644 x86/vmx.h
 
 diff --git a/config-x86-common.mak b/config-x86-common.mak
 index 455032b..34a41e1 100644
 --- a/config-x86-common.mak
 +++ b/config-x86-common.mak
 @@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o
  
  $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o
  
 +$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o
 +
  arch_clean:
   $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
   $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
 diff --git a/config-x86_64.mak b/config-x86_64.mak
 index 4e525f5..bb8ee89 100644
 --- a/config-x86_64.mak
 +++ b/config-x86_64.mak
 @@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
 $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
 $(TEST_DIR)/pcid.flat
  tests += $(TEST_DIR)/svm.flat
 +tests += $(TEST_DIR)/vmx.flat
  
  include config-x86-common.mak
 diff --git a/lib/x86/msr.h b/lib/x86/msr.h
 index 509a421..281255a 100644
 --- a/lib/x86/msr.h
 +++ b/lib/x86/msr.h
 @@ -396,6 +396,11 @@
  #define MSR_IA32_VMX_VMCS_ENUM  0x048a
  #define MSR_IA32_VMX_PROCBASED_CTLS20x048b
  #define MSR_IA32_VMX_EPT_VPID_CAP   0x048c
 +#define MSR_IA32_VMX_TRUE_PIN0x048d
 +#define MSR_IA32_VMX_TRUE_PROC   0x048e
 +#define MSR_IA32_VMX_TRUE_EXIT   0x048f
 +#define MSR_IA32_VMX_TRUE_ENTRY  0x0490
 +
  
  /* AMD-V MSRs */
  
 diff --git a/x86/cstart64.S b/x86/cstart64.S
 index 24df5f8..0fe76da 100644
 --- a/x86/cstart64.S
 +++ b/x86/cstart64.S
 @@ -4,6 +4,10 @@
  .globl boot_idt
  boot_idt = 0
  
 +.globl idt_descr
 +.globl tss_descr
 +.globl gdt64_desc
 +
  ipi_vector = 0x20
  
  max_cpus = 64
 diff --git a/x86/unittests.cfg b/x86/unittests.cfg
 index bc9643e..85c36aa 100644
 --- a/x86/unittests.cfg
 +++ b/x86/unittests.cfg
 @@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s`
  file = pcid.flat
  extra_params = -cpu qemu64,+pcid
  arch = x86_64
 +
 +[vmx]
 +file = vmx.flat
 +extra_params = -cpu host,+vmx
 +arch = x86_64
 +
 diff --git a/x86/vmx.c b/x86/vmx.c
 new file mode 100644
 index 000..ca3e117
 --- /dev/null
 +++ b/x86/vmx.c
 @@ -0,0 +1,712 @@
 +#include libcflat.h
 +#include processor.h
 +#include vm.h
 +#include desc.h
 +#include vmx.h
 +#include msr.h
 +#include smp.h
 +#include io.h
 +
 +int fails = 0, tests = 0;
 +u32 *vmxon_region;
 +struct vmcs *vmcs_root;
 +u32 vpid_cnt;
 +void *guest_stack, *guest_syscall_stack;
 +u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2];
 +ulong fix_cr0_set, fix_cr0_clr;
 +ulong fix_cr4_set, fix_cr4_clr;
 +struct regs regs;
 +struct vmx_test *current;
 +u64 hypercall_field = 0;
 +bool launched = 0;
 +
 +extern u64 gdt64_desc[];
 +extern u64 idt_descr[];
 +extern u64 tss_descr[];
 +extern void *vmx_return;
 +extern void *entry_sysenter;
 +extern void *guest_entry;
 +
 +static void report(const char *name, int result)
 +{
 + ++tests;
 + if (result)
 + printf(PASS: %s\n, name);
 + else {
 + printf(FAIL: %s\n, name);
 + ++fails;
 + }
 +}
 +
 +static int vmcs_clear(struct vmcs *vmcs)
 +{
 + bool ret;
 + asm volatile (vmclear %1; setbe %0 : =q (ret) : m (vmcs) : cc);
 + return ret;
 +}
 +
 +static u64 vmcs_read(enum Encoding enc)
 +{
 + u64 val;
 + asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc);
 + return val;
 +}
 +
 +static int vmcs_write(enum Encoding enc, u64 val)
 +{
 + bool ret;
 + asm volatile (vmwrite %1, %2; setbe %0
 + : =q(ret) : rm (val), r ((u64)enc) : cc);
 + return ret;
 +}
 +
 +static int make_vmcs_current(struct vmcs *vmcs)
 +{
 + bool ret;
 +
 + asm volatile (vmptrld %1; setbe %0 : =q (ret) : m (vmcs) : cc);
 + return ret;
 +}
 +
 +static 

Re: [PATCH 4/4] x86: properly handle kvm emulation of hyperv

2013-07-25 Thread Jason Wang
On 07/25/2013 03:59 PM, Paolo Bonzini wrote:
 Il 24/07/2013 23:37, H. Peter Anvin ha scritto:
 What I'm suggesting is exactly that except that the native hypervisor is 
 later in CPUID space.
 Me too actually.

 I was just suggesting an implementation of the idea (that takes into
 account hypervisors detected by other means than CPUID).

 Paolo

This make sense, will send V2.

Thanks
 KY Srinivasan k...@microsoft.com wrote:
 As Paolo suggested if there were some priority encoded, the guest could 
 make an
 informed decision. If the guest under question can run on both hypervisors 
 A and B,
 we would rather the guest discover hypervisor A when running on A and
 hypervisor B when running on B. The priority encoding could be as simple as
 surfacing the native hypervisor signature earlier in the CPUID space.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm: powerpc: set cache coherency only for kernel managed pages

2013-07-25 Thread Gleb Natapov
On Wed, Jul 24, 2013 at 03:32:49PM -0500, Scott Wood wrote:
 On 07/24/2013 04:39:59 AM, Alexander Graf wrote:
 
 On 24.07.2013, at 11:35, Gleb Natapov wrote:
 
  On Wed, Jul 24, 2013 at 11:21:11AM +0200, Alexander Graf wrote:
  Are not we going to use page_is_ram() from
 e500_shadow_mas2_attrib() as Scott commented?
 
  rWhy aren't we using page_is_ram() in kvm_is_mmio_pfn()?
 
 
  Because it is much slower and, IIRC, actually used to build pfn
 map that allow
  us to check quickly for valid pfn.
 
 Then why should we use page_is_ram()? :)
 
 I really don't want the e500 code to diverge too much from what
 the rest of the kvm code is doing.
 
 I don't understand actually used to build pfn map  What code
 is this?  I don't see any calls to page_is_ram() in the KVM code, or
 in generic mm code.  Is this a statement about what x86 does?
It may be not page_is_ram() directly, but the same into page_is_ram() is
using. On power both page_is_ram() and do_init_bootmem() walks some kind
of memblock_region data structure. What important is that pfn_valid()
does not mean that there is a memory behind page structure. See Andrea's
reply.

 
 On PPC page_is_ram() is only called (AFAICT) for determining what
 attributes to set on mmaps.  We want to be sure that KVM always
 makes the same decision.  While pfn_valid() seems like it should be
 equivalent, it's not obvious from the PPC code that it is.
 
Again pfn_valid() is not enough.

 If pfn_valid() is better, why is that not used for mmap?  Why are
 there two different names for the same thing?
 
They are not the same thing. page_is_ram() tells you if phys address is
ram backed. pfn_valid() tells you if there is struct page behind the
pfn. PageReserved() tells if you a pfn is marked as reserved. All non
ram pfns should be reserved, but ram pfns can be reserved too. Again,
see Andrea's reply.

Why ppc uses page_is_ram() for mmap? How should I know? But looking at
the function it does it only as a fallback if
ppc_md.phys_mem_access_prot() is not provided. Making access to MMIO
noncached as a safe fallback makes sense. It is also make sense to allow
noncached access to reserved ram sometimes.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] s390/kvm: support collaborative memory management

2013-07-25 Thread Martin Schwidefsky
From: Konstantin Weitz konstantin.we...@gmail.com

This patch enables Collaborative Memory Management (CMM) for kvm
on s390. CMM allows the guest to inform the host about page usage
(see arch/s390/mm/cmm.c). The host uses this information to avoid
swapping in unused pages in the page fault handler. Further, a CPU
provided list of unused invalid pages is processed to reclaim swap
space of not yet accessed unused pages.

[ Martin Schwidefsky: patch reordering and cleanup ]

Signed-off-by: Konstantin Weitz konstantin.we...@gmail.com
Signed-off-by: Martin Schwidefsky schwidef...@de.ibm.com
---
 arch/s390/include/asm/kvm_host.h |5 ++-
 arch/s390/include/asm/pgtable.h  |   24 
 arch/s390/kvm/kvm-s390.c |   25 +
 arch/s390/kvm/kvm-s390.h |2 +
 arch/s390/kvm/priv.c |   41 
 arch/s390/mm/pgtable.c   |   77 ++
 6 files changed, 173 insertions(+), 1 deletion(-)

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 3238d40..de6450e 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -113,7 +113,9 @@ struct kvm_s390_sie_block {
__u64   gbea;   /* 0x0180 */
__u8reserved188[24];/* 0x0188 */
__u32   fac;/* 0x01a0 */
-   __u8reserved1a4[92];/* 0x01a4 */
+   __u8reserved1a4[20];/* 0x01a4 */
+   __u64   cbrlo;  /* 0x01b8 */
+   __u8reserved1c0[64];/* 0x01c0 */
 } __attribute__((packed));
 
 struct kvm_vcpu_stat {
@@ -149,6 +151,7 @@ struct kvm_vcpu_stat {
u32 instruction_stsi;
u32 instruction_stfl;
u32 instruction_tprot;
+   u32 instruction_essa;
u32 instruction_sigp_sense;
u32 instruction_sigp_sense_running;
u32 instruction_sigp_external_call;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 75fb726..65d48b8 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -227,6 +227,7 @@ extern unsigned long MODULES_END;
 #define _PAGE_SWR  0x008   /* SW pte referenced bit */
 #define _PAGE_SWW  0x010   /* SW pte write bit */
 #define _PAGE_SPECIAL  0x020   /* SW associated with special page */
+#define _PAGE_UNUSED   0x040   /* SW bit for ptep_clear_flush() */
 #define __HAVE_ARCH_PTE_SPECIAL
 
 /* Set of bits not changed in pte_modify */
@@ -375,6 +376,12 @@ extern unsigned long MODULES_END;
 
 #endif /* CONFIG_64BIT */
 
+/* Guest Page State used for virtualization */
+#define _PGSTE_GPS_ZERO0x8000UL
+#define _PGSTE_GPS_USAGE_MASK  0x0300UL
+#define _PGSTE_GPS_USAGE_STABLE 0xUL
+#define _PGSTE_GPS_USAGE_UNUSED 0x0100UL
+
 /*
  * A user page table pointer has the space-switch-event bit, the
  * private-space-control bit and the storage-alteration-event-control
@@ -590,6 +597,12 @@ static inline int pte_file(pte_t pte)
return (pte_val(pte)  mask) == _PAGE_TYPE_FILE;
 }
 
+static inline int pte_swap(pte_t pte)
+{
+   unsigned long mask = _PAGE_RO | _PAGE_INVALID | _PAGE_SWT | _PAGE_SWX;
+   return (pte_val(pte)  mask) == _PAGE_TYPE_SWAP;
+}
+
 static inline int pte_special(pte_t pte)
 {
return (pte_val(pte)  _PAGE_SPECIAL);
@@ -794,6 +807,7 @@ unsigned long gmap_translate(unsigned long address, struct 
gmap *);
 unsigned long __gmap_fault(unsigned long address, struct gmap *);
 unsigned long gmap_fault(unsigned long address, struct gmap *);
 void gmap_discard(unsigned long from, unsigned long to, struct gmap *);
+void __gmap_zap(unsigned long address, struct gmap *);
 
 void gmap_register_ipte_notifier(struct gmap_notifier *);
 void gmap_unregister_ipte_notifier(struct gmap_notifier *);
@@ -825,6 +839,7 @@ static inline void set_pte_at(struct mm_struct *mm, 
unsigned long addr,
 
if (mm_has_pgste(mm)) {
pgste = pgste_get_lock(ptep);
+   pgste_val(pgste) = ~_PGSTE_GPS_ZERO;
pgste_set_key(ptep, pgste, entry);
pgste_set_pte(ptep, entry);
pgste_set_unlock(ptep, pgste);
@@ -858,6 +873,12 @@ static inline int pte_young(pte_t pte)
return 0;
 }
 
+#define __HAVE_ARCH_PTE_UNUSED
+static inline int pte_unused(pte_t pte)
+{
+   return pte_val(pte)  _PAGE_UNUSED;
+}
+
 /*
  * pgd/pmd/pte modification functions
  */
@@ -1142,6 +1163,9 @@ static inline pte_t ptep_clear_flush(struct 
vm_area_struct *vma,
pte_val(*ptep) = _PAGE_TYPE_EMPTY;
 
if (mm_has_pgste(vma-vm_mm)) {
+   if ((pgste_val(pgste)  _PGSTE_GPS_USAGE_MASK) ==
+   _PGSTE_GPS_USAGE_UNUSED)
+   pte_val(pte) |= _PAGE_UNUSED;
pgste = pgste_update_all(pte, pgste);
pgste_set_unlock(ptep, pgste);
}
diff --git 

[PATCH 1/2] mm: add support for discard of unused ptes

2013-07-25 Thread Martin Schwidefsky
From: Konstantin Weitz konstantin.we...@gmail.com

In a virtualized environment and given an appropriate interface the guest
can mark pages as unused while they are free (for the s390 implementation
see git commit 45e576b1c3d00206 guest page hinting light). For the host
the unused state is a property of the pte.

This patch adds the primitive 'pte_unused' and code to the host swap out
handler so that pages marked as unused by all mappers are not swapped out
but discarded instead, thus saving one IO for swap out and potentially
another one for swap in.

[ Martin Schwidefsky: patch reordering and simplification ]

Signed-off-by: Konstantin Weitz konstantin.we...@gmail.com
Signed-off-by: Martin Schwidefsky schwidef...@de.ibm.com
---
 include/asm-generic/pgtable.h |   13 +
 mm/rmap.c |   10 ++
 2 files changed, 23 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 2f47ade..ec540c5 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -193,6 +193,19 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
 }
 #endif
 
+#ifndef __HAVE_ARCH_PTE_UNUSED
+/*
+ * Some architectures provide facilities to virtualization guests
+ * so that they can flag allocated pages as unused. This allows the
+ * host to transparently reclaim unused pages. This function returns
+ * whether the pte's page is unused.
+ */
+static inline int pte_unused(pte_t pte)
+{
+   return 0;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PMD_SAME
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
diff --git a/mm/rmap.c b/mm/rmap.c
index cd356df..2291f25 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1234,6 +1234,16 @@ int try_to_unmap_one(struct page *page, struct 
vm_area_struct *vma,
}
set_pte_at(mm, address, pte,
   swp_entry_to_pte(make_hwpoison_entry(page)));
+   } else if (pte_unused(pteval)) {
+   /*
+* The guest indicated that the page content is of no
+* interest anymore. Simply discard the pte, vmscan
+* will take care of the rest.
+*/
+   if (PageAnon(page))
+   dec_mm_counter(mm, MM_ANONPAGES);
+   else
+   dec_mm_counter(mm, MM_FILEPAGES);
} else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(page) };
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 0/2] s390/kvm: add kvm support for guest page hinting v2

2013-07-25 Thread Martin Schwidefsky
v1-v2:
 - found a way to simplify the common code patch

Linux on s390 as a guest under z/VM has been using the guest page
hinting interface (alias collaborative memory management) for a long
time. The full version with volatile states has been deemed to be too
complicated (see the old discussion about guest page hinting e.g. on
http://marc.info/?l=linux-mmm=123816662017742w=2).
What is currently implemented for the guest is the unused and stable
states to mark unallocated pages as freely available to the host.
This works just fine with z/VM as the host.

The two patches in this series implement the guest page hinting
interface for the unused and stable states in the KVM host.
Most of the code specific to s390 but there is a common memory
management part as well, see patch #1.

The code is working stable now, from my point of view this is ready
for prime-time.

Konstantin Weitz (2):
  mm: add support for discard of unused ptes
  s390/kvm: support collaborative memory management

 arch/s390/include/asm/kvm_host.h |5 ++-
 arch/s390/include/asm/pgtable.h  |   24 
 arch/s390/kvm/kvm-s390.c |   25 +
 arch/s390/kvm/kvm-s390.h |2 +
 arch/s390/kvm/priv.c |   41 
 arch/s390/mm/pgtable.c   |   77 ++
 include/asm-generic/pgtable.h|   13 +++
 mm/rmap.c|   10 +
 8 files changed, 196 insertions(+), 1 deletion(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 4/4] x86: correctly detect hypervisor

2013-07-25 Thread Jason Wang
We try to handle the hypervisor compatibility mode by detecting hypervisor
through a specific order. This is not robust, since hypervisors may implement
each others features.

This patch tries to handle this situation by always choosing the last one in the
CPUID leaves. This is done by letting .detect() returns a priority instead of
true/false and just re-using the CPUID leaf where the signature were found as
the priority (or 1 if it was found by DMI). Then we can just pick hypervisor who
has the highest priority. Other sophisticated detection method could also be
implemented on top.

Suggested by H. Peter Anvin and Paolo Bonzini.

Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: x...@kernel.org
Cc: K. Y. Srinivasan k...@microsoft.com
Cc: Haiyang Zhang haiya...@microsoft.com
Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Jeremy Fitzhardinge jer...@goop.org
Cc: Doug Covelli dcove...@vmware.com
Cc: Borislav Petkov b...@suse.de
Cc: Dan Hecht dhe...@vmware.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Cc: Gleb Natapov g...@redhat.com
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: linux-ker...@vger.kernel.org
Cc: de...@linuxdriverproject.org
Cc: kvm@vger.kernel.org
Cc: xen-de...@lists.xensource.com
Cc: virtualizat...@lists.linux-foundation.org
Signed-off-by: Jason Wang jasow...@redhat.com
---
 arch/x86/include/asm/hypervisor.h |2 +-
 arch/x86/kernel/cpu/hypervisor.c  |   15 +++
 arch/x86/kernel/cpu/mshyperv.c|   13 -
 arch/x86/kernel/cpu/vmware.c  |8 
 arch/x86/kernel/kvm.c |6 ++
 arch/x86/xen/enlighten.c  |9 +++--
 6 files changed, 25 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/hypervisor.h 
b/arch/x86/include/asm/hypervisor.h
index 2d4b5e6..e42f758 100644
--- a/arch/x86/include/asm/hypervisor.h
+++ b/arch/x86/include/asm/hypervisor.h
@@ -33,7 +33,7 @@ struct hypervisor_x86 {
const char  *name;
 
/* Detection routine */
-   bool(*detect)(void);
+   uint32_t(*detect)(void);
 
/* Adjust CPU feature bits (run once per CPU) */
void(*set_cpu_features)(struct cpuinfo_x86 *);
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index 8727921..36ce402 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -25,11 +25,6 @@
 #include asm/processor.h
 #include asm/hypervisor.h
 
-/*
- * Hypervisor detect order.  This is specified explicitly here because
- * some hypervisors might implement compatibility modes for other
- * hypervisors and therefore need to be detected in specific sequence.
- */
 static const __initconst struct hypervisor_x86 * const hypervisors[] =
 {
 #ifdef CONFIG_XEN_PVHVM
@@ -49,15 +44,19 @@ static inline void __init
 detect_hypervisor_vendor(void)
 {
const struct hypervisor_x86 *h, * const *p;
+   uint32_t pri, max_pri = 0;
 
for (p = hypervisors; p  hypervisors + ARRAY_SIZE(hypervisors); p++) {
h = *p;
-   if (h-detect()) {
+   pri = h-detect();
+   if (pri != 0  pri  max_pri) {
+   max_pri = pri;
x86_hyper = h;
-   printk(KERN_INFO Hypervisor detected: %s\n, h-name);
-   break;
}
}
+
+   if (max_pri)
+   printk(KERN_INFO Hypervisor detected: %s\n, x86_hyper-name);
 }
 
 void init_hypervisor(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 8f4be53..71a39f3 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -27,20 +27,23 @@
 struct ms_hyperv_info ms_hyperv;
 EXPORT_SYMBOL_GPL(ms_hyperv);
 
-static bool __init ms_hyperv_platform(void)
+static uint32_t  __init ms_hyperv_platform(void)
 {
u32 eax;
u32 hyp_signature[3];
 
if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
-   return false;
+   return 0;
 
cpuid(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS,
  eax, hyp_signature[0], hyp_signature[1], hyp_signature[2]);
 
-   return eax = HYPERV_CPUID_MIN 
-   eax = HYPERV_CPUID_MAX 
-   !memcmp(Microsoft Hv, hyp_signature, 12);
+   if (eax = HYPERV_CPUID_MIN 
+   eax = HYPERV_CPUID_MAX 
+   !memcmp(Microsoft Hv, hyp_signature, 12))
+   return HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
+
+   return 0;
 }
 
 static cycle_t read_hv_clock(struct clocksource *arg)
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 7076878..628a059 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -93,7 +93,7 @@ static void __init vmware_platform_setup(void)
  * serial key should be enough, as 

[PATCH V2 1/4] x86: introduce hypervisor_cpuid_base()

2013-07-25 Thread Jason Wang
This patch introduce hypervisor_cpuid_base() which loop test the hypervisor
existence function until the signature match and check the number of leaves if
required. This could be used by Xen/KVM guest to detect the existence of
hypervisor.

Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Gleb Natapov g...@redhat.com
Cc: x...@kernel.org
Signed-off-by: Jason Wang jasow...@redhat.com
---
Changes from V1:
- use memcpy() and uint32_t instead of strcmp()
---
 arch/x86/include/asm/processor.h |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 24cf5ae..7763307 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -971,6 +971,21 @@ unsigned long calc_aperfmperf_ratio(struct aperfmperf *old,
return ratio;
 }
 
+static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
+{
+   uint32_t base, eax, signature[3];
+
+   for (base = 0x4000; base  0x4001; base += 0x100) {
+   cpuid(base, eax, signature[0], signature[1], signature[2]);
+
+   if (!memcmp(sig, signature, 12) 
+   (leaves == 0 || ((eax - base) = leaves)))
+   return base;
+   }
+
+   return 0;
+}
+
 extern unsigned long arch_align_stack(unsigned long sp);
 extern void free_init_pages(char *what, unsigned long begin, unsigned long 
end);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 2/4] xen: switch to use hypervisor_cpuid_base()

2013-07-25 Thread Jason Wang
Switch to use hypervisor_cpuid_base() to detect Xen.

Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Jeremy Fitzhardinge jer...@goop.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: x...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: xen-de...@lists.xensource.com
Cc: virtualizat...@lists.linux-foundation.org
Signed-off-by: Jason Wang jasow...@redhat.com
---
 arch/x86/include/asm/xen/hypervisor.h |   16 +---
 1 files changed, 1 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/xen/hypervisor.h 
b/arch/x86/include/asm/xen/hypervisor.h
index 125f344..d866959 100644
--- a/arch/x86/include/asm/xen/hypervisor.h
+++ b/arch/x86/include/asm/xen/hypervisor.h
@@ -40,21 +40,7 @@ extern struct start_info *xen_start_info;
 
 static inline uint32_t xen_cpuid_base(void)
 {
-   uint32_t base, eax, ebx, ecx, edx;
-   char signature[13];
-
-   for (base = 0x4000; base  0x4001; base += 0x100) {
-   cpuid(base, eax, ebx, ecx, edx);
-   *(uint32_t *)(signature + 0) = ebx;
-   *(uint32_t *)(signature + 4) = ecx;
-   *(uint32_t *)(signature + 8) = edx;
-   signature[12] = 0;
-
-   if (!strcmp(XenVMMXenVMM, signature)  ((eax - base) = 2))
-   return base;
-   }
-
-   return 0;
+   return hypervisor_cpuid_base(XenVMMXenVMM, 2);
 }
 
 #ifdef CONFIG_XEN
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 3/4] kvm: switch to use hypervisor_cpuid_base()

2013-07-25 Thread Jason Wang
Switch to use hypervisor_cpuid_base() to detect KVM.

Cc: Gleb Natapov g...@redhat.com
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: x...@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Jason Wang jasow...@redhat.com
---
Changes from V1:
- Introduce kvm_cpuid_base() which will be used by next patch.
---
 arch/x86/include/asm/kvm_para.h |   24 +---
 1 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..0644129 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -85,26 +85,20 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned 
long p1,
return ret;
 }
 
-static inline bool kvm_para_available(void)
+static inline uint32_t kvm_cpuid_base(void)
 {
-   unsigned int eax, ebx, ecx, edx;
-   char signature[13];
-
if (boot_cpu_data.cpuid_level  0)
-   return false;   /* So we don't blow up on old processors */
+   return 0;   /* So we don't blow up on old processors */
 
-   if (cpu_has_hypervisor) {
-   cpuid(KVM_CPUID_SIGNATURE, eax, ebx, ecx, edx);
-   memcpy(signature + 0, ebx, 4);
-   memcpy(signature + 4, ecx, 4);
-   memcpy(signature + 8, edx, 4);
-   signature[12] = 0;
+   if (cpu_has_hypervisor)
+   return hypervisor_cpuid_base(KVMKVMKVM\0\0\0, 0);
 
-   if (strcmp(signature, KVMKVMKVM) == 0)
-   return true;
-   }
+   return 0;
+}
 
-   return false;
+static inline bool kvm_para_available(void)
+{
+   return kvm_cpuid_base() != 0;
 }
 
 static inline unsigned int kvm_arch_para_features(void)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-25 Thread Raghavendra K T

On 07/24/2013 06:06 PM, Raghavendra K T wrote:

On 07/24/2013 05:36 PM, Gleb Natapov wrote:

On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote:

On 07/24/2013 04:09 PM, Gleb Natapov wrote:

On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote:

On 07/23/2013 08:37 PM, Gleb Natapov wrote:

On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:

+static void kvm_lock_spinning(struct arch_spinlock *lock,
__ticket_t want)

[...]

+
+/*
+ * halt until it's our turn and kicked. Note that we do safe
halt
+ * for irq enabled case to avoid hang when lock info is
overwritten
+ * in irq spinlock slowpath and no spurious interrupt occur
to save us.
+ */
+if (arch_irqs_disabled_flags(flags))
+halt();
+else
+safe_halt();
+
+out:

So here now interrupts can be either disabled or enabled. Previous
version disabled interrupts here, so are we sure it is safe to
have them
enabled at this point? I do not see any problem yet, will keep
thinking.


If we enable interrupt here, then



+cpumask_clear_cpu(cpu, waiting_cpus);


and if we start serving lock for an interrupt that came here,
cpumask clear and w-lock=null may not happen atomically.
if irq spinlock does not take slow path we would have non null value
for lock, but with no information in waitingcpu.

I am still thinking what would be problem with that.


Exactly, for kicker waiting_cpus and w-lock updates are
non atomic anyway.


+w-lock = NULL;
+local_irq_restore(flags);
+spin_time_accum_blocked(start);
+}
+PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
+
+/* Kick vcpu waiting on @lock-head to reach value @ticket */
+static void kvm_unlock_kick(struct arch_spinlock *lock,
__ticket_t ticket)
+{
+int cpu;
+
+add_stats(RELEASED_SLOW, 1);
+for_each_cpu(cpu, waiting_cpus) {
+const struct kvm_lock_waiting *w =
per_cpu(lock_waiting, cpu);
+if (ACCESS_ONCE(w-lock) == lock 
+ACCESS_ONCE(w-want) == ticket) {
+add_stats(RELEASED_SLOW_KICKED, 1);
+kvm_kick_cpu(cpu);

What about using NMI to wake sleepers? I think it was discussed, but
forgot why it was dismissed.


I think I have missed that discussion. 'll go back and check. so
what is the idea here? we can easily wake up the halted vcpus that
have interrupt disabled?

We can of course. IIRC the objection was that NMI handling path is very
fragile and handling NMI on each wakeup will be more expensive then
waking up a guest without injecting an event, but it is still
interesting
to see the numbers.



Haam, now I remember, We had tried request based mechanism. (new
request like REQ_UNHALT) and process that. It had worked, but had some
complex hacks in vcpu_enter_guest to avoid guest hang in case of
request cleared.  So had left it there..

https://lkml.org/lkml/2012/4/30/67

But I do not remember performance impact though.

No, this is something different. Wakeup with NMI does not need KVM
changes at
all. Instead of kvm_kick_cpu(cpu) in kvm_unlock_kick you send NMI IPI.



True. It was not NMI.
just to confirm, are you talking about something like this to be tried ?

apic-send_IPI_mask(cpumask_of(cpu), APIC_DM_NMI);


When I started benchmark, I started seeing
Dazed and confused, but trying to continue from unknown nmi error
handling.
Did I miss anything (because we did not register any NMI handler)? or
is it that spurious NMIs are trouble because we could get spurious NMIs
if next waiter already acquired the lock.

(note: I tried sending APIC_DM_REMRD IPI directly, which worked fine
but hypercall way of handling still performed well from the results I
saw).





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-25 Thread Gleb Natapov
On Thu, Jul 25, 2013 at 02:47:37PM +0530, Raghavendra K T wrote:
 On 07/24/2013 06:06 PM, Raghavendra K T wrote:
 On 07/24/2013 05:36 PM, Gleb Natapov wrote:
 On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote:
 On 07/24/2013 04:09 PM, Gleb Natapov wrote:
 On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote:
 On 07/23/2013 08:37 PM, Gleb Natapov wrote:
 On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:
 +static void kvm_lock_spinning(struct arch_spinlock *lock,
 __ticket_t want)
 [...]
 +
 +/*
 + * halt until it's our turn and kicked. Note that we do safe
 halt
 + * for irq enabled case to avoid hang when lock info is
 overwritten
 + * in irq spinlock slowpath and no spurious interrupt occur
 to save us.
 + */
 +if (arch_irqs_disabled_flags(flags))
 +halt();
 +else
 +safe_halt();
 +
 +out:
 So here now interrupts can be either disabled or enabled. Previous
 version disabled interrupts here, so are we sure it is safe to
 have them
 enabled at this point? I do not see any problem yet, will keep
 thinking.
 
 If we enable interrupt here, then
 
 
 +cpumask_clear_cpu(cpu, waiting_cpus);
 
 and if we start serving lock for an interrupt that came here,
 cpumask clear and w-lock=null may not happen atomically.
 if irq spinlock does not take slow path we would have non null value
 for lock, but with no information in waitingcpu.
 
 I am still thinking what would be problem with that.
 
 Exactly, for kicker waiting_cpus and w-lock updates are
 non atomic anyway.
 
 +w-lock = NULL;
 +local_irq_restore(flags);
 +spin_time_accum_blocked(start);
 +}
 +PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
 +
 +/* Kick vcpu waiting on @lock-head to reach value @ticket */
 +static void kvm_unlock_kick(struct arch_spinlock *lock,
 __ticket_t ticket)
 +{
 +int cpu;
 +
 +add_stats(RELEASED_SLOW, 1);
 +for_each_cpu(cpu, waiting_cpus) {
 +const struct kvm_lock_waiting *w =
 per_cpu(lock_waiting, cpu);
 +if (ACCESS_ONCE(w-lock) == lock 
 +ACCESS_ONCE(w-want) == ticket) {
 +add_stats(RELEASED_SLOW_KICKED, 1);
 +kvm_kick_cpu(cpu);
 What about using NMI to wake sleepers? I think it was discussed, but
 forgot why it was dismissed.
 
 I think I have missed that discussion. 'll go back and check. so
 what is the idea here? we can easily wake up the halted vcpus that
 have interrupt disabled?
 We can of course. IIRC the objection was that NMI handling path is very
 fragile and handling NMI on each wakeup will be more expensive then
 waking up a guest without injecting an event, but it is still
 interesting
 to see the numbers.
 
 
 Haam, now I remember, We had tried request based mechanism. (new
 request like REQ_UNHALT) and process that. It had worked, but had some
 complex hacks in vcpu_enter_guest to avoid guest hang in case of
 request cleared.  So had left it there..
 
 https://lkml.org/lkml/2012/4/30/67
 
 But I do not remember performance impact though.
 No, this is something different. Wakeup with NMI does not need KVM
 changes at
 all. Instead of kvm_kick_cpu(cpu) in kvm_unlock_kick you send NMI IPI.
 
 
 True. It was not NMI.
 just to confirm, are you talking about something like this to be tried ?
 
 apic-send_IPI_mask(cpumask_of(cpu), APIC_DM_NMI);
 
 When I started benchmark, I started seeing
 Dazed and confused, but trying to continue from unknown nmi error
 handling.
 Did I miss anything (because we did not register any NMI handler)? or
 is it that spurious NMIs are trouble because we could get spurious NMIs
 if next waiter already acquired the lock.
There is a default NMI handler that tries to detect the reason why NMI
happened (which is no so easy on x86) and prints this message if it
fails. You need to add logic to detect spinlock slow path there. Check
bit in waiting_cpus for instance.

 
 (note: I tried sending APIC_DM_REMRD IPI directly, which worked fine
 but hypercall way of handling still performed well from the results I
 saw).
You mean better? This is strange. Have you ran guest with x2apic?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-25 Thread Raghavendra K T

On 07/25/2013 02:45 PM, Gleb Natapov wrote:

On Thu, Jul 25, 2013 at 02:47:37PM +0530, Raghavendra K T wrote:

On 07/24/2013 06:06 PM, Raghavendra K T wrote:

On 07/24/2013 05:36 PM, Gleb Natapov wrote:

On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote:

On 07/24/2013 04:09 PM, Gleb Natapov wrote:

On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote:

On 07/23/2013 08:37 PM, Gleb Natapov wrote:

On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:

+static void kvm_lock_spinning(struct arch_spinlock *lock,
__ticket_t want)

[...]

+
+/*
+ * halt until it's our turn and kicked. Note that we do safe
halt
+ * for irq enabled case to avoid hang when lock info is
overwritten
+ * in irq spinlock slowpath and no spurious interrupt occur
to save us.
+ */
+if (arch_irqs_disabled_flags(flags))
+halt();
+else
+safe_halt();
+
+out:

So here now interrupts can be either disabled or enabled. Previous
version disabled interrupts here, so are we sure it is safe to
have them
enabled at this point? I do not see any problem yet, will keep
thinking.


If we enable interrupt here, then



+cpumask_clear_cpu(cpu, waiting_cpus);


and if we start serving lock for an interrupt that came here,
cpumask clear and w-lock=null may not happen atomically.
if irq spinlock does not take slow path we would have non null value
for lock, but with no information in waitingcpu.

I am still thinking what would be problem with that.


Exactly, for kicker waiting_cpus and w-lock updates are
non atomic anyway.


+w-lock = NULL;
+local_irq_restore(flags);
+spin_time_accum_blocked(start);
+}
+PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
+
+/* Kick vcpu waiting on @lock-head to reach value @ticket */
+static void kvm_unlock_kick(struct arch_spinlock *lock,
__ticket_t ticket)
+{
+int cpu;
+
+add_stats(RELEASED_SLOW, 1);
+for_each_cpu(cpu, waiting_cpus) {
+const struct kvm_lock_waiting *w =
per_cpu(lock_waiting, cpu);
+if (ACCESS_ONCE(w-lock) == lock 
+ACCESS_ONCE(w-want) == ticket) {
+add_stats(RELEASED_SLOW_KICKED, 1);
+kvm_kick_cpu(cpu);

What about using NMI to wake sleepers? I think it was discussed, but
forgot why it was dismissed.


I think I have missed that discussion. 'll go back and check. so
what is the idea here? we can easily wake up the halted vcpus that
have interrupt disabled?

We can of course. IIRC the objection was that NMI handling path is very
fragile and handling NMI on each wakeup will be more expensive then
waking up a guest without injecting an event, but it is still
interesting
to see the numbers.



Haam, now I remember, We had tried request based mechanism. (new
request like REQ_UNHALT) and process that. It had worked, but had some
complex hacks in vcpu_enter_guest to avoid guest hang in case of
request cleared.  So had left it there..

https://lkml.org/lkml/2012/4/30/67

But I do not remember performance impact though.

No, this is something different. Wakeup with NMI does not need KVM
changes at
all. Instead of kvm_kick_cpu(cpu) in kvm_unlock_kick you send NMI IPI.



True. It was not NMI.
just to confirm, are you talking about something like this to be tried ?

apic-send_IPI_mask(cpumask_of(cpu), APIC_DM_NMI);


When I started benchmark, I started seeing
Dazed and confused, but trying to continue from unknown nmi error
handling.
Did I miss anything (because we did not register any NMI handler)? or
is it that spurious NMIs are trouble because we could get spurious NMIs
if next waiter already acquired the lock.

There is a default NMI handler that tries to detect the reason why NMI
happened (which is no so easy on x86) and prints this message if it
fails. You need to add logic to detect spinlock slow path there. Check
bit in waiting_cpus for instance.


aha.. Okay. will check that.





(note: I tried sending APIC_DM_REMRD IPI directly, which worked fine
but hypercall way of handling still performed well from the results I
saw).

You mean better? This is strange. Have you ran guest with x2apic?



Had the same doubt. So ran the full benchmark for dbench.
So here is what I saw now. 1x was neck to neck (0.9% for hypercall vs 
0.7% for IPI which should boil to no difference considering the noise
factors) but otherwise, by sending IPI I see few percentage gain in 
overcommit cases.





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [[Qemu-devel] [PATCH]] nVMX: Initialize IA32_FEATURE_CONTROL MSR in reset and migration

2013-07-25 Thread Gleb Natapov
On Sun, Jul 07, 2013 at 11:13:37PM +0800, Arthur Chunqi Li wrote:
 The recent KVM patch adds IA32_FEATURE_CONTROL support. QEMU needs
 to clear this MSR when reset vCPU and keep the value of it when
 migration. This patch add this feature.
 
 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
Applied, thanks.

 ---
  target-i386/cpu.h |2 ++
  target-i386/kvm.c |4 
  target-i386/machine.c |   22 ++
  3 files changed, 28 insertions(+)
 
 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index 62e3547..a418e17 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -301,6 +301,7 @@
  #define MSR_IA32_APICBASE_BSP   (18)
  #define MSR_IA32_APICBASE_ENABLE(111)
  #define MSR_IA32_APICBASE_BASE  (0xf12)
 +#define MSR_IA32_FEATURE_CONTROL0x003a
  #define MSR_TSC_ADJUST  0x003b
  #define MSR_IA32_TSCDEADLINE0x6e0
  
 @@ -813,6 +814,7 @@ typedef struct CPUX86State {
  
  uint64_t mcg_status;
  uint64_t msr_ia32_misc_enable;
 +uint64_t msr_ia32_feature_control;
  
  /* exception/interrupt handling */
  int error_code;
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 39f4fbb..3cb2161 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -1122,6 +1122,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  if (hyperv_vapic_recommended()) {
  kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
  }
 +kvm_msr_entry_set(msrs[n++], MSR_IA32_FEATURE_CONTROL, 
 env-msr_ia32_feature_control);
  }
  if (env-mcg_cap) {
  int i;
 @@ -1346,6 +1347,7 @@ static int kvm_get_msrs(X86CPU *cpu)
  if (has_msr_misc_enable) {
  msrs[n++].index = MSR_IA32_MISC_ENABLE;
  }
 +msrs[n++].index = MSR_IA32_FEATURE_CONTROL;
  
  if (!env-tsc_valid) {
  msrs[n++].index = MSR_IA32_TSC;
 @@ -1444,6 +1446,8 @@ static int kvm_get_msrs(X86CPU *cpu)
  case MSR_IA32_MISC_ENABLE:
  env-msr_ia32_misc_enable = msrs[i].data;
  break;
 +case MSR_IA32_FEATURE_CONTROL:
 +env-msr_ia32_feature_control = msrs[i].data;
  default:
  if (msrs[i].index = MSR_MC0_CTL 
  msrs[i].index  MSR_MC0_CTL + (env-mcg_cap  0xff) * 4) {
 diff --git a/target-i386/machine.c b/target-i386/machine.c
 index 3659db9..94ca914 100644
 --- a/target-i386/machine.c
 +++ b/target-i386/machine.c
 @@ -399,6 +399,14 @@ static bool misc_enable_needed(void *opaque)
  return env-msr_ia32_misc_enable != MSR_IA32_MISC_ENABLE_DEFAULT;
  }
  
 +static bool feature_control_needed(void *opaque)
 +{
 +X86CPU *cpu = opaque;
 +CPUX86State *env = cpu-env;
 +
 +return env-msr_ia32_feature_control != 0;
 +}
 +
  static const VMStateDescription vmstate_msr_ia32_misc_enable = {
  .name = cpu/msr_ia32_misc_enable,
  .version_id = 1,
 @@ -410,6 +418,17 @@ static const VMStateDescription 
 vmstate_msr_ia32_misc_enable = {
  }
  };
  
 +static const VMStateDescription vmstate_msr_ia32_feature_control = {
 +.name = cpu/msr_ia32_feature_control,
 +.version_id = 1,
 +.minimum_version_id = 1,
 +.minimum_version_id_old = 1,
 +.fields  = (VMStateField []) {
 +VMSTATE_UINT64(env.msr_ia32_feature_control, X86CPU),
 +VMSTATE_END_OF_LIST()
 +}
 +};
 +
  const VMStateDescription vmstate_x86_cpu = {
  .name = cpu,
  .version_id = 12,
 @@ -535,6 +554,9 @@ const VMStateDescription vmstate_x86_cpu = {
  }, {
  .vmsd = vmstate_msr_ia32_misc_enable,
  .needed = misc_enable_needed,
 +}, {
 +.vmsd = vmstate_msr_ia32_feature_control,
 +.needed = feature_control_needed,
  } , {
  /* empty */
  }
 -- 
 1.7.9.5

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4 v6] powerpc: export debug registers save function for KVM

2013-07-25 Thread Alexander Graf

On 04.07.2013, at 08:57, Bharat Bhushan wrote:

 KVM need this function when switching from vcpu to user-space
 thread. My subsequent patch will use this function.
 
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com

Ben / Michael, please ack.


Alex

 ---
 v5-v6
 - switch_booke_debug_regs() not guarded by the compiler switch
 
 arch/powerpc/include/asm/switch_to.h |1 +
 arch/powerpc/kernel/process.c|3 ++-
 2 files changed, 3 insertions(+), 1 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/switch_to.h 
 b/arch/powerpc/include/asm/switch_to.h
 index 200d763..db68f1d 100644
 --- a/arch/powerpc/include/asm/switch_to.h
 +++ b/arch/powerpc/include/asm/switch_to.h
 @@ -29,6 +29,7 @@ extern void giveup_vsx(struct task_struct *);
 extern void enable_kernel_spe(void);
 extern void giveup_spe(struct task_struct *);
 extern void load_up_spe(struct task_struct *);
 +extern void switch_booke_debug_regs(struct thread_struct *new_thread);
 
 #ifndef CONFIG_SMP
 extern void discard_lazy_cpu_state(void);
 diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
 index 01ff496..da586aa 100644
 --- a/arch/powerpc/kernel/process.c
 +++ b/arch/powerpc/kernel/process.c
 @@ -362,12 +362,13 @@ static void prime_debug_regs(struct thread_struct 
 *thread)
  * debug registers, set the debug registers from the values
  * stored in the new thread.
  */
 -static void switch_booke_debug_regs(struct thread_struct *new_thread)
 +void switch_booke_debug_regs(struct thread_struct *new_thread)
 {
   if ((current-thread.debug.dbcr0  DBCR0_IDM)
   || (new_thread-debug.dbcr0  DBCR0_IDM))
   prime_debug_regs(new_thread);
 }
 +EXPORT_SYMBOL_GPL(switch_booke_debug_regs);
 #else /* !CONFIG_PPC_ADV_DEBUG_REGS */
 #ifndef CONFIG_HAVE_HW_BREAKPOINT
 static void set_debug_reg_defaults(struct thread_struct *thread)
 -- 
 1.7.0.4
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: x86: Simplify __apic_accept_irq

2013-07-25 Thread Gleb Natapov
On Thu, Jul 25, 2013 at 09:58:45AM +0200, Jan Kiszka wrote:
 If posted interrupts are enabled, we can no longer track if an IRQ was
 coalesced based on IRR. So drop this logic also from the classic
 software path and simplify apic_test_and_set_irr to apic_set_irr.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Applied both, thanks.

 ---
  arch/x86/kvm/lapic.c |   23 ---
  1 files changed, 8 insertions(+), 15 deletions(-)
 
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index afc1124..9dc3650 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -331,10 +331,10 @@ void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 
 *pir)
  }
  EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
  
 -static inline int apic_test_and_set_irr(int vec, struct kvm_lapic *apic)
 +static inline void apic_set_irr(int vec, struct kvm_lapic *apic)
  {
   apic-irr_pending = true;
 - return apic_test_and_set_vector(vec, apic-regs + APIC_IRR);
 + apic_set_vector(vec, apic-regs + APIC_IRR);
  }
  
  static inline int apic_search_irr(struct kvm_lapic *apic)
 @@ -681,28 +681,21 @@ static int __apic_accept_irq(struct kvm_lapic *apic, 
 int delivery_mode,
   if (unlikely(!apic_enabled(apic)))
   break;
  
 + result = 1;
 +
   if (dest_map)
   __set_bit(vcpu-vcpu_id, dest_map);
  
 - if (kvm_x86_ops-deliver_posted_interrupt) {
 - result = 1;
 + if (kvm_x86_ops-deliver_posted_interrupt)
   kvm_x86_ops-deliver_posted_interrupt(vcpu, vector);
 - } else {
 - result = !apic_test_and_set_irr(vector, apic);
 -
 - if (!result) {
 - if (trig_mode)
 - apic_debug(level trig mode repeatedly 
 - for vector %d, vector);
 - goto out;
 - }
 + else {
 + apic_set_irr(vector, apic);
  
   kvm_make_request(KVM_REQ_EVENT, vcpu);
   kvm_vcpu_kick(vcpu);
   }
 -out:
   trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
 - trig_mode, vector, !result);
 +   trig_mode, vector, false);
   break;
  
   case APIC_DM_REMRD:
 -- 
 1.7.3.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/2] s390/kvm: add kvm support for guest page hinting v2

2013-07-25 Thread Christian Borntraeger
On 25/07/13 10:54, Martin Schwidefsky wrote:
 v1-v2:
  - found a way to simplify the common code patch
 
 Linux on s390 as a guest under z/VM has been using the guest page
 hinting interface (alias collaborative memory management) for a long
 time. The full version with volatile states has been deemed to be too
 complicated (see the old discussion about guest page hinting e.g. on
 http://marc.info/?l=linux-mmm=123816662017742w=2).
 What is currently implemented for the guest is the unused and stable
 states to mark unallocated pages as freely available to the host.
 This works just fine with z/VM as the host.
 
 The two patches in this series implement the guest page hinting
 interface for the unused and stable states in the KVM host.
 Most of the code specific to s390 but there is a common memory
 management part as well, see patch #1.
 
 The code is working stable now, from my point of view this is ready
 for prime-time.
 
 Konstantin Weitz (2):
   mm: add support for discard of unused ptes
   s390/kvm: support collaborative memory management

Can you also add the patch from our tree that reset the usage states
on reboot (diag 308 subcode 3 and 4)?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm: add support for discard of unused ptes

2013-07-25 Thread Christian Borntraeger
On 25/07/13 10:54, Martin Schwidefsky wrote:
 From: Konstantin Weitz konstantin.we...@gmail.com
 
 In a virtualized environment and given an appropriate interface the guest
 can mark pages as unused while they are free (for the s390 implementation
 see git commit 45e576b1c3d00206 guest page hinting light). For the host
 the unused state is a property of the pte.
 
 This patch adds the primitive 'pte_unused' and code to the host swap out
 handler so that pages marked as unused by all mappers are not swapped out
 but discarded instead, thus saving one IO for swap out and potentially
 another one for swap in.
 
 [ Martin Schwidefsky: patch reordering and simplification ]
 
 Signed-off-by: Konstantin Weitz konstantin.we...@gmail.com
 Signed-off-by: Martin Schwidefsky schwidef...@de.ibm.com
Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com

 ---
  include/asm-generic/pgtable.h |   13 +
  mm/rmap.c |   10 ++
  2 files changed, 23 insertions(+)
 
 diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
 index 2f47ade..ec540c5 100644
 --- a/include/asm-generic/pgtable.h
 +++ b/include/asm-generic/pgtable.h
 @@ -193,6 +193,19 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
  }
  #endif
 
 +#ifndef __HAVE_ARCH_PTE_UNUSED
 +/*
 + * Some architectures provide facilities to virtualization guests
 + * so that they can flag allocated pages as unused. This allows the
 + * host to transparently reclaim unused pages. This function returns
 + * whether the pte's page is unused.
 + */
 +static inline int pte_unused(pte_t pte)
 +{
 + return 0;
 +}
 +#endif
 +
  #ifndef __HAVE_ARCH_PMD_SAME
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 diff --git a/mm/rmap.c b/mm/rmap.c
 index cd356df..2291f25 100644
 --- a/mm/rmap.c
 +++ b/mm/rmap.c
 @@ -1234,6 +1234,16 @@ int try_to_unmap_one(struct page *page, struct 
 vm_area_struct *vma,
   }
   set_pte_at(mm, address, pte,
  swp_entry_to_pte(make_hwpoison_entry(page)));
 + } else if (pte_unused(pteval)) {
 + /*
 +  * The guest indicated that the page content is of no
 +  * interest anymore. Simply discard the pte, vmscan
 +  * will take care of the rest.
 +  */
 + if (PageAnon(page))
 + dec_mm_counter(mm, MM_ANONPAGES);
 + else
 + dec_mm_counter(mm, MM_FILEPAGES);
   } else if (PageAnon(page)) {
   swp_entry_t entry = { .val = page_private(page) };
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V2 4/4] x86: correctly detect hypervisor

2013-07-25 Thread KY Srinivasan


 -Original Message-
 From: Jason Wang [mailto:jasow...@redhat.com]
 Sent: Thursday, July 25, 2013 4:55 AM
 To: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
 linux-ker...@vger.kernel.org; pbonz...@redhat.com
 Cc: kvm@vger.kernel.org; Jason Wang; KY Srinivasan; Haiyang Zhang; Konrad
 Rzeszutek Wilk; Jeremy Fitzhardinge; Doug Covelli; Borislav Petkov; Dan Hecht;
 Paul Gortmaker; Marcelo Tosatti; Gleb Natapov; Frederic Weisbecker;
 de...@linuxdriverproject.org; xen-de...@lists.xensource.com;
 virtualizat...@lists.linux-foundation.org
 Subject: [PATCH V2 4/4] x86: correctly detect hypervisor
 
 We try to handle the hypervisor compatibility mode by detecting hypervisor
 through a specific order. This is not robust, since hypervisors may implement
 each others features.
 
 This patch tries to handle this situation by always choosing the last one in 
 the
 CPUID leaves. This is done by letting .detect() returns a priority instead of
 true/false and just re-using the CPUID leaf where the signature were found as
 the priority (or 1 if it was found by DMI). Then we can just pick hypervisor 
 who
 has the highest priority. Other sophisticated detection method could also be
 implemented on top.
 
 Suggested by H. Peter Anvin and Paolo Bonzini.
 
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Ingo Molnar mi...@redhat.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: x...@kernel.org
 Cc: K. Y. Srinivasan k...@microsoft.com
 Cc: Haiyang Zhang haiya...@microsoft.com
 Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 Cc: Jeremy Fitzhardinge jer...@goop.org
 Cc: Doug Covelli dcove...@vmware.com
 Cc: Borislav Petkov b...@suse.de
 Cc: Dan Hecht dhe...@vmware.com
 Cc: Paul Gortmaker paul.gortma...@windriver.com
 Cc: Marcelo Tosatti mtosa...@redhat.com
 Cc: Gleb Natapov g...@redhat.com
 Cc: Paolo Bonzini pbonz...@redhat.com
 Cc: Frederic Weisbecker fweis...@gmail.com
 Cc: linux-ker...@vger.kernel.org
 Cc: de...@linuxdriverproject.org
 Cc: kvm@vger.kernel.org
 Cc: xen-de...@lists.xensource.com
 Cc: virtualizat...@lists.linux-foundation.org
 Signed-off-by: Jason Wang jasow...@redhat.com
Acked-by:  K. Y. Srinivasan k...@microsoft.com

 ---
  arch/x86/include/asm/hypervisor.h |2 +-
  arch/x86/kernel/cpu/hypervisor.c  |   15 +++
  arch/x86/kernel/cpu/mshyperv.c|   13 -
  arch/x86/kernel/cpu/vmware.c  |8 
  arch/x86/kernel/kvm.c |6 ++
  arch/x86/xen/enlighten.c  |9 +++--
  6 files changed, 25 insertions(+), 28 deletions(-)
 
 diff --git a/arch/x86/include/asm/hypervisor.h
 b/arch/x86/include/asm/hypervisor.h
 index 2d4b5e6..e42f758 100644
 --- a/arch/x86/include/asm/hypervisor.h
 +++ b/arch/x86/include/asm/hypervisor.h
 @@ -33,7 +33,7 @@ struct hypervisor_x86 {
   const char  *name;
 
   /* Detection routine */
 - bool(*detect)(void);
 + uint32_t(*detect)(void);
 
   /* Adjust CPU feature bits (run once per CPU) */
   void(*set_cpu_features)(struct cpuinfo_x86 *);
 diff --git a/arch/x86/kernel/cpu/hypervisor.c 
 b/arch/x86/kernel/cpu/hypervisor.c
 index 8727921..36ce402 100644
 --- a/arch/x86/kernel/cpu/hypervisor.c
 +++ b/arch/x86/kernel/cpu/hypervisor.c
 @@ -25,11 +25,6 @@
  #include asm/processor.h
  #include asm/hypervisor.h
 
 -/*
 - * Hypervisor detect order.  This is specified explicitly here because
 - * some hypervisors might implement compatibility modes for other
 - * hypervisors and therefore need to be detected in specific sequence.
 - */
  static const __initconst struct hypervisor_x86 * const hypervisors[] =
  {
  #ifdef CONFIG_XEN_PVHVM
 @@ -49,15 +44,19 @@ static inline void __init
  detect_hypervisor_vendor(void)
  {
   const struct hypervisor_x86 *h, * const *p;
 + uint32_t pri, max_pri = 0;
 
   for (p = hypervisors; p  hypervisors + ARRAY_SIZE(hypervisors); p++) {
   h = *p;
 - if (h-detect()) {
 + pri = h-detect();
 + if (pri != 0  pri  max_pri) {
 + max_pri = pri;
   x86_hyper = h;
 - printk(KERN_INFO Hypervisor detected: %s\n, h-
 name);
 - break;
   }
   }
 +
 + if (max_pri)
 + printk(KERN_INFO Hypervisor detected: %s\n, x86_hyper-
 name);
  }
 
  void init_hypervisor(struct cpuinfo_x86 *c)
 diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
 index 8f4be53..71a39f3 100644
 --- a/arch/x86/kernel/cpu/mshyperv.c
 +++ b/arch/x86/kernel/cpu/mshyperv.c
 @@ -27,20 +27,23 @@
  struct ms_hyperv_info ms_hyperv;
  EXPORT_SYMBOL_GPL(ms_hyperv);
 
 -static bool __init ms_hyperv_platform(void)
 +static uint32_t  __init ms_hyperv_platform(void)
  {
   u32 eax;
   u32 hyp_signature[3];
 
   if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
 - return false;
 + return 0;
 
   cpuid(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS,

Re: [PATCH 2/2] s390/kvm: support collaborative memory management

2013-07-25 Thread Christian Borntraeger
On 25/07/13 10:54, Martin Schwidefsky wrote:
 From: Konstantin Weitz konstantin.we...@gmail.com
 
 This patch enables Collaborative Memory Management (CMM) for kvm
 on s390. CMM allows the guest to inform the host about page usage
 (see arch/s390/mm/cmm.c). The host uses this information to avoid
 swapping in unused pages in the page fault handler. Further, a CPU
 provided list of unused invalid pages is processed to reclaim swap
 space of not yet accessed unused pages.
 
 [ Martin Schwidefsky: patch reordering and cleanup ]
 
 Signed-off-by: Konstantin Weitz konstantin.we...@gmail.com
 Signed-off-by: Martin Schwidefsky schwidef...@de.ibm.com

Two things to consider: life migration and reset

When we implement life migration, we need to add some additional magic for
userspace to query/set unused state. But this can be a followup patch, 
whenever this becomes necessary.

As of today it should be enough to add some code to the diag308 handler to
make reset save. For other kinds of reset (e.g. those for kdump) we need
to make this accessible to userspace. Again, this can be added later on
when we implement the other missing pieces for kdump and friends.

So

Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com
Tested-by: Christian Borntraeger borntrae...@de.ibm.com



 ---
  arch/s390/include/asm/kvm_host.h |5 ++-
  arch/s390/include/asm/pgtable.h  |   24 
  arch/s390/kvm/kvm-s390.c |   25 +
  arch/s390/kvm/kvm-s390.h |2 +
  arch/s390/kvm/priv.c |   41 
  arch/s390/mm/pgtable.c   |   77 
 ++
  6 files changed, 173 insertions(+), 1 deletion(-)
 
 diff --git a/arch/s390/include/asm/kvm_host.h 
 b/arch/s390/include/asm/kvm_host.h
 index 3238d40..de6450e 100644
 --- a/arch/s390/include/asm/kvm_host.h
 +++ b/arch/s390/include/asm/kvm_host.h
 @@ -113,7 +113,9 @@ struct kvm_s390_sie_block {
   __u64   gbea;   /* 0x0180 */
   __u8reserved188[24];/* 0x0188 */
   __u32   fac;/* 0x01a0 */
 - __u8reserved1a4[92];/* 0x01a4 */
 + __u8reserved1a4[20];/* 0x01a4 */
 + __u64   cbrlo;  /* 0x01b8 */
 + __u8reserved1c0[64];/* 0x01c0 */
  } __attribute__((packed));
 
  struct kvm_vcpu_stat {
 @@ -149,6 +151,7 @@ struct kvm_vcpu_stat {
   u32 instruction_stsi;
   u32 instruction_stfl;
   u32 instruction_tprot;
 + u32 instruction_essa;
   u32 instruction_sigp_sense;
   u32 instruction_sigp_sense_running;
   u32 instruction_sigp_external_call;
 diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
 index 75fb726..65d48b8 100644
 --- a/arch/s390/include/asm/pgtable.h
 +++ b/arch/s390/include/asm/pgtable.h
 @@ -227,6 +227,7 @@ extern unsigned long MODULES_END;
  #define _PAGE_SWR0x008   /* SW pte referenced bit */
  #define _PAGE_SWW0x010   /* SW pte write bit */
  #define _PAGE_SPECIAL0x020   /* SW associated with special 
 page */
 +#define _PAGE_UNUSED 0x040   /* SW bit for ptep_clear_flush() */
  #define __HAVE_ARCH_PTE_SPECIAL
 
  /* Set of bits not changed in pte_modify */
 @@ -375,6 +376,12 @@ extern unsigned long MODULES_END;
 
  #endif /* CONFIG_64BIT */
 
 +/* Guest Page State used for virtualization */
 +#define _PGSTE_GPS_ZERO  0x8000UL
 +#define _PGSTE_GPS_USAGE_MASK0x0300UL
 +#define _PGSTE_GPS_USAGE_STABLE 0xUL
 +#define _PGSTE_GPS_USAGE_UNUSED 0x0100UL
 +
  /*
   * A user page table pointer has the space-switch-event bit, the
   * private-space-control bit and the storage-alteration-event-control
 @@ -590,6 +597,12 @@ static inline int pte_file(pte_t pte)
   return (pte_val(pte)  mask) == _PAGE_TYPE_FILE;
  }
 
 +static inline int pte_swap(pte_t pte)
 +{
 + unsigned long mask = _PAGE_RO | _PAGE_INVALID | _PAGE_SWT | _PAGE_SWX;
 + return (pte_val(pte)  mask) == _PAGE_TYPE_SWAP;
 +}
 +
  static inline int pte_special(pte_t pte)
  {
   return (pte_val(pte)  _PAGE_SPECIAL);
 @@ -794,6 +807,7 @@ unsigned long gmap_translate(unsigned long address, 
 struct gmap *);
  unsigned long __gmap_fault(unsigned long address, struct gmap *);
  unsigned long gmap_fault(unsigned long address, struct gmap *);
  void gmap_discard(unsigned long from, unsigned long to, struct gmap *);
 +void __gmap_zap(unsigned long address, struct gmap *);
 
  void gmap_register_ipte_notifier(struct gmap_notifier *);
  void gmap_unregister_ipte_notifier(struct gmap_notifier *);
 @@ -825,6 +839,7 @@ static inline void set_pte_at(struct mm_struct *mm, 
 unsigned long addr,
 
   if (mm_has_pgste(mm)) {
   pgste = pgste_get_lock(ptep);
 + pgste_val(pgste) = ~_PGSTE_GPS_ZERO;
   pgste_set_key(ptep, pgste, entry);
   pgste_set_pte(ptep, entry);
   pgste_set_unlock(ptep, 

[PATCH v4 05/13] nEPT: make guest's A/D bits depends on guest's paging mode

2013-07-25 Thread Gleb Natapov
EPT uses different shifts for A/D bits and first version of nEPT does
not support them at all.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/paging_tmpl.h |   30 ++
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index fb26ca9..7581395 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -32,6 +32,10 @@
#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
#define PT_LEVEL_BITS PT64_LEVEL_BITS
+   #define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK
+   #define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
+   #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
+   #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
#ifdef CONFIG_X86_64
#define PT_MAX_FULL_LEVELS 4
#define CMPXCHG cmpxchg
@@ -49,6 +53,10 @@
#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
#define PT_LEVEL_BITS PT32_LEVEL_BITS
#define PT_MAX_FULL_LEVELS 2
+   #define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK
+   #define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
+   #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
+   #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
#define CMPXCHG cmpxchg
 #else
#error Invalid PTTYPE value
@@ -88,7 +96,8 @@ static inline void FNAME(protect_clean_gpte)(unsigned 
*access, unsigned gpte)
 
mask = (unsigned)~ACC_WRITE_MASK;
/* Allow write access to dirty gptes */
-   mask |= (gpte  (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT))  
PT_WRITABLE_MASK;
+   mask |= (gpte  (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) 
+   PT_WRITABLE_MASK;
*access = mask;
 }
 
@@ -138,7 +147,7 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu 
*vcpu,
if (!FNAME(is_present_gpte)(gpte))
goto no_present;
 
-   if (!(gpte  PT_ACCESSED_MASK))
+   if (!(gpte  PT_GUEST_ACCESSED_MASK))
goto no_present;
 
return false;
@@ -174,14 +183,14 @@ static int FNAME(update_accessed_dirty_bits)(struct 
kvm_vcpu *vcpu,
table_gfn = walker-table_gfn[level - 1];
ptep_user = walker-ptep_user[level - 1];
index = offset_in_page(ptep_user) / sizeof(pt_element_t);
-   if (!(pte  PT_ACCESSED_MASK)) {
+   if (!(pte  PT_GUEST_ACCESSED_MASK)) {
trace_kvm_mmu_set_accessed_bit(table_gfn, index, 
sizeof(pte));
-   pte |= PT_ACCESSED_MASK;
+   pte |= PT_GUEST_ACCESSED_MASK;
}
if (level == walker-level  write_fault 
-   !(pte  PT_DIRTY_MASK)) {
+   !(pte  PT_GUEST_DIRTY_MASK)) {
trace_kvm_mmu_set_dirty_bit(table_gfn, index, 
sizeof(pte));
-   pte |= PT_DIRTY_MASK;
+   pte |= PT_GUEST_DIRTY_MASK;
}
if (pte == orig_pte)
continue;
@@ -235,7 +244,7 @@ retry_walk:
ASSERT((!is_long_mode(vcpu)  is_pae(vcpu)) ||
   (mmu-get_cr3(vcpu)  CR3_NONPAE_RESERVED_BITS) == 0);
 
-   accessed_dirty = PT_ACCESSED_MASK;
+   accessed_dirty = PT_GUEST_ACCESSED_MASK;
pt_access = pte_access = ACC_ALL;
++walker-level;
 
@@ -310,7 +319,8 @@ retry_walk:
 * On a write fault, fold the dirty bit into accessed_dirty by
 * shifting it one place right.
 */
-   accessed_dirty = pte  (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
+   accessed_dirty = pte 
+   (PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
 
if (unlikely(!accessed_dirty)) {
ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, 
write_fault);
@@ -886,3 +896,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 #undef gpte_to_gfn
 #undef gpte_to_gfn_lvl
 #undef CMPXCHG
+#undef PT_GUEST_ACCESSED_MASK
+#undef PT_GUEST_DIRTY_MASK
+#undef PT_GUEST_DIRTY_SHIFT
+#undef PT_GUEST_ACCESSED_SHIFT
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 00/13] Nested EPT

2013-07-25 Thread Gleb Natapov
After changing hands several times I proud to present a new version of
Nested EPT patches. Nothing groundbreaking here comparing to v3: all 
review comment are addressed, some by Yang Zhang and some by Yours Truly.

Gleb Natapov (1):
  nEPT: make guest's A/D bits depends on guest's paging mode

Nadav Har'El (10):
  nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  nEPT: Fix cr3 handling in nested exit and entry
  nEPT: Fix wrong test in kvm_set_cr3
  nEPT: Move common code to paging_tmpl.h
  nEPT: Add EPT tables support to paging_tmpl.h
  nEPT: Nested INVEPT
  nEPT: MMU context for nested EPT
  nEPT: Advertise EPT to L1
  nEPT: Some additional comments
  nEPT: Miscelleneous cleanups

Yang Zhang (2):
  nEPT: Redefine EPT-specific link_shadow_page()
  nEPT: Add nEPT violation/misconfigration support

 arch/x86/include/asm/kvm_host.h |4 +
 arch/x86/include/asm/vmx.h  |3 +
 arch/x86/include/uapi/asm/vmx.h |1 +
 arch/x86/kvm/mmu.c  |  134 ++---
 arch/x86/kvm/mmu.h  |2 +
 arch/x86/kvm/paging_tmpl.h  |  175 
 arch/x86/kvm/vmx.c  |  210 ---
 arch/x86/kvm/x86.c  |   11 --
 8 files changed, 436 insertions(+), 104 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 12/13] nEPT: Some additional comments

2013-07-25 Thread Gleb Natapov
From: Nadav Har'El n...@il.ibm.com

Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention shadow on either EPT or shadow as the only two options.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/vmx.c |   13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a77f902..d513ace 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6659,7 +6659,20 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
*vcpu)
return nested_cpu_has2(vmcs12,
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
case EXIT_REASON_EPT_VIOLATION:
+   /*
+* L0 always deals with the EPT violation. If nested EPT is
+* used, and the nested mmu code discovers that the address is
+* missing in the guest EPT table (EPT12), the EPT violation
+* will be injected with nested_ept_inject_page_fault()
+*/
+   return 0;
case EXIT_REASON_EPT_MISCONFIG:
+   /*
+* L2 never uses directly L1's EPT, but rather L0's own EPT
+* table (shadow on EPT) or a merged EPT table that L0 built
+* (EPT on EPT). So any problems with the structure of the
+* table is L0's fault.
+*/
return 0;
case EXIT_REASON_PREEMPTION_TIMER:
return vmcs12-pin_based_vm_exec_control 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support

2013-07-25 Thread Gleb Natapov
From: Yang Zhang yang.z.zh...@intel.com

Inject nEPT fault to L1 guest. This patch is original from Xinhao.

Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |4 
 arch/x86/kvm/mmu.c  |   34 ++
 arch/x86/kvm/paging_tmpl.h  |   30 +-
 arch/x86/kvm/vmx.c  |   17 +
 4 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 531f47c..58a17c0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -286,6 +286,7 @@ struct kvm_mmu {
u64 *pae_root;
u64 *lm_root;
u64 rsvd_bits_mask[2][4];
+   u64 bad_mt_xwr;
 
/*
 * Bitmap: bit set = last pte in walk
@@ -512,6 +513,9 @@ struct kvm_vcpu_arch {
 * instruction.
 */
bool write_fault_to_shadow_pgtable;
+
+   /* set at EPT violation at this point */
+   unsigned long exit_qualification;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3df3ac3..58ae9db 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
int maxphyaddr = cpuid_maxphyaddr(vcpu);
u64 exb_bit_rsvd = 0;
 
+   context-bad_mt_xwr = 0;
+
if (!context-nx)
exb_bit_rsvd = rsvd_bits(63, 63);
switch (context-root_level) {
@@ -3576,6 +3578,38 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
}
 }
 
+static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
+   struct kvm_mmu *context, bool execonly)
+{
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+   int pte;
+
+   context-rsvd_bits_mask[0][3] =
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
+   context-rsvd_bits_mask[0][2] =
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+   context-rsvd_bits_mask[0][1] =
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+   context-rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+
+   /* large page */
+   context-rsvd_bits_mask[1][3] = context-rsvd_bits_mask[0][3];
+   context-rsvd_bits_mask[1][2] =
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
+   context-rsvd_bits_mask[1][1] =
+   rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
+   context-rsvd_bits_mask[1][0] = context-rsvd_bits_mask[0][0];
+   
+   for (pte = 0; pte  64; pte++) {
+   int rwx_bits = pte  7;
+   int mt = pte  3;
+   if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
+   rwx_bits == 0x2 || rwx_bits == 0x6 ||
+   (rwx_bits == 0x4  !execonly))
+   context-bad_mt_xwr |= (1ull  pte);
+   }
+}
+
 static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu 
*mmu)
 {
unsigned bit, byte, pfec;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 23a19a5..58d2f87 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -121,14 +121,23 @@ static inline void FNAME(protect_clean_gpte)(unsigned 
*access, unsigned gpte)
 #endif
 }
 
+#if PTTYPE == PTTYPE_EPT
+#define CHECK_BAD_MT_XWR(G) mmu-bad_mt_xwr  (1ull  ((G)  0x3f));
+#else
+#define CHECK_BAD_MT_XWR(G) 0;
+#endif
+
 static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
 {
int bit7;
 
bit7 = (gpte  7)  1;
-   return (gpte  mmu-rsvd_bits_mask[bit7][level-1]) != 0;
+   return ((gpte  mmu-rsvd_bits_mask[bit7][level-1]) != 0) ||
+   CHECK_BAD_MT_XWR(gpte);
 }
 
+#undef CHECK_BAD_MT_XWR
+
 static inline int FNAME(is_present_gpte)(unsigned long pte)
 {
 #if PTTYPE != PTTYPE_EPT
@@ -376,6 +385,25 @@ error:
walker-fault.vector = PF_VECTOR;
walker-fault.error_code_valid = true;
walker-fault.error_code = errcode;
+
+#if PTTYPE == PTTYPE_EPT
+   /*
+* Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
+* misconfiguration requires to be injected. The detection is
+* done by is_rsvd_bits_set() above.
+*
+* We set up the value of exit_qualification to inject:
+* [2:0] - Derive from [2:0] of real exit_qualification at EPT violation
+* [5:3] - Calculated by the page walk of the guest EPT page tables
+* [7:8] - Clear to 0.
+*
+* The other bits are set to 0.
+*/
+   if (!(errcode  PFERR_RSVD_MASK)) {
+   vcpu-arch.exit_qualification = 0x7;
+   vcpu-arch.exit_qualification |= ((pt_access  pte)  0x7)  3;
+   }
+#endif
walker-fault.address = addr;

[PATCH v4 11/13] nEPT: Advertise EPT to L1

2013-07-25 Thread Gleb Natapov
From: Nadav Har'El n...@il.ibm.com

Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/vmx.c |   16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6b79db7..a77f902 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2249,6 +2249,18 @@ static __init void nested_vmx_setup_ctls_msrs(void)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_WBINVD_EXITING;
 
+   if (enable_ept) {
+   /* nested EPT: emulate EPT also to L1 */
+   nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+   nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+   nested_vmx_ept_caps |=
+   VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
+   VMX_EPT_EXTENT_CONTEXT_BIT |
+   VMX_EPT_EXTENT_INDIVIDUAL_BIT;
+   nested_vmx_ept_caps = vmx_capability.ept;
+   } else
+   nested_vmx_ept_caps = 0;
+
/* miscellaneous data */
rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high);
nested_vmx_misc_low = VMX_MISC_PREEMPTION_TIMER_RATE_MASK |
@@ -2357,8 +2369,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
nested_vmx_secondary_ctls_high);
break;
case MSR_IA32_VMX_EPT_VPID_CAP:
-   /* Currently, no nested ept or nested vpid */
-   *pdata = 0;
+   /* Currently, no nested vpid support */
+   *pdata = nested_vmx_ept_caps;
break;
default:
return 0;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 02/13] nEPT: Fix cr3 handling in nested exit and entry

2013-07-25 Thread Gleb Natapov
From: Nadav Har'El n...@il.ibm.com

The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/vmx.c |   30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1e9437f..89b15df 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7595,6 +7595,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
kvm_set_cr3(vcpu, vmcs12-guest_cr3);
kvm_mmu_reset_context(vcpu);
 
+   /*
+* Additionally, except when L0 is using shadow page tables, L1 or
+* L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+*/
+   if (enable_ept) {
+   vmcs_write64(GUEST_PDPTR0, vmcs12-guest_pdptr0);
+   vmcs_write64(GUEST_PDPTR1, vmcs12-guest_pdptr1);
+   vmcs_write64(GUEST_PDPTR2, vmcs12-guest_pdptr2);
+   vmcs_write64(GUEST_PDPTR3, vmcs12-guest_pdptr3);
+   }
+
kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12-guest_rsp);
kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12-guest_rip);
 }
@@ -7917,6 +7928,25 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
vmcs12-guest_pending_dbg_exceptions =
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
+   /*
+* In some cases (usually, nested EPT), L2 is allowed to change its
+* own CR3 without exiting. If it has changed it, we must keep it.
+* Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+* by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+*/
+   if (enable_ept)
+   vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3);
+   /*
+* Additionally, except when L0 is using shadow page tables, L1 or
+* L2 control guest_cr3 for L2, so save their PDPTEs
+*/
+   if (enable_ept) {
+   vmcs12-guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+   vmcs12-guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+   vmcs12-guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+   vmcs12-guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+   }
+
vmcs12-vm_entry_controls =
(vmcs12-vm_entry_controls  ~VM_ENTRY_IA32E_MODE) |
(vmcs_read32(VM_ENTRY_CONTROLS)  VM_ENTRY_IA32E_MODE);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 13/13] nEPT: Miscelleneous cleanups

2013-07-25 Thread Gleb Natapov
From: Nadav Har'El n...@il.ibm.com

Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Reviewed-by: Paolo Bonzini pbonz...@redhat.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/vmx.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d513ace..66d9233 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -715,7 +715,6 @@ static void nested_release_page_clean(struct page *page)
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
@@ -1040,8 +1039,7 @@ static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, 
u32 bit)
(vmcs12-secondary_vm_exec_control  bit);
 }
 
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
-   struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
return vmcs12-pin_based_vm_exec_control  PIN_BASED_VIRTUAL_NMIS;
 }
@@ -6760,7 +6758,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 
if (unlikely(!cpu_has_virtual_nmis()  vmx-soft_vnmi_blocked 
!(is_guest_mode(vcpu)  nested_cpu_has_virtual_nmis(
-   get_vmcs12(vcpu), vcpu {
+   get_vmcs12(vcpu) {
if (vmx_interrupt_allowed(vcpu)) {
vmx-soft_vnmi_blocked = 0;
} else if (vmx-vnmi_blocked_time  10LL 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-07-25 Thread Gleb Natapov
From: Nadav Har'El n...@il.ibm.com

Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/vmx.c |   23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e999dc7..1e9437f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2198,7 +2198,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #else
nested_vmx_exit_ctls_high = 0;
 #endif
-   nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
+   nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
+ VM_EXIT_LOAD_IA32_EFER);
 
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -2207,8 +2208,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
nested_vmx_entry_ctls_high =
VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
-   nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
-
+   nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
+  VM_ENTRY_LOAD_IA32_EFER);
/* cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
@@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask;
vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits);
 
-   /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
-   vmcs_write32(VM_EXIT_CONTROLS,
-   vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl);
-   vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls |
+   /* L2-L1 exit controls are emulated - the hardware exit is to L0 so
+* we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
+* bits are further modified by vmx_set_efer() below.
+*/
+   vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+
+   /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
+* emulated by vmx_set_efer(), below.
+*/
+   vmcs_write32(VM_ENTRY_CONTROLS,
+   (vmcs12-vm_entry_controls  ~VM_ENTRY_LOAD_IA32_EFER 
+   ~VM_ENTRY_IA32E_MODE) |
(vmcs_config.vmentry_ctrl  ~VM_ENTRY_IA32E_MODE));
 
if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_PAT)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 10/13] nEPT: MMU context for nested EPT

2013-07-25 Thread Gleb Natapov
From: Nadav Har'El n...@il.ibm.com

KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new MMU context for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |   26 ++
 arch/x86/kvm/mmu.h |2 ++
 arch/x86/kvm/vmx.c |   41 -
 3 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 58ae9db..37fff14 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3792,6 +3792,32 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct 
kvm_mmu *context)
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
 
+int kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
+   bool execonly)
+{
+   ASSERT(vcpu);
+   ASSERT(!VALID_PAGE(vcpu-arch.mmu.root_hpa));
+
+   context-shadow_root_level = kvm_x86_ops-get_tdp_level();
+
+   context-nx = true;
+   context-new_cr3 = paging_new_cr3;
+   context-page_fault = ept_page_fault;
+   context-gva_to_gpa = ept_gva_to_gpa;
+   context-sync_page = ept_sync_page;
+   context-invlpg = ept_invlpg;
+   context-update_pte = ept_update_pte;
+   context-free = paging_free;
+   context-root_level = context-shadow_root_level;
+   context-root_hpa = INVALID_PAGE;
+   context-direct_map = false;
+
+   reset_rsvds_bits_mask_ept(vcpu, context, execonly);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
+
 static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
 {
int r = kvm_init_shadow_mmu(vcpu, vcpu-arch.walk_mmu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 5b59c57..77e044a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -71,6 +71,8 @@ enum {
 
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
+   bool execonly);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bbfff8d..6b79db7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1046,6 +1046,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct 
vmcs12 *vmcs12,
return vmcs12-pin_based_vm_exec_control  PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -7433,6 +7438,33 @@ static void nested_ept_inject_page_fault(struct kvm_vcpu 
*vcpu,
vmcs12-guest_physical_address = fault-address;
 }
 
+/* Callbacks for nested_ept_init_mmu_context: */
+
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+   /* return the page table to be shadowed - in our case, EPT12 */
+   return get_vmcs12(vcpu)-ept_pointer;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+   int r = kvm_init_shadow_ept_mmu(vcpu, vcpu-arch.mmu,
+   nested_vmx_ept_caps  VMX_EPT_EXECUTE_ONLY_BIT);
+
+   vcpu-arch.mmu.set_cr3   = vmx_set_cr3;
+   vcpu-arch.mmu.get_cr3   = nested_ept_get_cr3;
+   vcpu-arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+
+   vcpu-arch.walk_mmu  = vcpu-arch.nested_mmu;
+
+   return r;
+}
+
+static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
+{
+   vcpu-arch.walk_mmu = vcpu-arch.mmu;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it
@@ -7653,6 +7685,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
vmx_flush_tlb(vcpu);
}
 
+   if (nested_cpu_has_ept(vmcs12)) {
+   kvm_mmu_unload(vcpu);
+   nested_ept_init_mmu_context(vcpu);
+   }
+
if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_EFER)
vcpu-arch.efer = vmcs12-guest_ia32_efer;
else if (vmcs12-vm_entry_controls  VM_ENTRY_IA32E_MODE)
@@ -8125,7 +8162,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
vcpu-arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
 

[PATCH v4 08/13] nEPT: Nested INVEPT

2013-07-25 Thread Gleb Natapov
From: Nadav Har'El n...@il.ibm.com

If we let L1 use EPT, we should probably also support the INVEPT instruction.

In our current nested EPT implementation, when L1 changes its EPT table
for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in
the course of this modification already calls INVEPT. But if last level
of shadow page is unsync not all L1's changes to EPT12 are intercepted,
which means roots need to be synced when L1 calls INVEPT. Global INVEPT
should not be different since roots are synced by kvm_mmu_load() each
time EPTP02 changes.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/vmx.h  |3 ++
 arch/x86/include/uapi/asm/vmx.h |1 +
 arch/x86/kvm/mmu.c  |2 ++
 arch/x86/kvm/vmx.c  |   68 +++
 4 files changed, 74 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f3e01a2..c3d74b9 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -387,6 +387,7 @@ enum vmcs_field {
 #define VMX_EPT_EXTENT_INDIVIDUAL_ADDR 0
 #define VMX_EPT_EXTENT_CONTEXT 1
 #define VMX_EPT_EXTENT_GLOBAL  2
+#define VMX_EPT_EXTENT_SHIFT   24
 
 #define VMX_EPT_EXECUTE_ONLY_BIT   (1ull)
 #define VMX_EPT_PAGE_WALK_4_BIT(1ull  6)
@@ -394,7 +395,9 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT(1ull  14)
 #define VMX_EPT_2MB_PAGE_BIT   (1ull  16)
 #define VMX_EPT_1GB_PAGE_BIT   (1ull  17)
+#define VMX_EPT_INVEPT_BIT (1ull  20)
 #define VMX_EPT_AD_BIT (1ull  21)
+#define VMX_EPT_EXTENT_INDIVIDUAL_BIT  (1ull  24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull  25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT  (1ull  26)
 
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index d651082..7a34e8f 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -65,6 +65,7 @@
 #define EXIT_REASON_EOI_INDUCED 45
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
+#define EXIT_REASON_INVEPT  50
 #define EXIT_REASON_PREEMPTION_TIMER52
 #define EXIT_REASON_WBINVD  54
 #define EXIT_REASON_XSETBV  55
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9e0f467..3df3ac3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3182,6 +3182,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
mmu_sync_roots(vcpu);
spin_unlock(vcpu-kvm-mmu_lock);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_sync_roots);
 
 static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr,
  u32 access, struct x86_exception *exception)
@@ -3451,6 +3452,7 @@ void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu)
++vcpu-stat.tlb_flush;
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_flush_tlb);
 
 static void paging_new_cr3(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 56d0066..fc24370 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2156,6 +2156,7 @@ static u32 nested_vmx_pinbased_ctls_low, 
nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
 static u32 nested_vmx_misc_low, nested_vmx_misc_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
/*
@@ -6270,6 +6271,71 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu)
return 1;
 }
 
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+   u32 vmx_instruction_info;
+   bool ok;
+   unsigned long type;
+   gva_t gva;
+   struct x86_exception e;
+   struct {
+   u64 eptp, gpa;
+   } operand;
+
+   if (!(nested_vmx_secondary_ctls_high  SECONDARY_EXEC_ENABLE_EPT) ||
+   !(nested_vmx_ept_caps  VMX_EPT_INVEPT_BIT)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   /* According to the Intel VMX instruction reference, the memory
+* operand is read even if it isn't needed (e.g., for type==global)
+*/
+   vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+   

[PATCH v4 04/13] nEPT: Move common code to paging_tmpl.h

2013-07-25 Thread Gleb Natapov
From: Nadav Har'El n...@il.ibm.com

For preparation, we just move gpte_access(), prefetch_invalid_gpte(),
s_rsvd_bits_set(), protect_clean_gpte() and is_dirty_gpte() from mmu.c
to paging_tmpl.h.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |   55 --
 arch/x86/kvm/paging_tmpl.h |   80 +---
 2 files changed, 68 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3a9493a..4c4274d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -331,11 +331,6 @@ static int is_large_pte(u64 pte)
return pte  PT_PAGE_SIZE_MASK;
 }
 
-static int is_dirty_gpte(unsigned long pte)
-{
-   return pte  PT_DIRTY_MASK;
-}
-
 static int is_rmap_spte(u64 pte)
 {
return is_shadow_present_pte(pte);
@@ -2574,14 +2569,6 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
mmu_free_roots(vcpu);
 }
 
-static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
-{
-   int bit7;
-
-   bit7 = (gpte  7)  1;
-   return (gpte  mmu-rsvd_bits_mask[bit7][level-1]) != 0;
-}
-
 static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 bool no_dirty_log)
 {
@@ -2594,26 +2581,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu 
*vcpu, gfn_t gfn,
return gfn_to_pfn_memslot_atomic(slot, gfn);
 }
 
-static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *sp, u64 *spte,
- u64 gpte)
-{
-   if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
-   goto no_present;
-
-   if (!is_present_gpte(gpte))
-   goto no_present;
-
-   if (!(gpte  PT_ACCESSED_MASK))
-   goto no_present;
-
-   return false;
-
-no_present:
-   drop_spte(vcpu-kvm, spte);
-   return true;
-}
-
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp,
u64 *start, u64 *end)
@@ -3501,18 +3468,6 @@ static void paging_free(struct kvm_vcpu *vcpu)
nonpaging_free(vcpu);
 }
 
-static inline void protect_clean_gpte(unsigned *access, unsigned gpte)
-{
-   unsigned mask;
-
-   BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
-
-   mask = (unsigned)~ACC_WRITE_MASK;
-   /* Allow write access to dirty gptes */
-   mask |= (gpte  (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT))  
PT_WRITABLE_MASK;
-   *access = mask;
-}
-
 static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
   unsigned access, int *nr_present)
 {
@@ -3530,16 +3485,6 @@ static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, 
gfn_t gfn,
return false;
 }
 
-static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
-{
-   unsigned access;
-
-   access = (gpte  (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
-   access = ~(gpte  PT64_NX_SHIFT);
-
-   return access;
-}
-
 static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned 
gpte)
 {
unsigned index;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7769699..fb26ca9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -80,6 +80,31 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
return (gpte  PT_LVL_ADDR_MASK(lvl))  PAGE_SHIFT;
 }
 
+static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
+{
+   unsigned mask;
+
+   BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
+
+   mask = (unsigned)~ACC_WRITE_MASK;
+   /* Allow write access to dirty gptes */
+   mask |= (gpte  (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT))  
PT_WRITABLE_MASK;
+   *access = mask;
+}
+
+static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
+{
+   int bit7;
+
+   bit7 = (gpte  7)  1;
+   return (gpte  mmu-rsvd_bits_mask[bit7][level-1]) != 0;
+}
+
+static inline int FNAME(is_present_gpte)(unsigned long pte)
+{
+   return is_present_gpte(pte);
+}
+
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
   pt_element_t __user *ptep_user, unsigned index,
   pt_element_t orig_pte, pt_element_t new_pte)
@@ -103,6 +128,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
return (ret != orig_pte);
 }
 
+static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *sp, u64 *spte,
+ u64 gpte)
+{
+   if 

[PATCH v4 07/13] nEPT: Redefine EPT-specific link_shadow_page()

2013-07-25 Thread Gleb Natapov
From: Yang Zhang yang.z.zh...@intel.com

Since nEPT doesn't support A/D bit, so we should not set those bit
when build shadow page table.

Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |   12 +---
 arch/x86/kvm/paging_tmpl.h |4 ++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b5273c3..9e0f467 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2047,12 +2047,18 @@ static void shadow_walk_next(struct 
kvm_shadow_walk_iterator *iterator)
return __shadow_walk_next(iterator, *iterator-sptep);
 }
 
-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
+static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool 
accessed)
 {
u64 spte;
 
+   BUILD_BUG_ON(VMX_EPT_READABLE_MASK != PT_PRESENT_MASK ||
+   VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
+
spte = __pa(sp-spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
-  shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
+  shadow_user_mask | shadow_x_mask;
+
+   if (accessed)
+   spte |= shadow_accessed_mask;
 
mmu_spte_set(sptep, spte);
 }
@@ -2677,7 +2683,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
  iterator.level - 1,
  1, ACC_ALL, iterator.sptep);
 
-   link_shadow_page(iterator.sptep, sp);
+   link_shadow_page(iterator.sptep, sp, true);
}
}
return emulate;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index e38b3c0..23a19a5 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -545,7 +545,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
goto out_gpte_changed;
 
if (sp)
-   link_shadow_page(it.sptep, sp);
+   link_shadow_page(it.sptep, sp, PTTYPE != PTTYPE_EPT);
}
 
for (;
@@ -565,7 +565,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
  true, direct_access, it.sptep);
-   link_shadow_page(it.sptep, sp);
+   link_shadow_page(it.sptep, sp, PTTYPE != PTTYPE_EPT);
}
 
clear_sp_write_flooding_count(it.sptep);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3

2013-07-25 Thread Gleb Natapov
From: Nadav Har'El n...@il.ibm.com

kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/vmx.c |   12 
 arch/x86/kvm/x86.c |   11 ---
 2 files changed, 4 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 89b15df..56d0066 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7596,8 +7596,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
kvm_mmu_reset_context(vcpu);
 
/*
-* Additionally, except when L0 is using shadow page tables, L1 or
-* L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+* L1 may access the L2's PDPTR, so save them to construct vmcs12
 */
if (enable_ept) {
vmcs_write64(GUEST_PDPTR0, vmcs12-guest_pdptr0);
@@ -7933,14 +7932,11 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
 * own CR3 without exiting. If it has changed it, we must keep it.
 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
-*/
-   if (enable_ept)
-   vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3);
-   /*
-* Additionally, except when L0 is using shadow page tables, L1 or
-* L2 control guest_cr3 for L2, so save their PDPTEs
+*
+* Additionally, restore L2's PDPTR to vmcs12.
 */
if (enable_ept) {
+   vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3);
vmcs12-guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
vmcs12-guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
vmcs12-guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2caeb9..e2fef8b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -682,17 +682,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 */
}
 
-   /*
-* Does the new cr3 value map to physical memory? (Note, we
-* catch an invalid cr3 even in real-mode, because it would
-* cause trouble later on when we turn on paging anyway.)
-*
-* A real CPU would silently accept an invalid cr3 and would
-* attempt to use it - with largely undefined (and often hard
-* to debug) behavior on the guest side.
-*/
-   if (unlikely(!gfn_to_memslot(vcpu-kvm, cr3  PAGE_SHIFT)))
-   return 1;
vcpu-arch.cr3 = cr3;
__set_bit(VCPU_EXREG_CR3, (ulong *)vcpu-arch.regs_avail);
vcpu-arch.mmu.new_cr3(vcpu);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Mkinitcpio failing on vfio-vga-reset kernel branch on Arch Linux

2013-07-25 Thread Peter Kay
I've compiled the vfio-vga-reset kernel branch, but every time I try to run 
mkinitcpio with the -k switch pointing either to the kernel or specifying the 
kernel version it says there are no modules to add.

Running up to date Arch Linux. Mkinitcpio works fine with other kernels. Is 
there a specific list of kernel options required other than the four VFIO ones 
or other reason it might fail?

Thanks

Peter
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: PPC: Book3S PR: Load up SPRG3 register with guest value on guest entry

2013-07-25 Thread Alexander Graf

On 11.07.2013, at 13:49, Paul Mackerras wrote:

 Unlike the other general-purpose SPRs, SPRG3 can be read by usermode
 code, and is used in recent kernels to store the CPU and NUMA node
 numbers so that they can be read by VDSO functions.  Thus we need to
 load the guest's SPRG3 value into the real SPRG3 register when entering
 the guest, and restore the host's value when exiting the guest.  We don't
 need to save the guest SPRG3 value when exiting the guest as usermode
 code can't modify SPRG3.

This loads SPRG3 on every guest exit, which can happen a lot with instruction 
emulation. Since the kernel doesn't rely on the contents of SPRG3 we only have 
to care about it when not in KVM code, right?

So could we move this to kvmppc_core_vcpu_load/put instead?


Alex

 
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
 arch/powerpc/kernel/asm-offsets.c|  1 +
 arch/powerpc/kvm/book3s_interrupts.S | 14 ++
 2 files changed, 15 insertions(+)
 
 diff --git a/arch/powerpc/kernel/asm-offsets.c 
 b/arch/powerpc/kernel/asm-offsets.c
 index 6f16ffa..a67c76e 100644
 --- a/arch/powerpc/kernel/asm-offsets.c
 +++ b/arch/powerpc/kernel/asm-offsets.c
 @@ -452,6 +452,7 @@ int main(void)
   DEFINE(VCPU_SPRG2, offsetof(struct kvm_vcpu, arch.shregs.sprg2));
   DEFINE(VCPU_SPRG3, offsetof(struct kvm_vcpu, arch.shregs.sprg3));
 #endif
 + DEFINE(VCPU_SHARED_SPRG3, offsetof(struct kvm_vcpu_arch_shared, sprg3));
   DEFINE(VCPU_SHARED_SPRG4, offsetof(struct kvm_vcpu_arch_shared, sprg4));
   DEFINE(VCPU_SHARED_SPRG5, offsetof(struct kvm_vcpu_arch_shared, sprg5));
   DEFINE(VCPU_SHARED_SPRG6, offsetof(struct kvm_vcpu_arch_shared, sprg6));
 diff --git a/arch/powerpc/kvm/book3s_interrupts.S 
 b/arch/powerpc/kvm/book3s_interrupts.S
 index 48cbbf8..17cfae5 100644
 --- a/arch/powerpc/kvm/book3s_interrupts.S
 +++ b/arch/powerpc/kvm/book3s_interrupts.S
 @@ -92,6 +92,11 @@ kvm_start_lightweight:
   PPC_LL  r3, VCPU_HFLAGS(r4)
   rldicl  r3, r3, 0, 63   /* r3 = 1 */
   stb r3, HSTATE_RESTORE_HID5(r13)
 +
 + /* Load up guest SPRG3 value, since it's user readable */
 + ld  r3, VCPU_SHARED(r4)
 + ld  r3, VCPU_SHARED_SPRG3(r3)
 + mtspr   SPRN_SPRG3, r3
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
   PPC_LL  r4, VCPU_SHADOW_MSR(r4) /* get shadow_msr */
 @@ -123,6 +128,15 @@ kvmppc_handler_highmem:
   /* R7 = vcpu */
   PPC_LL  r7, GPR4(r1)
 
 +#ifdef CONFIG_PPC_BOOK3S_64
 + /*
 +  * Reload kernel SPRG3 value.
 +  * No need to save guest value as usermode can't modify SPRG3.
 +  */
 + ld  r3, PACA_SPRG3(r13)
 + mtspr   SPRN_SPRG3, r3
 +#endif /* CONFIG_PPC_BOOK3S_64 */
 +
   PPC_STL r14, VCPU_GPR(R14)(r7)
   PPC_STL r15, VCPU_GPR(R15)(r7)
   PPC_STL r16, VCPU_GPR(R16)(r7)
 -- 
 1.8.3.1
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: PPC: Book3S PR: Load up SPRG3 register with guest value on guest entry

2013-07-25 Thread Alexander Graf

On 25.07.2013, at 15:38, Alexander Graf wrote:

 
 On 11.07.2013, at 13:49, Paul Mackerras wrote:
 
 Unlike the other general-purpose SPRs, SPRG3 can be read by usermode
 code, and is used in recent kernels to store the CPU and NUMA node
 numbers so that they can be read by VDSO functions.  Thus we need to
 load the guest's SPRG3 value into the real SPRG3 register when entering
 the guest, and restore the host's value when exiting the guest.  We don't
 need to save the guest SPRG3 value when exiting the guest as usermode
 code can't modify SPRG3.
 
 This loads SPRG3 on every guest exit, which can happen a lot with instruction 
 emulation. Since the kernel doesn't rely on the contents of SPRG3 we only 
 have to care about it when not in KVM code, right?
 
 So could we move this to kvmppc_core_vcpu_load/put instead?

but then again if all the shadow copy code is negligible performance wise, so 
is this probably. Applied to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu

2013-07-25 Thread Alexander Graf

On 11.07.2013, at 13:50, Paul Mackerras wrote:

 Currently PR-style KVM keeps the volatile guest register values
 (R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than
 the main kvm_vcpu struct.  For 64-bit, the shadow_vcpu exists in two
 places, a kmalloc'd struct and in the PACA, and it gets copied back
 and forth in kvmppc_core_vcpu_load/put(), because the real-mode code
 can't rely on being able to access the kmalloc'd struct.
 
 This changes the code to copy the volatile values into the shadow_vcpu
 as one of the last things done before entering the guest.  Similarly
 the values are copied back out of the shadow_vcpu to the kvm_vcpu
 immediately after exiting the guest.  We arrange for interrupts to be
 still disabled at this point so that we can't get preempted on 64-bit
 and end up copying values from the wrong PACA.
 
 This means that the accessor functions in kvm_book3s.h for these
 registers are greatly simplified, and are same between PR and HV KVM.
 In places where accesses to shadow_vcpu fields are now replaced by
 accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs.
 Finally, on 64-bit, we don't need the kmalloc'd struct at all any more.
 
 With this, the time to read the PVR one million times in a loop went
 from 478.2ms to 480.1ms (averages of 4 values), a difference which is
 not statistically significant given the variability of the results.
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
 arch/powerpc/include/asm/kvm_book3s.h | 193 +-
 arch/powerpc/include/asm/kvm_book3s_asm.h |   6 +-
 arch/powerpc/include/asm/kvm_host.h   |   1 +
 arch/powerpc/kernel/asm-offsets.c |   3 +-
 arch/powerpc/kvm/book3s_emulate.c |   8 +-
 arch/powerpc/kvm/book3s_interrupts.S  | 101 
 arch/powerpc/kvm/book3s_pr.c  |  68 +--
 arch/powerpc/kvm/book3s_rmhandlers.S  |   5 -
 arch/powerpc/kvm/trace.h  |   7 +-
 9 files changed, 175 insertions(+), 217 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
 b/arch/powerpc/include/asm/kvm_book3s.h
 index 08891d0..5d68f6c 100644
 --- a/arch/powerpc/include/asm/kvm_book3s.h
 +++ b/arch/powerpc/include/asm/kvm_book3s.h
 @@ -198,149 +198,97 @@ extern void kvm_return_point(void);
 #include asm/kvm_book3s_64.h
 #endif
 
 -#ifdef CONFIG_KVM_BOOK3S_PR
 -
 -static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu *vcpu)
 -{
 - return to_book3s(vcpu)-hior;
 -}
 -
 -static inline void kvmppc_update_int_pending(struct kvm_vcpu *vcpu,
 - unsigned long pending_now, unsigned long old_pending)
 -{
 - if (pending_now)
 - vcpu-arch.shared-int_pending = 1;
 - else if (old_pending)
 - vcpu-arch.shared-int_pending = 0;
 -}
 -
 static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, ulong val)
 {
 - if ( num  14 ) {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - svcpu-gpr[num] = val;
 - svcpu_put(svcpu);
 - to_book3s(vcpu)-shadow_vcpu-gpr[num] = val;
 - } else
 - vcpu-arch.gpr[num] = val;
 + vcpu-arch.gpr[num] = val;
 }
 
 static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, int num)
 {
 - if ( num  14 ) {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - ulong r = svcpu-gpr[num];
 - svcpu_put(svcpu);
 - return r;
 - } else
 - return vcpu-arch.gpr[num];
 + return vcpu-arch.gpr[num];
 }
 
 static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - svcpu-cr = val;
 - svcpu_put(svcpu);
 - to_book3s(vcpu)-shadow_vcpu-cr = val;
 + vcpu-arch.cr = val;
 }
 
 static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - u32 r;
 - r = svcpu-cr;
 - svcpu_put(svcpu);
 - return r;
 + return vcpu-arch.cr;
 }
 
 static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, u32 val)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - svcpu-xer = val;
 - to_book3s(vcpu)-shadow_vcpu-xer = val;
 - svcpu_put(svcpu);
 + vcpu-arch.xer = val;
 }
 
 static inline u32 kvmppc_get_xer(struct kvm_vcpu *vcpu)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - u32 r;
 - r = svcpu-xer;
 - svcpu_put(svcpu);
 - return r;
 + return vcpu-arch.xer;
 }
 
 static inline void kvmppc_set_ctr(struct kvm_vcpu *vcpu, ulong val)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - svcpu-ctr = val;
 - svcpu_put(svcpu);
 + vcpu-arch.ctr = val;
 }
 
 static inline ulong kvmppc_get_ctr(struct kvm_vcpu *vcpu)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - ulong r;
 - r = svcpu-ctr;
 - svcpu_put(svcpu);
 - 

Re: [PATCH 2/2] kvm: powerpc: set cache coherency only for kernel managed pages

2013-07-25 Thread Alexander Graf

On 25.07.2013, at 10:50, Gleb Natapov wrote:

 On Wed, Jul 24, 2013 at 03:32:49PM -0500, Scott Wood wrote:
 On 07/24/2013 04:39:59 AM, Alexander Graf wrote:
 
 On 24.07.2013, at 11:35, Gleb Natapov wrote:
 
 On Wed, Jul 24, 2013 at 11:21:11AM +0200, Alexander Graf wrote:
 Are not we going to use page_is_ram() from
 e500_shadow_mas2_attrib() as Scott commented?
 
 rWhy aren't we using page_is_ram() in kvm_is_mmio_pfn()?
 
 
 Because it is much slower and, IIRC, actually used to build pfn
 map that allow
 us to check quickly for valid pfn.
 
 Then why should we use page_is_ram()? :)
 
 I really don't want the e500 code to diverge too much from what
 the rest of the kvm code is doing.
 
 I don't understand actually used to build pfn map  What code
 is this?  I don't see any calls to page_is_ram() in the KVM code, or
 in generic mm code.  Is this a statement about what x86 does?
 It may be not page_is_ram() directly, but the same into page_is_ram() is
 using. On power both page_is_ram() and do_init_bootmem() walks some kind
 of memblock_region data structure. What important is that pfn_valid()
 does not mean that there is a memory behind page structure. See Andrea's
 reply.
 
 
 On PPC page_is_ram() is only called (AFAICT) for determining what
 attributes to set on mmaps.  We want to be sure that KVM always
 makes the same decision.  While pfn_valid() seems like it should be
 equivalent, it's not obvious from the PPC code that it is.
 
 Again pfn_valid() is not enough.
 
 If pfn_valid() is better, why is that not used for mmap?  Why are
 there two different names for the same thing?
 
 They are not the same thing. page_is_ram() tells you if phys address is
 ram backed. pfn_valid() tells you if there is struct page behind the
 pfn. PageReserved() tells if you a pfn is marked as reserved. All non
 ram pfns should be reserved, but ram pfns can be reserved too. Again,
 see Andrea's reply.
 
 Why ppc uses page_is_ram() for mmap? How should I know? But looking at

That one's easy. Let's just ask Ben. Ben, is there any particular reason PPC 
uses page_is_ram() rather than what KVM does here to figure out whether a pfn 
is RAM or not? It would be really useful to be able to run the exact same logic 
that figures out whether we're cacheable or not in both TLB writers (KVM and 
linux-mm).


Alex

 the function it does it only as a fallback if
 ppc_md.phys_mem_access_prot() is not provided. Making access to MMIO
 noncached as a safe fallback makes sense. It is also make sense to allow
 noncached access to reserved ram sometimes.
 
 --
   Gleb.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm: powerpc: set cache coherency only for kernel managed pages

2013-07-25 Thread Gleb Natapov
On Thu, Jul 25, 2013 at 06:07:55PM +0200, Alexander Graf wrote:
 
 On 25.07.2013, at 10:50, Gleb Natapov wrote:
 
  On Wed, Jul 24, 2013 at 03:32:49PM -0500, Scott Wood wrote:
  On 07/24/2013 04:39:59 AM, Alexander Graf wrote:
  
  On 24.07.2013, at 11:35, Gleb Natapov wrote:
  
  On Wed, Jul 24, 2013 at 11:21:11AM +0200, Alexander Graf wrote:
  Are not we going to use page_is_ram() from
  e500_shadow_mas2_attrib() as Scott commented?
  
  rWhy aren't we using page_is_ram() in kvm_is_mmio_pfn()?
  
  
  Because it is much slower and, IIRC, actually used to build pfn
  map that allow
  us to check quickly for valid pfn.
  
  Then why should we use page_is_ram()? :)
  
  I really don't want the e500 code to diverge too much from what
  the rest of the kvm code is doing.
  
  I don't understand actually used to build pfn map  What code
  is this?  I don't see any calls to page_is_ram() in the KVM code, or
  in generic mm code.  Is this a statement about what x86 does?
  It may be not page_is_ram() directly, but the same into page_is_ram() is
  using. On power both page_is_ram() and do_init_bootmem() walks some kind
  of memblock_region data structure. What important is that pfn_valid()
  does not mean that there is a memory behind page structure. See Andrea's
  reply.
  
  
  On PPC page_is_ram() is only called (AFAICT) for determining what
  attributes to set on mmaps.  We want to be sure that KVM always
  makes the same decision.  While pfn_valid() seems like it should be
  equivalent, it's not obvious from the PPC code that it is.
  
  Again pfn_valid() is not enough.
  
  If pfn_valid() is better, why is that not used for mmap?  Why are
  there two different names for the same thing?
  
  They are not the same thing. page_is_ram() tells you if phys address is
  ram backed. pfn_valid() tells you if there is struct page behind the
  pfn. PageReserved() tells if you a pfn is marked as reserved. All non
  ram pfns should be reserved, but ram pfns can be reserved too. Again,
  see Andrea's reply.
  
  Why ppc uses page_is_ram() for mmap? How should I know? But looking at
 
 That one's easy. Let's just ask Ben. Ben, is there any particular reason PPC 
 uses page_is_ram() rather than what KVM does here to figure out whether a pfn 
 is RAM or not? It would be really useful to be able to run the exact same 
 logic that figures out whether we're cacheable or not in both TLB writers 
 (KVM and linux-mm).
 
KVM does not only try to figure out what is RAM or not! Look at how KVM
uses the function. KVM tries to figure out if refcounting needed to be
used on this page among other things.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-unit-tests : Basic architecture of VMX nested test case

2013-07-25 Thread Arthur Chunqi Li
On Thu, Jul 25, 2013 at 4:11 PM, Jan Kiszka jan.kis...@web.de wrote:
 On 2013-07-25 07:31, Arthur Chunqi Li wrote:
 This is the first version of VMX nested environment. It contains the
 basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/
 VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the
 basic execution routine in VMX nested environment andlet the VM print
 Hello World to inform its successfully run.

 The first release also includes a test suite for vmenter (vmlaunch and
 vmresume). Besides, hypercall mechanism is included and currently it is
 used to invoke VM normal exit.

 New files added:
 x86/vmx.h : contains all VMX related macro declerations
 x86/vmx.c : main file for VMX nested test case

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com

 Don't forget to update your public git as well.

 ---
  config-x86-common.mak |2 +
  config-x86_64.mak |1 +
  lib/x86/msr.h |5 +
  x86/cstart64.S|4 +
  x86/unittests.cfg |6 +
  x86/vmx.c |  712 
 +
  x86/vmx.h |  479 +
  7 files changed, 1209 insertions(+)
  create mode 100644 x86/vmx.c
  create mode 100644 x86/vmx.h

 diff --git a/config-x86-common.mak b/config-x86-common.mak
 index 455032b..34a41e1 100644
 --- a/config-x86-common.mak
 +++ b/config-x86-common.mak
 @@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) 
 $(TEST_DIR)/asyncpf.o

  $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o

 +$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o
 +
  arch_clean:
   $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
   $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
 diff --git a/config-x86_64.mak b/config-x86_64.mak
 index 4e525f5..bb8ee89 100644
 --- a/config-x86_64.mak
 +++ b/config-x86_64.mak
 @@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
 $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
 $(TEST_DIR)/pcid.flat
  tests += $(TEST_DIR)/svm.flat
 +tests += $(TEST_DIR)/vmx.flat

  include config-x86-common.mak
 diff --git a/lib/x86/msr.h b/lib/x86/msr.h
 index 509a421..281255a 100644
 --- a/lib/x86/msr.h
 +++ b/lib/x86/msr.h
 @@ -396,6 +396,11 @@
  #define MSR_IA32_VMX_VMCS_ENUM  0x048a
  #define MSR_IA32_VMX_PROCBASED_CTLS20x048b
  #define MSR_IA32_VMX_EPT_VPID_CAP   0x048c
 +#define MSR_IA32_VMX_TRUE_PIN0x048d
 +#define MSR_IA32_VMX_TRUE_PROC   0x048e
 +#define MSR_IA32_VMX_TRUE_EXIT   0x048f
 +#define MSR_IA32_VMX_TRUE_ENTRY  0x0490
 +

  /* AMD-V MSRs */

 diff --git a/x86/cstart64.S b/x86/cstart64.S
 index 24df5f8..0fe76da 100644
 --- a/x86/cstart64.S
 +++ b/x86/cstart64.S
 @@ -4,6 +4,10 @@
  .globl boot_idt
  boot_idt = 0

 +.globl idt_descr
 +.globl tss_descr
 +.globl gdt64_desc
 +
  ipi_vector = 0x20

  max_cpus = 64
 diff --git a/x86/unittests.cfg b/x86/unittests.cfg
 index bc9643e..85c36aa 100644
 --- a/x86/unittests.cfg
 +++ b/x86/unittests.cfg
 @@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s`
  file = pcid.flat
  extra_params = -cpu qemu64,+pcid
  arch = x86_64
 +
 +[vmx]
 +file = vmx.flat
 +extra_params = -cpu host,+vmx
 +arch = x86_64
 +
 diff --git a/x86/vmx.c b/x86/vmx.c
 new file mode 100644
 index 000..ca3e117
 --- /dev/null
 +++ b/x86/vmx.c
 @@ -0,0 +1,712 @@
 +#include libcflat.h
 +#include processor.h
 +#include vm.h
 +#include desc.h
 +#include vmx.h
 +#include msr.h
 +#include smp.h
 +#include io.h
 +
 +int fails = 0, tests = 0;
 +u32 *vmxon_region;
 +struct vmcs *vmcs_root;
 +u32 vpid_cnt;
 +void *guest_stack, *guest_syscall_stack;
 +u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2];
 +ulong fix_cr0_set, fix_cr0_clr;
 +ulong fix_cr4_set, fix_cr4_clr;
 +struct regs regs;
 +struct vmx_test *current;
 +u64 hypercall_field = 0;
 +bool launched = 0;
 +
 +extern u64 gdt64_desc[];
 +extern u64 idt_descr[];
 +extern u64 tss_descr[];
 +extern void *vmx_return;
 +extern void *entry_sysenter;
 +extern void *guest_entry;
 +
 +static void report(const char *name, int result)
 +{
 + ++tests;
 + if (result)
 + printf(PASS: %s\n, name);
 + else {
 + printf(FAIL: %s\n, name);
 + ++fails;
 + }
 +}
 +
 +static int vmcs_clear(struct vmcs *vmcs)
 +{
 + bool ret;
 + asm volatile (vmclear %1; setbe %0 : =q (ret) : m (vmcs) : cc);
 + return ret;
 +}
 +
 +static u64 vmcs_read(enum Encoding enc)
 +{
 + u64 val;
 + asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc);
 + return val;
 +}
 +
 +static int vmcs_write(enum Encoding enc, u64 val)
 +{
 + bool ret;
 + asm volatile (vmwrite %1, %2; setbe %0
 + : =q(ret) : rm (val), r ((u64)enc) : cc);
 + return ret;
 +}
 +
 +static int make_vmcs_current(struct vmcs *vmcs)
 +{
 + bool ret;
 +
 + asm volatile (vmptrld %1; setbe %0 : =q (ret) 

[PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h

2013-07-25 Thread Gleb Natapov
From: Nadav Har'El n...@il.ibm.com

This is the first patch in a series which adds nested EPT support to KVM's
nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
to set its own cr3 and take its own page faults without either of L0 or L1
getting involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over EPT, and shadow page
tables over shadow page tables).

This patch adds EPT support to paging_tmpl.h.

paging_tmpl.h contains the code for reading and writing page tables. The code
for 32-bit and 64-bit tables is very similar, but not identical, so
paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
with PTTYPE=64, and this generates the two sets of similar functions.

There are subtle but important differences between the format of EPT tables
and that of ordinary x86 64-bit page tables, so for nested EPT we need a
third set of functions to read the guest EPT table and to write the shadow
EPT table.

So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
with EPT) which correctly read and write EPT tables.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |5 +
 arch/x86/kvm/paging_tmpl.h |   43 +++
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4c4274d..b5273c3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, 
unsigned level, unsigned gp
return mmu-last_pte_bitmap  (1  index);
 }
 
+#define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE PTTYPE_EPT
+#include paging_tmpl.h
+#undef PTTYPE
+
 #define PTTYPE 64
 #include paging_tmpl.h
 #undef PTTYPE
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7581395..e38b3c0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -58,6 +58,21 @@
#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
#define CMPXCHG cmpxchg
+#elif PTTYPE == PTTYPE_EPT
+   #define pt_element_t u64
+   #define guest_walker guest_walkerEPT
+   #define FNAME(name) ept_##name
+   #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+   #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+   #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+   #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+   #define PT_LEVEL_BITS PT64_LEVEL_BITS
+   #define PT_GUEST_ACCESSED_MASK 0
+   #define PT_GUEST_DIRTY_MASK 0
+   #define PT_GUEST_DIRTY_SHIFT 0
+   #define PT_GUEST_ACCESSED_SHIFT 0
+   #define CMPXCHG cmpxchg64
+   #define PT_MAX_FULL_LEVELS 4
 #else
#error Invalid PTTYPE value
 #endif
@@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
 
 static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
 {
+#if PT_GUEST_DIRTY_MASK == 0
+   /* dirty bit is not supported, so no need to track it */
+   return;
+#else
unsigned mask;
 
BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
@@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned 
*access, unsigned gpte)
mask |= (gpte  (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) 
PT_WRITABLE_MASK;
*access = mask;
+#endif
 }
 
 static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
@@ -111,7 +131,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, 
u64 gpte, int level)
 
 static inline int FNAME(is_present_gpte)(unsigned long pte)
 {
+#if PTTYPE != PTTYPE_EPT
return is_present_gpte(pte);
+#else
+   return pte  7;
+#endif
 }
 
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
@@ -147,7 +171,8 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu 
*vcpu,
if (!FNAME(is_present_gpte)(gpte))
goto no_present;
 
-   if (!(gpte  PT_GUEST_ACCESSED_MASK))
+   /* if accessed bit is not supported prefetch non accessed gpte */
+   if (PT_GUEST_ACCESSED_MASK  !(gpte  PT_GUEST_ACCESSED_MASK))
goto no_present;
 
return false;
@@ -160,9 +185,14 @@ no_present:
 static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
 {
unsigned access;
-
+#if PTTYPE == PTTYPE_EPT
+   BUILD_BUG_ON(ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK);
+   access = (gpte  VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
+   ((gpte  VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);
+#else
access = (gpte  (PT_WRITABLE_MASK | 

[PATCH v2] kvm-unit-tests : Basic architecture of VMX nested test case

2013-07-25 Thread Arthur Chunqi Li
This is the first version of VMX nested environment. It contains the
basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/
VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the
basic execution routine in VMX nested environment andlet the VM print
Hello World to inform its successfully run.

The first release also includes a test suite for vmenter (vmlaunch and
vmresume). Besides, hypercall mechanism is included and currently it is
used to invoke VM normal exit.

New files added:
x86/vmx.h : contains all VMX related macro declerations
x86/vmx.c : main file for VMX nested test case

Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
---
 config-x86-common.mak |2 +
 config-x86_64.mak |1 +
 lib/x86/msr.h |5 +
 x86/cstart64.S|4 +
 x86/unittests.cfg |6 +
 x86/vmx.c |  712 +
 x86/vmx.h |  474 
 7 files changed, 1204 insertions(+)
 create mode 100644 x86/vmx.c
 create mode 100644 x86/vmx.h

diff --git a/config-x86-common.mak b/config-x86-common.mak
index 455032b..34a41e1 100644
--- a/config-x86-common.mak
+++ b/config-x86-common.mak
@@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o
 
 $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o
 
+$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o
+
 arch_clean:
$(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
$(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
diff --git a/config-x86_64.mak b/config-x86_64.mak
index 4e525f5..bb8ee89 100644
--- a/config-x86_64.mak
+++ b/config-x86_64.mak
@@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
  $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
  $(TEST_DIR)/pcid.flat
 tests += $(TEST_DIR)/svm.flat
+tests += $(TEST_DIR)/vmx.flat
 
 include config-x86-common.mak
diff --git a/lib/x86/msr.h b/lib/x86/msr.h
index 509a421..281255a 100644
--- a/lib/x86/msr.h
+++ b/lib/x86/msr.h
@@ -396,6 +396,11 @@
 #define MSR_IA32_VMX_VMCS_ENUM  0x048a
 #define MSR_IA32_VMX_PROCBASED_CTLS20x048b
 #define MSR_IA32_VMX_EPT_VPID_CAP   0x048c
+#define MSR_IA32_VMX_TRUE_PIN  0x048d
+#define MSR_IA32_VMX_TRUE_PROC 0x048e
+#define MSR_IA32_VMX_TRUE_EXIT 0x048f
+#define MSR_IA32_VMX_TRUE_ENTRY0x0490
+
 
 /* AMD-V MSRs */
 
diff --git a/x86/cstart64.S b/x86/cstart64.S
index 24df5f8..0fe76da 100644
--- a/x86/cstart64.S
+++ b/x86/cstart64.S
@@ -4,6 +4,10 @@
 .globl boot_idt
 boot_idt = 0
 
+.globl idt_descr
+.globl tss_descr
+.globl gdt64_desc
+
 ipi_vector = 0x20
 
 max_cpus = 64
diff --git a/x86/unittests.cfg b/x86/unittests.cfg
index bc9643e..85c36aa 100644
--- a/x86/unittests.cfg
+++ b/x86/unittests.cfg
@@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s`
 file = pcid.flat
 extra_params = -cpu qemu64,+pcid
 arch = x86_64
+
+[vmx]
+file = vmx.flat
+extra_params = -cpu host,+vmx
+arch = x86_64
+
diff --git a/x86/vmx.c b/x86/vmx.c
new file mode 100644
index 000..af694e1
--- /dev/null
+++ b/x86/vmx.c
@@ -0,0 +1,712 @@
+#include libcflat.h
+#include processor.h
+#include vm.h
+#include desc.h
+#include vmx.h
+#include msr.h
+#include smp.h
+#include io.h
+
+int fails = 0, tests = 0;
+u32 *vmxon_region;
+struct vmcs *vmcs_root;
+u32 vpid_cnt;
+void *guest_stack, *guest_syscall_stack;
+u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2];
+ulong fix_cr0_set, fix_cr0_clr;
+ulong fix_cr4_set, fix_cr4_clr;
+struct regs regs;
+struct vmx_test *current;
+u64 hypercall_field = 0;
+bool launched = 0;
+
+extern u64 gdt64_desc[];
+extern u64 idt_descr[];
+extern u64 tss_descr[];
+extern void *vmx_return;
+extern void *entry_sysenter;
+extern void *guest_entry;
+
+static void report(const char *name, int result)
+{
+   ++tests;
+   if (result)
+   printf(PASS: %s\n, name);
+   else {
+   printf(FAIL: %s\n, name);
+   ++fails;
+   }
+}
+
+static int vmcs_clear(struct vmcs *vmcs)
+{
+   bool ret;
+   asm volatile (vmclear %1; setbe %0 : =q (ret) : m (vmcs) : cc);
+   return ret;
+}
+
+static u64 vmcs_read(enum Encoding enc)
+{
+   u64 val;
+   asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc);
+   return val;
+}
+
+static int vmcs_write(enum Encoding enc, u64 val)
+{
+   bool ret;
+   asm volatile (vmwrite %1, %2; setbe %0
+   : =q(ret) : rm (val), r ((u64)enc) : cc);
+   return ret;
+}
+
+static int make_vmcs_current(struct vmcs *vmcs)
+{
+   bool ret;
+
+   asm volatile (vmptrld %1; setbe %0 : =q (ret) : m (vmcs) : cc);
+   return ret;
+}
+
+static int save_vmcs(struct vmcs **vmcs)
+{
+   bool ret;
+
+   asm volatile (vmptrst %1; setbe %0 : =q (ret) : m (*vmcs) : cc);
+   return ret;
+}
+
+/* entry_sysenter */
+asm(
+   .align 4, 0x90\n\t
+   .globl 

Re: [PATCH v2] kvm-unit-tests : Basic architecture of VMX nested test case

2013-07-25 Thread Bandan Das
Arthur Chunqi Li yzt...@gmail.com writes:

 This is the first version of VMX nested environment. It contains the
 basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/
 VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the
 basic execution routine in VMX nested environment andlet the VM print
 Hello World to inform its successfully run.

 The first release also includes a test suite for vmenter (vmlaunch and
 vmresume). Besides, hypercall mechanism is included and currently it is
 used to invoke VM normal exit.

What's the difference between this and the one you posted earlier :
[PATCH] kvm-unit-tests : Basic architecture of VMX nested test case
1374730297-27169-1-git-send-email-yzt...@gmail.com

Can you please mention what's changed in v2 ?

Bandan

 New files added:
 x86/vmx.h : contains all VMX related macro declerations
 x86/vmx.c : main file for VMX nested test case

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
  config-x86-common.mak |2 +
  config-x86_64.mak |1 +
  lib/x86/msr.h |5 +
  x86/cstart64.S|4 +
  x86/unittests.cfg |6 +
  x86/vmx.c |  712 
 +
  x86/vmx.h |  474 
  7 files changed, 1204 insertions(+)
  create mode 100644 x86/vmx.c
  create mode 100644 x86/vmx.h

 diff --git a/config-x86-common.mak b/config-x86-common.mak
 index 455032b..34a41e1 100644
 --- a/config-x86-common.mak
 +++ b/config-x86-common.mak
 @@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o
  
  $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o
  
 +$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o
 +
  arch_clean:
   $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
   $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
 diff --git a/config-x86_64.mak b/config-x86_64.mak
 index 4e525f5..bb8ee89 100644
 --- a/config-x86_64.mak
 +++ b/config-x86_64.mak
 @@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
 $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
 $(TEST_DIR)/pcid.flat
  tests += $(TEST_DIR)/svm.flat
 +tests += $(TEST_DIR)/vmx.flat
  
  include config-x86-common.mak
 diff --git a/lib/x86/msr.h b/lib/x86/msr.h
 index 509a421..281255a 100644
 --- a/lib/x86/msr.h
 +++ b/lib/x86/msr.h
 @@ -396,6 +396,11 @@
  #define MSR_IA32_VMX_VMCS_ENUM  0x048a
  #define MSR_IA32_VMX_PROCBASED_CTLS20x048b
  #define MSR_IA32_VMX_EPT_VPID_CAP   0x048c
 +#define MSR_IA32_VMX_TRUE_PIN0x048d
 +#define MSR_IA32_VMX_TRUE_PROC   0x048e
 +#define MSR_IA32_VMX_TRUE_EXIT   0x048f
 +#define MSR_IA32_VMX_TRUE_ENTRY  0x0490
 +
  
  /* AMD-V MSRs */
  
 diff --git a/x86/cstart64.S b/x86/cstart64.S
 index 24df5f8..0fe76da 100644
 --- a/x86/cstart64.S
 +++ b/x86/cstart64.S
 @@ -4,6 +4,10 @@
  .globl boot_idt
  boot_idt = 0
  
 +.globl idt_descr
 +.globl tss_descr
 +.globl gdt64_desc
 +
  ipi_vector = 0x20
  
  max_cpus = 64
 diff --git a/x86/unittests.cfg b/x86/unittests.cfg
 index bc9643e..85c36aa 100644
 --- a/x86/unittests.cfg
 +++ b/x86/unittests.cfg
 @@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s`
  file = pcid.flat
  extra_params = -cpu qemu64,+pcid
  arch = x86_64
 +
 +[vmx]
 +file = vmx.flat
 +extra_params = -cpu host,+vmx
 +arch = x86_64
 +
 diff --git a/x86/vmx.c b/x86/vmx.c
 new file mode 100644
 index 000..af694e1
 --- /dev/null
 +++ b/x86/vmx.c
 @@ -0,0 +1,712 @@
 +#include libcflat.h
 +#include processor.h
 +#include vm.h
 +#include desc.h
 +#include vmx.h
 +#include msr.h
 +#include smp.h
 +#include io.h
 +
 +int fails = 0, tests = 0;
 +u32 *vmxon_region;
 +struct vmcs *vmcs_root;
 +u32 vpid_cnt;
 +void *guest_stack, *guest_syscall_stack;
 +u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2];
 +ulong fix_cr0_set, fix_cr0_clr;
 +ulong fix_cr4_set, fix_cr4_clr;
 +struct regs regs;
 +struct vmx_test *current;
 +u64 hypercall_field = 0;
 +bool launched = 0;
 +
 +extern u64 gdt64_desc[];
 +extern u64 idt_descr[];
 +extern u64 tss_descr[];
 +extern void *vmx_return;
 +extern void *entry_sysenter;
 +extern void *guest_entry;
 +
 +static void report(const char *name, int result)
 +{
 + ++tests;
 + if (result)
 + printf(PASS: %s\n, name);
 + else {
 + printf(FAIL: %s\n, name);
 + ++fails;
 + }
 +}
 +
 +static int vmcs_clear(struct vmcs *vmcs)
 +{
 + bool ret;
 + asm volatile (vmclear %1; setbe %0 : =q (ret) : m (vmcs) : cc);
 + return ret;
 +}
 +
 +static u64 vmcs_read(enum Encoding enc)
 +{
 + u64 val;
 + asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc);
 + return val;
 +}
 +
 +static int vmcs_write(enum Encoding enc, u64 val)
 +{
 + bool ret;
 + asm volatile (vmwrite %1, %2; setbe %0
 + : =q(ret) : rm (val), r ((u64)enc) : cc);
 + return ret;
 +}

Re: [PATCH v2] kvm-unit-tests : Basic architecture of VMX nested test case

2013-07-25 Thread Jan Kiszka
On 2013-07-25 18:51, Bandan Das wrote:
 Arthur Chunqi Li yzt...@gmail.com writes:
 
 This is the first version of VMX nested environment. It contains the
 basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/
 VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the
 basic execution routine in VMX nested environment andlet the VM print
 Hello World to inform its successfully run.

 The first release also includes a test suite for vmenter (vmlaunch and
 vmresume). Besides, hypercall mechanism is included and currently it is
 used to invoke VM normal exit.
 
 What's the difference between this and the one you posted earlier :
 [PATCH] kvm-unit-tests : Basic architecture of VMX nested test case
 1374730297-27169-1-git-send-email-yzt...@gmail.com
 
 Can you please mention what's changed in v2 ?

True. A changelog can go...

 
 Bandan
 
 New files added:
 x86/vmx.h : contains all VMX related macro declerations
 x86/vmx.c : main file for VMX nested test case

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---

...here, ie. after the --- so that it wont be part of the commit.

Jan




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] kvm-unit-tests : Basic architecture of VMX nested test case

2013-07-25 Thread Gmail

 On 2013-07-25 18:51, Bandan Das wrote:
 Arthur Chunqi Li yzt...@gmail.com writes:
 
 This is the first version of VMX nested environment. It contains the
 basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/
 VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the
 basic execution routine in VMX nested environment andlet the VM print
 Hello World to inform its successfully run.
 
 The first release also includes a test suite for vmenter (vmlaunch and
 vmresume). Besides, hypercall mechanism is included and currently it is
 used to invoke VM normal exit.
 
 What's the difference between this and the one you posted earlier :
 [PATCH] kvm-unit-tests : Basic architecture of VMX nested test case
 1374730297-27169-1-git-send-email-yzt...@gmail.com
 
 Can you please mention what's changed in v2 ?
 
 True. A changelog can go...
Compared to v1, v2 removes two unused inline functions vmlaunch/vmresume in 
x86/vmx.h, and add host rflags in struct regs so that user can get host's 
rflags in exit_handler.
 
 
 Bandan
 
 New files added:
 x86/vmx.h : contains all VMX related macro declerations
 x86/vmx.c : main file for VMX nested test case
 
 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
 
 ...here, ie. after the --- so that it wont be part of the commit.
 
 Jan
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60629] New: Starting a virtual machine ater suspend causes the host system hangup

2013-07-25 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60629

Bug ID: 60629
   Summary: Starting a virtual machine ater suspend causes the
host system hangup
   Product: Virtualization
   Version: unspecified
Kernel Version: 2.6.35-32 - 3.9.9-1
  Hardware: x86-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: high
  Priority: P1
 Component: kvm
  Assignee: virtualization_...@kernel-bugs.osdl.org
  Reporter: ffsi...@yandex.ru
Regression: No

HW:
MB: Gigabyte EP43UD3l
CPU: CdQ Q8400
RAM: 4Gb

uname -a
Linux hostname 3.9.9-1-ARCH #1 SMP PREEMPT Wed Jul 3 22:45:16 CEST 2013 x86_64
GNU/Linux

I have the same problem as in
http://www.linux.org.ru/forum/linux-hardware/7455179
so just google-translate:

To get started: Ubuntu 10.10 64-bit, kernel 2.6.35-32 64-bit, Intel (R) Core
(TM) 2 Quad CPU Q8300@2.50GHz, motherboard gigabyte ep43-ds3l So this processor
supports virtualization, with 3 - 3,5 Gb ram everything works correctly with no
problems and in 32 and 64 bit. The problems begin when put 4Gb and more
virtualization-enabled in the BIOS, there is no problem if
virtualization-disabled. The problem: When the virtualization and 4gb of RAM or
higher, after the computer from sleep (suspend-to-ram), and run Virtualbox or
Vmware, comes complete frieze, or via ssh or in any other way, sysrq reaction
too no. An educated bet discovered that before sleep:

cat /proc/cpuinfo | grep 'model name'
model name: Intel(R) Core(TM)2 Quad CPUQ8300  @ 2.50GHz
model name: Intel(R) Core(TM)2 Quad CPUQ8300  @ 2.50GHz
model name: Intel(R) Core(TM)2 Quad CPUQ8300  @ 2.50GHz
model name: Intel(R) Core(TM)2 Quad CPUQ8300  @ 2.50GHz

After sleep:

cat /proc/cpuinfo | grep 'model name'
model name: Intel(R) Core(TM)2 Quad CPUQ8300  @ 2.50GHz
model name: 06/17
model name: 06/17
model name: 06/17

I.e. the model name changed to something strange, turning off virtualization
(RAM 4gb and above) it does not, and does not occur at 3Gb RAM and enabled, and
off virtualization. Part of the problem was solved with the help of a script
placed in / etc / pm / sleep.d /

#!/bin/sh
case $1 in
 hibernate|suspend)
  echo 0  /sys/devices/system/cpu/cpu1/online
  echo 0  /sys/devices/system/cpu/cpu2/online
  echo 0  /sys/devices/system/cpu/cpu3/online
  ;;
 thaw|resume)
  echo 0  /sys/devices/system/cpu/cpu1/online 
  echo 0  /sys/devices/system/cpu/cpu2/online 
  echo 0  /sys/devices/system/cpu/cpu3/online 
  echo 1  /sys/devices/system/cpu/cpu1/online 
  echo 1  /sys/devices/system/cpu/cpu2/online 
  echo 1  /sys/devices/system/cpu/cpu3/online 
/etc/init.d/microcode.ctl
  echo 0  /sys/devices/system/cpu/cpu1/online 
  echo 0  /sys/devices/system/cpu/cpu2/online 
  echo 0  /sys/devices/system/cpu/cpu3/online 
  echo 1  /sys/devices/system/cpu/cpu1/online 
  echo 1  /sys/devices/system/cpu/cpu2/online 
  echo 1  /sys/devices/system/cpu/cpu3/online
  ;;
esac

(But, model anyway '06/17')


But there was a nasty thing if Vmware or Virtualbox left running, then after
waking up again frieze, the script does not have time to work.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] powerpc: book3e: _PAGE_LENDIAN must be _PAGE_ENDIAN

2013-07-25 Thread Bharat Bhushan
For booke3e _PAGE_ENDIAN is not defined. Infact what is defined
is _PAGE_LENDIAN which is wrong and that should be _PAGE_ENDIAN.
There are no compilation errors as
arch/powerpc/include/asm/pte-common.h defines _PAGE_ENDIAN to 0
as it is not defined anywhere.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
 arch/powerpc/include/asm/pte-book3e.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/pte-book3e.h 
b/arch/powerpc/include/asm/pte-book3e.h
index 0156702..576ad88 100644
--- a/arch/powerpc/include/asm/pte-book3e.h
+++ b/arch/powerpc/include/asm/pte-book3e.h
@@ -40,7 +40,7 @@
 #define _PAGE_U1   0x01
 #define _PAGE_U0   0x02
 #define _PAGE_ACCESSED 0x04
-#define _PAGE_LENDIAN  0x08
+#define _PAGE_ENDIAN   0x08
 #define _PAGE_GUARDED  0x10
 #define _PAGE_COHERENT 0x20 /* M: enforce memory coherence */
 #define _PAGE_NO_CACHE 0x40 /* I: cache inhibit */
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] kvm: powerpc: allow guest control E attribute in mas2

2013-07-25 Thread Bharat Bhushan
E bit in MAS2 bit indicates whether the page is accessed
in Little-Endian or Big-Endian byte order.
There is no reason to stop guest setting  E, so allow him.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
 arch/powerpc/kvm/e500.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index c2e5e98..277cb18 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -117,7 +117,7 @@ static inline struct kvmppc_vcpu_e500 *to_e500(struct 
kvm_vcpu *vcpu)
 #define E500_TLB_USER_PERM_MASK (MAS3_UX|MAS3_UR|MAS3_UW)
 #define E500_TLB_SUPER_PERM_MASK (MAS3_SX|MAS3_SR|MAS3_SW)
 #define MAS2_ATTRIB_MASK \
- (MAS2_X0 | MAS2_X1)
+ (MAS2_X0 | MAS2_X1 | MAS2_E)
 #define MAS3_ATTRIB_MASK \
  (MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3 \
   | E500_TLB_USER_PERM_MASK | E500_TLB_SUPER_PERM_MASK)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] kvm: powerpc: allow guest control G attribute in mas2

2013-07-25 Thread Bharat Bhushan
G bit in MAS2 indicates whether the page is Guarded.
There is no reason to stop guest setting  G, so allow him.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
 arch/powerpc/kvm/e500.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 277cb18..4fd9650 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -117,7 +117,7 @@ static inline struct kvmppc_vcpu_e500 *to_e500(struct 
kvm_vcpu *vcpu)
 #define E500_TLB_USER_PERM_MASK (MAS3_UX|MAS3_UR|MAS3_UW)
 #define E500_TLB_SUPER_PERM_MASK (MAS3_SX|MAS3_SR|MAS3_SW)
 #define MAS2_ATTRIB_MASK \
- (MAS2_X0 | MAS2_X1 | MAS2_E)
+ (MAS2_X0 | MAS2_X1 | MAS2_E | MAS2_G)
 #define MAS3_ATTRIB_MASK \
  (MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3 \
   | E500_TLB_USER_PERM_MASK | E500_TLB_SUPER_PERM_MASK)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] kvm: powerpc: set cache coherency only for RAM pages

2013-07-25 Thread Bharat Bhushan
If the page is RAM then map this as cacheable and coherent (set M bit)
otherwise this page is treated as I/O and map this as cache inhibited
and guarded (set  I + G)

This helps setting proper MMU mapping for direct assigned device.

NOTE: There can be devices that require cacheable mapping, which is not yet 
supported.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
 arch/powerpc/kvm/e500_mmu_host.c |   24 +++-
 1 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 1c6a9d7..5cbdc8f 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -64,13 +64,27 @@ static inline u32 e500_shadow_mas3_attrib(u32 mas3, int 
usermode)
return mas3;
 }
 
-static inline u32 e500_shadow_mas2_attrib(u32 mas2, int usermode)
+static inline u32 e500_shadow_mas2_attrib(u32 mas2, pfn_t pfn)
 {
+   u32 mas2_attr;
+
+   mas2_attr = mas2  MAS2_ATTRIB_MASK;
+
+   if (kvm_is_mmio_pfn(pfn)) {
+   /*
+* If page is not RAM then it is treated as I/O page.
+* Map it with cache inhibited and guarded (set I + G).
+*/
+   mas2_attr |= MAS2_I | MAS2_G;
+   return mas2_attr;
+   }
+
+   /* Map RAM pages as cacheable (Not setting I in MAS2) */
 #ifdef CONFIG_SMP
-   return (mas2  MAS2_ATTRIB_MASK) | MAS2_M;
-#else
-   return mas2  MAS2_ATTRIB_MASK;
+   /* Also map as coherent (set M) in SMP */
+   mas2_attr |= MAS2_M;
 #endif
+   return mas2_attr;
 }
 
 /*
@@ -313,7 +327,7 @@ static void kvmppc_e500_setup_stlbe(
/* Force IPROT=0 for all guest mappings. */
stlbe-mas1 = MAS1_TSIZE(tsize) | get_tlb_sts(gtlbe) | MAS1_VALID;
stlbe-mas2 = (gvaddr  MAS2_EPN) |
- e500_shadow_mas2_attrib(gtlbe-mas2, pr);
+ e500_shadow_mas2_attrib(gtlbe-mas2, pfn);
stlbe-mas7_3 = ((u64)pfn  PAGE_SHIFT) |
e500_shadow_mas3_attrib(gtlbe-mas7_3, pr);
 
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4 v6] powerpc: export debug registers save function for KVM

2013-07-25 Thread Alexander Graf

On 04.07.2013, at 08:57, Bharat Bhushan wrote:

 KVM need this function when switching from vcpu to user-space
 thread. My subsequent patch will use this function.
 
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com

Ben / Michael, please ack.


Alex

 ---
 v5-v6
 - switch_booke_debug_regs() not guarded by the compiler switch
 
 arch/powerpc/include/asm/switch_to.h |1 +
 arch/powerpc/kernel/process.c|3 ++-
 2 files changed, 3 insertions(+), 1 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/switch_to.h 
 b/arch/powerpc/include/asm/switch_to.h
 index 200d763..db68f1d 100644
 --- a/arch/powerpc/include/asm/switch_to.h
 +++ b/arch/powerpc/include/asm/switch_to.h
 @@ -29,6 +29,7 @@ extern void giveup_vsx(struct task_struct *);
 extern void enable_kernel_spe(void);
 extern void giveup_spe(struct task_struct *);
 extern void load_up_spe(struct task_struct *);
 +extern void switch_booke_debug_regs(struct thread_struct *new_thread);
 
 #ifndef CONFIG_SMP
 extern void discard_lazy_cpu_state(void);
 diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
 index 01ff496..da586aa 100644
 --- a/arch/powerpc/kernel/process.c
 +++ b/arch/powerpc/kernel/process.c
 @@ -362,12 +362,13 @@ static void prime_debug_regs(struct thread_struct 
 *thread)
  * debug registers, set the debug registers from the values
  * stored in the new thread.
  */
 -static void switch_booke_debug_regs(struct thread_struct *new_thread)
 +void switch_booke_debug_regs(struct thread_struct *new_thread)
 {
   if ((current-thread.debug.dbcr0  DBCR0_IDM)
   || (new_thread-debug.dbcr0  DBCR0_IDM))
   prime_debug_regs(new_thread);
 }
 +EXPORT_SYMBOL_GPL(switch_booke_debug_regs);
 #else /* !CONFIG_PPC_ADV_DEBUG_REGS */
 #ifndef CONFIG_HAVE_HW_BREAKPOINT
 static void set_debug_reg_defaults(struct thread_struct *thread)
 -- 
 1.7.0.4
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: PPC: Book3S PR: Load up SPRG3 register with guest value on guest entry

2013-07-25 Thread Alexander Graf

On 11.07.2013, at 13:49, Paul Mackerras wrote:

 Unlike the other general-purpose SPRs, SPRG3 can be read by usermode
 code, and is used in recent kernels to store the CPU and NUMA node
 numbers so that they can be read by VDSO functions.  Thus we need to
 load the guest's SPRG3 value into the real SPRG3 register when entering
 the guest, and restore the host's value when exiting the guest.  We don't
 need to save the guest SPRG3 value when exiting the guest as usermode
 code can't modify SPRG3.

This loads SPRG3 on every guest exit, which can happen a lot with instruction 
emulation. Since the kernel doesn't rely on the contents of SPRG3 we only have 
to care about it when not in KVM code, right?

So could we move this to kvmppc_core_vcpu_load/put instead?


Alex

 
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
 arch/powerpc/kernel/asm-offsets.c|  1 +
 arch/powerpc/kvm/book3s_interrupts.S | 14 ++
 2 files changed, 15 insertions(+)
 
 diff --git a/arch/powerpc/kernel/asm-offsets.c 
 b/arch/powerpc/kernel/asm-offsets.c
 index 6f16ffa..a67c76e 100644
 --- a/arch/powerpc/kernel/asm-offsets.c
 +++ b/arch/powerpc/kernel/asm-offsets.c
 @@ -452,6 +452,7 @@ int main(void)
   DEFINE(VCPU_SPRG2, offsetof(struct kvm_vcpu, arch.shregs.sprg2));
   DEFINE(VCPU_SPRG3, offsetof(struct kvm_vcpu, arch.shregs.sprg3));
 #endif
 + DEFINE(VCPU_SHARED_SPRG3, offsetof(struct kvm_vcpu_arch_shared, sprg3));
   DEFINE(VCPU_SHARED_SPRG4, offsetof(struct kvm_vcpu_arch_shared, sprg4));
   DEFINE(VCPU_SHARED_SPRG5, offsetof(struct kvm_vcpu_arch_shared, sprg5));
   DEFINE(VCPU_SHARED_SPRG6, offsetof(struct kvm_vcpu_arch_shared, sprg6));
 diff --git a/arch/powerpc/kvm/book3s_interrupts.S 
 b/arch/powerpc/kvm/book3s_interrupts.S
 index 48cbbf8..17cfae5 100644
 --- a/arch/powerpc/kvm/book3s_interrupts.S
 +++ b/arch/powerpc/kvm/book3s_interrupts.S
 @@ -92,6 +92,11 @@ kvm_start_lightweight:
   PPC_LL  r3, VCPU_HFLAGS(r4)
   rldicl  r3, r3, 0, 63   /* r3 = 1 */
   stb r3, HSTATE_RESTORE_HID5(r13)
 +
 + /* Load up guest SPRG3 value, since it's user readable */
 + ld  r3, VCPU_SHARED(r4)
 + ld  r3, VCPU_SHARED_SPRG3(r3)
 + mtspr   SPRN_SPRG3, r3
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
   PPC_LL  r4, VCPU_SHADOW_MSR(r4) /* get shadow_msr */
 @@ -123,6 +128,15 @@ kvmppc_handler_highmem:
   /* R7 = vcpu */
   PPC_LL  r7, GPR4(r1)
 
 +#ifdef CONFIG_PPC_BOOK3S_64
 + /*
 +  * Reload kernel SPRG3 value.
 +  * No need to save guest value as usermode can't modify SPRG3.
 +  */
 + ld  r3, PACA_SPRG3(r13)
 + mtspr   SPRN_SPRG3, r3
 +#endif /* CONFIG_PPC_BOOK3S_64 */
 +
   PPC_STL r14, VCPU_GPR(R14)(r7)
   PPC_STL r15, VCPU_GPR(R15)(r7)
   PPC_STL r16, VCPU_GPR(R16)(r7)
 -- 
 1.8.3.1
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: PPC: Book3S PR: Load up SPRG3 register with guest value on guest entry

2013-07-25 Thread Alexander Graf

On 25.07.2013, at 15:38, Alexander Graf wrote:

 
 On 11.07.2013, at 13:49, Paul Mackerras wrote:
 
 Unlike the other general-purpose SPRs, SPRG3 can be read by usermode
 code, and is used in recent kernels to store the CPU and NUMA node
 numbers so that they can be read by VDSO functions.  Thus we need to
 load the guest's SPRG3 value into the real SPRG3 register when entering
 the guest, and restore the host's value when exiting the guest.  We don't
 need to save the guest SPRG3 value when exiting the guest as usermode
 code can't modify SPRG3.
 
 This loads SPRG3 on every guest exit, which can happen a lot with instruction 
 emulation. Since the kernel doesn't rely on the contents of SPRG3 we only 
 have to care about it when not in KVM code, right?
 
 So could we move this to kvmppc_core_vcpu_load/put instead?

but then again if all the shadow copy code is negligible performance wise, so 
is this probably. Applied to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu

2013-07-25 Thread Alexander Graf

On 11.07.2013, at 13:50, Paul Mackerras wrote:

 Currently PR-style KVM keeps the volatile guest register values
 (R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than
 the main kvm_vcpu struct.  For 64-bit, the shadow_vcpu exists in two
 places, a kmalloc'd struct and in the PACA, and it gets copied back
 and forth in kvmppc_core_vcpu_load/put(), because the real-mode code
 can't rely on being able to access the kmalloc'd struct.
 
 This changes the code to copy the volatile values into the shadow_vcpu
 as one of the last things done before entering the guest.  Similarly
 the values are copied back out of the shadow_vcpu to the kvm_vcpu
 immediately after exiting the guest.  We arrange for interrupts to be
 still disabled at this point so that we can't get preempted on 64-bit
 and end up copying values from the wrong PACA.
 
 This means that the accessor functions in kvm_book3s.h for these
 registers are greatly simplified, and are same between PR and HV KVM.
 In places where accesses to shadow_vcpu fields are now replaced by
 accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs.
 Finally, on 64-bit, we don't need the kmalloc'd struct at all any more.
 
 With this, the time to read the PVR one million times in a loop went
 from 478.2ms to 480.1ms (averages of 4 values), a difference which is
 not statistically significant given the variability of the results.
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
 arch/powerpc/include/asm/kvm_book3s.h | 193 +-
 arch/powerpc/include/asm/kvm_book3s_asm.h |   6 +-
 arch/powerpc/include/asm/kvm_host.h   |   1 +
 arch/powerpc/kernel/asm-offsets.c |   3 +-
 arch/powerpc/kvm/book3s_emulate.c |   8 +-
 arch/powerpc/kvm/book3s_interrupts.S  | 101 
 arch/powerpc/kvm/book3s_pr.c  |  68 +--
 arch/powerpc/kvm/book3s_rmhandlers.S  |   5 -
 arch/powerpc/kvm/trace.h  |   7 +-
 9 files changed, 175 insertions(+), 217 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
 b/arch/powerpc/include/asm/kvm_book3s.h
 index 08891d0..5d68f6c 100644
 --- a/arch/powerpc/include/asm/kvm_book3s.h
 +++ b/arch/powerpc/include/asm/kvm_book3s.h
 @@ -198,149 +198,97 @@ extern void kvm_return_point(void);
 #include asm/kvm_book3s_64.h
 #endif
 
 -#ifdef CONFIG_KVM_BOOK3S_PR
 -
 -static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu *vcpu)
 -{
 - return to_book3s(vcpu)-hior;
 -}
 -
 -static inline void kvmppc_update_int_pending(struct kvm_vcpu *vcpu,
 - unsigned long pending_now, unsigned long old_pending)
 -{
 - if (pending_now)
 - vcpu-arch.shared-int_pending = 1;
 - else if (old_pending)
 - vcpu-arch.shared-int_pending = 0;
 -}
 -
 static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, ulong val)
 {
 - if ( num  14 ) {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - svcpu-gpr[num] = val;
 - svcpu_put(svcpu);
 - to_book3s(vcpu)-shadow_vcpu-gpr[num] = val;
 - } else
 - vcpu-arch.gpr[num] = val;
 + vcpu-arch.gpr[num] = val;
 }
 
 static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, int num)
 {
 - if ( num  14 ) {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - ulong r = svcpu-gpr[num];
 - svcpu_put(svcpu);
 - return r;
 - } else
 - return vcpu-arch.gpr[num];
 + return vcpu-arch.gpr[num];
 }
 
 static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - svcpu-cr = val;
 - svcpu_put(svcpu);
 - to_book3s(vcpu)-shadow_vcpu-cr = val;
 + vcpu-arch.cr = val;
 }
 
 static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - u32 r;
 - r = svcpu-cr;
 - svcpu_put(svcpu);
 - return r;
 + return vcpu-arch.cr;
 }
 
 static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, u32 val)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - svcpu-xer = val;
 - to_book3s(vcpu)-shadow_vcpu-xer = val;
 - svcpu_put(svcpu);
 + vcpu-arch.xer = val;
 }
 
 static inline u32 kvmppc_get_xer(struct kvm_vcpu *vcpu)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - u32 r;
 - r = svcpu-xer;
 - svcpu_put(svcpu);
 - return r;
 + return vcpu-arch.xer;
 }
 
 static inline void kvmppc_set_ctr(struct kvm_vcpu *vcpu, ulong val)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - svcpu-ctr = val;
 - svcpu_put(svcpu);
 + vcpu-arch.ctr = val;
 }
 
 static inline ulong kvmppc_get_ctr(struct kvm_vcpu *vcpu)
 {
 - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
 - ulong r;
 - r = svcpu-ctr;
 - svcpu_put(svcpu);
 - 

Re: [PATCH 2/2] kvm: powerpc: set cache coherency only for kernel managed pages

2013-07-25 Thread Alexander Graf

On 25.07.2013, at 10:50, Gleb Natapov wrote:

 On Wed, Jul 24, 2013 at 03:32:49PM -0500, Scott Wood wrote:
 On 07/24/2013 04:39:59 AM, Alexander Graf wrote:
 
 On 24.07.2013, at 11:35, Gleb Natapov wrote:
 
 On Wed, Jul 24, 2013 at 11:21:11AM +0200, Alexander Graf wrote:
 Are not we going to use page_is_ram() from
 e500_shadow_mas2_attrib() as Scott commented?
 
 rWhy aren't we using page_is_ram() in kvm_is_mmio_pfn()?
 
 
 Because it is much slower and, IIRC, actually used to build pfn
 map that allow
 us to check quickly for valid pfn.
 
 Then why should we use page_is_ram()? :)
 
 I really don't want the e500 code to diverge too much from what
 the rest of the kvm code is doing.
 
 I don't understand actually used to build pfn map  What code
 is this?  I don't see any calls to page_is_ram() in the KVM code, or
 in generic mm code.  Is this a statement about what x86 does?
 It may be not page_is_ram() directly, but the same into page_is_ram() is
 using. On power both page_is_ram() and do_init_bootmem() walks some kind
 of memblock_region data structure. What important is that pfn_valid()
 does not mean that there is a memory behind page structure. See Andrea's
 reply.
 
 
 On PPC page_is_ram() is only called (AFAICT) for determining what
 attributes to set on mmaps.  We want to be sure that KVM always
 makes the same decision.  While pfn_valid() seems like it should be
 equivalent, it's not obvious from the PPC code that it is.
 
 Again pfn_valid() is not enough.
 
 If pfn_valid() is better, why is that not used for mmap?  Why are
 there two different names for the same thing?
 
 They are not the same thing. page_is_ram() tells you if phys address is
 ram backed. pfn_valid() tells you if there is struct page behind the
 pfn. PageReserved() tells if you a pfn is marked as reserved. All non
 ram pfns should be reserved, but ram pfns can be reserved too. Again,
 see Andrea's reply.
 
 Why ppc uses page_is_ram() for mmap? How should I know? But looking at

That one's easy. Let's just ask Ben. Ben, is there any particular reason PPC 
uses page_is_ram() rather than what KVM does here to figure out whether a pfn 
is RAM or not? It would be really useful to be able to run the exact same logic 
that figures out whether we're cacheable or not in both TLB writers (KVM and 
linux-mm).


Alex

 the function it does it only as a fallback if
 ppc_md.phys_mem_access_prot() is not provided. Making access to MMIO
 noncached as a safe fallback makes sense. It is also make sense to allow
 noncached access to reserved ram sometimes.
 
 --
   Gleb.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm: powerpc: set cache coherency only for kernel managed pages

2013-07-25 Thread Gleb Natapov
On Thu, Jul 25, 2013 at 06:07:55PM +0200, Alexander Graf wrote:
 
 On 25.07.2013, at 10:50, Gleb Natapov wrote:
 
  On Wed, Jul 24, 2013 at 03:32:49PM -0500, Scott Wood wrote:
  On 07/24/2013 04:39:59 AM, Alexander Graf wrote:
  
  On 24.07.2013, at 11:35, Gleb Natapov wrote:
  
  On Wed, Jul 24, 2013 at 11:21:11AM +0200, Alexander Graf wrote:
  Are not we going to use page_is_ram() from
  e500_shadow_mas2_attrib() as Scott commented?
  
  rWhy aren't we using page_is_ram() in kvm_is_mmio_pfn()?
  
  
  Because it is much slower and, IIRC, actually used to build pfn
  map that allow
  us to check quickly for valid pfn.
  
  Then why should we use page_is_ram()? :)
  
  I really don't want the e500 code to diverge too much from what
  the rest of the kvm code is doing.
  
  I don't understand actually used to build pfn map  What code
  is this?  I don't see any calls to page_is_ram() in the KVM code, or
  in generic mm code.  Is this a statement about what x86 does?
  It may be not page_is_ram() directly, but the same into page_is_ram() is
  using. On power both page_is_ram() and do_init_bootmem() walks some kind
  of memblock_region data structure. What important is that pfn_valid()
  does not mean that there is a memory behind page structure. See Andrea's
  reply.
  
  
  On PPC page_is_ram() is only called (AFAICT) for determining what
  attributes to set on mmaps.  We want to be sure that KVM always
  makes the same decision.  While pfn_valid() seems like it should be
  equivalent, it's not obvious from the PPC code that it is.
  
  Again pfn_valid() is not enough.
  
  If pfn_valid() is better, why is that not used for mmap?  Why are
  there two different names for the same thing?
  
  They are not the same thing. page_is_ram() tells you if phys address is
  ram backed. pfn_valid() tells you if there is struct page behind the
  pfn. PageReserved() tells if you a pfn is marked as reserved. All non
  ram pfns should be reserved, but ram pfns can be reserved too. Again,
  see Andrea's reply.
  
  Why ppc uses page_is_ram() for mmap? How should I know? But looking at
 
 That one's easy. Let's just ask Ben. Ben, is there any particular reason PPC 
 uses page_is_ram() rather than what KVM does here to figure out whether a pfn 
 is RAM or not? It would be really useful to be able to run the exact same 
 logic that figures out whether we're cacheable or not in both TLB writers 
 (KVM and linux-mm).
 
KVM does not only try to figure out what is RAM or not! Look at how KVM
uses the function. KVM tries to figure out if refcounting needed to be
used on this page among other things.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] powerpc: book3e: _PAGE_LENDIAN must be _PAGE_ENDIAN

2013-07-25 Thread Bharat Bhushan
For booke3e _PAGE_ENDIAN is not defined. Infact what is defined
is _PAGE_LENDIAN which is wrong and that should be _PAGE_ENDIAN.
There are no compilation errors as
arch/powerpc/include/asm/pte-common.h defines _PAGE_ENDIAN to 0
as it is not defined anywhere.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
 arch/powerpc/include/asm/pte-book3e.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/pte-book3e.h 
b/arch/powerpc/include/asm/pte-book3e.h
index 0156702..576ad88 100644
--- a/arch/powerpc/include/asm/pte-book3e.h
+++ b/arch/powerpc/include/asm/pte-book3e.h
@@ -40,7 +40,7 @@
 #define _PAGE_U1   0x01
 #define _PAGE_U0   0x02
 #define _PAGE_ACCESSED 0x04
-#define _PAGE_LENDIAN  0x08
+#define _PAGE_ENDIAN   0x08
 #define _PAGE_GUARDED  0x10
 #define _PAGE_COHERENT 0x20 /* M: enforce memory coherence */
 #define _PAGE_NO_CACHE 0x40 /* I: cache inhibit */
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] kvm: powerpc: allow guest control G attribute in mas2

2013-07-25 Thread Bharat Bhushan
G bit in MAS2 indicates whether the page is Guarded.
There is no reason to stop guest setting  G, so allow him.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
 arch/powerpc/kvm/e500.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 277cb18..4fd9650 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -117,7 +117,7 @@ static inline struct kvmppc_vcpu_e500 *to_e500(struct 
kvm_vcpu *vcpu)
 #define E500_TLB_USER_PERM_MASK (MAS3_UX|MAS3_UR|MAS3_UW)
 #define E500_TLB_SUPER_PERM_MASK (MAS3_SX|MAS3_SR|MAS3_SW)
 #define MAS2_ATTRIB_MASK \
- (MAS2_X0 | MAS2_X1 | MAS2_E)
+ (MAS2_X0 | MAS2_X1 | MAS2_E | MAS2_G)
 #define MAS3_ATTRIB_MASK \
  (MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3 \
   | E500_TLB_USER_PERM_MASK | E500_TLB_SUPER_PERM_MASK)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] kvm: powerpc: set cache coherency only for RAM pages

2013-07-25 Thread Bharat Bhushan
If the page is RAM then map this as cacheable and coherent (set M bit)
otherwise this page is treated as I/O and map this as cache inhibited
and guarded (set  I + G)

This helps setting proper MMU mapping for direct assigned device.

NOTE: There can be devices that require cacheable mapping, which is not yet 
supported.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
 arch/powerpc/kvm/e500_mmu_host.c |   24 +++-
 1 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 1c6a9d7..5cbdc8f 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -64,13 +64,27 @@ static inline u32 e500_shadow_mas3_attrib(u32 mas3, int 
usermode)
return mas3;
 }
 
-static inline u32 e500_shadow_mas2_attrib(u32 mas2, int usermode)
+static inline u32 e500_shadow_mas2_attrib(u32 mas2, pfn_t pfn)
 {
+   u32 mas2_attr;
+
+   mas2_attr = mas2  MAS2_ATTRIB_MASK;
+
+   if (kvm_is_mmio_pfn(pfn)) {
+   /*
+* If page is not RAM then it is treated as I/O page.
+* Map it with cache inhibited and guarded (set I + G).
+*/
+   mas2_attr |= MAS2_I | MAS2_G;
+   return mas2_attr;
+   }
+
+   /* Map RAM pages as cacheable (Not setting I in MAS2) */
 #ifdef CONFIG_SMP
-   return (mas2  MAS2_ATTRIB_MASK) | MAS2_M;
-#else
-   return mas2  MAS2_ATTRIB_MASK;
+   /* Also map as coherent (set M) in SMP */
+   mas2_attr |= MAS2_M;
 #endif
+   return mas2_attr;
 }
 
 /*
@@ -313,7 +327,7 @@ static void kvmppc_e500_setup_stlbe(
/* Force IPROT=0 for all guest mappings. */
stlbe-mas1 = MAS1_TSIZE(tsize) | get_tlb_sts(gtlbe) | MAS1_VALID;
stlbe-mas2 = (gvaddr  MAS2_EPN) |
- e500_shadow_mas2_attrib(gtlbe-mas2, pr);
+ e500_shadow_mas2_attrib(gtlbe-mas2, pfn);
stlbe-mas7_3 = ((u64)pfn  PAGE_SHIFT) |
e500_shadow_mas3_attrib(gtlbe-mas7_3, pr);
 
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html