[PATCH 1/2] KVM: x86: Simplify __apic_accept_irq
If posted interrupts are enabled, we can no longer track if an IRQ was coalesced based on IRR. So drop this logic also from the classic software path and simplify apic_test_and_set_irr to apic_set_irr. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/lapic.c | 23 --- 1 files changed, 8 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index afc1124..9dc3650 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -331,10 +331,10 @@ void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir) } EXPORT_SYMBOL_GPL(kvm_apic_update_irr); -static inline int apic_test_and_set_irr(int vec, struct kvm_lapic *apic) +static inline void apic_set_irr(int vec, struct kvm_lapic *apic) { apic-irr_pending = true; - return apic_test_and_set_vector(vec, apic-regs + APIC_IRR); + apic_set_vector(vec, apic-regs + APIC_IRR); } static inline int apic_search_irr(struct kvm_lapic *apic) @@ -681,28 +681,21 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, if (unlikely(!apic_enabled(apic))) break; + result = 1; + if (dest_map) __set_bit(vcpu-vcpu_id, dest_map); - if (kvm_x86_ops-deliver_posted_interrupt) { - result = 1; + if (kvm_x86_ops-deliver_posted_interrupt) kvm_x86_ops-deliver_posted_interrupt(vcpu, vector); - } else { - result = !apic_test_and_set_irr(vector, apic); - - if (!result) { - if (trig_mode) - apic_debug(level trig mode repeatedly - for vector %d, vector); - goto out; - } + else { + apic_set_irr(vector, apic); kvm_make_request(KVM_REQ_EVENT, vcpu); kvm_vcpu_kick(vcpu); } -out: trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode, - trig_mode, vector, !result); + trig_mode, vector, false); break; case APIC_DM_REMRD: -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: x86: Drop some unused functions from lapic
Both have no users anymore. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/lapic.c | 10 -- 1 files changed, 0 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 9dc3650..c98f054 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -79,16 +79,6 @@ static inline void apic_set_reg(struct kvm_lapic *apic, int reg_off, u32 val) *((u32 *) (apic-regs + reg_off)) = val; } -static inline int apic_test_and_set_vector(int vec, void *bitmap) -{ - return test_and_set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); -} - -static inline int apic_test_and_clear_vector(int vec, void *bitmap) -{ - return test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); -} - static inline int apic_test_vector(int vec, void *bitmap) { return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] x86: properly handle kvm emulation of hyperv
Il 24/07/2013 23:37, H. Peter Anvin ha scritto: What I'm suggesting is exactly that except that the native hypervisor is later in CPUID space. Me too actually. I was just suggesting an implementation of the idea (that takes into account hypervisors detected by other means than CPUID). Paolo KY Srinivasan k...@microsoft.com wrote: As Paolo suggested if there were some priority encoded, the guest could make an informed decision. If the guest under question can run on both hypervisors A and B, we would rather the guest discover hypervisor A when running on A and hypervisor B when running on B. The priority encoding could be as simple as surfacing the native hypervisor signature earlier in the CPUID space. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-unit-tests : Basic architecture of VMX nested test case
On 2013-07-25 07:31, Arthur Chunqi Li wrote: This is the first version of VMX nested environment. It contains the basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/ VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the basic execution routine in VMX nested environment andlet the VM print Hello World to inform its successfully run. The first release also includes a test suite for vmenter (vmlaunch and vmresume). Besides, hypercall mechanism is included and currently it is used to invoke VM normal exit. New files added: x86/vmx.h : contains all VMX related macro declerations x86/vmx.c : main file for VMX nested test case Signed-off-by: Arthur Chunqi Li yzt...@gmail.com Don't forget to update your public git as well. --- config-x86-common.mak |2 + config-x86_64.mak |1 + lib/x86/msr.h |5 + x86/cstart64.S|4 + x86/unittests.cfg |6 + x86/vmx.c | 712 + x86/vmx.h | 479 + 7 files changed, 1209 insertions(+) create mode 100644 x86/vmx.c create mode 100644 x86/vmx.h diff --git a/config-x86-common.mak b/config-x86-common.mak index 455032b..34a41e1 100644 --- a/config-x86-common.mak +++ b/config-x86-common.mak @@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o +$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o + arch_clean: $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \ $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o diff --git a/config-x86_64.mak b/config-x86_64.mak index 4e525f5..bb8ee89 100644 --- a/config-x86_64.mak +++ b/config-x86_64.mak @@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \ $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \ $(TEST_DIR)/pcid.flat tests += $(TEST_DIR)/svm.flat +tests += $(TEST_DIR)/vmx.flat include config-x86-common.mak diff --git a/lib/x86/msr.h b/lib/x86/msr.h index 509a421..281255a 100644 --- a/lib/x86/msr.h +++ b/lib/x86/msr.h @@ -396,6 +396,11 @@ #define MSR_IA32_VMX_VMCS_ENUM 0x048a #define MSR_IA32_VMX_PROCBASED_CTLS20x048b #define MSR_IA32_VMX_EPT_VPID_CAP 0x048c +#define MSR_IA32_VMX_TRUE_PIN0x048d +#define MSR_IA32_VMX_TRUE_PROC 0x048e +#define MSR_IA32_VMX_TRUE_EXIT 0x048f +#define MSR_IA32_VMX_TRUE_ENTRY 0x0490 + /* AMD-V MSRs */ diff --git a/x86/cstart64.S b/x86/cstart64.S index 24df5f8..0fe76da 100644 --- a/x86/cstart64.S +++ b/x86/cstart64.S @@ -4,6 +4,10 @@ .globl boot_idt boot_idt = 0 +.globl idt_descr +.globl tss_descr +.globl gdt64_desc + ipi_vector = 0x20 max_cpus = 64 diff --git a/x86/unittests.cfg b/x86/unittests.cfg index bc9643e..85c36aa 100644 --- a/x86/unittests.cfg +++ b/x86/unittests.cfg @@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s` file = pcid.flat extra_params = -cpu qemu64,+pcid arch = x86_64 + +[vmx] +file = vmx.flat +extra_params = -cpu host,+vmx +arch = x86_64 + diff --git a/x86/vmx.c b/x86/vmx.c new file mode 100644 index 000..ca3e117 --- /dev/null +++ b/x86/vmx.c @@ -0,0 +1,712 @@ +#include libcflat.h +#include processor.h +#include vm.h +#include desc.h +#include vmx.h +#include msr.h +#include smp.h +#include io.h + +int fails = 0, tests = 0; +u32 *vmxon_region; +struct vmcs *vmcs_root; +u32 vpid_cnt; +void *guest_stack, *guest_syscall_stack; +u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2]; +ulong fix_cr0_set, fix_cr0_clr; +ulong fix_cr4_set, fix_cr4_clr; +struct regs regs; +struct vmx_test *current; +u64 hypercall_field = 0; +bool launched = 0; + +extern u64 gdt64_desc[]; +extern u64 idt_descr[]; +extern u64 tss_descr[]; +extern void *vmx_return; +extern void *entry_sysenter; +extern void *guest_entry; + +static void report(const char *name, int result) +{ + ++tests; + if (result) + printf(PASS: %s\n, name); + else { + printf(FAIL: %s\n, name); + ++fails; + } +} + +static int vmcs_clear(struct vmcs *vmcs) +{ + bool ret; + asm volatile (vmclear %1; setbe %0 : =q (ret) : m (vmcs) : cc); + return ret; +} + +static u64 vmcs_read(enum Encoding enc) +{ + u64 val; + asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc); + return val; +} + +static int vmcs_write(enum Encoding enc, u64 val) +{ + bool ret; + asm volatile (vmwrite %1, %2; setbe %0 + : =q(ret) : rm (val), r ((u64)enc) : cc); + return ret; +} + +static int make_vmcs_current(struct vmcs *vmcs) +{ + bool ret; + + asm volatile (vmptrld %1; setbe %0 : =q (ret) : m (vmcs) : cc); + return ret; +} + +static
Re: [PATCH 4/4] x86: properly handle kvm emulation of hyperv
On 07/25/2013 03:59 PM, Paolo Bonzini wrote: Il 24/07/2013 23:37, H. Peter Anvin ha scritto: What I'm suggesting is exactly that except that the native hypervisor is later in CPUID space. Me too actually. I was just suggesting an implementation of the idea (that takes into account hypervisors detected by other means than CPUID). Paolo This make sense, will send V2. Thanks KY Srinivasan k...@microsoft.com wrote: As Paolo suggested if there were some priority encoded, the guest could make an informed decision. If the guest under question can run on both hypervisors A and B, we would rather the guest discover hypervisor A when running on A and hypervisor B when running on B. The priority encoding could be as simple as surfacing the native hypervisor signature earlier in the CPUID space. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm: powerpc: set cache coherency only for kernel managed pages
On Wed, Jul 24, 2013 at 03:32:49PM -0500, Scott Wood wrote: On 07/24/2013 04:39:59 AM, Alexander Graf wrote: On 24.07.2013, at 11:35, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 11:21:11AM +0200, Alexander Graf wrote: Are not we going to use page_is_ram() from e500_shadow_mas2_attrib() as Scott commented? rWhy aren't we using page_is_ram() in kvm_is_mmio_pfn()? Because it is much slower and, IIRC, actually used to build pfn map that allow us to check quickly for valid pfn. Then why should we use page_is_ram()? :) I really don't want the e500 code to diverge too much from what the rest of the kvm code is doing. I don't understand actually used to build pfn map What code is this? I don't see any calls to page_is_ram() in the KVM code, or in generic mm code. Is this a statement about what x86 does? It may be not page_is_ram() directly, but the same into page_is_ram() is using. On power both page_is_ram() and do_init_bootmem() walks some kind of memblock_region data structure. What important is that pfn_valid() does not mean that there is a memory behind page structure. See Andrea's reply. On PPC page_is_ram() is only called (AFAICT) for determining what attributes to set on mmaps. We want to be sure that KVM always makes the same decision. While pfn_valid() seems like it should be equivalent, it's not obvious from the PPC code that it is. Again pfn_valid() is not enough. If pfn_valid() is better, why is that not used for mmap? Why are there two different names for the same thing? They are not the same thing. page_is_ram() tells you if phys address is ram backed. pfn_valid() tells you if there is struct page behind the pfn. PageReserved() tells if you a pfn is marked as reserved. All non ram pfns should be reserved, but ram pfns can be reserved too. Again, see Andrea's reply. Why ppc uses page_is_ram() for mmap? How should I know? But looking at the function it does it only as a fallback if ppc_md.phys_mem_access_prot() is not provided. Making access to MMIO noncached as a safe fallback makes sense. It is also make sense to allow noncached access to reserved ram sometimes. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] s390/kvm: support collaborative memory management
From: Konstantin Weitz konstantin.we...@gmail.com This patch enables Collaborative Memory Management (CMM) for kvm on s390. CMM allows the guest to inform the host about page usage (see arch/s390/mm/cmm.c). The host uses this information to avoid swapping in unused pages in the page fault handler. Further, a CPU provided list of unused invalid pages is processed to reclaim swap space of not yet accessed unused pages. [ Martin Schwidefsky: patch reordering and cleanup ] Signed-off-by: Konstantin Weitz konstantin.we...@gmail.com Signed-off-by: Martin Schwidefsky schwidef...@de.ibm.com --- arch/s390/include/asm/kvm_host.h |5 ++- arch/s390/include/asm/pgtable.h | 24 arch/s390/kvm/kvm-s390.c | 25 + arch/s390/kvm/kvm-s390.h |2 + arch/s390/kvm/priv.c | 41 arch/s390/mm/pgtable.c | 77 ++ 6 files changed, 173 insertions(+), 1 deletion(-) diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index 3238d40..de6450e 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -113,7 +113,9 @@ struct kvm_s390_sie_block { __u64 gbea; /* 0x0180 */ __u8reserved188[24];/* 0x0188 */ __u32 fac;/* 0x01a0 */ - __u8reserved1a4[92];/* 0x01a4 */ + __u8reserved1a4[20];/* 0x01a4 */ + __u64 cbrlo; /* 0x01b8 */ + __u8reserved1c0[64];/* 0x01c0 */ } __attribute__((packed)); struct kvm_vcpu_stat { @@ -149,6 +151,7 @@ struct kvm_vcpu_stat { u32 instruction_stsi; u32 instruction_stfl; u32 instruction_tprot; + u32 instruction_essa; u32 instruction_sigp_sense; u32 instruction_sigp_sense_running; u32 instruction_sigp_external_call; diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 75fb726..65d48b8 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -227,6 +227,7 @@ extern unsigned long MODULES_END; #define _PAGE_SWR 0x008 /* SW pte referenced bit */ #define _PAGE_SWW 0x010 /* SW pte write bit */ #define _PAGE_SPECIAL 0x020 /* SW associated with special page */ +#define _PAGE_UNUSED 0x040 /* SW bit for ptep_clear_flush() */ #define __HAVE_ARCH_PTE_SPECIAL /* Set of bits not changed in pte_modify */ @@ -375,6 +376,12 @@ extern unsigned long MODULES_END; #endif /* CONFIG_64BIT */ +/* Guest Page State used for virtualization */ +#define _PGSTE_GPS_ZERO0x8000UL +#define _PGSTE_GPS_USAGE_MASK 0x0300UL +#define _PGSTE_GPS_USAGE_STABLE 0xUL +#define _PGSTE_GPS_USAGE_UNUSED 0x0100UL + /* * A user page table pointer has the space-switch-event bit, the * private-space-control bit and the storage-alteration-event-control @@ -590,6 +597,12 @@ static inline int pte_file(pte_t pte) return (pte_val(pte) mask) == _PAGE_TYPE_FILE; } +static inline int pte_swap(pte_t pte) +{ + unsigned long mask = _PAGE_RO | _PAGE_INVALID | _PAGE_SWT | _PAGE_SWX; + return (pte_val(pte) mask) == _PAGE_TYPE_SWAP; +} + static inline int pte_special(pte_t pte) { return (pte_val(pte) _PAGE_SPECIAL); @@ -794,6 +807,7 @@ unsigned long gmap_translate(unsigned long address, struct gmap *); unsigned long __gmap_fault(unsigned long address, struct gmap *); unsigned long gmap_fault(unsigned long address, struct gmap *); void gmap_discard(unsigned long from, unsigned long to, struct gmap *); +void __gmap_zap(unsigned long address, struct gmap *); void gmap_register_ipte_notifier(struct gmap_notifier *); void gmap_unregister_ipte_notifier(struct gmap_notifier *); @@ -825,6 +839,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, if (mm_has_pgste(mm)) { pgste = pgste_get_lock(ptep); + pgste_val(pgste) = ~_PGSTE_GPS_ZERO; pgste_set_key(ptep, pgste, entry); pgste_set_pte(ptep, entry); pgste_set_unlock(ptep, pgste); @@ -858,6 +873,12 @@ static inline int pte_young(pte_t pte) return 0; } +#define __HAVE_ARCH_PTE_UNUSED +static inline int pte_unused(pte_t pte) +{ + return pte_val(pte) _PAGE_UNUSED; +} + /* * pgd/pmd/pte modification functions */ @@ -1142,6 +1163,9 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma, pte_val(*ptep) = _PAGE_TYPE_EMPTY; if (mm_has_pgste(vma-vm_mm)) { + if ((pgste_val(pgste) _PGSTE_GPS_USAGE_MASK) == + _PGSTE_GPS_USAGE_UNUSED) + pte_val(pte) |= _PAGE_UNUSED; pgste = pgste_update_all(pte, pgste); pgste_set_unlock(ptep, pgste); } diff --git
[PATCH 1/2] mm: add support for discard of unused ptes
From: Konstantin Weitz konstantin.we...@gmail.com In a virtualized environment and given an appropriate interface the guest can mark pages as unused while they are free (for the s390 implementation see git commit 45e576b1c3d00206 guest page hinting light). For the host the unused state is a property of the pte. This patch adds the primitive 'pte_unused' and code to the host swap out handler so that pages marked as unused by all mappers are not swapped out but discarded instead, thus saving one IO for swap out and potentially another one for swap in. [ Martin Schwidefsky: patch reordering and simplification ] Signed-off-by: Konstantin Weitz konstantin.we...@gmail.com Signed-off-by: Martin Schwidefsky schwidef...@de.ibm.com --- include/asm-generic/pgtable.h | 13 + mm/rmap.c | 10 ++ 2 files changed, 23 insertions(+) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 2f47ade..ec540c5 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -193,6 +193,19 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) } #endif +#ifndef __HAVE_ARCH_PTE_UNUSED +/* + * Some architectures provide facilities to virtualization guests + * so that they can flag allocated pages as unused. This allows the + * host to transparently reclaim unused pages. This function returns + * whether the pte's page is unused. + */ +static inline int pte_unused(pte_t pte) +{ + return 0; +} +#endif + #ifndef __HAVE_ARCH_PMD_SAME #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) diff --git a/mm/rmap.c b/mm/rmap.c index cd356df..2291f25 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1234,6 +1234,16 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } set_pte_at(mm, address, pte, swp_entry_to_pte(make_hwpoison_entry(page))); + } else if (pte_unused(pteval)) { + /* +* The guest indicated that the page content is of no +* interest anymore. Simply discard the pte, vmscan +* will take care of the rest. +*/ + if (PageAnon(page)) + dec_mm_counter(mm, MM_ANONPAGES); + else + dec_mm_counter(mm, MM_FILEPAGES); } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(page) }; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 0/2] s390/kvm: add kvm support for guest page hinting v2
v1-v2: - found a way to simplify the common code patch Linux on s390 as a guest under z/VM has been using the guest page hinting interface (alias collaborative memory management) for a long time. The full version with volatile states has been deemed to be too complicated (see the old discussion about guest page hinting e.g. on http://marc.info/?l=linux-mmm=123816662017742w=2). What is currently implemented for the guest is the unused and stable states to mark unallocated pages as freely available to the host. This works just fine with z/VM as the host. The two patches in this series implement the guest page hinting interface for the unused and stable states in the KVM host. Most of the code specific to s390 but there is a common memory management part as well, see patch #1. The code is working stable now, from my point of view this is ready for prime-time. Konstantin Weitz (2): mm: add support for discard of unused ptes s390/kvm: support collaborative memory management arch/s390/include/asm/kvm_host.h |5 ++- arch/s390/include/asm/pgtable.h | 24 arch/s390/kvm/kvm-s390.c | 25 + arch/s390/kvm/kvm-s390.h |2 + arch/s390/kvm/priv.c | 41 arch/s390/mm/pgtable.c | 77 ++ include/asm-generic/pgtable.h| 13 +++ mm/rmap.c| 10 + 8 files changed, 196 insertions(+), 1 deletion(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 4/4] x86: correctly detect hypervisor
We try to handle the hypervisor compatibility mode by detecting hypervisor through a specific order. This is not robust, since hypervisors may implement each others features. This patch tries to handle this situation by always choosing the last one in the CPUID leaves. This is done by letting .detect() returns a priority instead of true/false and just re-using the CPUID leaf where the signature were found as the priority (or 1 if it was found by DMI). Then we can just pick hypervisor who has the highest priority. Other sophisticated detection method could also be implemented on top. Suggested by H. Peter Anvin and Paolo Bonzini. Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: K. Y. Srinivasan k...@microsoft.com Cc: Haiyang Zhang haiya...@microsoft.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Jeremy Fitzhardinge jer...@goop.org Cc: Doug Covelli dcove...@vmware.com Cc: Borislav Petkov b...@suse.de Cc: Dan Hecht dhe...@vmware.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Marcelo Tosatti mtosa...@redhat.com Cc: Gleb Natapov g...@redhat.com Cc: Paolo Bonzini pbonz...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: linux-ker...@vger.kernel.org Cc: de...@linuxdriverproject.org Cc: kvm@vger.kernel.org Cc: xen-de...@lists.xensource.com Cc: virtualizat...@lists.linux-foundation.org Signed-off-by: Jason Wang jasow...@redhat.com --- arch/x86/include/asm/hypervisor.h |2 +- arch/x86/kernel/cpu/hypervisor.c | 15 +++ arch/x86/kernel/cpu/mshyperv.c| 13 - arch/x86/kernel/cpu/vmware.c |8 arch/x86/kernel/kvm.c |6 ++ arch/x86/xen/enlighten.c |9 +++-- 6 files changed, 25 insertions(+), 28 deletions(-) diff --git a/arch/x86/include/asm/hypervisor.h b/arch/x86/include/asm/hypervisor.h index 2d4b5e6..e42f758 100644 --- a/arch/x86/include/asm/hypervisor.h +++ b/arch/x86/include/asm/hypervisor.h @@ -33,7 +33,7 @@ struct hypervisor_x86 { const char *name; /* Detection routine */ - bool(*detect)(void); + uint32_t(*detect)(void); /* Adjust CPU feature bits (run once per CPU) */ void(*set_cpu_features)(struct cpuinfo_x86 *); diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c index 8727921..36ce402 100644 --- a/arch/x86/kernel/cpu/hypervisor.c +++ b/arch/x86/kernel/cpu/hypervisor.c @@ -25,11 +25,6 @@ #include asm/processor.h #include asm/hypervisor.h -/* - * Hypervisor detect order. This is specified explicitly here because - * some hypervisors might implement compatibility modes for other - * hypervisors and therefore need to be detected in specific sequence. - */ static const __initconst struct hypervisor_x86 * const hypervisors[] = { #ifdef CONFIG_XEN_PVHVM @@ -49,15 +44,19 @@ static inline void __init detect_hypervisor_vendor(void) { const struct hypervisor_x86 *h, * const *p; + uint32_t pri, max_pri = 0; for (p = hypervisors; p hypervisors + ARRAY_SIZE(hypervisors); p++) { h = *p; - if (h-detect()) { + pri = h-detect(); + if (pri != 0 pri max_pri) { + max_pri = pri; x86_hyper = h; - printk(KERN_INFO Hypervisor detected: %s\n, h-name); - break; } } + + if (max_pri) + printk(KERN_INFO Hypervisor detected: %s\n, x86_hyper-name); } void init_hypervisor(struct cpuinfo_x86 *c) diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index 8f4be53..71a39f3 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -27,20 +27,23 @@ struct ms_hyperv_info ms_hyperv; EXPORT_SYMBOL_GPL(ms_hyperv); -static bool __init ms_hyperv_platform(void) +static uint32_t __init ms_hyperv_platform(void) { u32 eax; u32 hyp_signature[3]; if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) - return false; + return 0; cpuid(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS, eax, hyp_signature[0], hyp_signature[1], hyp_signature[2]); - return eax = HYPERV_CPUID_MIN - eax = HYPERV_CPUID_MAX - !memcmp(Microsoft Hv, hyp_signature, 12); + if (eax = HYPERV_CPUID_MIN + eax = HYPERV_CPUID_MAX + !memcmp(Microsoft Hv, hyp_signature, 12)) + return HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS; + + return 0; } static cycle_t read_hv_clock(struct clocksource *arg) diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c index 7076878..628a059 100644 --- a/arch/x86/kernel/cpu/vmware.c +++ b/arch/x86/kernel/cpu/vmware.c @@ -93,7 +93,7 @@ static void __init vmware_platform_setup(void) * serial key should be enough, as
[PATCH V2 1/4] x86: introduce hypervisor_cpuid_base()
This patch introduce hypervisor_cpuid_base() which loop test the hypervisor existence function until the signature match and check the number of leaves if required. This could be used by Xen/KVM guest to detect the existence of hypervisor. Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: Paolo Bonzini pbonz...@redhat.com Cc: Gleb Natapov g...@redhat.com Cc: x...@kernel.org Signed-off-by: Jason Wang jasow...@redhat.com --- Changes from V1: - use memcpy() and uint32_t instead of strcmp() --- arch/x86/include/asm/processor.h | 15 +++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 24cf5ae..7763307 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -971,6 +971,21 @@ unsigned long calc_aperfmperf_ratio(struct aperfmperf *old, return ratio; } +static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves) +{ + uint32_t base, eax, signature[3]; + + for (base = 0x4000; base 0x4001; base += 0x100) { + cpuid(base, eax, signature[0], signature[1], signature[2]); + + if (!memcmp(sig, signature, 12) + (leaves == 0 || ((eax - base) = leaves))) + return base; + } + + return 0; +} + extern unsigned long arch_align_stack(unsigned long sp); extern void free_init_pages(char *what, unsigned long begin, unsigned long end); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 2/4] xen: switch to use hypervisor_cpuid_base()
Switch to use hypervisor_cpuid_base() to detect Xen. Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Jeremy Fitzhardinge jer...@goop.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: Paolo Bonzini pbonz...@redhat.com Cc: xen-de...@lists.xensource.com Cc: virtualizat...@lists.linux-foundation.org Signed-off-by: Jason Wang jasow...@redhat.com --- arch/x86/include/asm/xen/hypervisor.h | 16 +--- 1 files changed, 1 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/xen/hypervisor.h b/arch/x86/include/asm/xen/hypervisor.h index 125f344..d866959 100644 --- a/arch/x86/include/asm/xen/hypervisor.h +++ b/arch/x86/include/asm/xen/hypervisor.h @@ -40,21 +40,7 @@ extern struct start_info *xen_start_info; static inline uint32_t xen_cpuid_base(void) { - uint32_t base, eax, ebx, ecx, edx; - char signature[13]; - - for (base = 0x4000; base 0x4001; base += 0x100) { - cpuid(base, eax, ebx, ecx, edx); - *(uint32_t *)(signature + 0) = ebx; - *(uint32_t *)(signature + 4) = ecx; - *(uint32_t *)(signature + 8) = edx; - signature[12] = 0; - - if (!strcmp(XenVMMXenVMM, signature) ((eax - base) = 2)) - return base; - } - - return 0; + return hypervisor_cpuid_base(XenVMMXenVMM, 2); } #ifdef CONFIG_XEN -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 3/4] kvm: switch to use hypervisor_cpuid_base()
Switch to use hypervisor_cpuid_base() to detect KVM. Cc: Gleb Natapov g...@redhat.com Cc: Paolo Bonzini pbonz...@redhat.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: kvm@vger.kernel.org Signed-off-by: Jason Wang jasow...@redhat.com --- Changes from V1: - Introduce kvm_cpuid_base() which will be used by next patch. --- arch/x86/include/asm/kvm_para.h | 24 +--- 1 files changed, 9 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 695399f..0644129 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -85,26 +85,20 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, return ret; } -static inline bool kvm_para_available(void) +static inline uint32_t kvm_cpuid_base(void) { - unsigned int eax, ebx, ecx, edx; - char signature[13]; - if (boot_cpu_data.cpuid_level 0) - return false; /* So we don't blow up on old processors */ + return 0; /* So we don't blow up on old processors */ - if (cpu_has_hypervisor) { - cpuid(KVM_CPUID_SIGNATURE, eax, ebx, ecx, edx); - memcpy(signature + 0, ebx, 4); - memcpy(signature + 4, ecx, 4); - memcpy(signature + 8, edx, 4); - signature[12] = 0; + if (cpu_has_hypervisor) + return hypervisor_cpuid_base(KVMKVMKVM\0\0\0, 0); - if (strcmp(signature, KVMKVMKVM) == 0) - return true; - } + return 0; +} - return false; +static inline bool kvm_para_available(void) +{ + return kvm_cpuid_base() != 0; } static inline unsigned int kvm_arch_para_features(void) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor
On 07/24/2013 06:06 PM, Raghavendra K T wrote: On 07/24/2013 05:36 PM, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote: On 07/24/2013 04:09 PM, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote: On 07/23/2013 08:37 PM, Gleb Natapov wrote: On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote: +static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want) [...] + +/* + * halt until it's our turn and kicked. Note that we do safe halt + * for irq enabled case to avoid hang when lock info is overwritten + * in irq spinlock slowpath and no spurious interrupt occur to save us. + */ +if (arch_irqs_disabled_flags(flags)) +halt(); +else +safe_halt(); + +out: So here now interrupts can be either disabled or enabled. Previous version disabled interrupts here, so are we sure it is safe to have them enabled at this point? I do not see any problem yet, will keep thinking. If we enable interrupt here, then +cpumask_clear_cpu(cpu, waiting_cpus); and if we start serving lock for an interrupt that came here, cpumask clear and w-lock=null may not happen atomically. if irq spinlock does not take slow path we would have non null value for lock, but with no information in waitingcpu. I am still thinking what would be problem with that. Exactly, for kicker waiting_cpus and w-lock updates are non atomic anyway. +w-lock = NULL; +local_irq_restore(flags); +spin_time_accum_blocked(start); +} +PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning); + +/* Kick vcpu waiting on @lock-head to reach value @ticket */ +static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) +{ +int cpu; + +add_stats(RELEASED_SLOW, 1); +for_each_cpu(cpu, waiting_cpus) { +const struct kvm_lock_waiting *w = per_cpu(lock_waiting, cpu); +if (ACCESS_ONCE(w-lock) == lock +ACCESS_ONCE(w-want) == ticket) { +add_stats(RELEASED_SLOW_KICKED, 1); +kvm_kick_cpu(cpu); What about using NMI to wake sleepers? I think it was discussed, but forgot why it was dismissed. I think I have missed that discussion. 'll go back and check. so what is the idea here? we can easily wake up the halted vcpus that have interrupt disabled? We can of course. IIRC the objection was that NMI handling path is very fragile and handling NMI on each wakeup will be more expensive then waking up a guest without injecting an event, but it is still interesting to see the numbers. Haam, now I remember, We had tried request based mechanism. (new request like REQ_UNHALT) and process that. It had worked, but had some complex hacks in vcpu_enter_guest to avoid guest hang in case of request cleared. So had left it there.. https://lkml.org/lkml/2012/4/30/67 But I do not remember performance impact though. No, this is something different. Wakeup with NMI does not need KVM changes at all. Instead of kvm_kick_cpu(cpu) in kvm_unlock_kick you send NMI IPI. True. It was not NMI. just to confirm, are you talking about something like this to be tried ? apic-send_IPI_mask(cpumask_of(cpu), APIC_DM_NMI); When I started benchmark, I started seeing Dazed and confused, but trying to continue from unknown nmi error handling. Did I miss anything (because we did not register any NMI handler)? or is it that spurious NMIs are trouble because we could get spurious NMIs if next waiter already acquired the lock. (note: I tried sending APIC_DM_REMRD IPI directly, which worked fine but hypercall way of handling still performed well from the results I saw). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor
On Thu, Jul 25, 2013 at 02:47:37PM +0530, Raghavendra K T wrote: On 07/24/2013 06:06 PM, Raghavendra K T wrote: On 07/24/2013 05:36 PM, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote: On 07/24/2013 04:09 PM, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote: On 07/23/2013 08:37 PM, Gleb Natapov wrote: On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote: +static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want) [...] + +/* + * halt until it's our turn and kicked. Note that we do safe halt + * for irq enabled case to avoid hang when lock info is overwritten + * in irq spinlock slowpath and no spurious interrupt occur to save us. + */ +if (arch_irqs_disabled_flags(flags)) +halt(); +else +safe_halt(); + +out: So here now interrupts can be either disabled or enabled. Previous version disabled interrupts here, so are we sure it is safe to have them enabled at this point? I do not see any problem yet, will keep thinking. If we enable interrupt here, then +cpumask_clear_cpu(cpu, waiting_cpus); and if we start serving lock for an interrupt that came here, cpumask clear and w-lock=null may not happen atomically. if irq spinlock does not take slow path we would have non null value for lock, but with no information in waitingcpu. I am still thinking what would be problem with that. Exactly, for kicker waiting_cpus and w-lock updates are non atomic anyway. +w-lock = NULL; +local_irq_restore(flags); +spin_time_accum_blocked(start); +} +PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning); + +/* Kick vcpu waiting on @lock-head to reach value @ticket */ +static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) +{ +int cpu; + +add_stats(RELEASED_SLOW, 1); +for_each_cpu(cpu, waiting_cpus) { +const struct kvm_lock_waiting *w = per_cpu(lock_waiting, cpu); +if (ACCESS_ONCE(w-lock) == lock +ACCESS_ONCE(w-want) == ticket) { +add_stats(RELEASED_SLOW_KICKED, 1); +kvm_kick_cpu(cpu); What about using NMI to wake sleepers? I think it was discussed, but forgot why it was dismissed. I think I have missed that discussion. 'll go back and check. so what is the idea here? we can easily wake up the halted vcpus that have interrupt disabled? We can of course. IIRC the objection was that NMI handling path is very fragile and handling NMI on each wakeup will be more expensive then waking up a guest without injecting an event, but it is still interesting to see the numbers. Haam, now I remember, We had tried request based mechanism. (new request like REQ_UNHALT) and process that. It had worked, but had some complex hacks in vcpu_enter_guest to avoid guest hang in case of request cleared. So had left it there.. https://lkml.org/lkml/2012/4/30/67 But I do not remember performance impact though. No, this is something different. Wakeup with NMI does not need KVM changes at all. Instead of kvm_kick_cpu(cpu) in kvm_unlock_kick you send NMI IPI. True. It was not NMI. just to confirm, are you talking about something like this to be tried ? apic-send_IPI_mask(cpumask_of(cpu), APIC_DM_NMI); When I started benchmark, I started seeing Dazed and confused, but trying to continue from unknown nmi error handling. Did I miss anything (because we did not register any NMI handler)? or is it that spurious NMIs are trouble because we could get spurious NMIs if next waiter already acquired the lock. There is a default NMI handler that tries to detect the reason why NMI happened (which is no so easy on x86) and prints this message if it fails. You need to add logic to detect spinlock slow path there. Check bit in waiting_cpus for instance. (note: I tried sending APIC_DM_REMRD IPI directly, which worked fine but hypercall way of handling still performed well from the results I saw). You mean better? This is strange. Have you ran guest with x2apic? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor
On 07/25/2013 02:45 PM, Gleb Natapov wrote: On Thu, Jul 25, 2013 at 02:47:37PM +0530, Raghavendra K T wrote: On 07/24/2013 06:06 PM, Raghavendra K T wrote: On 07/24/2013 05:36 PM, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote: On 07/24/2013 04:09 PM, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote: On 07/23/2013 08:37 PM, Gleb Natapov wrote: On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote: +static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want) [...] + +/* + * halt until it's our turn and kicked. Note that we do safe halt + * for irq enabled case to avoid hang when lock info is overwritten + * in irq spinlock slowpath and no spurious interrupt occur to save us. + */ +if (arch_irqs_disabled_flags(flags)) +halt(); +else +safe_halt(); + +out: So here now interrupts can be either disabled or enabled. Previous version disabled interrupts here, so are we sure it is safe to have them enabled at this point? I do not see any problem yet, will keep thinking. If we enable interrupt here, then +cpumask_clear_cpu(cpu, waiting_cpus); and if we start serving lock for an interrupt that came here, cpumask clear and w-lock=null may not happen atomically. if irq spinlock does not take slow path we would have non null value for lock, but with no information in waitingcpu. I am still thinking what would be problem with that. Exactly, for kicker waiting_cpus and w-lock updates are non atomic anyway. +w-lock = NULL; +local_irq_restore(flags); +spin_time_accum_blocked(start); +} +PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning); + +/* Kick vcpu waiting on @lock-head to reach value @ticket */ +static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) +{ +int cpu; + +add_stats(RELEASED_SLOW, 1); +for_each_cpu(cpu, waiting_cpus) { +const struct kvm_lock_waiting *w = per_cpu(lock_waiting, cpu); +if (ACCESS_ONCE(w-lock) == lock +ACCESS_ONCE(w-want) == ticket) { +add_stats(RELEASED_SLOW_KICKED, 1); +kvm_kick_cpu(cpu); What about using NMI to wake sleepers? I think it was discussed, but forgot why it was dismissed. I think I have missed that discussion. 'll go back and check. so what is the idea here? we can easily wake up the halted vcpus that have interrupt disabled? We can of course. IIRC the objection was that NMI handling path is very fragile and handling NMI on each wakeup will be more expensive then waking up a guest without injecting an event, but it is still interesting to see the numbers. Haam, now I remember, We had tried request based mechanism. (new request like REQ_UNHALT) and process that. It had worked, but had some complex hacks in vcpu_enter_guest to avoid guest hang in case of request cleared. So had left it there.. https://lkml.org/lkml/2012/4/30/67 But I do not remember performance impact though. No, this is something different. Wakeup with NMI does not need KVM changes at all. Instead of kvm_kick_cpu(cpu) in kvm_unlock_kick you send NMI IPI. True. It was not NMI. just to confirm, are you talking about something like this to be tried ? apic-send_IPI_mask(cpumask_of(cpu), APIC_DM_NMI); When I started benchmark, I started seeing Dazed and confused, but trying to continue from unknown nmi error handling. Did I miss anything (because we did not register any NMI handler)? or is it that spurious NMIs are trouble because we could get spurious NMIs if next waiter already acquired the lock. There is a default NMI handler that tries to detect the reason why NMI happened (which is no so easy on x86) and prints this message if it fails. You need to add logic to detect spinlock slow path there. Check bit in waiting_cpus for instance. aha.. Okay. will check that. (note: I tried sending APIC_DM_REMRD IPI directly, which worked fine but hypercall way of handling still performed well from the results I saw). You mean better? This is strange. Have you ran guest with x2apic? Had the same doubt. So ran the full benchmark for dbench. So here is what I saw now. 1x was neck to neck (0.9% for hypercall vs 0.7% for IPI which should boil to no difference considering the noise factors) but otherwise, by sending IPI I see few percentage gain in overcommit cases. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [[Qemu-devel] [PATCH]] nVMX: Initialize IA32_FEATURE_CONTROL MSR in reset and migration
On Sun, Jul 07, 2013 at 11:13:37PM +0800, Arthur Chunqi Li wrote: The recent KVM patch adds IA32_FEATURE_CONTROL support. QEMU needs to clear this MSR when reset vCPU and keep the value of it when migration. This patch add this feature. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com Applied, thanks. --- target-i386/cpu.h |2 ++ target-i386/kvm.c |4 target-i386/machine.c | 22 ++ 3 files changed, 28 insertions(+) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 62e3547..a418e17 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -301,6 +301,7 @@ #define MSR_IA32_APICBASE_BSP (18) #define MSR_IA32_APICBASE_ENABLE(111) #define MSR_IA32_APICBASE_BASE (0xf12) +#define MSR_IA32_FEATURE_CONTROL0x003a #define MSR_TSC_ADJUST 0x003b #define MSR_IA32_TSCDEADLINE0x6e0 @@ -813,6 +814,7 @@ typedef struct CPUX86State { uint64_t mcg_status; uint64_t msr_ia32_misc_enable; +uint64_t msr_ia32_feature_control; /* exception/interrupt handling */ int error_code; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 39f4fbb..3cb2161 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1122,6 +1122,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) if (hyperv_vapic_recommended()) { kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0); } +kvm_msr_entry_set(msrs[n++], MSR_IA32_FEATURE_CONTROL, env-msr_ia32_feature_control); } if (env-mcg_cap) { int i; @@ -1346,6 +1347,7 @@ static int kvm_get_msrs(X86CPU *cpu) if (has_msr_misc_enable) { msrs[n++].index = MSR_IA32_MISC_ENABLE; } +msrs[n++].index = MSR_IA32_FEATURE_CONTROL; if (!env-tsc_valid) { msrs[n++].index = MSR_IA32_TSC; @@ -1444,6 +1446,8 @@ static int kvm_get_msrs(X86CPU *cpu) case MSR_IA32_MISC_ENABLE: env-msr_ia32_misc_enable = msrs[i].data; break; +case MSR_IA32_FEATURE_CONTROL: +env-msr_ia32_feature_control = msrs[i].data; default: if (msrs[i].index = MSR_MC0_CTL msrs[i].index MSR_MC0_CTL + (env-mcg_cap 0xff) * 4) { diff --git a/target-i386/machine.c b/target-i386/machine.c index 3659db9..94ca914 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -399,6 +399,14 @@ static bool misc_enable_needed(void *opaque) return env-msr_ia32_misc_enable != MSR_IA32_MISC_ENABLE_DEFAULT; } +static bool feature_control_needed(void *opaque) +{ +X86CPU *cpu = opaque; +CPUX86State *env = cpu-env; + +return env-msr_ia32_feature_control != 0; +} + static const VMStateDescription vmstate_msr_ia32_misc_enable = { .name = cpu/msr_ia32_misc_enable, .version_id = 1, @@ -410,6 +418,17 @@ static const VMStateDescription vmstate_msr_ia32_misc_enable = { } }; +static const VMStateDescription vmstate_msr_ia32_feature_control = { +.name = cpu/msr_ia32_feature_control, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(env.msr_ia32_feature_control, X86CPU), +VMSTATE_END_OF_LIST() +} +}; + const VMStateDescription vmstate_x86_cpu = { .name = cpu, .version_id = 12, @@ -535,6 +554,9 @@ const VMStateDescription vmstate_x86_cpu = { }, { .vmsd = vmstate_msr_ia32_misc_enable, .needed = misc_enable_needed, +}, { +.vmsd = vmstate_msr_ia32_feature_control, +.needed = feature_control_needed, } , { /* empty */ } -- 1.7.9.5 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4 v6] powerpc: export debug registers save function for KVM
On 04.07.2013, at 08:57, Bharat Bhushan wrote: KVM need this function when switching from vcpu to user-space thread. My subsequent patch will use this function. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com Ben / Michael, please ack. Alex --- v5-v6 - switch_booke_debug_regs() not guarded by the compiler switch arch/powerpc/include/asm/switch_to.h |1 + arch/powerpc/kernel/process.c|3 ++- 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h index 200d763..db68f1d 100644 --- a/arch/powerpc/include/asm/switch_to.h +++ b/arch/powerpc/include/asm/switch_to.h @@ -29,6 +29,7 @@ extern void giveup_vsx(struct task_struct *); extern void enable_kernel_spe(void); extern void giveup_spe(struct task_struct *); extern void load_up_spe(struct task_struct *); +extern void switch_booke_debug_regs(struct thread_struct *new_thread); #ifndef CONFIG_SMP extern void discard_lazy_cpu_state(void); diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 01ff496..da586aa 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -362,12 +362,13 @@ static void prime_debug_regs(struct thread_struct *thread) * debug registers, set the debug registers from the values * stored in the new thread. */ -static void switch_booke_debug_regs(struct thread_struct *new_thread) +void switch_booke_debug_regs(struct thread_struct *new_thread) { if ((current-thread.debug.dbcr0 DBCR0_IDM) || (new_thread-debug.dbcr0 DBCR0_IDM)) prime_debug_regs(new_thread); } +EXPORT_SYMBOL_GPL(switch_booke_debug_regs); #else /* !CONFIG_PPC_ADV_DEBUG_REGS */ #ifndef CONFIG_HAVE_HW_BREAKPOINT static void set_debug_reg_defaults(struct thread_struct *thread) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: x86: Simplify __apic_accept_irq
On Thu, Jul 25, 2013 at 09:58:45AM +0200, Jan Kiszka wrote: If posted interrupts are enabled, we can no longer track if an IRQ was coalesced based on IRR. So drop this logic also from the classic software path and simplify apic_test_and_set_irr to apic_set_irr. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Applied both, thanks. --- arch/x86/kvm/lapic.c | 23 --- 1 files changed, 8 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index afc1124..9dc3650 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -331,10 +331,10 @@ void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir) } EXPORT_SYMBOL_GPL(kvm_apic_update_irr); -static inline int apic_test_and_set_irr(int vec, struct kvm_lapic *apic) +static inline void apic_set_irr(int vec, struct kvm_lapic *apic) { apic-irr_pending = true; - return apic_test_and_set_vector(vec, apic-regs + APIC_IRR); + apic_set_vector(vec, apic-regs + APIC_IRR); } static inline int apic_search_irr(struct kvm_lapic *apic) @@ -681,28 +681,21 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, if (unlikely(!apic_enabled(apic))) break; + result = 1; + if (dest_map) __set_bit(vcpu-vcpu_id, dest_map); - if (kvm_x86_ops-deliver_posted_interrupt) { - result = 1; + if (kvm_x86_ops-deliver_posted_interrupt) kvm_x86_ops-deliver_posted_interrupt(vcpu, vector); - } else { - result = !apic_test_and_set_irr(vector, apic); - - if (!result) { - if (trig_mode) - apic_debug(level trig mode repeatedly - for vector %d, vector); - goto out; - } + else { + apic_set_irr(vector, apic); kvm_make_request(KVM_REQ_EVENT, vcpu); kvm_vcpu_kick(vcpu); } -out: trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode, - trig_mode, vector, !result); + trig_mode, vector, false); break; case APIC_DM_REMRD: -- 1.7.3.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/2] s390/kvm: add kvm support for guest page hinting v2
On 25/07/13 10:54, Martin Schwidefsky wrote: v1-v2: - found a way to simplify the common code patch Linux on s390 as a guest under z/VM has been using the guest page hinting interface (alias collaborative memory management) for a long time. The full version with volatile states has been deemed to be too complicated (see the old discussion about guest page hinting e.g. on http://marc.info/?l=linux-mmm=123816662017742w=2). What is currently implemented for the guest is the unused and stable states to mark unallocated pages as freely available to the host. This works just fine with z/VM as the host. The two patches in this series implement the guest page hinting interface for the unused and stable states in the KVM host. Most of the code specific to s390 but there is a common memory management part as well, see patch #1. The code is working stable now, from my point of view this is ready for prime-time. Konstantin Weitz (2): mm: add support for discard of unused ptes s390/kvm: support collaborative memory management Can you also add the patch from our tree that reset the usage states on reboot (diag 308 subcode 3 and 4)? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mm: add support for discard of unused ptes
On 25/07/13 10:54, Martin Schwidefsky wrote: From: Konstantin Weitz konstantin.we...@gmail.com In a virtualized environment and given an appropriate interface the guest can mark pages as unused while they are free (for the s390 implementation see git commit 45e576b1c3d00206 guest page hinting light). For the host the unused state is a property of the pte. This patch adds the primitive 'pte_unused' and code to the host swap out handler so that pages marked as unused by all mappers are not swapped out but discarded instead, thus saving one IO for swap out and potentially another one for swap in. [ Martin Schwidefsky: patch reordering and simplification ] Signed-off-by: Konstantin Weitz konstantin.we...@gmail.com Signed-off-by: Martin Schwidefsky schwidef...@de.ibm.com Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com --- include/asm-generic/pgtable.h | 13 + mm/rmap.c | 10 ++ 2 files changed, 23 insertions(+) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 2f47ade..ec540c5 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -193,6 +193,19 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) } #endif +#ifndef __HAVE_ARCH_PTE_UNUSED +/* + * Some architectures provide facilities to virtualization guests + * so that they can flag allocated pages as unused. This allows the + * host to transparently reclaim unused pages. This function returns + * whether the pte's page is unused. + */ +static inline int pte_unused(pte_t pte) +{ + return 0; +} +#endif + #ifndef __HAVE_ARCH_PMD_SAME #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) diff --git a/mm/rmap.c b/mm/rmap.c index cd356df..2291f25 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1234,6 +1234,16 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } set_pte_at(mm, address, pte, swp_entry_to_pte(make_hwpoison_entry(page))); + } else if (pte_unused(pteval)) { + /* + * The guest indicated that the page content is of no + * interest anymore. Simply discard the pte, vmscan + * will take care of the rest. + */ + if (PageAnon(page)) + dec_mm_counter(mm, MM_ANONPAGES); + else + dec_mm_counter(mm, MM_FILEPAGES); } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(page) }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 4/4] x86: correctly detect hypervisor
-Original Message- From: Jason Wang [mailto:jasow...@redhat.com] Sent: Thursday, July 25, 2013 4:55 AM To: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org; linux-ker...@vger.kernel.org; pbonz...@redhat.com Cc: kvm@vger.kernel.org; Jason Wang; KY Srinivasan; Haiyang Zhang; Konrad Rzeszutek Wilk; Jeremy Fitzhardinge; Doug Covelli; Borislav Petkov; Dan Hecht; Paul Gortmaker; Marcelo Tosatti; Gleb Natapov; Frederic Weisbecker; de...@linuxdriverproject.org; xen-de...@lists.xensource.com; virtualizat...@lists.linux-foundation.org Subject: [PATCH V2 4/4] x86: correctly detect hypervisor We try to handle the hypervisor compatibility mode by detecting hypervisor through a specific order. This is not robust, since hypervisors may implement each others features. This patch tries to handle this situation by always choosing the last one in the CPUID leaves. This is done by letting .detect() returns a priority instead of true/false and just re-using the CPUID leaf where the signature were found as the priority (or 1 if it was found by DMI). Then we can just pick hypervisor who has the highest priority. Other sophisticated detection method could also be implemented on top. Suggested by H. Peter Anvin and Paolo Bonzini. Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: K. Y. Srinivasan k...@microsoft.com Cc: Haiyang Zhang haiya...@microsoft.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Jeremy Fitzhardinge jer...@goop.org Cc: Doug Covelli dcove...@vmware.com Cc: Borislav Petkov b...@suse.de Cc: Dan Hecht dhe...@vmware.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Marcelo Tosatti mtosa...@redhat.com Cc: Gleb Natapov g...@redhat.com Cc: Paolo Bonzini pbonz...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: linux-ker...@vger.kernel.org Cc: de...@linuxdriverproject.org Cc: kvm@vger.kernel.org Cc: xen-de...@lists.xensource.com Cc: virtualizat...@lists.linux-foundation.org Signed-off-by: Jason Wang jasow...@redhat.com Acked-by: K. Y. Srinivasan k...@microsoft.com --- arch/x86/include/asm/hypervisor.h |2 +- arch/x86/kernel/cpu/hypervisor.c | 15 +++ arch/x86/kernel/cpu/mshyperv.c| 13 - arch/x86/kernel/cpu/vmware.c |8 arch/x86/kernel/kvm.c |6 ++ arch/x86/xen/enlighten.c |9 +++-- 6 files changed, 25 insertions(+), 28 deletions(-) diff --git a/arch/x86/include/asm/hypervisor.h b/arch/x86/include/asm/hypervisor.h index 2d4b5e6..e42f758 100644 --- a/arch/x86/include/asm/hypervisor.h +++ b/arch/x86/include/asm/hypervisor.h @@ -33,7 +33,7 @@ struct hypervisor_x86 { const char *name; /* Detection routine */ - bool(*detect)(void); + uint32_t(*detect)(void); /* Adjust CPU feature bits (run once per CPU) */ void(*set_cpu_features)(struct cpuinfo_x86 *); diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c index 8727921..36ce402 100644 --- a/arch/x86/kernel/cpu/hypervisor.c +++ b/arch/x86/kernel/cpu/hypervisor.c @@ -25,11 +25,6 @@ #include asm/processor.h #include asm/hypervisor.h -/* - * Hypervisor detect order. This is specified explicitly here because - * some hypervisors might implement compatibility modes for other - * hypervisors and therefore need to be detected in specific sequence. - */ static const __initconst struct hypervisor_x86 * const hypervisors[] = { #ifdef CONFIG_XEN_PVHVM @@ -49,15 +44,19 @@ static inline void __init detect_hypervisor_vendor(void) { const struct hypervisor_x86 *h, * const *p; + uint32_t pri, max_pri = 0; for (p = hypervisors; p hypervisors + ARRAY_SIZE(hypervisors); p++) { h = *p; - if (h-detect()) { + pri = h-detect(); + if (pri != 0 pri max_pri) { + max_pri = pri; x86_hyper = h; - printk(KERN_INFO Hypervisor detected: %s\n, h- name); - break; } } + + if (max_pri) + printk(KERN_INFO Hypervisor detected: %s\n, x86_hyper- name); } void init_hypervisor(struct cpuinfo_x86 *c) diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index 8f4be53..71a39f3 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -27,20 +27,23 @@ struct ms_hyperv_info ms_hyperv; EXPORT_SYMBOL_GPL(ms_hyperv); -static bool __init ms_hyperv_platform(void) +static uint32_t __init ms_hyperv_platform(void) { u32 eax; u32 hyp_signature[3]; if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) - return false; + return 0; cpuid(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS,
Re: [PATCH 2/2] s390/kvm: support collaborative memory management
On 25/07/13 10:54, Martin Schwidefsky wrote: From: Konstantin Weitz konstantin.we...@gmail.com This patch enables Collaborative Memory Management (CMM) for kvm on s390. CMM allows the guest to inform the host about page usage (see arch/s390/mm/cmm.c). The host uses this information to avoid swapping in unused pages in the page fault handler. Further, a CPU provided list of unused invalid pages is processed to reclaim swap space of not yet accessed unused pages. [ Martin Schwidefsky: patch reordering and cleanup ] Signed-off-by: Konstantin Weitz konstantin.we...@gmail.com Signed-off-by: Martin Schwidefsky schwidef...@de.ibm.com Two things to consider: life migration and reset When we implement life migration, we need to add some additional magic for userspace to query/set unused state. But this can be a followup patch, whenever this becomes necessary. As of today it should be enough to add some code to the diag308 handler to make reset save. For other kinds of reset (e.g. those for kdump) we need to make this accessible to userspace. Again, this can be added later on when we implement the other missing pieces for kdump and friends. So Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com Tested-by: Christian Borntraeger borntrae...@de.ibm.com --- arch/s390/include/asm/kvm_host.h |5 ++- arch/s390/include/asm/pgtable.h | 24 arch/s390/kvm/kvm-s390.c | 25 + arch/s390/kvm/kvm-s390.h |2 + arch/s390/kvm/priv.c | 41 arch/s390/mm/pgtable.c | 77 ++ 6 files changed, 173 insertions(+), 1 deletion(-) diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index 3238d40..de6450e 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -113,7 +113,9 @@ struct kvm_s390_sie_block { __u64 gbea; /* 0x0180 */ __u8reserved188[24];/* 0x0188 */ __u32 fac;/* 0x01a0 */ - __u8reserved1a4[92];/* 0x01a4 */ + __u8reserved1a4[20];/* 0x01a4 */ + __u64 cbrlo; /* 0x01b8 */ + __u8reserved1c0[64];/* 0x01c0 */ } __attribute__((packed)); struct kvm_vcpu_stat { @@ -149,6 +151,7 @@ struct kvm_vcpu_stat { u32 instruction_stsi; u32 instruction_stfl; u32 instruction_tprot; + u32 instruction_essa; u32 instruction_sigp_sense; u32 instruction_sigp_sense_running; u32 instruction_sigp_external_call; diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 75fb726..65d48b8 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -227,6 +227,7 @@ extern unsigned long MODULES_END; #define _PAGE_SWR0x008 /* SW pte referenced bit */ #define _PAGE_SWW0x010 /* SW pte write bit */ #define _PAGE_SPECIAL0x020 /* SW associated with special page */ +#define _PAGE_UNUSED 0x040 /* SW bit for ptep_clear_flush() */ #define __HAVE_ARCH_PTE_SPECIAL /* Set of bits not changed in pte_modify */ @@ -375,6 +376,12 @@ extern unsigned long MODULES_END; #endif /* CONFIG_64BIT */ +/* Guest Page State used for virtualization */ +#define _PGSTE_GPS_ZERO 0x8000UL +#define _PGSTE_GPS_USAGE_MASK0x0300UL +#define _PGSTE_GPS_USAGE_STABLE 0xUL +#define _PGSTE_GPS_USAGE_UNUSED 0x0100UL + /* * A user page table pointer has the space-switch-event bit, the * private-space-control bit and the storage-alteration-event-control @@ -590,6 +597,12 @@ static inline int pte_file(pte_t pte) return (pte_val(pte) mask) == _PAGE_TYPE_FILE; } +static inline int pte_swap(pte_t pte) +{ + unsigned long mask = _PAGE_RO | _PAGE_INVALID | _PAGE_SWT | _PAGE_SWX; + return (pte_val(pte) mask) == _PAGE_TYPE_SWAP; +} + static inline int pte_special(pte_t pte) { return (pte_val(pte) _PAGE_SPECIAL); @@ -794,6 +807,7 @@ unsigned long gmap_translate(unsigned long address, struct gmap *); unsigned long __gmap_fault(unsigned long address, struct gmap *); unsigned long gmap_fault(unsigned long address, struct gmap *); void gmap_discard(unsigned long from, unsigned long to, struct gmap *); +void __gmap_zap(unsigned long address, struct gmap *); void gmap_register_ipte_notifier(struct gmap_notifier *); void gmap_unregister_ipte_notifier(struct gmap_notifier *); @@ -825,6 +839,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, if (mm_has_pgste(mm)) { pgste = pgste_get_lock(ptep); + pgste_val(pgste) = ~_PGSTE_GPS_ZERO; pgste_set_key(ptep, pgste, entry); pgste_set_pte(ptep, entry); pgste_set_unlock(ptep,
[PATCH v4 05/13] nEPT: make guest's A/D bits depends on guest's paging mode
EPT uses different shifts for A/D bits and first version of nEPT does not support them at all. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/paging_tmpl.h | 30 ++ 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index fb26ca9..7581395 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -32,6 +32,10 @@ #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl) #define PT_INDEX(addr, level) PT64_INDEX(addr, level) #define PT_LEVEL_BITS PT64_LEVEL_BITS + #define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK + #define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK + #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT + #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT #ifdef CONFIG_X86_64 #define PT_MAX_FULL_LEVELS 4 #define CMPXCHG cmpxchg @@ -49,6 +53,10 @@ #define PT_INDEX(addr, level) PT32_INDEX(addr, level) #define PT_LEVEL_BITS PT32_LEVEL_BITS #define PT_MAX_FULL_LEVELS 2 + #define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK + #define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK + #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT + #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT #define CMPXCHG cmpxchg #else #error Invalid PTTYPE value @@ -88,7 +96,8 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte) mask = (unsigned)~ACC_WRITE_MASK; /* Allow write access to dirty gptes */ - mask |= (gpte (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) PT_WRITABLE_MASK; + mask |= (gpte (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) + PT_WRITABLE_MASK; *access = mask; } @@ -138,7 +147,7 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, if (!FNAME(is_present_gpte)(gpte)) goto no_present; - if (!(gpte PT_ACCESSED_MASK)) + if (!(gpte PT_GUEST_ACCESSED_MASK)) goto no_present; return false; @@ -174,14 +183,14 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, table_gfn = walker-table_gfn[level - 1]; ptep_user = walker-ptep_user[level - 1]; index = offset_in_page(ptep_user) / sizeof(pt_element_t); - if (!(pte PT_ACCESSED_MASK)) { + if (!(pte PT_GUEST_ACCESSED_MASK)) { trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte)); - pte |= PT_ACCESSED_MASK; + pte |= PT_GUEST_ACCESSED_MASK; } if (level == walker-level write_fault - !(pte PT_DIRTY_MASK)) { + !(pte PT_GUEST_DIRTY_MASK)) { trace_kvm_mmu_set_dirty_bit(table_gfn, index, sizeof(pte)); - pte |= PT_DIRTY_MASK; + pte |= PT_GUEST_DIRTY_MASK; } if (pte == orig_pte) continue; @@ -235,7 +244,7 @@ retry_walk: ASSERT((!is_long_mode(vcpu) is_pae(vcpu)) || (mmu-get_cr3(vcpu) CR3_NONPAE_RESERVED_BITS) == 0); - accessed_dirty = PT_ACCESSED_MASK; + accessed_dirty = PT_GUEST_ACCESSED_MASK; pt_access = pte_access = ACC_ALL; ++walker-level; @@ -310,7 +319,8 @@ retry_walk: * On a write fault, fold the dirty bit into accessed_dirty by * shifting it one place right. */ - accessed_dirty = pte (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT); + accessed_dirty = pte + (PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT); if (unlikely(!accessed_dirty)) { ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault); @@ -886,3 +896,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) #undef gpte_to_gfn #undef gpte_to_gfn_lvl #undef CMPXCHG +#undef PT_GUEST_ACCESSED_MASK +#undef PT_GUEST_DIRTY_MASK +#undef PT_GUEST_DIRTY_SHIFT +#undef PT_GUEST_ACCESSED_SHIFT -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 00/13] Nested EPT
After changing hands several times I proud to present a new version of Nested EPT patches. Nothing groundbreaking here comparing to v3: all review comment are addressed, some by Yang Zhang and some by Yours Truly. Gleb Natapov (1): nEPT: make guest's A/D bits depends on guest's paging mode Nadav Har'El (10): nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 nEPT: Fix cr3 handling in nested exit and entry nEPT: Fix wrong test in kvm_set_cr3 nEPT: Move common code to paging_tmpl.h nEPT: Add EPT tables support to paging_tmpl.h nEPT: Nested INVEPT nEPT: MMU context for nested EPT nEPT: Advertise EPT to L1 nEPT: Some additional comments nEPT: Miscelleneous cleanups Yang Zhang (2): nEPT: Redefine EPT-specific link_shadow_page() nEPT: Add nEPT violation/misconfigration support arch/x86/include/asm/kvm_host.h |4 + arch/x86/include/asm/vmx.h |3 + arch/x86/include/uapi/asm/vmx.h |1 + arch/x86/kvm/mmu.c | 134 ++--- arch/x86/kvm/mmu.h |2 + arch/x86/kvm/paging_tmpl.h | 175 arch/x86/kvm/vmx.c | 210 --- arch/x86/kvm/x86.c | 11 -- 8 files changed, 436 insertions(+), 104 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 12/13] nEPT: Some additional comments
From: Nadav Har'El n...@il.ibm.com Some additional comments to preexisting code: Explain who (L0 or L1) handles EPT violation and misconfiguration exits. Don't mention shadow on either EPT or shadow as the only two options. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a77f902..d513ace 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -6659,7 +6659,20 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); case EXIT_REASON_EPT_VIOLATION: + /* +* L0 always deals with the EPT violation. If nested EPT is +* used, and the nested mmu code discovers that the address is +* missing in the guest EPT table (EPT12), the EPT violation +* will be injected with nested_ept_inject_page_fault() +*/ + return 0; case EXIT_REASON_EPT_MISCONFIG: + /* +* L2 never uses directly L1's EPT, but rather L0's own EPT +* table (shadow on EPT) or a merged EPT table that L0 built +* (EPT on EPT). So any problems with the structure of the +* table is L0's fault. +*/ return 0; case EXIT_REASON_PREEMPTION_TIMER: return vmcs12-pin_based_vm_exec_control -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
From: Yang Zhang yang.z.zh...@intel.com Inject nEPT fault to L1 guest. This patch is original from Xinhao. Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |4 arch/x86/kvm/mmu.c | 34 ++ arch/x86/kvm/paging_tmpl.h | 30 +- arch/x86/kvm/vmx.c | 17 + 4 files changed, 84 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 531f47c..58a17c0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -286,6 +286,7 @@ struct kvm_mmu { u64 *pae_root; u64 *lm_root; u64 rsvd_bits_mask[2][4]; + u64 bad_mt_xwr; /* * Bitmap: bit set = last pte in walk @@ -512,6 +513,9 @@ struct kvm_vcpu_arch { * instruction. */ bool write_fault_to_shadow_pgtable; + + /* set at EPT violation at this point */ + unsigned long exit_qualification; }; struct kvm_lpage_info { diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3df3ac3..58ae9db 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int maxphyaddr = cpuid_maxphyaddr(vcpu); u64 exb_bit_rsvd = 0; + context-bad_mt_xwr = 0; + if (!context-nx) exb_bit_rsvd = rsvd_bits(63, 63); switch (context-root_level) { @@ -3576,6 +3578,38 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, } } +static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu, + struct kvm_mmu *context, bool execonly) +{ + int maxphyaddr = cpuid_maxphyaddr(vcpu); + int pte; + + context-rsvd_bits_mask[0][3] = + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7); + context-rsvd_bits_mask[0][2] = + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6); + context-rsvd_bits_mask[0][1] = + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6); + context-rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51); + + /* large page */ + context-rsvd_bits_mask[1][3] = context-rsvd_bits_mask[0][3]; + context-rsvd_bits_mask[1][2] = + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29); + context-rsvd_bits_mask[1][1] = + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20); + context-rsvd_bits_mask[1][0] = context-rsvd_bits_mask[0][0]; + + for (pte = 0; pte 64; pte++) { + int rwx_bits = pte 7; + int mt = pte 3; + if (mt == 0x2 || mt == 0x3 || mt == 0x7 || + rwx_bits == 0x2 || rwx_bits == 0x6 || + (rwx_bits == 0x4 !execonly)) + context-bad_mt_xwr |= (1ull pte); + } +} + static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) { unsigned bit, byte, pfec; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 23a19a5..58d2f87 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -121,14 +121,23 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte) #endif } +#if PTTYPE == PTTYPE_EPT +#define CHECK_BAD_MT_XWR(G) mmu-bad_mt_xwr (1ull ((G) 0x3f)); +#else +#define CHECK_BAD_MT_XWR(G) 0; +#endif + static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level) { int bit7; bit7 = (gpte 7) 1; - return (gpte mmu-rsvd_bits_mask[bit7][level-1]) != 0; + return ((gpte mmu-rsvd_bits_mask[bit7][level-1]) != 0) || + CHECK_BAD_MT_XWR(gpte); } +#undef CHECK_BAD_MT_XWR + static inline int FNAME(is_present_gpte)(unsigned long pte) { #if PTTYPE != PTTYPE_EPT @@ -376,6 +385,25 @@ error: walker-fault.vector = PF_VECTOR; walker-fault.error_code_valid = true; walker-fault.error_code = errcode; + +#if PTTYPE == PTTYPE_EPT + /* +* Use PFERR_RSVD_MASK in erorr_code to to tell if EPT +* misconfiguration requires to be injected. The detection is +* done by is_rsvd_bits_set() above. +* +* We set up the value of exit_qualification to inject: +* [2:0] - Derive from [2:0] of real exit_qualification at EPT violation +* [5:3] - Calculated by the page walk of the guest EPT page tables +* [7:8] - Clear to 0. +* +* The other bits are set to 0. +*/ + if (!(errcode PFERR_RSVD_MASK)) { + vcpu-arch.exit_qualification = 0x7; + vcpu-arch.exit_qualification |= ((pt_access pte) 0x7) 3; + } +#endif walker-fault.address = addr;
[PATCH v4 11/13] nEPT: Advertise EPT to L1
From: Nadav Har'El n...@il.ibm.com Advertise the support of EPT to the L1 guest, through the appropriate MSR. This is the last patch of the basic Nested EPT feature, so as to allow bisection through this patch series: The guest will not see EPT support until this last patch, and will not attempt to use the half-applied feature. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6b79db7..a77f902 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2249,6 +2249,18 @@ static __init void nested_vmx_setup_ctls_msrs(void) SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | SECONDARY_EXEC_WBINVD_EXITING; + if (enable_ept) { + /* nested EPT: emulate EPT also to L1 */ + nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT; + nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT; + nested_vmx_ept_caps |= + VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT | + VMX_EPT_EXTENT_CONTEXT_BIT | + VMX_EPT_EXTENT_INDIVIDUAL_BIT; + nested_vmx_ept_caps = vmx_capability.ept; + } else + nested_vmx_ept_caps = 0; + /* miscellaneous data */ rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high); nested_vmx_misc_low = VMX_MISC_PREEMPTION_TIMER_RATE_MASK | @@ -2357,8 +2369,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) nested_vmx_secondary_ctls_high); break; case MSR_IA32_VMX_EPT_VPID_CAP: - /* Currently, no nested ept or nested vpid */ - *pdata = 0; + /* Currently, no nested vpid support */ + *pdata = nested_vmx_ept_caps; break; default: return 0; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 02/13] nEPT: Fix cr3 handling in nested exit and entry
From: Nadav Har'El n...@il.ibm.com The existing code for handling cr3 and related VMCS fields during nested exit and entry wasn't correct in all cases: If L2 is allowed to control cr3 (and this is indeed the case in nested EPT), during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and we forgot to do so. This patch adds this copy. If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and whoever does control cr3 (L1 or L2) is using PAE, the processor might have saved PDPTEs and we should also save them in vmcs12 (and restore later). Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1e9437f..89b15df 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7595,6 +7595,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) kvm_set_cr3(vcpu, vmcs12-guest_cr3); kvm_mmu_reset_context(vcpu); + /* +* Additionally, except when L0 is using shadow page tables, L1 or +* L2 control guest_cr3 for L2, so they may also have saved PDPTEs +*/ + if (enable_ept) { + vmcs_write64(GUEST_PDPTR0, vmcs12-guest_pdptr0); + vmcs_write64(GUEST_PDPTR1, vmcs12-guest_pdptr1); + vmcs_write64(GUEST_PDPTR2, vmcs12-guest_pdptr2); + vmcs_write64(GUEST_PDPTR3, vmcs12-guest_pdptr3); + } + kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12-guest_rsp); kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12-guest_rip); } @@ -7917,6 +7928,25 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vmcs12-guest_pending_dbg_exceptions = vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS); + /* +* In some cases (usually, nested EPT), L2 is allowed to change its +* own CR3 without exiting. If it has changed it, we must keep it. +* Of course, if L0 is using shadow page tables, GUEST_CR3 was defined +* by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12. +*/ + if (enable_ept) + vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3); + /* +* Additionally, except when L0 is using shadow page tables, L1 or +* L2 control guest_cr3 for L2, so save their PDPTEs +*/ + if (enable_ept) { + vmcs12-guest_pdptr0 = vmcs_read64(GUEST_PDPTR0); + vmcs12-guest_pdptr1 = vmcs_read64(GUEST_PDPTR1); + vmcs12-guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); + vmcs12-guest_pdptr3 = vmcs_read64(GUEST_PDPTR3); + } + vmcs12-vm_entry_controls = (vmcs12-vm_entry_controls ~VM_ENTRY_IA32E_MODE) | (vmcs_read32(VM_ENTRY_CONTROLS) VM_ENTRY_IA32E_MODE); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 13/13] nEPT: Miscelleneous cleanups
From: Nadav Har'El n...@il.ibm.com Some trivial code cleanups not really related to nested EPT. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c |6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index d513ace..66d9233 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -715,7 +715,6 @@ static void nested_release_page_clean(struct page *page) static u64 construct_eptp(unsigned long root_hpa); static void kvm_cpu_vmxon(u64 addr); static void kvm_cpu_vmxoff(void); -static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3); static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -1040,8 +1039,7 @@ static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit) (vmcs12-secondary_vm_exec_control bit); } -static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12, - struct kvm_vcpu *vcpu) +static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12) { return vmcs12-pin_based_vm_exec_control PIN_BASED_VIRTUAL_NMIS; } @@ -6760,7 +6758,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) if (unlikely(!cpu_has_virtual_nmis() vmx-soft_vnmi_blocked !(is_guest_mode(vcpu) nested_cpu_has_virtual_nmis( - get_vmcs12(vcpu), vcpu { + get_vmcs12(vcpu) { if (vmx_interrupt_allowed(vcpu)) { vmx-soft_vnmi_blocked = 0; } else if (vmx-vnmi_blocked_time 10LL -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
From: Nadav Har'El n...@il.ibm.com Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577 switch the EFER MSR when EPT is used and the host and guest have different NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2) and want to be able to run recent KVM as L1, we need to allow L1 to use this EFER switching feature. To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available, and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the latter is still unsupported). Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state, respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do in this patch is to properly advertise this feature to L1. Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using vmx_set_efer (which itself sets one of several vmcs02 fields), so we always support this feature, regardless of whether the host supports it. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..1e9437f 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2198,7 +2198,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) #else nested_vmx_exit_ctls_high = 0; #endif - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | + VM_EXIT_LOAD_IA32_EFER); /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2207,8 +2208,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; nested_vmx_entry_ctls_high = VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE; - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; - + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | + VM_ENTRY_LOAD_IA32_EFER); /* cpu-based controls */ rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high); @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits); - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */ - vmcs_write32(VM_EXIT_CONTROLS, - vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl); - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls | + /* L2-L1 exit controls are emulated - the hardware exit is to L0 so +* we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER +* bits are further modified by vmx_set_efer() below. +*/ + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); + + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are +* emulated by vmx_set_efer(), below. +*/ + vmcs_write32(VM_ENTRY_CONTROLS, + (vmcs12-vm_entry_controls ~VM_ENTRY_LOAD_IA32_EFER + ~VM_ENTRY_IA32E_MODE) | (vmcs_config.vmentry_ctrl ~VM_ENTRY_IA32E_MODE)); if (vmcs12-vm_entry_controls VM_ENTRY_LOAD_IA32_PAT) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 10/13] nEPT: MMU context for nested EPT
From: Nadav Har'El n...@il.ibm.com KVM's existing shadow MMU code already supports nested TDP. To use it, we need to set up a new MMU context for nested EPT, and create a few callbacks for it (nested_ept_*()). This context should also use the EPT versions of the page table access functions (defined in the previous patch). Then, we need to switch back and forth between this nested context and the regular MMU context when switching between L1 and L2 (when L1 runs this L2 with EPT). Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c | 26 ++ arch/x86/kvm/mmu.h |2 ++ arch/x86/kvm/vmx.c | 41 - 3 files changed, 68 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 58ae9db..37fff14 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3792,6 +3792,32 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context) } EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu); +int kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context, + bool execonly) +{ + ASSERT(vcpu); + ASSERT(!VALID_PAGE(vcpu-arch.mmu.root_hpa)); + + context-shadow_root_level = kvm_x86_ops-get_tdp_level(); + + context-nx = true; + context-new_cr3 = paging_new_cr3; + context-page_fault = ept_page_fault; + context-gva_to_gpa = ept_gva_to_gpa; + context-sync_page = ept_sync_page; + context-invlpg = ept_invlpg; + context-update_pte = ept_update_pte; + context-free = paging_free; + context-root_level = context-shadow_root_level; + context-root_hpa = INVALID_PAGE; + context-direct_map = false; + + reset_rsvds_bits_mask_ept(vcpu, context, execonly); + + return 0; +} +EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu); + static int init_kvm_softmmu(struct kvm_vcpu *vcpu) { int r = kvm_init_shadow_mmu(vcpu, vcpu-arch.walk_mmu); diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 5b59c57..77e044a 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -71,6 +71,8 @@ enum { int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct); int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context); +int kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context, + bool execonly); static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm) { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bbfff8d..6b79db7 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1046,6 +1046,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12, return vmcs12-pin_based_vm_exec_control PIN_BASED_VIRTUAL_NMIS; } +static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12) +{ + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT); +} + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -7433,6 +7438,33 @@ static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu, vmcs12-guest_physical_address = fault-address; } +/* Callbacks for nested_ept_init_mmu_context: */ + +static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu) +{ + /* return the page table to be shadowed - in our case, EPT12 */ + return get_vmcs12(vcpu)-ept_pointer; +} + +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) +{ + int r = kvm_init_shadow_ept_mmu(vcpu, vcpu-arch.mmu, + nested_vmx_ept_caps VMX_EPT_EXECUTE_ONLY_BIT); + + vcpu-arch.mmu.set_cr3 = vmx_set_cr3; + vcpu-arch.mmu.get_cr3 = nested_ept_get_cr3; + vcpu-arch.mmu.inject_page_fault = nested_ept_inject_page_fault; + + vcpu-arch.walk_mmu = vcpu-arch.nested_mmu; + + return r; +} + +static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu) +{ + vcpu-arch.walk_mmu = vcpu-arch.mmu; +} + /* * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it @@ -7653,6 +7685,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vmx_flush_tlb(vcpu); } + if (nested_cpu_has_ept(vmcs12)) { + kvm_mmu_unload(vcpu); + nested_ept_init_mmu_context(vcpu); + } + if (vmcs12-vm_entry_controls VM_ENTRY_LOAD_IA32_EFER) vcpu-arch.efer = vmcs12-guest_ia32_efer; else if (vmcs12-vm_entry_controls VM_ENTRY_IA32E_MODE) @@ -8125,7 +8162,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, vcpu-arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
[PATCH v4 08/13] nEPT: Nested INVEPT
From: Nadav Har'El n...@il.ibm.com If we let L1 use EPT, we should probably also support the INVEPT instruction. In our current nested EPT implementation, when L1 changes its EPT table for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course of this modification already calls INVEPT. But if last level of shadow page is unsync not all L1's changes to EPT12 are intercepted, which means roots need to be synced when L1 calls INVEPT. Global INVEPT should not be different since roots are synced by kvm_mmu_load() each time EPTP02 changes. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/vmx.h |3 ++ arch/x86/include/uapi/asm/vmx.h |1 + arch/x86/kvm/mmu.c |2 ++ arch/x86/kvm/vmx.c | 68 +++ 4 files changed, 74 insertions(+) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index f3e01a2..c3d74b9 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -387,6 +387,7 @@ enum vmcs_field { #define VMX_EPT_EXTENT_INDIVIDUAL_ADDR 0 #define VMX_EPT_EXTENT_CONTEXT 1 #define VMX_EPT_EXTENT_GLOBAL 2 +#define VMX_EPT_EXTENT_SHIFT 24 #define VMX_EPT_EXECUTE_ONLY_BIT (1ull) #define VMX_EPT_PAGE_WALK_4_BIT(1ull 6) @@ -394,7 +395,9 @@ enum vmcs_field { #define VMX_EPTP_WB_BIT(1ull 14) #define VMX_EPT_2MB_PAGE_BIT (1ull 16) #define VMX_EPT_1GB_PAGE_BIT (1ull 17) +#define VMX_EPT_INVEPT_BIT (1ull 20) #define VMX_EPT_AD_BIT (1ull 21) +#define VMX_EPT_EXTENT_INDIVIDUAL_BIT (1ull 24) #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull 25) #define VMX_EPT_EXTENT_GLOBAL_BIT (1ull 26) diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h index d651082..7a34e8f 100644 --- a/arch/x86/include/uapi/asm/vmx.h +++ b/arch/x86/include/uapi/asm/vmx.h @@ -65,6 +65,7 @@ #define EXIT_REASON_EOI_INDUCED 45 #define EXIT_REASON_EPT_VIOLATION 48 #define EXIT_REASON_EPT_MISCONFIG 49 +#define EXIT_REASON_INVEPT 50 #define EXIT_REASON_PREEMPTION_TIMER52 #define EXIT_REASON_WBINVD 54 #define EXIT_REASON_XSETBV 55 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9e0f467..3df3ac3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3182,6 +3182,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu) mmu_sync_roots(vcpu); spin_unlock(vcpu-kvm-mmu_lock); } +EXPORT_SYMBOL_GPL(kvm_mmu_sync_roots); static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access, struct x86_exception *exception) @@ -3451,6 +3452,7 @@ void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu) ++vcpu-stat.tlb_flush; kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); } +EXPORT_SYMBOL_GPL(kvm_mmu_flush_tlb); static void paging_new_cr3(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 56d0066..fc24370 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2156,6 +2156,7 @@ static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high; static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high; static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high; static u32 nested_vmx_misc_low, nested_vmx_misc_high; +static u32 nested_vmx_ept_caps; static __init void nested_vmx_setup_ctls_msrs(void) { /* @@ -6270,6 +6271,71 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu) return 1; } +/* Emulate the INVEPT instruction */ +static int handle_invept(struct kvm_vcpu *vcpu) +{ + u32 vmx_instruction_info; + bool ok; + unsigned long type; + gva_t gva; + struct x86_exception e; + struct { + u64 eptp, gpa; + } operand; + + if (!(nested_vmx_secondary_ctls_high SECONDARY_EXEC_ENABLE_EPT) || + !(nested_vmx_ept_caps VMX_EPT_INVEPT_BIT)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + /* According to the Intel VMX instruction reference, the memory +* operand is read even if it isn't needed (e.g., for type==global) +*/ + vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); + if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), +
[PATCH v4 04/13] nEPT: Move common code to paging_tmpl.h
From: Nadav Har'El n...@il.ibm.com For preparation, we just move gpte_access(), prefetch_invalid_gpte(), s_rsvd_bits_set(), protect_clean_gpte() and is_dirty_gpte() from mmu.c to paging_tmpl.h. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c | 55 -- arch/x86/kvm/paging_tmpl.h | 80 +--- 2 files changed, 68 insertions(+), 67 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3a9493a..4c4274d 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -331,11 +331,6 @@ static int is_large_pte(u64 pte) return pte PT_PAGE_SIZE_MASK; } -static int is_dirty_gpte(unsigned long pte) -{ - return pte PT_DIRTY_MASK; -} - static int is_rmap_spte(u64 pte) { return is_shadow_present_pte(pte); @@ -2574,14 +2569,6 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) mmu_free_roots(vcpu); } -static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level) -{ - int bit7; - - bit7 = (gpte 7) 1; - return (gpte mmu-rsvd_bits_mask[bit7][level-1]) != 0; -} - static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, bool no_dirty_log) { @@ -2594,26 +2581,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, return gfn_to_pfn_memslot_atomic(slot, gfn); } -static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *sp, u64 *spte, - u64 gpte) -{ - if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL)) - goto no_present; - - if (!is_present_gpte(gpte)) - goto no_present; - - if (!(gpte PT_ACCESSED_MASK)) - goto no_present; - - return false; - -no_present: - drop_spte(vcpu-kvm, spte); - return true; -} - static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *start, u64 *end) @@ -3501,18 +3468,6 @@ static void paging_free(struct kvm_vcpu *vcpu) nonpaging_free(vcpu); } -static inline void protect_clean_gpte(unsigned *access, unsigned gpte) -{ - unsigned mask; - - BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK); - - mask = (unsigned)~ACC_WRITE_MASK; - /* Allow write access to dirty gptes */ - mask |= (gpte (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) PT_WRITABLE_MASK; - *access = mask; -} - static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn, unsigned access, int *nr_present) { @@ -3530,16 +3485,6 @@ static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn, return false; } -static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte) -{ - unsigned access; - - access = (gpte (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK; - access = ~(gpte PT64_NX_SHIFT); - - return access; -} - static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gpte) { unsigned index; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 7769699..fb26ca9 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -80,6 +80,31 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl) return (gpte PT_LVL_ADDR_MASK(lvl)) PAGE_SHIFT; } +static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte) +{ + unsigned mask; + + BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK); + + mask = (unsigned)~ACC_WRITE_MASK; + /* Allow write access to dirty gptes */ + mask |= (gpte (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) PT_WRITABLE_MASK; + *access = mask; +} + +static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level) +{ + int bit7; + + bit7 = (gpte 7) 1; + return (gpte mmu-rsvd_bits_mask[bit7][level-1]) != 0; +} + +static inline int FNAME(is_present_gpte)(unsigned long pte) +{ + return is_present_gpte(pte); +} + static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, pt_element_t __user *ptep_user, unsigned index, pt_element_t orig_pte, pt_element_t new_pte) @@ -103,6 +128,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, return (ret != orig_pte); } +static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp, u64 *spte, + u64 gpte) +{ + if
[PATCH v4 07/13] nEPT: Redefine EPT-specific link_shadow_page()
From: Yang Zhang yang.z.zh...@intel.com Since nEPT doesn't support A/D bit, so we should not set those bit when build shadow page table. Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c | 12 +--- arch/x86/kvm/paging_tmpl.h |4 ++-- 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b5273c3..9e0f467 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2047,12 +2047,18 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) return __shadow_walk_next(iterator, *iterator-sptep); } -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp) +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool accessed) { u64 spte; + BUILD_BUG_ON(VMX_EPT_READABLE_MASK != PT_PRESENT_MASK || + VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK); + spte = __pa(sp-spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK | - shadow_user_mask | shadow_x_mask | shadow_accessed_mask; + shadow_user_mask | shadow_x_mask; + + if (accessed) + spte |= shadow_accessed_mask; mmu_spte_set(sptep, spte); } @@ -2677,7 +2683,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, iterator.level - 1, 1, ACC_ALL, iterator.sptep); - link_shadow_page(iterator.sptep, sp); + link_shadow_page(iterator.sptep, sp, true); } } return emulate; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index e38b3c0..23a19a5 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -545,7 +545,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, goto out_gpte_changed; if (sp) - link_shadow_page(it.sptep, sp); + link_shadow_page(it.sptep, sp, PTTYPE != PTTYPE_EPT); } for (; @@ -565,7 +565,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1, true, direct_access, it.sptep); - link_shadow_page(it.sptep, sp); + link_shadow_page(it.sptep, sp, PTTYPE != PTTYPE_EPT); } clear_sp_write_flooding_count(it.sptep); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3
From: Nadav Har'El n...@il.ibm.com kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical address. The problem is that with nested EPT, cr3 is an *L2* physical address, not an L1 physical address as this test expects. As the comment above this test explains, it isn't necessary, and doesn't correspond to anything a real processor would do. So this patch removes it. Note that this wrong test could have also theoretically caused problems in nested NPT, not just in nested EPT. However, in practice, the problem was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus circumventing the problem. Additional potential calls to the buggy function are avoided in that we don't trap cr3 modifications when nested NPT is enabled. However, because in nested VMX we did want to use kvm_set_cr3() (as requested in Avi Kivity's review of the original nested VMX patches), we can't avoid this problem and need to fix it. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 12 arch/x86/kvm/x86.c | 11 --- 2 files changed, 4 insertions(+), 19 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 89b15df..56d0066 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7596,8 +7596,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) kvm_mmu_reset_context(vcpu); /* -* Additionally, except when L0 is using shadow page tables, L1 or -* L2 control guest_cr3 for L2, so they may also have saved PDPTEs +* L1 may access the L2's PDPTR, so save them to construct vmcs12 */ if (enable_ept) { vmcs_write64(GUEST_PDPTR0, vmcs12-guest_pdptr0); @@ -7933,14 +7932,11 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) * own CR3 without exiting. If it has changed it, we must keep it. * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12. -*/ - if (enable_ept) - vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3); - /* -* Additionally, except when L0 is using shadow page tables, L1 or -* L2 control guest_cr3 for L2, so save their PDPTEs +* +* Additionally, restore L2's PDPTR to vmcs12. */ if (enable_ept) { + vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3); vmcs12-guest_pdptr0 = vmcs_read64(GUEST_PDPTR0); vmcs12-guest_pdptr1 = vmcs_read64(GUEST_PDPTR1); vmcs12-guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d2caeb9..e2fef8b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -682,17 +682,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) */ } - /* -* Does the new cr3 value map to physical memory? (Note, we -* catch an invalid cr3 even in real-mode, because it would -* cause trouble later on when we turn on paging anyway.) -* -* A real CPU would silently accept an invalid cr3 and would -* attempt to use it - with largely undefined (and often hard -* to debug) behavior on the guest side. -*/ - if (unlikely(!gfn_to_memslot(vcpu-kvm, cr3 PAGE_SHIFT))) - return 1; vcpu-arch.cr3 = cr3; __set_bit(VCPU_EXREG_CR3, (ulong *)vcpu-arch.regs_avail); vcpu-arch.mmu.new_cr3(vcpu); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mkinitcpio failing on vfio-vga-reset kernel branch on Arch Linux
I've compiled the vfio-vga-reset kernel branch, but every time I try to run mkinitcpio with the -k switch pointing either to the kernel or specifying the kernel version it says there are no modules to add. Running up to date Arch Linux. Mkinitcpio works fine with other kernels. Is there a specific list of kernel options required other than the four VFIO ones or other reason it might fail? Thanks Peter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] KVM: PPC: Book3S PR: Load up SPRG3 register with guest value on guest entry
On 11.07.2013, at 13:49, Paul Mackerras wrote: Unlike the other general-purpose SPRs, SPRG3 can be read by usermode code, and is used in recent kernels to store the CPU and NUMA node numbers so that they can be read by VDSO functions. Thus we need to load the guest's SPRG3 value into the real SPRG3 register when entering the guest, and restore the host's value when exiting the guest. We don't need to save the guest SPRG3 value when exiting the guest as usermode code can't modify SPRG3. This loads SPRG3 on every guest exit, which can happen a lot with instruction emulation. Since the kernel doesn't rely on the contents of SPRG3 we only have to care about it when not in KVM code, right? So could we move this to kvmppc_core_vcpu_load/put instead? Alex Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kernel/asm-offsets.c| 1 + arch/powerpc/kvm/book3s_interrupts.S | 14 ++ 2 files changed, 15 insertions(+) diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 6f16ffa..a67c76e 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -452,6 +452,7 @@ int main(void) DEFINE(VCPU_SPRG2, offsetof(struct kvm_vcpu, arch.shregs.sprg2)); DEFINE(VCPU_SPRG3, offsetof(struct kvm_vcpu, arch.shregs.sprg3)); #endif + DEFINE(VCPU_SHARED_SPRG3, offsetof(struct kvm_vcpu_arch_shared, sprg3)); DEFINE(VCPU_SHARED_SPRG4, offsetof(struct kvm_vcpu_arch_shared, sprg4)); DEFINE(VCPU_SHARED_SPRG5, offsetof(struct kvm_vcpu_arch_shared, sprg5)); DEFINE(VCPU_SHARED_SPRG6, offsetof(struct kvm_vcpu_arch_shared, sprg6)); diff --git a/arch/powerpc/kvm/book3s_interrupts.S b/arch/powerpc/kvm/book3s_interrupts.S index 48cbbf8..17cfae5 100644 --- a/arch/powerpc/kvm/book3s_interrupts.S +++ b/arch/powerpc/kvm/book3s_interrupts.S @@ -92,6 +92,11 @@ kvm_start_lightweight: PPC_LL r3, VCPU_HFLAGS(r4) rldicl r3, r3, 0, 63 /* r3 = 1 */ stb r3, HSTATE_RESTORE_HID5(r13) + + /* Load up guest SPRG3 value, since it's user readable */ + ld r3, VCPU_SHARED(r4) + ld r3, VCPU_SHARED_SPRG3(r3) + mtspr SPRN_SPRG3, r3 #endif /* CONFIG_PPC_BOOK3S_64 */ PPC_LL r4, VCPU_SHADOW_MSR(r4) /* get shadow_msr */ @@ -123,6 +128,15 @@ kvmppc_handler_highmem: /* R7 = vcpu */ PPC_LL r7, GPR4(r1) +#ifdef CONFIG_PPC_BOOK3S_64 + /* + * Reload kernel SPRG3 value. + * No need to save guest value as usermode can't modify SPRG3. + */ + ld r3, PACA_SPRG3(r13) + mtspr SPRN_SPRG3, r3 +#endif /* CONFIG_PPC_BOOK3S_64 */ + PPC_STL r14, VCPU_GPR(R14)(r7) PPC_STL r15, VCPU_GPR(R15)(r7) PPC_STL r16, VCPU_GPR(R16)(r7) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] KVM: PPC: Book3S PR: Load up SPRG3 register with guest value on guest entry
On 25.07.2013, at 15:38, Alexander Graf wrote: On 11.07.2013, at 13:49, Paul Mackerras wrote: Unlike the other general-purpose SPRs, SPRG3 can be read by usermode code, and is used in recent kernels to store the CPU and NUMA node numbers so that they can be read by VDSO functions. Thus we need to load the guest's SPRG3 value into the real SPRG3 register when entering the guest, and restore the host's value when exiting the guest. We don't need to save the guest SPRG3 value when exiting the guest as usermode code can't modify SPRG3. This loads SPRG3 on every guest exit, which can happen a lot with instruction emulation. Since the kernel doesn't rely on the contents of SPRG3 we only have to care about it when not in KVM code, right? So could we move this to kvmppc_core_vcpu_load/put instead? but then again if all the shadow copy code is negligible performance wise, so is this probably. Applied to kvm-ppc-queue. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu
On 11.07.2013, at 13:50, Paul Mackerras wrote: Currently PR-style KVM keeps the volatile guest register values (R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than the main kvm_vcpu struct. For 64-bit, the shadow_vcpu exists in two places, a kmalloc'd struct and in the PACA, and it gets copied back and forth in kvmppc_core_vcpu_load/put(), because the real-mode code can't rely on being able to access the kmalloc'd struct. This changes the code to copy the volatile values into the shadow_vcpu as one of the last things done before entering the guest. Similarly the values are copied back out of the shadow_vcpu to the kvm_vcpu immediately after exiting the guest. We arrange for interrupts to be still disabled at this point so that we can't get preempted on 64-bit and end up copying values from the wrong PACA. This means that the accessor functions in kvm_book3s.h for these registers are greatly simplified, and are same between PR and HV KVM. In places where accesses to shadow_vcpu fields are now replaced by accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs. Finally, on 64-bit, we don't need the kmalloc'd struct at all any more. With this, the time to read the PVR one million times in a loop went from 478.2ms to 480.1ms (averages of 4 values), a difference which is not statistically significant given the variability of the results. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s.h | 193 +- arch/powerpc/include/asm/kvm_book3s_asm.h | 6 +- arch/powerpc/include/asm/kvm_host.h | 1 + arch/powerpc/kernel/asm-offsets.c | 3 +- arch/powerpc/kvm/book3s_emulate.c | 8 +- arch/powerpc/kvm/book3s_interrupts.S | 101 arch/powerpc/kvm/book3s_pr.c | 68 +-- arch/powerpc/kvm/book3s_rmhandlers.S | 5 - arch/powerpc/kvm/trace.h | 7 +- 9 files changed, 175 insertions(+), 217 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 08891d0..5d68f6c 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -198,149 +198,97 @@ extern void kvm_return_point(void); #include asm/kvm_book3s_64.h #endif -#ifdef CONFIG_KVM_BOOK3S_PR - -static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu *vcpu) -{ - return to_book3s(vcpu)-hior; -} - -static inline void kvmppc_update_int_pending(struct kvm_vcpu *vcpu, - unsigned long pending_now, unsigned long old_pending) -{ - if (pending_now) - vcpu-arch.shared-int_pending = 1; - else if (old_pending) - vcpu-arch.shared-int_pending = 0; -} - static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, ulong val) { - if ( num 14 ) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - svcpu-gpr[num] = val; - svcpu_put(svcpu); - to_book3s(vcpu)-shadow_vcpu-gpr[num] = val; - } else - vcpu-arch.gpr[num] = val; + vcpu-arch.gpr[num] = val; } static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, int num) { - if ( num 14 ) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - ulong r = svcpu-gpr[num]; - svcpu_put(svcpu); - return r; - } else - return vcpu-arch.gpr[num]; + return vcpu-arch.gpr[num]; } static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - svcpu-cr = val; - svcpu_put(svcpu); - to_book3s(vcpu)-shadow_vcpu-cr = val; + vcpu-arch.cr = val; } static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - u32 r; - r = svcpu-cr; - svcpu_put(svcpu); - return r; + return vcpu-arch.cr; } static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, u32 val) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - svcpu-xer = val; - to_book3s(vcpu)-shadow_vcpu-xer = val; - svcpu_put(svcpu); + vcpu-arch.xer = val; } static inline u32 kvmppc_get_xer(struct kvm_vcpu *vcpu) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - u32 r; - r = svcpu-xer; - svcpu_put(svcpu); - return r; + return vcpu-arch.xer; } static inline void kvmppc_set_ctr(struct kvm_vcpu *vcpu, ulong val) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - svcpu-ctr = val; - svcpu_put(svcpu); + vcpu-arch.ctr = val; } static inline ulong kvmppc_get_ctr(struct kvm_vcpu *vcpu) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - ulong r; - r = svcpu-ctr; - svcpu_put(svcpu); -
Re: [PATCH 2/2] kvm: powerpc: set cache coherency only for kernel managed pages
On 25.07.2013, at 10:50, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 03:32:49PM -0500, Scott Wood wrote: On 07/24/2013 04:39:59 AM, Alexander Graf wrote: On 24.07.2013, at 11:35, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 11:21:11AM +0200, Alexander Graf wrote: Are not we going to use page_is_ram() from e500_shadow_mas2_attrib() as Scott commented? rWhy aren't we using page_is_ram() in kvm_is_mmio_pfn()? Because it is much slower and, IIRC, actually used to build pfn map that allow us to check quickly for valid pfn. Then why should we use page_is_ram()? :) I really don't want the e500 code to diverge too much from what the rest of the kvm code is doing. I don't understand actually used to build pfn map What code is this? I don't see any calls to page_is_ram() in the KVM code, or in generic mm code. Is this a statement about what x86 does? It may be not page_is_ram() directly, but the same into page_is_ram() is using. On power both page_is_ram() and do_init_bootmem() walks some kind of memblock_region data structure. What important is that pfn_valid() does not mean that there is a memory behind page structure. See Andrea's reply. On PPC page_is_ram() is only called (AFAICT) for determining what attributes to set on mmaps. We want to be sure that KVM always makes the same decision. While pfn_valid() seems like it should be equivalent, it's not obvious from the PPC code that it is. Again pfn_valid() is not enough. If pfn_valid() is better, why is that not used for mmap? Why are there two different names for the same thing? They are not the same thing. page_is_ram() tells you if phys address is ram backed. pfn_valid() tells you if there is struct page behind the pfn. PageReserved() tells if you a pfn is marked as reserved. All non ram pfns should be reserved, but ram pfns can be reserved too. Again, see Andrea's reply. Why ppc uses page_is_ram() for mmap? How should I know? But looking at That one's easy. Let's just ask Ben. Ben, is there any particular reason PPC uses page_is_ram() rather than what KVM does here to figure out whether a pfn is RAM or not? It would be really useful to be able to run the exact same logic that figures out whether we're cacheable or not in both TLB writers (KVM and linux-mm). Alex the function it does it only as a fallback if ppc_md.phys_mem_access_prot() is not provided. Making access to MMIO noncached as a safe fallback makes sense. It is also make sense to allow noncached access to reserved ram sometimes. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm: powerpc: set cache coherency only for kernel managed pages
On Thu, Jul 25, 2013 at 06:07:55PM +0200, Alexander Graf wrote: On 25.07.2013, at 10:50, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 03:32:49PM -0500, Scott Wood wrote: On 07/24/2013 04:39:59 AM, Alexander Graf wrote: On 24.07.2013, at 11:35, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 11:21:11AM +0200, Alexander Graf wrote: Are not we going to use page_is_ram() from e500_shadow_mas2_attrib() as Scott commented? rWhy aren't we using page_is_ram() in kvm_is_mmio_pfn()? Because it is much slower and, IIRC, actually used to build pfn map that allow us to check quickly for valid pfn. Then why should we use page_is_ram()? :) I really don't want the e500 code to diverge too much from what the rest of the kvm code is doing. I don't understand actually used to build pfn map What code is this? I don't see any calls to page_is_ram() in the KVM code, or in generic mm code. Is this a statement about what x86 does? It may be not page_is_ram() directly, but the same into page_is_ram() is using. On power both page_is_ram() and do_init_bootmem() walks some kind of memblock_region data structure. What important is that pfn_valid() does not mean that there is a memory behind page structure. See Andrea's reply. On PPC page_is_ram() is only called (AFAICT) for determining what attributes to set on mmaps. We want to be sure that KVM always makes the same decision. While pfn_valid() seems like it should be equivalent, it's not obvious from the PPC code that it is. Again pfn_valid() is not enough. If pfn_valid() is better, why is that not used for mmap? Why are there two different names for the same thing? They are not the same thing. page_is_ram() tells you if phys address is ram backed. pfn_valid() tells you if there is struct page behind the pfn. PageReserved() tells if you a pfn is marked as reserved. All non ram pfns should be reserved, but ram pfns can be reserved too. Again, see Andrea's reply. Why ppc uses page_is_ram() for mmap? How should I know? But looking at That one's easy. Let's just ask Ben. Ben, is there any particular reason PPC uses page_is_ram() rather than what KVM does here to figure out whether a pfn is RAM or not? It would be really useful to be able to run the exact same logic that figures out whether we're cacheable or not in both TLB writers (KVM and linux-mm). KVM does not only try to figure out what is RAM or not! Look at how KVM uses the function. KVM tries to figure out if refcounting needed to be used on this page among other things. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-unit-tests : Basic architecture of VMX nested test case
On Thu, Jul 25, 2013 at 4:11 PM, Jan Kiszka jan.kis...@web.de wrote: On 2013-07-25 07:31, Arthur Chunqi Li wrote: This is the first version of VMX nested environment. It contains the basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/ VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the basic execution routine in VMX nested environment andlet the VM print Hello World to inform its successfully run. The first release also includes a test suite for vmenter (vmlaunch and vmresume). Besides, hypercall mechanism is included and currently it is used to invoke VM normal exit. New files added: x86/vmx.h : contains all VMX related macro declerations x86/vmx.c : main file for VMX nested test case Signed-off-by: Arthur Chunqi Li yzt...@gmail.com Don't forget to update your public git as well. --- config-x86-common.mak |2 + config-x86_64.mak |1 + lib/x86/msr.h |5 + x86/cstart64.S|4 + x86/unittests.cfg |6 + x86/vmx.c | 712 + x86/vmx.h | 479 + 7 files changed, 1209 insertions(+) create mode 100644 x86/vmx.c create mode 100644 x86/vmx.h diff --git a/config-x86-common.mak b/config-x86-common.mak index 455032b..34a41e1 100644 --- a/config-x86-common.mak +++ b/config-x86-common.mak @@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o +$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o + arch_clean: $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \ $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o diff --git a/config-x86_64.mak b/config-x86_64.mak index 4e525f5..bb8ee89 100644 --- a/config-x86_64.mak +++ b/config-x86_64.mak @@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \ $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \ $(TEST_DIR)/pcid.flat tests += $(TEST_DIR)/svm.flat +tests += $(TEST_DIR)/vmx.flat include config-x86-common.mak diff --git a/lib/x86/msr.h b/lib/x86/msr.h index 509a421..281255a 100644 --- a/lib/x86/msr.h +++ b/lib/x86/msr.h @@ -396,6 +396,11 @@ #define MSR_IA32_VMX_VMCS_ENUM 0x048a #define MSR_IA32_VMX_PROCBASED_CTLS20x048b #define MSR_IA32_VMX_EPT_VPID_CAP 0x048c +#define MSR_IA32_VMX_TRUE_PIN0x048d +#define MSR_IA32_VMX_TRUE_PROC 0x048e +#define MSR_IA32_VMX_TRUE_EXIT 0x048f +#define MSR_IA32_VMX_TRUE_ENTRY 0x0490 + /* AMD-V MSRs */ diff --git a/x86/cstart64.S b/x86/cstart64.S index 24df5f8..0fe76da 100644 --- a/x86/cstart64.S +++ b/x86/cstart64.S @@ -4,6 +4,10 @@ .globl boot_idt boot_idt = 0 +.globl idt_descr +.globl tss_descr +.globl gdt64_desc + ipi_vector = 0x20 max_cpus = 64 diff --git a/x86/unittests.cfg b/x86/unittests.cfg index bc9643e..85c36aa 100644 --- a/x86/unittests.cfg +++ b/x86/unittests.cfg @@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s` file = pcid.flat extra_params = -cpu qemu64,+pcid arch = x86_64 + +[vmx] +file = vmx.flat +extra_params = -cpu host,+vmx +arch = x86_64 + diff --git a/x86/vmx.c b/x86/vmx.c new file mode 100644 index 000..ca3e117 --- /dev/null +++ b/x86/vmx.c @@ -0,0 +1,712 @@ +#include libcflat.h +#include processor.h +#include vm.h +#include desc.h +#include vmx.h +#include msr.h +#include smp.h +#include io.h + +int fails = 0, tests = 0; +u32 *vmxon_region; +struct vmcs *vmcs_root; +u32 vpid_cnt; +void *guest_stack, *guest_syscall_stack; +u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2]; +ulong fix_cr0_set, fix_cr0_clr; +ulong fix_cr4_set, fix_cr4_clr; +struct regs regs; +struct vmx_test *current; +u64 hypercall_field = 0; +bool launched = 0; + +extern u64 gdt64_desc[]; +extern u64 idt_descr[]; +extern u64 tss_descr[]; +extern void *vmx_return; +extern void *entry_sysenter; +extern void *guest_entry; + +static void report(const char *name, int result) +{ + ++tests; + if (result) + printf(PASS: %s\n, name); + else { + printf(FAIL: %s\n, name); + ++fails; + } +} + +static int vmcs_clear(struct vmcs *vmcs) +{ + bool ret; + asm volatile (vmclear %1; setbe %0 : =q (ret) : m (vmcs) : cc); + return ret; +} + +static u64 vmcs_read(enum Encoding enc) +{ + u64 val; + asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc); + return val; +} + +static int vmcs_write(enum Encoding enc, u64 val) +{ + bool ret; + asm volatile (vmwrite %1, %2; setbe %0 + : =q(ret) : rm (val), r ((u64)enc) : cc); + return ret; +} + +static int make_vmcs_current(struct vmcs *vmcs) +{ + bool ret; + + asm volatile (vmptrld %1; setbe %0 : =q (ret)
[PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
From: Nadav Har'El n...@il.ibm.com This is the first patch in a series which adds nested EPT support to KVM's nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set its own cr3 and take its own page faults without either of L0 or L1 getting involved. This often significanlty improves L2's performance over the previous two alternatives (shadow page tables over EPT, and shadow page tables over shadow page tables). This patch adds EPT support to paging_tmpl.h. paging_tmpl.h contains the code for reading and writing page tables. The code for 32-bit and 64-bit tables is very similar, but not identical, so paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once with PTTYPE=64, and this generates the two sets of similar functions. There are subtle but important differences between the format of EPT tables and that of ordinary x86 64-bit page tables, so for nested EPT we need a third set of functions to read the guest EPT table and to write the shadow EPT table. So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed with EPT) which correctly read and write EPT tables. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c |5 + arch/x86/kvm/paging_tmpl.h | 43 +++ 2 files changed, 44 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 4c4274d..b5273c3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp return mmu-last_pte_bitmap (1 index); } +#define PTTYPE_EPT 18 /* arbitrary */ +#define PTTYPE PTTYPE_EPT +#include paging_tmpl.h +#undef PTTYPE + #define PTTYPE 64 #include paging_tmpl.h #undef PTTYPE diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 7581395..e38b3c0 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -58,6 +58,21 @@ #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT #define CMPXCHG cmpxchg +#elif PTTYPE == PTTYPE_EPT + #define pt_element_t u64 + #define guest_walker guest_walkerEPT + #define FNAME(name) ept_##name + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl) + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl) + #define PT_INDEX(addr, level) PT64_INDEX(addr, level) + #define PT_LEVEL_BITS PT64_LEVEL_BITS + #define PT_GUEST_ACCESSED_MASK 0 + #define PT_GUEST_DIRTY_MASK 0 + #define PT_GUEST_DIRTY_SHIFT 0 + #define PT_GUEST_ACCESSED_SHIFT 0 + #define CMPXCHG cmpxchg64 + #define PT_MAX_FULL_LEVELS 4 #else #error Invalid PTTYPE value #endif @@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl) static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte) { +#if PT_GUEST_DIRTY_MASK == 0 + /* dirty bit is not supported, so no need to track it */ + return; +#else unsigned mask; BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK); @@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte) mask |= (gpte (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) PT_WRITABLE_MASK; *access = mask; +#endif } static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level) @@ -111,7 +131,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level) static inline int FNAME(is_present_gpte)(unsigned long pte) { +#if PTTYPE != PTTYPE_EPT return is_present_gpte(pte); +#else + return pte 7; +#endif } static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, @@ -147,7 +171,8 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, if (!FNAME(is_present_gpte)(gpte)) goto no_present; - if (!(gpte PT_GUEST_ACCESSED_MASK)) + /* if accessed bit is not supported prefetch non accessed gpte */ + if (PT_GUEST_ACCESSED_MASK !(gpte PT_GUEST_ACCESSED_MASK)) goto no_present; return false; @@ -160,9 +185,14 @@ no_present: static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte) { unsigned access; - +#if PTTYPE == PTTYPE_EPT + BUILD_BUG_ON(ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK); + access = (gpte VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK | + ((gpte VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0); +#else access = (gpte (PT_WRITABLE_MASK |
[PATCH v2] kvm-unit-tests : Basic architecture of VMX nested test case
This is the first version of VMX nested environment. It contains the basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/ VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the basic execution routine in VMX nested environment andlet the VM print Hello World to inform its successfully run. The first release also includes a test suite for vmenter (vmlaunch and vmresume). Besides, hypercall mechanism is included and currently it is used to invoke VM normal exit. New files added: x86/vmx.h : contains all VMX related macro declerations x86/vmx.c : main file for VMX nested test case Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- config-x86-common.mak |2 + config-x86_64.mak |1 + lib/x86/msr.h |5 + x86/cstart64.S|4 + x86/unittests.cfg |6 + x86/vmx.c | 712 + x86/vmx.h | 474 7 files changed, 1204 insertions(+) create mode 100644 x86/vmx.c create mode 100644 x86/vmx.h diff --git a/config-x86-common.mak b/config-x86-common.mak index 455032b..34a41e1 100644 --- a/config-x86-common.mak +++ b/config-x86-common.mak @@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o +$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o + arch_clean: $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \ $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o diff --git a/config-x86_64.mak b/config-x86_64.mak index 4e525f5..bb8ee89 100644 --- a/config-x86_64.mak +++ b/config-x86_64.mak @@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \ $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \ $(TEST_DIR)/pcid.flat tests += $(TEST_DIR)/svm.flat +tests += $(TEST_DIR)/vmx.flat include config-x86-common.mak diff --git a/lib/x86/msr.h b/lib/x86/msr.h index 509a421..281255a 100644 --- a/lib/x86/msr.h +++ b/lib/x86/msr.h @@ -396,6 +396,11 @@ #define MSR_IA32_VMX_VMCS_ENUM 0x048a #define MSR_IA32_VMX_PROCBASED_CTLS20x048b #define MSR_IA32_VMX_EPT_VPID_CAP 0x048c +#define MSR_IA32_VMX_TRUE_PIN 0x048d +#define MSR_IA32_VMX_TRUE_PROC 0x048e +#define MSR_IA32_VMX_TRUE_EXIT 0x048f +#define MSR_IA32_VMX_TRUE_ENTRY0x0490 + /* AMD-V MSRs */ diff --git a/x86/cstart64.S b/x86/cstart64.S index 24df5f8..0fe76da 100644 --- a/x86/cstart64.S +++ b/x86/cstart64.S @@ -4,6 +4,10 @@ .globl boot_idt boot_idt = 0 +.globl idt_descr +.globl tss_descr +.globl gdt64_desc + ipi_vector = 0x20 max_cpus = 64 diff --git a/x86/unittests.cfg b/x86/unittests.cfg index bc9643e..85c36aa 100644 --- a/x86/unittests.cfg +++ b/x86/unittests.cfg @@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s` file = pcid.flat extra_params = -cpu qemu64,+pcid arch = x86_64 + +[vmx] +file = vmx.flat +extra_params = -cpu host,+vmx +arch = x86_64 + diff --git a/x86/vmx.c b/x86/vmx.c new file mode 100644 index 000..af694e1 --- /dev/null +++ b/x86/vmx.c @@ -0,0 +1,712 @@ +#include libcflat.h +#include processor.h +#include vm.h +#include desc.h +#include vmx.h +#include msr.h +#include smp.h +#include io.h + +int fails = 0, tests = 0; +u32 *vmxon_region; +struct vmcs *vmcs_root; +u32 vpid_cnt; +void *guest_stack, *guest_syscall_stack; +u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2]; +ulong fix_cr0_set, fix_cr0_clr; +ulong fix_cr4_set, fix_cr4_clr; +struct regs regs; +struct vmx_test *current; +u64 hypercall_field = 0; +bool launched = 0; + +extern u64 gdt64_desc[]; +extern u64 idt_descr[]; +extern u64 tss_descr[]; +extern void *vmx_return; +extern void *entry_sysenter; +extern void *guest_entry; + +static void report(const char *name, int result) +{ + ++tests; + if (result) + printf(PASS: %s\n, name); + else { + printf(FAIL: %s\n, name); + ++fails; + } +} + +static int vmcs_clear(struct vmcs *vmcs) +{ + bool ret; + asm volatile (vmclear %1; setbe %0 : =q (ret) : m (vmcs) : cc); + return ret; +} + +static u64 vmcs_read(enum Encoding enc) +{ + u64 val; + asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc); + return val; +} + +static int vmcs_write(enum Encoding enc, u64 val) +{ + bool ret; + asm volatile (vmwrite %1, %2; setbe %0 + : =q(ret) : rm (val), r ((u64)enc) : cc); + return ret; +} + +static int make_vmcs_current(struct vmcs *vmcs) +{ + bool ret; + + asm volatile (vmptrld %1; setbe %0 : =q (ret) : m (vmcs) : cc); + return ret; +} + +static int save_vmcs(struct vmcs **vmcs) +{ + bool ret; + + asm volatile (vmptrst %1; setbe %0 : =q (ret) : m (*vmcs) : cc); + return ret; +} + +/* entry_sysenter */ +asm( + .align 4, 0x90\n\t + .globl
Re: [PATCH v2] kvm-unit-tests : Basic architecture of VMX nested test case
Arthur Chunqi Li yzt...@gmail.com writes: This is the first version of VMX nested environment. It contains the basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/ VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the basic execution routine in VMX nested environment andlet the VM print Hello World to inform its successfully run. The first release also includes a test suite for vmenter (vmlaunch and vmresume). Besides, hypercall mechanism is included and currently it is used to invoke VM normal exit. What's the difference between this and the one you posted earlier : [PATCH] kvm-unit-tests : Basic architecture of VMX nested test case 1374730297-27169-1-git-send-email-yzt...@gmail.com Can you please mention what's changed in v2 ? Bandan New files added: x86/vmx.h : contains all VMX related macro declerations x86/vmx.c : main file for VMX nested test case Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- config-x86-common.mak |2 + config-x86_64.mak |1 + lib/x86/msr.h |5 + x86/cstart64.S|4 + x86/unittests.cfg |6 + x86/vmx.c | 712 + x86/vmx.h | 474 7 files changed, 1204 insertions(+) create mode 100644 x86/vmx.c create mode 100644 x86/vmx.h diff --git a/config-x86-common.mak b/config-x86-common.mak index 455032b..34a41e1 100644 --- a/config-x86-common.mak +++ b/config-x86-common.mak @@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o +$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o + arch_clean: $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \ $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o diff --git a/config-x86_64.mak b/config-x86_64.mak index 4e525f5..bb8ee89 100644 --- a/config-x86_64.mak +++ b/config-x86_64.mak @@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \ $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \ $(TEST_DIR)/pcid.flat tests += $(TEST_DIR)/svm.flat +tests += $(TEST_DIR)/vmx.flat include config-x86-common.mak diff --git a/lib/x86/msr.h b/lib/x86/msr.h index 509a421..281255a 100644 --- a/lib/x86/msr.h +++ b/lib/x86/msr.h @@ -396,6 +396,11 @@ #define MSR_IA32_VMX_VMCS_ENUM 0x048a #define MSR_IA32_VMX_PROCBASED_CTLS20x048b #define MSR_IA32_VMX_EPT_VPID_CAP 0x048c +#define MSR_IA32_VMX_TRUE_PIN0x048d +#define MSR_IA32_VMX_TRUE_PROC 0x048e +#define MSR_IA32_VMX_TRUE_EXIT 0x048f +#define MSR_IA32_VMX_TRUE_ENTRY 0x0490 + /* AMD-V MSRs */ diff --git a/x86/cstart64.S b/x86/cstart64.S index 24df5f8..0fe76da 100644 --- a/x86/cstart64.S +++ b/x86/cstart64.S @@ -4,6 +4,10 @@ .globl boot_idt boot_idt = 0 +.globl idt_descr +.globl tss_descr +.globl gdt64_desc + ipi_vector = 0x20 max_cpus = 64 diff --git a/x86/unittests.cfg b/x86/unittests.cfg index bc9643e..85c36aa 100644 --- a/x86/unittests.cfg +++ b/x86/unittests.cfg @@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s` file = pcid.flat extra_params = -cpu qemu64,+pcid arch = x86_64 + +[vmx] +file = vmx.flat +extra_params = -cpu host,+vmx +arch = x86_64 + diff --git a/x86/vmx.c b/x86/vmx.c new file mode 100644 index 000..af694e1 --- /dev/null +++ b/x86/vmx.c @@ -0,0 +1,712 @@ +#include libcflat.h +#include processor.h +#include vm.h +#include desc.h +#include vmx.h +#include msr.h +#include smp.h +#include io.h + +int fails = 0, tests = 0; +u32 *vmxon_region; +struct vmcs *vmcs_root; +u32 vpid_cnt; +void *guest_stack, *guest_syscall_stack; +u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2]; +ulong fix_cr0_set, fix_cr0_clr; +ulong fix_cr4_set, fix_cr4_clr; +struct regs regs; +struct vmx_test *current; +u64 hypercall_field = 0; +bool launched = 0; + +extern u64 gdt64_desc[]; +extern u64 idt_descr[]; +extern u64 tss_descr[]; +extern void *vmx_return; +extern void *entry_sysenter; +extern void *guest_entry; + +static void report(const char *name, int result) +{ + ++tests; + if (result) + printf(PASS: %s\n, name); + else { + printf(FAIL: %s\n, name); + ++fails; + } +} + +static int vmcs_clear(struct vmcs *vmcs) +{ + bool ret; + asm volatile (vmclear %1; setbe %0 : =q (ret) : m (vmcs) : cc); + return ret; +} + +static u64 vmcs_read(enum Encoding enc) +{ + u64 val; + asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc); + return val; +} + +static int vmcs_write(enum Encoding enc, u64 val) +{ + bool ret; + asm volatile (vmwrite %1, %2; setbe %0 + : =q(ret) : rm (val), r ((u64)enc) : cc); + return ret; +}
Re: [PATCH v2] kvm-unit-tests : Basic architecture of VMX nested test case
On 2013-07-25 18:51, Bandan Das wrote: Arthur Chunqi Li yzt...@gmail.com writes: This is the first version of VMX nested environment. It contains the basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/ VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the basic execution routine in VMX nested environment andlet the VM print Hello World to inform its successfully run. The first release also includes a test suite for vmenter (vmlaunch and vmresume). Besides, hypercall mechanism is included and currently it is used to invoke VM normal exit. What's the difference between this and the one you posted earlier : [PATCH] kvm-unit-tests : Basic architecture of VMX nested test case 1374730297-27169-1-git-send-email-yzt...@gmail.com Can you please mention what's changed in v2 ? True. A changelog can go... Bandan New files added: x86/vmx.h : contains all VMX related macro declerations x86/vmx.c : main file for VMX nested test case Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- ...here, ie. after the --- so that it wont be part of the commit. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH v2] kvm-unit-tests : Basic architecture of VMX nested test case
On 2013-07-25 18:51, Bandan Das wrote: Arthur Chunqi Li yzt...@gmail.com writes: This is the first version of VMX nested environment. It contains the basic VMX instructions test cases, including VMXON/VMXOFF/VMXPTRLD/ VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patchalso tests the basic execution routine in VMX nested environment andlet the VM print Hello World to inform its successfully run. The first release also includes a test suite for vmenter (vmlaunch and vmresume). Besides, hypercall mechanism is included and currently it is used to invoke VM normal exit. What's the difference between this and the one you posted earlier : [PATCH] kvm-unit-tests : Basic architecture of VMX nested test case 1374730297-27169-1-git-send-email-yzt...@gmail.com Can you please mention what's changed in v2 ? True. A changelog can go... Compared to v1, v2 removes two unused inline functions vmlaunch/vmresume in x86/vmx.h, and add host rflags in struct regs so that user can get host's rflags in exit_handler. Bandan New files added: x86/vmx.h : contains all VMX related macro declerations x86/vmx.c : main file for VMX nested test case Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- ...here, ie. after the --- so that it wont be part of the commit. Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 60629] New: Starting a virtual machine ater suspend causes the host system hangup
https://bugzilla.kernel.org/show_bug.cgi?id=60629 Bug ID: 60629 Summary: Starting a virtual machine ater suspend causes the host system hangup Product: Virtualization Version: unspecified Kernel Version: 2.6.35-32 - 3.9.9-1 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: high Priority: P1 Component: kvm Assignee: virtualization_...@kernel-bugs.osdl.org Reporter: ffsi...@yandex.ru Regression: No HW: MB: Gigabyte EP43UD3l CPU: CdQ Q8400 RAM: 4Gb uname -a Linux hostname 3.9.9-1-ARCH #1 SMP PREEMPT Wed Jul 3 22:45:16 CEST 2013 x86_64 GNU/Linux I have the same problem as in http://www.linux.org.ru/forum/linux-hardware/7455179 so just google-translate: To get started: Ubuntu 10.10 64-bit, kernel 2.6.35-32 64-bit, Intel (R) Core (TM) 2 Quad CPU Q8300@2.50GHz, motherboard gigabyte ep43-ds3l So this processor supports virtualization, with 3 - 3,5 Gb ram everything works correctly with no problems and in 32 and 64 bit. The problems begin when put 4Gb and more virtualization-enabled in the BIOS, there is no problem if virtualization-disabled. The problem: When the virtualization and 4gb of RAM or higher, after the computer from sleep (suspend-to-ram), and run Virtualbox or Vmware, comes complete frieze, or via ssh or in any other way, sysrq reaction too no. An educated bet discovered that before sleep: cat /proc/cpuinfo | grep 'model name' model name: Intel(R) Core(TM)2 Quad CPUQ8300 @ 2.50GHz model name: Intel(R) Core(TM)2 Quad CPUQ8300 @ 2.50GHz model name: Intel(R) Core(TM)2 Quad CPUQ8300 @ 2.50GHz model name: Intel(R) Core(TM)2 Quad CPUQ8300 @ 2.50GHz After sleep: cat /proc/cpuinfo | grep 'model name' model name: Intel(R) Core(TM)2 Quad CPUQ8300 @ 2.50GHz model name: 06/17 model name: 06/17 model name: 06/17 I.e. the model name changed to something strange, turning off virtualization (RAM 4gb and above) it does not, and does not occur at 3Gb RAM and enabled, and off virtualization. Part of the problem was solved with the help of a script placed in / etc / pm / sleep.d / #!/bin/sh case $1 in hibernate|suspend) echo 0 /sys/devices/system/cpu/cpu1/online echo 0 /sys/devices/system/cpu/cpu2/online echo 0 /sys/devices/system/cpu/cpu3/online ;; thaw|resume) echo 0 /sys/devices/system/cpu/cpu1/online echo 0 /sys/devices/system/cpu/cpu2/online echo 0 /sys/devices/system/cpu/cpu3/online echo 1 /sys/devices/system/cpu/cpu1/online echo 1 /sys/devices/system/cpu/cpu2/online echo 1 /sys/devices/system/cpu/cpu3/online /etc/init.d/microcode.ctl echo 0 /sys/devices/system/cpu/cpu1/online echo 0 /sys/devices/system/cpu/cpu2/online echo 0 /sys/devices/system/cpu/cpu3/online echo 1 /sys/devices/system/cpu/cpu1/online echo 1 /sys/devices/system/cpu/cpu2/online echo 1 /sys/devices/system/cpu/cpu3/online ;; esac (But, model anyway '06/17') But there was a nasty thing if Vmware or Virtualbox left running, then after waking up again frieze, the script does not have time to work. -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] powerpc: book3e: _PAGE_LENDIAN must be _PAGE_ENDIAN
For booke3e _PAGE_ENDIAN is not defined. Infact what is defined is _PAGE_LENDIAN which is wrong and that should be _PAGE_ENDIAN. There are no compilation errors as arch/powerpc/include/asm/pte-common.h defines _PAGE_ENDIAN to 0 as it is not defined anywhere. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- arch/powerpc/include/asm/pte-book3e.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/pte-book3e.h b/arch/powerpc/include/asm/pte-book3e.h index 0156702..576ad88 100644 --- a/arch/powerpc/include/asm/pte-book3e.h +++ b/arch/powerpc/include/asm/pte-book3e.h @@ -40,7 +40,7 @@ #define _PAGE_U1 0x01 #define _PAGE_U0 0x02 #define _PAGE_ACCESSED 0x04 -#define _PAGE_LENDIAN 0x08 +#define _PAGE_ENDIAN 0x08 #define _PAGE_GUARDED 0x10 #define _PAGE_COHERENT 0x20 /* M: enforce memory coherence */ #define _PAGE_NO_CACHE 0x40 /* I: cache inhibit */ -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] kvm: powerpc: allow guest control E attribute in mas2
E bit in MAS2 bit indicates whether the page is accessed in Little-Endian or Big-Endian byte order. There is no reason to stop guest setting E, so allow him. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- arch/powerpc/kvm/e500.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h index c2e5e98..277cb18 100644 --- a/arch/powerpc/kvm/e500.h +++ b/arch/powerpc/kvm/e500.h @@ -117,7 +117,7 @@ static inline struct kvmppc_vcpu_e500 *to_e500(struct kvm_vcpu *vcpu) #define E500_TLB_USER_PERM_MASK (MAS3_UX|MAS3_UR|MAS3_UW) #define E500_TLB_SUPER_PERM_MASK (MAS3_SX|MAS3_SR|MAS3_SW) #define MAS2_ATTRIB_MASK \ - (MAS2_X0 | MAS2_X1) + (MAS2_X0 | MAS2_X1 | MAS2_E) #define MAS3_ATTRIB_MASK \ (MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3 \ | E500_TLB_USER_PERM_MASK | E500_TLB_SUPER_PERM_MASK) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] kvm: powerpc: allow guest control G attribute in mas2
G bit in MAS2 indicates whether the page is Guarded. There is no reason to stop guest setting G, so allow him. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- arch/powerpc/kvm/e500.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h index 277cb18..4fd9650 100644 --- a/arch/powerpc/kvm/e500.h +++ b/arch/powerpc/kvm/e500.h @@ -117,7 +117,7 @@ static inline struct kvmppc_vcpu_e500 *to_e500(struct kvm_vcpu *vcpu) #define E500_TLB_USER_PERM_MASK (MAS3_UX|MAS3_UR|MAS3_UW) #define E500_TLB_SUPER_PERM_MASK (MAS3_SX|MAS3_SR|MAS3_SW) #define MAS2_ATTRIB_MASK \ - (MAS2_X0 | MAS2_X1 | MAS2_E) + (MAS2_X0 | MAS2_X1 | MAS2_E | MAS2_G) #define MAS3_ATTRIB_MASK \ (MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3 \ | E500_TLB_USER_PERM_MASK | E500_TLB_SUPER_PERM_MASK) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] kvm: powerpc: set cache coherency only for RAM pages
If the page is RAM then map this as cacheable and coherent (set M bit) otherwise this page is treated as I/O and map this as cache inhibited and guarded (set I + G) This helps setting proper MMU mapping for direct assigned device. NOTE: There can be devices that require cacheable mapping, which is not yet supported. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- arch/powerpc/kvm/e500_mmu_host.c | 24 +++- 1 files changed, 19 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 1c6a9d7..5cbdc8f 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -64,13 +64,27 @@ static inline u32 e500_shadow_mas3_attrib(u32 mas3, int usermode) return mas3; } -static inline u32 e500_shadow_mas2_attrib(u32 mas2, int usermode) +static inline u32 e500_shadow_mas2_attrib(u32 mas2, pfn_t pfn) { + u32 mas2_attr; + + mas2_attr = mas2 MAS2_ATTRIB_MASK; + + if (kvm_is_mmio_pfn(pfn)) { + /* +* If page is not RAM then it is treated as I/O page. +* Map it with cache inhibited and guarded (set I + G). +*/ + mas2_attr |= MAS2_I | MAS2_G; + return mas2_attr; + } + + /* Map RAM pages as cacheable (Not setting I in MAS2) */ #ifdef CONFIG_SMP - return (mas2 MAS2_ATTRIB_MASK) | MAS2_M; -#else - return mas2 MAS2_ATTRIB_MASK; + /* Also map as coherent (set M) in SMP */ + mas2_attr |= MAS2_M; #endif + return mas2_attr; } /* @@ -313,7 +327,7 @@ static void kvmppc_e500_setup_stlbe( /* Force IPROT=0 for all guest mappings. */ stlbe-mas1 = MAS1_TSIZE(tsize) | get_tlb_sts(gtlbe) | MAS1_VALID; stlbe-mas2 = (gvaddr MAS2_EPN) | - e500_shadow_mas2_attrib(gtlbe-mas2, pr); + e500_shadow_mas2_attrib(gtlbe-mas2, pfn); stlbe-mas7_3 = ((u64)pfn PAGE_SHIFT) | e500_shadow_mas3_attrib(gtlbe-mas7_3, pr); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4 v6] powerpc: export debug registers save function for KVM
On 04.07.2013, at 08:57, Bharat Bhushan wrote: KVM need this function when switching from vcpu to user-space thread. My subsequent patch will use this function. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com Ben / Michael, please ack. Alex --- v5-v6 - switch_booke_debug_regs() not guarded by the compiler switch arch/powerpc/include/asm/switch_to.h |1 + arch/powerpc/kernel/process.c|3 ++- 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h index 200d763..db68f1d 100644 --- a/arch/powerpc/include/asm/switch_to.h +++ b/arch/powerpc/include/asm/switch_to.h @@ -29,6 +29,7 @@ extern void giveup_vsx(struct task_struct *); extern void enable_kernel_spe(void); extern void giveup_spe(struct task_struct *); extern void load_up_spe(struct task_struct *); +extern void switch_booke_debug_regs(struct thread_struct *new_thread); #ifndef CONFIG_SMP extern void discard_lazy_cpu_state(void); diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 01ff496..da586aa 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -362,12 +362,13 @@ static void prime_debug_regs(struct thread_struct *thread) * debug registers, set the debug registers from the values * stored in the new thread. */ -static void switch_booke_debug_regs(struct thread_struct *new_thread) +void switch_booke_debug_regs(struct thread_struct *new_thread) { if ((current-thread.debug.dbcr0 DBCR0_IDM) || (new_thread-debug.dbcr0 DBCR0_IDM)) prime_debug_regs(new_thread); } +EXPORT_SYMBOL_GPL(switch_booke_debug_regs); #else /* !CONFIG_PPC_ADV_DEBUG_REGS */ #ifndef CONFIG_HAVE_HW_BREAKPOINT static void set_debug_reg_defaults(struct thread_struct *thread) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] KVM: PPC: Book3S PR: Load up SPRG3 register with guest value on guest entry
On 11.07.2013, at 13:49, Paul Mackerras wrote: Unlike the other general-purpose SPRs, SPRG3 can be read by usermode code, and is used in recent kernels to store the CPU and NUMA node numbers so that they can be read by VDSO functions. Thus we need to load the guest's SPRG3 value into the real SPRG3 register when entering the guest, and restore the host's value when exiting the guest. We don't need to save the guest SPRG3 value when exiting the guest as usermode code can't modify SPRG3. This loads SPRG3 on every guest exit, which can happen a lot with instruction emulation. Since the kernel doesn't rely on the contents of SPRG3 we only have to care about it when not in KVM code, right? So could we move this to kvmppc_core_vcpu_load/put instead? Alex Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kernel/asm-offsets.c| 1 + arch/powerpc/kvm/book3s_interrupts.S | 14 ++ 2 files changed, 15 insertions(+) diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 6f16ffa..a67c76e 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -452,6 +452,7 @@ int main(void) DEFINE(VCPU_SPRG2, offsetof(struct kvm_vcpu, arch.shregs.sprg2)); DEFINE(VCPU_SPRG3, offsetof(struct kvm_vcpu, arch.shregs.sprg3)); #endif + DEFINE(VCPU_SHARED_SPRG3, offsetof(struct kvm_vcpu_arch_shared, sprg3)); DEFINE(VCPU_SHARED_SPRG4, offsetof(struct kvm_vcpu_arch_shared, sprg4)); DEFINE(VCPU_SHARED_SPRG5, offsetof(struct kvm_vcpu_arch_shared, sprg5)); DEFINE(VCPU_SHARED_SPRG6, offsetof(struct kvm_vcpu_arch_shared, sprg6)); diff --git a/arch/powerpc/kvm/book3s_interrupts.S b/arch/powerpc/kvm/book3s_interrupts.S index 48cbbf8..17cfae5 100644 --- a/arch/powerpc/kvm/book3s_interrupts.S +++ b/arch/powerpc/kvm/book3s_interrupts.S @@ -92,6 +92,11 @@ kvm_start_lightweight: PPC_LL r3, VCPU_HFLAGS(r4) rldicl r3, r3, 0, 63 /* r3 = 1 */ stb r3, HSTATE_RESTORE_HID5(r13) + + /* Load up guest SPRG3 value, since it's user readable */ + ld r3, VCPU_SHARED(r4) + ld r3, VCPU_SHARED_SPRG3(r3) + mtspr SPRN_SPRG3, r3 #endif /* CONFIG_PPC_BOOK3S_64 */ PPC_LL r4, VCPU_SHADOW_MSR(r4) /* get shadow_msr */ @@ -123,6 +128,15 @@ kvmppc_handler_highmem: /* R7 = vcpu */ PPC_LL r7, GPR4(r1) +#ifdef CONFIG_PPC_BOOK3S_64 + /* + * Reload kernel SPRG3 value. + * No need to save guest value as usermode can't modify SPRG3. + */ + ld r3, PACA_SPRG3(r13) + mtspr SPRN_SPRG3, r3 +#endif /* CONFIG_PPC_BOOK3S_64 */ + PPC_STL r14, VCPU_GPR(R14)(r7) PPC_STL r15, VCPU_GPR(R15)(r7) PPC_STL r16, VCPU_GPR(R16)(r7) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] KVM: PPC: Book3S PR: Load up SPRG3 register with guest value on guest entry
On 25.07.2013, at 15:38, Alexander Graf wrote: On 11.07.2013, at 13:49, Paul Mackerras wrote: Unlike the other general-purpose SPRs, SPRG3 can be read by usermode code, and is used in recent kernels to store the CPU and NUMA node numbers so that they can be read by VDSO functions. Thus we need to load the guest's SPRG3 value into the real SPRG3 register when entering the guest, and restore the host's value when exiting the guest. We don't need to save the guest SPRG3 value when exiting the guest as usermode code can't modify SPRG3. This loads SPRG3 on every guest exit, which can happen a lot with instruction emulation. Since the kernel doesn't rely on the contents of SPRG3 we only have to care about it when not in KVM code, right? So could we move this to kvmppc_core_vcpu_load/put instead? but then again if all the shadow copy code is negligible performance wise, so is this probably. Applied to kvm-ppc-queue. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu
On 11.07.2013, at 13:50, Paul Mackerras wrote: Currently PR-style KVM keeps the volatile guest register values (R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than the main kvm_vcpu struct. For 64-bit, the shadow_vcpu exists in two places, a kmalloc'd struct and in the PACA, and it gets copied back and forth in kvmppc_core_vcpu_load/put(), because the real-mode code can't rely on being able to access the kmalloc'd struct. This changes the code to copy the volatile values into the shadow_vcpu as one of the last things done before entering the guest. Similarly the values are copied back out of the shadow_vcpu to the kvm_vcpu immediately after exiting the guest. We arrange for interrupts to be still disabled at this point so that we can't get preempted on 64-bit and end up copying values from the wrong PACA. This means that the accessor functions in kvm_book3s.h for these registers are greatly simplified, and are same between PR and HV KVM. In places where accesses to shadow_vcpu fields are now replaced by accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs. Finally, on 64-bit, we don't need the kmalloc'd struct at all any more. With this, the time to read the PVR one million times in a loop went from 478.2ms to 480.1ms (averages of 4 values), a difference which is not statistically significant given the variability of the results. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s.h | 193 +- arch/powerpc/include/asm/kvm_book3s_asm.h | 6 +- arch/powerpc/include/asm/kvm_host.h | 1 + arch/powerpc/kernel/asm-offsets.c | 3 +- arch/powerpc/kvm/book3s_emulate.c | 8 +- arch/powerpc/kvm/book3s_interrupts.S | 101 arch/powerpc/kvm/book3s_pr.c | 68 +-- arch/powerpc/kvm/book3s_rmhandlers.S | 5 - arch/powerpc/kvm/trace.h | 7 +- 9 files changed, 175 insertions(+), 217 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 08891d0..5d68f6c 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -198,149 +198,97 @@ extern void kvm_return_point(void); #include asm/kvm_book3s_64.h #endif -#ifdef CONFIG_KVM_BOOK3S_PR - -static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu *vcpu) -{ - return to_book3s(vcpu)-hior; -} - -static inline void kvmppc_update_int_pending(struct kvm_vcpu *vcpu, - unsigned long pending_now, unsigned long old_pending) -{ - if (pending_now) - vcpu-arch.shared-int_pending = 1; - else if (old_pending) - vcpu-arch.shared-int_pending = 0; -} - static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, ulong val) { - if ( num 14 ) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - svcpu-gpr[num] = val; - svcpu_put(svcpu); - to_book3s(vcpu)-shadow_vcpu-gpr[num] = val; - } else - vcpu-arch.gpr[num] = val; + vcpu-arch.gpr[num] = val; } static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, int num) { - if ( num 14 ) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - ulong r = svcpu-gpr[num]; - svcpu_put(svcpu); - return r; - } else - return vcpu-arch.gpr[num]; + return vcpu-arch.gpr[num]; } static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - svcpu-cr = val; - svcpu_put(svcpu); - to_book3s(vcpu)-shadow_vcpu-cr = val; + vcpu-arch.cr = val; } static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - u32 r; - r = svcpu-cr; - svcpu_put(svcpu); - return r; + return vcpu-arch.cr; } static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, u32 val) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - svcpu-xer = val; - to_book3s(vcpu)-shadow_vcpu-xer = val; - svcpu_put(svcpu); + vcpu-arch.xer = val; } static inline u32 kvmppc_get_xer(struct kvm_vcpu *vcpu) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - u32 r; - r = svcpu-xer; - svcpu_put(svcpu); - return r; + return vcpu-arch.xer; } static inline void kvmppc_set_ctr(struct kvm_vcpu *vcpu, ulong val) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - svcpu-ctr = val; - svcpu_put(svcpu); + vcpu-arch.ctr = val; } static inline ulong kvmppc_get_ctr(struct kvm_vcpu *vcpu) { - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - ulong r; - r = svcpu-ctr; - svcpu_put(svcpu); -
Re: [PATCH 2/2] kvm: powerpc: set cache coherency only for kernel managed pages
On 25.07.2013, at 10:50, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 03:32:49PM -0500, Scott Wood wrote: On 07/24/2013 04:39:59 AM, Alexander Graf wrote: On 24.07.2013, at 11:35, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 11:21:11AM +0200, Alexander Graf wrote: Are not we going to use page_is_ram() from e500_shadow_mas2_attrib() as Scott commented? rWhy aren't we using page_is_ram() in kvm_is_mmio_pfn()? Because it is much slower and, IIRC, actually used to build pfn map that allow us to check quickly for valid pfn. Then why should we use page_is_ram()? :) I really don't want the e500 code to diverge too much from what the rest of the kvm code is doing. I don't understand actually used to build pfn map What code is this? I don't see any calls to page_is_ram() in the KVM code, or in generic mm code. Is this a statement about what x86 does? It may be not page_is_ram() directly, but the same into page_is_ram() is using. On power both page_is_ram() and do_init_bootmem() walks some kind of memblock_region data structure. What important is that pfn_valid() does not mean that there is a memory behind page structure. See Andrea's reply. On PPC page_is_ram() is only called (AFAICT) for determining what attributes to set on mmaps. We want to be sure that KVM always makes the same decision. While pfn_valid() seems like it should be equivalent, it's not obvious from the PPC code that it is. Again pfn_valid() is not enough. If pfn_valid() is better, why is that not used for mmap? Why are there two different names for the same thing? They are not the same thing. page_is_ram() tells you if phys address is ram backed. pfn_valid() tells you if there is struct page behind the pfn. PageReserved() tells if you a pfn is marked as reserved. All non ram pfns should be reserved, but ram pfns can be reserved too. Again, see Andrea's reply. Why ppc uses page_is_ram() for mmap? How should I know? But looking at That one's easy. Let's just ask Ben. Ben, is there any particular reason PPC uses page_is_ram() rather than what KVM does here to figure out whether a pfn is RAM or not? It would be really useful to be able to run the exact same logic that figures out whether we're cacheable or not in both TLB writers (KVM and linux-mm). Alex the function it does it only as a fallback if ppc_md.phys_mem_access_prot() is not provided. Making access to MMIO noncached as a safe fallback makes sense. It is also make sense to allow noncached access to reserved ram sometimes. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm: powerpc: set cache coherency only for kernel managed pages
On Thu, Jul 25, 2013 at 06:07:55PM +0200, Alexander Graf wrote: On 25.07.2013, at 10:50, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 03:32:49PM -0500, Scott Wood wrote: On 07/24/2013 04:39:59 AM, Alexander Graf wrote: On 24.07.2013, at 11:35, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 11:21:11AM +0200, Alexander Graf wrote: Are not we going to use page_is_ram() from e500_shadow_mas2_attrib() as Scott commented? rWhy aren't we using page_is_ram() in kvm_is_mmio_pfn()? Because it is much slower and, IIRC, actually used to build pfn map that allow us to check quickly for valid pfn. Then why should we use page_is_ram()? :) I really don't want the e500 code to diverge too much from what the rest of the kvm code is doing. I don't understand actually used to build pfn map What code is this? I don't see any calls to page_is_ram() in the KVM code, or in generic mm code. Is this a statement about what x86 does? It may be not page_is_ram() directly, but the same into page_is_ram() is using. On power both page_is_ram() and do_init_bootmem() walks some kind of memblock_region data structure. What important is that pfn_valid() does not mean that there is a memory behind page structure. See Andrea's reply. On PPC page_is_ram() is only called (AFAICT) for determining what attributes to set on mmaps. We want to be sure that KVM always makes the same decision. While pfn_valid() seems like it should be equivalent, it's not obvious from the PPC code that it is. Again pfn_valid() is not enough. If pfn_valid() is better, why is that not used for mmap? Why are there two different names for the same thing? They are not the same thing. page_is_ram() tells you if phys address is ram backed. pfn_valid() tells you if there is struct page behind the pfn. PageReserved() tells if you a pfn is marked as reserved. All non ram pfns should be reserved, but ram pfns can be reserved too. Again, see Andrea's reply. Why ppc uses page_is_ram() for mmap? How should I know? But looking at That one's easy. Let's just ask Ben. Ben, is there any particular reason PPC uses page_is_ram() rather than what KVM does here to figure out whether a pfn is RAM or not? It would be really useful to be able to run the exact same logic that figures out whether we're cacheable or not in both TLB writers (KVM and linux-mm). KVM does not only try to figure out what is RAM or not! Look at how KVM uses the function. KVM tries to figure out if refcounting needed to be used on this page among other things. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] powerpc: book3e: _PAGE_LENDIAN must be _PAGE_ENDIAN
For booke3e _PAGE_ENDIAN is not defined. Infact what is defined is _PAGE_LENDIAN which is wrong and that should be _PAGE_ENDIAN. There are no compilation errors as arch/powerpc/include/asm/pte-common.h defines _PAGE_ENDIAN to 0 as it is not defined anywhere. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- arch/powerpc/include/asm/pte-book3e.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/pte-book3e.h b/arch/powerpc/include/asm/pte-book3e.h index 0156702..576ad88 100644 --- a/arch/powerpc/include/asm/pte-book3e.h +++ b/arch/powerpc/include/asm/pte-book3e.h @@ -40,7 +40,7 @@ #define _PAGE_U1 0x01 #define _PAGE_U0 0x02 #define _PAGE_ACCESSED 0x04 -#define _PAGE_LENDIAN 0x08 +#define _PAGE_ENDIAN 0x08 #define _PAGE_GUARDED 0x10 #define _PAGE_COHERENT 0x20 /* M: enforce memory coherence */ #define _PAGE_NO_CACHE 0x40 /* I: cache inhibit */ -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] kvm: powerpc: allow guest control G attribute in mas2
G bit in MAS2 indicates whether the page is Guarded. There is no reason to stop guest setting G, so allow him. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- arch/powerpc/kvm/e500.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h index 277cb18..4fd9650 100644 --- a/arch/powerpc/kvm/e500.h +++ b/arch/powerpc/kvm/e500.h @@ -117,7 +117,7 @@ static inline struct kvmppc_vcpu_e500 *to_e500(struct kvm_vcpu *vcpu) #define E500_TLB_USER_PERM_MASK (MAS3_UX|MAS3_UR|MAS3_UW) #define E500_TLB_SUPER_PERM_MASK (MAS3_SX|MAS3_SR|MAS3_SW) #define MAS2_ATTRIB_MASK \ - (MAS2_X0 | MAS2_X1 | MAS2_E) + (MAS2_X0 | MAS2_X1 | MAS2_E | MAS2_G) #define MAS3_ATTRIB_MASK \ (MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3 \ | E500_TLB_USER_PERM_MASK | E500_TLB_SUPER_PERM_MASK) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] kvm: powerpc: set cache coherency only for RAM pages
If the page is RAM then map this as cacheable and coherent (set M bit) otherwise this page is treated as I/O and map this as cache inhibited and guarded (set I + G) This helps setting proper MMU mapping for direct assigned device. NOTE: There can be devices that require cacheable mapping, which is not yet supported. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- arch/powerpc/kvm/e500_mmu_host.c | 24 +++- 1 files changed, 19 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 1c6a9d7..5cbdc8f 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -64,13 +64,27 @@ static inline u32 e500_shadow_mas3_attrib(u32 mas3, int usermode) return mas3; } -static inline u32 e500_shadow_mas2_attrib(u32 mas2, int usermode) +static inline u32 e500_shadow_mas2_attrib(u32 mas2, pfn_t pfn) { + u32 mas2_attr; + + mas2_attr = mas2 MAS2_ATTRIB_MASK; + + if (kvm_is_mmio_pfn(pfn)) { + /* +* If page is not RAM then it is treated as I/O page. +* Map it with cache inhibited and guarded (set I + G). +*/ + mas2_attr |= MAS2_I | MAS2_G; + return mas2_attr; + } + + /* Map RAM pages as cacheable (Not setting I in MAS2) */ #ifdef CONFIG_SMP - return (mas2 MAS2_ATTRIB_MASK) | MAS2_M; -#else - return mas2 MAS2_ATTRIB_MASK; + /* Also map as coherent (set M) in SMP */ + mas2_attr |= MAS2_M; #endif + return mas2_attr; } /* @@ -313,7 +327,7 @@ static void kvmppc_e500_setup_stlbe( /* Force IPROT=0 for all guest mappings. */ stlbe-mas1 = MAS1_TSIZE(tsize) | get_tlb_sts(gtlbe) | MAS1_VALID; stlbe-mas2 = (gvaddr MAS2_EPN) | - e500_shadow_mas2_attrib(gtlbe-mas2, pr); + e500_shadow_mas2_attrib(gtlbe-mas2, pfn); stlbe-mas7_3 = ((u64)pfn PAGE_SHIFT) | e500_shadow_mas3_attrib(gtlbe-mas7_3, pr); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html