Re: [PATCH 0/18][RFC] Nested Paging support for Nested SVM (aka NPT-Virtualization)
On Fri, Mar 12, 2010 at 09:36:41AM +0200, Avi Kivity wrote: On 03/11/2010 10:58 PM, Marcelo Tosatti wrote: Can't you translate l2_gpa - l1_gpa walking the current l1 nested pagetable, and pass that to the kvm tdp fault path (with the correct context setup)? If I understand your suggestion correctly, I think thats exactly whats done in the patches. Some words about the design: For nested-nested we need to shadow the l1-nested-ptable on the host. This is done using the vcpu-arch.mmu context which holds the l1 paging modes while the l2 is running. On a npt-fault from the l2 we just instrument the shadow-ptable code. This is the common case. because it happens all the time while the l2 is running. OK, makes sense now, I was missing the fact that the l1-nested-ptable needs to be shadowed and l1 translations to it must be write protected. Shadow converts (gva - gpa - hpa) to (gva - hpa) or (ngpa - gpa - hpa) to (ngpa - hpa) equally well. In the second case npt still does (ngva - ngpa). You should disable out of sync shadow so that l1 guest writes to l1-nested-ptables always trap. Why? The guest is under obligation to flush the tlb if it writes to a page table, and we will resync on that tlb flush. The guests hypervisor will not flush the tlb with invlpg for updates of its NPT pagetables. It'll create a new ASID, and KVM will not trap that. Unsync makes just as much sense for nnpt. Think of khugepaged in the guest eating a page table and spitting out a PDE. And in the trap case, you'd have to invalidate l2 shadow pagetable entries that used the (now obsolete) l1-nested-ptable entry. Does that happen automatically? What do you mean by 'l2 shadow ptable entries'? There are the guest's page tables (ordinary direct mapped, unless the guest's guest is also running an npt-enabled hypervisor), and the host page tables. When the guest writes to each page table, we invalidate the shadows. With 'l2 shadow ptable entries' i mean the shadow pagetables that translate GPA-L2 - HPA. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Selectively control Unmapped Page Cache (nospam version) From: Balbir Singh bal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. The patch is applied against mmotm feb-11-2010. Tests - Ran 4 VMs in parallel, running kernbench using kvm autotest. Each guest had 2 CPUs with 512M of memory. Guest Usage without boot parameter (memory in KB) MemFree Cached Time 19900 292912 137 17540 296196 139 17900 296124 141 19356 296660 141 Host usage: (memory in KB) RSS Cache mapped swap 2788664 781884 3780359536 Guest Usage with boot parameter (memory in KB) - Memfree Cached Time 244824 74828 144 237840 81764 143 235880 83044 138 239312 80092 148 Host usage: (memory in KB) RSS Cache mapped swap 2700184 958012 334848 398412 The key thing to observe is the free memory when the boot parameter is enabled. TODOS - 1. Balance slab cache as well 2. Invoke the balance routines from the balloon driver Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/mmzone.h |2 - include/linux/swap.h |3 + mm/page_alloc.c|9 ++- mm/vmscan.c| 165 4 files changed, 134 insertions(+), 45 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ad5abcf..f0b245f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -293,12 +293,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; + unsigned long min_unmapped_pages; #ifdef CONFIG_NUMA int node; /* * zone reclaim becomes active if more unmapped pages exist. */ - unsigned long min_unmapped_pages; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index c2a4295..d0c8176 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -254,10 +254,11 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern bool should_balance_unmapped_pages(struct zone *zone); +extern int sysctl_min_unmapped_ratio; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 416b056..1cc5c75 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1578,6 +1578,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + mark = zone-watermark[alloc_flags ALLOC_WMARK_MASK]; if
Re: [PATCH 0/18][RFC] Nested Paging support for Nested SVM (aka NPT-Virtualization)
On 03/15/2010 08:27 AM, Marcelo Tosatti wrote: You should disable out of sync shadow so that l1 guest writes to l1-nested-ptables always trap. Why? The guest is under obligation to flush the tlb if it writes to a page table, and we will resync on that tlb flush. The guests hypervisor will not flush the tlb with invlpg for updates of its NPT pagetables. It'll create a new ASID, and KVM will not trap that. We'll get a kvm_set_cr3() on the next vmrun. And in the trap case, you'd have to invalidate l2 shadow pagetable entries that used the (now obsolete) l1-nested-ptable entry. Does that happen automatically? What do you mean by 'l2 shadow ptable entries'? There are the guest's page tables (ordinary direct mapped, unless the guest's guest is also running an npt-enabled hypervisor), and the host page tables. When the guest writes to each page table, we invalidate the shadows. With 'l2 shadow ptable entries' i mean the shadow pagetables that translate GPA-L2 - HPA. kvm_mmu_pte_write() will invalidate those sptes and will also install new translations if possible. Beautiful, isn't it? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/18] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa
On 03/03/2010 09:12 PM, Joerg Roedel wrote: This patch implements logic to make sure that either a page-fault/page-fault-vmexit or a nested-page-fault-vmexit is propagated back to the guest. Signed-off-by: Joerg Roedeljoerg.roe...@amd.com --- arch/x86/kvm/mmu.h |1 + arch/x86/kvm/paging_tmpl.h |2 ++ arch/x86/kvm/x86.c | 15 ++- 3 files changed, 17 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 64f619b..b42b27e 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -47,6 +47,7 @@ #define PFERR_USER_MASK (1U 2) #define PFERR_RSVD_MASK (1U 3) #define PFERR_FETCH_MASK (1U 4) +#define PFERR_NESTED_MASK (1U 31) Why is this needed? Queue an ordinary page fault page; the injection code should check the page fault intercept and #VMEXIT if needed. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 25/30] KVM: x86 emulator: fix in/out emulation.
On 03/14/2010 07:35 PM, Gleb Natapov wrote: On Sun, Mar 14, 2010 at 06:54:11PM +0200, Avi Kivity wrote: On 03/14/2010 06:21 PM, Gleb Natapov wrote: in/out emulation is broken now. The breakage is different depending on where IO device resides. If it is in userspace emulator reports emulation failure since it incorrectly interprets kvm_emulate_pio() return value. If IO device is in the kernel emulation of 'in' will do nothing since kvm_emulate_pio() stores result directly into vcpu registers, so emulator will overwrite result of emulation during commit of shadowed register. diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 8f5e4c8..344e17b 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -210,13 +210,13 @@ static u32 opcode_table[256] = { 0, 0, 0, 0, 0, 0, 0, 0, /* 0xE0 - 0xE7 */ 0, 0, 0, 0, - ByteOp | SrcImmUByte, SrcImmUByte, - ByteOp | SrcImmUByte, SrcImmUByte, + ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc, + ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc, REX prefix shouldn't expand DstAcc to 64 bits here. Might cause problems further down in the pipeline.+ Is REX prefix allowed with this opcodes? If yes: if (c-dst.bytes == 8) c-dst.bytes = 4; inside IN/OUT emulation will fix this. I don't know, but I guess REX is allowed and ignored. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.
On 03/14/2010 08:06 PM, Gleb Natapov wrote: Suggest simply reentering every N executions. This restart mechanism is, in fact, needed for ins read ahead to work. After reading ahead from IO port we need to avoid entering decoder until entire cache is consumed otherwise decoder will clear cache and data will be lost. So we can't just enter guest in arbitrary times, only when read ahead cache is empty. Since read ahead is never done across page boundary this is save place to re-enter guest. Please make the two depend on each other directly then. We can't expect the reader of the emulator code know that. Have the emulator ask the buffer when it is empty. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 25/30] KVM: x86 emulator: fix in/out emulation.
On Mon, Mar 15, 2010 at 09:41:51AM +0200, Avi Kivity wrote: On 03/14/2010 07:35 PM, Gleb Natapov wrote: On Sun, Mar 14, 2010 at 06:54:11PM +0200, Avi Kivity wrote: On 03/14/2010 06:21 PM, Gleb Natapov wrote: in/out emulation is broken now. The breakage is different depending on where IO device resides. If it is in userspace emulator reports emulation failure since it incorrectly interprets kvm_emulate_pio() return value. If IO device is in the kernel emulation of 'in' will do nothing since kvm_emulate_pio() stores result directly into vcpu registers, so emulator will overwrite result of emulation during commit of shadowed register. diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 8f5e4c8..344e17b 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -210,13 +210,13 @@ static u32 opcode_table[256] = { 0, 0, 0, 0, 0, 0, 0, 0, /* 0xE0 - 0xE7 */ 0, 0, 0, 0, - ByteOp | SrcImmUByte, SrcImmUByte, - ByteOp | SrcImmUByte, SrcImmUByte, + ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc, + ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc, REX prefix shouldn't expand DstAcc to 64 bits here. Might cause problems further down in the pipeline.+ Is REX prefix allowed with this opcodes? If yes: if (c-dst.bytes == 8) c-dst.bytes = 4; inside IN/OUT emulation will fix this. I don't know, but I guess REX is allowed and ignored. Hmm, curious. I'll test. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 09:22 AM, Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singhbal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. Well, for a guest, host page cache is a lot slower than guest page cache. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Avi Kivity a...@redhat.com [2010-03-15 09:48:05]: On 03/15/2010 09:22 AM, Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singhbal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. Well, for a guest, host page cache is a lot slower than guest page cache. Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? One of the reasons I created a boot parameter was to deal with selective enablement for cases where memory is the most important resource being managed. I do see a hit in performance with my results (please see the data below), but the savings are quite large. The other solution mentioned in the TODOs is to have the balloon driver invoke this path. The sysctl also allows the guest to tune the amount of unmapped page cache if needed. The knobs are for 1. Selective enablement 2. Selective control of the % of unmapped pages -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: cleanup: change to use bool return values
Make use of bool as return valuses. Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com --- arch/x86/kvm/vmx.c | 72 ++-- 1 files changed, 36 insertions(+), 36 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 06108f3..cc0628e 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -231,65 +231,65 @@ static const u32 vmx_msr_index[] = { }; #define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index) -static inline int is_page_fault(u32 intr_info) +static inline bool is_page_fault(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_HARD_EXCEPTION | PF_VECTOR | INTR_INFO_VALID_MASK); } -static inline int is_no_device(u32 intr_info) +static inline bool is_no_device(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_HARD_EXCEPTION | NM_VECTOR | INTR_INFO_VALID_MASK); } -static inline int is_invalid_opcode(u32 intr_info) +static inline bool is_invalid_opcode(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_HARD_EXCEPTION | UD_VECTOR | INTR_INFO_VALID_MASK); } -static inline int is_external_interrupt(u32 intr_info) +static inline bool is_external_interrupt(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK); } -static inline int is_machine_check(u32 intr_info) +static inline bool is_machine_check(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_HARD_EXCEPTION | MC_VECTOR | INTR_INFO_VALID_MASK); } -static inline int cpu_has_vmx_msr_bitmap(void) +static inline bool cpu_has_vmx_msr_bitmap(void) { - return vmcs_config.cpu_based_exec_ctrl CPU_BASED_USE_MSR_BITMAPS; + return !!(vmcs_config.cpu_based_exec_ctrl CPU_BASED_USE_MSR_BITMAPS); } -static inline int cpu_has_vmx_tpr_shadow(void) +static inline bool cpu_has_vmx_tpr_shadow(void) { - return vmcs_config.cpu_based_exec_ctrl CPU_BASED_TPR_SHADOW; + return !!(vmcs_config.cpu_based_exec_ctrl CPU_BASED_TPR_SHADOW); } -static inline int vm_need_tpr_shadow(struct kvm *kvm) +static inline bool vm_need_tpr_shadow(struct kvm *kvm) { return (cpu_has_vmx_tpr_shadow()) (irqchip_in_kernel(kvm)); } -static inline int cpu_has_secondary_exec_ctrls(void) +static inline bool cpu_has_secondary_exec_ctrls(void) { - return vmcs_config.cpu_based_exec_ctrl - CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; + return !!(vmcs_config.cpu_based_exec_ctrl + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS); } static inline bool cpu_has_vmx_virtualize_apic_accesses(void) { - return vmcs_config.cpu_based_2nd_exec_ctrl - SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; + return !!(vmcs_config.cpu_based_2nd_exec_ctrl + SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); } static inline bool cpu_has_vmx_flexpriority(void) @@ -323,59 +323,59 @@ static inline bool cpu_has_vmx_ept_1g_page(void) return !!(vmx_capability.ept VMX_EPT_1GB_PAGE_BIT); } -static inline int cpu_has_vmx_invept_individual_addr(void) +static inline bool cpu_has_vmx_invept_individual_addr(void) { return !!(vmx_capability.ept VMX_EPT_EXTENT_INDIVIDUAL_BIT); } -static inline int cpu_has_vmx_invept_context(void) +static inline bool cpu_has_vmx_invept_context(void) { return !!(vmx_capability.ept VMX_EPT_EXTENT_CONTEXT_BIT); } -static inline int cpu_has_vmx_invept_global(void) +static inline bool cpu_has_vmx_invept_global(void) { return !!(vmx_capability.ept VMX_EPT_EXTENT_GLOBAL_BIT); } -static inline int cpu_has_vmx_ept(void) +static inline bool cpu_has_vmx_ept(void) { - return vmcs_config.cpu_based_2nd_exec_ctrl - SECONDARY_EXEC_ENABLE_EPT; + return !!(vmcs_config.cpu_based_2nd_exec_ctrl + SECONDARY_EXEC_ENABLE_EPT); } -static inline int cpu_has_vmx_unrestricted_guest(void) +static inline bool cpu_has_vmx_unrestricted_guest(void) { - return vmcs_config.cpu_based_2nd_exec_ctrl - SECONDARY_EXEC_UNRESTRICTED_GUEST; + return !!(vmcs_config.cpu_based_2nd_exec_ctrl + SECONDARY_EXEC_UNRESTRICTED_GUEST); } -static inline int cpu_has_vmx_ple(void) +static inline bool cpu_has_vmx_ple(void) { - return vmcs_config.cpu_based_2nd_exec_ctrl - SECONDARY_EXEC_PAUSE_LOOP_EXITING; + return !!(vmcs_config.cpu_based_2nd_exec_ctrl +
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 10:07 AM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-03-15 09:48:05]: On 03/15/2010 09:22 AM, Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singhbal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. Well, for a guest, host page cache is a lot slower than guest page cache. Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. One of the reasons I created a boot parameter was to deal with selective enablement for cases where memory is the most important resource being managed. I do see a hit in performance with my results (please see the data below), but the savings are quite large. The other solution mentioned in the TODOs is to have the balloon driver invoke this path. The sysctl also allows the guest to tune the amount of unmapped page cache if needed. The knobs are for 1. Selective enablement 2. Selective control of the % of unmapped pages An alternative path is to enable KSM for page cache. Then we have direct read-only guest access to host page cache, without any guest modifications required. That will be pretty difficult to achieve though - will need a readonly bit in the page cache radix tree, and teach all paths to honour it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: cleanup: change to use bool return values
On 03/15/2010 10:23 AM, Gui Jianfeng wrote: Make use of bool as return valuses. -static inline int cpu_has_vmx_tpr_shadow(void) +static inline bool cpu_has_vmx_tpr_shadow(void) { - return vmcs_config.cpu_based_exec_ctrl CPU_BASED_TPR_SHADOW; + return !!(vmcs_config.cpu_based_exec_ctrl CPU_BASED_TPR_SHADOW); } Those !! are not required - demotion to bool is defined to convert nonzero to true. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Use native_store_idt() instead of kvm_get_idt()
On Fri, Mar 05, 2010 at 12:11:48PM +0800, Wei Yongjun wrote: This patch use generic linux function native_store_idt() instead of kvm_get_idt(), and also removed the useless function kvm_get_idt(). Signed-off-by: Wei Yongjun yj...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |5 - arch/x86/kvm/vmx.c |2 +- 2 files changed, 1 insertions(+), 6 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Moving dirty bitmaps to userspace - Double buffering approach
On Mon, Mar 08, 2010 at 05:22:43PM +0900, Takuya Yoshikawa wrote: Hi, I would like to hear your comments about the following plan: Moving dirty bitmaps to userspace - Double buffering approach especially I would be glad if I can hear some advice about how to keep the compatibility. Thanks in advance, Takuya --- Overview: Last time, I submitted a patch make get dirty log ioctl return the first dirty page's position http://www.spinics.net/lists/kvm/msg29724.html and got some new better ideas from Avi. As a result, I agreed to try to eliminate the bitmap allocation done in the x86 KVM every time when we execute get dirty log by using double buffering approach. Here is my plan: - move the dirty bitmap allocation to userspace We allocate bitmaps in the userspace and register them by ioctl. Once a bitmap is registered, we do not touch it from userspace and let the kernel modify it directly until we switch to the next bitmap. We use double buffering at this switch point: userspace give the kernel a new bitmap by ioctl and the kernel switch the bitmap atomically to new one. After succeeded in this switch, we can read the old bitmap freely in the userspace and free it if we want: needless to say we can also reuse it at the next switch. - implementation details Although it may be possible to touch the bitmap from the kernel side without doing kmap, I think kmapping the bitmap is better. So we may use the following functions paying enough attention to the preemption control. - get_user_pages() - kmap_atomic() - compatibility issues What I am facing now are the compatibility issues. We have to support both the userspace and kernel side bitmap allocations to let the current qemu and KVM work properly. 1. From the kernel side, we have to care bitmap allocations done in both the kvm_vm_ioctl_set_memory_region() and kvm_vm_ioctl_get_dirty_log(). 2. From the userspace side, we have to check the new api's availability and determine which way we use, e.g. by using check extension ioctl. The most problematic is 1, kernel side. We have to be able to know by which way current bitmap allocation is being done using flags or something. In the case of set memory region, we have to judge whether we allocate a bitmap, and if not we have to register a bitmap later by another api: set memory region is not restricted to the dirty log issues and need more care than get dirty log. Are there any good ways to solve this kind of problems? You can introduce a new get_dirty_log ioctl that passes the address of the next bitmap in userspace, and use it (after pinning with get_user_pages), instead of vmalloc'ing. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
+/* The structure to notify the virtqueue for async socket */ +struct vhost_notifier { +struct list_head list; +struct vhost_virtqueue *vq; +int head; +int size; +int log; +void *ctrl; +void (*dtor)(struct vhost_notifier *); +}; + So IMO, this is not the best interface between vhost and your driver, exposing them to each other unnecessarily. If you think about it, your driver should not care about this structure. It could get e.g. a kiocb (sendmsg already gets one), and call ki_dtor on completion. vhost could save it's state in ki_user_data. If your driver needs to add more data to do more tracking, I think it can put skb pointer in the private pointer. Then if I remove the struct vhost_notifier, and just use struct kiocb, but don't use the one got from sendmsg or recvmsg, but allocated within the page_info structure, and don't implement any aio logic related to it, is that ok? Sorry, I made a patch, but don't know how to reply mail with a good formatted patch here Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Moving dirty bitmaps to userspace - Double buffering approach
On 03/15/2010 10:33 AM, Marcelo Tosatti wrote: Are there any good ways to solve this kind of problems? You can introduce a new get_dirty_log ioctl that passes the address of the next bitmap in userspace, and use it (after pinning with get_user_pages), instead of vmalloc'ing. No pinning please, put_user_bit() or set_bit_user(). (can be implemented generically using get_user_pages() and kmap_atomic(), but x86 should get an optimized implementation) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/18] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa
On Mon, Mar 15, 2010 at 09:36:52AM +0200, Avi Kivity wrote: On 03/03/2010 09:12 PM, Joerg Roedel wrote: This patch implements logic to make sure that either a page-fault/page-fault-vmexit or a nested-page-fault-vmexit is propagated back to the guest. Signed-off-by: Joerg Roedeljoerg.roe...@amd.com --- arch/x86/kvm/mmu.h |1 + arch/x86/kvm/paging_tmpl.h |2 ++ arch/x86/kvm/x86.c | 15 ++- 3 files changed, 17 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 64f619b..b42b27e 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -47,6 +47,7 @@ #define PFERR_USER_MASK (1U 2) #define PFERR_RSVD_MASK (1U 3) #define PFERR_FETCH_MASK (1U 4) +#define PFERR_NESTED_MASK (1U 31) Why is this needed? Queue an ordinary page fault page; the injection code should check the page fault intercept and #VMEXIT if needed. This is needed because we could have a nested page fault or an ordinary page fault which need to be propagated. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Avi Kivity a...@redhat.com [2010-03-15 10:27:45]: On 03/15/2010 10:07 AM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-03-15 09:48:05]: On 03/15/2010 09:22 AM, Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singhbal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. Well, for a guest, host page cache is a lot slower than guest page cache. Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. cache=off works for *direct I/O* supported filesystems and my concern is that one of the side-effects is that idle VM's can consume a lot of memory (assuming all the memory is available to them). As the number of VM's grow, they could cache a whole lot of memory. In my experiments I found that the total amount of memory cached far exceeded the mapped ratio by a large amount when we had idle VM's. The philosophy of this patch is to move the caching to the _host_ and let the host maintain the cache instead of the guest. One of the reasons I created a boot parameter was to deal with selective enablement for cases where memory is the most important resource being managed. I do see a hit in performance with my results (please see the data below), but the savings are quite large. The other solution mentioned in the TODOs is to have the balloon driver invoke this path. The sysctl also allows the guest to tune the amount of unmapped page cache if needed. The knobs are for 1. Selective enablement 2. Selective control of the % of unmapped pages An alternative path is to enable KSM for page cache. Then we have direct read-only guest access to host page cache, without any guest modifications required. That will be pretty difficult to achieve though - will need a readonly bit in the page cache radix tree, and teach all paths to honour it. Yes, it is, I've taken a quick look. I am not sure if de-duplication would be the best approach, may be dropping the page in the page cache might be a good first step. Data consistency would be much easier to maintain that way, as long as the guest is not writing frequently to that page, we don't need the page cache in the host. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/18] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa
On 03/15/2010 11:06 AM, Joerg Roedel wrote: On Mon, Mar 15, 2010 at 09:36:52AM +0200, Avi Kivity wrote: On 03/03/2010 09:12 PM, Joerg Roedel wrote: This patch implements logic to make sure that either a page-fault/page-fault-vmexit or a nested-page-fault-vmexit is propagated back to the guest. Signed-off-by: Joerg Roedeljoerg.roe...@amd.com --- arch/x86/kvm/mmu.h |1 + arch/x86/kvm/paging_tmpl.h |2 ++ arch/x86/kvm/x86.c | 15 ++- 3 files changed, 17 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 64f619b..b42b27e 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -47,6 +47,7 @@ #define PFERR_USER_MASK (1U 2) #define PFERR_RSVD_MASK (1U 3) #define PFERR_FETCH_MASK (1U 4) +#define PFERR_NESTED_MASK (1U 31) Why is this needed? Queue an ordinary page fault page; the injection code should check the page fault intercept and #VMEXIT if needed. This is needed because we could have a nested page fault or an ordinary page fault which need to be propagated. Right. Why is pio_copy_data() changed? One would think that it would be an all-or-nothing affair. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
On Mon, Mar 15, 2010 at 04:46:50PM +0800, Xin, Xiaohui wrote: +/* The structure to notify the virtqueue for async socket */ +struct vhost_notifier { + struct list_head list; + struct vhost_virtqueue *vq; + int head; + int size; + int log; + void *ctrl; + void (*dtor)(struct vhost_notifier *); +}; + So IMO, this is not the best interface between vhost and your driver, exposing them to each other unnecessarily. If you think about it, your driver should not care about this structure. It could get e.g. a kiocb (sendmsg already gets one), and call ki_dtor on completion. vhost could save it's state in ki_user_data. If your driver needs to add more data to do more tracking, I think it can put skb pointer in the private pointer. Then if I remove the struct vhost_notifier, and just use struct kiocb, but don't use the one got from sendmsg or recvmsg, but allocated within the page_info structure, and don't implement any aio logic related to it, is that ok? Hmm, not sure I understand. It seems both cleaner and easier to use the iocb passed to sendmsg/recvmsg. No? I am not saying you necessarily must implement full aio directly. Sorry, I made a patch, but don't know how to reply mail with a good formatted patch here Thanks Xiaohui Maybe Documentation/email-clients.txt will help? Generally you do it like this (at start of mail): Subject: one line patch summary (overrides mail subject) multilie patch description Signed-off-by: ... --- Free text comes after --- delimeter, before patch. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index a140dad..e830b30 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -106,22 +106,41 @@ static void handle_tx(struct vhost_net *net) -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 11:17 AM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-03-15 10:27:45]: On 03/15/2010 10:07 AM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-03-15 09:48:05]: On 03/15/2010 09:22 AM, Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singhbal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. Well, for a guest, host page cache is a lot slower than guest page cache. Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. cache=off works for *direct I/O* supported filesystems and my concern is that one of the side-effects is that idle VM's can consume a lot of memory (assuming all the memory is available to them). As the number of VM's grow, they could cache a whole lot of memory. In my experiments I found that the total amount of memory cached far exceeded the mapped ratio by a large amount when we had idle VM's. The philosophy of this patch is to move the caching to the _host_ and let the host maintain the cache instead of the guest. That's only beneficial if the cache is shared. Otherwise, you could use the balloon to evict cache when memory is tight. Shared cache is mostly a desktop thing where users run similar workloads. For servers, it's much less likely. So a modified-guest doesn't help a lot here. One of the reasons I created a boot parameter was to deal with selective enablement for cases where memory is the most important resource being managed. I do see a hit in performance with my results (please see the data below), but the savings are quite large. The other solution mentioned in the TODOs is to have the balloon driver invoke this path. The sysctl also allows the guest to tune the amount of unmapped page cache if needed. The knobs are for 1. Selective enablement 2. Selective control of the % of unmapped pages An alternative path is to enable KSM for page cache. Then we have direct read-only guest access to host page cache, without any guest modifications required. That will be pretty difficult to achieve though - will need a readonly bit in the page cache radix tree, and teach all paths to honour it. Yes, it is, I've taken a quick look. I am not sure if de-duplication would be the best approach, may be dropping the page in the page cache might be a good first step. Data consistency would be much easier to maintain that way, as long as the guest is not writing frequently to that page, we don't need the page cache in the host. Trimming the host page cache should happen automatically under pressure. Since the page is cached by the guest, it won't be re-read, so the host page is not frequently used and then dropped. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: Cleanup: change to use bool return values
Make use of bool as return values, and remove some useless bool value converting. Thanks Avi to point this out. Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com --- arch/x86/kvm/vmx.c | 54 ++-- 1 files changed, 27 insertions(+), 27 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 06108f3..3ddcfc5 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -231,56 +231,56 @@ static const u32 vmx_msr_index[] = { }; #define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index) -static inline int is_page_fault(u32 intr_info) +static inline bool is_page_fault(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_HARD_EXCEPTION | PF_VECTOR | INTR_INFO_VALID_MASK); } -static inline int is_no_device(u32 intr_info) +static inline bool is_no_device(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_HARD_EXCEPTION | NM_VECTOR | INTR_INFO_VALID_MASK); } -static inline int is_invalid_opcode(u32 intr_info) +static inline bool is_invalid_opcode(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_HARD_EXCEPTION | UD_VECTOR | INTR_INFO_VALID_MASK); } -static inline int is_external_interrupt(u32 intr_info) +static inline bool is_external_interrupt(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK); } -static inline int is_machine_check(u32 intr_info) +static inline bool is_machine_check(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_HARD_EXCEPTION | MC_VECTOR | INTR_INFO_VALID_MASK); } -static inline int cpu_has_vmx_msr_bitmap(void) +static inline bool cpu_has_vmx_msr_bitmap(void) { return vmcs_config.cpu_based_exec_ctrl CPU_BASED_USE_MSR_BITMAPS; } -static inline int cpu_has_vmx_tpr_shadow(void) +static inline bool cpu_has_vmx_tpr_shadow(void) { return vmcs_config.cpu_based_exec_ctrl CPU_BASED_TPR_SHADOW; } -static inline int vm_need_tpr_shadow(struct kvm *kvm) +static inline bool vm_need_tpr_shadow(struct kvm *kvm) { return (cpu_has_vmx_tpr_shadow()) (irqchip_in_kernel(kvm)); } -static inline int cpu_has_secondary_exec_ctrls(void) +static inline bool cpu_has_secondary_exec_ctrls(void) { return vmcs_config.cpu_based_exec_ctrl CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; @@ -300,80 +300,80 @@ static inline bool cpu_has_vmx_flexpriority(void) static inline bool cpu_has_vmx_ept_execute_only(void) { - return !!(vmx_capability.ept VMX_EPT_EXECUTE_ONLY_BIT); + return vmx_capability.ept VMX_EPT_EXECUTE_ONLY_BIT; } static inline bool cpu_has_vmx_eptp_uncacheable(void) { - return !!(vmx_capability.ept VMX_EPTP_UC_BIT); + return vmx_capability.ept VMX_EPTP_UC_BIT; } static inline bool cpu_has_vmx_eptp_writeback(void) { - return !!(vmx_capability.ept VMX_EPTP_WB_BIT); + return vmx_capability.ept VMX_EPTP_WB_BIT; } static inline bool cpu_has_vmx_ept_2m_page(void) { - return !!(vmx_capability.ept VMX_EPT_2MB_PAGE_BIT); + return vmx_capability.ept VMX_EPT_2MB_PAGE_BIT; } static inline bool cpu_has_vmx_ept_1g_page(void) { - return !!(vmx_capability.ept VMX_EPT_1GB_PAGE_BIT); + return vmx_capability.ept VMX_EPT_1GB_PAGE_BIT; } -static inline int cpu_has_vmx_invept_individual_addr(void) +static inline bool cpu_has_vmx_invept_individual_addr(void) { - return !!(vmx_capability.ept VMX_EPT_EXTENT_INDIVIDUAL_BIT); + return vmx_capability.ept VMX_EPT_EXTENT_INDIVIDUAL_BIT; } -static inline int cpu_has_vmx_invept_context(void) +static inline bool cpu_has_vmx_invept_context(void) { - return !!(vmx_capability.ept VMX_EPT_EXTENT_CONTEXT_BIT); + return vmx_capability.ept VMX_EPT_EXTENT_CONTEXT_BIT; } -static inline int cpu_has_vmx_invept_global(void) +static inline bool cpu_has_vmx_invept_global(void) { - return !!(vmx_capability.ept VMX_EPT_EXTENT_GLOBAL_BIT); + return vmx_capability.ept VMX_EPT_EXTENT_GLOBAL_BIT; } -static inline int cpu_has_vmx_ept(void) +static inline bool cpu_has_vmx_ept(void) { return vmcs_config.cpu_based_2nd_exec_ctrl SECONDARY_EXEC_ENABLE_EPT; } -static inline int cpu_has_vmx_unrestricted_guest(void) +static inline bool cpu_has_vmx_unrestricted_guest(void) { return vmcs_config.cpu_based_2nd_exec_ctrl SECONDARY_EXEC_UNRESTRICTED_GUEST; } -static inline int
Re: [PATCH 15/18] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa
On Mon, Mar 15, 2010 at 11:23:07AM +0200, Avi Kivity wrote: On 03/15/2010 11:06 AM, Joerg Roedel wrote: On Mon, Mar 15, 2010 at 09:36:52AM +0200, Avi Kivity wrote: On 03/03/2010 09:12 PM, Joerg Roedel wrote: This patch implements logic to make sure that either a page-fault/page-fault-vmexit or a nested-page-fault-vmexit is propagated back to the guest. Signed-off-by: Joerg Roedeljoerg.roe...@amd.com --- arch/x86/kvm/mmu.h |1 + arch/x86/kvm/paging_tmpl.h |2 ++ arch/x86/kvm/x86.c | 15 ++- 3 files changed, 17 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 64f619b..b42b27e 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -47,6 +47,7 @@ #define PFERR_USER_MASK (1U 2) #define PFERR_RSVD_MASK (1U 3) #define PFERR_FETCH_MASK (1U 4) +#define PFERR_NESTED_MASK (1U 31) Why is this needed? Queue an ordinary page fault page; the injection code should check the page fault intercept and #VMEXIT if needed. This is needed because we could have a nested page fault or an ordinary page fault which need to be propagated. Right. Why is pio_copy_data() changed? One would think that it would be an all-or-nothing affair. It was the only place I found where the PROPAGATE_FAULT value was checked and actually propagated. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.
On Mon, Mar 15, 2010 at 09:44:26AM +0200, Avi Kivity wrote: On 03/14/2010 08:06 PM, Gleb Natapov wrote: Suggest simply reentering every N executions. This restart mechanism is, in fact, needed for ins read ahead to work. After reading ahead from IO port we need to avoid entering decoder until entire cache is consumed otherwise decoder will clear cache and data will be lost. So we can't just enter guest in arbitrary times, only when read ahead cache is empty. Since read ahead is never done across page boundary this is save place to re-enter guest. Please make the two depend on each other directly then. We can't expect the reader of the emulator code know that. We can document that. I wouldn't want to have different conditions for guest re-entry for different opcodes. Have the emulator ask the buffer when it is empty. It will be always empty for all string ops except INS. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.
On 03/15/2010 11:44 AM, Gleb Natapov wrote: On Mon, Mar 15, 2010 at 09:44:26AM +0200, Avi Kivity wrote: On 03/14/2010 08:06 PM, Gleb Natapov wrote: Suggest simply reentering every N executions. This restart mechanism is, in fact, needed for ins read ahead to work. After reading ahead from IO port we need to avoid entering decoder until entire cache is consumed otherwise decoder will clear cache and data will be lost. So we can't just enter guest in arbitrary times, only when read ahead cache is empty. Since read ahead is never done across page boundary this is save place to re-enter guest. Please make the two depend on each other directly then. We can't expect the reader of the emulator code know that. We can document that. I wouldn't want to have different conditions for guest re-entry for different opcodes. We now have a write buffer size of one. It's just a matter of making the emulator know the size of the buffer (extra parameter to -write_emulated). Have the emulator ask the buffer when it is empty. It will be always empty for all string ops except INS. Or we can make the buffer larger for everyone (outside this patchset though). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.
On Mon, Mar 15, 2010 at 11:56:32AM +0200, Avi Kivity wrote: On 03/15/2010 11:44 AM, Gleb Natapov wrote: On Mon, Mar 15, 2010 at 09:44:26AM +0200, Avi Kivity wrote: On 03/14/2010 08:06 PM, Gleb Natapov wrote: Suggest simply reentering every N executions. This restart mechanism is, in fact, needed for ins read ahead to work. After reading ahead from IO port we need to avoid entering decoder until entire cache is consumed otherwise decoder will clear cache and data will be lost. So we can't just enter guest in arbitrary times, only when read ahead cache is empty. Since read ahead is never done across page boundary this is save place to re-enter guest. Please make the two depend on each other directly then. We can't expect the reader of the emulator code know that. We can document that. I wouldn't want to have different conditions for guest re-entry for different opcodes. We now have a write buffer size of one. It's just a matter of making the emulator know the size of the buffer (extra parameter to -write_emulated). The buffer is maintained inside emulator, so emulator knows about it and can check it, but then for all other string instruction except INS we will re-enter guest on each iteration. Have the emulator ask the buffer when it is empty. It will be always empty for all string ops except INS. Or we can make the buffer larger for everyone (outside this patchset though). I am not sure what do you mean here. INS read ahead and MMIO read cache are different beasts. Former is needed to speed-up string pio reads, later (not yet implemented) is needed to reread previous MMIO read results in case instruction emulation is restarted due to need to exit to userspace. MMIO read cache need to be invalidated on each iteration of string instruction. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.
On 03/15/2010 12:07 PM, Gleb Natapov wrote: Or we can make the buffer larger for everyone (outside this patchset though). I am not sure what do you mean here. INS read ahead and MMIO read cache are different beasts. Former is needed to speed-up string pio reads, later (not yet implemented) is needed to reread previous MMIO read results in case instruction emulation is restarted due to need to exit to userspace. MMIO read cache need to be invalidated on each iteration of string instruction. Instructions with multiple reads or writes need an mmio read/write buffer that can be replayed on re-execution. buffer != cache! A cache can be dropped (perhaps after flushing it to a backing store), but a buffer in general cannot. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Fix some mmu/emulator atomicity issues (v2)
On Sun, Mar 14, 2010 at 09:03:47AM +0200, Avi Kivity wrote: On 03/10/2010 04:50 PM, Avi Kivity wrote: Currently when we emulate a locked operation into a shadowed guest page table, we perform a write rather than a true atomic. This is indicated by the emulating exchange as write message that shows up in dmesg. In addition, the pte prefetch operation during invlpg suffered from a race. This was fixed by removing the operation. This patchset fixes both issues and reinstates pte prefetch on invlpg. v2: - fix truncated description for patch 1 - add new patch 4, which fixes a bug in patch 5 No comments, but looks like last week's maintainer neglected to merge this. Looks fine. Can you please regenerate against next branch? (just pushed). For the invlpg prefetch it would be good to confirm the original bug is not reproducible. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.
On Mon, Mar 15, 2010 at 12:15:22PM +0200, Avi Kivity wrote: On 03/15/2010 12:07 PM, Gleb Natapov wrote: Or we can make the buffer larger for everyone (outside this patchset though). I am not sure what do you mean here. INS read ahead and MMIO read cache are different beasts. Former is needed to speed-up string pio reads, later (not yet implemented) is needed to reread previous MMIO read results in case instruction emulation is restarted due to need to exit to userspace. MMIO read cache need to be invalidated on each iteration of string instruction. Instructions with multiple reads or writes need an mmio read/write buffer that can be replayed on re-execution. buffer != cache! A cache can be dropped (perhaps after flushing it to a backing store), but a buffer in general cannot. That is just naming. Call it buffer if you want. I still don't understand what do you mean by Or we can make the buffer larger for everyone. Who is this everyone? Different instruction need different kind of buffers. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.
On 03/15/2010 12:19 PM, Gleb Natapov wrote: On Mon, Mar 15, 2010 at 12:15:22PM +0200, Avi Kivity wrote: On 03/15/2010 12:07 PM, Gleb Natapov wrote: Or we can make the buffer larger for everyone (outside this patchset though). I am not sure what do you mean here. INS read ahead and MMIO read cache are different beasts. Former is needed to speed-up string pio reads, later (not yet implemented) is needed to reread previous MMIO read results in case instruction emulation is restarted due to need to exit to userspace. MMIO read cache need to be invalidated on each iteration of string instruction. Instructions with multiple reads or writes need an mmio read/write buffer that can be replayed on re-execution. buffer != cache! A cache can be dropped (perhaps after flushing it to a backing store), but a buffer in general cannot. That is just naming. Call it buffer if you want. I still don't understand what do you mean by Or we can make the buffer larger for everyone. Who is this everyone? Different instruction need different kind of buffers. Many instructions can issue multiple reads, ins is just one of them. A generic mmio buffer can be used by everyone. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.
On Mon, Mar 15, 2010 at 12:24:43PM +0200, Avi Kivity wrote: On 03/15/2010 12:19 PM, Gleb Natapov wrote: On Mon, Mar 15, 2010 at 12:15:22PM +0200, Avi Kivity wrote: On 03/15/2010 12:07 PM, Gleb Natapov wrote: Or we can make the buffer larger for everyone (outside this patchset though). I am not sure what do you mean here. INS read ahead and MMIO read cache are different beasts. Former is needed to speed-up string pio reads, later (not yet implemented) is needed to reread previous MMIO read results in case instruction emulation is restarted due to need to exit to userspace. MMIO read cache need to be invalidated on each iteration of string instruction. Instructions with multiple reads or writes need an mmio read/write buffer that can be replayed on re-execution. buffer != cache! A cache can be dropped (perhaps after flushing it to a backing store), but a buffer in general cannot. That is just naming. Call it buffer if you want. I still don't understand what do you mean by Or we can make the buffer larger for everyone. Who is this everyone? Different instruction need different kind of buffers. Many instructions can issue multiple reads, ins is just one of them. A generic mmio buffer can be used by everyone. No, ins can issue only _one_ io read during one iteration (i.e between each pair of reads there is a commit point). But this is slow, so we do non-architectural hack: do many reads ahead of time into a buffer and use results from this buffer for emulation of subsequent iterations. Other instruction can do multiple reads between instruction fetching and commit of emulation result and need different kind of buffering (actually caching is more appropriate here since we cache results of reads from past attempts to emulation same instruction). -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Avi Kivity a...@redhat.com [2010-03-15 11:27:56]: The knobs are for 1. Selective enablement 2. Selective control of the % of unmapped pages An alternative path is to enable KSM for page cache. Then we have direct read-only guest access to host page cache, without any guest modifications required. That will be pretty difficult to achieve though - will need a readonly bit in the page cache radix tree, and teach all paths to honour it. Yes, it is, I've taken a quick look. I am not sure if de-duplication would be the best approach, may be dropping the page in the page cache might be a good first step. Data consistency would be much easier to maintain that way, as long as the guest is not writing frequently to that page, we don't need the page cache in the host. Trimming the host page cache should happen automatically under pressure. Since the page is cached by the guest, it won't be re-read, so the host page is not frequently used and then dropped. Yes, agreed, but dropping is easier than tagging cache as read-only and getting everybody to understand read-only cached pages. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Moving dirty bitmaps to userspace - Double buffering approach
Avi Kivity wrote: On 03/15/2010 10:33 AM, Marcelo Tosatti wrote: Are there any good ways to solve this kind of problems? You can introduce a new get_dirty_log ioctl that passes the address of the next bitmap in userspace, and use it (after pinning with get_user_pages), instead of vmalloc'ing. Thank you for your advice! No pinning please, put_user_bit() or set_bit_user(). (can be implemented generically using get_user_pages() and kmap_atomic(), but x86 should get an optimized implementation) Given your advice last time, I started this with my colleague. -- We were just talking about how to strugle with every architectures. As your comment, we'll make the generic implementation with optimized one for x86 first. Thanks Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/30] KVM: remove realmode_lmsw function.
Gleb Natapov wrote: Use (get|set)_cr callback to emulate lmsw inside emulator. I see that vmx.c:handle_cr() is the only other user of kvm_lmsw(). If we fix this place similar like you did below, we could get rid of kvm_lmsw() entirely. But I am not sure whether it's OK to remove an exported symbol. Regards, Andre. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |2 -- arch/x86/kvm/emulate.c |4 ++-- arch/x86/kvm/x86.c |7 --- 3 files changed, 2 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index e8e108a..1e15a0a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -582,8 +582,6 @@ int emulate_instruction(struct kvm_vcpu *vcpu, void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context); void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); -void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, - unsigned long *rflags); void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 5b060e4..5e2fa61 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2486,8 +2486,8 @@ twobyte_insn: c-dst.val = ops-get_cr(0, ctxt-vcpu); break; case 6: /* lmsw */ - realmode_lmsw(ctxt-vcpu, (u16)c-src.val, - ctxt-eflags); + ops-set_cr(0, (ops-get_cr(0, ctxt-vcpu) ~0x0ful) | + (c-src.val 0x0f), ctxt-vcpu); c-dst.type = OP_NONE; break; case 7: /* invlpg*/ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bf714df..b08f8a1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4045,13 +4045,6 @@ void realmode_lidt(struct kvm_vcpu *vcpu, u16 limit, unsigned long base) kvm_x86_ops-set_idt(vcpu, dt); } -void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, - unsigned long *rflags) -{ - kvm_lmsw(vcpu, msw); - *rflags = kvm_get_rflags(vcpu); -} - static int move_to_next_stateful_cpuid_entry(struct kvm_vcpu *vcpu, int i) { struct kvm_cpuid_entry2 *e = vcpu-arch.cpuid_entries[i]; -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [long] MINIX 3.1.6 works in QEMU-0.12.3 only with KVM disabled
Avi Kivity wrote on 2010-03-10 13:03:25 +0200: On 03/10/2010 12:26 PM, Erik van der Kouwe wrote: I've submitted this bug report a week ago: http://sourceforge.net/tracker/?func=detailaid=2962575group_id=180599atid=893831 MINIX is using big real mode which is currently not well supported by kvm on Intel hardware: (qemu) info registers EAX=0010 EBX=0009 ECX=4920 EDX=a796 ESI=0200 EDI=49200200 EBP=0009 ESP=a762 EIP=f4a7 EFL=00023002 [---] CPL=3 II=0 A20=1 SMM=0 HLT=0 ES = f300 CS =f000 000f f300 SS =9492 00094920 f300 DS =97ce 00097cec f300 A ds.base of 0x97cec cannot be translated to a real mode segment. There is some work to get this to work, but it is proceeding really slowly. It should work on AMD hardware though. Hi guys, I searched the issue, and Erik was kind enough to point me to this list where there are knowledgeable people. Erik van der Kouwe wrote in http://groups.google.com/group/minix3/msg/40f44df0c434cfa6: The situation is as follows: The boot monitor runs in real-address mode, but has to copy parts of the boot image into high memory (= 1 MB) which is not accessible from that mode as only 20 bits are available. It calls the BIOS (int 0x15) to perform the copy. This is done under the ext_copy label in boot/ boothead.s. Okay. It is my understanding this is where Minix' involvement stops. The BIOS switches to protected mode, loading a GDT which it receives from the caller. Before returning to the caller, it copies data using the segment descriptors in the GDT and switches back to real-address mode. This is the description of BIOS service 15/87, which have to be implemented (using whatever solution it pleases) by the BIOS. When doing switch, the cached segment selectors are preserved, which allows one to use protected mode segments in real-address mode (this is called unreal mode). Now this is a by-product of the implementation inside the BIOS. In fact, even if the BIOS enters unreal mode (or the similar big real, more useful with segmentation-less architectures), before turning back to the client it (should) reset things to normal real mode, as service 15/87 is not an usual way to enter unreal mode (for example, this effect is not even mentionned in Ralf Brown's list). As a result (and also and foremost because of 80286 compatibility), instead of directly using unreal or big real mode if possible (as done eg. in himem.sys), Minix monitor goes to the great pain to going back to square #1, and since blocks are at most 64 KB in size and several iterations are needed, on the next block Minix sets up the (very similar) GDT then does another call to the same BIOS service 15/87. I knew these parts before, but this is where Avi's answer came in: KVM on Intel does not yet support unreal mode and requires the cached segment descriptors to be valid in real-address mode. I do not know which virtual BIOS is using KVM, but I notice while reading http://bochs.sourceforge.net/cgi-bin/lxr/source/bios/rombios.c: [ Slightly edited to fit the width of my post. AL. ] 3555 case 0x87: 3556 #if BX_CPU 3 3557 # error Int15 function 87h not supported on 80386 3558 #endif 3559 // +++ should probably have descriptor checks 3560 // +++ should have exception handlers ... 3640 mov eax, cr0 3641 or al, #0x01 3642 mov cr0, eax 3643 ;; far jump to flush CPU queue after transition to prot. mode 3644 JMP_AP(0x0020, protected_mode) 3645 3646 protected_mode: 3647 ;; GDT points to valid descriptor table, now load SS, DS, ES 3648 mov ax, #0x28 ;; 101 000 = 5th desc.in table, TI=GDT,RPL=00 3649 mov ss, ax 3650 mov ax, #0x10 ;; 010 000 = 2nd desc.in table, TI=GDT,RPL=00 3651 mov ds, ax 3652 mov ax, #0x18 ;; 011 000 = 3rd desc.in table, TI=GDT,RPL=00 3653 mov es, ax 3654 xor si, si 3655 xor di, di 3656 cld 3657 rep 3658 movsw ;; move CX words from DS:SI to ES:DI 3659 3660 ;; make sure DS and ES limits are 64KB 3661 mov ax, #0x28 3662 mov ds, ax 3663 mov es, ax 3664 3665 ;; reset PG bit in CR0 ??? 3666 mov eax, cr0 3667 and al, #0xFE 3668 mov cr0, eax I should be loosing something here... There is no unreal mode at any moment, is it? [ ... some web browsing occuring meanwhile ... Later: ] Okay, now I got another picture. 8-| Until recently, KVM (and qemu) used Bochs BIOS, showed above; but they switched recently to SeaBIOS... where the applicable code is in src/system.c, and looks like (now this is ATT assembly): 83 static void 84 handle_1587(struct bregs *regs) 85 { 86 // +++ should probably have descriptor checks 87 // +++ should have exception handlers 127 // Enable protected mode 128 movl %%cr0, %%eax\n 129 orl $ __stringify(CR0_PE)
Re: [PATCH v2 06/30] KVM: remove realmode_lmsw function.
On 03/15/2010 01:02 PM, Andre Przywara wrote: Gleb Natapov wrote: Use (get|set)_cr callback to emulate lmsw inside emulator. I see that vmx.c:handle_cr() is the only other user of kvm_lmsw(). If we fix this place similar like you did below, we could get rid of kvm_lmsw() entirely. But I am not sure whether it's OK to remove an exported symbol. Exported symbols can be changed or removed at will. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Fix some mmu/emulator atomicity issues (v2)
On 03/15/2010 12:16 PM, Marcelo Tosatti wrote: On Sun, Mar 14, 2010 at 09:03:47AM +0200, Avi Kivity wrote: On 03/10/2010 04:50 PM, Avi Kivity wrote: Currently when we emulate a locked operation into a shadowed guest page table, we perform a write rather than a true atomic. This is indicated by the emulating exchange as write message that shows up in dmesg. In addition, the pte prefetch operation during invlpg suffered from a race. This was fixed by removing the operation. This patchset fixes both issues and reinstates pte prefetch on invlpg. v2: - fix truncated description for patch 1 - add new patch 4, which fixes a bug in patch 5 No comments, but looks like last week's maintainer neglected to merge this. Looks fine. Can you please regenerate against next branch? (just pushed). Will send out shortly. For the invlpg prefetch it would be good to confirm the original bug is not reproducible. I tried to reproduce the problem with the original revert reverted, but couldn't. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] KVM: MMU: Reinstate pte prefetch on invlpg
Commit fb341f57 removed the pte prefetch on guest invlpg, citing guest races. However, the SDM is adamant that prefetch is allowed: The processor may create entries in paging-structure caches for translations required for prefetches and for accesses that are a result of speculative execution that would never actually occur in the executed code path. And, in fact, there was a race in the prefetch code: we picked up the pte without the mmu lock held, so an older invlpg could install the pte over a newer invlpg. Reinstate the prefetch logic, but this time note whether another invlpg has executed using a counter. If a race occured, do not install the pte. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c | 37 +++-- arch/x86/kvm/paging_tmpl.h | 15 +++ 3 files changed, 39 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ea1b6c6..28826c8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -389,6 +389,7 @@ struct kvm_arch { unsigned int n_free_mmu_pages; unsigned int n_requested_mmu_pages; unsigned int n_alloc_mmu_pages; + atomic_t invlpg_counter; struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; /* * Hash table of struct kvm_mmu_page. diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f63c9ad..b3edc46 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2609,20 +2609,11 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, int flooded = 0; int npte; int r; + int invlpg_counter; pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes); - switch (bytes) { - case 4: - gentry = *(const u32 *)new; - break; - case 8: - gentry = *(const u64 *)new; - break; - default: - gentry = 0; - break; - } + invlpg_counter = atomic_read(vcpu-kvm-arch.invlpg_counter); /* * Assume that the pte write on a page table of the same type @@ -2630,16 +2621,34 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, * (might be false while changing modes). Note it is verified later * by update_pte(). */ - if (is_pae(vcpu) bytes == 4) { + if ((is_pae(vcpu) bytes == 4) || !new) { /* Handle a 32-bit guest writing two halves of a 64-bit gpte */ - gpa = ~(gpa_t)7; - r = kvm_read_guest(vcpu-kvm, gpa, gentry, 8); + if (is_pae(vcpu)) { + gpa = ~(gpa_t)7; + bytes = 8; + } + r = kvm_read_guest(vcpu-kvm, gpa, gentry, min(bytes, 8)); if (r) gentry = 0; + new = (const u8 *)gentry; + } + + switch (bytes) { + case 4: + gentry = *(const u32 *)new; + break; + case 8: + gentry = *(const u64 *)new; + break; + default: + gentry = 0; + break; } mmu_guess_page_from_pte_write(vcpu, gpa, gentry); spin_lock(vcpu-kvm-mmu_lock); + if (atomic_read(vcpu-kvm-arch.invlpg_counter) != invlpg_counter) + gentry = 0; kvm_mmu_access_page(vcpu, gfn); kvm_mmu_free_some_pages(vcpu); ++vcpu-kvm-stat.mmu_pte_write; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 4b37e1a..067797a 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -463,6 +463,7 @@ out_unlock: static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) { struct kvm_shadow_walk_iterator iterator; + gpa_t pte_gpa = -1; int level; u64 *sptep; int need_flush = 0; @@ -476,6 +477,10 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) if (level == PT_PAGE_TABLE_LEVEL || ((level == PT_DIRECTORY_LEVEL is_large_pte(*sptep))) || ((level == PT_PDPE_LEVEL is_large_pte(*sptep { + struct kvm_mmu_page *sp = page_header(__pa(sptep)); + + pte_gpa = (sp-gfn PAGE_SHIFT); + pte_gpa += (sptep - sp-spt) * sizeof(pt_element_t); if (is_shadow_present_pte(*sptep)) { rmap_remove(vcpu-kvm, sptep); @@ -493,7 +498,17 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) if (need_flush) kvm_flush_remote_tlbs(vcpu-kvm); + + atomic_inc(vcpu-kvm-arch.invlpg_counter); + spin_unlock(vcpu-kvm-mmu_lock); + + if (pte_gpa == -1) + return; + + if (mmu_topup_memory_caches(vcpu)) + return;
[PATCH 2/5] KVM: Make locked operations truly atomic
Once upon a time, locked operations were emulated while holding the mmu mutex. Since mmu pages were write protected, it was safe to emulate the writes in a non-atomic manner, since there could be no other writer, either in the guest or in the kernel. These days emulation takes place without holding the mmu spinlock, so the write could be preempted by an unshadowing event, which exposes the page to writes by the guest. This may cause corruption of guest page tables. Fix by using an atomic cmpxchg for these operations. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/x86.c | 69 1 files changed, 48 insertions(+), 21 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9d02cc7..d724a52 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3299,41 +3299,68 @@ int emulator_write_emulated(unsigned long addr, } EXPORT_SYMBOL_GPL(emulator_write_emulated); +#define CMPXCHG_TYPE(t, ptr, old, new) \ + (cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new)) == *(t *)(old)) + +#ifdef CONFIG_X86_64 +# define CMPXCHG64(ptr, old, new) CMPXCHG_TYPE(u64, ptr, old, new) +#else +# define CMPXCHG64(ptr, old, new) \ + (cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u *)(new)) == *(u64 *)(old)) +#endif + static int emulator_cmpxchg_emulated(unsigned long addr, const void *old, const void *new, unsigned int bytes, struct kvm_vcpu *vcpu) { - printk_once(KERN_WARNING kvm: emulating exchange as write\n); -#ifndef CONFIG_X86_64 - /* guests cmpxchg8b have to be emulated atomically */ - if (bytes == 8) { - gpa_t gpa; - struct page *page; - char *kaddr; - u64 val; + gpa_t gpa; + struct page *page; + char *kaddr; + bool exchanged; - gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, NULL); + /* guests cmpxchg8b have to be emulated atomically */ + if (bytes 8 || (bytes (bytes - 1))) + goto emul_write; - if (gpa == UNMAPPED_GVA || - (gpa PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) - goto emul_write; + gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, NULL); - if (((gpa + bytes - 1) PAGE_MASK) != (gpa PAGE_MASK)) - goto emul_write; + if (gpa == UNMAPPED_GVA || + (gpa PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) + goto emul_write; - val = *(u64 *)new; + if (((gpa + bytes - 1) PAGE_MASK) != (gpa PAGE_MASK)) + goto emul_write; - page = gfn_to_page(vcpu-kvm, gpa PAGE_SHIFT); + page = gfn_to_page(vcpu-kvm, gpa PAGE_SHIFT); - kaddr = kmap_atomic(page, KM_USER0); - set_64bit((u64 *)(kaddr + offset_in_page(gpa)), val); - kunmap_atomic(kaddr, KM_USER0); - kvm_release_page_dirty(page); + kaddr = kmap_atomic(page, KM_USER0); + kaddr += offset_in_page(gpa); + switch (bytes) { + case 1: + exchanged = CMPXCHG_TYPE(u8, kaddr, old, new); + break; + case 2: + exchanged = CMPXCHG_TYPE(u16, kaddr, old, new); + break; + case 4: + exchanged = CMPXCHG_TYPE(u32, kaddr, old, new); + break; + case 8: + exchanged = CMPXCHG64(kaddr, old, new); + break; + default: + BUG(); } + kunmap_atomic(kaddr, KM_USER0); + kvm_release_page_dirty(page); + + if (!exchanged) + return X86EMUL_CMPXCHG_FAILED; + emul_write: -#endif + printk_once(KERN_WARNING kvm: emulating exchange as write\n); return emulator_write_emulated(addr, new, bytes, vcpu); } -- 1.7.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] KVM: MMU: Do not instantiate nontrapping spte on unsync page
The update_pte() path currently uses a nontrapping spte when a nonpresent (or nonaccessed) gpte is written. This is fine since at present it is only used on sync pages. However, on an unsync page this will cause an endless fault loop as the guest is under no obligation to invlpg a gpte that transitions from nonpresent to present. Needed for the next patch which reinstates update_pte() on invlpg. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/paging_tmpl.h | 10 -- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 81eab9a..4b37e1a 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -258,11 +258,17 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page, pt_element_t gpte; unsigned pte_access; pfn_t pfn; + u64 new_spte; gpte = *(const pt_element_t *)pte; if (~gpte (PT_PRESENT_MASK | PT_ACCESSED_MASK)) { - if (!is_present_gpte(gpte)) - __set_spte(spte, shadow_notrap_nonpresent_pte); + if (!is_present_gpte(gpte)) { + if (page-unsync) + new_spte = shadow_trap_nonpresent_pte; + else + new_spte = shadow_notrap_nonpresent_pte; + __set_spte(spte, new_spte); + } return; } pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte); -- 1.7.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] KVM: Don't follow an atomic operation by a non-atomic one
Currently emulated atomic operations are immediately followed by a non-atomic operation, so that kvm_mmu_pte_write() can be invoked. This updates the mmu but undoes the whole point of doing things atomically. Fix by only performing the atomic operation and the mmu update, and avoiding the non-atomic write. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/x86.c | 21 +++-- 1 files changed, 15 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d724a52..2c0f632 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3227,7 +3227,8 @@ static int emulator_write_emulated_onepage(unsigned long addr, const void *val, unsigned int bytes, struct kvm_vcpu *vcpu, - bool guest_initiated) + bool guest_initiated, + bool mmu_only) { gpa_t gpa; u32 error_code; @@ -3247,6 +3248,10 @@ static int emulator_write_emulated_onepage(unsigned long addr, if ((gpa PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) goto mmio; + if (mmu_only) { + kvm_mmu_pte_write(vcpu, gpa, val, bytes, 1); + return X86EMUL_CONTINUE; + } if (emulator_write_phys(vcpu, gpa, val, bytes)) return X86EMUL_CONTINUE; @@ -3271,7 +3276,8 @@ int __emulator_write_emulated(unsigned long addr, const void *val, unsigned int bytes, struct kvm_vcpu *vcpu, - bool guest_initiated) + bool guest_initiated, + bool mmu_only) { /* Crossing a page boundary? */ if (((addr + bytes - 1) ^ addr) PAGE_MASK) { @@ -3279,7 +3285,7 @@ int __emulator_write_emulated(unsigned long addr, now = -addr ~PAGE_MASK; rc = emulator_write_emulated_onepage(addr, val, now, vcpu, -guest_initiated); +guest_initiated, mmu_only); if (rc != X86EMUL_CONTINUE) return rc; addr += now; @@ -3287,7 +3293,7 @@ int __emulator_write_emulated(unsigned long addr, bytes -= now; } return emulator_write_emulated_onepage(addr, val, bytes, vcpu, - guest_initiated); + guest_initiated, mmu_only); } int emulator_write_emulated(unsigned long addr, @@ -3295,7 +3301,7 @@ int emulator_write_emulated(unsigned long addr, unsigned int bytes, struct kvm_vcpu *vcpu) { - return __emulator_write_emulated(addr, val, bytes, vcpu, true); + return __emulator_write_emulated(addr, val, bytes, vcpu, true, false); } EXPORT_SYMBOL_GPL(emulator_write_emulated); @@ -3359,6 +3365,8 @@ static int emulator_cmpxchg_emulated(unsigned long addr, if (!exchanged) return X86EMUL_CMPXCHG_FAILED; + return __emulator_write_emulated(addr, new, bytes, vcpu, true, true); + emul_write: printk_once(KERN_WARNING kvm: emulating exchange as write\n); @@ -4013,7 +4021,8 @@ int kvm_fix_hypercall(struct kvm_vcpu *vcpu) kvm_x86_ops-patch_hypercall(vcpu, instruction); - return __emulator_write_emulated(rip, instruction, 3, vcpu, false); + return __emulator_write_emulated(rip, instruction, 3, vcpu, +false, false); } static u64 mk_cr_64(u64 curr_cr, u32 new_val) -- 1.7.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] KVM: MMU: Consolidate two guest pte reads in kvm_mmu_pte_write()
kvm_mmu_pte_write() reads guest ptes in two different occasions, both to allow a 32-bit pae guest to update a pte with 4-byte writes. Consolidate these into a single read, which also allows us to consolidate another read from an invlpg speculating a gpte into the shadow page table. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/mmu.c | 69 +++ 1 files changed, 31 insertions(+), 38 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b137515..f63c9ad 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2556,36 +2556,11 @@ static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu) } static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, - const u8 *new, int bytes) + u64 gpte) { gfn_t gfn; - int r; - u64 gpte = 0; pfn_t pfn; - if (bytes != 4 bytes != 8) - return; - - /* -* Assume that the pte write on a page table of the same type -* as the current vcpu paging mode. This is nearly always true -* (might be false while changing modes). Note it is verified later -* by update_pte(). -*/ - if (is_pae(vcpu)) { - /* Handle a 32-bit guest writing two halves of a 64-bit gpte */ - if ((bytes == 4) (gpa % 4 == 0)) { - r = kvm_read_guest(vcpu-kvm, gpa ~(u64)7, gpte, 8); - if (r) - return; - memcpy((void *)gpte + (gpa % 8), new, 4); - } else if ((bytes == 8) (gpa % 8 == 0)) { - memcpy((void *)gpte, new, 8); - } - } else { - if ((bytes == 4) (gpa % 4 == 0)) - memcpy((void *)gpte, new, 4); - } if (!is_present_gpte(gpte)) return; gfn = (gpte PT64_BASE_ADDR_MASK) PAGE_SHIFT; @@ -2636,7 +2611,34 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, int r; pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes); - mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes); + + switch (bytes) { + case 4: + gentry = *(const u32 *)new; + break; + case 8: + gentry = *(const u64 *)new; + break; + default: + gentry = 0; + break; + } + + /* +* Assume that the pte write on a page table of the same type +* as the current vcpu paging mode. This is nearly always true +* (might be false while changing modes). Note it is verified later +* by update_pte(). +*/ + if (is_pae(vcpu) bytes == 4) { + /* Handle a 32-bit guest writing two halves of a 64-bit gpte */ + gpa = ~(gpa_t)7; + r = kvm_read_guest(vcpu-kvm, gpa, gentry, 8); + if (r) + gentry = 0; + } + + mmu_guess_page_from_pte_write(vcpu, gpa, gentry); spin_lock(vcpu-kvm-mmu_lock); kvm_mmu_access_page(vcpu, gfn); kvm_mmu_free_some_pages(vcpu); @@ -2701,20 +2703,11 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, continue; } spte = sp-spt[page_offset / sizeof(*spte)]; - if ((gpa (pte_size - 1)) || (bytes pte_size)) { - gentry = 0; - r = kvm_read_guest_atomic(vcpu-kvm, - gpa ~(u64)(pte_size - 1), - gentry, pte_size); - new = (const void *)gentry; - if (r 0) - new = NULL; - } while (npte--) { entry = *spte; mmu_pte_write_zap_pte(vcpu, sp, spte); - if (new) - mmu_pte_write_new_pte(vcpu, sp, spte, new); + if (gentry) + mmu_pte_write_new_pte(vcpu, sp, spte, gentry); mmu_pte_write_flush_tlb(vcpu, entry, *spte); ++spte; } -- 1.7.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] Fix some mmu/emulator atomicity issues (v2)
Currently when we emulate a locked operation into a shadowed guest page table, we perform a write rather than a true atomic. This is indicated by the emulating exchange as write message that shows up in dmesg. In addition, the pte prefetch operation during invlpg suffered from a race. This was fixed by removing the operation. This patchset fixes both issues and reinstates pte prefetch on invlpg. v3: - rebase against next branch (resolves conflicts via hypercall patch) v2: - fix truncated description for patch 1 - add new patch 4, which fixes a bug in patch 5 Avi Kivity (5): KVM: MMU: Consolidate two guest pte reads in kvm_mmu_pte_write() KVM: Make locked operations truly atomic KVM: Don't follow an atomic operation by a non-atomic one KVM: MMU: Do not instantiate nontrapping spte on unsync page KVM: MMU: Reinstate pte prefetch on invlpg arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c | 78 + arch/x86/kvm/paging_tmpl.h | 25 ++- arch/x86/kvm/x86.c | 90 +++ 4 files changed, 127 insertions(+), 67 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On 03/10/2010 11:30 PM, Luiz Capitulino wrote: Hi there, Our wiki page for the Summer of Code 2010 is doing quite well: http://wiki.qemu.org/Google_Summer_of_Code_2010 I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote: On 03/10/2010 11:30 PM, Luiz Capitulino wrote: Hi there, Our wiki page for the Summer of Code 2010 is doing quite well: http://wiki.qemu.org/Google_Summer_of_Code_2010 I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Good idea. If there is interest I could help to mentor this project. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On 03/15/2010 02:38 PM, Joerg Roedel wrote: On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote: On 03/10/2010 11:30 PM, Luiz Capitulino wrote: Hi there, Our wiki page for the Summer of Code 2010 is doing quite well: http://wiki.qemu.org/Google_Summer_of_Code_2010 I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Good idea. If there is interest I could help to mentor this project. Thanks. I volunteered Anthony, but he may be a little overcommitted. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [long] MINIX 3.1.6 works in QEMU-0.12.3 only with KVM disabled
On 03/15/2010 12:54 PM, Antoine Leca wrote: When doing switch, the cached segment selectors are preserved, which allows one to use protected mode segments in real-address mode (this is called unreal mode). Now this is a by-product of the implementation inside the BIOS. In fact, even if the BIOS enters unreal mode (or the similar big real, more useful with segmentation-less architectures), before turning back to the client it (should) reset things to normal real mode, as service 15/87 is not an usual way to enter unreal mode (for example, this effect is not even mentionned in Ralf Brown's list). The entry into unreal mode is unintentional; the bios is transitioning to protected mode and 'unreal mode' only exists for a few instructions, IIRC. As a result (and also and foremost because of 80286 compatibility), instead of directly using unreal or big real mode if possible (as done eg. in himem.sys), Minix monitor goes to the great pain to going back to square #1, and since blocks are at most 64 KB in size and several iterations are needed, on the next block Minix sets up the (very similar) GDT then does another call to the same BIOS service 15/87. I knew these parts before, but this is where Avi's answer came in: KVM on Intel does not yet support unreal mode and requires the cached segment descriptors to be valid in real-address mode. I do not know which virtual BIOS is using KVM, but I notice while reading http://bochs.sourceforge.net/cgi-bin/lxr/source/bios/rombios.c: [ Slightly edited to fit the width of my post. AL. ] 3555 case 0x87: 3556 #if BX_CPU 3 3557 # error Int15 function 87h not supported on 80386 3558 #endif 3559 // +++ should probably have descriptor checks 3560 // +++ should have exception handlers ... 3640 mov eax, cr0 3641 or al, #0x01 3642 mov cr0, eax 3643 ;; far jump to flush CPU queue after transition to prot. mode 3644 JMP_AP(0x0020, protected_mode) 3645 3646 protected_mode: 3647 ;; GDT points to valid descriptor table, now load SS, DS, ES 3648 mov ax, #0x28 ;; 101 000 = 5th desc.in table, TI=GDT,RPL=00 3649 mov ss, ax 3650 mov ax, #0x10 ;; 010 000 = 2nd desc.in table, TI=GDT,RPL=00 3651 mov ds, ax 3652 mov ax, #0x18 ;; 011 000 = 3rd desc.in table, TI=GDT,RPL=00 3653 mov es, ax 3654 xor si, si 3655 xor di, di 3656 cld 3657 rep 3658 movsw ;; move CX words from DS:SI to ES:DI 3659 3660 ;; make sure DS and ES limits are 64KB 3661 mov ax, #0x28 3662 mov ds, ax 3663 mov es, ax 3664 3665 ;; reset PG bit in CR0 ??? 3666 mov eax, cr0 3667 and al, #0xFE 3668 mov cr0, eax I should be loosing something here... There is no unreal mode at any moment, is it? [ ... some web browsing occuring meanwhile ... Later: ] Okay, now I got another picture. 8-| Until recently, KVM (and qemu) used Bochs BIOS, showed above; but they switched recently to SeaBIOS... where the applicable code is in src/system.c, and looks like (now this is ATT assembly): 83 static void 84 handle_1587(struct bregs *regs) 85 { 86 // +++ should probably have descriptor checks 87 // +++ should have exception handlers 127 // Enable protected mode 128 movl %%cr0, %%eax\n 129 orl $ __stringify(CR0_PE) , %%eax\n 130 movl %%eax, %%cr0\n 131 132 // far jump to flush CPU queue after transition to prot. mode 133 ljmpw $(43), $1f\n 134 135 // GDT points to valid descriptor table, now load DS, ES 136 1:movw $(23), %%ax\n // 2nd descriptor in table, TI=GDT, RPL=00 137 movw %%ax, %%ds\n 138 movw $(33), %%ax\n // 3rd descriptor in table, TI=GDT, RPL=00 139 movw %%ax, %%es\n 140 141 // move CX words from DS:SI to ES:DI 142 xorw %%si, %%si\n 143 xorw %%di, %%di\n 144 rep movsw\n 145 146 // Disable protected mode 147 movl %%cr0, %%eax\n 148 andl $~ __stringify(CR0_PE) , %%eax\n 149 movl %%eax, %%cr0\n Note that while the basic scheme is the same, the cleaning up of lines 3660-3663 make sure DS and ES limits are 64KB is not present. IIUC, the virtualized CPU goes back to real mode with those segments sets as they are in protected mode, and yes with Minix boot monitor they happenned to NOT be paragraph-aligned. Is it possible to add back this cleaning up to the BIOS used in KVM? I think so. This is a longstanding kvm bug, but I can't see any downsides to a workaround in the BIOS. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/18] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa
On Mon, Mar 15, 2010 at 04:30:47AM +, Daniel K. wrote: Joerg Roedel wrote: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2883ce8..9f8b02d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -314,6 +314,19 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned long addr, kvm_queue_exception_e(vcpu, PF_VECTOR, error_code) } +void kvm_propagate_fault(struct kvm_vcpu *vcpu, unsigned long addr, u32 error_code) +{ +u32 nested, error; + +nested = error_code PFERR_NESTED_MASK; +error = error_code ~PFERR_NESTED_MASK; + +if (vcpu-arch.mmu.nested !(error_code PFERR_NESTED_MASK)) This looks incorrect, nested is unused. At the very least it should be a binary operation if (vcpu-arch.mmu.nested !(error_code PFERR_NESTED_MASK)) which can be simplified to if (vcpu-arch.mmu.nested !nested) but it seems wrong that the condition is that it is nested and not nested at the same time. Yes, this is already fixed in my local patch-stack. I found it during further testing (while fixing another bug). But thanks for your feedback :-) Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote: On 03/10/2010 11:30 PM, Luiz Capitulino wrote: Hi there, Our wiki page for the Summer of Code 2010 is doing quite well: http://wiki.qemu.org/Google_Summer_of_Code_2010 I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Our experiments show that nested device assignment is pretty much required for I/O performance in nested scenarios. Cheers, Muli -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On Mon, Mar 15, 2010 at 05:53:13AM -0700, Muli Ben-Yehuda wrote: On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote: On 03/10/2010 11:30 PM, Luiz Capitulino wrote: Hi there, Our wiki page for the Summer of Code 2010 is doing quite well: http://wiki.qemu.org/Google_Summer_of_Code_2010 I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Our experiments show that nested device assignment is pretty much required for I/O performance in nested scenarios. Really? I did a small test with virtio-blk in a nested guest (disk read with dd, so not a real benchmark) and got a reasonable read-performance of around 25MB/s from the disk in the l2-guest. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 05/30] KVM: Provide callback to get/set control registers in emulator ops.
Gleb Natapov wrote: Use this callback instead of directly call kvm function. Also rename realmode_(set|get)_cr to emulator_(set|get)_cr since function has nothing to do with real mode. Do you mind removing the static before emulator_{set,get}_cr and marking it EXPORT_SYMBOL? Then one could use it in vmx.c (and soon in svm.c ;-) while handling MOV-CR intercepts. Currently most of the code is actually duplicated. Also, shouldn't mk_cr_64() not be called mask_cr_64() for better readability? Regards, Andre. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |3 +- arch/x86/include/asm/kvm_host.h|2 - arch/x86/kvm/emulate.c |7 +- arch/x86/kvm/x86.c | 114 ++-- 4 files changed, 63 insertions(+), 63 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 2666d7a..0c5caa4 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -108,7 +108,8 @@ struct x86_emulate_ops { const void *new, unsigned int bytes, struct kvm_vcpu *vcpu); - + ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu); + void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu); }; /* Type, address-of, and value of an instruction's operand. */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3b178d8..e8e108a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -585,8 +585,6 @@ void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, unsigned long *rflags); -unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr); -void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value); void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 91450b5..5b060e4 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2483,7 +2483,7 @@ twobyte_insn: break; case 4: /* smsw */ c-dst.bytes = 2; - c-dst.val = realmode_get_cr(ctxt-vcpu, 0); + c-dst.val = ops-get_cr(0, ctxt-vcpu); break; case 6: /* lmsw */ realmode_lmsw(ctxt-vcpu, (u16)c-src.val, @@ -2519,8 +2519,7 @@ twobyte_insn: case 0x20: /* mov cr, reg */ if (c-modrm_mod != 3) goto cannot_emulate; - c-regs[c-modrm_rm] = - realmode_get_cr(ctxt-vcpu, c-modrm_reg); + c-regs[c-modrm_rm] = ops-get_cr(c-modrm_reg, ctxt-vcpu); c-dst.type = OP_NONE; /* no writeback */ break; case 0x21: /* mov from dr to reg */ @@ -2534,7 +2533,7 @@ twobyte_insn: case 0x22: /* mov reg, cr */ if (c-modrm_mod != 3) goto cannot_emulate; - realmode_set_cr(ctxt-vcpu, c-modrm_reg, c-modrm_val); + ops-set_cr(c-modrm_reg, c-modrm_val, ctxt-vcpu); c-dst.type = OP_NONE; break; case 0x23: /* mov from reg to dr */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a1e671a..bf714df 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3370,12 +3370,70 @@ void kvm_report_emulation_failure(struct kvm_vcpu *vcpu, const char *context) } EXPORT_SYMBOL_GPL(kvm_report_emulation_failure); +static u64 mk_cr_64(u64 curr_cr, u32 new_val) +{ + return (curr_cr ~((1ULL 32) - 1)) | new_val; +} + +static unsigned long emulator_get_cr(int cr, struct kvm_vcpu *vcpu) +{ + unsigned long value; + + switch (cr) { + case 0: + value = kvm_read_cr0(vcpu); + break; + case 2: + value = vcpu-arch.cr2; + break; + case 3: + value = vcpu-arch.cr3; + break; + case 4: + value = kvm_read_cr4(vcpu); + break; + case 8: + value = kvm_get_cr8(vcpu); + break; + default: + vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr); + return 0; + } + + return value; +} + +static void emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu *vcpu) +{ + switch (cr) { + case 0: + kvm_set_cr0(vcpu, mk_cr_64(kvm_read_cr0(vcpu), val)); + break; + case 2: + vcpu-arch.cr2 = val; + break; + case 3: + kvm_set_cr3(vcpu, val); + break;
[PATCH rework] KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s error handling
kvm_coalesced_mmio_init() keeps to hold the addresses of a coalesced mmio ring page and dev even after it has freed them. Also, if this function fails, though it might be rare, it seems to be suggesting the system's serious state: so we'd better stop the works following the kvm_creat_vm(). This patch clears these problems. We move the coalesced mmio's initialization out of kvm_create_vm(). This seems to be natural because it includes a registration which can be done only when vm is successfully created. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- virt/kvm/coalesced_mmio.c |2 ++ virt/kvm/kvm_main.c | 12 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c index 5169736..11776b7 100644 --- a/virt/kvm/coalesced_mmio.c +++ b/virt/kvm/coalesced_mmio.c @@ -119,8 +119,10 @@ int kvm_coalesced_mmio_init(struct kvm *kvm) return ret; out_free_dev: + kvm-coalesced_mmio_dev = NULL; kfree(dev); out_free_page: + kvm-coalesced_mmio_ring = NULL; __free_page(page); out_err: return ret; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index bcd08b8..c7053aa 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -418,9 +418,6 @@ static struct kvm *kvm_create_vm(void) spin_lock(kvm_lock); list_add(kvm-vm_list, vm_list); spin_unlock(kvm_lock); -#ifdef KVM_COALESCED_MMIO_PAGE_OFFSET - kvm_coalesced_mmio_init(kvm); -#endif out: return kvm; @@ -1748,12 +1745,19 @@ static struct file_operations kvm_vm_fops = { static int kvm_dev_ioctl_create_vm(void) { - int fd; + int fd, r; struct kvm *kvm; kvm = kvm_create_vm(); if (IS_ERR(kvm)) return PTR_ERR(kvm); +#ifdef KVM_COALESCED_MMIO_PAGE_OFFSET + r = kvm_coalesced_mmio_init(kvm); + if (r 0) { + kvm_put_kvm(kvm); + return r; + } +#endif fd = anon_inode_getfd(kvm-vm, kvm_vm_fops, kvm, O_RDWR); if (fd 0) kvm_put_kvm(kvm); -- 1.6.3.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 05/30] KVM: Provide callback to get/set control registers in emulator ops.
On Mon, Mar 15, 2010 at 02:06:48PM +0100, Andre Przywara wrote: Gleb Natapov wrote: Use this callback instead of directly call kvm function. Also rename realmode_(set|get)_cr to emulator_(set|get)_cr since function has nothing to do with real mode. Do you mind removing the static before emulator_{set,get}_cr and I don't, but this is not the goal of this patch series. marking it EXPORT_SYMBOL? Then one could use it in vmx.c (and soon in svm.c ;-) while handling MOV-CR intercepts. Currently most of the code is actually duplicated. Also, shouldn't mk_cr_64() not be called mask_cr_64() for better readability? This is how it is called now, the patch only moves it. But this code will be reworked by later patches anyway since functions called from emulator should not inject exceptions behind emulator's back. Regards, Andre. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |3 +- arch/x86/include/asm/kvm_host.h|2 - arch/x86/kvm/emulate.c |7 +- arch/x86/kvm/x86.c | 114 ++-- 4 files changed, 63 insertions(+), 63 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 2666d7a..0c5caa4 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -108,7 +108,8 @@ struct x86_emulate_ops { const void *new, unsigned int bytes, struct kvm_vcpu *vcpu); - +ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu); +void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu); }; /* Type, address-of, and value of an instruction's operand. */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3b178d8..e8e108a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -585,8 +585,6 @@ void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, unsigned long *rflags); -unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr); -void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value); void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 91450b5..5b060e4 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2483,7 +2483,7 @@ twobyte_insn: break; case 4: /* smsw */ c-dst.bytes = 2; -c-dst.val = realmode_get_cr(ctxt-vcpu, 0); +c-dst.val = ops-get_cr(0, ctxt-vcpu); break; case 6: /* lmsw */ realmode_lmsw(ctxt-vcpu, (u16)c-src.val, @@ -2519,8 +2519,7 @@ twobyte_insn: case 0x20: /* mov cr, reg */ if (c-modrm_mod != 3) goto cannot_emulate; -c-regs[c-modrm_rm] = -realmode_get_cr(ctxt-vcpu, c-modrm_reg); +c-regs[c-modrm_rm] = ops-get_cr(c-modrm_reg, ctxt-vcpu); c-dst.type = OP_NONE; /* no writeback */ break; case 0x21: /* mov from dr to reg */ @@ -2534,7 +2533,7 @@ twobyte_insn: case 0x22: /* mov reg, cr */ if (c-modrm_mod != 3) goto cannot_emulate; -realmode_set_cr(ctxt-vcpu, c-modrm_reg, c-modrm_val); +ops-set_cr(c-modrm_reg, c-modrm_val, ctxt-vcpu); c-dst.type = OP_NONE; break; case 0x23: /* mov from reg to dr */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a1e671a..bf714df 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3370,12 +3370,70 @@ void kvm_report_emulation_failure(struct kvm_vcpu *vcpu, const char *context) } EXPORT_SYMBOL_GPL(kvm_report_emulation_failure); +static u64 mk_cr_64(u64 curr_cr, u32 new_val) +{ +return (curr_cr ~((1ULL 32) - 1)) | new_val; +} + +static unsigned long emulator_get_cr(int cr, struct kvm_vcpu *vcpu) +{ +unsigned long value; + +switch (cr) { +case 0: +value = kvm_read_cr0(vcpu); +break; +case 2: +value = vcpu-arch.cr2; +break; +case 3: +value = vcpu-arch.cr3; +break; +case 4: +value = kvm_read_cr4(vcpu); +break; +case 8: +value = kvm_get_cr8(vcpu); +break; +default: +vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr); +return 0; +} + +return value; +} + +static void emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu *vcpu) +{ +switch
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On 03/15/2010 03:03 PM, Joerg Roedel wrote: I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Our experiments show that nested device assignment is pretty much required for I/O performance in nested scenarios. Really? I did a small test with virtio-blk in a nested guest (disk read with dd, so not a real benchmark) and got a reasonable read-performance of around 25MB/s from the disk in the l2-guest. Your guest wasn't doing a zillion VMREADs and VMWRITEs every exit. I plan to reduce VMREAD/VMWRITE overhead for kvm, but not much we can do for other guests. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 05/30] KVM: Provide callback to get/set control registers in emulator ops.
On 03/15/2010 03:06 PM, Andre Przywara wrote: Gleb Natapov wrote: Use this callback instead of directly call kvm function. Also rename realmode_(set|get)_cr to emulator_(set|get)_cr since function has nothing to do with real mode. Do you mind removing the static before emulator_{set,get}_cr and marking it EXPORT_SYMBOL? Then one could use it in vmx.c (and soon in svm.c ;-) while handling MOV-CR intercepts. Currently most of the code is actually duplicated. Just do that in your patch, that's standard practice. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On 03/15/2010 07:42 AM, Avi Kivity wrote: On 03/15/2010 02:38 PM, Joerg Roedel wrote: On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote: On 03/10/2010 11:30 PM, Luiz Capitulino wrote: Hi there, Our wiki page for the Summer of Code 2010 is doing quite well: http://wiki.qemu.org/Google_Summer_of_Code_2010 I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Good idea. If there is interest I could help to mentor this project. Thanks. I volunteered Anthony, but he may be a little overcommitted. Joerg, feel free to put your name against too. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/30] KVM: Provide x86_emulate_ctxt callback to get current cpl
Gleb, what is the purpose of this patch? Is this a preparation for something upcoming? I don't see a reason to change this, in my eyes it is not a simplification. Regards, Andre. Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |1 + arch/x86/kvm/emulate.c | 15 --- arch/x86/kvm/x86.c |6 ++ 3 files changed, 15 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 0c5caa4..b048fd2 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -110,6 +110,7 @@ struct x86_emulate_ops { struct kvm_vcpu *vcpu); ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu); void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu); + int (*cpl)(struct kvm_vcpu *vcpu); }; /* Type, address-of, and value of an instruction's operand. */ diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 5e2fa61..8bd0557 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1257,7 +1257,7 @@ static int emulate_popf(struct x86_emulate_ctxt *ctxt, int rc; unsigned long val, change_mask; int iopl = (ctxt-eflags X86_EFLAGS_IOPL) IOPL_SHIFT; - int cpl = kvm_x86_ops-get_cpl(ctxt-vcpu); + int cpl = ops-cpl(ctxt-vcpu); rc = emulate_pop(ctxt, ops, val, len); if (rc != X86EMUL_CONTINUE) @@ -1758,7 +1758,8 @@ emulate_sysexit(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt) +static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops) { int iopl; if (ctxt-mode == X86EMUL_MODE_REAL) @@ -1766,7 +1767,7 @@ static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt) if (ctxt-mode == X86EMUL_MODE_VM86) return true; iopl = (ctxt-eflags X86_EFLAGS_IOPL) IOPL_SHIFT; - return kvm_x86_ops-get_cpl(ctxt-vcpu) iopl; + return ops-cpl(ctxt-vcpu) iopl; } static bool emulator_io_port_access_allowed(struct x86_emulate_ctxt *ctxt, @@ -1803,7 +1804,7 @@ static bool emulator_io_permited(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops, u16 port, u16 len) { - if (emulator_bad_iopl(ctxt)) + if (emulator_bad_iopl(ctxt, ops)) if (!emulator_io_port_access_allowed(ctxt, ops, port, len)) return false; return true; @@ -1842,7 +1843,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) } /* Privileged instruction can be executed only in CPL=0 */ - if ((c-d Priv) kvm_x86_ops-get_cpl(ctxt-vcpu)) { + if ((c-d Priv) ops-cpl(ctxt-vcpu)) { kvm_inject_gp(ctxt-vcpu, 0); goto done; } @@ -2378,7 +2379,7 @@ special_insn: c-dst.type = OP_NONE; /* Disable writeback. */ break; case 0xfa: /* cli */ - if (emulator_bad_iopl(ctxt)) + if (emulator_bad_iopl(ctxt, ops)) kvm_inject_gp(ctxt-vcpu, 0); else { ctxt-eflags = ~X86_EFLAGS_IF; @@ -2386,7 +2387,7 @@ special_insn: } break; case 0xfb: /* sti */ - if (emulator_bad_iopl(ctxt)) + if (emulator_bad_iopl(ctxt, ops)) kvm_inject_gp(ctxt-vcpu, 0); else { toggle_interruptibility(ctxt, KVM_X86_SHADOW_INT_STI); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b08f8a1..3f2a8d3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3426,6 +3426,11 @@ static void emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu *vcpu) } } +static int emulator_get_cpl(struct kvm_vcpu *vcpu) +{ + return kvm_x86_ops-get_cpl(vcpu); +} + static struct x86_emulate_ops emulate_ops = { .read_std= kvm_read_guest_virt_system, .fetch = kvm_fetch_guest_virt, @@ -3434,6 +3439,7 @@ static struct x86_emulate_ops emulate_ops = { .cmpxchg_emulated= emulator_cmpxchg_emulated, .get_cr = emulator_get_cr, .set_cr = emulator_set_cr, + .cpl = emulator_get_cpl, }; static void cache_all_regs(struct kvm_vcpu *vcpu) -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/30] KVM: Provide x86_emulate_ctxt callback to get current cpl
On Mon, Mar 15, 2010 at 02:16:01PM +0100, Andre Przywara wrote: Gleb, what is the purpose of this patch? Is this a preparation for something upcoming? I don't see a reason to change this, in my eyes it is not a simplification. To make emulator independent of KVM. All direct calls from emulator to KVM will be changed to callbacks. Regards, Andre. Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |1 + arch/x86/kvm/emulate.c | 15 --- arch/x86/kvm/x86.c |6 ++ 3 files changed, 15 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 0c5caa4..b048fd2 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -110,6 +110,7 @@ struct x86_emulate_ops { struct kvm_vcpu *vcpu); ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu); void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu); +int (*cpl)(struct kvm_vcpu *vcpu); }; /* Type, address-of, and value of an instruction's operand. */ diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 5e2fa61..8bd0557 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1257,7 +1257,7 @@ static int emulate_popf(struct x86_emulate_ctxt *ctxt, int rc; unsigned long val, change_mask; int iopl = (ctxt-eflags X86_EFLAGS_IOPL) IOPL_SHIFT; -int cpl = kvm_x86_ops-get_cpl(ctxt-vcpu); +int cpl = ops-cpl(ctxt-vcpu); rc = emulate_pop(ctxt, ops, val, len); if (rc != X86EMUL_CONTINUE) @@ -1758,7 +1758,8 @@ emulate_sysexit(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt) +static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops) { int iopl; if (ctxt-mode == X86EMUL_MODE_REAL) @@ -1766,7 +1767,7 @@ static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt) if (ctxt-mode == X86EMUL_MODE_VM86) return true; iopl = (ctxt-eflags X86_EFLAGS_IOPL) IOPL_SHIFT; -return kvm_x86_ops-get_cpl(ctxt-vcpu) iopl; +return ops-cpl(ctxt-vcpu) iopl; } static bool emulator_io_port_access_allowed(struct x86_emulate_ctxt *ctxt, @@ -1803,7 +1804,7 @@ static bool emulator_io_permited(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops, u16 port, u16 len) { -if (emulator_bad_iopl(ctxt)) +if (emulator_bad_iopl(ctxt, ops)) if (!emulator_io_port_access_allowed(ctxt, ops, port, len)) return false; return true; @@ -1842,7 +1843,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) } /* Privileged instruction can be executed only in CPL=0 */ -if ((c-d Priv) kvm_x86_ops-get_cpl(ctxt-vcpu)) { +if ((c-d Priv) ops-cpl(ctxt-vcpu)) { kvm_inject_gp(ctxt-vcpu, 0); goto done; } @@ -2378,7 +2379,7 @@ special_insn: c-dst.type = OP_NONE; /* Disable writeback. */ break; case 0xfa: /* cli */ -if (emulator_bad_iopl(ctxt)) +if (emulator_bad_iopl(ctxt, ops)) kvm_inject_gp(ctxt-vcpu, 0); else { ctxt-eflags = ~X86_EFLAGS_IF; @@ -2386,7 +2387,7 @@ special_insn: } break; case 0xfb: /* sti */ -if (emulator_bad_iopl(ctxt)) +if (emulator_bad_iopl(ctxt, ops)) kvm_inject_gp(ctxt-vcpu, 0); else { toggle_interruptibility(ctxt, KVM_X86_SHADOW_INT_STI); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b08f8a1..3f2a8d3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3426,6 +3426,11 @@ static void emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu *vcpu) } } +static int emulator_get_cpl(struct kvm_vcpu *vcpu) +{ +return kvm_x86_ops-get_cpl(vcpu); +} + static struct x86_emulate_ops emulate_ops = { .read_std= kvm_read_guest_virt_system, .fetch = kvm_fetch_guest_virt, @@ -3434,6 +3439,7 @@ static struct x86_emulate_ops emulate_ops = { .cmpxchg_emulated= emulator_cmpxchg_emulated, .get_cr = emulator_get_cr, .set_cr = emulator_set_cr, +.cpl = emulator_get_cpl, }; static void cache_all_regs(struct kvm_vcpu *vcpu) -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On 03/15/2010 08:11 AM, Avi Kivity wrote: On 03/15/2010 03:03 PM, Joerg Roedel wrote: I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Our experiments show that nested device assignment is pretty much required for I/O performance in nested scenarios. Really? I did a small test with virtio-blk in a nested guest (disk read with dd, so not a real benchmark) and got a reasonable read-performance of around 25MB/s from the disk in the l2-guest. Your guest wasn't doing a zillion VMREADs and VMWRITEs every exit. I plan to reduce VMREAD/VMWRITE overhead for kvm, but not much we can do for other guests. VMREAD/VMWRITEs are generally optimized by hypervisors as they tend to be costly. KVM is a bit unusual in terms of how many times the instructions are executed per exit. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On Mon, Mar 15, 2010 at 03:11:42PM +0200, Avi Kivity wrote: On 03/15/2010 03:03 PM, Joerg Roedel wrote: I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Our experiments show that nested device assignment is pretty much required for I/O performance in nested scenarios. Really? I did a small test with virtio-blk in a nested guest (disk read with dd, so not a real benchmark) and got a reasonable read-performance of around 25MB/s from the disk in the l2-guest. Your guest wasn't doing a zillion VMREADs and VMWRITEs every exit. I plan to reduce VMREAD/VMWRITE overhead for kvm, but not much we can do for other guests. Does it matter for the ept-on-ept case? The initial patchset of nested-vmx implemented it and they reported a performance drop of around 12% between levels which is reasonable. So I expected the loss of io-performance for l2 also reasonable in this case. My small measurement was also done using npt-on-npt. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: Corrupted filesystem, possible after livemigration with iSCSI storagebackend.
In our KVM system we have two iSCSI backends (master/slave configuration) with failover and two KVM hosts supporting live migration. The iSCSI volumes are shared by the host as a block device in KVM, and the volumes are available on both frontends. After a reboot one of the KVMs where not able to start again due to file system corruption. We use XFS and have problems to understand what caused the corruption. We have ruled out the iSCSI backend as both the master and slave data where consistent at the time. Anyone else had similar problems? What is the recommended way to share an iSCSI drive among the two host machines? Should XFS be ok as a file system for live migration? I'm not able to find any documentation stating that a clustered file system (GFS2 etc.) is recommended. Are there any concurrent writes on the two host machines during a livemigtation? disk type='block' device='disk' driver name='qemu'/ source dev='/dev/disk/by-path/ip-ip:3260-iscsi-test2-lun-0'/ target dev='sda' bus='scsi'/ address type='drive' controller='0' bus='0' unit='0'/ /disk #virsh version Compiled against library: libvir 0.7.6 Using library: libvir 0.7.6 Using API: QEMU 0.7.6 Running hypervisor: QEMU 0.11.0 #uname -a Linux vm01 2.6.32-bpo.2-amd64 #1 SMP Fri Feb 12 16:50:27 UTC 2010 x86_64 GNU/Linux Regards Espen -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to tweak kernel to get the best out of kvm?
On 03/13/10 09:54, Avi Kivity wrote: If the slowdown is indeed due to I/O, LVM (with cache=off) should eliminate it completely. As promised I have installed LVM: The difference is remarkable. My test case (running 8 vhosts in parallel, each building a Linux kernel) just works. There is no blocking job (by now), all vhosts can be pinged, great. Many thanx for your help, and for the nice software, of course. Regards Harri -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On 03/15/2010 08:24 AM, Joerg Roedel wrote: On Mon, Mar 15, 2010 at 03:11:42PM +0200, Avi Kivity wrote: On 03/15/2010 03:03 PM, Joerg Roedel wrote: I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Our experiments show that nested device assignment is pretty much required for I/O performance in nested scenarios. Really? I did a small test with virtio-blk in a nested guest (disk read with dd, so not a real benchmark) and got a reasonable read-performance of around 25MB/s from the disk in the l2-guest. Your guest wasn't doing a zillion VMREADs and VMWRITEs every exit. I plan to reduce VMREAD/VMWRITE overhead for kvm, but not much we can do for other guests. Does it matter for the ept-on-ept case? The initial patchset of nested-vmx implemented it and they reported a performance drop of around 12% between levels which is reasonable. So I expected the loss of io-performance for l2 also reasonable in this case. My small measurement was also done using npt-on-npt. But that was something like kernbench IIRC which is actually exit light once ept is enabled. Network IO is typically exit heavy and becomes something more of a pathological work load (both for nested ept and nested npt). Regards, Anthony Liguori Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: Corrupted filesystem, possible after livemigration with iSCSI storagebackend.
On 03/15/2010 08:46 AM, Espen Berg wrote: In our KVM system we have two iSCSI backends (master/slave configuration) with failover and two KVM hosts supporting live migration. The iSCSI volumes are shared by the host as a block device in KVM, and the volumes are available on both frontends. After a reboot one of the KVMs where not able to start again due to file system corruption. We use XFS and have problems to understand what caused the corruption. We have ruled out the iSCSI backend as both the master and slave data where consistent at the time. Anyone else had similar problems? What is the recommended way to share an iSCSI drive among the two host machines? Should XFS be ok as a file system for live migration? I'm not able to find any documentation stating that a clustered file system (GFS2 etc.) is recommended. Are there any concurrent writes on the two host machines during a livemigtation? disk type='block' device='disk' driver name='qemu'/ source dev='/dev/disk/by-path/ip-ip:3260-iscsi-test2-lun-0'/ target dev='sda' bus='scsi'/ address type='drive' controller='0' bus='0' unit='0'/ /disk You need to use cache=off if you've got one iscsi drive mounted on two separate physical machines. The additional layer of caching will result in inconsistency because iSCSI doesn't have a mechanism to provide cache coherence between two nodes. Regards, Anthony Liguori #virsh version Compiled against library: libvir 0.7.6 Using library: libvir 0.7.6 Using API: QEMU 0.7.6 Running hypervisor: QEMU 0.11.0 #uname -a Linux vm01 2.6.32-bpo.2-amd64 #1 SMP Fri Feb 12 16:50:27 UTC 2010 x86_64 GNU/Linux Regards Espen -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: Corrupted filesystem, possible after livemigration with iSCSI storagebackend.
On Mon, Mar 15, 2010 at 08:59:10AM -0500, Anthony Liguori wrote: On 03/15/2010 08:46 AM, Espen Berg wrote: In our KVM system we have two iSCSI backends (master/slave configuration) with failover and two KVM hosts supporting live migration. The iSCSI volumes are shared by the host as a block device in KVM, and the volumes are available on both frontends. After a reboot one of the KVMs where not able to start again due to file system corruption. We use XFS and have problems to understand what caused the corruption. We have ruled out the iSCSI backend as both the master and slave data where consistent at the time. Anyone else had similar problems? What is the recommended way to share an iSCSI drive among the two host machines? Should XFS be ok as a file system for live migration? I'm not able to find any documentation stating that a clustered file system (GFS2 etc.) is recommended. Are there any concurrent writes on the two host machines during a livemigtation? disk type='block' device='disk' driver name='qemu'/ source dev='/dev/disk/by-path/ip-ip:3260-iscsi-test2-lun-0'/ target dev='sda' bus='scsi'/ address type='drive' controller='0' bus='0' unit='0'/ /disk You need to use cache=off if you've got one iscsi drive mounted on two separate physical machines. FYI, this can be done by changing the disk XML driver driver name='qemu'/ to be driver name='qemu' cache='none'/ Regards, Daniel -- |: Red Hat, Engineering, London-o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org-o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 04/30] KVM: Remove pointer to rflags from realmode_set_cr parameters.
Mov reg, cr instruction doesn't change flags in any meaningful way, so no need to update rflags after instruction execution. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |3 +-- arch/x86/kvm/emulate.c |3 +-- arch/x86/kvm/x86.c |4 +--- 3 files changed, 3 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ea1b6c6..8567107 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -586,8 +586,7 @@ void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, unsigned long *rflags); unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr); -void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value, -unsigned long *rflags); +void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value); void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 670ca8f..91450b5 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2534,8 +2534,7 @@ twobyte_insn: case 0x22: /* mov reg, cr */ if (c-modrm_mod != 3) goto cannot_emulate; - realmode_set_cr(ctxt-vcpu, - c-modrm_reg, c-modrm_val, ctxt-eflags); + realmode_set_cr(ctxt-vcpu, c-modrm_reg, c-modrm_val); c-dst.type = OP_NONE; break; case 0x23: /* mov from reg to dr */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9d02cc7..56cdaa5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4043,13 +4043,11 @@ unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr) return value; } -void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long val, -unsigned long *rflags) +void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long val) { switch (cr) { case 0: kvm_set_cr0(vcpu, mk_cr_64(kvm_read_cr0(vcpu), val)); - *rflags = kvm_get_rflags(vcpu); break; case 2: vcpu-arch.cr2 = val; -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 03/30] KVM: x86 emulator: check return value against correct define
Check return value against correct define instead of open code the value. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 4dce805..670ca8f 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -566,7 +566,7 @@ static u32 group2_table[] = { #define insn_fetch(_type, _size, _eip) \ ({ unsigned long _x; \ rc = do_insn_fetch(ctxt, ops, (_eip), _x, (_size));\ - if (rc != 0)\ + if (rc != X86EMUL_CONTINUE) \ goto done; \ (_eip) += (_size); \ (_type)_x; \ -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 01/30] KVM: x86 emulator: Fix DstAcc decoding.
Set correct operation length. Add RAX (64bit) handling. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |7 +-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 2832a8c..0b70a36 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1194,9 +1194,9 @@ done_prefixes: break; case DstAcc: c-dst.type = OP_REG; - c-dst.bytes = c-op_bytes; + c-dst.bytes = (c-d ByteOp) ? 1 : c-op_bytes; c-dst.ptr = c-regs[VCPU_REGS_RAX]; - switch (c-op_bytes) { + switch (c-dst.bytes) { case 1: c-dst.val = *(u8 *)c-dst.ptr; break; @@ -1206,6 +1206,9 @@ done_prefixes: case 4: c-dst.val = *(u32 *)c-dst.ptr; break; + case 8: + c-dst.val = *(u64 *)c-dst.ptr; + break; } c-dst.orig_val = c-dst.val; break; -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 00/30] emulator cleanup
This is the first series of patches that tries to cleanup emulator code. This is mix of bug fixes and moving code that does emulation from x86.c to emulator.c while making it KVM independent. The status of the patches: works for me. realtime.flat test now also pass where it failed before. ChangeLog: v1-v2: - A couple of new bug fixed - cpl is now x86_emulator_ops callback - during string instruction re-enter guest on each page boundary - retain fast path for pio out (do not go through emulator) v2-v3: - use correct operand length for pio instruction with REX prefix - check for string instruction before decrementing ecx - change guest re-entry condition for string instruction Gleb Natapov (30): KVM: x86 emulator: Fix DstAcc decoding. KVM: x86 emulator: fix RCX access during rep emulation KVM: x86 emulator: check return value against correct define KVM: Remove pointer to rflags from realmode_set_cr parameters. KVM: Provide callback to get/set control registers in emulator ops. KVM: remove realmode_lmsw function. KVM: Provide x86_emulate_ctxt callback to get current cpl KVM: Provide current eip as part of emulator context. KVM: x86 emulator: fix mov r/m, sreg emulation. KVM: x86 emulator: fix 0f 01 /5 emulation KVM: x86 emulator: 0f (20|21|22|23) ignore mod bits. KVM: x86 emulator: inject #UD on access to non-existing CR KVM: x86 emulator: fix mov dr to inject #UD when needed. KVM: x86 emulator: fix return values of syscall/sysenter/sysexit emulations KVM: x86 emulator: do not call writeback if msr access fails. KVM: x86 emulator: If LOCK prefix is used dest arg should be memory. KVM: x86 emulator: cleanup grp3 return value KVM: x86 emulator: Provide more callbacks for x86 emulator. KVM: x86 emulator: Emulate task switch in emulator.c KVM: x86 emulator: Use load_segment_descriptor() instead of kvm_load_segment_descriptor() KVM: Use task switch from emulator.c KVM: x86 emulator: populate OP_MEM operand during decoding. KVM: x86 emulator: add decoding of X,Y parameters from Intel SDM KVM: x86 emulator: during rep emulation decrement ECX only if emulation succeeded KVM: x86 emulator: fix in/out emulation. KVM: x86 emulator: Move string pio emulation into emulator.c KVM: x86 emulator: remove saved_eip KVM: x86 emulator: restart string instruction without going back to a guest. KVM: x86 emulator: introduce pio in string read ahead. KVM: small kvm_arch_vcpu_ioctl_run() cleanup. arch/x86/include/asm/kvm_emulate.h | 41 ++- arch/x86/include/asm/kvm_host.h| 16 +- arch/x86/kvm/emulate.c | 1062 ++- arch/x86/kvm/svm.c | 20 +- arch/x86/kvm/vmx.c | 18 +- arch/x86/kvm/x86.c | 1121 +--- 6 files changed, 1146 insertions(+), 1132 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 10/30] KVM: x86 emulator: fix 0f 01 /5 emulation
It is undefined and should generate #UD. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c3b9334..7c7debb 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2490,6 +2490,9 @@ twobyte_insn: (c-src.val 0x0f), ctxt-vcpu); c-dst.type = OP_NONE; break; + case 5: /* not defined */ + kvm_queue_exception(ctxt-vcpu, UD_VECTOR); + goto done; case 7: /* invlpg*/ emulate_invlpg(ctxt-vcpu, memop); /* Disable writeback. */ -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 06/30] KVM: remove realmode_lmsw function.
Use (get|set)_cr callback to emulate lmsw inside emulator. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |2 -- arch/x86/kvm/emulate.c |4 ++-- arch/x86/kvm/x86.c |7 --- 3 files changed, 2 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 9725856..72997aa 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -582,8 +582,6 @@ int emulate_instruction(struct kvm_vcpu *vcpu, void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context); void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); -void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, - unsigned long *rflags); void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 5b060e4..5e2fa61 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2486,8 +2486,8 @@ twobyte_insn: c-dst.val = ops-get_cr(0, ctxt-vcpu); break; case 6: /* lmsw */ - realmode_lmsw(ctxt-vcpu, (u16)c-src.val, - ctxt-eflags); + ops-set_cr(0, (ops-get_cr(0, ctxt-vcpu) ~0x0ful) | + (c-src.val 0x0f), ctxt-vcpu); c-dst.type = OP_NONE; break; case 7: /* invlpg*/ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fb00ed5..b139334 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4061,13 +4061,6 @@ void realmode_lidt(struct kvm_vcpu *vcpu, u16 limit, unsigned long base) kvm_x86_ops-set_idt(vcpu, dt); } -void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, - unsigned long *rflags) -{ - kvm_lmsw(vcpu, msw); - *rflags = kvm_get_rflags(vcpu); -} - static int move_to_next_stateful_cpuid_entry(struct kvm_vcpu *vcpu, int i) { struct kvm_cpuid_entry2 *e = vcpu-arch.cpuid_entries[i]; -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 09/30] KVM: x86 emulator: fix mov r/m, sreg emulation.
mov r/m, sreg generates #UD ins sreg is incorrect. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |7 +++ 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 2c27aa4..c3b9334 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2126,12 +2126,11 @@ special_insn: case 0x8c: { /* mov r/m, sreg */ struct kvm_segment segreg; - if (c-modrm_reg = 5) + if (c-modrm_reg = VCPU_SREG_GS) kvm_get_segment(ctxt-vcpu, segreg, c-modrm_reg); else { - printk(KERN_INFO 0x8c: Invalid segreg in modrm byte 0x%02x\n, - c-modrm); - goto cannot_emulate; + kvm_queue_exception(ctxt-vcpu, UD_VECTOR); + goto done; } c-dst.val = segreg.selector; break; -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 11/30] KVM: x86 emulator: 0f (20|21|22|23) ignore mod bits.
Resent spec says that for 0f (20|21|22|23) the 2 bits in the mod field are ignored. Interestingly enough older spec says that 11 is only valid encoding. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |8 1 files changed, 0 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 7c7debb..fa4604e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2520,28 +2520,20 @@ twobyte_insn: c-dst.type = OP_NONE; break; case 0x20: /* mov cr, reg */ - if (c-modrm_mod != 3) - goto cannot_emulate; c-regs[c-modrm_rm] = ops-get_cr(c-modrm_reg, ctxt-vcpu); c-dst.type = OP_NONE; /* no writeback */ break; case 0x21: /* mov from dr to reg */ - if (c-modrm_mod != 3) - goto cannot_emulate; if (emulator_get_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm])) goto cannot_emulate; rc = X86EMUL_CONTINUE; c-dst.type = OP_NONE; /* no writeback */ break; case 0x22: /* mov reg, cr */ - if (c-modrm_mod != 3) - goto cannot_emulate; ops-set_cr(c-modrm_reg, c-modrm_val, ctxt-vcpu); c-dst.type = OP_NONE; break; case 0x23: /* mov from reg to dr */ - if (c-modrm_mod != 3) - goto cannot_emulate; if (emulator_set_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm])) goto cannot_emulate; rc = X86EMUL_CONTINUE; -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 12/30] KVM: x86 emulator: inject #UD on access to non-existing CR
Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index fa4604e..836e97b 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2520,6 +2520,13 @@ twobyte_insn: c-dst.type = OP_NONE; break; case 0x20: /* mov cr, reg */ + switch (c-modrm_reg) { + case 1: + case 5 ... 7: + case 9 ... 15: + kvm_queue_exception(ctxt-vcpu, UD_VECTOR); + goto done; + } c-regs[c-modrm_rm] = ops-get_cr(c-modrm_reg, ctxt-vcpu); c-dst.type = OP_NONE; /* no writeback */ break; -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 14/30] KVM: x86 emulator: fix return values of syscall/sysenter/sysexit emulations
Return X86EMUL_PROPAGATE_FAULT is fault was injected. Also inject #UD for those instruction when appropriate. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c | 17 +++-- 1 files changed, 11 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 5afddcf..1393bf0 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1600,8 +1600,11 @@ emulate_syscall(struct x86_emulate_ctxt *ctxt) u64 msr_data; /* syscall is not available in real mode */ - if (ctxt-mode == X86EMUL_MODE_REAL || ctxt-mode == X86EMUL_MODE_VM86) - return X86EMUL_UNHANDLEABLE; + if (ctxt-mode == X86EMUL_MODE_REAL || + ctxt-mode == X86EMUL_MODE_VM86) { + kvm_queue_exception(ctxt-vcpu, UD_VECTOR); + return X86EMUL_PROPAGATE_FAULT; + } setup_syscalls_segments(ctxt, cs, ss); @@ -1651,14 +1654,16 @@ emulate_sysenter(struct x86_emulate_ctxt *ctxt) /* inject #GP if in real mode */ if (ctxt-mode == X86EMUL_MODE_REAL) { kvm_inject_gp(ctxt-vcpu, 0); - return X86EMUL_UNHANDLEABLE; + return X86EMUL_PROPAGATE_FAULT; } /* XXX sysenter/sysexit have not been tested in 64bit mode. * Therefore, we inject an #UD. */ - if (ctxt-mode == X86EMUL_MODE_PROT64) - return X86EMUL_UNHANDLEABLE; + if (ctxt-mode == X86EMUL_MODE_PROT64) { + kvm_queue_exception(ctxt-vcpu, UD_VECTOR); + return X86EMUL_PROPAGATE_FAULT; + } setup_syscalls_segments(ctxt, cs, ss); @@ -1713,7 +1718,7 @@ emulate_sysexit(struct x86_emulate_ctxt *ctxt) if (ctxt-mode == X86EMUL_MODE_REAL || ctxt-mode == X86EMUL_MODE_VM86) { kvm_inject_gp(ctxt-vcpu, 0); - return X86EMUL_UNHANDLEABLE; + return X86EMUL_PROPAGATE_FAULT; } setup_syscalls_segments(ctxt, cs, ss); -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 05/30] KVM: Provide callback to get/set control registers in emulator ops.
Use this callback instead of directly call kvm function. Also rename realmode_(set|get)_cr to emulator_(set|get)_cr since function has nothing to do with real mode. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |3 +- arch/x86/include/asm/kvm_host.h|2 - arch/x86/kvm/emulate.c |7 +- arch/x86/kvm/x86.c | 114 ++-- 4 files changed, 63 insertions(+), 63 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 2666d7a..0c5caa4 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -108,7 +108,8 @@ struct x86_emulate_ops { const void *new, unsigned int bytes, struct kvm_vcpu *vcpu); - + ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu); + void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu); }; /* Type, address-of, and value of an instruction's operand. */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8567107..9725856 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -585,8 +585,6 @@ void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, unsigned long *rflags); -unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr); -void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value); void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 91450b5..5b060e4 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2483,7 +2483,7 @@ twobyte_insn: break; case 4: /* smsw */ c-dst.bytes = 2; - c-dst.val = realmode_get_cr(ctxt-vcpu, 0); + c-dst.val = ops-get_cr(0, ctxt-vcpu); break; case 6: /* lmsw */ realmode_lmsw(ctxt-vcpu, (u16)c-src.val, @@ -2519,8 +2519,7 @@ twobyte_insn: case 0x20: /* mov cr, reg */ if (c-modrm_mod != 3) goto cannot_emulate; - c-regs[c-modrm_rm] = - realmode_get_cr(ctxt-vcpu, c-modrm_reg); + c-regs[c-modrm_rm] = ops-get_cr(c-modrm_reg, ctxt-vcpu); c-dst.type = OP_NONE; /* no writeback */ break; case 0x21: /* mov from dr to reg */ @@ -2534,7 +2533,7 @@ twobyte_insn: case 0x22: /* mov reg, cr */ if (c-modrm_mod != 3) goto cannot_emulate; - realmode_set_cr(ctxt-vcpu, c-modrm_reg, c-modrm_val); + ops-set_cr(c-modrm_reg, c-modrm_val, ctxt-vcpu); c-dst.type = OP_NONE; break; case 0x23: /* mov from reg to dr */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 56cdaa5..fb00ed5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3386,12 +3386,70 @@ void kvm_report_emulation_failure(struct kvm_vcpu *vcpu, const char *context) } EXPORT_SYMBOL_GPL(kvm_report_emulation_failure); +static u64 mk_cr_64(u64 curr_cr, u32 new_val) +{ + return (curr_cr ~((1ULL 32) - 1)) | new_val; +} + +static unsigned long emulator_get_cr(int cr, struct kvm_vcpu *vcpu) +{ + unsigned long value; + + switch (cr) { + case 0: + value = kvm_read_cr0(vcpu); + break; + case 2: + value = vcpu-arch.cr2; + break; + case 3: + value = vcpu-arch.cr3; + break; + case 4: + value = kvm_read_cr4(vcpu); + break; + case 8: + value = kvm_get_cr8(vcpu); + break; + default: + vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr); + return 0; + } + + return value; +} + +static void emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu *vcpu) +{ + switch (cr) { + case 0: + kvm_set_cr0(vcpu, mk_cr_64(kvm_read_cr0(vcpu), val)); + break; + case 2: + vcpu-arch.cr2 = val; + break; + case 3: + kvm_set_cr3(vcpu, val); + break; + case 4: + kvm_set_cr4(vcpu, mk_cr_64(kvm_read_cr4(vcpu), val)); + break; + case 8: + kvm_set_cr8(vcpu, val 0xfUL); + break; + default: + vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr); + } +} + static struct x86_emulate_ops emulate_ops = {
[PATCH v3 25/30] KVM: x86 emulator: fix in/out emulation.
in/out emulation is broken now. The breakage is different depending on where IO device resides. If it is in userspace emulator reports emulation failure since it incorrectly interprets kvm_emulate_pio() return value. If IO device is in the kernel emulation of 'in' will do nothing since kvm_emulate_pio() stores result directly into vcpu registers, so emulator will overwrite result of emulation during commit of shadowed register. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |7 + arch/x86/include/asm/kvm_host.h|3 +- arch/x86/kvm/emulate.c | 49 - arch/x86/kvm/svm.c | 20 +-- arch/x86/kvm/vmx.c | 18 ++-- arch/x86/kvm/x86.c | 213 ++-- 6 files changed, 177 insertions(+), 133 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index bd46929..679245c 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -119,6 +119,13 @@ struct x86_emulate_ops { const void *new, unsigned int bytes, struct kvm_vcpu *vcpu); + + int (*pio_in_emulated)(int size, unsigned short port, void *val, + unsigned int count, struct kvm_vcpu *vcpu); + + int (*pio_out_emulated)(int size, unsigned short port, const void *val, + unsigned int count, struct kvm_vcpu *vcpu); + bool (*get_cached_descriptor)(struct desc_struct *desc, int seg, struct kvm_vcpu *vcpu); void (*set_cached_descriptor)(struct desc_struct *desc, diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 72997aa..4a4fb8d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -589,8 +589,7 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); struct x86_emulate_ctxt; -int kvm_emulate_pio(struct kvm_vcpu *vcpu, int in, -int size, unsigned port); +int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port); int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, int in, int size, unsigned long count, int down, gva_t address, int rep, unsigned port); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index a166235..873da58 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -210,13 +210,13 @@ static u32 opcode_table[256] = { 0, 0, 0, 0, 0, 0, 0, 0, /* 0xE0 - 0xE7 */ 0, 0, 0, 0, - ByteOp | SrcImmUByte, SrcImmUByte, - ByteOp | SrcImmUByte, SrcImmUByte, + ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc, + ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc, /* 0xE8 - 0xEF */ SrcImm | Stack, SrcImm | ImplicitOps, SrcImmU | Src2Imm16 | No64, SrcImmByte | ImplicitOps, - SrcNone | ByteOp | ImplicitOps, SrcNone | ImplicitOps, - SrcNone | ByteOp | ImplicitOps, SrcNone | ImplicitOps, + SrcNone | ByteOp | DstAcc, SrcNone | DstAcc, + SrcNone | ByteOp | DstAcc, SrcNone | DstAcc, /* 0xF0 - 0xF7 */ 0, 0, 0, 0, ImplicitOps | Priv, ImplicitOps, Group | Group3_Byte, Group | Group3, @@ -2422,8 +2422,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) u64 msr_data; unsigned long saved_eip = 0; struct decode_cache *c = ctxt-decode; - unsigned int port; - int io_dir_in; int rc = X86EMUL_CONTINUE; ctxt-interruptibility = 0; @@ -2819,14 +2817,10 @@ special_insn: break; case 0xe4: /* inb */ case 0xe5: /* in */ - port = c-src.val; - io_dir_in = 1; - goto do_io; + goto do_io_in; case 0xe6: /* outb */ case 0xe7: /* out */ - port = c-src.val; - io_dir_in = 0; - goto do_io; + goto do_io_out; case 0xe8: /* call (near) */ { long int rel = c-src.val; c-src.val = (unsigned long) c-eip; @@ -2851,25 +2845,28 @@ special_insn: break; case 0xec: /* in al,dx */ case 0xed: /* in (e/r)ax,dx */ - port = c-regs[VCPU_REGS_RDX]; - io_dir_in = 1; - goto do_io; + c-src.val = c-regs[VCPU_REGS_RDX]; + do_io_in: + c-dst.bytes = min(c-dst.bytes, 4u); + if (!emulator_io_permited(ctxt, ops, c-src.val, c-dst.bytes)) { + kvm_inject_gp(ctxt-vcpu, 0); + goto done; + } + ops-pio_in_emulated(c-dst.bytes, c-src.val, c-dst.val, 1, +ctxt-vcpu); +
[PATCH v3 17/30] KVM: x86 emulator: cleanup grp3 return value
When x86_emulate_insn() does not know how to emulate instruction it exits via cannot_emulate label in all cases except when emulating grp3. Fix that. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c | 12 1 files changed, 4 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 46a7ee3..d696cbd 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1397,7 +1397,6 @@ static inline int emulate_grp3(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) { struct decode_cache *c = ctxt-decode; - int rc = X86EMUL_CONTINUE; switch (c-modrm_reg) { case 0 ... 1: /* test */ @@ -1410,11 +1409,9 @@ static inline int emulate_grp3(struct x86_emulate_ctxt *ctxt, emulate_1op(neg, c-dst, ctxt-eflags); break; default: - DPRINTF(Cannot emulate %02x\n, c-b); - rc = X86EMUL_UNHANDLEABLE; - break; + return 0; } - return rc; + return 1; } static inline int emulate_grp45(struct x86_emulate_ctxt *ctxt, @@ -2374,9 +2371,8 @@ special_insn: c-dst.type = OP_NONE; /* Disable writeback. */ break; case 0xf6 ... 0xf7: /* Grp3 */ - rc = emulate_grp3(ctxt, ops); - if (rc != X86EMUL_CONTINUE) - goto done; + if (!emulate_grp3(ctxt, ops)) + goto cannot_emulate; break; case 0xf8: /* clc */ ctxt-eflags = ~EFLG_CF; -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 29/30] KVM: x86 emulator: introduce pio in string read ahead.
To optimize rep ins instruction do IO in big chunks ahead of time instead of doing it only when required during instruction emulation. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |7 ++ arch/x86/kvm/emulate.c | 43 +++ 2 files changed, 45 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 7fda16f..b5e12c5 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -151,6 +151,12 @@ struct fetch_cache { unsigned long end; }; +struct read_cache { + u8 data[1024]; + unsigned long pos; + unsigned long end; +}; + struct decode_cache { u8 twobyte; u8 b; @@ -178,6 +184,7 @@ struct decode_cache { void *modrm_ptr; unsigned long modrm_val; struct fetch_cache fetch; + struct read_cache io_read; }; struct x86_emulate_ctxt { diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c4da60e..d9cf93b 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1257,6 +1257,34 @@ done: return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0; } +static int pio_in_emulated(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops, + unsigned int size, unsigned short port, + void *dest) +{ + struct read_cache *rc = ctxt-decode.io_read; + + if (rc-pos == rc-end) { /* refill pio read ahead */ + struct decode_cache *c = ctxt-decode; + unsigned int in_page, n; + unsigned int count = c-rep_prefix ? + address_mask(c, c-regs[VCPU_REGS_RCX]) : 1; + in_page = (ctxt-eflags EFLG_DF) ? + offset_in_page(c-regs[VCPU_REGS_RDI]) : + PAGE_SIZE - offset_in_page(c-regs[VCPU_REGS_RDI]); + n = min(min(in_page, (unsigned int)sizeof(rc-data)) / size, + count); + rc-pos = rc-end = 0; + if (!ops-pio_in_emulated(size, port, rc-data, n, ctxt-vcpu)) + return 0; + rc-end = n * size; + } + + memcpy(dest, rc-data + rc-pos, size); + rc-pos += size; + return 1; +} + static u32 desc_limit_scaled(struct desc_struct *desc) { u32 limit = get_desc_limit(desc); @@ -2618,8 +2646,8 @@ special_insn: kvm_inject_gp(ctxt-vcpu, 0); goto done; } - if (!ops-pio_in_emulated(c-dst.bytes, c-regs[VCPU_REGS_RDX], - c-dst.val, 1, ctxt-vcpu)) + if (!pio_in_emulated(ctxt, ops, c-dst.bytes, +c-regs[VCPU_REGS_RDX], c-dst.val)) goto done; /* IO is needed, skip writeback */ break; case 0x6e: /* outsb */ @@ -2835,8 +2863,7 @@ special_insn: kvm_inject_gp(ctxt-vcpu, 0); goto done; } - ops-pio_in_emulated(c-dst.bytes, c-src.val, c-dst.val, 1, -ctxt-vcpu); + pio_in_emulated(ctxt, ops, c-dst.bytes, c-src.val, c-dst.val); break; case 0xee: /* out al,dx */ case 0xef: /* out (e/r)ax,dx */ @@ -2923,8 +2950,14 @@ writeback: string_addr_inc(ctxt, es_base(ctxt), VCPU_REGS_RDI, c-dst); if (c-rep_prefix (c-d String)) { + struct read_cache *rc = ctxt-decode.io_read; register_address_increment(c, c-regs[VCPU_REGS_RCX], -1); - if (!(c-regs[VCPU_REGS_RCX] 0x3ff)) + /* +* Re-enter guest when pio read ahead buffer is empty or, +* if it is not used, after each 1024 iteration. +*/ + if ((rc-end == 0 !(c-regs[VCPU_REGS_RCX] 0x3ff)) || + (rc-end != 0 rc-end == rc-pos)) ctxt-restart = false; } -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 22/30] KVM: x86 emulator: populate OP_MEM operand during decoding.
All struct operand fields are initialized during decoding for all operand types except OP_MEM, but there is no reason for that. Move OP_MEM operand initialization into decoding stage for consistency. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c | 66 +--- 1 files changed, 29 insertions(+), 37 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 702bfff..55b8a8b 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1057,6 +1057,10 @@ done_prefixes: if (c-ad_bytes != 8) c-modrm_ea = (u32)c-modrm_ea; + + if (c-rip_relative) + c-modrm_ea += c-eip; + /* * Decode and fetch the source operand: register, memory * or immediate. @@ -1091,6 +1095,8 @@ done_prefixes: break; } c-src.type = OP_MEM; + c-src.ptr = (unsigned long *)c-modrm_ea; + c-src.val = 0; break; case SrcImm: case SrcImmU: @@ -1169,8 +1175,10 @@ done_prefixes: c-src2.val = 1; break; case Src2Mem16: - c-src2.bytes = 2; c-src2.type = OP_MEM; + c-src2.bytes = 2; + c-src2.ptr = (unsigned long *)(c-modrm_ea + c-src.bytes); + c-src2.val = 0; break; } @@ -1192,6 +1200,15 @@ done_prefixes: break; } c-dst.type = OP_MEM; + c-dst.ptr = (unsigned long *)c-modrm_ea; + c-dst.bytes = (c-d ByteOp) ? 1 : c-op_bytes; + c-dst.val = 0; + if (c-d BitOp) { + unsigned long mask = ~(c-dst.bytes * 8 - 1); + + c-dst.ptr = (void *)c-dst.ptr + + (c-src.val mask) / 8; + } break; case DstAcc: c-dst.type = OP_REG; @@ -1215,9 +1232,6 @@ done_prefixes: break; } - if (c-rip_relative) - c-modrm_ea += c-eip; - done: return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0; } @@ -1638,14 +1652,13 @@ static inline int emulate_grp45(struct x86_emulate_ctxt *ctxt, } static inline int emulate_grp9(struct x86_emulate_ctxt *ctxt, - struct x86_emulate_ops *ops, - unsigned long memop) + struct x86_emulate_ops *ops) { struct decode_cache *c = ctxt-decode; u64 old, new; int rc; - rc = ops-read_emulated(memop, old, 8, ctxt-vcpu); + rc = ops-read_emulated(c-modrm_ea, old, 8, ctxt-vcpu); if (rc != X86EMUL_CONTINUE) return rc; @@ -1660,7 +1673,7 @@ static inline int emulate_grp9(struct x86_emulate_ctxt *ctxt, new = ((u64)c-regs[VCPU_REGS_RCX] 32) | (u32) c-regs[VCPU_REGS_RBX]; - rc = ops-cmpxchg_emulated(memop, old, new, 8, ctxt-vcpu); + rc = ops-cmpxchg_emulated(c-modrm_ea, old, new, 8, ctxt-vcpu); if (rc != X86EMUL_CONTINUE) return rc; ctxt-eflags |= EFLG_ZF; @@ -2378,7 +2391,6 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt, int x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) { - unsigned long memop = 0; u64 msr_data; unsigned long saved_eip = 0; struct decode_cache *c = ctxt-decode; @@ -2413,9 +2425,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) goto done; } - if (((c-d ModRM) (c-modrm_mod != 3)) || (c-d MemAbs)) - memop = c-modrm_ea; - if (c-rep_prefix (c-d String)) { /* All REP prefixes have the same first termination condition */ if (address_mask(c, c-regs[VCPU_REGS_RCX]) == 0) { @@ -2447,8 +2456,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) } if (c-src.type == OP_MEM) { - c-src.ptr = (unsigned long *)memop; - c-src.val = 0; rc = ops-read_emulated((unsigned long)c-src.ptr, c-src.val, c-src.bytes, @@ -2459,8 +2466,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) } if (c-src2.type == OP_MEM) { - c-src2.ptr = (unsigned long *)(memop + c-src.bytes); - c-src2.val = 0; rc = ops-read_emulated((unsigned long)c-src2.ptr, c-src2.val, c-src2.bytes, @@ -2473,25 +2478,12 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops)
[PATCH v3 24/30] KVM: x86 emulator: during rep emulation decrement ECX only if emulation succeeded
Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c | 15 --- 1 files changed, 8 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 6ebd642..a166235 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2407,13 +2407,13 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt, } static void string_addr_inc(struct x86_emulate_ctxt *ctxt, unsigned long base, - int reg, unsigned long **ptr) + int reg, struct operand *op) { struct decode_cache *c = ctxt-decode; int df = (ctxt-eflags EFLG_DF) ? -1 : 1; - register_address_increment(c, c-regs[reg], df * c-src.bytes); - *ptr = (unsigned long *)register_address(c, base, c-regs[reg]); + register_address_increment(c, c-regs[reg], df * op-bytes); + op-ptr = (unsigned long *)register_address(c, base, c-regs[reg]); } int @@ -2479,7 +2479,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) goto done; } } - register_address_increment(c, c-regs[VCPU_REGS_RCX], -1); c-eip = ctxt-eip; } @@ -2932,11 +2931,13 @@ writeback: if ((c-d SrcMask) == SrcSI) string_addr_inc(ctxt, seg_override_base(ctxt, c), VCPU_REGS_RSI, - c-src.ptr); + c-src); if ((c-d DstMask) == DstDI) - string_addr_inc(ctxt, es_base(ctxt), VCPU_REGS_RDI, - c-dst.ptr); + string_addr_inc(ctxt, es_base(ctxt), VCPU_REGS_RDI, c-dst); + + if (c-rep_prefix (c-d String)) + register_address_increment(c, c-regs[VCPU_REGS_RCX], -1); /* Commit shadow register state. */ memcpy(ctxt-vcpu-arch.regs, c-regs, sizeof c-regs); -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 21/30] KVM: Use task switch from emulator.c
Remove old task switch code from x86.c Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/x86.c | 557 ++-- 1 files changed, 17 insertions(+), 540 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2ef83db..7d1b481 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4795,553 +4795,30 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, return 0; } -static void seg_desct_to_kvm_desct(struct desc_struct *seg_desc, u16 selector, - struct kvm_segment *kvm_desct) -{ - kvm_desct-base = get_desc_base(seg_desc); - kvm_desct-limit = get_desc_limit(seg_desc); - if (seg_desc-g) { - kvm_desct-limit = 12; - kvm_desct-limit |= 0xfff; - } - kvm_desct-selector = selector; - kvm_desct-type = seg_desc-type; - kvm_desct-present = seg_desc-p; - kvm_desct-dpl = seg_desc-dpl; - kvm_desct-db = seg_desc-d; - kvm_desct-s = seg_desc-s; - kvm_desct-l = seg_desc-l; - kvm_desct-g = seg_desc-g; - kvm_desct-avl = seg_desc-avl; - if (!selector) - kvm_desct-unusable = 1; - else - kvm_desct-unusable = 0; - kvm_desct-padding = 0; -} - -static void get_segment_descriptor_dtable(struct kvm_vcpu *vcpu, - u16 selector, - struct desc_ptr *dtable) -{ - if (selector 1 2) { - struct kvm_segment kvm_seg; - - kvm_get_segment(vcpu, kvm_seg, VCPU_SREG_LDTR); - - if (kvm_seg.unusable) - dtable-size = 0; - else - dtable-size = kvm_seg.limit; - dtable-address = kvm_seg.base; - } - else - kvm_x86_ops-get_gdt(vcpu, dtable); -} - -/* allowed just for 8 bytes segments */ -static int load_guest_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, -struct desc_struct *seg_desc) -{ - struct desc_ptr dtable; - u16 index = selector 3; - int ret; - u32 err; - gva_t addr; - - get_segment_descriptor_dtable(vcpu, selector, dtable); - - if (dtable.size index * 8 + 7) { - kvm_queue_exception_e(vcpu, GP_VECTOR, selector 0xfffc); - return X86EMUL_PROPAGATE_FAULT; - } - addr = dtable.address + index * 8; - ret = kvm_read_guest_virt_system(addr, seg_desc, sizeof(*seg_desc), -vcpu, err); - if (ret == X86EMUL_PROPAGATE_FAULT) - kvm_inject_page_fault(vcpu, addr, err); - - return ret; -} - -/* allowed just for 8 bytes segments */ -static int save_guest_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, -struct desc_struct *seg_desc) -{ - struct desc_ptr dtable; - u16 index = selector 3; - - get_segment_descriptor_dtable(vcpu, selector, dtable); - - if (dtable.size index * 8 + 7) - return 1; - return kvm_write_guest_virt(dtable.address + index*8, seg_desc, sizeof(*seg_desc), vcpu, NULL); -} - -static gpa_t get_tss_base_addr_write(struct kvm_vcpu *vcpu, - struct desc_struct *seg_desc) -{ - u32 base_addr = get_desc_base(seg_desc); - - return kvm_mmu_gva_to_gpa_write(vcpu, base_addr, NULL); -} - -static gpa_t get_tss_base_addr_read(struct kvm_vcpu *vcpu, -struct desc_struct *seg_desc) -{ - u32 base_addr = get_desc_base(seg_desc); - - return kvm_mmu_gva_to_gpa_read(vcpu, base_addr, NULL); -} - -static u16 get_segment_selector(struct kvm_vcpu *vcpu, int seg) -{ - struct kvm_segment kvm_seg; - - kvm_get_segment(vcpu, kvm_seg, seg); - return kvm_seg.selector; -} - -static int kvm_load_realmode_segment(struct kvm_vcpu *vcpu, u16 selector, int seg) -{ - struct kvm_segment segvar = { - .base = selector 4, - .limit = 0x, - .selector = selector, - .type = 3, - .present = 1, - .dpl = 3, - .db = 0, - .s = 1, - .l = 0, - .g = 0, - .avl = 0, - .unusable = 0, - }; - kvm_x86_ops-set_segment(vcpu, segvar, seg); - return X86EMUL_CONTINUE; -} - -static int is_vm86_segment(struct kvm_vcpu *vcpu, int seg) -{ - return (seg != VCPU_SREG_LDTR) - (seg != VCPU_SREG_TR) - (kvm_get_rflags(vcpu) X86_EFLAGS_VM); -} - -int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, int seg) -{ - struct kvm_segment kvm_seg; - struct desc_struct seg_desc; - u8 dpl, rpl, cpl; - unsigned err_vec = GP_VECTOR; - u32 err_code = 0; - bool
[PATCH v3 30/30] KVM: small kvm_arch_vcpu_ioctl_run() cleanup.
Unify all conditions that get us back into emulator after returning from userspace. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/x86.c | 32 ++-- 1 files changed, 6 insertions(+), 26 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cd0043a..1c00c06 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4505,33 +4505,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) if (!irqchip_in_kernel(vcpu-kvm)) kvm_set_cr8(vcpu, kvm_run-cr8); - if (vcpu-arch.pio.count) { - vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu); - r = emulate_instruction(vcpu, 0, 0, EMULTYPE_NO_DECODE); - srcu_read_unlock(vcpu-kvm-srcu, vcpu-srcu_idx); - if (r == EMULATE_DO_MMIO) { - r = 0; - goto out; + if (vcpu-arch.pio.count || vcpu-mmio_needed || + vcpu-arch.emulate_ctxt.restart) { + if (vcpu-mmio_needed) { + memcpy(vcpu-mmio_data, kvm_run-mmio.data, 8); + vcpu-mmio_read_completed = 1; + vcpu-mmio_needed = 0; } - } - if (vcpu-mmio_needed) { - memcpy(vcpu-mmio_data, kvm_run-mmio.data, 8); - vcpu-mmio_read_completed = 1; - vcpu-mmio_needed = 0; - - vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu); - r = emulate_instruction(vcpu, vcpu-arch.mmio_fault_cr2, 0, - EMULTYPE_NO_DECODE); - srcu_read_unlock(vcpu-kvm-srcu, vcpu-srcu_idx); - if (r == EMULATE_DO_MMIO) { - /* -* Read-modify-write. Back to userspace. -*/ - r = 0; - goto out; - } - } - if (vcpu-arch.emulate_ctxt.restart) { vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu); r = emulate_instruction(vcpu, 0, 0, EMULTYPE_NO_DECODE); srcu_read_unlock(vcpu-kvm-srcu, vcpu-srcu_idx); -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.
Currently when string instruction is only partially complete we go back to a guest mode, guest tries to reexecute instruction and exits again and at this point emulation continues. Avoid all of this by restarting instruction without going back to a guest mode, but return to a guest mode each 1024 iterations to allow interrupt injection. Pending exception causes immediate guest entry too. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |1 + arch/x86/kvm/emulate.c | 34 +++--- arch/x86/kvm/x86.c | 19 ++- 3 files changed, 42 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 679245c..7fda16f 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -193,6 +193,7 @@ struct x86_emulate_ctxt { /* interruptibility state, as a result of execution of STI or MOV SS */ int interruptibility; + bool restart; /* restart string instruction after writeback */ /* decode cache */ struct decode_cache decode; }; diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 541f3c9..c4da60e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -927,8 +927,11 @@ x86_decode_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) int mode = ctxt-mode; int def_op_bytes, def_ad_bytes, group; - /* Shadow copy of register state. Committed on successful emulation. */ + /* we cannot decode insn before we complete previous rep insn */ + WARN_ON(ctxt-restart); + + /* Shadow copy of register state. Committed on successful emulation. */ memset(c, 0, sizeof(struct decode_cache)); c-eip = ctxt-eip; ctxt-cs_base = seg_base(ctxt, VCPU_SREG_CS); @@ -2422,6 +2425,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) u64 msr_data; struct decode_cache *c = ctxt-decode; int rc = X86EMUL_CONTINUE; + int saved_dst_type = c-dst.type; ctxt-interruptibility = 0; @@ -2450,8 +2454,11 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) } if (c-rep_prefix (c-d String)) { + ctxt-restart = true; /* All REP prefixes have the same first termination condition */ if (address_mask(c, c-regs[VCPU_REGS_RCX]) == 0) { + string_done: + ctxt-restart = false; kvm_rip_write(ctxt-vcpu, c-eip); goto done; } @@ -2463,17 +2470,13 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) * - if REPNE/REPNZ and ZF = 1 then done */ if ((c-b == 0xa6) || (c-b == 0xa7) || - (c-b == 0xae) || (c-b == 0xaf)) { + (c-b == 0xae) || (c-b == 0xaf)) { if ((c-rep_prefix == REPE_PREFIX) - ((ctxt-eflags EFLG_ZF) == 0)) { - kvm_rip_write(ctxt-vcpu, c-eip); - goto done; - } + ((ctxt-eflags EFLG_ZF) == 0)) + goto string_done; if ((c-rep_prefix == REPNE_PREFIX) - ((ctxt-eflags EFLG_ZF) == EFLG_ZF)) { - kvm_rip_write(ctxt-vcpu, c-eip); - goto done; - } + ((ctxt-eflags EFLG_ZF) == EFLG_ZF)) + goto string_done; } c-eip = ctxt-eip; } @@ -2906,6 +2909,12 @@ writeback: if (rc != X86EMUL_CONTINUE) goto done; + /* +* restore dst type in case the decoding will be reused +* (happens for string instruction ) +*/ + c-dst.type = saved_dst_type; + if ((c-d SrcMask) == SrcSI) string_addr_inc(ctxt, seg_override_base(ctxt, c), VCPU_REGS_RSI, c-src); @@ -2913,8 +2922,11 @@ writeback: if ((c-d DstMask) == DstDI) string_addr_inc(ctxt, es_base(ctxt), VCPU_REGS_RDI, c-dst); - if (c-rep_prefix (c-d String)) + if (c-rep_prefix (c-d String)) { register_address_increment(c, c-regs[VCPU_REGS_RCX], -1); + if (!(c-regs[VCPU_REGS_RCX] 0x3ff)) + ctxt-restart = false; + } /* Commit shadow register state. */ memcpy(ctxt-vcpu-arch.regs, c-regs, sizeof c-regs); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b8237ac..cd0043a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3718,6 +3718,7 @@ int
[PATCH v3 26/30] KVM: x86 emulator: Move string pio emulation into emulator.c
Currently emulation is done outside of emulator so things like doing ins/outs to/from mmio are broken it also makes it hard (if not impossible) to implement single stepping in the future. The implementation in this patch is not efficient since it exits to userspace for each IO while previous implementation did 'ins' in batches. Further patch that implements pio in string read ahead address this problem. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |8 -- arch/x86/kvm/emulate.c | 48 +++-- arch/x86/kvm/x86.c | 204 +++ 3 files changed, 31 insertions(+), 229 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4a4fb8d..c072401 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -224,14 +224,9 @@ struct kvm_pv_mmu_op_buffer { struct kvm_pio_request { unsigned long count; - int cur_count; - gva_t guest_gva; int in; int port; int size; - int string; - int down; - int rep; }; /* @@ -590,9 +585,6 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); struct x86_emulate_ctxt; int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port); -int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, int in, - int size, unsigned long count, int down, - gva_t address, int rep, unsigned port); void kvm_emulate_cpuid(struct kvm_vcpu *vcpu); int kvm_emulate_halt(struct kvm_vcpu *vcpu); int emulate_invlpg(struct kvm_vcpu *vcpu, gva_t address); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 873da58..1bedbb6 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -153,8 +153,8 @@ static u32 opcode_table[256] = { 0, 0, 0, 0, /* 0x68 - 0x6F */ SrcImm | Mov | Stack, 0, SrcImmByte | Mov | Stack, 0, - SrcNone | ByteOp | ImplicitOps, SrcNone | ImplicitOps, /* insb, insw/insd */ - SrcNone | ByteOp | ImplicitOps, SrcNone | ImplicitOps, /* outsb, outsw/outsd */ + DstDI | ByteOp | Mov | String, DstDI | Mov | String, /* insb, insw/insd */ + SrcSI | ByteOp | ImplicitOps | String, SrcSI | ImplicitOps | String, /* outsb, outsw/outsd */ /* 0x70 - 0x77 */ SrcImmByte, SrcImmByte, SrcImmByte, SrcImmByte, SrcImmByte, SrcImmByte, SrcImmByte, SrcImmByte, @@ -2611,47 +2611,29 @@ special_insn: break; case 0x6c: /* insb */ case 0x6d: /* insw/insd */ + c-dst.bytes = min(c-dst.bytes, 4u); if (!emulator_io_permited(ctxt, ops, c-regs[VCPU_REGS_RDX], - (c-d ByteOp) ? 1 : c-op_bytes)) { + c-dst.bytes)) { kvm_inject_gp(ctxt-vcpu, 0); goto done; } - if (kvm_emulate_pio_string(ctxt-vcpu, - 1, - (c-d ByteOp) ? 1 : c-op_bytes, - c-rep_prefix ? - address_mask(c, c-regs[VCPU_REGS_RCX]) : 1, - (ctxt-eflags EFLG_DF), - register_address(c, es_base(ctxt), -c-regs[VCPU_REGS_RDI]), - c-rep_prefix, - c-regs[VCPU_REGS_RDX]) == 0) { - c-eip = saved_eip; - return -1; - } - return 0; + if (!ops-pio_in_emulated(c-dst.bytes, c-regs[VCPU_REGS_RDX], + c-dst.val, 1, ctxt-vcpu)) + goto done; /* IO is needed, skip writeback */ + break; case 0x6e: /* outsb */ case 0x6f: /* outsw/outsd */ + c-src.bytes = min(c-src.bytes, 4u); if (!emulator_io_permited(ctxt, ops, c-regs[VCPU_REGS_RDX], - (c-d ByteOp) ? 1 : c-op_bytes)) { + c-src.bytes)) { kvm_inject_gp(ctxt-vcpu, 0); goto done; } - if (kvm_emulate_pio_string(ctxt-vcpu, - 0, - (c-d ByteOp) ? 1 : c-op_bytes, - c-rep_prefix ? - address_mask(c, c-regs[VCPU_REGS_RCX]) : 1, - (ctxt-eflags EFLG_DF), -register_address(c, - seg_override_base(ctxt, c), -c-regs[VCPU_REGS_RSI]), -
[PATCH v3 18/30] KVM: x86 emulator: Provide more callbacks for x86 emulator.
Provide get_cached_descriptor(), set_cached_descriptor(), get_segment_selector(), set_segment_selector(), get_gdt(), write_std() callbacks. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h | 16 + arch/x86/kvm/x86.c | 130 +++ 2 files changed, 131 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 0765725..f901467 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -63,6 +63,15 @@ struct x86_emulate_ops { unsigned int bytes, struct kvm_vcpu *vcpu, u32 *error); /* +* write_std: Write bytes of standard (non-emulated/special) memory. +*Used for descriptor writing. +* @addr: [IN ] Linear address to which to write. +* @val: [OUT] Value write to memory, zero-extended to 'u_long'. +* @bytes: [IN ] Number of bytes to write to memory. +*/ + int (*write_std)(unsigned long addr, void *val, +unsigned int bytes, struct kvm_vcpu *vcpu, u32 *error); + /* * fetch: Read bytes of standard (non-emulated/special) memory. *Used for instruction fetch. * @addr: [IN ] Linear address from which to read. @@ -108,6 +117,13 @@ struct x86_emulate_ops { const void *new, unsigned int bytes, struct kvm_vcpu *vcpu); + bool (*get_cached_descriptor)(struct desc_struct *desc, + int seg, struct kvm_vcpu *vcpu); + void (*set_cached_descriptor)(struct desc_struct *desc, + int seg, struct kvm_vcpu *vcpu); + u16 (*get_segment_selector)(int seg, struct kvm_vcpu *vcpu); + void (*set_segment_selector)(u16 sel, int seg, struct kvm_vcpu *vcpu); + void (*get_gdt)(struct desc_ptr *dt, struct kvm_vcpu *vcpu); ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu); void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu); int (*cpl)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 022d28e..2ef83db 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3050,6 +3050,18 @@ static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *v) return kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, len, v); } +static void kvm_set_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg) +{ + kvm_x86_ops-set_segment(vcpu, var, seg); +} + +void kvm_get_segment(struct kvm_vcpu *vcpu, +struct kvm_segment *var, int seg) +{ + kvm_x86_ops-get_segment(vcpu, var, seg); +} + gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva, u32 *error) { u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; @@ -3130,14 +3142,18 @@ static int kvm_read_guest_virt_system(gva_t addr, void *val, unsigned int bytes, return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, error); } -static int kvm_write_guest_virt(gva_t addr, void *val, unsigned int bytes, - struct kvm_vcpu *vcpu, u32 *error) +static int kvm_write_guest_virt_helper(gva_t addr, void *val, + unsigned int bytes, + struct kvm_vcpu *vcpu, u32 access, + u32 *error) { void *data = val; int r = X86EMUL_CONTINUE; + access |= PFERR_WRITE_MASK; + while (bytes) { - gpa_t gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, error); + gpa_t gpa = vcpu-arch.mmu.gva_to_gpa(vcpu, addr, access, error); unsigned offset = addr (PAGE_SIZE-1); unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset); int ret; @@ -3160,6 +3176,19 @@ out: return r; } +static int kvm_write_guest_virt(gva_t addr, void *val, unsigned int bytes, + struct kvm_vcpu *vcpu, u32 *error) +{ + u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + return kvm_write_guest_virt_helper(addr, val, bytes, vcpu, access, error); +} + +static int kvm_write_guest_virt_system(gva_t addr, void *val, + unsigned int bytes, + struct kvm_vcpu *vcpu, u32 *error) +{ + return kvm_write_guest_virt_helper(addr, val, bytes, vcpu, 0, error); +} static int emulator_read_emulated(unsigned long addr, void *val, @@ -3447,12 +3476,95 @@ static int emulator_get_cpl(struct kvm_vcpu *vcpu) return kvm_x86_ops-get_cpl(vcpu); } +static void emulator_get_gdt(struct desc_ptr *dt, struct kvm_vcpu *vcpu) +{ +
[PATCH v3 27/30] KVM: x86 emulator: remove saved_eip
c-eip is never written back in case of emulation failure, so no need to set it to old value. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |9 + 1 files changed, 1 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 1bedbb6..541f3c9 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2420,7 +2420,6 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) { u64 msr_data; - unsigned long saved_eip = 0; struct decode_cache *c = ctxt-decode; int rc = X86EMUL_CONTINUE; @@ -2432,7 +2431,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) */ memcpy(c-regs, ctxt-vcpu-arch.regs, sizeof c-regs); - saved_eip = c-eip; if (ctxt-mode == X86EMUL_MODE_PROT64 (c-d No64)) { kvm_queue_exception(ctxt-vcpu, UD_VECTOR); @@ -2923,11 +2921,7 @@ writeback: kvm_rip_write(ctxt-vcpu, c-eip); done: - if (rc == X86EMUL_UNHANDLEABLE) { - c-eip = saved_eip; - return -1; - } - return 0; + return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0; twobyte_insn: switch (c-b) { @@ -3204,6 +3198,5 @@ twobyte_insn: cannot_emulate: DPRINTF(Cannot emulate %02x\n, c-b); - c-eip = saved_eip; return -1; } -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 19/30] KVM: x86 emulator: Emulate task switch in emulator.c
Implement emulation of 16/32 bit task switch in emulator.c Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |5 + arch/x86/kvm/emulate.c | 563 2 files changed, 568 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index f901467..bd46929 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -11,6 +11,8 @@ #ifndef _ASM_X86_KVM_X86_EMULATE_H #define _ASM_X86_KVM_X86_EMULATE_H +#include asm/desc_defs.h + struct x86_emulate_ctxt; /* @@ -210,5 +212,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops); int x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops); +int emulator_task_switch(struct x86_emulate_ctxt *ctxt, +struct x86_emulate_ops *ops, +u16 tss_selector, int reason); #endif /* _ASM_X86_KVM_X86_EMULATE_H */ diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index d696cbd..db4776c 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -33,6 +33,7 @@ #include asm/kvm_emulate.h #include x86.h +#include tss.h /* * Opcode effective-address decode tables. @@ -1221,6 +1222,198 @@ done: return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0; } +static u32 desc_limit_scaled(struct desc_struct *desc) +{ + u32 limit = get_desc_limit(desc); + + return desc-g ? (limit 12) | 0xfff : limit; +} + +static void get_descriptor_table_ptr(struct x86_emulate_ctxt *ctxt, +struct x86_emulate_ops *ops, +u16 selector, struct desc_ptr *dt) +{ + if (selector 1 2) { + struct desc_struct desc; + memset (dt, 0, sizeof *dt); + if (!ops-get_cached_descriptor(desc, VCPU_SREG_LDTR, ctxt-vcpu)) + return; + + dt-size = desc_limit_scaled(desc); /* what if limit 65535? */ + dt-address = get_desc_base(desc); + } else + ops-get_gdt(dt, ctxt-vcpu); +} + +/* allowed just for 8 bytes segments */ +static int read_segment_descriptor(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops, + u16 selector, struct desc_struct *desc) +{ + struct desc_ptr dt; + u16 index = selector 3; + int ret; + u32 err; + ulong addr; + + get_descriptor_table_ptr(ctxt, ops, selector, dt); + + if (dt.size index * 8 + 7) { + kvm_inject_gp(ctxt-vcpu, selector 0xfffc); + return X86EMUL_PROPAGATE_FAULT; + } + addr = dt.address + index * 8; + ret = ops-read_std(addr, desc, sizeof *desc, ctxt-vcpu, err); + if (ret == X86EMUL_PROPAGATE_FAULT) + kvm_inject_page_fault(ctxt-vcpu, addr, err); + + return ret; +} + +/* allowed just for 8 bytes segments */ +static int write_segment_descriptor(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops, + u16 selector, struct desc_struct *desc) +{ + struct desc_ptr dt; + u16 index = selector 3; + u32 err; + ulong addr; + int ret; + + get_descriptor_table_ptr(ctxt, ops, selector, dt); + + if (dt.size index * 8 + 7) { + kvm_inject_gp(ctxt-vcpu, selector 0xfffc); + return X86EMUL_PROPAGATE_FAULT; + } + + addr = dt.address + index * 8; + ret = ops-write_std(addr, desc, sizeof *desc, ctxt-vcpu, err); + if (ret == X86EMUL_PROPAGATE_FAULT) + kvm_inject_page_fault(ctxt-vcpu, addr, err); + + return ret; +} + +static int load_segment_descriptor(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops, + u16 selector, int seg) +{ + struct desc_struct seg_desc; + u8 dpl, rpl, cpl; + unsigned err_vec = GP_VECTOR; + u32 err_code = 0; + bool null_selector = !(selector ~0x3); /* -0003 are null */ + int ret; + + memset(seg_desc, 0, sizeof seg_desc); + + if ((seg = VCPU_SREG_GS ctxt-mode == X86EMUL_MODE_VM86) + || ctxt-mode == X86EMUL_MODE_REAL) { + /* set real mode segment descriptor */ + set_desc_base(seg_desc, selector 4); + set_desc_limit(seg_desc, 0x); + seg_desc.type = 3; + seg_desc.p = 1; + seg_desc.s = 1; + goto load; + } + + /* NULL selector is not valid for TR, CS and SS */ + if ((seg == VCPU_SREG_CS || seg == VCPU_SREG_SS || seg == VCPU_SREG_TR) +null_selector) + goto exception; + + /* TR should be in GDT only
[PATCH v3 02/30] KVM: x86 emulator: fix RCX access during rep emulation
During rep emulation access length to RCX depends on current address mode. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 0b70a36..4dce805 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1852,7 +1852,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) if (c-rep_prefix (c-d String)) { /* All REP prefixes have the same first termination condition */ - if (c-regs[VCPU_REGS_RCX] == 0) { + if (address_mask(c, c-regs[VCPU_REGS_RCX]) == 0) { kvm_rip_write(ctxt-vcpu, c-eip); goto done; } @@ -1876,7 +1876,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) goto done; } } - c-regs[VCPU_REGS_RCX]--; + register_address_increment(c, c-regs[VCPU_REGS_RCX], -1); c-eip = kvm_rip_read(ctxt-vcpu); } -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 16/30] KVM: x86 emulator: If LOCK prefix is used dest arg should be memory.
If LOCK prefix is used dest arg should be memory, otherwise instruction should generate #UD. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index b89a8f2..46a7ee3 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1842,7 +1842,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) } /* LOCK prefix is allowed only with some instructions */ - if (c-lock_prefix !(c-d Lock)) { + if (c-lock_prefix (!(c-d Lock) || c-dst.type != OP_MEM)) { kvm_queue_exception(ctxt-vcpu, UD_VECTOR); goto done; } -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 23/30] KVM: x86 emulator: add decoding of X,Y parameters from Intel SDM
Add decoding of X,Y parameters from Intel SDM which are used by string instruction to specify source and destination. Use this new decoding to implement movs, cmps, stos, lods in a generic way. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c | 125 +--- 1 files changed, 44 insertions(+), 81 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 55b8a8b..6ebd642 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -51,6 +51,7 @@ #define DstReg (21) /* Register operand. */ #define DstMem (31) /* Memory operand. */ #define DstAcc (41) /* Destination Accumulator */ +#define DstDI (51) /* Destination is in ES:(E)DI */ #define DstMask (71) /* Source operand type. */ #define SrcNone (04) /* No source operand. */ @@ -64,6 +65,7 @@ #define SrcOne (74) /* Implied '1' */ #define SrcImmUByte (84) /* 8-bit unsigned immediate operand. */ #define SrcImmU (94) /* Immediate operand, unsigned */ +#define SrcSI (0xa4) /* Source is in the DS:RSI */ #define SrcMask (0xf4) /* Generic ModRM decode. */ #define ModRM (18) @@ -177,12 +179,12 @@ static u32 opcode_table[256] = { /* 0xA0 - 0xA7 */ ByteOp | DstReg | SrcMem | Mov | MemAbs, DstReg | SrcMem | Mov | MemAbs, ByteOp | DstMem | SrcReg | Mov | MemAbs, DstMem | SrcReg | Mov | MemAbs, - ByteOp | ImplicitOps | Mov | String, ImplicitOps | Mov | String, - ByteOp | ImplicitOps | String, ImplicitOps | String, + ByteOp | SrcSI | DstDI | Mov | String, SrcSI | DstDI | Mov | String, + ByteOp | SrcSI | DstDI | String, SrcSI | DstDI | String, /* 0xA8 - 0xAF */ - 0, 0, ByteOp | ImplicitOps | Mov | String, ImplicitOps | Mov | String, - ByteOp | ImplicitOps | Mov | String, ImplicitOps | Mov | String, - ByteOp | ImplicitOps | String, ImplicitOps | String, + 0, 0, ByteOp | DstDI | Mov | String, DstDI | Mov | String, + ByteOp | SrcSI | DstAcc | Mov | String, SrcSI | DstAcc | Mov | String, + ByteOp | DstDI | String, DstDI | String, /* 0xB0 - 0xB7 */ ByteOp | DstReg | SrcImm | Mov, ByteOp | DstReg | SrcImm | Mov, ByteOp | DstReg | SrcImm | Mov, ByteOp | DstReg | SrcImm | Mov, @@ -1145,6 +1147,14 @@ done_prefixes: c-src.bytes = 1; c-src.val = 1; break; + case SrcSI: + c-src.type = OP_MEM; + c-src.bytes = (c-d ByteOp) ? 1 : c-op_bytes; + c-src.ptr = (unsigned long *) + register_address(c, seg_override_base(ctxt, c), +c-regs[VCPU_REGS_RSI]); + c-src.val = 0; + break; } /* @@ -1230,6 +1240,14 @@ done_prefixes: } c-dst.orig_val = c-dst.val; break; + case DstDI: + c-dst.type = OP_MEM; + c-dst.bytes = (c-d ByteOp) ? 1 : c-op_bytes; + c-dst.ptr = (unsigned long *) + register_address(c, es_base(ctxt), +c-regs[VCPU_REGS_RDI]); + c-dst.val = 0; + break; } done: @@ -2388,6 +2406,16 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt, return rc; } +static void string_addr_inc(struct x86_emulate_ctxt *ctxt, unsigned long base, + int reg, unsigned long **ptr) +{ + struct decode_cache *c = ctxt-decode; + int df = (ctxt-eflags EFLG_DF) ? -1 : 1; + + register_address_increment(c, c-regs[reg], df * c-src.bytes); + *ptr = (unsigned long *)register_address(c, base, c-regs[reg]); +} + int x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) { @@ -2750,89 +2778,16 @@ special_insn: c-dst.val = (unsigned long)c-regs[VCPU_REGS_RAX]; break; case 0xa4 ... 0xa5: /* movs */ - c-dst.type = OP_MEM; - c-dst.bytes = (c-d ByteOp) ? 1 : c-op_bytes; - c-dst.ptr = (unsigned long *)register_address(c, - es_base(ctxt), - c-regs[VCPU_REGS_RDI]); - rc = ops-read_emulated(register_address(c, - seg_override_base(ctxt, c), - c-regs[VCPU_REGS_RSI]), - c-dst.val, - c-dst.bytes, ctxt-vcpu); - if (rc != X86EMUL_CONTINUE) - goto done; - register_address_increment(c, c-regs[VCPU_REGS_RSI], - (ctxt-eflags EFLG_DF) ? -c-dst.bytes -
[PATCH v3 07/30] KVM: Provide x86_emulate_ctxt callback to get current cpl
Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |1 + arch/x86/kvm/emulate.c | 15 --- arch/x86/kvm/x86.c |6 ++ 3 files changed, 15 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 0c5caa4..b048fd2 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -110,6 +110,7 @@ struct x86_emulate_ops { struct kvm_vcpu *vcpu); ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu); void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu); + int (*cpl)(struct kvm_vcpu *vcpu); }; /* Type, address-of, and value of an instruction's operand. */ diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 5e2fa61..8bd0557 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1257,7 +1257,7 @@ static int emulate_popf(struct x86_emulate_ctxt *ctxt, int rc; unsigned long val, change_mask; int iopl = (ctxt-eflags X86_EFLAGS_IOPL) IOPL_SHIFT; - int cpl = kvm_x86_ops-get_cpl(ctxt-vcpu); + int cpl = ops-cpl(ctxt-vcpu); rc = emulate_pop(ctxt, ops, val, len); if (rc != X86EMUL_CONTINUE) @@ -1758,7 +1758,8 @@ emulate_sysexit(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt) +static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops) { int iopl; if (ctxt-mode == X86EMUL_MODE_REAL) @@ -1766,7 +1767,7 @@ static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt) if (ctxt-mode == X86EMUL_MODE_VM86) return true; iopl = (ctxt-eflags X86_EFLAGS_IOPL) IOPL_SHIFT; - return kvm_x86_ops-get_cpl(ctxt-vcpu) iopl; + return ops-cpl(ctxt-vcpu) iopl; } static bool emulator_io_port_access_allowed(struct x86_emulate_ctxt *ctxt, @@ -1803,7 +1804,7 @@ static bool emulator_io_permited(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops, u16 port, u16 len) { - if (emulator_bad_iopl(ctxt)) + if (emulator_bad_iopl(ctxt, ops)) if (!emulator_io_port_access_allowed(ctxt, ops, port, len)) return false; return true; @@ -1842,7 +1843,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) } /* Privileged instruction can be executed only in CPL=0 */ - if ((c-d Priv) kvm_x86_ops-get_cpl(ctxt-vcpu)) { + if ((c-d Priv) ops-cpl(ctxt-vcpu)) { kvm_inject_gp(ctxt-vcpu, 0); goto done; } @@ -2378,7 +2379,7 @@ special_insn: c-dst.type = OP_NONE; /* Disable writeback. */ break; case 0xfa: /* cli */ - if (emulator_bad_iopl(ctxt)) + if (emulator_bad_iopl(ctxt, ops)) kvm_inject_gp(ctxt-vcpu, 0); else { ctxt-eflags = ~X86_EFLAGS_IF; @@ -2386,7 +2387,7 @@ special_insn: } break; case 0xfb: /* sti */ - if (emulator_bad_iopl(ctxt)) + if (emulator_bad_iopl(ctxt, ops)) kvm_inject_gp(ctxt-vcpu, 0); else { toggle_interruptibility(ctxt, KVM_X86_SHADOW_INT_STI); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b139334..3b6848e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3442,6 +3442,11 @@ static void emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu *vcpu) } } +static int emulator_get_cpl(struct kvm_vcpu *vcpu) +{ + return kvm_x86_ops-get_cpl(vcpu); +} + static struct x86_emulate_ops emulate_ops = { .read_std= kvm_read_guest_virt_system, .fetch = kvm_fetch_guest_virt, @@ -3450,6 +3455,7 @@ static struct x86_emulate_ops emulate_ops = { .cmpxchg_emulated= emulator_cmpxchg_emulated, .get_cr = emulator_get_cr, .set_cr = emulator_set_cr, + .cpl = emulator_get_cpl, }; static void cache_all_regs(struct kvm_vcpu *vcpu) -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 13/30] KVM: x86 emulator: fix mov dr to inject #UD when needed.
If CR4.DE=1 access to registers DR4/DR5 cause #UD. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c | 18 -- 1 files changed, 12 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 836e97b..5afddcf 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2531,9 +2531,12 @@ twobyte_insn: c-dst.type = OP_NONE; /* no writeback */ break; case 0x21: /* mov from dr to reg */ - if (emulator_get_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm])) - goto cannot_emulate; - rc = X86EMUL_CONTINUE; + if ((ops-get_cr(4, ctxt-vcpu) X86_CR4_DE) + (c-modrm_reg == 4 || c-modrm_reg == 5)) { + kvm_queue_exception(ctxt-vcpu, UD_VECTOR); + goto done; + } + emulator_get_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm]); c-dst.type = OP_NONE; /* no writeback */ break; case 0x22: /* mov reg, cr */ @@ -2541,9 +2544,12 @@ twobyte_insn: c-dst.type = OP_NONE; break; case 0x23: /* mov from reg to dr */ - if (emulator_set_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm])) - goto cannot_emulate; - rc = X86EMUL_CONTINUE; + if ((ops-get_cr(4, ctxt-vcpu) X86_CR4_DE) + (c-modrm_reg == 4 || c-modrm_reg == 5)) { + kvm_queue_exception(ctxt-vcpu, UD_VECTOR); + goto done; + } + emulator_set_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm]); c-dst.type = OP_NONE; /* no writeback */ break; case 0x30: -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 08/30] KVM: Provide current eip as part of emulator context.
Eliminate the need to call back into KVM to get it from emulator. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_emulate.h |3 ++- arch/x86/kvm/emulate.c | 12 ++-- arch/x86/kvm/x86.c |1 + 3 files changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index b048fd2..0765725 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -141,7 +141,7 @@ struct decode_cache { u8 seg_override; unsigned int d; unsigned long regs[NR_VCPU_REGS]; - unsigned long eip, eip_orig; + unsigned long eip; /* modrm */ u8 modrm; u8 modrm_mod; @@ -160,6 +160,7 @@ struct x86_emulate_ctxt { struct kvm_vcpu *vcpu; unsigned long eflags; + unsigned long eip; /* eip before instruction emulation */ /* Emulated execution mode, represented by an X86EMUL_MODE value. */ int mode; u32 cs_base; diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 8bd0557..2c27aa4 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -667,7 +667,7 @@ static int do_insn_fetch(struct x86_emulate_ctxt *ctxt, int rc; /* x86 instructions are limited to 15 bytes. */ - if (eip + size - ctxt-decode.eip_orig 15) + if (eip + size - ctxt-eip 15) return X86EMUL_UNHANDLEABLE; eip += ctxt-cs_base; while (size--) { @@ -927,7 +927,7 @@ x86_decode_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) /* Shadow copy of register state. Committed on successful emulation. */ memset(c, 0, sizeof(struct decode_cache)); - c-eip = c-eip_orig = kvm_rip_read(ctxt-vcpu); + c-eip = ctxt-eip; ctxt-cs_base = seg_base(ctxt, VCPU_SREG_CS); memcpy(c-regs, ctxt-vcpu-arch.regs, sizeof c-regs); @@ -1878,7 +1878,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) } } register_address_increment(c, c-regs[VCPU_REGS_RCX], -1); - c-eip = kvm_rip_read(ctxt-vcpu); + c-eip = ctxt-eip; } if (c-src.type == OP_MEM) { @@ -2447,7 +2447,7 @@ twobyte_insn: goto done; /* Let the processor re-execute the fixed hypercall */ - c-eip = kvm_rip_read(ctxt-vcpu); + c-eip = ctxt-eip; /* Disable writeback. */ c-dst.type = OP_NONE; break; @@ -2551,7 +2551,7 @@ twobyte_insn: | ((u64)c-regs[VCPU_REGS_RDX] 32); if (kvm_set_msr(ctxt-vcpu, c-regs[VCPU_REGS_RCX], msr_data)) { kvm_inject_gp(ctxt-vcpu, 0); - c-eip = kvm_rip_read(ctxt-vcpu); + c-eip = ctxt-eip; } rc = X86EMUL_CONTINUE; c-dst.type = OP_NONE; @@ -2560,7 +2560,7 @@ twobyte_insn: /* rdmsr */ if (kvm_get_msr(ctxt-vcpu, c-regs[VCPU_REGS_RCX], msr_data)) { kvm_inject_gp(ctxt-vcpu, 0); - c-eip = kvm_rip_read(ctxt-vcpu); + c-eip = ctxt-eip; } else { c-regs[VCPU_REGS_RAX] = (u32)msr_data; c-regs[VCPU_REGS_RDX] = msr_data 32; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3b6848e..022d28e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3494,6 +3494,7 @@ int emulate_instruction(struct kvm_vcpu *vcpu, vcpu-arch.emulate_ctxt.vcpu = vcpu; vcpu-arch.emulate_ctxt.eflags = kvm_x86_ops-get_rflags(vcpu); + vcpu-arch.emulate_ctxt.eip = kvm_rip_read(vcpu); vcpu-arch.emulate_ctxt.mode = (!is_protmode(vcpu)) ? X86EMUL_MODE_REAL : (vcpu-arch.emulate_ctxt.eflags X86_EFLAGS_VM) -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 20/30] KVM: x86 emulator: Use load_segment_descriptor() instead of kvm_load_segment_descriptor()
Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c | 10 +- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index db4776c..702bfff 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1508,7 +1508,7 @@ static int emulate_pop_sreg(struct x86_emulate_ctxt *ctxt, if (rc != X86EMUL_CONTINUE) return rc; - rc = kvm_load_segment_descriptor(ctxt-vcpu, (u16)selector, seg); + rc = load_segment_descriptor(ctxt, ops, (u16)selector, seg); return rc; } @@ -1683,7 +1683,7 @@ static int emulate_ret_far(struct x86_emulate_ctxt *ctxt, rc = emulate_pop(ctxt, ops, cs, c-op_bytes); if (rc != X86EMUL_CONTINUE) return rc; - rc = kvm_load_segment_descriptor(ctxt-vcpu, (u16)cs, VCPU_SREG_CS); + rc = load_segment_descriptor(ctxt, ops, (u16)cs, VCPU_SREG_CS); return rc; } @@ -2717,7 +2717,7 @@ special_insn: if (c-modrm_reg == VCPU_SREG_SS) toggle_interruptibility(ctxt, KVM_X86_SHADOW_INT_MOV_SS); - rc = kvm_load_segment_descriptor(ctxt-vcpu, sel, c-modrm_reg); + rc = load_segment_descriptor(ctxt, ops, sel, c-modrm_reg); c-dst.type = OP_NONE; /* Disable writeback. */ break; @@ -2892,8 +2892,8 @@ special_insn: goto jmp; case 0xea: /* jmp far */ jump_far: - if (kvm_load_segment_descriptor(ctxt-vcpu, c-src2.val, - VCPU_SREG_CS)) + if (load_segment_descriptor(ctxt, ops, c-src2.val, + VCPU_SREG_CS)) goto done; c-eip = c-src.val; -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 15/30] KVM: x86 emulator: do not call writeback if msr access fails.
Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 1393bf0..b89a8f2 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2563,7 +2563,7 @@ twobyte_insn: | ((u64)c-regs[VCPU_REGS_RDX] 32); if (kvm_set_msr(ctxt-vcpu, c-regs[VCPU_REGS_RCX], msr_data)) { kvm_inject_gp(ctxt-vcpu, 0); - c-eip = ctxt-eip; + goto done; } rc = X86EMUL_CONTINUE; c-dst.type = OP_NONE; @@ -2572,7 +2572,7 @@ twobyte_insn: /* rdmsr */ if (kvm_get_msr(ctxt-vcpu, c-regs[VCPU_REGS_RCX], msr_data)) { kvm_inject_gp(ctxt-vcpu, 0); - c-eip = ctxt-eip; + goto done; } else { c-regs[VCPU_REGS_RAX] = (u32)msr_data; c-regs[VCPU_REGS_RDX] = msr_data 32; -- 1.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On Mon, Mar 15, 2010 at 02:03:11PM +0100, Joerg Roedel wrote: On Mon, Mar 15, 2010 at 05:53:13AM -0700, Muli Ben-Yehuda wrote: On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote: On 03/10/2010 11:30 PM, Luiz Capitulino wrote: Hi there, Our wiki page for the Summer of Code 2010 is doing quite well: http://wiki.qemu.org/Google_Summer_of_Code_2010 I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Our experiments show that nested device assignment is pretty much required for I/O performance in nested scenarios. Really? I did a small test with virtio-blk in a nested guest (disk read with dd, so not a real benchmark) and got a reasonable read-performance of around 25MB/s from the disk in the l2-guest. Netperf running in L1 with direct access: ~950 Mbps throughput with 25% CPU utilization. Netperf running in L2 with virtio between L2 and L1 and direct assignment between L1 and L0: roughly the same throughput, but over 90% CPU utilization! Now extrapolate to 10GbE. Cheers, Muli -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/30] emulator cleanup
On 03/15/2010 04:38 PM, Gleb Natapov wrote: This is the first series of patches that tries to cleanup emulator code. This is mix of bug fixes and moving code that does emulation from x86.c to emulator.c while making it KVM independent. The status of the patches: works for me. realtime.flat test now also pass where it failed before. Reviewed-by: Avi Kivity a...@redhat.com -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Ideas wiki for GSoC 2010
On 03/15/2010 03:23 PM, Anthony Liguori wrote: On 03/15/2010 08:11 AM, Avi Kivity wrote: On 03/15/2010 03:03 PM, Joerg Roedel wrote: I will add another project - iommu emulation. Could be very useful for doing device assignment to nested guests, which could make testing a lot easier. Our experiments show that nested device assignment is pretty much required for I/O performance in nested scenarios. Really? I did a small test with virtio-blk in a nested guest (disk read with dd, so not a real benchmark) and got a reasonable read-performance of around 25MB/s from the disk in the l2-guest. Your guest wasn't doing a zillion VMREADs and VMWRITEs every exit. I plan to reduce VMREAD/VMWRITE overhead for kvm, but not much we can do for other guests. VMREAD/VMWRITEs are generally optimized by hypervisors as they tend to be costly. KVM is a bit unusual in terms of how many times the instructions are executed per exit. Do you know offhand of any unnecessary read/writes? There's update_cr8_intercept(), but on normal exits, I don't see what else we can remove. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 16/30] KVM: x86 emulator: If LOCK prefix is used dest arg should be memory.
Gleb Natapov wrote: If LOCK prefix is used dest arg should be memory, otherwise instruction should generate #UD. Well, there is one exception: There is an AMD specific lock mov cr0 = mov cr8 equivalence, where there is no memory involved (and we intercept this). I am not sure if anyone actually uses this code sequence, but it is definitely legal. Regards, Andre. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/emulate.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index b89a8f2..46a7ee3 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1842,7 +1842,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) } /* LOCK prefix is allowed only with some instructions */ - if (c-lock_prefix !(c-d Lock)) { + if (c-lock_prefix (!(c-d Lock) || c-dst.type != OP_MEM)) { kvm_queue_exception(ctxt-vcpu, UD_VECTOR); goto done; } -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html