date:20100315

Re: [PATCH 0/18][RFC] Nested Paging support for Nested SVM (aka NPT-Virtualization)

2010-03-15 Thread Marcelo Tosatti

On Fri, Mar 12, 2010 at 09:36:41AM +0200, Avi Kivity wrote:
 On 03/11/2010 10:58 PM, Marcelo Tosatti wrote:
 
 Can't you translate l2_gpa -  l1_gpa walking the current l1 nested
 pagetable, and pass that to the kvm tdp fault path (with the correct
 context setup)?
 If I understand your suggestion correctly, I think thats exactly whats
 done in the patches. Some words about the design:
 
 For nested-nested we need to shadow the l1-nested-ptable on the host.
 This is done using the vcpu-arch.mmu context which holds the l1 paging
 modes while the l2 is running. On a npt-fault from the l2 we just
 instrument the shadow-ptable code. This is the common case. because it
 happens all the time while the l2 is running.
 OK, makes sense now, I was missing the fact that the l1-nested-ptable
 needs to be shadowed and l1 translations to it must be write protected.
 
 Shadow converts (gva - gpa - hpa) to (gva - hpa) or (ngpa - gpa
 - hpa) to (ngpa - hpa) equally well.  In the second case npt still
 does (ngva - ngpa).
 
 You should disable out of sync shadow so that l1 guest writes to
 l1-nested-ptables always trap.
 
 Why?  The guest is under obligation to flush the tlb if it writes to
 a page table, and we will resync on that tlb flush.

The guests hypervisor will not flush the tlb with invlpg for updates of
its NPT pagetables. It'll create a new ASID, and KVM will not trap
that.

 Unsync makes just as much sense for nnpt.  Think of khugepaged in
 the guest eating a page table and spitting out a PDE.
 
 And in the trap case, you'd have to
 invalidate l2 shadow pagetable entries that used the (now obsolete)
 l1-nested-ptable entry. Does that happen automatically?
 
 What do you mean by 'l2 shadow ptable entries'?  There are the
 guest's page tables (ordinary direct mapped, unless the guest's
 guest is also running an npt-enabled hypervisor), and the host page
 tables.  When the guest writes to each page table, we invalidate the
 shadows.

With 'l2 shadow ptable entries' i mean the shadow pagetables that
translate GPA-L2 - HPA.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Balbir Singh

Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singh bal...@linux.vnet.ibm.com

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache!=none, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

The patch is applied against mmotm feb-11-2010.

Tests
-
Ran 4 VMs in parallel, running kernbench using kvm autotest. Each
guest had 2 CPUs with 512M of memory.

Guest Usage without boot parameter (memory in KB)

MemFree Cached Time
19900   292912 137
17540   296196 139
17900   296124 141
19356   296660 141

Host usage:  (memory in KB)

RSS Cache   mapped  swap
2788664 781884  3780359536

Guest Usage with boot parameter (memory in KB)
-
Memfree Cached   Time
244824  74828   144
237840  81764   143
235880  83044   138
239312  80092   148

Host usage: (memory in KB)

RSS Cache   mapped  swap
2700184 958012  334848  398412

The key thing to observe is the free memory when the boot parameter is
enabled.

TODOS
-
1. Balance slab cache as well
2. Invoke the balance routines from the balloon driver

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---

 include/linux/mmzone.h |2 -
 include/linux/swap.h   |3 +
 mm/page_alloc.c|9 ++-
 mm/vmscan.c|  165 
 4 files changed, 134 insertions(+), 45 deletions(-)


diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ad5abcf..f0b245f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -293,12 +293,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
+   unsigned long   min_unmapped_pages;
 #ifdef CONFIG_NUMA
int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
-   unsigned long   min_unmapped_pages;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c2a4295..d0c8176 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,10 +254,11 @@ extern unsigned long shrink_all_memory(unsigned long 
nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
+extern bool should_balance_unmapped_pages(struct zone *zone);
 
+extern int sysctl_min_unmapped_ratio;
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 416b056..1cc5c75 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1578,6 +1578,9 @@ zonelist_scan:
unsigned long mark;
int ret;
 
+   if (should_balance_unmapped_pages(zone))
+   wakeup_kswapd(zone, order);
+
mark = zone-watermark[alloc_flags  ALLOC_WMARK_MASK];
if

Re: [PATCH 0/18][RFC] Nested Paging support for Nested SVM (aka NPT-Virtualization)

2010-03-15 Thread Avi Kivity


On 03/15/2010 08:27 AM, Marcelo Tosatti wrote:



You should disable out of sync shadow so that l1 guest writes to
l1-nested-ptables always trap.
   

Why?  The guest is under obligation to flush the tlb if it writes to
a page table, and we will resync on that tlb flush.
 

The guests hypervisor will not flush the tlb with invlpg for updates of
its NPT pagetables. It'll create a new ASID, and KVM will not trap
that.
   


We'll get a kvm_set_cr3() on the next vmrun.


And in the trap case, you'd have to
invalidate l2 shadow pagetable entries that used the (now obsolete)
l1-nested-ptable entry. Does that happen automatically?
   

What do you mean by 'l2 shadow ptable entries'?  There are the
guest's page tables (ordinary direct mapped, unless the guest's
guest is also running an npt-enabled hypervisor), and the host page
tables.  When the guest writes to each page table, we invalidate the
shadows.
 

With 'l2 shadow ptable entries' i mean the shadow pagetables that
translate GPA-L2 -  HPA.
   


kvm_mmu_pte_write() will invalidate those sptes and will also install 
new translations if possible.


Beautiful, isn't it?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 15/18] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa

2010-03-15 Thread Avi Kivity


On 03/03/2010 09:12 PM, Joerg Roedel wrote:

This patch implements logic to make sure that either a
page-fault/page-fault-vmexit or a nested-page-fault-vmexit
is propagated back to the guest.

Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
---
  arch/x86/kvm/mmu.h |1 +
  arch/x86/kvm/paging_tmpl.h |2 ++
  arch/x86/kvm/x86.c |   15 ++-
  3 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 64f619b..b42b27e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -47,6 +47,7 @@
  #define PFERR_USER_MASK (1U  2)
  #define PFERR_RSVD_MASK (1U  3)
  #define PFERR_FETCH_MASK (1U  4)
+#define PFERR_NESTED_MASK (1U  31)
   



Why is this needed?  Queue an ordinary page fault page; the injection 
code should check the page fault intercept and #VMEXIT if needed.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 25/30] KVM: x86 emulator: fix in/out emulation.

2010-03-15 Thread Avi Kivity


On 03/14/2010 07:35 PM, Gleb Natapov wrote:

On Sun, Mar 14, 2010 at 06:54:11PM +0200, Avi Kivity wrote:
   

On 03/14/2010 06:21 PM, Gleb Natapov wrote:
 

in/out emulation is broken now. The breakage is different depending
on where IO device resides. If it is in userspace emulator reports
emulation failure since it incorrectly interprets kvm_emulate_pio()
return value. If IO device is in the kernel emulation of 'in' will do
nothing since kvm_emulate_pio() stores result directly into vcpu
registers, so emulator will overwrite result of emulation during
commit of shadowed register.


diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8f5e4c8..344e17b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -210,13 +210,13 @@ static u32 opcode_table[256] = {
0, 0, 0, 0, 0, 0, 0, 0,
/* 0xE0 - 0xE7 */
0, 0, 0, 0,
-   ByteOp | SrcImmUByte, SrcImmUByte,
-   ByteOp | SrcImmUByte, SrcImmUByte,
+   ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc,
+   ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc,
   

REX prefix shouldn't expand DstAcc to 64 bits here.  Might cause
problems further down in the pipeline.+
 

Is REX prefix allowed with this opcodes?

If yes:

if (c-dst.bytes == 8)
c-dst.bytes = 4;

inside IN/OUT emulation will fix this.
   


I don't know, but I guess REX is allowed and ignored.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.

2010-03-15 Thread Avi Kivity


On 03/14/2010 08:06 PM, Gleb Natapov wrote:

Suggest simply reentering every N executions.

 

This restart mechanism is, in fact, needed for ins read ahead to work.
After reading ahead from IO port we need to avoid entering decoder
until entire cache is consumed otherwise decoder will clear cache and
data will be lost. So we can't just enter guest in arbitrary times, only
when read ahead cache is empty. Since read ahead is never done across
page boundary this is save place to re-enter guest.
   


Please make the two depend on each other directly then.  We can't expect 
the reader of the emulator code know that.


Have the emulator ask the buffer when it is empty.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 25/30] KVM: x86 emulator: fix in/out emulation.

2010-03-15 Thread Gleb Natapov

On Mon, Mar 15, 2010 at 09:41:51AM +0200, Avi Kivity wrote:
 On 03/14/2010 07:35 PM, Gleb Natapov wrote:
 On Sun, Mar 14, 2010 at 06:54:11PM +0200, Avi Kivity wrote:
 On 03/14/2010 06:21 PM, Gleb Natapov wrote:
 in/out emulation is broken now. The breakage is different depending
 on where IO device resides. If it is in userspace emulator reports
 emulation failure since it incorrectly interprets kvm_emulate_pio()
 return value. If IO device is in the kernel emulation of 'in' will do
 nothing since kvm_emulate_pio() stores result directly into vcpu
 registers, so emulator will overwrite result of emulation during
 commit of shadowed register.
 
 
 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 8f5e4c8..344e17b 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -210,13 +210,13 @@ static u32 opcode_table[256] = {
0, 0, 0, 0, 0, 0, 0, 0,
/* 0xE0 - 0xE7 */
0, 0, 0, 0,
 -  ByteOp | SrcImmUByte, SrcImmUByte,
 -  ByteOp | SrcImmUByte, SrcImmUByte,
 +  ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc,
 +  ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc,
 REX prefix shouldn't expand DstAcc to 64 bits here.  Might cause
 problems further down in the pipeline.+
 Is REX prefix allowed with this opcodes?
 
 If yes:
 
 if (c-dst.bytes == 8)
  c-dst.bytes = 4;
 
 inside IN/OUT emulation will fix this.
 
 I don't know, but I guess REX is allowed and ignored.
 
Hmm, curious. I'll test.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Avi Kivity


On 03/15/2010 09:22 AM, Balbir Singh wrote:

Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singhbal...@linux.vnet.ibm.com

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache!=none, we see
   double caching - (one in the host and one in the guest). As
   we try to scale guests, cache usage across the system grows.
   The goal of this patch is to reclaim page cache when Linux is running
   as a guest and get the host to hold the page cache and manage it.
   There might be temporary duplication, but in the long run, memory
   in the guests would be used for mapped pages.
   


Well, for a guest, host page cache is a lot slower than guest page cache.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Balbir Singh

* Avi Kivity a...@redhat.com [2010-03-15 09:48:05]:

 On 03/15/2010 09:22 AM, Balbir Singh wrote:
 Selectively control Unmapped Page Cache (nospam version)
 
 From: Balbir Singhbal...@linux.vnet.ibm.com
 
 This patch implements unmapped page cache control via preferred
 page cache reclaim. The current patch hooks into kswapd and reclaims
 page cache if the user has requested for unmapped page control.
 This is useful in the following scenario
 
 - In a virtualized environment with cache!=none, we see
double caching - (one in the host and one in the guest). As
we try to scale guests, cache usage across the system grows.
The goal of this patch is to reclaim page cache when Linux is running
as a guest and get the host to hold the page cache and manage it.
There might be temporary duplication, but in the long run, memory
in the guests would be used for mapped pages.
 
 Well, for a guest, host page cache is a lot slower than guest page cache.


Yes, it is a virtio call away, but is the cost of paying twice in
terms of memory acceptable? One of the reasons I created a boot
parameter was to deal with selective enablement for cases where
memory is the most important resource being managed.

I do see a hit in performance with my results (please see the data
below), but the savings are quite large. The other solution mentioned
in the TODOs is to have the balloon driver invoke this path. The
sysctl also allows the guest to tune the amount of unmapped page cache
if needed.

The knobs are for

1. Selective enablement
2. Selective control of the % of unmapped pages

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: cleanup: change to use bool return values

2010-03-15 Thread Gui Jianfeng

Make use of bool as return valuses.

Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com
---
 arch/x86/kvm/vmx.c |   72 ++--
 1 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 06108f3..cc0628e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -231,65 +231,65 @@ static const u32 vmx_msr_index[] = {
 };
 #define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index)
 
-static inline int is_page_fault(u32 intr_info)
+static inline bool is_page_fault(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
 INTR_INFO_VALID_MASK)) ==
(INTR_TYPE_HARD_EXCEPTION | PF_VECTOR | INTR_INFO_VALID_MASK);
 }
 
-static inline int is_no_device(u32 intr_info)
+static inline bool is_no_device(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
 INTR_INFO_VALID_MASK)) ==
(INTR_TYPE_HARD_EXCEPTION | NM_VECTOR | INTR_INFO_VALID_MASK);
 }
 
-static inline int is_invalid_opcode(u32 intr_info)
+static inline bool is_invalid_opcode(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
 INTR_INFO_VALID_MASK)) ==
(INTR_TYPE_HARD_EXCEPTION | UD_VECTOR | INTR_INFO_VALID_MASK);
 }
 
-static inline int is_external_interrupt(u32 intr_info)
+static inline bool is_external_interrupt(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
== (INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
 }
 
-static inline int is_machine_check(u32 intr_info)
+static inline bool is_machine_check(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
 INTR_INFO_VALID_MASK)) ==
(INTR_TYPE_HARD_EXCEPTION | MC_VECTOR | INTR_INFO_VALID_MASK);
 }
 
-static inline int cpu_has_vmx_msr_bitmap(void)
+static inline bool cpu_has_vmx_msr_bitmap(void)
 {
-   return vmcs_config.cpu_based_exec_ctrl  CPU_BASED_USE_MSR_BITMAPS;
+   return !!(vmcs_config.cpu_based_exec_ctrl  CPU_BASED_USE_MSR_BITMAPS);
 }
 
-static inline int cpu_has_vmx_tpr_shadow(void)
+static inline bool cpu_has_vmx_tpr_shadow(void)
 {
-   return vmcs_config.cpu_based_exec_ctrl  CPU_BASED_TPR_SHADOW;
+   return !!(vmcs_config.cpu_based_exec_ctrl  CPU_BASED_TPR_SHADOW);
 }
 
-static inline int vm_need_tpr_shadow(struct kvm *kvm)
+static inline bool vm_need_tpr_shadow(struct kvm *kvm)
 {
return (cpu_has_vmx_tpr_shadow())  (irqchip_in_kernel(kvm));
 }
 
-static inline int cpu_has_secondary_exec_ctrls(void)
+static inline bool cpu_has_secondary_exec_ctrls(void)
 {
-   return vmcs_config.cpu_based_exec_ctrl 
-   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+   return !!(vmcs_config.cpu_based_exec_ctrl 
+ CPU_BASED_ACTIVATE_SECONDARY_CONTROLS);
 }
 
 static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 {
-   return vmcs_config.cpu_based_2nd_exec_ctrl 
-   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+   return !!(vmcs_config.cpu_based_2nd_exec_ctrl 
+ SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 }
 
 static inline bool cpu_has_vmx_flexpriority(void)
@@ -323,59 +323,59 @@ static inline bool cpu_has_vmx_ept_1g_page(void)
return !!(vmx_capability.ept  VMX_EPT_1GB_PAGE_BIT);
 }
 
-static inline int cpu_has_vmx_invept_individual_addr(void)
+static inline bool cpu_has_vmx_invept_individual_addr(void)
 {
return !!(vmx_capability.ept  VMX_EPT_EXTENT_INDIVIDUAL_BIT);
 }
 
-static inline int cpu_has_vmx_invept_context(void)
+static inline bool cpu_has_vmx_invept_context(void)
 {
return !!(vmx_capability.ept  VMX_EPT_EXTENT_CONTEXT_BIT);
 }
 
-static inline int cpu_has_vmx_invept_global(void)
+static inline bool cpu_has_vmx_invept_global(void)
 {
return !!(vmx_capability.ept  VMX_EPT_EXTENT_GLOBAL_BIT);
 }
 
-static inline int cpu_has_vmx_ept(void)
+static inline bool cpu_has_vmx_ept(void)
 {
-   return vmcs_config.cpu_based_2nd_exec_ctrl 
-   SECONDARY_EXEC_ENABLE_EPT;
+   return !!(vmcs_config.cpu_based_2nd_exec_ctrl 
+ SECONDARY_EXEC_ENABLE_EPT);
 }
 
-static inline int cpu_has_vmx_unrestricted_guest(void)
+static inline bool cpu_has_vmx_unrestricted_guest(void)
 {
-   return vmcs_config.cpu_based_2nd_exec_ctrl 
-   SECONDARY_EXEC_UNRESTRICTED_GUEST;
+   return !!(vmcs_config.cpu_based_2nd_exec_ctrl 
+ SECONDARY_EXEC_UNRESTRICTED_GUEST);
 }
 
-static inline int cpu_has_vmx_ple(void)
+static inline bool cpu_has_vmx_ple(void)
 {
-   return vmcs_config.cpu_based_2nd_exec_ctrl 
-   SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+   return !!(vmcs_config.cpu_based_2nd_exec_ctrl 
+

Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Avi Kivity


On 03/15/2010 10:07 AM, Balbir Singh wrote:

* Avi Kivitya...@redhat.com  [2010-03-15 09:48:05]:

   

On 03/15/2010 09:22 AM, Balbir Singh wrote:
 

Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singhbal...@linux.vnet.ibm.com

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache!=none, we see
   double caching - (one in the host and one in the guest). As
   we try to scale guests, cache usage across the system grows.
   The goal of this patch is to reclaim page cache when Linux is running
   as a guest and get the host to hold the page cache and manage it.
   There might be temporary duplication, but in the long run, memory
   in the guests would be used for mapped pages.
   

Well, for a guest, host page cache is a lot slower than guest page cache.

 

Yes, it is a virtio call away, but is the cost of paying twice in
terms of memory acceptable?


Usually, it isn't, which is why I recommend cache=off.


One of the reasons I created a boot
parameter was to deal with selective enablement for cases where
memory is the most important resource being managed.

I do see a hit in performance with my results (please see the data
below), but the savings are quite large. The other solution mentioned
in the TODOs is to have the balloon driver invoke this path. The
sysctl also allows the guest to tune the amount of unmapped page cache
if needed.

The knobs are for

1. Selective enablement
2. Selective control of the % of unmapped pages
   


An alternative path is to enable KSM for page cache.  Then we have 
direct read-only guest access to host page cache, without any guest 
modifications required.  That will be pretty difficult to achieve though 
- will need a readonly bit in the page cache radix tree, and teach all 
paths to honour it.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: cleanup: change to use bool return values

2010-03-15 Thread Avi Kivity


On 03/15/2010 10:23 AM, Gui Jianfeng wrote:

Make use of bool as return valuses.


-static inline int cpu_has_vmx_tpr_shadow(void)
+static inline bool cpu_has_vmx_tpr_shadow(void)
  {
-   return vmcs_config.cpu_based_exec_ctrl  CPU_BASED_TPR_SHADOW;
+   return !!(vmcs_config.cpu_based_exec_ctrl  CPU_BASED_TPR_SHADOW);
  }

   


Those !! are not required - demotion to bool is defined to convert 
nonzero to true.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Use native_store_idt() instead of kvm_get_idt()

2010-03-15 Thread Marcelo Tosatti

On Fri, Mar 05, 2010 at 12:11:48PM +0800, Wei Yongjun wrote:
 This patch use generic linux function native_store_idt()
 instead of kvm_get_idt(), and also removed the useless
 function kvm_get_idt().
 
 Signed-off-by: Wei Yongjun yj...@cn.fujitsu.com
 ---
  arch/x86/include/asm/kvm_host.h |5 -
  arch/x86/kvm/vmx.c  |2 +-
  2 files changed, 1 insertions(+), 6 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Moving dirty bitmaps to userspace - Double buffering approach

2010-03-15 Thread Marcelo Tosatti

On Mon, Mar 08, 2010 at 05:22:43PM +0900, Takuya Yoshikawa wrote:
 Hi, I would like to hear your comments about the following plan:
 
   Moving dirty bitmaps to userspace
 - Double buffering approach
 
 especially I would be glad if I can hear some advice about how
 to keep the compatibility.
 
 Thanks in advance,
   Takuya
 
 
 ---
 Overview:
 
 Last time, I submitted a patch
   make get dirty log ioctl return the first dirty page's position
http://www.spinics.net/lists/kvm/msg29724.html
 and got some new better ideas from Avi.
 
 As a result, I agreed to try to eliminate the bitmap allocation
 done in the x86 KVM every time when we execute get dirty log by
 using double buffering approach.
 
 
 Here is my plan:
 
 - move the dirty bitmap allocation to userspace
 
 We allocate bitmaps in the userspace and register them by ioctl.
 Once a bitmap is registered, we do not touch it from userspace
 and let the kernel modify it directly until we switch to the next
 bitmap. We use double buffering at this switch point: userspace
 give the kernel a new bitmap by ioctl and the kernel switch the
 bitmap atomically to new one.

 After succeeded in this switch, we can read the old bitmap freely
 in the userspace and free it if we want: needless to say we can
 also reuse it at the next switch.
 
 
 - implementation details
 
 Although it may be possible to touch the bitmap from the kernel
 side without doing kmap, I think kmapping the bitmap is better.
 So we may use the following functions paying enough attention to
 the preemption control.
   - get_user_pages()
   - kmap_atomic()
 
 
 - compatibility issues
 
 What I am facing now are the compatibility issues. We have to
 support both the userspace and kernel side bitmap allocations
 to let the current qemu and KVM work properly.
 
 1. From the kernel side, we have to care bitmap allocations done in both
 the kvm_vm_ioctl_set_memory_region() and kvm_vm_ioctl_get_dirty_log().
 
 2. From the userspace side, we have to check the new api's availability
 and determine which way we use, e.g. by using check extension ioctl.
 
 The most problematic is 1, kernel side. We have to be able to know
 by which way current bitmap allocation is being done using flags or
 something. In the case of set memory region, we have to judge whether
 we allocate a bitmap, and if not we have to register a bitmap later
 by another api: set memory region is not restricted to the dirty log
 issues and need more care than get dirty log.
 
 Are there any good ways to solve this kind of problems?

You can introduce a new get_dirty_log ioctl that passes the address
of the next bitmap in userspace, and use it (after pinning with
get_user_pages), instead of vmalloc'ing.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-03-15 Thread Xin, Xiaohui

 +/* The structure to notify the virtqueue for async socket */
 +struct vhost_notifier {
 +struct list_head list;
 +struct vhost_virtqueue *vq;
 +int head;
 +int size;
 +int log;
 +void *ctrl;
 +void (*dtor)(struct vhost_notifier *);
 +};
 +

So IMO, this is not the best interface between vhost
and your driver, exposing them to each other unnecessarily.

If you think about it, your driver should not care about this structure.
It could get e.g. a kiocb (sendmsg already gets one), and call ki_dtor
on completion.  vhost could save it's state in ki_user_data.  If your
driver needs to add more data to do more tracking, I think it can put
skb pointer in the private pointer.

Then if I remove the struct vhost_notifier, and just use struct kiocb, but 
don't use the one got from sendmsg or recvmsg, but allocated within the 
page_info structure, and don't implement any aio logic related to it, is that 
ok?

Sorry, I made a patch, but don't know how to reply mail with a good formatted 
patch here

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Moving dirty bitmaps to userspace - Double buffering approach

2010-03-15 Thread Avi Kivity


On 03/15/2010 10:33 AM, Marcelo Tosatti wrote:



Are there any good ways to solve this kind of problems?
 

You can introduce a new get_dirty_log ioctl that passes the address
of the next bitmap in userspace, and use it (after pinning with
get_user_pages), instead of vmalloc'ing.

   


No pinning please, put_user_bit() or set_bit_user().

(can be implemented generically using get_user_pages() and 
kmap_atomic(), but x86 should get an optimized implementation)


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 15/18] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa

2010-03-15 Thread Joerg Roedel

On Mon, Mar 15, 2010 at 09:36:52AM +0200, Avi Kivity wrote:
 On 03/03/2010 09:12 PM, Joerg Roedel wrote:
 This patch implements logic to make sure that either a
 page-fault/page-fault-vmexit or a nested-page-fault-vmexit
 is propagated back to the guest.

 Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
 ---
   arch/x86/kvm/mmu.h |1 +
   arch/x86/kvm/paging_tmpl.h |2 ++
   arch/x86/kvm/x86.c |   15 ++-
   3 files changed, 17 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
 index 64f619b..b42b27e 100644
 --- a/arch/x86/kvm/mmu.h
 +++ b/arch/x86/kvm/mmu.h
 @@ -47,6 +47,7 @@
   #define PFERR_USER_MASK (1U  2)
   #define PFERR_RSVD_MASK (1U  3)
   #define PFERR_FETCH_MASK (1U  4)
 +#define PFERR_NESTED_MASK (1U  31)



 Why is this needed?  Queue an ordinary page fault page; the injection  
 code should check the page fault intercept and #VMEXIT if needed.

This is needed because we could have a nested page fault or an ordinary
page fault which need to be propagated.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Balbir Singh

* Avi Kivity a...@redhat.com [2010-03-15 10:27:45]:

 On 03/15/2010 10:07 AM, Balbir Singh wrote:
 * Avi Kivitya...@redhat.com  [2010-03-15 09:48:05]:
 
 On 03/15/2010 09:22 AM, Balbir Singh wrote:
 Selectively control Unmapped Page Cache (nospam version)
 
 From: Balbir Singhbal...@linux.vnet.ibm.com
 
 This patch implements unmapped page cache control via preferred
 page cache reclaim. The current patch hooks into kswapd and reclaims
 page cache if the user has requested for unmapped page control.
 This is useful in the following scenario
 
 - In a virtualized environment with cache!=none, we see
double caching - (one in the host and one in the guest). As
we try to scale guests, cache usage across the system grows.
The goal of this patch is to reclaim page cache when Linux is running
as a guest and get the host to hold the page cache and manage it.
There might be temporary duplication, but in the long run, memory
in the guests would be used for mapped pages.
 Well, for a guest, host page cache is a lot slower than guest page cache.
 
 Yes, it is a virtio call away, but is the cost of paying twice in
 terms of memory acceptable?
 
 Usually, it isn't, which is why I recommend cache=off.


cache=off works for *direct I/O* supported filesystems and my concern is that
one of the side-effects is that idle VM's can consume a lot of memory
(assuming all the memory is available to them). As the number of VM's
grow, they could cache a whole lot of memory. In my experiments I
found that the total amount of memory cached far exceeded the mapped
ratio by a large amount when we had idle VM's. The philosophy of this
patch is to move the caching to the _host_ and let the host maintain
the cache instead of the guest.
 
 One of the reasons I created a boot
 parameter was to deal with selective enablement for cases where
 memory is the most important resource being managed.
 
 I do see a hit in performance with my results (please see the data
 below), but the savings are quite large. The other solution mentioned
 in the TODOs is to have the balloon driver invoke this path. The
 sysctl also allows the guest to tune the amount of unmapped page cache
 if needed.
 
 The knobs are for
 
 1. Selective enablement
 2. Selective control of the % of unmapped pages
 
 An alternative path is to enable KSM for page cache.  Then we have
 direct read-only guest access to host page cache, without any guest
 modifications required.  That will be pretty difficult to achieve
 though - will need a readonly bit in the page cache radix tree, and
 teach all paths to honour it.
 

Yes, it is, I've taken a quick look. I am not sure if de-duplication
would be the best approach, may be dropping the page in the page cache
might be a good first step. Data consistency would be much easier to
maintain that way, as long as the guest is not writing frequently to
that page, we don't need the page cache in the host.

 -- 
 Do not meddle in the internals of kernels, for they are subtle and quick to 
 panic.
 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 15/18] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa

2010-03-15 Thread Avi Kivity


On 03/15/2010 11:06 AM, Joerg Roedel wrote:

On Mon, Mar 15, 2010 at 09:36:52AM +0200, Avi Kivity wrote:
   

On 03/03/2010 09:12 PM, Joerg Roedel wrote:
 

This patch implements logic to make sure that either a
page-fault/page-fault-vmexit or a nested-page-fault-vmexit
is propagated back to the guest.

Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
---
   arch/x86/kvm/mmu.h |1 +
   arch/x86/kvm/paging_tmpl.h |2 ++
   arch/x86/kvm/x86.c |   15 ++-
   3 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 64f619b..b42b27e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -47,6 +47,7 @@
   #define PFERR_USER_MASK (1U   2)
   #define PFERR_RSVD_MASK (1U   3)
   #define PFERR_FETCH_MASK (1U   4)
+#define PFERR_NESTED_MASK (1U   31)

   


Why is this needed?  Queue an ordinary page fault page; the injection
code should check the page fault intercept and #VMEXIT if needed.
 

This is needed because we could have a nested page fault or an ordinary
page fault which need to be propagated.
   


Right.

Why is pio_copy_data() changed?  One would think that it would be an 
all-or-nothing affair.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-03-15 Thread Michael S. Tsirkin

On Mon, Mar 15, 2010 at 04:46:50PM +0800, Xin, Xiaohui wrote:
  +/* The structure to notify the virtqueue for async socket */
  +struct vhost_notifier {
  +  struct list_head list;
  +  struct vhost_virtqueue *vq;
  +  int head;
  +  int size;
  +  int log;
  +  void *ctrl;
  +  void (*dtor)(struct vhost_notifier *);
  +};
  +
 
 So IMO, this is not the best interface between vhost
 and your driver, exposing them to each other unnecessarily.
 
 If you think about it, your driver should not care about this structure.
 It could get e.g. a kiocb (sendmsg already gets one), and call ki_dtor
 on completion.  vhost could save it's state in ki_user_data.  If your
 driver needs to add more data to do more tracking, I think it can put
 skb pointer in the private pointer.
 
 Then if I remove the struct vhost_notifier, and just use struct kiocb, but 
 don't use the one got from sendmsg or recvmsg, but allocated within the 
 page_info structure, and don't implement any aio logic related to it, is that 
 ok?

Hmm, not sure I understand.  It seems both cleaner and easier to use the
iocb passed to sendmsg/recvmsg. No? I am not saying you necessarily must
implement full aio directly.

 Sorry, I made a patch, but don't know how to reply mail with a good formatted 
 patch here
 
 Thanks
 Xiaohui

Maybe Documentation/email-clients.txt will help?
Generally you do it like this (at start of mail):

Subject: one line patch summary (overrides mail subject)

multilie patch description

Signed-off-by: ...

---

Free text comes after --- delimeter, before patch.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index a140dad..e830b30 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -106,22 +106,41 @@ static void handle_tx(struct vhost_net *net)




-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Avi Kivity


On 03/15/2010 11:17 AM, Balbir Singh wrote:

* Avi Kivitya...@redhat.com  [2010-03-15 10:27:45]:

   

On 03/15/2010 10:07 AM, Balbir Singh wrote:
 

* Avi Kivitya...@redhat.com   [2010-03-15 09:48:05]:

   

On 03/15/2010 09:22 AM, Balbir Singh wrote:
 

Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singhbal...@linux.vnet.ibm.com

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache!=none, we see
   double caching - (one in the host and one in the guest). As
   we try to scale guests, cache usage across the system grows.
   The goal of this patch is to reclaim page cache when Linux is running
   as a guest and get the host to hold the page cache and manage it.
   There might be temporary duplication, but in the long run, memory
   in the guests would be used for mapped pages.
   

Well, for a guest, host page cache is a lot slower than guest page cache.

 

Yes, it is a virtio call away, but is the cost of paying twice in
terms of memory acceptable?
   

Usually, it isn't, which is why I recommend cache=off.

 

cache=off works for *direct I/O* supported filesystems and my concern is that
one of the side-effects is that idle VM's can consume a lot of memory
(assuming all the memory is available to them). As the number of VM's
grow, they could cache a whole lot of memory. In my experiments I
found that the total amount of memory cached far exceeded the mapped
ratio by a large amount when we had idle VM's. The philosophy of this
patch is to move the caching to the _host_ and let the host maintain
the cache instead of the guest.
   


That's only beneficial if the cache is shared.  Otherwise, you could use 
the balloon to evict cache when memory is tight.


Shared cache is mostly a desktop thing where users run similar 
workloads.  For servers, it's much less likely.  So a modified-guest 
doesn't help a lot here.



One of the reasons I created a boot
parameter was to deal with selective enablement for cases where
memory is the most important resource being managed.

I do see a hit in performance with my results (please see the data
below), but the savings are quite large. The other solution mentioned
in the TODOs is to have the balloon driver invoke this path. The
sysctl also allows the guest to tune the amount of unmapped page cache
if needed.

The knobs are for

1. Selective enablement
2. Selective control of the % of unmapped pages
   

An alternative path is to enable KSM for page cache.  Then we have
direct read-only guest access to host page cache, without any guest
modifications required.  That will be pretty difficult to achieve
though - will need a readonly bit in the page cache radix tree, and
teach all paths to honour it.

 

Yes, it is, I've taken a quick look. I am not sure if de-duplication
would be the best approach, may be dropping the page in the page cache
might be a good first step. Data consistency would be much easier to
maintain that way, as long as the guest is not writing frequently to
that page, we don't need the page cache in the host.
   


Trimming the host page cache should happen automatically under 
pressure.  Since the page is cached by the guest, it won't be re-read, 
so the host page is not frequently used and then dropped.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: Cleanup: change to use bool return values

2010-03-15 Thread Gui Jianfeng

Make use of bool as return values, and remove some useless
bool value converting. Thanks Avi to point this out.

Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com
---
 arch/x86/kvm/vmx.c |   54 ++--
 1 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 06108f3..3ddcfc5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -231,56 +231,56 @@ static const u32 vmx_msr_index[] = {
 };
 #define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index)
 
-static inline int is_page_fault(u32 intr_info)
+static inline bool is_page_fault(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
 INTR_INFO_VALID_MASK)) ==
(INTR_TYPE_HARD_EXCEPTION | PF_VECTOR | INTR_INFO_VALID_MASK);
 }
 
-static inline int is_no_device(u32 intr_info)
+static inline bool is_no_device(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
 INTR_INFO_VALID_MASK)) ==
(INTR_TYPE_HARD_EXCEPTION | NM_VECTOR | INTR_INFO_VALID_MASK);
 }
 
-static inline int is_invalid_opcode(u32 intr_info)
+static inline bool is_invalid_opcode(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
 INTR_INFO_VALID_MASK)) ==
(INTR_TYPE_HARD_EXCEPTION | UD_VECTOR | INTR_INFO_VALID_MASK);
 }
 
-static inline int is_external_interrupt(u32 intr_info)
+static inline bool is_external_interrupt(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
== (INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
 }
 
-static inline int is_machine_check(u32 intr_info)
+static inline bool is_machine_check(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
 INTR_INFO_VALID_MASK)) ==
(INTR_TYPE_HARD_EXCEPTION | MC_VECTOR | INTR_INFO_VALID_MASK);
 }
 
-static inline int cpu_has_vmx_msr_bitmap(void)
+static inline bool cpu_has_vmx_msr_bitmap(void)
 {
return vmcs_config.cpu_based_exec_ctrl  CPU_BASED_USE_MSR_BITMAPS;
 }
 
-static inline int cpu_has_vmx_tpr_shadow(void)
+static inline bool cpu_has_vmx_tpr_shadow(void)
 {
return vmcs_config.cpu_based_exec_ctrl  CPU_BASED_TPR_SHADOW;
 }
 
-static inline int vm_need_tpr_shadow(struct kvm *kvm)
+static inline bool vm_need_tpr_shadow(struct kvm *kvm)
 {
return (cpu_has_vmx_tpr_shadow())  (irqchip_in_kernel(kvm));
 }
 
-static inline int cpu_has_secondary_exec_ctrls(void)
+static inline bool cpu_has_secondary_exec_ctrls(void)
 {
return vmcs_config.cpu_based_exec_ctrl 
CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
@@ -300,80 +300,80 @@ static inline bool cpu_has_vmx_flexpriority(void)
 
 static inline bool cpu_has_vmx_ept_execute_only(void)
 {
-   return !!(vmx_capability.ept  VMX_EPT_EXECUTE_ONLY_BIT);
+   return vmx_capability.ept  VMX_EPT_EXECUTE_ONLY_BIT;
 }
 
 static inline bool cpu_has_vmx_eptp_uncacheable(void)
 {
-   return !!(vmx_capability.ept  VMX_EPTP_UC_BIT);
+   return vmx_capability.ept  VMX_EPTP_UC_BIT;
 }
 
 static inline bool cpu_has_vmx_eptp_writeback(void)
 {
-   return !!(vmx_capability.ept  VMX_EPTP_WB_BIT);
+   return vmx_capability.ept  VMX_EPTP_WB_BIT;
 }
 
 static inline bool cpu_has_vmx_ept_2m_page(void)
 {
-   return !!(vmx_capability.ept  VMX_EPT_2MB_PAGE_BIT);
+   return vmx_capability.ept  VMX_EPT_2MB_PAGE_BIT;
 }
 
 static inline bool cpu_has_vmx_ept_1g_page(void)
 {
-   return !!(vmx_capability.ept  VMX_EPT_1GB_PAGE_BIT);
+   return vmx_capability.ept  VMX_EPT_1GB_PAGE_BIT;
 }
 
-static inline int cpu_has_vmx_invept_individual_addr(void)
+static inline bool cpu_has_vmx_invept_individual_addr(void)
 {
-   return !!(vmx_capability.ept  VMX_EPT_EXTENT_INDIVIDUAL_BIT);
+   return vmx_capability.ept  VMX_EPT_EXTENT_INDIVIDUAL_BIT;
 }
 
-static inline int cpu_has_vmx_invept_context(void)
+static inline bool cpu_has_vmx_invept_context(void)
 {
-   return !!(vmx_capability.ept  VMX_EPT_EXTENT_CONTEXT_BIT);
+   return vmx_capability.ept  VMX_EPT_EXTENT_CONTEXT_BIT;
 }
 
-static inline int cpu_has_vmx_invept_global(void)
+static inline bool cpu_has_vmx_invept_global(void)
 {
-   return !!(vmx_capability.ept  VMX_EPT_EXTENT_GLOBAL_BIT);
+   return vmx_capability.ept  VMX_EPT_EXTENT_GLOBAL_BIT;
 }
 
-static inline int cpu_has_vmx_ept(void)
+static inline bool cpu_has_vmx_ept(void)
 {
return vmcs_config.cpu_based_2nd_exec_ctrl 
SECONDARY_EXEC_ENABLE_EPT;
 }
 
-static inline int cpu_has_vmx_unrestricted_guest(void)
+static inline bool cpu_has_vmx_unrestricted_guest(void)
 {
return vmcs_config.cpu_based_2nd_exec_ctrl 
SECONDARY_EXEC_UNRESTRICTED_GUEST;
 }
 
-static inline int

Re: [PATCH 15/18] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa

2010-03-15 Thread Joerg Roedel

On Mon, Mar 15, 2010 at 11:23:07AM +0200, Avi Kivity wrote:
 On 03/15/2010 11:06 AM, Joerg Roedel wrote:
 On Mon, Mar 15, 2010 at 09:36:52AM +0200, Avi Kivity wrote:

 On 03/03/2010 09:12 PM, Joerg Roedel wrote:
  
 This patch implements logic to make sure that either a
 page-fault/page-fault-vmexit or a nested-page-fault-vmexit
 is propagated back to the guest.

 Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
 ---
arch/x86/kvm/mmu.h |1 +
arch/x86/kvm/paging_tmpl.h |2 ++
arch/x86/kvm/x86.c |   15 ++-
3 files changed, 17 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
 index 64f619b..b42b27e 100644
 --- a/arch/x86/kvm/mmu.h
 +++ b/arch/x86/kvm/mmu.h
 @@ -47,6 +47,7 @@
#define PFERR_USER_MASK (1U   2)
#define PFERR_RSVD_MASK (1U   3)
#define PFERR_FETCH_MASK (1U   4)
 +#define PFERR_NESTED_MASK (1U   31)



 Why is this needed?  Queue an ordinary page fault page; the injection
 code should check the page fault intercept and #VMEXIT if needed.
  
 This is needed because we could have a nested page fault or an ordinary
 page fault which need to be propagated.


 Right.

 Why is pio_copy_data() changed?  One would think that it would be an  
 all-or-nothing affair.

It was the only place I found where the PROPAGATE_FAULT value was
checked and actually propagated.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.

2010-03-15 Thread Gleb Natapov

On Mon, Mar 15, 2010 at 09:44:26AM +0200, Avi Kivity wrote:
 On 03/14/2010 08:06 PM, Gleb Natapov wrote:
 Suggest simply reentering every N executions.
 
 This restart mechanism is, in fact, needed for ins read ahead to work.
 After reading ahead from IO port we need to avoid entering decoder
 until entire cache is consumed otherwise decoder will clear cache and
 data will be lost. So we can't just enter guest in arbitrary times, only
 when read ahead cache is empty. Since read ahead is never done across
 page boundary this is save place to re-enter guest.
 
 Please make the two depend on each other directly then.  We can't
 expect the reader of the emulator code know that.
 
We can document that. I wouldn't want to have different conditions for
guest re-entry for different opcodes.

 Have the emulator ask the buffer when it is empty.
 
It will be always empty for all string ops except INS.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.

2010-03-15 Thread Avi Kivity


On 03/15/2010 11:44 AM, Gleb Natapov wrote:

On Mon, Mar 15, 2010 at 09:44:26AM +0200, Avi Kivity wrote:
   

On 03/14/2010 08:06 PM, Gleb Natapov wrote:
 

Suggest simply reentering every N executions.

 

This restart mechanism is, in fact, needed for ins read ahead to work.
After reading ahead from IO port we need to avoid entering decoder
until entire cache is consumed otherwise decoder will clear cache and
data will be lost. So we can't just enter guest in arbitrary times, only
when read ahead cache is empty. Since read ahead is never done across
page boundary this is save place to re-enter guest.
   

Please make the two depend on each other directly then.  We can't
expect the reader of the emulator code know that.

 

We can document that. I wouldn't want to have different conditions for
guest re-entry for different opcodes.
   


We now have a write buffer size of one.  It's just a matter of making 
the emulator know the size of the buffer (extra parameter to 
-write_emulated).


   

Have the emulator ask the buffer when it is empty.

 

It will be always empty for all string ops except INS.

   


Or we can make the buffer larger for everyone (outside this patchset 
though).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.

2010-03-15 Thread Gleb Natapov

On Mon, Mar 15, 2010 at 11:56:32AM +0200, Avi Kivity wrote:
 On 03/15/2010 11:44 AM, Gleb Natapov wrote:
 On Mon, Mar 15, 2010 at 09:44:26AM +0200, Avi Kivity wrote:
 On 03/14/2010 08:06 PM, Gleb Natapov wrote:
 Suggest simply reentering every N executions.
 
 This restart mechanism is, in fact, needed for ins read ahead to work.
 After reading ahead from IO port we need to avoid entering decoder
 until entire cache is consumed otherwise decoder will clear cache and
 data will be lost. So we can't just enter guest in arbitrary times, only
 when read ahead cache is empty. Since read ahead is never done across
 page boundary this is save place to re-enter guest.
 Please make the two depend on each other directly then.  We can't
 expect the reader of the emulator code know that.
 
 We can document that. I wouldn't want to have different conditions for
 guest re-entry for different opcodes.
 
 We now have a write buffer size of one.  It's just a matter of
 making the emulator know the size of the buffer (extra parameter to
 -write_emulated).
 
The buffer is maintained inside emulator, so emulator knows about it and
can check it, but then for all other string instruction except INS we will
re-enter guest on each iteration.

 Have the emulator ask the buffer when it is empty.
 
 It will be always empty for all string ops except INS.
 
 
 Or we can make the buffer larger for everyone (outside this patchset
 though).
 
I am not sure what do you mean here. INS read ahead and MMIO read cache are
different beasts. Former is needed to speed-up string pio reads, later
(not yet implemented) is needed to reread previous MMIO read results in
case instruction emulation is restarted due to need to exit to userspace.
MMIO read cache need to be invalidated on each iteration of string
instruction.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.

2010-03-15 Thread Avi Kivity


On 03/15/2010 12:07 PM, Gleb Natapov wrote:



Or we can make the buffer larger for everyone (outside this patchset
though).

 

I am not sure what do you mean here. INS read ahead and MMIO read cache are
different beasts. Former is needed to speed-up string pio reads, later
(not yet implemented) is needed to reread previous MMIO read results in
case instruction emulation is restarted due to need to exit to userspace.
MMIO read cache need to be invalidated on each iteration of string
instruction.
   


Instructions with multiple reads or writes need an mmio read/write 
buffer that can be replayed on re-execution.


buffer != cache!  A cache can be dropped (perhaps after flushing it to a 
backing store), but a buffer in general cannot.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Fix some mmu/emulator atomicity issues (v2)

2010-03-15 Thread Marcelo Tosatti

On Sun, Mar 14, 2010 at 09:03:47AM +0200, Avi Kivity wrote:
 On 03/10/2010 04:50 PM, Avi Kivity wrote:
 Currently when we emulate a locked operation into a shadowed guest page
 table, we perform a write rather than a true atomic.  This is indicated
 by the emulating exchange as write message that shows up in dmesg.
 
 In addition, the pte prefetch operation during invlpg suffered from a
 race.  This was fixed by removing the operation.
 
 This patchset fixes both issues and reinstates pte prefetch on invlpg.
 
 v2:
 - fix truncated description for patch 1
 - add new patch 4, which fixes a bug in patch 5
 
 No comments, but looks like last week's maintainer neglected to merge this.

Looks fine. Can you please regenerate against next branch? (just
pushed).

For the invlpg prefetch it would be good to confirm the original bug is
not reproducible.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.

2010-03-15 Thread Gleb Natapov

On Mon, Mar 15, 2010 at 12:15:22PM +0200, Avi Kivity wrote:
 On 03/15/2010 12:07 PM, Gleb Natapov wrote:
 
 Or we can make the buffer larger for everyone (outside this patchset
 though).
 
 I am not sure what do you mean here. INS read ahead and MMIO read cache are
 different beasts. Former is needed to speed-up string pio reads, later
 (not yet implemented) is needed to reread previous MMIO read results in
 case instruction emulation is restarted due to need to exit to userspace.
 MMIO read cache need to be invalidated on each iteration of string
 instruction.
 
 Instructions with multiple reads or writes need an mmio read/write
 buffer that can be replayed on re-execution.
 
 buffer != cache!  A cache can be dropped (perhaps after flushing it
 to a backing store), but a buffer in general cannot.
 
That is just naming. Call it buffer if you want.

I still don't understand what do you mean by Or we can make the buffer
larger for everyone. Who is this everyone? Different instruction need
different kind of buffers.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.

2010-03-15 Thread Avi Kivity


On 03/15/2010 12:19 PM, Gleb Natapov wrote:

On Mon, Mar 15, 2010 at 12:15:22PM +0200, Avi Kivity wrote:
   

On 03/15/2010 12:07 PM, Gleb Natapov wrote:
 
   

Or we can make the buffer larger for everyone (outside this patchset
though).

 

I am not sure what do you mean here. INS read ahead and MMIO read cache are
different beasts. Former is needed to speed-up string pio reads, later
(not yet implemented) is needed to reread previous MMIO read results in
case instruction emulation is restarted due to need to exit to userspace.
MMIO read cache need to be invalidated on each iteration of string
instruction.
   

Instructions with multiple reads or writes need an mmio read/write
buffer that can be replayed on re-execution.

buffer != cache!  A cache can be dropped (perhaps after flushing it
to a backing store), but a buffer in general cannot.

 

That is just naming. Call it buffer if you want.

I still don't understand what do you mean by Or we can make the buffer
larger for everyone. Who is this everyone? Different instruction need
different kind of buffers.
   


Many instructions can issue multiple reads, ins is just one of them.  A 
generic mmio buffer can be used by everyone.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.

2010-03-15 Thread Gleb Natapov

On Mon, Mar 15, 2010 at 12:24:43PM +0200, Avi Kivity wrote:
 On 03/15/2010 12:19 PM, Gleb Natapov wrote:
 On Mon, Mar 15, 2010 at 12:15:22PM +0200, Avi Kivity wrote:
 On 03/15/2010 12:07 PM, Gleb Natapov wrote:
 Or we can make the buffer larger for everyone (outside this patchset
 though).
 
 I am not sure what do you mean here. INS read ahead and MMIO read cache are
 different beasts. Former is needed to speed-up string pio reads, later
 (not yet implemented) is needed to reread previous MMIO read results in
 case instruction emulation is restarted due to need to exit to userspace.
 MMIO read cache need to be invalidated on each iteration of string
 instruction.
 Instructions with multiple reads or writes need an mmio read/write
 buffer that can be replayed on re-execution.
 
 buffer != cache!  A cache can be dropped (perhaps after flushing it
 to a backing store), but a buffer in general cannot.
 
 That is just naming. Call it buffer if you want.
 
 I still don't understand what do you mean by Or we can make the buffer
 larger for everyone. Who is this everyone? Different instruction need
 different kind of buffers.
 
 Many instructions can issue multiple reads, ins is just one of them.
 A generic mmio buffer can be used by everyone.
 
No, ins can issue only _one_ io read during one iteration (i.e between
each pair of reads there is a commit point). But this is slow, so we do
non-architectural hack: do many reads ahead of time into a buffer and
use results from this buffer for emulation of subsequent iterations.
Other instruction can do multiple reads between instruction fetching and
commit of emulation result and need different kind of buffering
(actually caching is more appropriate here since we cache results of
reads from past attempts to emulation same instruction).

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Balbir Singh

* Avi Kivity a...@redhat.com [2010-03-15 11:27:56]:

 The knobs are for
 
 1. Selective enablement
 2. Selective control of the % of unmapped pages
 An alternative path is to enable KSM for page cache.  Then we have
 direct read-only guest access to host page cache, without any guest
 modifications required.  That will be pretty difficult to achieve
 though - will need a readonly bit in the page cache radix tree, and
 teach all paths to honour it.
 
 Yes, it is, I've taken a quick look. I am not sure if de-duplication
 would be the best approach, may be dropping the page in the page cache
 might be a good first step. Data consistency would be much easier to
 maintain that way, as long as the guest is not writing frequently to
 that page, we don't need the page cache in the host.
 
 Trimming the host page cache should happen automatically under
 pressure.  Since the page is cached by the guest, it won't be
 re-read, so the host page is not frequently used and then dropped.


Yes, agreed, but dropping is easier than tagging cache as read-only
and getting everybody to understand read-only cached pages. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Moving dirty bitmaps to userspace - Double buffering approach

2010-03-15 Thread Takuya Yoshikawa


Avi Kivity wrote:

On 03/15/2010 10:33 AM, Marcelo Tosatti wrote:



Are there any good ways to solve this kind of problems?
 

You can introduce a new get_dirty_log ioctl that passes the address
of the next bitmap in userspace, and use it (after pinning with
get_user_pages), instead of vmalloc'ing.



Thank you for your advice!

   


No pinning please, put_user_bit() or set_bit_user().

(can be implemented generically using get_user_pages() and 
kmap_atomic(), but x86 should get an optimized implementation)




Given your advice last time, I started this with my colleague.
 -- We were just talking about how to strugle with every architectures.

As your comment, we'll make the generic implementation with optimized one
for x86 first.

Thanks
  Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 06/30] KVM: remove realmode_lmsw function.

2010-03-15 Thread Andre Przywara


Gleb Natapov wrote:

Use (get|set)_cr callback to emulate lmsw inside emulator.
I see that vmx.c:handle_cr() is the only other user of kvm_lmsw(). If we 
fix this place similar like you did below, we could get rid of 
kvm_lmsw() entirely. But I am not sure whether it's OK to remove an 
exported symbol.


Regards,
Andre.



Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |2 --
 arch/x86/kvm/emulate.c  |4 ++--
 arch/x86/kvm/x86.c  |7 ---
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e8e108a..1e15a0a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -582,8 +582,6 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
 void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context);
 void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
 void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
-void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,
-  unsigned long *rflags);
 
 void kvm_enable_efer_bits(u64);

 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 5b060e4..5e2fa61 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2486,8 +2486,8 @@ twobyte_insn:
c-dst.val = ops-get_cr(0, ctxt-vcpu);
break;
case 6: /* lmsw */
-   realmode_lmsw(ctxt-vcpu, (u16)c-src.val,
- ctxt-eflags);
+   ops-set_cr(0, (ops-get_cr(0, ctxt-vcpu)  ~0x0ful) |
+   (c-src.val  0x0f), ctxt-vcpu);
c-dst.type = OP_NONE;
break;
case 7: /* invlpg*/
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bf714df..b08f8a1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4045,13 +4045,6 @@ void realmode_lidt(struct kvm_vcpu *vcpu, u16 limit, 
unsigned long base)
kvm_x86_ops-set_idt(vcpu, dt);
 }
 
-void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,

-  unsigned long *rflags)
-{
-   kvm_lmsw(vcpu, msw);
-   *rflags = kvm_get_rflags(vcpu);
-}
-
 static int move_to_next_stateful_cpuid_entry(struct kvm_vcpu *vcpu, int i)
 {
struct kvm_cpuid_entry2 *e = vcpu-arch.cpuid_entries[i];



--
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [long] MINIX 3.1.6 works in QEMU-0.12.3 only with KVM disabled

2010-03-15 Thread Antoine Leca

Avi Kivity wrote on 2010-03-10 13:03:25 +0200:
 On 03/10/2010 12:26 PM, Erik van der Kouwe wrote:
 I've submitted this bug report a week ago:
 http://sourceforge.net/tracker/?func=detailaid=2962575group_id=180599atid=893831
  
 
 MINIX is using big real mode which is currently not well supported by kvm on 
 Intel hardware: 
 
 (qemu) info registers
 EAX=0010 EBX=0009 ECX=4920 EDX=a796
 ESI=0200 EDI=49200200 EBP=0009 ESP=a762
 EIP=f4a7 EFL=00023002 [---] CPL=3 II=0 A20=1 SMM=0 HLT=0
 ES =   f300
 CS =f000 000f  f300
 SS =9492 00094920  f300
 DS =97ce 00097cec  f300
 
 A ds.base of 0x97cec cannot be translated to a real mode segment.
 
 There is some work to get this to work, but it is proceeding really slowly.
 It should work on AMD hardware though.

Hi guys,

I searched the issue, and Erik was kind enough to point me to this list
where there are knowledgeable people.


Erik van der Kouwe wrote in
http://groups.google.com/group/minix3/msg/40f44df0c434cfa6:
 The situation is as follows:

 The boot monitor runs in real-address mode, but has to copy parts of
 the boot image into high memory (= 1 MB) which is not accessible from
 that mode as only 20 bits are available. It calls the BIOS (int 0x15)
 to perform the copy. This is done under the ext_copy label in boot/
 boothead.s.

Okay. It is my understanding this is where Minix' involvement stops.


 The BIOS switches to protected mode, loading a GDT which it receives
 from the caller. Before returning to the caller, it copies data using
 the segment descriptors in the GDT and switches back to real-address
 mode.

This is the description of BIOS service 15/87, which have to be
implemented (using whatever solution it pleases) by the BIOS.


 When doing switch, the cached segment selectors are preserved,
 which allows one to use protected mode segments in real-address mode
 (this is called unreal mode).

Now this is a by-product of the implementation inside the BIOS.
In fact, even if the BIOS enters unreal mode (or the similar big real,
more useful with segmentation-less architectures), before turning back
to the client it (should) reset things to normal real mode, as service
15/87 is not an usual way to enter unreal mode (for example, this effect
is not even mentionned in Ralf Brown's list).

As a result (and also and foremost because of 80286 compatibility),
instead of directly using unreal or big real mode if possible (as done
eg. in himem.sys), Minix monitor goes to the great pain to going back to
square #1, and since blocks are at most 64 KB in size and several
iterations are needed, on the next block Minix sets up the (very
similar) GDT then does another call to the same BIOS service 15/87.


 I knew these parts before, but this is where Avi's answer came in: KVM
 on Intel does not yet support unreal mode and requires the cached
 segment descriptors to be valid in real-address mode.

I do not know which virtual BIOS is using KVM, but I notice while
reading http://bochs.sourceforge.net/cgi-bin/lxr/source/bios/rombios.c:
[ Slightly edited to fit the width of my post. AL. ]
3555 case 0x87:
3556 #if BX_CPU  3
3557 #  error Int15 function 87h not supported on  80386
3558 #endif
3559   // +++ should probably have descriptor checks
3560   // +++ should have exception handlers
...
3640   mov  eax, cr0
3641   or   al, #0x01
3642   mov  cr0, eax
3643   ;; far jump to flush CPU queue after transition to prot. mode
3644   JMP_AP(0x0020, protected_mode)
3645
3646 protected_mode:
3647   ;; GDT points to valid descriptor table, now load SS, DS, ES
3648   mov  ax, #0x28 ;; 101 000 = 5th desc.in table, TI=GDT,RPL=00
3649   mov  ss, ax
3650   mov  ax, #0x10 ;; 010 000 = 2nd desc.in table, TI=GDT,RPL=00
3651   mov  ds, ax
3652   mov  ax, #0x18 ;; 011 000 = 3rd desc.in table, TI=GDT,RPL=00
3653   mov  es, ax
3654   xor  si, si
3655   xor  di, di
3656   cld
3657   rep
3658 movsw  ;; move CX words from DS:SI to ES:DI
3659
3660   ;; make sure DS and ES limits are 64KB
3661   mov ax, #0x28
3662   mov ds, ax
3663   mov es, ax
3664
3665   ;; reset PG bit in CR0 ???
3666   mov  eax, cr0
3667   and  al, #0xFE
3668   mov  cr0, eax

I should be loosing something here... There is no unreal mode at any
moment, is it?

[ ... some web browsing occuring meanwhile ... Later: ]

Okay, now I got another picture. 8-|
Until recently, KVM (and qemu) used Bochs BIOS, showed above; but they
switched recently to SeaBIOS... where the applicable code is in
src/system.c, and looks like (now this is ATT assembly):
  83 static void
  84 handle_1587(struct bregs *regs)
  85 {
  86 // +++ should probably have descriptor checks
  87 // +++ should have exception handlers

 127 // Enable protected mode
 128   movl %%cr0, %%eax\n
 129   orl $ __stringify(CR0_PE)

Re: [PATCH v2 06/30] KVM: remove realmode_lmsw function.

2010-03-15 Thread Avi Kivity


On 03/15/2010 01:02 PM, Andre Przywara wrote:

Gleb Natapov wrote:

Use (get|set)_cr callback to emulate lmsw inside emulator.
I see that vmx.c:handle_cr() is the only other user of kvm_lmsw(). If 
we fix this place similar like you did below, we could get rid of 
kvm_lmsw() entirely. But I am not sure whether it's OK to remove an 
exported symbol.


Exported symbols can be changed or removed at will.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Fix some mmu/emulator atomicity issues (v2)

2010-03-15 Thread Avi Kivity


On 03/15/2010 12:16 PM, Marcelo Tosatti wrote:

On Sun, Mar 14, 2010 at 09:03:47AM +0200, Avi Kivity wrote:
   

On 03/10/2010 04:50 PM, Avi Kivity wrote:
 

Currently when we emulate a locked operation into a shadowed guest page
table, we perform a write rather than a true atomic.  This is indicated
by the emulating exchange as write message that shows up in dmesg.

In addition, the pte prefetch operation during invlpg suffered from a
race.  This was fixed by removing the operation.

This patchset fixes both issues and reinstates pte prefetch on invlpg.

v2:
- fix truncated description for patch 1
- add new patch 4, which fixes a bug in patch 5
   

No comments, but looks like last week's maintainer neglected to merge this.
 

Looks fine. Can you please regenerate against next branch? (just
pushed).
   


Will send out shortly.


For the invlpg prefetch it would be good to confirm the original bug is
not reproducible.
   


I tried to reproduce the problem with the original revert reverted, but 
couldn't.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/5] KVM: MMU: Reinstate pte prefetch on invlpg

2010-03-15 Thread Avi Kivity

Commit fb341f57 removed the pte prefetch on guest invlpg, citing guest races.
However, the SDM is adamant that prefetch is allowed:

  The processor may create entries in paging-structure caches for
   translations required for prefetches and for accesses that are a
   result of speculative execution that would never actually occur
   in the executed code path.

And, in fact, there was a race in the prefetch code: we picked up the pte
without the mmu lock held, so an older invlpg could install the pte over
a newer invlpg.

Reinstate the prefetch logic, but this time note whether another invlpg has
executed using a counter.  If a race occured, do not install the pte.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   37 +++--
 arch/x86/kvm/paging_tmpl.h  |   15 +++
 3 files changed, 39 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ea1b6c6..28826c8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,7 @@ struct kvm_arch {
unsigned int n_free_mmu_pages;
unsigned int n_requested_mmu_pages;
unsigned int n_alloc_mmu_pages;
+   atomic_t invlpg_counter;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f63c9ad..b3edc46 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2609,20 +2609,11 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
int flooded = 0;
int npte;
int r;
+   int invlpg_counter;
 
pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes);
 
-   switch (bytes) {
-   case 4:
-   gentry = *(const u32 *)new;
-   break;
-   case 8:
-   gentry = *(const u64 *)new;
-   break;
-   default:
-   gentry = 0;
-   break;
-   }
+   invlpg_counter = atomic_read(vcpu-kvm-arch.invlpg_counter);
 
/*
 * Assume that the pte write on a page table of the same type
@@ -2630,16 +2621,34 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 * (might be false while changing modes).  Note it is verified later
 * by update_pte().
 */
-   if (is_pae(vcpu)  bytes == 4) {
+   if ((is_pae(vcpu)  bytes == 4) || !new) {
/* Handle a 32-bit guest writing two halves of a 64-bit gpte */
-   gpa = ~(gpa_t)7;
-   r = kvm_read_guest(vcpu-kvm, gpa, gentry, 8);
+   if (is_pae(vcpu)) {
+   gpa = ~(gpa_t)7;
+   bytes = 8;
+   }
+   r = kvm_read_guest(vcpu-kvm, gpa, gentry, min(bytes, 8));
if (r)
gentry = 0;
+   new = (const u8 *)gentry;
+   }
+
+   switch (bytes) {
+   case 4:
+   gentry = *(const u32 *)new;
+   break;
+   case 8:
+   gentry = *(const u64 *)new;
+   break;
+   default:
+   gentry = 0;
+   break;
}
 
mmu_guess_page_from_pte_write(vcpu, gpa, gentry);
spin_lock(vcpu-kvm-mmu_lock);
+   if (atomic_read(vcpu-kvm-arch.invlpg_counter) != invlpg_counter)
+   gentry = 0;
kvm_mmu_access_page(vcpu, gfn);
kvm_mmu_free_some_pages(vcpu);
++vcpu-kvm-stat.mmu_pte_write;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 4b37e1a..067797a 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -463,6 +463,7 @@ out_unlock:
 static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
struct kvm_shadow_walk_iterator iterator;
+   gpa_t pte_gpa = -1;
int level;
u64 *sptep;
int need_flush = 0;
@@ -476,6 +477,10 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
if (level == PT_PAGE_TABLE_LEVEL  ||
((level == PT_DIRECTORY_LEVEL  is_large_pte(*sptep))) ||
((level == PT_PDPE_LEVEL  is_large_pte(*sptep {
+   struct kvm_mmu_page *sp = page_header(__pa(sptep));
+
+   pte_gpa = (sp-gfn  PAGE_SHIFT);
+   pte_gpa += (sptep - sp-spt) * sizeof(pt_element_t);
 
if (is_shadow_present_pte(*sptep)) {
rmap_remove(vcpu-kvm, sptep);
@@ -493,7 +498,17 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 
if (need_flush)
kvm_flush_remote_tlbs(vcpu-kvm);
+
+   atomic_inc(vcpu-kvm-arch.invlpg_counter);
+
spin_unlock(vcpu-kvm-mmu_lock);
+
+   if (pte_gpa == -1)
+   return;
+
+   if (mmu_topup_memory_caches(vcpu))
+   return;

[PATCH 2/5] KVM: Make locked operations truly atomic

2010-03-15 Thread Avi Kivity

Once upon a time, locked operations were emulated while holding the mmu mutex.
Since mmu pages were write protected, it was safe to emulate the writes in
a non-atomic manner, since there could be no other writer, either in the
guest or in the kernel.

These days emulation takes place without holding the mmu spinlock, so the
write could be preempted by an unshadowing event, which exposes the page
to writes by the guest.  This may cause corruption of guest page tables.

Fix by using an atomic cmpxchg for these operations.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/x86.c |   69 
 1 files changed, 48 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9d02cc7..d724a52 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3299,41 +3299,68 @@ int emulator_write_emulated(unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(emulator_write_emulated);
 
+#define CMPXCHG_TYPE(t, ptr, old, new) \
+   (cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new)) == *(t *)(old))
+
+#ifdef CONFIG_X86_64
+#  define CMPXCHG64(ptr, old, new) CMPXCHG_TYPE(u64, ptr, old, new)
+#else
+#  define CMPXCHG64(ptr, old, new) \
+   (cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u *)(new)) == *(u64 *)(old))
+#endif
+
 static int emulator_cmpxchg_emulated(unsigned long addr,
 const void *old,
 const void *new,
 unsigned int bytes,
 struct kvm_vcpu *vcpu)
 {
-   printk_once(KERN_WARNING kvm: emulating exchange as write\n);
-#ifndef CONFIG_X86_64
-   /* guests cmpxchg8b have to be emulated atomically */
-   if (bytes == 8) {
-   gpa_t gpa;
-   struct page *page;
-   char *kaddr;
-   u64 val;
+   gpa_t gpa;
+   struct page *page;
+   char *kaddr;
+   bool exchanged;
 
-   gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, NULL);
+   /* guests cmpxchg8b have to be emulated atomically */
+   if (bytes  8 || (bytes  (bytes - 1)))
+   goto emul_write;
 
-   if (gpa == UNMAPPED_GVA ||
-  (gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
-   goto emul_write;
+   gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, NULL);
 
-   if (((gpa + bytes - 1)  PAGE_MASK) != (gpa  PAGE_MASK))
-   goto emul_write;
+   if (gpa == UNMAPPED_GVA ||
+   (gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
+   goto emul_write;
 
-   val = *(u64 *)new;
+   if (((gpa + bytes - 1)  PAGE_MASK) != (gpa  PAGE_MASK))
+   goto emul_write;
 
-   page = gfn_to_page(vcpu-kvm, gpa  PAGE_SHIFT);
+   page = gfn_to_page(vcpu-kvm, gpa  PAGE_SHIFT);
 
-   kaddr = kmap_atomic(page, KM_USER0);
-   set_64bit((u64 *)(kaddr + offset_in_page(gpa)), val);
-   kunmap_atomic(kaddr, KM_USER0);
-   kvm_release_page_dirty(page);
+   kaddr = kmap_atomic(page, KM_USER0);
+   kaddr += offset_in_page(gpa);
+   switch (bytes) {
+   case 1:
+   exchanged = CMPXCHG_TYPE(u8, kaddr, old, new);
+   break;
+   case 2:
+   exchanged = CMPXCHG_TYPE(u16, kaddr, old, new);
+   break;
+   case 4:
+   exchanged = CMPXCHG_TYPE(u32, kaddr, old, new);
+   break;
+   case 8:
+   exchanged = CMPXCHG64(kaddr, old, new);
+   break;
+   default:
+   BUG();
}
+   kunmap_atomic(kaddr, KM_USER0);
+   kvm_release_page_dirty(page);
+
+   if (!exchanged)
+   return X86EMUL_CMPXCHG_FAILED;
+
 emul_write:
-#endif
+   printk_once(KERN_WARNING kvm: emulating exchange as write\n);
 
return emulator_write_emulated(addr, new, bytes, vcpu);
 }
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5] KVM: MMU: Do not instantiate nontrapping spte on unsync page

2010-03-15 Thread Avi Kivity

The update_pte() path currently uses a nontrapping spte when a nonpresent
(or nonaccessed) gpte is written.  This is fine since at present it is only
used on sync pages.  However, on an unsync page this will cause an endless
fault loop as the guest is under no obligation to invlpg a gpte that
transitions from nonpresent to present.

Needed for the next patch which reinstates update_pte() on invlpg.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/paging_tmpl.h |   10 --
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 81eab9a..4b37e1a 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -258,11 +258,17 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *page,
pt_element_t gpte;
unsigned pte_access;
pfn_t pfn;
+   u64 new_spte;
 
gpte = *(const pt_element_t *)pte;
if (~gpte  (PT_PRESENT_MASK | PT_ACCESSED_MASK)) {
-   if (!is_present_gpte(gpte))
-   __set_spte(spte, shadow_notrap_nonpresent_pte);
+   if (!is_present_gpte(gpte)) {
+   if (page-unsync)
+   new_spte = shadow_trap_nonpresent_pte;
+   else
+   new_spte = shadow_notrap_nonpresent_pte;
+   __set_spte(spte, new_spte);
+   }
return;
}
pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte);
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/5] KVM: Don't follow an atomic operation by a non-atomic one

2010-03-15 Thread Avi Kivity

Currently emulated atomic operations are immediately followed by a non-atomic
operation, so that kvm_mmu_pte_write() can be invoked.  This updates the mmu
but undoes the whole point of doing things atomically.

Fix by only performing the atomic operation and the mmu update, and avoiding
the non-atomic write.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/x86.c |   21 +++--
 1 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d724a52..2c0f632 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3227,7 +3227,8 @@ static int emulator_write_emulated_onepage(unsigned long 
addr,
   const void *val,
   unsigned int bytes,
   struct kvm_vcpu *vcpu,
-  bool guest_initiated)
+  bool guest_initiated,
+  bool mmu_only)
 {
gpa_t gpa;
u32 error_code;
@@ -3247,6 +3248,10 @@ static int emulator_write_emulated_onepage(unsigned long 
addr,
if ((gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
goto mmio;
 
+   if (mmu_only) {
+   kvm_mmu_pte_write(vcpu, gpa, val, bytes, 1);
+   return X86EMUL_CONTINUE;
+   }
if (emulator_write_phys(vcpu, gpa, val, bytes))
return X86EMUL_CONTINUE;
 
@@ -3271,7 +3276,8 @@ int __emulator_write_emulated(unsigned long addr,
   const void *val,
   unsigned int bytes,
   struct kvm_vcpu *vcpu,
-  bool guest_initiated)
+  bool guest_initiated,
+  bool mmu_only)
 {
/* Crossing a page boundary? */
if (((addr + bytes - 1) ^ addr)  PAGE_MASK) {
@@ -3279,7 +3285,7 @@ int __emulator_write_emulated(unsigned long addr,
 
now = -addr  ~PAGE_MASK;
rc = emulator_write_emulated_onepage(addr, val, now, vcpu,
-guest_initiated);
+guest_initiated, mmu_only);
if (rc != X86EMUL_CONTINUE)
return rc;
addr += now;
@@ -3287,7 +3293,7 @@ int __emulator_write_emulated(unsigned long addr,
bytes -= now;
}
return emulator_write_emulated_onepage(addr, val, bytes, vcpu,
-  guest_initiated);
+  guest_initiated, mmu_only);
 }
 
 int emulator_write_emulated(unsigned long addr,
@@ -3295,7 +3301,7 @@ int emulator_write_emulated(unsigned long addr,
   unsigned int bytes,
   struct kvm_vcpu *vcpu)
 {
-   return __emulator_write_emulated(addr, val, bytes, vcpu, true);
+   return __emulator_write_emulated(addr, val, bytes, vcpu, true, false);
 }
 EXPORT_SYMBOL_GPL(emulator_write_emulated);
 
@@ -3359,6 +3365,8 @@ static int emulator_cmpxchg_emulated(unsigned long addr,
if (!exchanged)
return X86EMUL_CMPXCHG_FAILED;
 
+   return __emulator_write_emulated(addr, new, bytes, vcpu, true, true);
+
 emul_write:
printk_once(KERN_WARNING kvm: emulating exchange as write\n);
 
@@ -4013,7 +4021,8 @@ int kvm_fix_hypercall(struct kvm_vcpu *vcpu)
 
kvm_x86_ops-patch_hypercall(vcpu, instruction);
 
-   return __emulator_write_emulated(rip, instruction, 3, vcpu, false);
+   return __emulator_write_emulated(rip, instruction, 3, vcpu,
+false, false);
 }
 
 static u64 mk_cr_64(u64 curr_cr, u32 new_val)
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5] KVM: MMU: Consolidate two guest pte reads in kvm_mmu_pte_write()

2010-03-15 Thread Avi Kivity

kvm_mmu_pte_write() reads guest ptes in two different occasions, both to
allow a 32-bit pae guest to update a pte with 4-byte writes.  Consolidate
these into a single read, which also allows us to consolidate another read
from an invlpg speculating a gpte into the shadow page table.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/mmu.c |   69 +++
 1 files changed, 31 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b137515..f63c9ad 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2556,36 +2556,11 @@ static bool last_updated_pte_accessed(struct kvm_vcpu 
*vcpu)
 }
 
 static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
- const u8 *new, int bytes)
+ u64 gpte)
 {
gfn_t gfn;
-   int r;
-   u64 gpte = 0;
pfn_t pfn;
 
-   if (bytes != 4  bytes != 8)
-   return;
-
-   /*
-* Assume that the pte write on a page table of the same type
-* as the current vcpu paging mode.  This is nearly always true
-* (might be false while changing modes).  Note it is verified later
-* by update_pte().
-*/
-   if (is_pae(vcpu)) {
-   /* Handle a 32-bit guest writing two halves of a 64-bit gpte */
-   if ((bytes == 4)  (gpa % 4 == 0)) {
-   r = kvm_read_guest(vcpu-kvm, gpa  ~(u64)7, gpte, 8);
-   if (r)
-   return;
-   memcpy((void *)gpte + (gpa % 8), new, 4);
-   } else if ((bytes == 8)  (gpa % 8 == 0)) {
-   memcpy((void *)gpte, new, 8);
-   }
-   } else {
-   if ((bytes == 4)  (gpa % 4 == 0))
-   memcpy((void *)gpte, new, 4);
-   }
if (!is_present_gpte(gpte))
return;
gfn = (gpte  PT64_BASE_ADDR_MASK)  PAGE_SHIFT;
@@ -2636,7 +2611,34 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
int r;
 
pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes);
-   mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes);
+
+   switch (bytes) {
+   case 4:
+   gentry = *(const u32 *)new;
+   break;
+   case 8:
+   gentry = *(const u64 *)new;
+   break;
+   default:
+   gentry = 0;
+   break;
+   }
+
+   /*
+* Assume that the pte write on a page table of the same type
+* as the current vcpu paging mode.  This is nearly always true
+* (might be false while changing modes).  Note it is verified later
+* by update_pte().
+*/
+   if (is_pae(vcpu)  bytes == 4) {
+   /* Handle a 32-bit guest writing two halves of a 64-bit gpte */
+   gpa = ~(gpa_t)7;
+   r = kvm_read_guest(vcpu-kvm, gpa, gentry, 8);
+   if (r)
+   gentry = 0;
+   }
+
+   mmu_guess_page_from_pte_write(vcpu, gpa, gentry);
spin_lock(vcpu-kvm-mmu_lock);
kvm_mmu_access_page(vcpu, gfn);
kvm_mmu_free_some_pages(vcpu);
@@ -2701,20 +2703,11 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
continue;
}
spte = sp-spt[page_offset / sizeof(*spte)];
-   if ((gpa  (pte_size - 1)) || (bytes  pte_size)) {
-   gentry = 0;
-   r = kvm_read_guest_atomic(vcpu-kvm,
- gpa  ~(u64)(pte_size - 1),
- gentry, pte_size);
-   new = (const void *)gentry;
-   if (r  0)
-   new = NULL;
-   }
while (npte--) {
entry = *spte;
mmu_pte_write_zap_pte(vcpu, sp, spte);
-   if (new)
-   mmu_pte_write_new_pte(vcpu, sp, spte, new);
+   if (gentry)
+   mmu_pte_write_new_pte(vcpu, sp, spte, gentry);
mmu_pte_write_flush_tlb(vcpu, entry, *spte);
++spte;
}
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/5] Fix some mmu/emulator atomicity issues (v2)

2010-03-15 Thread Avi Kivity

Currently when we emulate a locked operation into a shadowed guest page
table, we perform a write rather than a true atomic.  This is indicated
by the emulating exchange as write message that shows up in dmesg.

In addition, the pte prefetch operation during invlpg suffered from a
race.  This was fixed by removing the operation.

This patchset fixes both issues and reinstates pte prefetch on invlpg.

v3:
   - rebase against next branch (resolves conflicts via hypercall patch)

v2:
   - fix truncated description for patch 1
   - add new patch 4, which fixes a bug in patch 5

Avi Kivity (5):
  KVM: MMU: Consolidate two guest pte reads in kvm_mmu_pte_write()
  KVM: Make locked operations truly atomic
  KVM: Don't follow an atomic operation by a non-atomic one
  KVM: MMU: Do not instantiate nontrapping spte on unsync page
  KVM: MMU: Reinstate pte prefetch on invlpg

 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   78 +
 arch/x86/kvm/paging_tmpl.h  |   25 ++-
 arch/x86/kvm/x86.c  |   90 +++
 4 files changed, 127 insertions(+), 67 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Avi Kivity


On 03/10/2010 11:30 PM, Luiz Capitulino wrote:

  Hi there,

  Our wiki page for the Summer of Code 2010 is doing quite well:

http://wiki.qemu.org/Google_Summer_of_Code_2010
   


I will add another project - iommu emulation.  Could be very useful for 
doing device assignment to nested guests, which could make testing a lot 
easier.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Joerg Roedel

On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote:
 On 03/10/2010 11:30 PM, Luiz Capitulino wrote:
   Hi there,

   Our wiki page for the Summer of Code 2010 is doing quite well:

 http://wiki.qemu.org/Google_Summer_of_Code_2010


 I will add another project - iommu emulation.  Could be very useful for  
 doing device assignment to nested guests, which could make testing a lot  
 easier.

Good idea. If there is interest I could help to mentor this project.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Avi Kivity


On 03/15/2010 02:38 PM, Joerg Roedel wrote:

On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote:
   

On 03/10/2010 11:30 PM, Luiz Capitulino wrote:
 

   Hi there,

   Our wiki page for the Summer of Code 2010 is doing quite well:

http://wiki.qemu.org/Google_Summer_of_Code_2010

   

I will add another project - iommu emulation.  Could be very useful for
doing device assignment to nested guests, which could make testing a lot
easier.
 

Good idea. If there is interest I could help to mentor this project.
   


Thanks.  I volunteered Anthony, but he may be a little overcommitted.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [long] MINIX 3.1.6 works in QEMU-0.12.3 only with KVM disabled

2010-03-15 Thread Avi Kivity


On 03/15/2010 12:54 PM, Antoine Leca wrote:



When doing switch, the cached segment selectors are preserved,
which allows one to use protected mode segments in real-address mode
(this is called unreal mode).
 

Now this is a by-product of the implementation inside the BIOS.
In fact, even if the BIOS enters unreal mode (or the similar big real,
more useful with segmentation-less architectures), before turning back
to the client it (should) reset things to normal real mode, as service
15/87 is not an usual way to enter unreal mode (for example, this effect
is not even mentionned in Ralf Brown's list).
   


The entry into unreal mode is unintentional; the bios is transitioning 
to protected mode and 'unreal mode' only exists for a few instructions, 
IIRC.



As a result (and also and foremost because of 80286 compatibility),
instead of directly using unreal or big real mode if possible (as done
eg. in himem.sys), Minix monitor goes to the great pain to going back to
square #1, and since blocks are at most 64 KB in size and several
iterations are needed, on the next block Minix sets up the (very
similar) GDT then does another call to the same BIOS service 15/87.


   

I knew these parts before, but this is where Avi's answer came in: KVM
on Intel does not yet support unreal mode and requires the cached
segment descriptors to be valid in real-address mode.
 

I do not know which virtual BIOS is using KVM, but I notice while
reading http://bochs.sourceforge.net/cgi-bin/lxr/source/bios/rombios.c:
[ Slightly edited to fit the width of my post. AL. ]
3555 case 0x87:
3556 #if BX_CPU  3
3557 #  error Int15 function 87h not supported on  80386
3558 #endif
3559   // +++ should probably have descriptor checks
3560   // +++ should have exception handlers
...
3640   mov  eax, cr0
3641   or   al, #0x01
3642   mov  cr0, eax
3643   ;; far jump to flush CPU queue after transition to prot. mode
3644   JMP_AP(0x0020, protected_mode)
3645
3646 protected_mode:
3647   ;; GDT points to valid descriptor table, now load SS, DS, ES
3648   mov  ax, #0x28 ;; 101 000 = 5th desc.in table, TI=GDT,RPL=00
3649   mov  ss, ax
3650   mov  ax, #0x10 ;; 010 000 = 2nd desc.in table, TI=GDT,RPL=00
3651   mov  ds, ax
3652   mov  ax, #0x18 ;; 011 000 = 3rd desc.in table, TI=GDT,RPL=00
3653   mov  es, ax
3654   xor  si, si
3655   xor  di, di
3656   cld
3657   rep
3658 movsw  ;; move CX words from DS:SI to ES:DI
3659
3660   ;; make sure DS and ES limits are 64KB
3661   mov ax, #0x28
3662   mov ds, ax
3663   mov es, ax
3664
3665   ;; reset PG bit in CR0 ???
3666   mov  eax, cr0
3667   and  al, #0xFE
3668   mov  cr0, eax

I should be loosing something here... There is no unreal mode at any
moment, is it?

[ ... some web browsing occuring meanwhile ... Later: ]

Okay, now I got another picture. 8-|
Until recently, KVM (and qemu) used Bochs BIOS, showed above; but they
switched recently to SeaBIOS... where the applicable code is in
src/system.c, and looks like (now this is ATT assembly):
   83 static void
   84 handle_1587(struct bregs *regs)
   85 {
   86 // +++ should probably have descriptor checks
   87 // +++ should have exception handlers

  127 // Enable protected mode
  128   movl %%cr0, %%eax\n
  129   orl $ __stringify(CR0_PE) , %%eax\n
  130   movl %%eax, %%cr0\n
  131
  132  // far jump to flush CPU queue after transition to prot. mode
  133   ljmpw $(43), $1f\n
  134
  135 // GDT points to valid descriptor table, now load DS, ES
  136 1:movw $(23), %%ax\n
 // 2nd descriptor in table, TI=GDT, RPL=00
  137   movw %%ax, %%ds\n
  138   movw $(33), %%ax\n
 // 3rd descriptor in table, TI=GDT, RPL=00
  139   movw %%ax, %%es\n
  140
  141 // move CX words from DS:SI to ES:DI
  142   xorw %%si, %%si\n
  143   xorw %%di, %%di\n
  144   rep movsw\n
  145
  146 // Disable protected mode
  147   movl %%cr0, %%eax\n
  148   andl $~ __stringify(CR0_PE) , %%eax\n
  149   movl %%eax, %%cr0\n

Note that while the basic scheme is the same, the cleaning up of lines
3660-3663 make sure DS and ES limits are 64KB is not present.
IIUC, the virtualized CPU goes back to real mode with those segments
sets as they are in protected mode, and yes with Minix boot monitor they
happenned to NOT be paragraph-aligned.


Is it possible to add back this cleaning up to the BIOS used in KVM?
   


I think so.  This is a longstanding kvm bug, but I can't see any 
downsides to a workaround in the BIOS.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 15/18] KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa

2010-03-15 Thread Joerg Roedel

On Mon, Mar 15, 2010 at 04:30:47AM +, Daniel K. wrote:
 Joerg Roedel wrote:
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 2883ce8..9f8b02d 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -314,6 +314,19 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, 
 unsigned long addr,
  kvm_queue_exception_e(vcpu, PF_VECTOR, error_code)
  }
  +void kvm_propagate_fault(struct kvm_vcpu *vcpu, unsigned long addr, 
 u32 error_code)
 +{
 +u32 nested, error;
 +
 +nested = error_code   PFERR_NESTED_MASK;
 +error  = error_code  ~PFERR_NESTED_MASK;
 +
 +if (vcpu-arch.mmu.nested  !(error_code  PFERR_NESTED_MASK))

 This looks incorrect, nested is unused.

 At the very least it should be a binary  operation

   if (vcpu-arch.mmu.nested  !(error_code  PFERR_NESTED_MASK))

 which can be simplified to

   if (vcpu-arch.mmu.nested  !nested)

 but it seems wrong that the condition is that it is nested and not nested 
 at the same time.

Yes, this is already fixed in my local patch-stack. I found it during
further testing (while fixing another bug). But thanks for your feedback
:-)

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Muli Ben-Yehuda

On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote:
 On 03/10/2010 11:30 PM, Luiz Capitulino wrote:

   Hi there,
 
   Our wiki page for the Summer of Code 2010 is doing quite well:
 
 http://wiki.qemu.org/Google_Summer_of_Code_2010
 
 I will add another project - iommu emulation.  Could be very useful
 for doing device assignment to nested guests, which could make
 testing a lot easier.

Our experiments show that nested device assignment is pretty much
required for I/O performance in nested scenarios.

Cheers,
Muli
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Joerg Roedel

On Mon, Mar 15, 2010 at 05:53:13AM -0700, Muli Ben-Yehuda wrote:
 On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote:
  On 03/10/2010 11:30 PM, Luiz Capitulino wrote:
 
Hi there,
  
Our wiki page for the Summer of Code 2010 is doing quite well:
  
  http://wiki.qemu.org/Google_Summer_of_Code_2010
  
  I will add another project - iommu emulation.  Could be very useful
  for doing device assignment to nested guests, which could make
  testing a lot easier.
 
 Our experiments show that nested device assignment is pretty much
 required for I/O performance in nested scenarios.

Really? I did a small test with virtio-blk in a nested guest (disk read
with dd, so not a real benchmark) and got a reasonable read-performance
of around 25MB/s from the disk in the l2-guest.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 05/30] KVM: Provide callback to get/set control registers in emulator ops.

2010-03-15 Thread Andre Przywara


Gleb Natapov wrote:

Use this callback instead of directly call kvm function. Also rename
realmode_(set|get)_cr to emulator_(set|get)_cr since function has nothing
to do with real mode.
Do you mind removing the static before emulator_{set,get}_cr and marking 
it EXPORT_SYMBOL? Then one could use it in vmx.c (and soon in svm.c ;-) 
while handling MOV-CR intercepts. Currently most of the code is actually 
duplicated.


Also, shouldn't mk_cr_64() not be called mask_cr_64() for better 
readability?


Regards,
Andre.


Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |3 +-
 arch/x86/include/asm/kvm_host.h|2 -
 arch/x86/kvm/emulate.c |7 +-
 arch/x86/kvm/x86.c |  114 ++--
 4 files changed, 63 insertions(+), 63 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 2666d7a..0c5caa4 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -108,7 +108,8 @@ struct x86_emulate_ops {
const void *new,
unsigned int bytes,
struct kvm_vcpu *vcpu);
-
+   ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu);
+   void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu);
 };
 
 /* Type, address-of, and value of an instruction's operand. */

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3b178d8..e8e108a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -585,8 +585,6 @@ void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, 
unsigned long address);
 void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,
   unsigned long *rflags);
 
-unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr);

-void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value);
 void kvm_enable_efer_bits(u64);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
 int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 91450b5..5b060e4 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2483,7 +2483,7 @@ twobyte_insn:
break;
case 4: /* smsw */
c-dst.bytes = 2;
-   c-dst.val = realmode_get_cr(ctxt-vcpu, 0);
+   c-dst.val = ops-get_cr(0, ctxt-vcpu);
break;
case 6: /* lmsw */
realmode_lmsw(ctxt-vcpu, (u16)c-src.val,
@@ -2519,8 +2519,7 @@ twobyte_insn:
case 0x20: /* mov cr, reg */
if (c-modrm_mod != 3)
goto cannot_emulate;
-   c-regs[c-modrm_rm] =
-   realmode_get_cr(ctxt-vcpu, c-modrm_reg);
+   c-regs[c-modrm_rm] = ops-get_cr(c-modrm_reg, ctxt-vcpu);
c-dst.type = OP_NONE;   /* no writeback */
break;
case 0x21: /* mov from dr to reg */
@@ -2534,7 +2533,7 @@ twobyte_insn:
case 0x22: /* mov reg, cr */
if (c-modrm_mod != 3)
goto cannot_emulate;
-   realmode_set_cr(ctxt-vcpu, c-modrm_reg, c-modrm_val);
+   ops-set_cr(c-modrm_reg, c-modrm_val, ctxt-vcpu);
c-dst.type = OP_NONE;
break;
case 0x23: /* mov from reg to dr */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a1e671a..bf714df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3370,12 +3370,70 @@ void kvm_report_emulation_failure(struct kvm_vcpu 
*vcpu, const char *context)
 }
 EXPORT_SYMBOL_GPL(kvm_report_emulation_failure);
 
+static u64 mk_cr_64(u64 curr_cr, u32 new_val)

+{
+   return (curr_cr  ~((1ULL  32) - 1)) | new_val;
+}
+
+static unsigned long emulator_get_cr(int cr, struct kvm_vcpu *vcpu)
+{
+   unsigned long value;
+
+   switch (cr) {
+   case 0:
+   value = kvm_read_cr0(vcpu);
+   break;
+   case 2:
+   value = vcpu-arch.cr2;
+   break;
+   case 3:
+   value = vcpu-arch.cr3;
+   break;
+   case 4:
+   value = kvm_read_cr4(vcpu);
+   break;
+   case 8:
+   value = kvm_get_cr8(vcpu);
+   break;
+   default:
+   vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr);
+   return 0;
+   }
+
+   return value;
+}
+
+static void emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu *vcpu)
+{
+   switch (cr) {
+   case 0:
+   kvm_set_cr0(vcpu, mk_cr_64(kvm_read_cr0(vcpu), val));
+   break;
+   case 2:
+   vcpu-arch.cr2 = val;
+   break;
+   case 3:
+   kvm_set_cr3(vcpu, val);
+   break;

[PATCH rework] KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s error handling

2010-03-15 Thread Takuya Yoshikawa

kvm_coalesced_mmio_init() keeps to hold the addresses of a coalesced
mmio ring page and dev even after it has freed them.

Also, if this function fails, though it might be rare, it seems to be
suggesting the system's serious state: so we'd better stop the works
following the kvm_creat_vm().

This patch clears these problems.

  We move the coalesced mmio's initialization out of kvm_create_vm().
  This seems to be natural because it includes a registration which
  can be done only when vm is successfully created.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 virt/kvm/coalesced_mmio.c |2 ++
 virt/kvm/kvm_main.c   |   12 
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
index 5169736..11776b7 100644
--- a/virt/kvm/coalesced_mmio.c
+++ b/virt/kvm/coalesced_mmio.c
@@ -119,8 +119,10 @@ int kvm_coalesced_mmio_init(struct kvm *kvm)
return ret;
 
 out_free_dev:
+   kvm-coalesced_mmio_dev = NULL;
kfree(dev);
 out_free_page:
+   kvm-coalesced_mmio_ring = NULL;
__free_page(page);
 out_err:
return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bcd08b8..c7053aa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -418,9 +418,6 @@ static struct kvm *kvm_create_vm(void)
spin_lock(kvm_lock);
list_add(kvm-vm_list, vm_list);
spin_unlock(kvm_lock);
-#ifdef KVM_COALESCED_MMIO_PAGE_OFFSET
-   kvm_coalesced_mmio_init(kvm);
-#endif
 out:
return kvm;
 
@@ -1748,12 +1745,19 @@ static struct file_operations kvm_vm_fops = {
 
 static int kvm_dev_ioctl_create_vm(void)
 {
-   int fd;
+   int fd, r;
struct kvm *kvm;
 
kvm = kvm_create_vm();
if (IS_ERR(kvm))
return PTR_ERR(kvm);
+#ifdef KVM_COALESCED_MMIO_PAGE_OFFSET
+   r = kvm_coalesced_mmio_init(kvm);
+   if (r  0) {
+   kvm_put_kvm(kvm);
+   return r;
+   }
+#endif
fd = anon_inode_getfd(kvm-vm, kvm_vm_fops, kvm, O_RDWR);
if (fd  0)
kvm_put_kvm(kvm);
-- 
1.6.3.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 05/30] KVM: Provide callback to get/set control registers in emulator ops.

2010-03-15 Thread Gleb Natapov

On Mon, Mar 15, 2010 at 02:06:48PM +0100, Andre Przywara wrote:
 Gleb Natapov wrote:
 Use this callback instead of directly call kvm function. Also rename
 realmode_(set|get)_cr to emulator_(set|get)_cr since function has nothing
 to do with real mode.
 Do you mind removing the static before emulator_{set,get}_cr and
I don't, but this is not the goal of this patch series.

 marking it EXPORT_SYMBOL? Then one could use it in vmx.c (and soon
 in svm.c ;-) while handling MOV-CR intercepts. Currently most of the
 code is actually duplicated.
 
 Also, shouldn't mk_cr_64() not be called mask_cr_64() for better
 readability?
This is how it is called now, the patch only moves it. But this code
will be reworked by later patches anyway since functions called from
emulator should not inject exceptions behind emulator's back.

 
 Regards,
 Andre.
 
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/include/asm/kvm_emulate.h |3 +-
  arch/x86/include/asm/kvm_host.h|2 -
  arch/x86/kvm/emulate.c |7 +-
  arch/x86/kvm/x86.c |  114 
  ++--
  4 files changed, 63 insertions(+), 63 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_emulate.h 
 b/arch/x86/include/asm/kvm_emulate.h
 index 2666d7a..0c5caa4 100644
 --- a/arch/x86/include/asm/kvm_emulate.h
 +++ b/arch/x86/include/asm/kvm_emulate.h
 @@ -108,7 +108,8 @@ struct x86_emulate_ops {
  const void *new,
  unsigned int bytes,
  struct kvm_vcpu *vcpu);
 -
 +ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu);
 +void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu);
  };
  /* Type, address-of, and value of an instruction's operand. */
 diff --git a/arch/x86/include/asm/kvm_host.h 
 b/arch/x86/include/asm/kvm_host.h
 index 3b178d8..e8e108a 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -585,8 +585,6 @@ void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, 
 unsigned long address);
  void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,
 unsigned long *rflags);
 -unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr);
 -void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value);
  void kvm_enable_efer_bits(u64);
  int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
  int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 91450b5..5b060e4 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -2483,7 +2483,7 @@ twobyte_insn:
  break;
  case 4: /* smsw */
  c-dst.bytes = 2;
 -c-dst.val = realmode_get_cr(ctxt-vcpu, 0);
 +c-dst.val = ops-get_cr(0, ctxt-vcpu);
  break;
  case 6: /* lmsw */
  realmode_lmsw(ctxt-vcpu, (u16)c-src.val,
 @@ -2519,8 +2519,7 @@ twobyte_insn:
  case 0x20: /* mov cr, reg */
  if (c-modrm_mod != 3)
  goto cannot_emulate;
 -c-regs[c-modrm_rm] =
 -realmode_get_cr(ctxt-vcpu, c-modrm_reg);
 +c-regs[c-modrm_rm] = ops-get_cr(c-modrm_reg, ctxt-vcpu);
  c-dst.type = OP_NONE;  /* no writeback */
  break;
  case 0x21: /* mov from dr to reg */
 @@ -2534,7 +2533,7 @@ twobyte_insn:
  case 0x22: /* mov reg, cr */
  if (c-modrm_mod != 3)
  goto cannot_emulate;
 -realmode_set_cr(ctxt-vcpu, c-modrm_reg, c-modrm_val);
 +ops-set_cr(c-modrm_reg, c-modrm_val, ctxt-vcpu);
  c-dst.type = OP_NONE;
  break;
  case 0x23: /* mov from reg to dr */
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index a1e671a..bf714df 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -3370,12 +3370,70 @@ void kvm_report_emulation_failure(struct kvm_vcpu 
 *vcpu, const char *context)
  }
  EXPORT_SYMBOL_GPL(kvm_report_emulation_failure);
 +static u64 mk_cr_64(u64 curr_cr, u32 new_val)
 +{
 +return (curr_cr  ~((1ULL  32) - 1)) | new_val;
 +}
 +
 +static unsigned long emulator_get_cr(int cr, struct kvm_vcpu *vcpu)
 +{
 +unsigned long value;
 +
 +switch (cr) {
 +case 0:
 +value = kvm_read_cr0(vcpu);
 +break;
 +case 2:
 +value = vcpu-arch.cr2;
 +break;
 +case 3:
 +value = vcpu-arch.cr3;
 +break;
 +case 4:
 +value = kvm_read_cr4(vcpu);
 +break;
 +case 8:
 +value = kvm_get_cr8(vcpu);
 +break;
 +default:
 +vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr);
 +return 0;
 +}
 +
 +return value;
 +}
 +
 +static void emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu 
 *vcpu)
 +{
 +switch

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Avi Kivity


On 03/15/2010 03:03 PM, Joerg Roedel wrote:



I will add another project - iommu emulation.  Could be very useful
for doing device assignment to nested guests, which could make
testing a lot easier.
   

Our experiments show that nested device assignment is pretty much
required for I/O performance in nested scenarios.
 

Really? I did a small test with virtio-blk in a nested guest (disk read
with dd, so not a real benchmark) and got a reasonable read-performance
of around 25MB/s from the disk in the l2-guest.

   


Your guest wasn't doing a zillion VMREADs and VMWRITEs every exit.

I plan to reduce VMREAD/VMWRITE overhead for kvm, but not much we can do 
for other guests.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 05/30] KVM: Provide callback to get/set control registers in emulator ops.

2010-03-15 Thread Avi Kivity


On 03/15/2010 03:06 PM, Andre Przywara wrote:

Gleb Natapov wrote:

Use this callback instead of directly call kvm function. Also rename
realmode_(set|get)_cr to emulator_(set|get)_cr since function has 
nothing

to do with real mode.
Do you mind removing the static before emulator_{set,get}_cr and 
marking it EXPORT_SYMBOL? Then one could use it in vmx.c (and soon in 
svm.c ;-) while handling MOV-CR intercepts. Currently most of the code 
is actually duplicated.


Just do that in your patch, that's standard practice.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Anthony Liguori


On 03/15/2010 07:42 AM, Avi Kivity wrote:

On 03/15/2010 02:38 PM, Joerg Roedel wrote:

On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote:

On 03/10/2010 11:30 PM, Luiz Capitulino wrote:

   Hi there,

   Our wiki page for the Summer of Code 2010 is doing quite well:

http://wiki.qemu.org/Google_Summer_of_Code_2010


I will add another project - iommu emulation.  Could be very useful for
doing device assignment to nested guests, which could make testing a 
lot

easier.

Good idea. If there is interest I could help to mentor this project.


Thanks.  I volunteered Anthony, but he may be a little overcommitted.


Joerg, feel free to put your name against too.

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 07/30] KVM: Provide x86_emulate_ctxt callback to get current cpl

2010-03-15 Thread Andre Przywara


Gleb,

what is the purpose of this patch? Is this a preparation for something 
upcoming? I don't see a reason to change this, in my eyes it is not a 
simplification.


Regards,
Andre.


Gleb Natapov wrote:

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |1 +
 arch/x86/kvm/emulate.c |   15 ---
 arch/x86/kvm/x86.c |6 ++
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 0c5caa4..b048fd2 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -110,6 +110,7 @@ struct x86_emulate_ops {
struct kvm_vcpu *vcpu);
ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu);
void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu);
+   int (*cpl)(struct kvm_vcpu *vcpu);
 };
 
 /* Type, address-of, and value of an instruction's operand. */

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 5e2fa61..8bd0557 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1257,7 +1257,7 @@ static int emulate_popf(struct x86_emulate_ctxt *ctxt,
int rc;
unsigned long val, change_mask;
int iopl = (ctxt-eflags  X86_EFLAGS_IOPL)  IOPL_SHIFT;
-   int cpl = kvm_x86_ops-get_cpl(ctxt-vcpu);
+   int cpl = ops-cpl(ctxt-vcpu);
 
 	rc = emulate_pop(ctxt, ops, val, len);

if (rc != X86EMUL_CONTINUE)
@@ -1758,7 +1758,8 @@ emulate_sysexit(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt)

+static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt,
+ struct x86_emulate_ops *ops)
 {
int iopl;
if (ctxt-mode == X86EMUL_MODE_REAL)
@@ -1766,7 +1767,7 @@ static bool emulator_bad_iopl(struct x86_emulate_ctxt 
*ctxt)
if (ctxt-mode == X86EMUL_MODE_VM86)
return true;
iopl = (ctxt-eflags  X86_EFLAGS_IOPL)  IOPL_SHIFT;
-   return kvm_x86_ops-get_cpl(ctxt-vcpu)  iopl;
+   return ops-cpl(ctxt-vcpu)  iopl;
 }
 
 static bool emulator_io_port_access_allowed(struct x86_emulate_ctxt *ctxt,

@@ -1803,7 +1804,7 @@ static bool emulator_io_permited(struct x86_emulate_ctxt 
*ctxt,
 struct x86_emulate_ops *ops,
 u16 port, u16 len)
 {
-   if (emulator_bad_iopl(ctxt))
+   if (emulator_bad_iopl(ctxt, ops))
if (!emulator_io_port_access_allowed(ctxt, ops, port, len))
return false;
return true;
@@ -1842,7 +1843,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
}
 
 	/* Privileged instruction can be executed only in CPL=0 */

-   if ((c-d  Priv)  kvm_x86_ops-get_cpl(ctxt-vcpu)) {
+   if ((c-d  Priv)  ops-cpl(ctxt-vcpu)) {
kvm_inject_gp(ctxt-vcpu, 0);
goto done;
}
@@ -2378,7 +2379,7 @@ special_insn:
c-dst.type = OP_NONE;   /* Disable writeback. */
break;
case 0xfa: /* cli */
-   if (emulator_bad_iopl(ctxt))
+   if (emulator_bad_iopl(ctxt, ops))
kvm_inject_gp(ctxt-vcpu, 0);
else {
ctxt-eflags = ~X86_EFLAGS_IF;
@@ -2386,7 +2387,7 @@ special_insn:
}
break;
case 0xfb: /* sti */
-   if (emulator_bad_iopl(ctxt))
+   if (emulator_bad_iopl(ctxt, ops))
kvm_inject_gp(ctxt-vcpu, 0);
else {
toggle_interruptibility(ctxt, KVM_X86_SHADOW_INT_STI);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b08f8a1..3f2a8d3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3426,6 +3426,11 @@ static void emulator_set_cr(int cr, unsigned long val, 
struct kvm_vcpu *vcpu)
}
 }
 
+static int emulator_get_cpl(struct kvm_vcpu *vcpu)

+{
+   return kvm_x86_ops-get_cpl(vcpu);
+}
+
 static struct x86_emulate_ops emulate_ops = {
.read_std= kvm_read_guest_virt_system,
.fetch   = kvm_fetch_guest_virt,
@@ -3434,6 +3439,7 @@ static struct x86_emulate_ops emulate_ops = {
.cmpxchg_emulated= emulator_cmpxchg_emulated,
.get_cr  = emulator_get_cr,
.set_cr  = emulator_set_cr,
+   .cpl = emulator_get_cpl,
 };
 
 static void cache_all_regs(struct kvm_vcpu *vcpu)



--
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 07/30] KVM: Provide x86_emulate_ctxt callback to get current cpl

2010-03-15 Thread Gleb Natapov

On Mon, Mar 15, 2010 at 02:16:01PM +0100, Andre Przywara wrote:
 Gleb,
 
 what is the purpose of this patch? Is this a preparation for
 something upcoming? I don't see a reason to change this, in my eyes
 it is not a simplification.
 
To make emulator independent of KVM. All direct calls from emulator to
KVM will be changed to callbacks.

 Regards,
 Andre.
 
 
 Gleb Natapov wrote:
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/include/asm/kvm_emulate.h |1 +
  arch/x86/kvm/emulate.c |   15 ---
  arch/x86/kvm/x86.c |6 ++
  3 files changed, 15 insertions(+), 7 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_emulate.h 
 b/arch/x86/include/asm/kvm_emulate.h
 index 0c5caa4..b048fd2 100644
 --- a/arch/x86/include/asm/kvm_emulate.h
 +++ b/arch/x86/include/asm/kvm_emulate.h
 @@ -110,6 +110,7 @@ struct x86_emulate_ops {
  struct kvm_vcpu *vcpu);
  ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu);
  void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu);
 +int (*cpl)(struct kvm_vcpu *vcpu);
  };
  /* Type, address-of, and value of an instruction's operand. */
 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 5e2fa61..8bd0557 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -1257,7 +1257,7 @@ static int emulate_popf(struct x86_emulate_ctxt *ctxt,
  int rc;
  unsigned long val, change_mask;
  int iopl = (ctxt-eflags  X86_EFLAGS_IOPL)  IOPL_SHIFT;
 -int cpl = kvm_x86_ops-get_cpl(ctxt-vcpu);
 +int cpl = ops-cpl(ctxt-vcpu);
  rc = emulate_pop(ctxt, ops, val, len);
  if (rc != X86EMUL_CONTINUE)
 @@ -1758,7 +1758,8 @@ emulate_sysexit(struct x86_emulate_ctxt *ctxt)
  return X86EMUL_CONTINUE;
  }
 -static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt)
 +static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt,
 +  struct x86_emulate_ops *ops)
  {
  int iopl;
  if (ctxt-mode == X86EMUL_MODE_REAL)
 @@ -1766,7 +1767,7 @@ static bool emulator_bad_iopl(struct x86_emulate_ctxt 
 *ctxt)
  if (ctxt-mode == X86EMUL_MODE_VM86)
  return true;
  iopl = (ctxt-eflags  X86_EFLAGS_IOPL)  IOPL_SHIFT;
 -return kvm_x86_ops-get_cpl(ctxt-vcpu)  iopl;
 +return ops-cpl(ctxt-vcpu)  iopl;
  }
  static bool emulator_io_port_access_allowed(struct x86_emulate_ctxt *ctxt,
 @@ -1803,7 +1804,7 @@ static bool emulator_io_permited(struct 
 x86_emulate_ctxt *ctxt,
   struct x86_emulate_ops *ops,
   u16 port, u16 len)
  {
 -if (emulator_bad_iopl(ctxt))
 +if (emulator_bad_iopl(ctxt, ops))
  if (!emulator_io_port_access_allowed(ctxt, ops, port, len))
  return false;
  return true;
 @@ -1842,7 +1843,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
 x86_emulate_ops *ops)
  }
  /* Privileged instruction can be executed only in CPL=0 */
 -if ((c-d  Priv)  kvm_x86_ops-get_cpl(ctxt-vcpu)) {
 +if ((c-d  Priv)  ops-cpl(ctxt-vcpu)) {
  kvm_inject_gp(ctxt-vcpu, 0);
  goto done;
  }
 @@ -2378,7 +2379,7 @@ special_insn:
  c-dst.type = OP_NONE;  /* Disable writeback. */
  break;
  case 0xfa: /* cli */
 -if (emulator_bad_iopl(ctxt))
 +if (emulator_bad_iopl(ctxt, ops))
  kvm_inject_gp(ctxt-vcpu, 0);
  else {
  ctxt-eflags = ~X86_EFLAGS_IF;
 @@ -2386,7 +2387,7 @@ special_insn:
  }
  break;
  case 0xfb: /* sti */
 -if (emulator_bad_iopl(ctxt))
 +if (emulator_bad_iopl(ctxt, ops))
  kvm_inject_gp(ctxt-vcpu, 0);
  else {
  toggle_interruptibility(ctxt, KVM_X86_SHADOW_INT_STI);
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index b08f8a1..3f2a8d3 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -3426,6 +3426,11 @@ static void emulator_set_cr(int cr, unsigned long 
 val, struct kvm_vcpu *vcpu)
  }
  }
 +static int emulator_get_cpl(struct kvm_vcpu *vcpu)
 +{
 +return kvm_x86_ops-get_cpl(vcpu);
 +}
 +
  static struct x86_emulate_ops emulate_ops = {
  .read_std= kvm_read_guest_virt_system,
  .fetch   = kvm_fetch_guest_virt,
 @@ -3434,6 +3439,7 @@ static struct x86_emulate_ops emulate_ops = {
  .cmpxchg_emulated= emulator_cmpxchg_emulated,
  .get_cr  = emulator_get_cr,
  .set_cr  = emulator_set_cr,
 +.cpl = emulator_get_cpl,
  };
  static void cache_all_regs(struct kvm_vcpu *vcpu)
 
 
 -- 
 Andre Przywara
 AMD-OSRC (Dresden)
 Tel: x29712

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Anthony Liguori


On 03/15/2010 08:11 AM, Avi Kivity wrote:

On 03/15/2010 03:03 PM, Joerg Roedel wrote:



I will add another project - iommu emulation.  Could be very useful
for doing device assignment to nested guests, which could make
testing a lot easier.

Our experiments show that nested device assignment is pretty much
required for I/O performance in nested scenarios.

Really? I did a small test with virtio-blk in a nested guest (disk read
with dd, so not a real benchmark) and got a reasonable read-performance
of around 25MB/s from the disk in the l2-guest.



Your guest wasn't doing a zillion VMREADs and VMWRITEs every exit.

I plan to reduce VMREAD/VMWRITE overhead for kvm, but not much we can 
do for other guests.


VMREAD/VMWRITEs are generally optimized by hypervisors as they tend to 
be costly.  KVM is a bit unusual in terms of how many times the 
instructions are executed per exit.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Joerg Roedel

On Mon, Mar 15, 2010 at 03:11:42PM +0200, Avi Kivity wrote:
 On 03/15/2010 03:03 PM, Joerg Roedel wrote:

 I will add another project - iommu emulation.  Could be very useful
 for doing device assignment to nested guests, which could make
 testing a lot easier.

 Our experiments show that nested device assignment is pretty much
 required for I/O performance in nested scenarios.
  
 Really? I did a small test with virtio-blk in a nested guest (disk read
 with dd, so not a real benchmark) and got a reasonable read-performance
 of around 25MB/s from the disk in the l2-guest.



 Your guest wasn't doing a zillion VMREADs and VMWRITEs every exit.

 I plan to reduce VMREAD/VMWRITE overhead for kvm, but not much we can do  
 for other guests.

Does it matter for the ept-on-ept case? The initial patchset of
nested-vmx implemented it and they reported a performance drop of around
12% between levels which is reasonable. So I expected the loss of
io-performance for l2 also reasonable in this case. My small measurement
was also done using npt-on-npt.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fwd: Corrupted filesystem, possible after livemigration with iSCSI storagebackend.

2010-03-15 Thread Espen Berg


In our KVM system we have two iSCSI backends (master/slave
configuration) with failover and two KVM hosts supporting live migration.

The iSCSI volumes are shared by the host as a block device in KVM, and
the volumes are available on both frontends.  After a reboot one of the
KVMs where not able to start again due to file system corruption.  We
use XFS and have problems to understand what caused the corruption.

We have ruled out the iSCSI backend as both the master and slave data
where consistent at the time.

Anyone else had similar problems?  What is the recommended way to share
an iSCSI drive among the two host machines?

Should XFS be ok as a file system for live migration?  I'm not able to
find any documentation stating that a clustered file system (GFS2 etc.)
is recommended.  Are there any concurrent writes on the two host
machines during a livemigtation?

 disk type='block' device='disk'
  driver name='qemu'/
  source dev='/dev/disk/by-path/ip-ip:3260-iscsi-test2-lun-0'/
  target dev='sda' bus='scsi'/
  address type='drive' controller='0' bus='0' unit='0'/
 /disk

#virsh version
Compiled against library: libvir 0.7.6
Using library: libvir 0.7.6
Using API: QEMU 0.7.6
Running hypervisor: QEMU 0.11.0

#uname -a
Linux vm01 2.6.32-bpo.2-amd64 #1 SMP Fri Feb 12 16:50:27 UTC 2010 x86_64
GNU/Linux

Regards
Espen



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to tweak kernel to get the best out of kvm?

2010-03-15 Thread Harald Dunkel

On 03/13/10 09:54, Avi Kivity wrote:
 
 If the slowdown is indeed due to I/O, LVM (with cache=off) should
 eliminate it completely.
 
As promised I have installed LVM: The difference is remarkable.
My test case (running 8 vhosts in parallel, each building a Linux
kernel) just works. There is no blocking job (by now), all
vhosts can be pinged, great.

Many thanx for your help, and for the nice software, of course.


Regards

Harri
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Anthony Liguori


On 03/15/2010 08:24 AM, Joerg Roedel wrote:

On Mon, Mar 15, 2010 at 03:11:42PM +0200, Avi Kivity wrote:
   

On 03/15/2010 03:03 PM, Joerg Roedel wrote:
 
   

I will add another project - iommu emulation.  Could be very useful
for doing device assignment to nested guests, which could make
testing a lot easier.

   

Our experiments show that nested device assignment is pretty much
required for I/O performance in nested scenarios.

 

Really? I did a small test with virtio-blk in a nested guest (disk read
with dd, so not a real benchmark) and got a reasonable read-performance
of around 25MB/s from the disk in the l2-guest.


   

Your guest wasn't doing a zillion VMREADs and VMWRITEs every exit.

I plan to reduce VMREAD/VMWRITE overhead for kvm, but not much we can do
for other guests.
 

Does it matter for the ept-on-ept case? The initial patchset of
nested-vmx implemented it and they reported a performance drop of around
12% between levels which is reasonable. So I expected the loss of
io-performance for l2 also reasonable in this case. My small measurement
was also done using npt-on-npt.
   


But that was something like kernbench IIRC which is actually exit light 
once ept is enabled.


Network IO is typically exit heavy and becomes something more of a 
pathological work load (both for nested ept and nested npt).


Regards,

Anthony Liguori


Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fwd: Corrupted filesystem, possible after livemigration with iSCSI storagebackend.

2010-03-15 Thread Anthony Liguori


On 03/15/2010 08:46 AM, Espen Berg wrote:

In our KVM system we have two iSCSI backends (master/slave
configuration) with failover and two KVM hosts supporting live migration.

The iSCSI volumes are shared by the host as a block device in KVM, and
the volumes are available on both frontends.  After a reboot one of the
KVMs where not able to start again due to file system corruption.  We
use XFS and have problems to understand what caused the corruption.

We have ruled out the iSCSI backend as both the master and slave data
where consistent at the time.

Anyone else had similar problems?  What is the recommended way to share
an iSCSI drive among the two host machines?

Should XFS be ok as a file system for live migration?  I'm not able to
find any documentation stating that a clustered file system (GFS2 etc.)
is recommended.  Are there any concurrent writes on the two host
machines during a livemigtation?

disk type='block' device='disk'
driver name='qemu'/
source dev='/dev/disk/by-path/ip-ip:3260-iscsi-test2-lun-0'/
target dev='sda' bus='scsi'/
address type='drive' controller='0' bus='0' unit='0'/
/disk


You need to use cache=off if you've got one iscsi drive mounted on two 
separate physical machines.


The additional layer of caching will result in inconsistency because 
iSCSI doesn't have a mechanism to provide cache coherence between two nodes.


Regards,

Anthony Liguori


#virsh version
Compiled against library: libvir 0.7.6
Using library: libvir 0.7.6
Using API: QEMU 0.7.6
Running hypervisor: QEMU 0.11.0

#uname -a
Linux vm01 2.6.32-bpo.2-amd64 #1 SMP Fri Feb 12 16:50:27 UTC 2010 x86_64
GNU/Linux

Regards
Espen



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fwd: Corrupted filesystem, possible after livemigration with iSCSI storagebackend.

2010-03-15 Thread Daniel P. Berrange

On Mon, Mar 15, 2010 at 08:59:10AM -0500, Anthony Liguori wrote:
 On 03/15/2010 08:46 AM, Espen Berg wrote:
 In our KVM system we have two iSCSI backends (master/slave
 configuration) with failover and two KVM hosts supporting live migration.
 
 The iSCSI volumes are shared by the host as a block device in KVM, and
 the volumes are available on both frontends.  After a reboot one of the
 KVMs where not able to start again due to file system corruption.  We
 use XFS and have problems to understand what caused the corruption.
 
 We have ruled out the iSCSI backend as both the master and slave data
 where consistent at the time.
 
 Anyone else had similar problems?  What is the recommended way to share
 an iSCSI drive among the two host machines?
 
 Should XFS be ok as a file system for live migration?  I'm not able to
 find any documentation stating that a clustered file system (GFS2 etc.)
 is recommended.  Are there any concurrent writes on the two host
 machines during a livemigtation?
 
 disk type='block' device='disk'
 driver name='qemu'/
 source dev='/dev/disk/by-path/ip-ip:3260-iscsi-test2-lun-0'/
 target dev='sda' bus='scsi'/
 address type='drive' controller='0' bus='0' unit='0'/
 /disk
 
 You need to use cache=off if you've got one iscsi drive mounted on two 
 separate physical machines.

FYI, this can be done by changing the disk XML driver

  driver name='qemu'/

to be

  driver name='qemu' cache='none'/

Regards,
Daniel
-- 
|: Red Hat, Engineering, London-o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org-o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 04/30] KVM: Remove pointer to rflags from realmode_set_cr parameters.

2010-03-15 Thread Gleb Natapov

Mov reg, cr instruction doesn't change flags in any meaningful way, so
no need to update rflags after instruction execution.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |3 +--
 arch/x86/kvm/emulate.c  |3 +--
 arch/x86/kvm/x86.c  |4 +---
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ea1b6c6..8567107 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -586,8 +586,7 @@ void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,
   unsigned long *rflags);
 
 unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr);
-void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value,
-unsigned long *rflags);
+void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value);
 void kvm_enable_efer_bits(u64);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
 int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 670ca8f..91450b5 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2534,8 +2534,7 @@ twobyte_insn:
case 0x22: /* mov reg, cr */
if (c-modrm_mod != 3)
goto cannot_emulate;
-   realmode_set_cr(ctxt-vcpu,
-   c-modrm_reg, c-modrm_val, ctxt-eflags);
+   realmode_set_cr(ctxt-vcpu, c-modrm_reg, c-modrm_val);
c-dst.type = OP_NONE;
break;
case 0x23: /* mov from reg to dr */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9d02cc7..56cdaa5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4043,13 +4043,11 @@ unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, 
int cr)
return value;
 }
 
-void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long val,
-unsigned long *rflags)
+void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long val)
 {
switch (cr) {
case 0:
kvm_set_cr0(vcpu, mk_cr_64(kvm_read_cr0(vcpu), val));
-   *rflags = kvm_get_rflags(vcpu);
break;
case 2:
vcpu-arch.cr2 = val;
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 03/30] KVM: x86 emulator: check return value against correct define

2010-03-15 Thread Gleb Natapov

Check return value against correct define instead of open code
the value.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 4dce805..670ca8f 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -566,7 +566,7 @@ static u32 group2_table[] = {
 #define insn_fetch(_type, _size, _eip)  \
 ({ unsigned long _x;   \
rc = do_insn_fetch(ctxt, ops, (_eip), _x, (_size));\
-   if (rc != 0)\
+   if (rc != X86EMUL_CONTINUE) \
goto done;  \
(_eip) += (_size);  \
(_type)_x;  \
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 01/30] KVM: x86 emulator: Fix DstAcc decoding.

2010-03-15 Thread Gleb Natapov

Set correct operation length. Add RAX (64bit) handling.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 2832a8c..0b70a36 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1194,9 +1194,9 @@ done_prefixes:
break;
case DstAcc:
c-dst.type = OP_REG;
-   c-dst.bytes = c-op_bytes;
+   c-dst.bytes = (c-d  ByteOp) ? 1 : c-op_bytes;
c-dst.ptr = c-regs[VCPU_REGS_RAX];
-   switch (c-op_bytes) {
+   switch (c-dst.bytes) {
case 1:
c-dst.val = *(u8 *)c-dst.ptr;
break;
@@ -1206,6 +1206,9 @@ done_prefixes:
case 4:
c-dst.val = *(u32 *)c-dst.ptr;
break;
+   case 8:
+   c-dst.val = *(u64 *)c-dst.ptr;
+   break;
}
c-dst.orig_val = c-dst.val;
break;
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 00/30] emulator cleanup

2010-03-15 Thread Gleb Natapov

This is the first series of patches that tries to cleanup emulator code.
This is mix of bug fixes and moving code that does emulation from x86.c
to emulator.c while making it KVM independent. The status of the patches:
works for me. realtime.flat test now also pass where it failed before.

ChangeLog:

v1-v2:
  - A couple of new bug fixed
  - cpl is now x86_emulator_ops callback
  - during string instruction re-enter guest on each page boundary
  - retain fast path for pio out (do not go through emulator)
v2-v3:
  - use correct operand length for pio instruction with REX prefix
  - check for string instruction before decrementing ecx
  - change guest re-entry condition for string instruction

Gleb Natapov (30):
  KVM: x86 emulator: Fix DstAcc decoding.
  KVM: x86 emulator: fix RCX access during rep emulation
  KVM: x86 emulator: check return value against correct define
  KVM: Remove pointer to rflags from realmode_set_cr parameters.
  KVM: Provide callback to get/set control registers in emulator ops.
  KVM: remove realmode_lmsw function.
  KVM: Provide x86_emulate_ctxt callback to get current cpl
  KVM: Provide current eip as part of emulator context.
  KVM: x86 emulator: fix mov r/m, sreg emulation.
  KVM: x86 emulator: fix 0f 01 /5 emulation
  KVM: x86 emulator: 0f (20|21|22|23) ignore mod bits.
  KVM: x86 emulator: inject #UD on access to non-existing CR
  KVM: x86 emulator: fix mov dr to inject #UD when needed.
  KVM: x86 emulator: fix return values of syscall/sysenter/sysexit
emulations
  KVM: x86 emulator: do not call writeback if msr access fails.
  KVM: x86 emulator: If LOCK prefix is used dest arg should be memory.
  KVM: x86 emulator: cleanup grp3 return value
  KVM: x86 emulator: Provide more callbacks for x86 emulator.
  KVM: x86 emulator: Emulate task switch in emulator.c
  KVM: x86 emulator: Use load_segment_descriptor() instead of
kvm_load_segment_descriptor()
  KVM: Use task switch from emulator.c
  KVM: x86 emulator: populate OP_MEM operand during decoding.
  KVM: x86 emulator: add decoding of X,Y parameters from Intel SDM
  KVM: x86 emulator: during rep emulation decrement ECX only if
emulation succeeded
  KVM: x86 emulator: fix in/out emulation.
  KVM: x86 emulator: Move string pio emulation into emulator.c
  KVM: x86 emulator: remove saved_eip
  KVM: x86 emulator: restart string instruction without going back to a
guest.
  KVM: x86 emulator: introduce pio in string read ahead.
  KVM: small kvm_arch_vcpu_ioctl_run() cleanup.

 arch/x86/include/asm/kvm_emulate.h |   41 ++-
 arch/x86/include/asm/kvm_host.h|   16 +-
 arch/x86/kvm/emulate.c | 1062 ++-
 arch/x86/kvm/svm.c |   20 +-
 arch/x86/kvm/vmx.c |   18 +-
 arch/x86/kvm/x86.c | 1121 +---
 6 files changed, 1146 insertions(+), 1132 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 10/30] KVM: x86 emulator: fix 0f 01 /5 emulation

2010-03-15 Thread Gleb Natapov

It is undefined and should generate #UD.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c3b9334..7c7debb 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2490,6 +2490,9 @@ twobyte_insn:
(c-src.val  0x0f), ctxt-vcpu);
c-dst.type = OP_NONE;
break;
+   case 5: /* not defined */
+   kvm_queue_exception(ctxt-vcpu, UD_VECTOR);
+   goto done;
case 7: /* invlpg*/
emulate_invlpg(ctxt-vcpu, memop);
/* Disable writeback. */
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 06/30] KVM: remove realmode_lmsw function.

2010-03-15 Thread Gleb Natapov

Use (get|set)_cr callback to emulate lmsw inside emulator.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |2 --
 arch/x86/kvm/emulate.c  |4 ++--
 arch/x86/kvm/x86.c  |7 ---
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9725856..72997aa 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -582,8 +582,6 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
 void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context);
 void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
 void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
-void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,
-  unsigned long *rflags);
 
 void kvm_enable_efer_bits(u64);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 5b060e4..5e2fa61 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2486,8 +2486,8 @@ twobyte_insn:
c-dst.val = ops-get_cr(0, ctxt-vcpu);
break;
case 6: /* lmsw */
-   realmode_lmsw(ctxt-vcpu, (u16)c-src.val,
- ctxt-eflags);
+   ops-set_cr(0, (ops-get_cr(0, ctxt-vcpu)  ~0x0ful) |
+   (c-src.val  0x0f), ctxt-vcpu);
c-dst.type = OP_NONE;
break;
case 7: /* invlpg*/
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb00ed5..b139334 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4061,13 +4061,6 @@ void realmode_lidt(struct kvm_vcpu *vcpu, u16 limit, 
unsigned long base)
kvm_x86_ops-set_idt(vcpu, dt);
 }
 
-void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,
-  unsigned long *rflags)
-{
-   kvm_lmsw(vcpu, msw);
-   *rflags = kvm_get_rflags(vcpu);
-}
-
 static int move_to_next_stateful_cpuid_entry(struct kvm_vcpu *vcpu, int i)
 {
struct kvm_cpuid_entry2 *e = vcpu-arch.cpuid_entries[i];
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 09/30] KVM: x86 emulator: fix mov r/m, sreg emulation.

2010-03-15 Thread Gleb Natapov

mov r/m, sreg generates #UD ins sreg is incorrect.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |7 +++
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 2c27aa4..c3b9334 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2126,12 +2126,11 @@ special_insn:
case 0x8c: { /* mov r/m, sreg */
struct kvm_segment segreg;
 
-   if (c-modrm_reg = 5)
+   if (c-modrm_reg = VCPU_SREG_GS)
kvm_get_segment(ctxt-vcpu, segreg, c-modrm_reg);
else {
-   printk(KERN_INFO 0x8c: Invalid segreg in modrm byte 
0x%02x\n,
-  c-modrm);
-   goto cannot_emulate;
+   kvm_queue_exception(ctxt-vcpu, UD_VECTOR);
+   goto done;
}
c-dst.val = segreg.selector;
break;
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 11/30] KVM: x86 emulator: 0f (20|21|22|23) ignore mod bits.

2010-03-15 Thread Gleb Natapov

Resent spec says that for 0f (20|21|22|23) the 2 bits in the mod field
are ignored. Interestingly enough older spec says that 11 is only valid
encoding.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |8 
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 7c7debb..fa4604e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2520,28 +2520,20 @@ twobyte_insn:
c-dst.type = OP_NONE;
break;
case 0x20: /* mov cr, reg */
-   if (c-modrm_mod != 3)
-   goto cannot_emulate;
c-regs[c-modrm_rm] = ops-get_cr(c-modrm_reg, ctxt-vcpu);
c-dst.type = OP_NONE;  /* no writeback */
break;
case 0x21: /* mov from dr to reg */
-   if (c-modrm_mod != 3)
-   goto cannot_emulate;
if (emulator_get_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm]))
goto cannot_emulate;
rc = X86EMUL_CONTINUE;
c-dst.type = OP_NONE;  /* no writeback */
break;
case 0x22: /* mov reg, cr */
-   if (c-modrm_mod != 3)
-   goto cannot_emulate;
ops-set_cr(c-modrm_reg, c-modrm_val, ctxt-vcpu);
c-dst.type = OP_NONE;
break;
case 0x23: /* mov from reg to dr */
-   if (c-modrm_mod != 3)
-   goto cannot_emulate;
if (emulator_set_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm]))
goto cannot_emulate;
rc = X86EMUL_CONTINUE;
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 12/30] KVM: x86 emulator: inject #UD on access to non-existing CR

2010-03-15 Thread Gleb Natapov


Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index fa4604e..836e97b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2520,6 +2520,13 @@ twobyte_insn:
c-dst.type = OP_NONE;
break;
case 0x20: /* mov cr, reg */
+   switch (c-modrm_reg) {
+   case 1:
+   case 5 ... 7:
+   case 9 ... 15:
+   kvm_queue_exception(ctxt-vcpu, UD_VECTOR);
+   goto done;
+   }
c-regs[c-modrm_rm] = ops-get_cr(c-modrm_reg, ctxt-vcpu);
c-dst.type = OP_NONE;  /* no writeback */
break;
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 14/30] KVM: x86 emulator: fix return values of syscall/sysenter/sysexit emulations

2010-03-15 Thread Gleb Natapov

Return X86EMUL_PROPAGATE_FAULT is fault was injected. Also inject #UD
for those instruction when appropriate.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |   17 +++--
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 5afddcf..1393bf0 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1600,8 +1600,11 @@ emulate_syscall(struct x86_emulate_ctxt *ctxt)
u64 msr_data;
 
/* syscall is not available in real mode */
-   if (ctxt-mode == X86EMUL_MODE_REAL || ctxt-mode == X86EMUL_MODE_VM86)
-   return X86EMUL_UNHANDLEABLE;
+   if (ctxt-mode == X86EMUL_MODE_REAL ||
+   ctxt-mode == X86EMUL_MODE_VM86) {
+   kvm_queue_exception(ctxt-vcpu, UD_VECTOR);
+   return X86EMUL_PROPAGATE_FAULT;
+   }
 
setup_syscalls_segments(ctxt, cs, ss);
 
@@ -1651,14 +1654,16 @@ emulate_sysenter(struct x86_emulate_ctxt *ctxt)
/* inject #GP if in real mode */
if (ctxt-mode == X86EMUL_MODE_REAL) {
kvm_inject_gp(ctxt-vcpu, 0);
-   return X86EMUL_UNHANDLEABLE;
+   return X86EMUL_PROPAGATE_FAULT;
}
 
/* XXX sysenter/sysexit have not been tested in 64bit mode.
* Therefore, we inject an #UD.
*/
-   if (ctxt-mode == X86EMUL_MODE_PROT64)
-   return X86EMUL_UNHANDLEABLE;
+   if (ctxt-mode == X86EMUL_MODE_PROT64) {
+   kvm_queue_exception(ctxt-vcpu, UD_VECTOR);
+   return X86EMUL_PROPAGATE_FAULT;
+   }
 
setup_syscalls_segments(ctxt, cs, ss);
 
@@ -1713,7 +1718,7 @@ emulate_sysexit(struct x86_emulate_ctxt *ctxt)
if (ctxt-mode == X86EMUL_MODE_REAL ||
ctxt-mode == X86EMUL_MODE_VM86) {
kvm_inject_gp(ctxt-vcpu, 0);
-   return X86EMUL_UNHANDLEABLE;
+   return X86EMUL_PROPAGATE_FAULT;
}
 
setup_syscalls_segments(ctxt, cs, ss);
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 05/30] KVM: Provide callback to get/set control registers in emulator ops.

2010-03-15 Thread Gleb Natapov

Use this callback instead of directly call kvm function. Also rename
realmode_(set|get)_cr to emulator_(set|get)_cr since function has nothing
to do with real mode.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |3 +-
 arch/x86/include/asm/kvm_host.h|2 -
 arch/x86/kvm/emulate.c |7 +-
 arch/x86/kvm/x86.c |  114 ++--
 4 files changed, 63 insertions(+), 63 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 2666d7a..0c5caa4 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -108,7 +108,8 @@ struct x86_emulate_ops {
const void *new,
unsigned int bytes,
struct kvm_vcpu *vcpu);
-
+   ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu);
+   void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu);
 };
 
 /* Type, address-of, and value of an instruction's operand. */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8567107..9725856 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -585,8 +585,6 @@ void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, 
unsigned long address);
 void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,
   unsigned long *rflags);
 
-unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr);
-void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value);
 void kvm_enable_efer_bits(u64);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
 int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 91450b5..5b060e4 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2483,7 +2483,7 @@ twobyte_insn:
break;
case 4: /* smsw */
c-dst.bytes = 2;
-   c-dst.val = realmode_get_cr(ctxt-vcpu, 0);
+   c-dst.val = ops-get_cr(0, ctxt-vcpu);
break;
case 6: /* lmsw */
realmode_lmsw(ctxt-vcpu, (u16)c-src.val,
@@ -2519,8 +2519,7 @@ twobyte_insn:
case 0x20: /* mov cr, reg */
if (c-modrm_mod != 3)
goto cannot_emulate;
-   c-regs[c-modrm_rm] =
-   realmode_get_cr(ctxt-vcpu, c-modrm_reg);
+   c-regs[c-modrm_rm] = ops-get_cr(c-modrm_reg, ctxt-vcpu);
c-dst.type = OP_NONE;  /* no writeback */
break;
case 0x21: /* mov from dr to reg */
@@ -2534,7 +2533,7 @@ twobyte_insn:
case 0x22: /* mov reg, cr */
if (c-modrm_mod != 3)
goto cannot_emulate;
-   realmode_set_cr(ctxt-vcpu, c-modrm_reg, c-modrm_val);
+   ops-set_cr(c-modrm_reg, c-modrm_val, ctxt-vcpu);
c-dst.type = OP_NONE;
break;
case 0x23: /* mov from reg to dr */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 56cdaa5..fb00ed5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3386,12 +3386,70 @@ void kvm_report_emulation_failure(struct kvm_vcpu 
*vcpu, const char *context)
 }
 EXPORT_SYMBOL_GPL(kvm_report_emulation_failure);
 
+static u64 mk_cr_64(u64 curr_cr, u32 new_val)
+{
+   return (curr_cr  ~((1ULL  32) - 1)) | new_val;
+}
+
+static unsigned long emulator_get_cr(int cr, struct kvm_vcpu *vcpu)
+{
+   unsigned long value;
+
+   switch (cr) {
+   case 0:
+   value = kvm_read_cr0(vcpu);
+   break;
+   case 2:
+   value = vcpu-arch.cr2;
+   break;
+   case 3:
+   value = vcpu-arch.cr3;
+   break;
+   case 4:
+   value = kvm_read_cr4(vcpu);
+   break;
+   case 8:
+   value = kvm_get_cr8(vcpu);
+   break;
+   default:
+   vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr);
+   return 0;
+   }
+
+   return value;
+}
+
+static void emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu *vcpu)
+{
+   switch (cr) {
+   case 0:
+   kvm_set_cr0(vcpu, mk_cr_64(kvm_read_cr0(vcpu), val));
+   break;
+   case 2:
+   vcpu-arch.cr2 = val;
+   break;
+   case 3:
+   kvm_set_cr3(vcpu, val);
+   break;
+   case 4:
+   kvm_set_cr4(vcpu, mk_cr_64(kvm_read_cr4(vcpu), val));
+   break;
+   case 8:
+   kvm_set_cr8(vcpu, val  0xfUL);
+   break;
+   default:
+   vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr);
+   }
+}
+
 static struct x86_emulate_ops emulate_ops = {

[PATCH v3 25/30] KVM: x86 emulator: fix in/out emulation.

2010-03-15 Thread Gleb Natapov

in/out emulation is broken now. The breakage is different depending
on where IO device resides. If it is in userspace emulator reports
emulation failure since it incorrectly interprets kvm_emulate_pio()
return value. If IO device is in the kernel emulation of 'in' will do
nothing since kvm_emulate_pio() stores result directly into vcpu
registers, so emulator will overwrite result of emulation during
commit of shadowed register.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |7 +
 arch/x86/include/asm/kvm_host.h|3 +-
 arch/x86/kvm/emulate.c |   49 -
 arch/x86/kvm/svm.c |   20 +--
 arch/x86/kvm/vmx.c |   18 ++--
 arch/x86/kvm/x86.c |  213 ++--
 6 files changed, 177 insertions(+), 133 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index bd46929..679245c 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -119,6 +119,13 @@ struct x86_emulate_ops {
const void *new,
unsigned int bytes,
struct kvm_vcpu *vcpu);
+
+   int (*pio_in_emulated)(int size, unsigned short port, void *val,
+  unsigned int count, struct kvm_vcpu *vcpu);
+
+   int (*pio_out_emulated)(int size, unsigned short port, const void *val,
+   unsigned int count, struct kvm_vcpu *vcpu);
+
bool (*get_cached_descriptor)(struct desc_struct *desc,
  int seg, struct kvm_vcpu *vcpu);
void (*set_cached_descriptor)(struct desc_struct *desc,
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 72997aa..4a4fb8d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -589,8 +589,7 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 
data);
 
 struct x86_emulate_ctxt;
 
-int kvm_emulate_pio(struct kvm_vcpu *vcpu, int in,
-int size, unsigned port);
+int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port);
 int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, int in,
   int size, unsigned long count, int down,
gva_t address, int rep, unsigned port);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a166235..873da58 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -210,13 +210,13 @@ static u32 opcode_table[256] = {
0, 0, 0, 0, 0, 0, 0, 0,
/* 0xE0 - 0xE7 */
0, 0, 0, 0,
-   ByteOp | SrcImmUByte, SrcImmUByte,
-   ByteOp | SrcImmUByte, SrcImmUByte,
+   ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc,
+   ByteOp | SrcImmUByte | DstAcc, SrcImmUByte | DstAcc,
/* 0xE8 - 0xEF */
SrcImm | Stack, SrcImm | ImplicitOps,
SrcImmU | Src2Imm16 | No64, SrcImmByte | ImplicitOps,
-   SrcNone | ByteOp | ImplicitOps, SrcNone | ImplicitOps,
-   SrcNone | ByteOp | ImplicitOps, SrcNone | ImplicitOps,
+   SrcNone | ByteOp | DstAcc, SrcNone | DstAcc,
+   SrcNone | ByteOp | DstAcc, SrcNone | DstAcc,
/* 0xF0 - 0xF7 */
0, 0, 0, 0,
ImplicitOps | Priv, ImplicitOps, Group | Group3_Byte, Group | Group3,
@@ -2422,8 +2422,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
u64 msr_data;
unsigned long saved_eip = 0;
struct decode_cache *c = ctxt-decode;
-   unsigned int port;
-   int io_dir_in;
int rc = X86EMUL_CONTINUE;
 
ctxt-interruptibility = 0;
@@ -2819,14 +2817,10 @@ special_insn:
break;
case 0xe4:  /* inb */
case 0xe5:  /* in */
-   port = c-src.val;
-   io_dir_in = 1;
-   goto do_io;
+   goto do_io_in;
case 0xe6: /* outb */
case 0xe7: /* out */
-   port = c-src.val;
-   io_dir_in = 0;
-   goto do_io;
+   goto do_io_out;
case 0xe8: /* call (near) */ {
long int rel = c-src.val;
c-src.val = (unsigned long) c-eip;
@@ -2851,25 +2845,28 @@ special_insn:
break;
case 0xec: /* in al,dx */
case 0xed: /* in (e/r)ax,dx */
-   port = c-regs[VCPU_REGS_RDX];
-   io_dir_in = 1;
-   goto do_io;
+   c-src.val = c-regs[VCPU_REGS_RDX];
+   do_io_in:
+   c-dst.bytes = min(c-dst.bytes, 4u);
+   if (!emulator_io_permited(ctxt, ops, c-src.val, c-dst.bytes)) 
{
+   kvm_inject_gp(ctxt-vcpu, 0);
+   goto done;
+   }
+   ops-pio_in_emulated(c-dst.bytes, c-src.val, c-dst.val, 1,
+ctxt-vcpu);
+

[PATCH v3 17/30] KVM: x86 emulator: cleanup grp3 return value

2010-03-15 Thread Gleb Natapov

When x86_emulate_insn() does not know how to emulate instruction it
exits via cannot_emulate label in all cases except when emulating
grp3. Fix that.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |   12 
 1 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 46a7ee3..d696cbd 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1397,7 +1397,6 @@ static inline int emulate_grp3(struct x86_emulate_ctxt 
*ctxt,
   struct x86_emulate_ops *ops)
 {
struct decode_cache *c = ctxt-decode;
-   int rc = X86EMUL_CONTINUE;
 
switch (c-modrm_reg) {
case 0 ... 1:   /* test */
@@ -1410,11 +1409,9 @@ static inline int emulate_grp3(struct x86_emulate_ctxt 
*ctxt,
emulate_1op(neg, c-dst, ctxt-eflags);
break;
default:
-   DPRINTF(Cannot emulate %02x\n, c-b);
-   rc = X86EMUL_UNHANDLEABLE;
-   break;
+   return 0;
}
-   return rc;
+   return 1;
 }
 
 static inline int emulate_grp45(struct x86_emulate_ctxt *ctxt,
@@ -2374,9 +2371,8 @@ special_insn:
c-dst.type = OP_NONE;  /* Disable writeback. */
break;
case 0xf6 ... 0xf7: /* Grp3 */
-   rc = emulate_grp3(ctxt, ops);
-   if (rc != X86EMUL_CONTINUE)
-   goto done;
+   if (!emulate_grp3(ctxt, ops))
+   goto cannot_emulate;
break;
case 0xf8: /* clc */
ctxt-eflags = ~EFLG_CF;
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 29/30] KVM: x86 emulator: introduce pio in string read ahead.

2010-03-15 Thread Gleb Natapov

To optimize rep ins instruction do IO in big chunks ahead of time
instead of doing it only when required during instruction emulation.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |7 ++
 arch/x86/kvm/emulate.c |   43 +++
 2 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 7fda16f..b5e12c5 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -151,6 +151,12 @@ struct fetch_cache {
unsigned long end;
 };
 
+struct read_cache {
+   u8 data[1024];
+   unsigned long pos;
+   unsigned long end;
+};
+
 struct decode_cache {
u8 twobyte;
u8 b;
@@ -178,6 +184,7 @@ struct decode_cache {
void *modrm_ptr;
unsigned long modrm_val;
struct fetch_cache fetch;
+   struct read_cache io_read;
 };
 
 struct x86_emulate_ctxt {
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c4da60e..d9cf93b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1257,6 +1257,34 @@ done:
return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0;
 }
 
+static int pio_in_emulated(struct x86_emulate_ctxt *ctxt,
+  struct x86_emulate_ops *ops,
+  unsigned int size, unsigned short port,
+  void *dest)
+{
+   struct read_cache *rc = ctxt-decode.io_read;
+
+   if (rc-pos == rc-end) { /* refill pio read ahead */
+   struct decode_cache *c = ctxt-decode;
+   unsigned int in_page, n;
+   unsigned int count = c-rep_prefix ?
+   address_mask(c, c-regs[VCPU_REGS_RCX]) : 1;
+   in_page = (ctxt-eflags  EFLG_DF) ?
+   offset_in_page(c-regs[VCPU_REGS_RDI]) :
+   PAGE_SIZE - offset_in_page(c-regs[VCPU_REGS_RDI]);
+   n = min(min(in_page, (unsigned int)sizeof(rc-data)) / size,
+   count);
+   rc-pos = rc-end = 0;
+   if (!ops-pio_in_emulated(size, port, rc-data, n, ctxt-vcpu))
+   return 0;
+   rc-end = n * size;
+   }
+
+   memcpy(dest, rc-data + rc-pos, size);
+   rc-pos += size;
+   return 1;
+}
+
 static u32 desc_limit_scaled(struct desc_struct *desc)
 {
u32 limit = get_desc_limit(desc);
@@ -2618,8 +2646,8 @@ special_insn:
kvm_inject_gp(ctxt-vcpu, 0);
goto done;
}
-   if (!ops-pio_in_emulated(c-dst.bytes, c-regs[VCPU_REGS_RDX],
- c-dst.val, 1, ctxt-vcpu))
+   if (!pio_in_emulated(ctxt, ops, c-dst.bytes,
+c-regs[VCPU_REGS_RDX], c-dst.val))
goto done; /* IO is needed, skip writeback */
break;
case 0x6e:  /* outsb */
@@ -2835,8 +2863,7 @@ special_insn:
kvm_inject_gp(ctxt-vcpu, 0);
goto done;
}
-   ops-pio_in_emulated(c-dst.bytes, c-src.val, c-dst.val, 1,
-ctxt-vcpu);
+   pio_in_emulated(ctxt, ops, c-dst.bytes, c-src.val, 
c-dst.val);
break;
case 0xee: /* out al,dx */
case 0xef: /* out (e/r)ax,dx */
@@ -2923,8 +2950,14 @@ writeback:
string_addr_inc(ctxt, es_base(ctxt), VCPU_REGS_RDI, c-dst);
 
if (c-rep_prefix  (c-d  String)) {
+   struct read_cache *rc = ctxt-decode.io_read;
register_address_increment(c, c-regs[VCPU_REGS_RCX], -1);
-   if (!(c-regs[VCPU_REGS_RCX]  0x3ff))
+   /*
+* Re-enter guest when pio read ahead buffer is empty or,
+* if it is not used, after each 1024 iteration.
+*/
+   if ((rc-end == 0  !(c-regs[VCPU_REGS_RCX]  0x3ff)) ||
+   (rc-end != 0  rc-end == rc-pos))
ctxt-restart = false;
}
 
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 22/30] KVM: x86 emulator: populate OP_MEM operand during decoding.

2010-03-15 Thread Gleb Natapov

All struct operand fields are initialized during decoding for all
operand types except OP_MEM, but there is no reason for that. Move
OP_MEM operand initialization into decoding stage for consistency.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |   66 +---
 1 files changed, 29 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 702bfff..55b8a8b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1057,6 +1057,10 @@ done_prefixes:
 
if (c-ad_bytes != 8)
c-modrm_ea = (u32)c-modrm_ea;
+
+   if (c-rip_relative)
+   c-modrm_ea += c-eip;
+
/*
 * Decode and fetch the source operand: register, memory
 * or immediate.
@@ -1091,6 +1095,8 @@ done_prefixes:
break;
}
c-src.type = OP_MEM;
+   c-src.ptr = (unsigned long *)c-modrm_ea;
+   c-src.val = 0;
break;
case SrcImm:
case SrcImmU:
@@ -1169,8 +1175,10 @@ done_prefixes:
c-src2.val = 1;
break;
case Src2Mem16:
-   c-src2.bytes = 2;
c-src2.type = OP_MEM;
+   c-src2.bytes = 2;
+   c-src2.ptr = (unsigned long *)(c-modrm_ea + c-src.bytes);
+   c-src2.val = 0;
break;
}
 
@@ -1192,6 +1200,15 @@ done_prefixes:
break;
}
c-dst.type = OP_MEM;
+   c-dst.ptr = (unsigned long *)c-modrm_ea;
+   c-dst.bytes = (c-d  ByteOp) ? 1 : c-op_bytes;
+   c-dst.val = 0;
+   if (c-d  BitOp) {
+   unsigned long mask = ~(c-dst.bytes * 8 - 1);
+
+   c-dst.ptr = (void *)c-dst.ptr +
+  (c-src.val  mask) / 8;
+   }
break;
case DstAcc:
c-dst.type = OP_REG;
@@ -1215,9 +1232,6 @@ done_prefixes:
break;
}
 
-   if (c-rip_relative)
-   c-modrm_ea += c-eip;
-
 done:
return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0;
 }
@@ -1638,14 +1652,13 @@ static inline int emulate_grp45(struct x86_emulate_ctxt 
*ctxt,
 }
 
 static inline int emulate_grp9(struct x86_emulate_ctxt *ctxt,
-  struct x86_emulate_ops *ops,
-  unsigned long memop)
+  struct x86_emulate_ops *ops)
 {
struct decode_cache *c = ctxt-decode;
u64 old, new;
int rc;
 
-   rc = ops-read_emulated(memop, old, 8, ctxt-vcpu);
+   rc = ops-read_emulated(c-modrm_ea, old, 8, ctxt-vcpu);
if (rc != X86EMUL_CONTINUE)
return rc;
 
@@ -1660,7 +1673,7 @@ static inline int emulate_grp9(struct x86_emulate_ctxt 
*ctxt,
new = ((u64)c-regs[VCPU_REGS_RCX]  32) |
   (u32) c-regs[VCPU_REGS_RBX];
 
-   rc = ops-cmpxchg_emulated(memop, old, new, 8, ctxt-vcpu);
+   rc = ops-cmpxchg_emulated(c-modrm_ea, old, new, 8, 
ctxt-vcpu);
if (rc != X86EMUL_CONTINUE)
return rc;
ctxt-eflags |= EFLG_ZF;
@@ -2378,7 +2391,6 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
 int
 x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops)
 {
-   unsigned long memop = 0;
u64 msr_data;
unsigned long saved_eip = 0;
struct decode_cache *c = ctxt-decode;
@@ -2413,9 +2425,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
goto done;
}
 
-   if (((c-d  ModRM)  (c-modrm_mod != 3)) || (c-d  MemAbs))
-   memop = c-modrm_ea;
-
if (c-rep_prefix  (c-d  String)) {
/* All REP prefixes have the same first termination condition */
if (address_mask(c, c-regs[VCPU_REGS_RCX]) == 0) {
@@ -2447,8 +2456,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
}
 
if (c-src.type == OP_MEM) {
-   c-src.ptr = (unsigned long *)memop;
-   c-src.val = 0;
rc = ops-read_emulated((unsigned long)c-src.ptr,
c-src.val,
c-src.bytes,
@@ -2459,8 +2466,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
}
 
if (c-src2.type == OP_MEM) {
-   c-src2.ptr = (unsigned long *)(memop + c-src.bytes);
-   c-src2.val = 0;
rc = ops-read_emulated((unsigned long)c-src2.ptr,
c-src2.val,
c-src2.bytes,
@@ -2473,25 +2478,12 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)

[PATCH v3 24/30] KVM: x86 emulator: during rep emulation decrement ECX only if emulation succeeded

2010-03-15 Thread Gleb Natapov


Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |   15 ---
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6ebd642..a166235 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2407,13 +2407,13 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
 }
 
 static void string_addr_inc(struct x86_emulate_ctxt *ctxt, unsigned long base,
-   int reg, unsigned long **ptr)
+   int reg, struct operand *op)
 {
struct decode_cache *c = ctxt-decode;
int df = (ctxt-eflags  EFLG_DF) ? -1 : 1;
 
-   register_address_increment(c, c-regs[reg], df * c-src.bytes);
-   *ptr = (unsigned long *)register_address(c,  base, c-regs[reg]);
+   register_address_increment(c, c-regs[reg], df * op-bytes);
+   op-ptr = (unsigned long *)register_address(c,  base, c-regs[reg]);
 }
 
 int
@@ -2479,7 +2479,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
goto done;
}
}
-   register_address_increment(c, c-regs[VCPU_REGS_RCX], -1);
c-eip = ctxt-eip;
}
 
@@ -2932,11 +2931,13 @@ writeback:
 
if ((c-d  SrcMask) == SrcSI)
string_addr_inc(ctxt, seg_override_base(ctxt, c), VCPU_REGS_RSI,
-   c-src.ptr);
+   c-src);
 
if ((c-d  DstMask) == DstDI)
-   string_addr_inc(ctxt, es_base(ctxt), VCPU_REGS_RDI,
-   c-dst.ptr);
+   string_addr_inc(ctxt, es_base(ctxt), VCPU_REGS_RDI, c-dst);
+
+   if (c-rep_prefix  (c-d  String))
+   register_address_increment(c, c-regs[VCPU_REGS_RCX], -1);
 
/* Commit shadow register state. */
memcpy(ctxt-vcpu-arch.regs, c-regs, sizeof c-regs);
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 21/30] KVM: Use task switch from emulator.c

2010-03-15 Thread Gleb Natapov

Remove old task switch code from x86.c

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/x86.c |  557 ++--
 1 files changed, 17 insertions(+), 540 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2ef83db..7d1b481 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4795,553 +4795,30 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu 
*vcpu,
return 0;
 }
 
-static void seg_desct_to_kvm_desct(struct desc_struct *seg_desc, u16 selector,
-  struct kvm_segment *kvm_desct)
-{
-   kvm_desct-base = get_desc_base(seg_desc);
-   kvm_desct-limit = get_desc_limit(seg_desc);
-   if (seg_desc-g) {
-   kvm_desct-limit = 12;
-   kvm_desct-limit |= 0xfff;
-   }
-   kvm_desct-selector = selector;
-   kvm_desct-type = seg_desc-type;
-   kvm_desct-present = seg_desc-p;
-   kvm_desct-dpl = seg_desc-dpl;
-   kvm_desct-db = seg_desc-d;
-   kvm_desct-s = seg_desc-s;
-   kvm_desct-l = seg_desc-l;
-   kvm_desct-g = seg_desc-g;
-   kvm_desct-avl = seg_desc-avl;
-   if (!selector)
-   kvm_desct-unusable = 1;
-   else
-   kvm_desct-unusable = 0;
-   kvm_desct-padding = 0;
-}
-
-static void get_segment_descriptor_dtable(struct kvm_vcpu *vcpu,
- u16 selector,
- struct desc_ptr *dtable)
-{
-   if (selector  1  2) {
-   struct kvm_segment kvm_seg;
-
-   kvm_get_segment(vcpu, kvm_seg, VCPU_SREG_LDTR);
-
-   if (kvm_seg.unusable)
-   dtable-size = 0;
-   else
-   dtable-size = kvm_seg.limit;
-   dtable-address = kvm_seg.base;
-   }
-   else
-   kvm_x86_ops-get_gdt(vcpu, dtable);
-}
-
-/* allowed just for 8 bytes segments */
-static int load_guest_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector,
-struct desc_struct *seg_desc)
-{
-   struct desc_ptr dtable;
-   u16 index = selector  3;
-   int ret;
-   u32 err;
-   gva_t addr;
-
-   get_segment_descriptor_dtable(vcpu, selector, dtable);
-
-   if (dtable.size  index * 8 + 7) {
-   kvm_queue_exception_e(vcpu, GP_VECTOR, selector  0xfffc);
-   return X86EMUL_PROPAGATE_FAULT;
-   }
-   addr = dtable.address + index * 8;
-   ret = kvm_read_guest_virt_system(addr, seg_desc, sizeof(*seg_desc),
-vcpu,  err);
-   if (ret == X86EMUL_PROPAGATE_FAULT)
-   kvm_inject_page_fault(vcpu, addr, err);
-
-   return ret;
-}
-
-/* allowed just for 8 bytes segments */
-static int save_guest_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector,
-struct desc_struct *seg_desc)
-{
-   struct desc_ptr dtable;
-   u16 index = selector  3;
-
-   get_segment_descriptor_dtable(vcpu, selector, dtable);
-
-   if (dtable.size  index * 8 + 7)
-   return 1;
-   return kvm_write_guest_virt(dtable.address + index*8, seg_desc, 
sizeof(*seg_desc), vcpu, NULL);
-}
-
-static gpa_t get_tss_base_addr_write(struct kvm_vcpu *vcpu,
-  struct desc_struct *seg_desc)
-{
-   u32 base_addr = get_desc_base(seg_desc);
-
-   return kvm_mmu_gva_to_gpa_write(vcpu, base_addr, NULL);
-}
-
-static gpa_t get_tss_base_addr_read(struct kvm_vcpu *vcpu,
-struct desc_struct *seg_desc)
-{
-   u32 base_addr = get_desc_base(seg_desc);
-
-   return kvm_mmu_gva_to_gpa_read(vcpu, base_addr, NULL);
-}
-
-static u16 get_segment_selector(struct kvm_vcpu *vcpu, int seg)
-{
-   struct kvm_segment kvm_seg;
-
-   kvm_get_segment(vcpu, kvm_seg, seg);
-   return kvm_seg.selector;
-}
-
-static int kvm_load_realmode_segment(struct kvm_vcpu *vcpu, u16 selector, int 
seg)
-{
-   struct kvm_segment segvar = {
-   .base = selector  4,
-   .limit = 0x,
-   .selector = selector,
-   .type = 3,
-   .present = 1,
-   .dpl = 3,
-   .db = 0,
-   .s = 1,
-   .l = 0,
-   .g = 0,
-   .avl = 0,
-   .unusable = 0,
-   };
-   kvm_x86_ops-set_segment(vcpu, segvar, seg);
-   return X86EMUL_CONTINUE;
-}
-
-static int is_vm86_segment(struct kvm_vcpu *vcpu, int seg)
-{
-   return (seg != VCPU_SREG_LDTR) 
-   (seg != VCPU_SREG_TR) 
-   (kvm_get_rflags(vcpu)  X86_EFLAGS_VM);
-}
-
-int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, int seg)
-{
-   struct kvm_segment kvm_seg;
-   struct desc_struct seg_desc;
-   u8 dpl, rpl, cpl;
-   unsigned err_vec = GP_VECTOR;
-   u32 err_code = 0;
-   bool

[PATCH v3 30/30] KVM: small kvm_arch_vcpu_ioctl_run() cleanup.

2010-03-15 Thread Gleb Natapov

Unify all conditions that get us back into emulator after returning from
userspace.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/x86.c |   32 ++--
 1 files changed, 6 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cd0043a..1c00c06 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4505,33 +4505,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
if (!irqchip_in_kernel(vcpu-kvm))
kvm_set_cr8(vcpu, kvm_run-cr8);
 
-   if (vcpu-arch.pio.count) {
-   vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
-   r = emulate_instruction(vcpu, 0, 0, EMULTYPE_NO_DECODE);
-   srcu_read_unlock(vcpu-kvm-srcu, vcpu-srcu_idx);
-   if (r == EMULATE_DO_MMIO) {
-   r = 0;
-   goto out;
+   if (vcpu-arch.pio.count || vcpu-mmio_needed ||
+   vcpu-arch.emulate_ctxt.restart) {
+   if (vcpu-mmio_needed) {
+   memcpy(vcpu-mmio_data, kvm_run-mmio.data, 8);
+   vcpu-mmio_read_completed = 1;
+   vcpu-mmio_needed = 0;
}
-   }
-   if (vcpu-mmio_needed) {
-   memcpy(vcpu-mmio_data, kvm_run-mmio.data, 8);
-   vcpu-mmio_read_completed = 1;
-   vcpu-mmio_needed = 0;
-
-   vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
-   r = emulate_instruction(vcpu, vcpu-arch.mmio_fault_cr2, 0,
-   EMULTYPE_NO_DECODE);
-   srcu_read_unlock(vcpu-kvm-srcu, vcpu-srcu_idx);
-   if (r == EMULATE_DO_MMIO) {
-   /*
-* Read-modify-write.  Back to userspace.
-*/
-   r = 0;
-   goto out;
-   }
-   }
-   if (vcpu-arch.emulate_ctxt.restart) {
vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
r = emulate_instruction(vcpu, 0, 0, EMULTYPE_NO_DECODE);
srcu_read_unlock(vcpu-kvm-srcu, vcpu-srcu_idx);
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 28/30] KVM: x86 emulator: restart string instruction without going back to a guest.

2010-03-15 Thread Gleb Natapov

Currently when string instruction is only partially complete we go back
to a guest mode, guest tries to reexecute instruction and exits again
and at this point emulation continues. Avoid all of this by restarting
instruction without going back to a guest mode, but return to a guest
mode each 1024 iterations to allow interrupt injection. Pending
exception causes immediate guest entry too.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |1 +
 arch/x86/kvm/emulate.c |   34 +++---
 arch/x86/kvm/x86.c |   19 ++-
 3 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 679245c..7fda16f 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -193,6 +193,7 @@ struct x86_emulate_ctxt {
/* interruptibility state, as a result of execution of STI or MOV SS */
int interruptibility;
 
+   bool restart; /* restart string instruction after writeback */
/* decode cache */
struct decode_cache decode;
 };
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 541f3c9..c4da60e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -927,8 +927,11 @@ x86_decode_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
int mode = ctxt-mode;
int def_op_bytes, def_ad_bytes, group;
 
-   /* Shadow copy of register state. Committed on successful emulation. */
 
+   /* we cannot decode insn before we complete previous rep insn */
+   WARN_ON(ctxt-restart);
+
+   /* Shadow copy of register state. Committed on successful emulation. */
memset(c, 0, sizeof(struct decode_cache));
c-eip = ctxt-eip;
ctxt-cs_base = seg_base(ctxt, VCPU_SREG_CS);
@@ -2422,6 +2425,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
u64 msr_data;
struct decode_cache *c = ctxt-decode;
int rc = X86EMUL_CONTINUE;
+   int saved_dst_type = c-dst.type;
 
ctxt-interruptibility = 0;
 
@@ -2450,8 +2454,11 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
}
 
if (c-rep_prefix  (c-d  String)) {
+   ctxt-restart = true;
/* All REP prefixes have the same first termination condition */
if (address_mask(c, c-regs[VCPU_REGS_RCX]) == 0) {
+   string_done:
+   ctxt-restart = false;
kvm_rip_write(ctxt-vcpu, c-eip);
goto done;
}
@@ -2463,17 +2470,13 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
 *  - if REPNE/REPNZ and ZF = 1 then done
 */
if ((c-b == 0xa6) || (c-b == 0xa7) ||
-   (c-b == 0xae) || (c-b == 0xaf)) {
+   (c-b == 0xae) || (c-b == 0xaf)) {
if ((c-rep_prefix == REPE_PREFIX) 
-   ((ctxt-eflags  EFLG_ZF) == 0)) {
-   kvm_rip_write(ctxt-vcpu, c-eip);
-   goto done;
-   }
+   ((ctxt-eflags  EFLG_ZF) == 0))
+   goto string_done;
if ((c-rep_prefix == REPNE_PREFIX) 
-   ((ctxt-eflags  EFLG_ZF) == EFLG_ZF)) {
-   kvm_rip_write(ctxt-vcpu, c-eip);
-   goto done;
-   }
+   ((ctxt-eflags  EFLG_ZF) == EFLG_ZF))
+   goto string_done;
}
c-eip = ctxt-eip;
}
@@ -2906,6 +2909,12 @@ writeback:
if (rc != X86EMUL_CONTINUE)
goto done;
 
+   /*
+* restore dst type in case the decoding will be reused
+* (happens for string instruction )
+*/
+   c-dst.type = saved_dst_type;
+
if ((c-d  SrcMask) == SrcSI)
string_addr_inc(ctxt, seg_override_base(ctxt, c), VCPU_REGS_RSI,
c-src);
@@ -2913,8 +2922,11 @@ writeback:
if ((c-d  DstMask) == DstDI)
string_addr_inc(ctxt, es_base(ctxt), VCPU_REGS_RDI, c-dst);
 
-   if (c-rep_prefix  (c-d  String))
+   if (c-rep_prefix  (c-d  String)) {
register_address_increment(c, c-regs[VCPU_REGS_RCX], -1);
+   if (!(c-regs[VCPU_REGS_RCX]  0x3ff))
+   ctxt-restart = false;
+   }
 
/* Commit shadow register state. */
memcpy(ctxt-vcpu-arch.regs, c-regs, sizeof c-regs);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b8237ac..cd0043a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3718,6 +3718,7 @@ int

[PATCH v3 26/30] KVM: x86 emulator: Move string pio emulation into emulator.c

2010-03-15 Thread Gleb Natapov

Currently emulation is done outside of emulator so things like doing
ins/outs to/from mmio are broken it also makes it hard (if not impossible)
to implement single stepping in the future. The implementation in this
patch is not efficient since it exits to userspace for each IO while
previous implementation did 'ins' in batches. Further patch that
implements pio in string read ahead address this problem.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |8 --
 arch/x86/kvm/emulate.c  |   48 +++--
 arch/x86/kvm/x86.c  |  204 +++
 3 files changed, 31 insertions(+), 229 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4a4fb8d..c072401 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -224,14 +224,9 @@ struct kvm_pv_mmu_op_buffer {
 
 struct kvm_pio_request {
unsigned long count;
-   int cur_count;
-   gva_t guest_gva;
int in;
int port;
int size;
-   int string;
-   int down;
-   int rep;
 };
 
 /*
@@ -590,9 +585,6 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 
data);
 struct x86_emulate_ctxt;
 
 int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port);
-int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, int in,
-  int size, unsigned long count, int down,
-   gva_t address, int rep, unsigned port);
 void kvm_emulate_cpuid(struct kvm_vcpu *vcpu);
 int kvm_emulate_halt(struct kvm_vcpu *vcpu);
 int emulate_invlpg(struct kvm_vcpu *vcpu, gva_t address);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 873da58..1bedbb6 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -153,8 +153,8 @@ static u32 opcode_table[256] = {
0, 0, 0, 0,
/* 0x68 - 0x6F */
SrcImm | Mov | Stack, 0, SrcImmByte | Mov | Stack, 0,
-   SrcNone  | ByteOp  | ImplicitOps, SrcNone  | ImplicitOps, /* insb, 
insw/insd */
-   SrcNone  | ByteOp  | ImplicitOps, SrcNone  | ImplicitOps, /* outsb, 
outsw/outsd */
+   DstDI | ByteOp | Mov | String, DstDI | Mov | String, /* insb, insw/insd 
*/
+   SrcSI | ByteOp | ImplicitOps | String, SrcSI | ImplicitOps | String, /* 
outsb, outsw/outsd */
/* 0x70 - 0x77 */
SrcImmByte, SrcImmByte, SrcImmByte, SrcImmByte,
SrcImmByte, SrcImmByte, SrcImmByte, SrcImmByte,
@@ -2611,47 +2611,29 @@ special_insn:
break;
case 0x6c:  /* insb */
case 0x6d:  /* insw/insd */
+   c-dst.bytes = min(c-dst.bytes, 4u);
if (!emulator_io_permited(ctxt, ops, c-regs[VCPU_REGS_RDX],
- (c-d  ByteOp) ? 1 : c-op_bytes)) {
+ c-dst.bytes)) {
kvm_inject_gp(ctxt-vcpu, 0);
goto done;
}
-   if (kvm_emulate_pio_string(ctxt-vcpu,
-   1,
-   (c-d  ByteOp) ? 1 : c-op_bytes,
-   c-rep_prefix ?
-   address_mask(c, c-regs[VCPU_REGS_RCX]) : 1,
-   (ctxt-eflags  EFLG_DF),
-   register_address(c, es_base(ctxt),
-c-regs[VCPU_REGS_RDI]),
-   c-rep_prefix,
-   c-regs[VCPU_REGS_RDX]) == 0) {
-   c-eip = saved_eip;
-   return -1;
-   }
-   return 0;
+   if (!ops-pio_in_emulated(c-dst.bytes, c-regs[VCPU_REGS_RDX],
+ c-dst.val, 1, ctxt-vcpu))
+   goto done; /* IO is needed, skip writeback */
+   break;
case 0x6e:  /* outsb */
case 0x6f:  /* outsw/outsd */
+   c-src.bytes = min(c-src.bytes, 4u);
if (!emulator_io_permited(ctxt, ops, c-regs[VCPU_REGS_RDX],
- (c-d  ByteOp) ? 1 : c-op_bytes)) {
+ c-src.bytes)) {
kvm_inject_gp(ctxt-vcpu, 0);
goto done;
}
-   if (kvm_emulate_pio_string(ctxt-vcpu,
-   0,
-   (c-d  ByteOp) ? 1 : c-op_bytes,
-   c-rep_prefix ?
-   address_mask(c, c-regs[VCPU_REGS_RCX]) : 1,
-   (ctxt-eflags  EFLG_DF),
-register_address(c,
- seg_override_base(ctxt, c),
-c-regs[VCPU_REGS_RSI]),
-

[PATCH v3 18/30] KVM: x86 emulator: Provide more callbacks for x86 emulator.

2010-03-15 Thread Gleb Natapov

Provide get_cached_descriptor(), set_cached_descriptor(),
get_segment_selector(), set_segment_selector(), get_gdt(),
write_std() callbacks.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |   16 +
 arch/x86/kvm/x86.c |  130 +++
 2 files changed, 131 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 0765725..f901467 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -63,6 +63,15 @@ struct x86_emulate_ops {
unsigned int bytes, struct kvm_vcpu *vcpu, u32 *error);
 
/*
+* write_std: Write bytes of standard (non-emulated/special) memory.
+*Used for descriptor writing.
+*  @addr:  [IN ] Linear address to which to write.
+*  @val:   [OUT] Value write to memory, zero-extended to 'u_long'.
+*  @bytes: [IN ] Number of bytes to write to memory.
+*/
+   int (*write_std)(unsigned long addr, void *val,
+unsigned int bytes, struct kvm_vcpu *vcpu, u32 *error);
+   /*
 * fetch: Read bytes of standard (non-emulated/special) memory.
 *Used for instruction fetch.
 *  @addr:  [IN ] Linear address from which to read.
@@ -108,6 +117,13 @@ struct x86_emulate_ops {
const void *new,
unsigned int bytes,
struct kvm_vcpu *vcpu);
+   bool (*get_cached_descriptor)(struct desc_struct *desc,
+ int seg, struct kvm_vcpu *vcpu);
+   void (*set_cached_descriptor)(struct desc_struct *desc,
+ int seg, struct kvm_vcpu *vcpu);
+   u16 (*get_segment_selector)(int seg, struct kvm_vcpu *vcpu);
+   void (*set_segment_selector)(u16 sel, int seg, struct kvm_vcpu *vcpu);
+   void (*get_gdt)(struct desc_ptr *dt, struct kvm_vcpu *vcpu);
ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu);
void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu);
int (*cpl)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 022d28e..2ef83db 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3050,6 +3050,18 @@ static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t 
addr, int len, void *v)
return kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, len, v);
 }
 
+static void kvm_set_segment(struct kvm_vcpu *vcpu,
+   struct kvm_segment *var, int seg)
+{
+   kvm_x86_ops-set_segment(vcpu, var, seg);
+}
+
+void kvm_get_segment(struct kvm_vcpu *vcpu,
+struct kvm_segment *var, int seg)
+{
+   kvm_x86_ops-get_segment(vcpu, var, seg);
+}
+
 gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva, u32 *error)
 {
u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
@@ -3130,14 +3142,18 @@ static int kvm_read_guest_virt_system(gva_t addr, void 
*val, unsigned int bytes,
return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, error);
 }
 
-static int kvm_write_guest_virt(gva_t addr, void *val, unsigned int bytes,
-   struct kvm_vcpu *vcpu, u32 *error)
+static int kvm_write_guest_virt_helper(gva_t addr, void *val,
+  unsigned int bytes,
+  struct kvm_vcpu *vcpu, u32 access,
+  u32 *error)
 {
void *data = val;
int r = X86EMUL_CONTINUE;
 
+   access |= PFERR_WRITE_MASK;
+
while (bytes) {
-   gpa_t gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, error);
+   gpa_t gpa =  vcpu-arch.mmu.gva_to_gpa(vcpu, addr, access, 
error);
unsigned offset = addr  (PAGE_SIZE-1);
unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset);
int ret;
@@ -3160,6 +3176,19 @@ out:
return r;
 }
 
+static int kvm_write_guest_virt(gva_t addr, void *val, unsigned int bytes,
+   struct kvm_vcpu *vcpu, u32 *error)
+{
+   u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+   return kvm_write_guest_virt_helper(addr, val, bytes, vcpu, access, 
error);
+}
+
+static int kvm_write_guest_virt_system(gva_t addr, void *val,
+  unsigned int bytes,
+  struct kvm_vcpu *vcpu, u32 *error)
+{
+   return kvm_write_guest_virt_helper(addr, val, bytes, vcpu, 0, error);
+}
 
 static int emulator_read_emulated(unsigned long addr,
  void *val,
@@ -3447,12 +3476,95 @@ static int emulator_get_cpl(struct kvm_vcpu *vcpu)
return kvm_x86_ops-get_cpl(vcpu);
 }
 
+static void emulator_get_gdt(struct desc_ptr *dt, struct kvm_vcpu *vcpu)
+{
+

[PATCH v3 27/30] KVM: x86 emulator: remove saved_eip

2010-03-15 Thread Gleb Natapov

c-eip is never written back in case of emulation failure, so no need to
set it to old value.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |9 +
 1 files changed, 1 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 1bedbb6..541f3c9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2420,7 +2420,6 @@ int
 x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops)
 {
u64 msr_data;
-   unsigned long saved_eip = 0;
struct decode_cache *c = ctxt-decode;
int rc = X86EMUL_CONTINUE;
 
@@ -2432,7 +2431,6 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
 */
 
memcpy(c-regs, ctxt-vcpu-arch.regs, sizeof c-regs);
-   saved_eip = c-eip;
 
if (ctxt-mode == X86EMUL_MODE_PROT64  (c-d  No64)) {
kvm_queue_exception(ctxt-vcpu, UD_VECTOR);
@@ -2923,11 +2921,7 @@ writeback:
kvm_rip_write(ctxt-vcpu, c-eip);
 
 done:
-   if (rc == X86EMUL_UNHANDLEABLE) {
-   c-eip = saved_eip;
-   return -1;
-   }
-   return 0;
+   return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0;
 
 twobyte_insn:
switch (c-b) {
@@ -3204,6 +3198,5 @@ twobyte_insn:
 
 cannot_emulate:
DPRINTF(Cannot emulate %02x\n, c-b);
-   c-eip = saved_eip;
return -1;
 }
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 19/30] KVM: x86 emulator: Emulate task switch in emulator.c

2010-03-15 Thread Gleb Natapov

Implement emulation of 16/32 bit task switch in emulator.c

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |5 +
 arch/x86/kvm/emulate.c |  563 
 2 files changed, 568 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index f901467..bd46929 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -11,6 +11,8 @@
 #ifndef _ASM_X86_KVM_X86_EMULATE_H
 #define _ASM_X86_KVM_X86_EMULATE_H
 
+#include asm/desc_defs.h
+
 struct x86_emulate_ctxt;
 
 /*
@@ -210,5 +212,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt,
struct x86_emulate_ops *ops);
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt,
 struct x86_emulate_ops *ops);
+int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
+struct x86_emulate_ops *ops,
+u16 tss_selector, int reason);
 
 #endif /* _ASM_X86_KVM_X86_EMULATE_H */
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index d696cbd..db4776c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -33,6 +33,7 @@
 #include asm/kvm_emulate.h
 
 #include x86.h
+#include tss.h
 
 /*
  * Opcode effective-address decode tables.
@@ -1221,6 +1222,198 @@ done:
return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0;
 }
 
+static u32 desc_limit_scaled(struct desc_struct *desc)
+{
+   u32 limit = get_desc_limit(desc);
+
+   return desc-g ? (limit  12) | 0xfff : limit;
+}
+
+static void get_descriptor_table_ptr(struct x86_emulate_ctxt *ctxt,
+struct x86_emulate_ops *ops,
+u16 selector, struct desc_ptr *dt)
+{
+   if (selector  1  2) {
+   struct desc_struct desc;
+   memset (dt, 0, sizeof *dt);
+   if (!ops-get_cached_descriptor(desc, VCPU_SREG_LDTR, 
ctxt-vcpu))
+   return;
+
+   dt-size = desc_limit_scaled(desc); /* what if limit  65535? 
*/
+   dt-address = get_desc_base(desc);
+   } else
+   ops-get_gdt(dt, ctxt-vcpu);
+}
+
+/* allowed just for 8 bytes segments */
+static int read_segment_descriptor(struct x86_emulate_ctxt *ctxt,
+  struct x86_emulate_ops *ops,
+  u16 selector, struct desc_struct *desc)
+{
+   struct desc_ptr dt;
+   u16 index = selector  3;
+   int ret;
+   u32 err;
+   ulong addr;
+
+   get_descriptor_table_ptr(ctxt, ops, selector, dt);
+
+   if (dt.size  index * 8 + 7) {
+   kvm_inject_gp(ctxt-vcpu, selector  0xfffc);
+   return X86EMUL_PROPAGATE_FAULT;
+   }
+   addr = dt.address + index * 8;
+   ret = ops-read_std(addr, desc, sizeof *desc, ctxt-vcpu,  err);
+   if (ret == X86EMUL_PROPAGATE_FAULT)
+   kvm_inject_page_fault(ctxt-vcpu, addr, err);
+
+   return ret;
+}
+
+/* allowed just for 8 bytes segments */
+static int write_segment_descriptor(struct x86_emulate_ctxt *ctxt,
+   struct x86_emulate_ops *ops,
+   u16 selector, struct desc_struct *desc)
+{
+   struct desc_ptr dt;
+   u16 index = selector  3;
+   u32 err;
+   ulong addr;
+   int ret;
+
+   get_descriptor_table_ptr(ctxt, ops, selector, dt);
+
+   if (dt.size  index * 8 + 7) {
+   kvm_inject_gp(ctxt-vcpu, selector  0xfffc);
+   return X86EMUL_PROPAGATE_FAULT;
+   }
+
+   addr = dt.address + index * 8;
+   ret = ops-write_std(addr, desc, sizeof *desc, ctxt-vcpu, err);
+   if (ret == X86EMUL_PROPAGATE_FAULT)
+   kvm_inject_page_fault(ctxt-vcpu, addr, err);
+
+   return ret;
+}
+
+static int load_segment_descriptor(struct x86_emulate_ctxt *ctxt,
+  struct x86_emulate_ops *ops,
+  u16 selector, int seg)
+{
+   struct desc_struct seg_desc;
+   u8 dpl, rpl, cpl;
+   unsigned err_vec = GP_VECTOR;
+   u32 err_code = 0;
+   bool null_selector = !(selector  ~0x3); /* -0003 are null */
+   int ret;
+
+   memset(seg_desc, 0, sizeof seg_desc);
+
+   if ((seg = VCPU_SREG_GS  ctxt-mode == X86EMUL_MODE_VM86)
+   || ctxt-mode == X86EMUL_MODE_REAL) {
+   /* set real mode segment descriptor */
+   set_desc_base(seg_desc, selector  4);
+   set_desc_limit(seg_desc, 0x);
+   seg_desc.type = 3;
+   seg_desc.p = 1;
+   seg_desc.s = 1;
+   goto load;
+   }
+
+   /* NULL selector is not valid for TR, CS and SS */
+   if ((seg == VCPU_SREG_CS || seg == VCPU_SREG_SS || seg == VCPU_SREG_TR)
+null_selector)
+   goto exception;
+
+   /* TR should be in GDT only

[PATCH v3 02/30] KVM: x86 emulator: fix RCX access during rep emulation

2010-03-15 Thread Gleb Natapov

During rep emulation access length to RCX depends on current address
mode.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0b70a36..4dce805 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1852,7 +1852,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
 
if (c-rep_prefix  (c-d  String)) {
/* All REP prefixes have the same first termination condition */
-   if (c-regs[VCPU_REGS_RCX] == 0) {
+   if (address_mask(c, c-regs[VCPU_REGS_RCX]) == 0) {
kvm_rip_write(ctxt-vcpu, c-eip);
goto done;
}
@@ -1876,7 +1876,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
goto done;
}
}
-   c-regs[VCPU_REGS_RCX]--;
+   register_address_increment(c, c-regs[VCPU_REGS_RCX], -1);
c-eip = kvm_rip_read(ctxt-vcpu);
}
 
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 16/30] KVM: x86 emulator: If LOCK prefix is used dest arg should be memory.

2010-03-15 Thread Gleb Natapov

If LOCK prefix is used dest arg should be memory, otherwise instruction
should generate #UD.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index b89a8f2..46a7ee3 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1842,7 +1842,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
}
 
/* LOCK prefix is allowed only with some instructions */
-   if (c-lock_prefix  !(c-d  Lock)) {
+   if (c-lock_prefix  (!(c-d  Lock) || c-dst.type != OP_MEM)) {
kvm_queue_exception(ctxt-vcpu, UD_VECTOR);
goto done;
}
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 23/30] KVM: x86 emulator: add decoding of X,Y parameters from Intel SDM

2010-03-15 Thread Gleb Natapov

Add decoding of X,Y parameters from Intel SDM which are used by string
instruction to specify source and destination. Use this new decoding
to implement movs, cmps, stos, lods in a generic way.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |  125 +---
 1 files changed, 44 insertions(+), 81 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 55b8a8b..6ebd642 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -51,6 +51,7 @@
 #define DstReg  (21) /* Register operand. */
 #define DstMem  (31) /* Memory operand. */
 #define DstAcc  (41)  /* Destination Accumulator */
+#define DstDI   (51) /* Destination is in ES:(E)DI */
 #define DstMask (71)
 /* Source operand type. */
 #define SrcNone (04) /* No source operand. */
@@ -64,6 +65,7 @@
 #define SrcOne  (74) /* Implied '1' */
 #define SrcImmUByte (84)  /* 8-bit unsigned immediate operand. */
 #define SrcImmU (94)  /* Immediate operand, unsigned */
+#define SrcSI   (0xa4)   /* Source is in the DS:RSI */
 #define SrcMask (0xf4)
 /* Generic ModRM decode. */
 #define ModRM   (18)
@@ -177,12 +179,12 @@ static u32 opcode_table[256] = {
/* 0xA0 - 0xA7 */
ByteOp | DstReg | SrcMem | Mov | MemAbs, DstReg | SrcMem | Mov | MemAbs,
ByteOp | DstMem | SrcReg | Mov | MemAbs, DstMem | SrcReg | Mov | MemAbs,
-   ByteOp | ImplicitOps | Mov | String, ImplicitOps | Mov | String,
-   ByteOp | ImplicitOps | String, ImplicitOps | String,
+   ByteOp | SrcSI | DstDI | Mov | String, SrcSI | DstDI | Mov | String,
+   ByteOp | SrcSI | DstDI | String, SrcSI | DstDI | String,
/* 0xA8 - 0xAF */
-   0, 0, ByteOp | ImplicitOps | Mov | String, ImplicitOps | Mov | String,
-   ByteOp | ImplicitOps | Mov | String, ImplicitOps | Mov | String,
-   ByteOp | ImplicitOps | String, ImplicitOps | String,
+   0, 0, ByteOp | DstDI | Mov | String, DstDI | Mov | String,
+   ByteOp | SrcSI | DstAcc | Mov | String, SrcSI | DstAcc | Mov | String,
+   ByteOp | DstDI | String, DstDI | String,
/* 0xB0 - 0xB7 */
ByteOp | DstReg | SrcImm | Mov, ByteOp | DstReg | SrcImm | Mov,
ByteOp | DstReg | SrcImm | Mov, ByteOp | DstReg | SrcImm | Mov,
@@ -1145,6 +1147,14 @@ done_prefixes:
c-src.bytes = 1;
c-src.val = 1;
break;
+   case SrcSI:
+   c-src.type = OP_MEM;
+   c-src.bytes = (c-d  ByteOp) ? 1 : c-op_bytes;
+   c-src.ptr = (unsigned long *)
+   register_address(c,  seg_override_base(ctxt, c),
+c-regs[VCPU_REGS_RSI]);
+   c-src.val = 0;
+   break;
}
 
/*
@@ -1230,6 +1240,14 @@ done_prefixes:
}
c-dst.orig_val = c-dst.val;
break;
+   case DstDI:
+   c-dst.type = OP_MEM;
+   c-dst.bytes = (c-d  ByteOp) ? 1 : c-op_bytes;
+   c-dst.ptr = (unsigned long *)
+   register_address(c, es_base(ctxt),
+c-regs[VCPU_REGS_RDI]);
+   c-dst.val = 0;
+   break;
}
 
 done:
@@ -2388,6 +2406,16 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
return rc;
 }
 
+static void string_addr_inc(struct x86_emulate_ctxt *ctxt, unsigned long base,
+   int reg, unsigned long **ptr)
+{
+   struct decode_cache *c = ctxt-decode;
+   int df = (ctxt-eflags  EFLG_DF) ? -1 : 1;
+
+   register_address_increment(c, c-regs[reg], df * c-src.bytes);
+   *ptr = (unsigned long *)register_address(c,  base, c-regs[reg]);
+}
+
 int
 x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops)
 {
@@ -2750,89 +2778,16 @@ special_insn:
c-dst.val = (unsigned long)c-regs[VCPU_REGS_RAX];
break;
case 0xa4 ... 0xa5: /* movs */
-   c-dst.type = OP_MEM;
-   c-dst.bytes = (c-d  ByteOp) ? 1 : c-op_bytes;
-   c-dst.ptr = (unsigned long *)register_address(c,
-  es_base(ctxt),
-  c-regs[VCPU_REGS_RDI]);
-   rc = ops-read_emulated(register_address(c,
-   seg_override_base(ctxt, c),
-   c-regs[VCPU_REGS_RSI]),
-   c-dst.val,
-   c-dst.bytes, ctxt-vcpu);
-   if (rc != X86EMUL_CONTINUE)
-   goto done;
-   register_address_increment(c, c-regs[VCPU_REGS_RSI],
-  (ctxt-eflags  EFLG_DF) ? -c-dst.bytes
-

[PATCH v3 07/30] KVM: Provide x86_emulate_ctxt callback to get current cpl

2010-03-15 Thread Gleb Natapov


Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |1 +
 arch/x86/kvm/emulate.c |   15 ---
 arch/x86/kvm/x86.c |6 ++
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 0c5caa4..b048fd2 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -110,6 +110,7 @@ struct x86_emulate_ops {
struct kvm_vcpu *vcpu);
ulong (*get_cr)(int cr, struct kvm_vcpu *vcpu);
void (*set_cr)(int cr, ulong val, struct kvm_vcpu *vcpu);
+   int (*cpl)(struct kvm_vcpu *vcpu);
 };
 
 /* Type, address-of, and value of an instruction's operand. */
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 5e2fa61..8bd0557 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1257,7 +1257,7 @@ static int emulate_popf(struct x86_emulate_ctxt *ctxt,
int rc;
unsigned long val, change_mask;
int iopl = (ctxt-eflags  X86_EFLAGS_IOPL)  IOPL_SHIFT;
-   int cpl = kvm_x86_ops-get_cpl(ctxt-vcpu);
+   int cpl = ops-cpl(ctxt-vcpu);
 
rc = emulate_pop(ctxt, ops, val, len);
if (rc != X86EMUL_CONTINUE)
@@ -1758,7 +1758,8 @@ emulate_sysexit(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt)
+static bool emulator_bad_iopl(struct x86_emulate_ctxt *ctxt,
+ struct x86_emulate_ops *ops)
 {
int iopl;
if (ctxt-mode == X86EMUL_MODE_REAL)
@@ -1766,7 +1767,7 @@ static bool emulator_bad_iopl(struct x86_emulate_ctxt 
*ctxt)
if (ctxt-mode == X86EMUL_MODE_VM86)
return true;
iopl = (ctxt-eflags  X86_EFLAGS_IOPL)  IOPL_SHIFT;
-   return kvm_x86_ops-get_cpl(ctxt-vcpu)  iopl;
+   return ops-cpl(ctxt-vcpu)  iopl;
 }
 
 static bool emulator_io_port_access_allowed(struct x86_emulate_ctxt *ctxt,
@@ -1803,7 +1804,7 @@ static bool emulator_io_permited(struct x86_emulate_ctxt 
*ctxt,
 struct x86_emulate_ops *ops,
 u16 port, u16 len)
 {
-   if (emulator_bad_iopl(ctxt))
+   if (emulator_bad_iopl(ctxt, ops))
if (!emulator_io_port_access_allowed(ctxt, ops, port, len))
return false;
return true;
@@ -1842,7 +1843,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
}
 
/* Privileged instruction can be executed only in CPL=0 */
-   if ((c-d  Priv)  kvm_x86_ops-get_cpl(ctxt-vcpu)) {
+   if ((c-d  Priv)  ops-cpl(ctxt-vcpu)) {
kvm_inject_gp(ctxt-vcpu, 0);
goto done;
}
@@ -2378,7 +2379,7 @@ special_insn:
c-dst.type = OP_NONE;  /* Disable writeback. */
break;
case 0xfa: /* cli */
-   if (emulator_bad_iopl(ctxt))
+   if (emulator_bad_iopl(ctxt, ops))
kvm_inject_gp(ctxt-vcpu, 0);
else {
ctxt-eflags = ~X86_EFLAGS_IF;
@@ -2386,7 +2387,7 @@ special_insn:
}
break;
case 0xfb: /* sti */
-   if (emulator_bad_iopl(ctxt))
+   if (emulator_bad_iopl(ctxt, ops))
kvm_inject_gp(ctxt-vcpu, 0);
else {
toggle_interruptibility(ctxt, KVM_X86_SHADOW_INT_STI);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b139334..3b6848e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3442,6 +3442,11 @@ static void emulator_set_cr(int cr, unsigned long val, 
struct kvm_vcpu *vcpu)
}
 }
 
+static int emulator_get_cpl(struct kvm_vcpu *vcpu)
+{
+   return kvm_x86_ops-get_cpl(vcpu);
+}
+
 static struct x86_emulate_ops emulate_ops = {
.read_std= kvm_read_guest_virt_system,
.fetch   = kvm_fetch_guest_virt,
@@ -3450,6 +3455,7 @@ static struct x86_emulate_ops emulate_ops = {
.cmpxchg_emulated= emulator_cmpxchg_emulated,
.get_cr  = emulator_get_cr,
.set_cr  = emulator_set_cr,
+   .cpl = emulator_get_cpl,
 };
 
 static void cache_all_regs(struct kvm_vcpu *vcpu)
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 13/30] KVM: x86 emulator: fix mov dr to inject #UD when needed.

2010-03-15 Thread Gleb Natapov

If CR4.DE=1 access to registers DR4/DR5 cause #UD.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |   18 --
 1 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 836e97b..5afddcf 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2531,9 +2531,12 @@ twobyte_insn:
c-dst.type = OP_NONE;  /* no writeback */
break;
case 0x21: /* mov from dr to reg */
-   if (emulator_get_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm]))
-   goto cannot_emulate;
-   rc = X86EMUL_CONTINUE;
+   if ((ops-get_cr(4, ctxt-vcpu)  X86_CR4_DE) 
+   (c-modrm_reg == 4 || c-modrm_reg == 5)) {
+   kvm_queue_exception(ctxt-vcpu, UD_VECTOR);
+   goto done;
+   }
+   emulator_get_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm]);
c-dst.type = OP_NONE;  /* no writeback */
break;
case 0x22: /* mov reg, cr */
@@ -2541,9 +2544,12 @@ twobyte_insn:
c-dst.type = OP_NONE;
break;
case 0x23: /* mov from reg to dr */
-   if (emulator_set_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm]))
-   goto cannot_emulate;
-   rc = X86EMUL_CONTINUE;
+   if ((ops-get_cr(4, ctxt-vcpu)  X86_CR4_DE) 
+   (c-modrm_reg == 4 || c-modrm_reg == 5)) {
+   kvm_queue_exception(ctxt-vcpu, UD_VECTOR);
+   goto done;
+   }
+   emulator_set_dr(ctxt, c-modrm_reg, c-regs[c-modrm_rm]);
c-dst.type = OP_NONE;  /* no writeback */
break;
case 0x30:
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 08/30] KVM: Provide current eip as part of emulator context.

2010-03-15 Thread Gleb Natapov

Eliminate the need to call back into KVM to get it from emulator.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |3 ++-
 arch/x86/kvm/emulate.c |   12 ++--
 arch/x86/kvm/x86.c |1 +
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index b048fd2..0765725 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -141,7 +141,7 @@ struct decode_cache {
u8 seg_override;
unsigned int d;
unsigned long regs[NR_VCPU_REGS];
-   unsigned long eip, eip_orig;
+   unsigned long eip;
/* modrm */
u8 modrm;
u8 modrm_mod;
@@ -160,6 +160,7 @@ struct x86_emulate_ctxt {
struct kvm_vcpu *vcpu;
 
unsigned long eflags;
+   unsigned long eip; /* eip before instruction emulation */
/* Emulated execution mode, represented by an X86EMUL_MODE value. */
int mode;
u32 cs_base;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8bd0557..2c27aa4 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -667,7 +667,7 @@ static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
int rc;
 
/* x86 instructions are limited to 15 bytes. */
-   if (eip + size - ctxt-decode.eip_orig  15)
+   if (eip + size - ctxt-eip  15)
return X86EMUL_UNHANDLEABLE;
eip += ctxt-cs_base;
while (size--) {
@@ -927,7 +927,7 @@ x86_decode_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
/* Shadow copy of register state. Committed on successful emulation. */
 
memset(c, 0, sizeof(struct decode_cache));
-   c-eip = c-eip_orig = kvm_rip_read(ctxt-vcpu);
+   c-eip = ctxt-eip;
ctxt-cs_base = seg_base(ctxt, VCPU_SREG_CS);
memcpy(c-regs, ctxt-vcpu-arch.regs, sizeof c-regs);
 
@@ -1878,7 +1878,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
}
}
register_address_increment(c, c-regs[VCPU_REGS_RCX], -1);
-   c-eip = kvm_rip_read(ctxt-vcpu);
+   c-eip = ctxt-eip;
}
 
if (c-src.type == OP_MEM) {
@@ -2447,7 +2447,7 @@ twobyte_insn:
goto done;
 
/* Let the processor re-execute the fixed hypercall */
-   c-eip = kvm_rip_read(ctxt-vcpu);
+   c-eip = ctxt-eip;
/* Disable writeback. */
c-dst.type = OP_NONE;
break;
@@ -2551,7 +2551,7 @@ twobyte_insn:
| ((u64)c-regs[VCPU_REGS_RDX]  32);
if (kvm_set_msr(ctxt-vcpu, c-regs[VCPU_REGS_RCX], msr_data)) {
kvm_inject_gp(ctxt-vcpu, 0);
-   c-eip = kvm_rip_read(ctxt-vcpu);
+   c-eip = ctxt-eip;
}
rc = X86EMUL_CONTINUE;
c-dst.type = OP_NONE;
@@ -2560,7 +2560,7 @@ twobyte_insn:
/* rdmsr */
if (kvm_get_msr(ctxt-vcpu, c-regs[VCPU_REGS_RCX], msr_data)) 
{
kvm_inject_gp(ctxt-vcpu, 0);
-   c-eip = kvm_rip_read(ctxt-vcpu);
+   c-eip = ctxt-eip;
} else {
c-regs[VCPU_REGS_RAX] = (u32)msr_data;
c-regs[VCPU_REGS_RDX] = msr_data  32;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3b6848e..022d28e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3494,6 +3494,7 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
 
vcpu-arch.emulate_ctxt.vcpu = vcpu;
vcpu-arch.emulate_ctxt.eflags = kvm_x86_ops-get_rflags(vcpu);
+   vcpu-arch.emulate_ctxt.eip = kvm_rip_read(vcpu);
vcpu-arch.emulate_ctxt.mode =
(!is_protmode(vcpu)) ? X86EMUL_MODE_REAL :
(vcpu-arch.emulate_ctxt.eflags  X86_EFLAGS_VM)
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 20/30] KVM: x86 emulator: Use load_segment_descriptor() instead of kvm_load_segment_descriptor()

2010-03-15 Thread Gleb Natapov


Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index db4776c..702bfff 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1508,7 +1508,7 @@ static int emulate_pop_sreg(struct x86_emulate_ctxt *ctxt,
if (rc != X86EMUL_CONTINUE)
return rc;
 
-   rc = kvm_load_segment_descriptor(ctxt-vcpu, (u16)selector, seg);
+   rc = load_segment_descriptor(ctxt, ops, (u16)selector, seg);
return rc;
 }
 
@@ -1683,7 +1683,7 @@ static int emulate_ret_far(struct x86_emulate_ctxt *ctxt,
rc = emulate_pop(ctxt, ops, cs, c-op_bytes);
if (rc != X86EMUL_CONTINUE)
return rc;
-   rc = kvm_load_segment_descriptor(ctxt-vcpu, (u16)cs, VCPU_SREG_CS);
+   rc = load_segment_descriptor(ctxt, ops, (u16)cs, VCPU_SREG_CS);
return rc;
 }
 
@@ -2717,7 +2717,7 @@ special_insn:
if (c-modrm_reg == VCPU_SREG_SS)
toggle_interruptibility(ctxt, 
KVM_X86_SHADOW_INT_MOV_SS);
 
-   rc = kvm_load_segment_descriptor(ctxt-vcpu, sel, c-modrm_reg);
+   rc = load_segment_descriptor(ctxt, ops, sel, c-modrm_reg);
 
c-dst.type = OP_NONE;  /* Disable writeback. */
break;
@@ -2892,8 +2892,8 @@ special_insn:
goto jmp;
case 0xea: /* jmp far */
jump_far:
-   if (kvm_load_segment_descriptor(ctxt-vcpu, c-src2.val,
-   VCPU_SREG_CS))
+   if (load_segment_descriptor(ctxt, ops, c-src2.val,
+   VCPU_SREG_CS))
goto done;
 
c-eip = c-src.val;
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 15/30] KVM: x86 emulator: do not call writeback if msr access fails.

2010-03-15 Thread Gleb Natapov


Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 1393bf0..b89a8f2 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2563,7 +2563,7 @@ twobyte_insn:
| ((u64)c-regs[VCPU_REGS_RDX]  32);
if (kvm_set_msr(ctxt-vcpu, c-regs[VCPU_REGS_RCX], msr_data)) {
kvm_inject_gp(ctxt-vcpu, 0);
-   c-eip = ctxt-eip;
+   goto done;
}
rc = X86EMUL_CONTINUE;
c-dst.type = OP_NONE;
@@ -2572,7 +2572,7 @@ twobyte_insn:
/* rdmsr */
if (kvm_get_msr(ctxt-vcpu, c-regs[VCPU_REGS_RCX], msr_data)) 
{
kvm_inject_gp(ctxt-vcpu, 0);
-   c-eip = ctxt-eip;
+   goto done;
} else {
c-regs[VCPU_REGS_RAX] = (u32)msr_data;
c-regs[VCPU_REGS_RDX] = msr_data  32;
-- 
1.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Muli Ben-Yehuda

On Mon, Mar 15, 2010 at 02:03:11PM +0100, Joerg Roedel wrote:

 On Mon, Mar 15, 2010 at 05:53:13AM -0700, Muli Ben-Yehuda wrote:

  On Mon, Mar 15, 2010 at 02:25:41PM +0200, Avi Kivity wrote:

   On 03/10/2010 11:30 PM, Luiz Capitulino wrote:
  
 Hi there,
   
 Our wiki page for the Summer of Code 2010 is doing quite well:
   
   http://wiki.qemu.org/Google_Summer_of_Code_2010
   
   I will add another project - iommu emulation.  Could be very
   useful for doing device assignment to nested guests, which could
   make testing a lot easier.
  
  Our experiments show that nested device assignment is pretty much
  required for I/O performance in nested scenarios.
 
 Really? I did a small test with virtio-blk in a nested guest (disk
 read with dd, so not a real benchmark) and got a reasonable
 read-performance of around 25MB/s from the disk in the l2-guest.

Netperf running in L1 with direct access: ~950 Mbps throughput with
25% CPU utilization. Netperf running in L2 with virtio between L2 and
L1 and direct assignment between L1 and L0: roughly the same
throughput, but over 90% CPU utilization! Now extrapolate to 10GbE.

Cheers,
Muli
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 00/30] emulator cleanup

2010-03-15 Thread Avi Kivity


On 03/15/2010 04:38 PM, Gleb Natapov wrote:

This is the first series of patches that tries to cleanup emulator code.
This is mix of bug fixes and moving code that does emulation from x86.c
to emulator.c while making it KVM independent. The status of the patches:
works for me. realtime.flat test now also pass where it failed before.
   


Reviewed-by: Avi Kivity a...@redhat.com

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Ideas wiki for GSoC 2010

2010-03-15 Thread Avi Kivity


On 03/15/2010 03:23 PM, Anthony Liguori wrote:

On 03/15/2010 08:11 AM, Avi Kivity wrote:

On 03/15/2010 03:03 PM, Joerg Roedel wrote:



I will add another project - iommu emulation.  Could be very useful
for doing device assignment to nested guests, which could make
testing a lot easier.

Our experiments show that nested device assignment is pretty much
required for I/O performance in nested scenarios.

Really? I did a small test with virtio-blk in a nested guest (disk read
with dd, so not a real benchmark) and got a reasonable read-performance
of around 25MB/s from the disk in the l2-guest.



Your guest wasn't doing a zillion VMREADs and VMWRITEs every exit.

I plan to reduce VMREAD/VMWRITE overhead for kvm, but not much we can 
do for other guests.


VMREAD/VMWRITEs are generally optimized by hypervisors as they tend to 
be costly.  KVM is a bit unusual in terms of how many times the 
instructions are executed per exit.


Do you know offhand of any unnecessary read/writes?  There's 
update_cr8_intercept(), but on normal exits, I don't see what else we 
can remove.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 16/30] KVM: x86 emulator: If LOCK prefix is used dest arg should be memory.

2010-03-15 Thread Andre Przywara


Gleb Natapov wrote:

If LOCK prefix is used dest arg should be memory, otherwise instruction
should generate #UD.

Well, there is one exception:
There is an AMD specific lock mov cr0 = mov cr8 equivalence, where 
there is no memory involved (and we intercept this). I am not sure if 
anyone actually uses this code sequence, but it is definitely legal.


Regards,
Andre.



Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/emulate.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index b89a8f2..46a7ee3 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1842,7 +1842,7 @@ x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
}
 
 	/* LOCK prefix is allowed only with some instructions */

-   if (c-lock_prefix  !(c-d  Lock)) {
+   if (c-lock_prefix  (!(c-d  Lock) || c-dst.type != OP_MEM)) {
kvm_queue_exception(ctxt-vcpu, UD_VECTOR);
goto done;
}



--
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 121 matches

Mail list logo