Re: ESXi, KVM or Xen?

2010-07-03 Thread Jan Kiszka
Peter Chacko wrote:
 Did you consider VirtualBox ? Of course VmWare is the market leader
 NOW, but if you plan to invest on future open source  platforms, You
 should choose KVM(which is now Linux native) or XEN.( Its unlikely to
 be killed). KVM still lag behind in terms of enterprise-class features
 , but count on it for future investment. So, i think you should just
 start off with Xen or virtualBox, with a migration plan to KVM in
 future.

VBox has surely its strengths on non-Linux hosts and hosts without
virtualization acceleration. But I would carefully evaluate its
performance under relevant load:

http://permalink.gmane.org/gmane.comp.emulators.virtualbox.devel/2796


I recently learned from someone doing Xen consulting that it's still
troublesome to get it running on non-certified hardware. This may have
impact on the hardware choice.


For hosting Linux-on-Linux, I would also consider containers, e.g. lxc
or OpenVZ. Performance-wise, that's generally the most efficient approach.

Jan



signature.asc
Description: OpenPGP digital signature


Re: ESXi, KVM or Xen?

2010-07-03 Thread Jan Kiszka
Emmanuel Noobadmin wrote:
 if by 'put storage on the network' you mean using a block-level
 protocol (iSCSI, FCoE, AoE, NBD, DRBD...), then you should by all
 means initiate on the host OS (Dom0 in Xen) and present to the VM as
 if it were local storage.  it's far faster and more stable that way.
 in that case, storage wouldn't add to the VM's network load, which
 might or might not make those (old) scenarios irrelevant
 
 Thanks for that tip :)
 
 in any case, yes; Xen does have more maturity on big hosting
 deployments.  but most third parties are betting on KVM for the
 future.  the biggest examples are Redhat, Canonical, libvirt (which is
 sponsored by redhat), and Eucalyptus (which reimplements amazon's EC2
 with either Xen or KVM, focusing on the last) so the gap is closing.
 
 This is what I figured too, hence not a straightforward choice. I
 don't need top notch performance for most of the servers targeted for
 virtualization. Loads are usually low except on the mail servers and
 often only when there's a mail loop problem. So if the performance hit
 under worse case situation is only 10~20%, it's something I can live
 with. Especially since the intended VM servers (i5/i7) will be
 significantly faster than the current ones (P4/C2D) I'm basing the my
 estimates on.
 
 But I need to do my due dilligence and have justification ready to
 show that current performance/reliability/security is at least good
 enough instead of I like where KVM is going and think it'll be the
 platform of choice in the years to come. Bosses and clients tend to
 frown on that kind of thing :D

How much customization will you apply on your virtualization
infrastructure? If you can manage to do the majority via proper
hypervisor abstraction, specifically libvirt, you will actually have
quite some freedom in choosing the platform. If not, I would very
carefully look at the management interfaces of all those hypervisors,
how much they conform to standard administration procedures or what
specialties they require, both on host and guest side.

 
 and finally, even if right now the 'best' deployment on Xen definitely
 outperforms KVM by a measurable margin; when things are not optimal
 Xen degrades a lot quicker than KVM.  in part because the Xen core
 scheduler is far from the maturity of Linux kernel's scheduler.
 
 The problem is finding stats to back that up if my clients/boss ask
 about it. So far most of the available comparisons/data seem rather
 dated, mostly 2007 and 2008. The most professional looking one, in
 that PDF I linked to, seems to indicate the opposite, i.e. KVM
 degrades faster when things go south. That graph with the Apache
 problem is especially damning because our primary product/services are
 web-based applications, infrastructure being a supplement
 service/product.
 
 In addition, I remember reading a thread on this list where an Intel
 developer pointed out that the Linux scheduler causes performance hit,
 about 8x~10x slower when the physical processors are heavily loaded
 and there are more vCPU than pCPU when it puts the same VM's vCPUs
 into the same physical core.

That's only relevant if you run SMP guests on over-committed hosts. How
will your guests look like?

 
 So I am a little worried since 8~10x is massive difference, esp if
 some process goes awry, starts chewing up processor cycles and the VM
 starts to lag because of this. A vicious cycle that makes it even
 harder to fix things without killing the VM.
 
 Of course if I could honestly tell my clients/boss This, this and
 this are rare situations we will almost never encounter..., then it's
 a different thing. Hence asking about this here :)

All solutions have weak points. The point is indeed to estimate if your
use cases will trigger then. Still then, the question remains if some
weakness is inherent to the solution's design or likely to be fixed
quicker than you will actually hit it. And weaknesses may not only be
performance aspects.

Jan



signature.asc
Description: OpenPGP digital signature


[PATCH] KVM: VMX: fix tlb flush with invalid root

2010-07-03 Thread Xiao Guangrong
Commit 341d9b535b6c simplify reload logic while entry guest mode, it
can avoid unnecessary sync-root if KVM_REQ_MMU_RELOAD and
KVM_REQ_MMU_SYNC both set.

But, it cause a issue that when we handle 'KVM_REQ_TLB_FLUSH', the
root is invalid, it is triggered during my test:

Kernel BUG at a00212b8 [verbose debug info unavailable]
..

 [810f5caf] ? fget_light+0x111/0x28e
 [81103963] sys_ioctl+0x47/0x6a
 [81002c1b] system_call_fastpath+0x16/0x1b
Code: f0 eb 21 f7 c2 00 00 00 04 74 22 48 8d 45 f0 48 c7 45 f0 00 00 00 00 48 
c7 45 f8 00 00 00 00 b9 02 00 00 00 66 0f 38 80 08 77 02 0f 0b c9 c3 55 48 89 
e5 0f 1f 44 00 00 ba 00 68 00 00 48 8b 8f 
RIP  [a00212b8] vmx_flush_tlb+0xdf/0xe3 [kvm_intel]
 RSP 8800be269ca8

Fixed by directly return if the root is not ready.

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_host.h |2 ++
 arch/x86/kvm/mmu.c  |2 --
 arch/x86/kvm/vmx.c  |5 -
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2bda624..8f522ec 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -40,6 +40,8 @@
  0xFF00ULL)
 
 #define INVALID_PAGE (~(hpa_t)0)
+#define VALID_PAGE(x) ((x) != INVALID_PAGE)
+
 #define UNMAPPED_GVA (~(gpa_t)0)
 
 /* KVM Hugepage definitions for x86 */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a0c5c31..399ddb0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -92,8 +92,6 @@ module_param(oos_shadow, bool, 0644);
 #define PT_FIRST_AVAIL_BITS_SHIFT 9
 #define PT64_SECOND_AVAIL_BITS_SHIFT 52
 
-#define VALID_PAGE(x) ((x) != INVALID_PAGE)
-
 #define PT64_LEVEL_BITS 9
 
 #define PT64_LEVEL_SHIFT(level) \
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 806ab12..ebaaeaf 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1831,8 +1831,11 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
 static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
 {
vpid_sync_context(to_vmx(vcpu));
-   if (enable_ept)
+   if (enable_ept) {
+   if (!VALID_PAGE(vcpu-arch.mmu.root_hpa))
+   return;
ept_sync_context(construct_eptp(vcpu-arch.mmu.root_hpa));
+   }
 }
 
 static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu)
-- 
1.6.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] KVM: MMU: prefetch ptes when intercepted guest #PF

2010-07-03 Thread Xiao Guangrong


Marcelo Tosatti wrote:

 +
 +if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL))
 +break;
 
 BTW, doesnt sync_page also lack reserved bit checking? (unrelated to
 this patch).
 

I think it's not since if EPT is enabled, no unsync page exist, the sync page
path can't be trigged. :-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: IOAPIC: only access APIC registers one dword at a time

2010-07-03 Thread Xiao Guangrong


Marcelo Tosatti wrote:

 +if (len != 4) {
 +printk(KERN_WARNING ioapic: wrong length %d\n, len);
 +return 0;
 +}
 +
 
 Just remove the printks please, as guests can flood hosts dmesg.
 

OK, please review this one, thanks, Marcelo!

Subject: [PATCH v3] KVM: IOAPIC: only access APIC registers one dword at a time

The IOAPIC spec says:
When accessing these registers, accesses must be done one dword at a time.
For example, software should never access byte 2 from the Data register before
accessing bytes 0 and 1. The hardware will not attempt to recover from a bad
programming model in this case.

So, this patch removes other width access

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 virt/kvm/ioapic.c |   22 ++
 1 files changed, 6 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 1149c60..aed91fd 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -288,6 +288,9 @@ static int ioapic_mmio_read(struct kvm_io_device *this, 
gpa_t addr, int len,
ioapic_debug(addr %lx\n, (unsigned long)addr);
ASSERT(!(addr  0xf));  /* check alignment */
 
+   if (len != 4)
+   return 0;
+
addr = 0xff;
spin_lock(ioapic-lock);
switch (addr) {
@@ -305,18 +308,7 @@ static int ioapic_mmio_read(struct kvm_io_device *this, 
gpa_t addr, int len,
}
spin_unlock(ioapic-lock);
 
-   switch (len) {
-   case 8:
-   *(u64 *) val = result;
-   break;
-   case 1:
-   case 2:
-   case 4:
-   memcpy(val, (char *)result, len);
-   break;
-   default:
-   printk(KERN_WARNING ioapic: wrong length %d\n, len);
-   }
+   *(u32 *) val = result;
return 0;
 }
 
@@ -332,12 +324,10 @@ static int ioapic_mmio_write(struct kvm_io_device *this, 
gpa_t addr, int len,
 (void*)addr, len, val);
ASSERT(!(addr  0xf));  /* check alignment */
 
-   if (len == 4 || len == 8)
+   if (len == 4)
data = *(u32 *) val;
-   else {
-   printk(KERN_WARNING ioapic: Unsupported size %d\n, len);
+   else
return 0;
-   }
 
addr = 0xff;
spin_lock(ioapic-lock);
-- 
1.6.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-07-03 Thread Herbert Xu
On Mon, Jun 28, 2010 at 05:56:07PM +0800, Xin, Xiaohui wrote:

 OK, if I understand you correctly then I don't think have a
 problem.  With your current patch-set you have exactly the same
 situation when the skb-data is reallocated as a kernel buffer.
 
 When will skb-data to be reallocated? May you point me the code path?

Anything that calls pskb_expand_head.

 This is OK because as you correctly argue, it is a rare situation.
 
 With my proposal you will need to get this extra external buffer
 in even less cases, because you'd only need to do it if the skb
 head grows, which only happens if it becomes encapsulated.
 So let me explain it in a bit more detail:
 
 Our packet starts out as a purely non-linear skb, i.e., skb-head
 contains nothing and all the page frags come from the guest.
 
 During host processing we may pull data into skb-head but the
 first frag will remain unless we pull all of it.  If we did do
 that then you would have a free external buffer anyway.
 
 Now in the common case the header may be modified or pulled, but
 it very rarely grows.  So you can just copy the header back into
 the first frag just before we give it to the guest.
 
 Since the data is still there, so recompute the page offset and size is ok, 
 right?

Right, you just move the page offset back and increase the size.
However, to do this safely we'd need to have a way of knowing
whether the skb head has been modified.

It may well turn out to be just as effective to do something like

if (memcmp(skb-data, page frag head, skb_headlen))
memcpy(page frag head, skb-data, skb_headlen)

As the page frag head should be in cache since it would've been
used to populate skb-data.

It'd be good to run some benchmarks with this to see whether
adding a bit to sk_buff to avoid the memcmp is worth it or not.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: IOAPIC: only access APIC registers one dword at a time

2010-07-03 Thread Avi Kivity

On 07/02/2010 11:00 AM, Xiao Guangrong wrote:

The IOAPIC spec says:

When accessing these registers, accesses must be done one dword at a time.
For example, software should never access byte 2 from the Data register before
accessing bytes 0 and 1. The hardware will not attempt to recover from a bad
programming model in this case.

So, this patch removes other width access

   


The ioapic code also implements the ia64 iosapic.  I'm guessing that 
does support 64-bit accesses.  Please check the iosapic documentation.


There might be guests that use incorrect access despite the 
documentation; if real hardware supports it, it should work.  So we need 
to start with just a warning, and allow the access.  Later we can drop 
the invalid access.



@@ -288,6 +288,11 @@ static int ioapic_mmio_read(struct kvm_io_device *this, 
gpa_t addr, int len,
ioapic_debug(addr %lx\n, (unsigned long)addr);
ASSERT(!(addr  0xf));  /* check alignment */

+   if (len != 4) {
+   printk(KERN_WARNING ioapic: wrong length %d\n, len);
+   return 0;
+   }
+
   


Guest triggered, so needs to be rate limited.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm.git unittest failures

2010-07-03 Thread Avi Kivity

On 07/03/2010 01:44 AM, Marcelo Tosatti wrote:



Config entry:

[access]
file = access.flat

Massive log, compressed and attached.
 

run
test pde.p user: FAIL: error code 5 expected 4
test pte.rw pde.p user: FAIL: error code 5 expected 4
test pte.user pde.p user: FAIL: error code 5 expected 4
test pte.rw pte.user pde.p user: FAIL: error code 5 expected 4
test pte.a pde.p user: FAIL: error code 5 expected 4
test pte.rw pte.a pde.p user: FAIL: error code 5 expected 4
test pte.user pte.a pde.p user: FAIL: error code 5 expected 4

P flag (bit 0).
This flag is 0 if there is no valid translation for the linear address
because the P
flag was 0 in one of the paging-structure entries used to translate that
address.

Avi, a walk ignoring access permissions should be done to properly set
the P flag on error code. Does anybody care?
   


Probably not, but best to be compatible with real hardware.

We could simply continue the existing walk, just set a flag indicating 
that it can't succeed.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: random crash in post_kvm_run()

2010-07-03 Thread Avi Kivity

On 07/02/2010 10:08 PM, BuraphaLinux Server wrote:

Hello,

I tried my best to do the bisection, and the result after many kernels was:

Bisecting: 0 revisions left to test after this (roughly 0 steps)
[3ce672d48400e0112fec7a3cb6bb2120493c6e11] KVM: SVM: init_vmcb():
remove redundant save-cr0 initialization

So what do I do next?



You still need to test that last kernel.  0 revisions left means that 
after this test (and the last git bisect good or git bisect bad) 
you'll have the answer.



I did not 'make mrproper' between each build -
should I have done that?
   


No need.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] KVM: MMU: combine guest pte read between walk and pte prefetch

2010-07-03 Thread Xiao Guangrong


Marcelo Tosatti wrote:
 On Thu, Jul 01, 2010 at 09:55:56PM +0800, Xiao Guangrong wrote:
 Combine guest pte read between guest pte walk and pte prefetch

 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 ---
  arch/x86/kvm/paging_tmpl.h |   48 
 ++-
  1 files changed, 33 insertions(+), 15 deletions(-)
 
 Can't do this, it can miss invlpg:
 
 vcpu0 vcpu1
 read guest ptes
   modify guest pte
   invlpg
 instantiate stale 
 guest pte

Ah, oops, sorry :-(

 
 See how the pte is reread inside fetch with mmu_lock held.
 

It looks like something is broken in 'fetch' functions, this patch will
fix it.

Subject: [PATCH] KVM: MMU: fix last level broken in FNAME(fetch)

We read the guest level out of 'mmu_lock', sometimes, the host mapping is
confusion. Consider this case:

VCPU0:  VCPU1

Read guest mapping, assume the mapping is:
GLV3 - GLV2 - GLV1 - GFNA,
And in the host, the corresponding mapping is
HLV3 - HLV2 - HLV1(P=0)

   Write GLV1 and cause the
   mapping point to GFNB
   (May occur in pte_write or
  invlpg path)

Mapping GLV1 to GFNA

This issue only occurs in the last indirect mapping, since if the middle
mapping is changed, the mapping will be zapped, then it will be detected
in the FNAME(fetch) path, but when it map the last level, it not checked.

Fixed by also check the last level.

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/paging_tmpl.h |   32 +---
 1 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3350c02..e617e93 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -291,6 +291,20 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *sp,
 gpte_to_gfn(gpte), pfn, true, true);
 }
 
+static bool FNAME(check_level_mapping)(struct kvm_vcpu *vcpu,
+  struct guest_walker *gw, int level)
+{
+   pt_element_t curr_pte;
+   int r;
+
+   r = kvm_read_guest_atomic(vcpu-kvm, gw-pte_gpa[level - 1],
+curr_pte, sizeof(curr_pte));
+   if (r || curr_pte != gw-ptes[level - 1])
+   return false;
+
+   return true;
+}
+
 /*
  * Fetch a shadow pte for a specific level in the paging hierarchy.
  */
@@ -304,11 +318,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
u64 spte, *sptep = NULL;
int direct;
gfn_t table_gfn;
-   int r;
int level;
-   bool dirty = is_dirty_gpte(gw-ptes[gw-level - 1]);
+   bool dirty = is_dirty_gpte(gw-ptes[gw-level - 1]), check = true;
unsigned direct_access;
-   pt_element_t curr_pte;
struct kvm_shadow_walk_iterator iterator;
 
if (!is_present_gpte(gw-ptes[gw-level - 1]))
@@ -322,6 +334,12 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
level = iterator.level;
sptep = iterator.sptep;
if (iterator.level == hlevel) {
+   if (check  level == gw-level 
+ !FNAME(check_level_mapping)(vcpu, gw, hlevel)) {
+   kvm_release_pfn_clean(pfn);
+   break;
+   }
+
mmu_set_spte(vcpu, sptep, access,
 gw-pte_access  access,
 user_fault, write_fault,
@@ -376,10 +394,10 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t 
addr,
sp = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1,
   direct, access, sptep);
if (!direct) {
-   r = kvm_read_guest_atomic(vcpu-kvm,
- gw-pte_gpa[level - 2],
- curr_pte, sizeof(curr_pte));
-   if (r || curr_pte != gw-ptes[level - 2]) {
+   if (hlevel == level - 1)
+   check = false;
+
+   if (!FNAME(check_level_mapping)(vcpu, gw, level - 1)) {
kvm_mmu_put_page(sp, sptep);
kvm_release_pfn_clean(pfn);
sptep = NULL;
-- 
1.6.1.2



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] KVM: MMU: combine guest pte read between walk and pte prefetch

2010-07-03 Thread Avi Kivity

On 07/02/2010 08:03 PM, Marcelo Tosatti wrote:

On Thu, Jul 01, 2010 at 09:55:56PM +0800, Xiao Guangrong wrote:
   

Combine guest pte read between guest pte walk and pte prefetch

Signed-off-by: Xiao Guangrongxiaoguangr...@cn.fujitsu.com
---
  arch/x86/kvm/paging_tmpl.h |   48 ++-
  1 files changed, 33 insertions(+), 15 deletions(-)
 

Can't do this, it can miss invlpg:

vcpu0   vcpu1
read guest ptes
modify guest pte
invlpg
instantiate stale
guest pte

See how the pte is reread inside fetch with mmu_lock held.
   


Note, this is fine if the pte is unsync, since vcpu0 will soon invlpg 
it.  It's only broken for sync ptes.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2010-07-03 Thread 黄煜
subscribe kvm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] KVM: MMU: combine guest pte read between walk and pte prefetch

2010-07-03 Thread Avi Kivity

On 07/03/2010 01:31 PM, Xiao Guangrong wrote:



See how the pte is reread inside fetch with mmu_lock held.

 

It looks like something is broken in 'fetch' functions, this patch will
fix it.

Subject: [PATCH] KVM: MMU: fix last level broken in FNAME(fetch)

We read the guest level out of 'mmu_lock', sometimes, the host mapping is
confusion. Consider this case:

VCPU0:  VCPU1

Read guest mapping, assume the mapping is:
GLV3 -  GLV2 -  GLV1 -  GFNA,
And in the host, the corresponding mapping is
HLV3 -  HLV2 -  HLV1(P=0)

Write GLV1 and cause the
mapping point to GFNB
(May occur in pte_write or
   invlpg path)

Mapping GLV1 to GFNA

This issue only occurs in the last indirect mapping, since if the middle
mapping is changed, the mapping will be zapped, then it will be detected
in the FNAME(fetch) path, but when it map the last level, it not checked.

Fixed by also check the last level.

   


I don't really see what is fixed.  We already check the gpte.  What's 
special about the new scenario?



@@ -322,6 +334,12 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
level = iterator.level;
sptep = iterator.sptep;
if (iterator.level == hlevel) {
+   if (check  level == gw-level
+ !FNAME(check_level_mapping)(vcpu, gw, hlevel)) {
+   kvm_release_pfn_clean(pfn);
+   break;
+   }
+
   


Now we check here...


mmu_set_spte(vcpu, sptep, access,
 gw-pte_access  access,
 user_fault, write_fault,
@@ -376,10 +394,10 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t 
addr,
sp = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1,
   direct, access, sptep);
if (!direct) {
-   r = kvm_read_guest_atomic(vcpu-kvm,
- gw-pte_gpa[level - 2],
-   curr_pte, sizeof(curr_pte));
-   if (r || curr_pte != gw-ptes[level - 2]) {
+   if (hlevel == level - 1)
+   check = false;
+
+   if (!FNAME(check_level_mapping)(vcpu, gw, level - 1)) {
   


... and here?  Why?


(looking at the code, we have a call to kvm_host_page_size() on every 
page fault, that takes mmap_sem... that's got to impact scaling)


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] KVM: MMU: combine guest pte read between walk and pte prefetch

2010-07-03 Thread Xiao Guangrong


Avi Kivity wrote:
 On 07/03/2010 01:31 PM, Xiao Guangrong wrote:

 See how the pte is reread inside fetch with mmu_lock held.

  
 It looks like something is broken in 'fetch' functions, this patch will
 fix it.

 Subject: [PATCH] KVM: MMU: fix last level broken in FNAME(fetch)

 We read the guest level out of 'mmu_lock', sometimes, the host mapping is
 confusion. Consider this case:

 VCPU0:  VCPU1

 Read guest mapping, assume the mapping is:
 GLV3 -  GLV2 -  GLV1 -  GFNA,
 And in the host, the corresponding mapping is
 HLV3 -  HLV2 -  HLV1(P=0)

 Write GLV1 and
 cause the
 mapping point to GFNB
 (May occur in
 pte_write or
invlpg path)

 Mapping GLV1 to GFNA

 This issue only occurs in the last indirect mapping, since if the middle
 mapping is changed, the mapping will be zapped, then it will be detected
 in the FNAME(fetch) path, but when it map the last level, it not checked.

 Fixed by also check the last level.


 
 I don't really see what is fixed.  We already check the gpte.  What's
 special about the new scenario?
 

I mean is: while we map the last level, we will directly set to the pfn but
the pfn is got by walk_addr, at this time, the guest mapping may be changed.

What is the 'We already check the gpte' mean? i think i miss something :-(

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] KVM: MMU: combine guest pte read between walk and pte prefetch

2010-07-03 Thread Avi Kivity

On 07/03/2010 03:16 PM, Xiao Guangrong wrote:


Avi Kivity wrote:
   

On 07/03/2010 01:31 PM, Xiao Guangrong wrote:
 
   

See how the pte is reread inside fetch with mmu_lock held.


 

It looks like something is broken in 'fetch' functions, this patch will
fix it.

Subject: [PATCH] KVM: MMU: fix last level broken in FNAME(fetch)

We read the guest level out of 'mmu_lock', sometimes, the host mapping is
confusion. Consider this case:

VCPU0:  VCPU1

Read guest mapping, assume the mapping is:
GLV3 -   GLV2 -   GLV1 -   GFNA,
And in the host, the corresponding mapping is
HLV3 -   HLV2 -   HLV1(P=0)

 Write GLV1 and
cause the
 mapping point to GFNB
 (May occur in
pte_write or
invlpg path)

Mapping GLV1 to GFNA

This issue only occurs in the last indirect mapping, since if the middle
mapping is changed, the mapping will be zapped, then it will be detected
in the FNAME(fetch) path, but when it map the last level, it not checked.

Fixed by also check the last level.


   

I don't really see what is fixed.  We already check the gpte.  What's
special about the new scenario?

 

I mean is: while we map the last level, we will directly set to the pfn but
the pfn is got by walk_addr, at this time, the guest mapping may be changed.

What is the 'We already check the gpte' mean? i think i miss something :-(
   


if (!direct) {
r = kvm_read_guest_atomic(vcpu-kvm,
  gw-pte_gpa[level - 2],
curr_pte, sizeof(curr_pte));
if (r || curr_pte != gw-ptes[level - 2]) {
kvm_mmu_put_page(shadow_page, sptep);
kvm_release_pfn_clean(pfn);
sptep = NULL;
break;
}
}

the code you moved... under what scenario is it not sufficient?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2010-07-03 Thread 黄煜
subscribe kvm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2010-07-03 Thread 黄煜
subscribe kvm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] KVM: MMU: combine guest pte read between walk and pte prefetch

2010-07-03 Thread Avi Kivity

On 07/03/2010 03:31 PM, Xiao Guangrong wrote:


Avi Kivity wrote:

   


   

 if (!direct) {
 r = kvm_read_guest_atomic(vcpu-kvm,
   gw-pte_gpa[level - 2],
curr_pte, sizeof(curr_pte));
 if (r || curr_pte != gw-ptes[level - 2]) {
 kvm_mmu_put_page(shadow_page, sptep);
 kvm_release_pfn_clean(pfn);
 sptep = NULL;
 break;
 }
 }

the code you moved... under what scenario is it not sufficient?

 

I not move those code, just use common function instead, that it's
FNAME(check_level_mapping)(), there are do the same work.

And this check is not sufficient, since it's only checked if the
mapping is zapped or not exist, for other words only when broken this
judgment:
is_shadow_present_pte(*sptep)  !is_large_pte(*sptep)

but if the middle level is present and it's not the large mapping,
this check is skipped.
   



Well, in the description, it looked like everything was using small 
pages (in kvm, level=1 means PTE level, we need to change this one 
day).  Please describe again and say exactly when the guest or host uses 
large pages.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] KVM: MMU: combine guest pte read between walk and pte prefetch

2010-07-03 Thread Avi Kivity

On 07/03/2010 03:44 PM, Avi Kivity wrote:

On 07/03/2010 03:31 PM, Xiao Guangrong wrote:


Avi Kivity wrote:




 if (!direct) {
 r = kvm_read_guest_atomic(vcpu-kvm,
   gw-pte_gpa[level - 2],
curr_pte, sizeof(curr_pte));
 if (r || curr_pte != gw-ptes[level - 2]) {
 kvm_mmu_put_page(shadow_page, sptep);
 kvm_release_pfn_clean(pfn);
 sptep = NULL;
 break;
 }
 }

the code you moved... under what scenario is it not sufficient?


I not move those code, just use common function instead, that it's
FNAME(check_level_mapping)(), there are do the same work.

And this check is not sufficient, since it's only checked if the
mapping is zapped or not exist, for other words only when broken this
judgment:
is_shadow_present_pte(*sptep)  !is_large_pte(*sptep)

but if the middle level is present and it's not the large mapping,
this check is skipped.



Well, in the description, it looked like everything was using small 
pages (in kvm, level=1 means PTE level, we need to change this one 
day).  Please describe again and say exactly when the guest or host 
uses large pages.





Oh, I see what you mean.

Regarding the patch, is it possible just to move the check before, 
instead of adding the 'check' variable?


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] KVM: MMU: combine guest pte read between walk and pte prefetch

2010-07-03 Thread Xiao Guangrong


Avi Kivity wrote:

 And this check is not sufficient, since it's only checked if the
 mapping is zapped or not exist, for other words only when broken this
 judgment:
 is_shadow_present_pte(*sptep)  !is_large_pte(*sptep)

 but if the middle level is present and it's not the large mapping,
 this check is skipped.

 
 
 Well, in the description, it looked like everything was using small
 pages (in kvm, level=1 means PTE level, we need to change this one
 day).  Please describe again and say exactly when the guest or host uses
 large pages.
 

Avi, sorry for my poor English, i not mean everything was using small, i don't
know where cause you confused :-(





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] KVM: MMU: combine guest pte read between walk and pte prefetch

2010-07-03 Thread Xiao Guangrong


Avi Kivity wrote:



 Well, in the description, it looked like everything was using small
 pages (in kvm, level=1 means PTE level, we need to change this one
 day).  Please describe again and say exactly when the guest or host
 uses large pages.


 
 Oh, I see what you mean.
 
 Regarding the patch, is it possible just to move the check before,
 instead of adding the 'check' variable?
 

Umm, if we move the check before the judgment, it'll check every level,
actually, only the opened mapping and the laster level need checked, so
for the performance reason, maybe it's better to keep two check-point.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: random crash in post_kvm_run()

2010-07-03 Thread BuraphaLinux Server
Ok, I kept going like you said.   Here is what it said:

$git bisect good
44ea2b1758d88ad822e65b1c4c21ca6164494e27 is the first bad commit
commit 44ea2b1758d88ad822e65b1c4c21ca6164494e27
Author: Avi Kivity a...@redhat.com
Date:   Sun Sep 6 15:55:37 2009 +0300

KVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx autoload msr area

Currently MSR_KERNEL_GS_BASE is saved and restored as part of the
guest/host msr reloading.  Since we wish to lazy-restore all the other
msrs, save and reload MSR_KERNEL_GS_BASE explicitly instead of using
the common code.

Signed-off-by: Avi Kivity a...@redhat.com

:04 04 fcf14f9e5578a996430650f7806a54bcc8184ef6
24f4b80719c5b7931a5ed5604f3554f78352ed67 M  arch
$



On 7/3/10, Avi Kivity a...@redhat.com wrote:
 On 07/02/2010 10:08 PM, BuraphaLinux Server wrote:
 Hello,

 I tried my best to do the bisection, and the result after many kernels
 was:

 Bisecting: 0 revisions left to test after this (roughly 0 steps)
 [3ce672d48400e0112fec7a3cb6bb2120493c6e11] KVM: SVM: init_vmcb():
 remove redundant save-cr0 initialization

 So what do I do next?


 You still need to test that last kernel.  0 revisions left means that
 after this test (and the last git bisect good or git bisect bad)
 you'll have the answer.

 I did not 'make mrproper' between each build -
 should I have done that?


 No need.

 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 27/27] KVM: PPC: Add Documentation about PV interface

2010-07-03 Thread Benjamin Herrenschmidt
On Fri, 2010-07-02 at 18:27 +0200, Segher Boessenkool wrote:
  +To find out if we're running on KVM or not, we overlay the PVR  
  register. Usually
  +the PVR register contains an id that identifies your CPU type. If,  
  however, you
  +pass KVM_PVR_PARA in the register that you want the PVR result in,  
  the register
  +still contains KVM_PVR_PARA after the mfpvr call.
  +
  +   LOAD_REG_IMM(r5, KVM_PVR_PARA)
  +   mfpvr   r5
  +   [r5 still contains KVM_PVR_PARA]
 
 I love this part :-)

Me not :-)

It should be in the device-tree instead, or something like that. Enough
games with PVR...

Ben.

  +   __u64 scratch3;
  +   __u64 critical; /* Guest may not get interrupts if == r1 */
  +   __u64 sprg0;
  +   __u64 sprg1;
  +   __u64 sprg2;
  +   __u64 sprg3;
  +   __u64 srr0;
  +   __u64 srr1;
  +   __u64 dar;
  +   __u64 msr;
  +   __u32 dsisr;
  +   __u32 int_pending;  /* Tells the guest if we have an interrupt */
  +};
  +
  +Additions to the page must only occur at the end. Struct fields  
  are always 32
  +bit aligned.
 
 The u64s are 64-bit aligned, should they always be?
 
  +The ld and std instructions are transormed to lwz and stw  
  instructions
  +respectively on 32 bit systems with an added offset of 4 to  
  accomodate for big
  +endianness.
 
 Will this add never overflow?  Is there anything that checks for it?
 
  +mtmsrd rX, 0   b   special mtmsr section
  +mtmsr  b   special mtmsr section
 
 mtmsr rX
 
 
 Segher
 
 ___
 Linuxppc-dev mailing list
 linuxppc-...@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 27/27] KVM: PPC: Add Documentation about PV interface

2010-07-03 Thread Benjamin Herrenschmidt
On Fri, 2010-07-02 at 20:41 +0200, Alexander Graf wrote:
 The u64s are 64-bit aligned, should they always be?
 
 That's obvious, isn't it? And the ABI only specifies u64s to be 32 bit
 aligned, no? At least that's what ld and std specify.

No, the PowerPC ABI specifies u64's to be 64-bit aligned, even for
32-bit binaries.

Ben.

  
  +The ld and std instructions are transormed to lwz and stw
 instructions
  +respectively on 32 bit systems with an added offset of 4 to
 accomodate for big
  +endianness.
  
  Will this add never overflow?  Is there anything that checks for it?
 
 It basically means that to access dar, we either do
 
 ld  rX, DAR(0)
 
 or
 
 lwz rX, DAR+4(0)
 
 
  
  +mtmsrd  rX, 0   b   special mtmsr section
  +mtmsr   b   special mtmsr section
  
  mtmsr rX
 
 Nod. 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 27/27] KVM: PPC: Add Documentation about PV interface

2010-07-03 Thread Benjamin Herrenschmidt
On Fri, 2010-07-02 at 18:27 +0200, Segher Boessenkool wrote:
  +To find out if we're running on KVM or not, we overlay the PVR  
  register. Usually
  +the PVR register contains an id that identifies your CPU type. If,  
  however, you
  +pass KVM_PVR_PARA in the register that you want the PVR result in,  
  the register
  +still contains KVM_PVR_PARA after the mfpvr call.
  +
  +   LOAD_REG_IMM(r5, KVM_PVR_PARA)
  +   mfpvr   r5
  +   [r5 still contains KVM_PVR_PARA]
 
 I love this part :-)

Me not :-)

It should be in the device-tree instead, or something like that. Enough
games with PVR...

Ben.

  +   __u64 scratch3;
  +   __u64 critical; /* Guest may not get interrupts if == r1 */
  +   __u64 sprg0;
  +   __u64 sprg1;
  +   __u64 sprg2;
  +   __u64 sprg3;
  +   __u64 srr0;
  +   __u64 srr1;
  +   __u64 dar;
  +   __u64 msr;
  +   __u32 dsisr;
  +   __u32 int_pending;  /* Tells the guest if we have an interrupt */
  +};
  +
  +Additions to the page must only occur at the end. Struct fields  
  are always 32
  +bit aligned.
 
 The u64s are 64-bit aligned, should they always be?
 
  +The ld and std instructions are transormed to lwz and stw  
  instructions
  +respectively on 32 bit systems with an added offset of 4 to  
  accomodate for big
  +endianness.
 
 Will this add never overflow?  Is there anything that checks for it?
 
  +mtmsrd rX, 0   b   special mtmsr section
  +mtmsr  b   special mtmsr section
 
 mtmsr rX
 
 
 Segher
 
 ___
 Linuxppc-dev mailing list
 linuxppc-...@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 27/27] KVM: PPC: Add Documentation about PV interface

2010-07-03 Thread Benjamin Herrenschmidt
On Fri, 2010-07-02 at 20:41 +0200, Alexander Graf wrote:
 The u64s are 64-bit aligned, should they always be?
 
 That's obvious, isn't it? And the ABI only specifies u64s to be 32 bit
 aligned, no? At least that's what ld and std specify.

No, the PowerPC ABI specifies u64's to be 64-bit aligned, even for
32-bit binaries.

Ben.

  
  +The ld and std instructions are transormed to lwz and stw
 instructions
  +respectively on 32 bit systems with an added offset of 4 to
 accomodate for big
  +endianness.
  
  Will this add never overflow?  Is there anything that checks for it?
 
 It basically means that to access dar, we either do
 
 ld  rX, DAR(0)
 
 or
 
 lwz rX, DAR+4(0)
 
 
  
  +mtmsrd  rX, 0   b   special mtmsr section
  +mtmsr   b   special mtmsr section
  
  mtmsr rX
 
 Nod. 

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html