Re: [PATCH v5 08/13] ARM: KVM: Handle guest faults in KVM

2011-12-13 Thread Avi Kivity
On 12/12/2011 09:53 PM, Christoffer Dall wrote:
 
  A bigger problem is that you pin all memory; what are the plans wrt mmu
  notifiers?
 
 hmm, I have no plans (yet).

 I haven't looked into neither MMU shrinker nor MMU notifier.

 As I see it, the problems of consuming too much memory just for the
 page tables should be solved by somehow reclaiming pages used for the
 second stage mappings, 

That's what the shrinker does.

 the question is just which mappings are the
 most efficient to reclaim.

Do you have accessed bits in those PTEs?

It's not really critical to have efficient reclaim here, since it
happens so rarely.  It just needs to do something.

 The other problem, the actual guest memory consuming too much memory,
 I assumed this limit would be set by the user when creating his/her
 VM, or can we do something smarter? (again, forgive my ignorance).
 What is the alternative to pinning actual guest pages

mmu notifiers - pages aren't pinned; instead, Linux calls back into kvm
when modifying a host pte, and kvm responds by dropping or modifying its
translation (second stage pte in your case).

  - as far as I
 know it's not common to have swap space on ARM architectures, but I
 could be wrong.

It will become common once you start doing servers.

mmu notifiers are also useful for other optimizations, like ksm,
ballooning, and transparent huge pages.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Android-virt] [PATCH v5 08/13] ARM: KVM: Handle guest faults in KVM

2011-12-13 Thread Christoffer Dall
On Tue, Dec 13, 2011 at 4:45 AM, Avi Kivity a...@redhat.com wrote:
 On 12/12/2011 09:53 PM, Christoffer Dall wrote:
 
  A bigger problem is that you pin all memory; what are the plans wrt mmu
  notifiers?
 
 hmm, I have no plans (yet).

 I haven't looked into neither MMU shrinker nor MMU notifier.

 As I see it, the problems of consuming too much memory just for the
 page tables should be solved by somehow reclaiming pages used for the
 second stage mappings,

 That's what the shrinker does.


ok, that's what I thought.

 the question is just which mappings are the
 most efficient to reclaim.

 Do you have accessed bits in those PTEs?


nope. We can protect the underlying target pages though, but...


 It's not really critical to have efficient reclaim here, since it
 happens so rarely.  It just needs to do something.


when would you trigger it - when it reaches a certain limit, or? And
then what, free the lot and re-allocate what's needed?

 The other problem, the actual guest memory consuming too much memory,
 I assumed this limit would be set by the user when creating his/her
 VM, or can we do something smarter? (again, forgive my ignorance).
 What is the alternative to pinning actual guest pages

 mmu notifiers - pages aren't pinned; instead, Linux calls back into kvm
 when modifying a host pte, and kvm responds by dropping or modifying its
 translation (second stage pte in your case).


ah ok, so this works across VM boundary. Based on hyper-calls I presume?

  - as far as I
 know it's not common to have swap space on ARM architectures, but I
 could be wrong.

 It will become common once you start doing servers.


I think so too, but I am not sure if it's completely supported for
ARM. Is it all arch-independent or do we miss arm-specific pieces?
Marc?

 mmu notifiers are also useful for other optimizations, like ksm,
 ballooning, and transparent huge pages.


I know those features have to be supported eventually. The question is
if all this must be in place before a merge upstream?


Thanks,
Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Android-virt] [PATCH v5 08/13] ARM: KVM: Handle guest faults in KVM

2011-12-13 Thread Marc Zyngier
On 13/12/11 13:10, Christoffer Dall wrote:
 On Tue, Dec 13, 2011 at 4:45 AM, Avi Kivity a...@redhat.com wrote:
 On 12/12/2011 09:53 PM, Christoffer Dall wrote:
  - as far as I
 know it's not common to have swap space on ARM architectures, but I
 could be wrong.

 It will become common once you start doing servers.

 
 I think so too, but I am not sure if it's completely supported for
 ARM. Is it all arch-independent or do we miss arm-specific pieces?
 Marc?

Swapping definitely works as expected on ARM (and if it doesn't, it's a
major bug and should be tackled immediately).

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Android-virt] [PATCH v5 08/13] ARM: KVM: Handle guest faults in KVM

2011-12-13 Thread Avi Kivity
On 12/13/2011 03:10 PM, Christoffer Dall wrote:
  the question is just which mappings are the
  most efficient to reclaim.
 
  Do you have accessed bits in those PTEs?
 

 nope. We can protect the underlying target pages though, but...

Yeah, we have the same issue with one of the vendors.  Fortunately only
90% of the market is affected.

  It's not really critical to have efficient reclaim here, since it
  happens so rarely.  It just needs to do something.
 

 when would you trigger it - when it reaches a certain limit, or? And
 then what, free the lot and re-allocate what's needed?

The kernel triggers it based on internal pressure.  It tells you how
much pressure to apply, so you just translate it to a number of pages to
free.


  The other problem, the actual guest memory consuming too much memory,
  I assumed this limit would be set by the user when creating his/her
  VM, or can we do something smarter? (again, forgive my ignorance).
  What is the alternative to pinning actual guest pages
 
  mmu notifiers - pages aren't pinned; instead, Linux calls back into kvm
  when modifying a host pte, and kvm responds by dropping or modifying its
  translation (second stage pte in your case).
 

 ah ok, so this works across VM boundary. Based on hyper-calls I presume?

No, it's completely internal to the host.

See for example kvm_mmu_notifier_invalidate_page() (in common code). 
It's called when Linux-as-host wants to change a pte (say to swap a
page).  kvm responds by translating the host virtual address into a
guest physical address (via the memory slots table), then zapping the
relevant pte and flushing and TLBs which may have cached the pte.

  mmu notifiers are also useful for other optimizations, like ksm,
  ballooning, and transparent huge pages.
 

 I know those features have to be supported eventually. The question is
 if all this must be in place before a merge upstream?

It doesn't have to be there for the merge but I recommend giving it high
priority.  At least read and understand the code so the addition will
follow naturally.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Android-virt] [PATCH v5 08/13] ARM: KVM: Handle guest faults in KVM

2011-12-13 Thread Christoffer Dall
On Tue, Dec 13, 2011 at 8:23 AM, Avi Kivity a...@redhat.com wrote:
 On 12/13/2011 03:10 PM, Christoffer Dall wrote:
  the question is just which mappings are the
  most efficient to reclaim.
 
  Do you have accessed bits in those PTEs?
 

 nope. We can protect the underlying target pages though, but...

 Yeah, we have the same issue with one of the vendors.  Fortunately only
 90% of the market is affected.


:)

  It's not really critical to have efficient reclaim here, since it
  happens so rarely.  It just needs to do something.
 

 when would you trigger it - when it reaches a certain limit, or? And
 then what, free the lot and re-allocate what's needed?

 The kernel triggers it based on internal pressure.  It tells you how
 much pressure to apply, so you just translate it to a number of pages to
 free.



ok, so we pick those pages at random? (perhaps trying to avoid hitting
the guest kernel at least for Linux, or...?)

  The other problem, the actual guest memory consuming too much memory,
  I assumed this limit would be set by the user when creating his/her
  VM, or can we do something smarter? (again, forgive my ignorance).
  What is the alternative to pinning actual guest pages
 
  mmu notifiers - pages aren't pinned; instead, Linux calls back into kvm
  when modifying a host pte, and kvm responds by dropping or modifying its
  translation (second stage pte in your case).
 

 ah ok, so this works across VM boundary. Based on hyper-calls I presume?

 No, it's completely internal to the host.


ok, got you. I got thrown off by the Linux calls back into kvm statement.

 See for example kvm_mmu_notifier_invalidate_page() (in common code).
 It's called when Linux-as-host wants to change a pte (say to swap a
 page).  kvm responds by translating the host virtual address into a
 guest physical address (via the memory slots table), then zapping the
 relevant pte and flushing and TLBs which may have cached the pte.

  mmu notifiers are also useful for other optimizations, like ksm,
  ballooning, and transparent huge pages.
 

 I know those features have to be supported eventually. The question is
 if all this must be in place before a merge upstream?

 It doesn't have to be there for the merge but I recommend giving it high
 priority.  At least read and understand the code so the addition will
 follow naturally.

will do - I will make it a Christmas activity.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Android-virt] [PATCH v5 08/13] ARM: KVM: Handle guest faults in KVM

2011-12-13 Thread Avi Kivity
On 12/13/2011 03:44 PM, Christoffer Dall wrote:
   It's not really critical to have efficient reclaim here, since it
   happens so rarely.  It just needs to do something.
  
 
  when would you trigger it - when it reaches a certain limit, or? And
  then what, free the lot and re-allocate what's needed?
 
  The kernel triggers it based on internal pressure.  It tells you how
  much pressure to apply, so you just translate it to a number of pages to
  free.
 
 

 ok, so we pick those pages at random? (perhaps trying to avoid hitting
 the guest kernel at least for Linux, or...?)

x86 has a sort of poorly managed LRU; it's wildly inaccurate but doesn't
hurt in practice since it only triggers under severe memory pressure anyway.

  It doesn't have to be there for the merge but I recommend giving it high
  priority.  At least read and understand the code so the addition will
  follow naturally.
 
 will do - I will make it a Christmas activity.

I was hoping to to get the ARM port as a present...

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 08/13] ARM: KVM: Handle guest faults in KVM

2011-12-12 Thread Christoffer Dall
On Mon, Dec 12, 2011 at 10:05 AM, Avi Kivity a...@redhat.com wrote:
 On 12/11/2011 12:25 PM, Christoffer Dall wrote:
 From: Christoffer Dall cd...@cs.columbia.edu

 Handles the guest faults in KVM by mapping in corresponding user pages
 in the 2nd stage page tables.

 Introduces new ARM-specific kernel memory types, PAGE_KVM_GUEST and
 pgprot_guest variables used to map 2nd stage memory for KVM guests.


 +static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 +                       gfn_t gfn, struct kvm_memory_slot *memslot)
 +{
 +     pfn_t pfn;
 +     pgd_t *pgd;
 +     pud_t *pud;
 +     pmd_t *pmd;
 +     pte_t *pte, new_pte;
 +
 +     pfn = gfn_to_pfn(vcpu-kvm, gfn);
 +
 +     if (is_error_pfn(pfn)) {

 put_page()


ack

 +             kvm_err(-EFAULT, Guest gfn %u (0x%08lx) does not have 
 +                             corresponding host mapping,
 +                             gfn, gfn  PAGE_SHIFT);
 +             return -EFAULT;
 +     }
 +
 +     /* Create 2nd stage page table mapping - Level 1 */
 +     pgd = vcpu-kvm-arch.pgd + pgd_index(fault_ipa);
 +     pud = pud_offset(pgd, fault_ipa);
 +     if (pud_none(*pud)) {
 +             pmd = pmd_alloc_one(NULL, fault_ipa);
 +             if (!pmd) {
 +                     kvm_err(-ENOMEM, Cannot allocate 2nd stage pmd);

 put_page()


ack

 +                     return -ENOMEM;
 +             }
 +             pud_populate(NULL, pud, pmd);
 +             pmd += pmd_index(fault_ipa);
 +     } else
 +             pmd = pmd_offset(pud, fault_ipa);
 +
 +     /* Create 2nd stage page table mapping - Level 2 */
 +     if (pmd_none(*pmd)) {
 +             pte = pte_alloc_one_kernel(NULL, fault_ipa);
 +             if (!pte) {
 +                     kvm_err(-ENOMEM, Cannot allocate 2nd stage pte);
 +                     return -ENOMEM;
 +             }
 +             pmd_populate_kernel(NULL, pmd, pte);
 +             pte += pte_index(fault_ipa);
 +     } else
 +             pte = pte_offset_kernel(pmd, fault_ipa);
 +
 +     /* Create 2nd stage page table mapping - Level 3 */
 +     new_pte = pfn_pte(pfn, PAGE_KVM_GUEST);
 +     set_pte_ext(pte, new_pte, 0);


 With LPAE and 40-bit addresses, a guest can cause 2GBs worth of page
 tables to be pinned in host memory; this can be used as a denial of
 service attack.  x86 handles this by having a shrinker that can
 dynamically free page tables, see mmu_shrinker.

 An alternative way may be to impose RLIMIT_AS on the sum of a guest's
 memory slots; though I prefer having a shrinker.

 A bigger problem is that you pin all memory; what are the plans wrt mmu
 notifiers?

hmm, I have no plans (yet).

I haven't looked into neither MMU shrinker nor MMU notifier.

As I see it, the problems of consuming too much memory just for the
page tables should be solved by somehow reclaiming pages used for the
second stage mappings, the question is just which mappings are the
most efficient to reclaim.

The other problem, the actual guest memory consuming too much memory,
I assumed this limit would be set by the user when creating his/her
VM, or can we do something smarter? (again, forgive my ignorance).
What is the alternative to pinning actual guest pages - as far as I
know it's not common to have swap space on ARM architectures, but I
could be wrong.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 08/13] ARM: KVM: Handle guest faults in KVM

2011-12-12 Thread Avi Kivity
On 12/11/2011 12:25 PM, Christoffer Dall wrote:
 From: Christoffer Dall cd...@cs.columbia.edu

 Handles the guest faults in KVM by mapping in corresponding user pages
 in the 2nd stage page tables.

 Introduces new ARM-specific kernel memory types, PAGE_KVM_GUEST and
 pgprot_guest variables used to map 2nd stage memory for KVM guests.

  
 +static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 +   gfn_t gfn, struct kvm_memory_slot *memslot)
 +{
 + pfn_t pfn;
 + pgd_t *pgd;
 + pud_t *pud;
 + pmd_t *pmd;
 + pte_t *pte, new_pte;
 +
 + pfn = gfn_to_pfn(vcpu-kvm, gfn);
 +
 + if (is_error_pfn(pfn)) {

put_page()

 + kvm_err(-EFAULT, Guest gfn %u (0x%08lx) does not have 
 + corresponding host mapping,
 + gfn, gfn  PAGE_SHIFT);
 + return -EFAULT;
 + }
 +
 + /* Create 2nd stage page table mapping - Level 1 */
 + pgd = vcpu-kvm-arch.pgd + pgd_index(fault_ipa);
 + pud = pud_offset(pgd, fault_ipa);
 + if (pud_none(*pud)) {
 + pmd = pmd_alloc_one(NULL, fault_ipa);
 + if (!pmd) {
 + kvm_err(-ENOMEM, Cannot allocate 2nd stage pmd);

put_page()

 + return -ENOMEM;
 + }
 + pud_populate(NULL, pud, pmd);
 + pmd += pmd_index(fault_ipa);
 + } else
 + pmd = pmd_offset(pud, fault_ipa);
 +
 + /* Create 2nd stage page table mapping - Level 2 */
 + if (pmd_none(*pmd)) {
 + pte = pte_alloc_one_kernel(NULL, fault_ipa);
 + if (!pte) {
 + kvm_err(-ENOMEM, Cannot allocate 2nd stage pte);
 + return -ENOMEM;
 + }
 + pmd_populate_kernel(NULL, pmd, pte);
 + pte += pte_index(fault_ipa);
 + } else
 + pte = pte_offset_kernel(pmd, fault_ipa);
 +
 + /* Create 2nd stage page table mapping - Level 3 */
 + new_pte = pfn_pte(pfn, PAGE_KVM_GUEST);
 + set_pte_ext(pte, new_pte, 0);


With LPAE and 40-bit addresses, a guest can cause 2GBs worth of page
tables to be pinned in host memory; this can be used as a denial of
service attack.  x86 handles this by having a shrinker that can
dynamically free page tables, see mmu_shrinker.

An alternative way may be to impose RLIMIT_AS on the sum of a guest's
memory slots; though I prefer having a shrinker.

A bigger problem is that you pin all memory; what are the plans wrt mmu
notifiers?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 08/13] ARM: KVM: Handle guest faults in KVM

2011-12-11 Thread Christoffer Dall
From: Christoffer Dall cd...@cs.columbia.edu

Handles the guest faults in KVM by mapping in corresponding user pages
in the 2nd stage page tables.

Introduces new ARM-specific kernel memory types, PAGE_KVM_GUEST and
pgprot_guest variables used to map 2nd stage memory for KVM guests.

Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
---
 arch/arm/include/asm/pgtable-3level.h |8 ++
 arch/arm/include/asm/pgtable.h|4 +
 arch/arm/kvm/mmu.c|  107 -
 arch/arm/mm/mmu.c |3 +
 4 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-3level.h 
b/arch/arm/include/asm/pgtable-3level.h
index edc3cb9..6dc5331 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -104,6 +104,14 @@
  */
 #define L_PGD_SWAPPER  (_AT(pgdval_t, 1)  55)/* 
swapper_pg_dir entry */
 
+/*
+ * 2-nd stage PTE definitions for LPAE.
+ */
+#define L_PTE2_READ(_AT(pteval_t, 1)  6) /* HAP[0] */
+#define L_PTE2_WRITE   (_AT(pteval_t, 1)  7) /* HAP[1] */
+#define L_PTE2_NORM_WB (_AT(pteval_t, 3)  4) /* MemAttr[3:2] */
+#define L_PTE2_INNER_WB(_AT(pteval_t, 3)  2) /* MemAttr[1:0] 
*/
+
 #ifndef __ASSEMBLY__
 
 #define pud_none(pud)  (!pud_val(pud))
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 20025cc..778856b 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -76,6 +76,7 @@ extern void __pgd_error(const char *file, int line, pgd_t);
 
 extern pgprot_tpgprot_user;
 extern pgprot_tpgprot_kernel;
+extern pgprot_tpgprot_guest;
 
 #define _MOD_PROT(p, b)__pgprot(pgprot_val(p) | (b))
 
@@ -89,6 +90,9 @@ extern pgprot_t   pgprot_kernel;
 #define PAGE_KERNEL_MOD_PROT(pgprot_kernel, L_PTE_XN)
 #define PAGE_KERNEL_EXEC   pgprot_kernel
 #define PAGE_HYP   _MOD_PROT(pgprot_kernel, L_PTE_USER)
+#define PAGE_KVM_GUEST _MOD_PROT(pgprot_guest, L_PTE2_READ | \
+ L_PTE2_WRITE | L_PTE2_NORM_WB | \
+ L_PTE2_INNER_WB)
 
 #define __PAGE_NONE__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | 
L_PTE_XN)
 #define __PAGE_SHARED  __pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_XN)
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index f7a7b17..d468238 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -229,8 +229,111 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
kvm-arch.pgd = NULL;
 }
 
+static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+ gfn_t gfn, struct kvm_memory_slot *memslot)
+{
+   pfn_t pfn;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte, new_pte;
+
+   pfn = gfn_to_pfn(vcpu-kvm, gfn);
+
+   if (is_error_pfn(pfn)) {
+   kvm_err(-EFAULT, Guest gfn %u (0x%08lx) does not have 
+   corresponding host mapping,
+   gfn, gfn  PAGE_SHIFT);
+   return -EFAULT;
+   }
+
+   /* Create 2nd stage page table mapping - Level 1 */
+   pgd = vcpu-kvm-arch.pgd + pgd_index(fault_ipa);
+   pud = pud_offset(pgd, fault_ipa);
+   if (pud_none(*pud)) {
+   pmd = pmd_alloc_one(NULL, fault_ipa);
+   if (!pmd) {
+   kvm_err(-ENOMEM, Cannot allocate 2nd stage pmd);
+   return -ENOMEM;
+   }
+   pud_populate(NULL, pud, pmd);
+   pmd += pmd_index(fault_ipa);
+   } else
+   pmd = pmd_offset(pud, fault_ipa);
+
+   /* Create 2nd stage page table mapping - Level 2 */
+   if (pmd_none(*pmd)) {
+   pte = pte_alloc_one_kernel(NULL, fault_ipa);
+   if (!pte) {
+   kvm_err(-ENOMEM, Cannot allocate 2nd stage pte);
+   return -ENOMEM;
+   }
+   pmd_populate_kernel(NULL, pmd, pte);
+   pte += pte_index(fault_ipa);
+   } else
+   pte = pte_offset_kernel(pmd, fault_ipa);
+
+   /* Create 2nd stage page table mapping - Level 3 */
+   new_pte = pfn_pte(pfn, PAGE_KVM_GUEST);
+   set_pte_ext(pte, new_pte, 0);
+
+   return 0;
+}
+
+#define HSR_ABT_FS (0x3f)
+#define HPFAR_MASK (~0xf)
+
+/**
+ * kvm_handle_guest_abort - handles all 2nd stage aborts
+ * @vcpu:  the VCPU pointer
+ * @run:   the kvm_run structure
+ *
+ * Any abort that gets to the host is almost guaranteed to be caused by a
+ * missing second stage translation table entry, which can mean that either the
+ * guest simply needs more memory and we must allocate an appropriate page or 
it
+ * can mean that the guest tried to access I/O memory, which is emulated by 
user
+ * space. The