Re: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address

2011-04-28 Thread Huang Ying
On 04/28/2011 10:04 PM, Marcelo Tosatti wrote:
 On Thu, Apr 28, 2011 at 08:00:19AM -0500, Anthony Liguori wrote:
 On 04/27/2011 06:06 PM, Marcelo Tosatti wrote:
 On Fri, Nov 19, 2010 at 04:17:35PM +0800, Huang Ying wrote:
 On Tue, 2010-11-16 at 10:23 +0800, Huang Ying wrote:
 Author: Max Asbockmasb...@linux.vnet.ibm.com

 Add command x-gpa2hva to translate guest physical address to host
 virtual address. Because gpa to hva translation is not consistent, so
 this command is only used for debugging.

 The x-gpa2hva command provides one step in a chain of translations from
 guest virtual to guest physical to host virtual to host physical. Host
 physical is then used to inject a machine check error. As a
 consequence the HWPOISON code on the host and the MCE injection code
 in qemu-kvm are exercised.

 v3:

 - Rename to x-gpa2hva
 - Remove QMP support, because gpa2hva is not consistent

 Is this patch an acceptable solution for now? This command is useful for
 our testing.

 Anthony?

 Yeah, but it should come through qemu-devel, no?
 
 Yes, Huang Ying, can you please resend?

Via QEMU git or uq/master branch of KVM git?

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot

2011-02-10 Thread Huang Ying
On Thu, 2011-02-10 at 16:52 +0800, Jan Kiszka wrote:
 On 2011-02-10 01:27, Huang Ying wrote:
  @@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en
   hardware_memory_error();
   }
   }
  +kvm_hwpoison_page_add(ram_addr);
   
   if (code == BUS_MCEERR_AR) {
   /* Fake an Intel architectural Data Load SRAR UCR */
  @@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a
   QEMU itself instead of guest system!: %p\n, addr);
   return 0;
   }
  +kvm_hwpoison_page_add(ram_addr);
   kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
   } else
   #endif
 
 
 
  Looks fine otherwise. Unless that simplification makes sense, I could
  offer to include this into my MCE rework (there is some minor conflict).
  If all goes well, that series should be posted during this week.
 
 Please have a look at
 
 git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream
 
 and tell me if it works for you and your signed-off still applies.

Thanks!  Works as expected in my testing!

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot

2011-02-09 Thread Huang Ying
On Wed, 2011-02-09 at 16:00 +0800, Jan Kiszka wrote:
 On 2011-02-09 04:00, Huang Ying wrote:
  In Linux kernel HWPoison processing implementation, the virtual
  address in processes mapping the error physical memory page is marked
  as HWPoison.  So that, the further accessing to the virtual
  address will kill corresponding processes with SIGBUS.
  
  If the error physical memory page is used by a KVM guest, the SIGBUS
  will be sent to QEMU, and QEMU will simulate a MCE to report that
  memory error to the guest OS.  If the guest OS can not recover from
  the error (for example, the page is accessed by kernel code), guest OS
  will reboot the system.  But because the underlying host virtual
  address backing the guest physical memory is still poisoned, if the
  guest system accesses the corresponding guest physical memory even
  after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
  simulated.  That is, guest system can not recover via rebooting.
 
 Yeah, saw this already during my test...
 
  
  In fact, across rebooting, the contents of guest physical memory page
  need not to be kept.  We can allocate a new host physical page to
  back the corresponding guest physical address.
 
 I just wondering what would be architecturally suboptimal if we simply
 remapped on SIGBUS directly. Would save us at least the bookkeeping.

Because we can not change the content of memory silently during guest OS
running, this may corrupts guest OS data structure and even ruins disk
contents.  But during rebooting, all guest OS state are discarded.

[snip]
  @@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en
   hardware_memory_error();
   }
   }
  +kvm_hwpoison_page_add(ram_addr);
   
   if (code == BUS_MCEERR_AR) {
   /* Fake an Intel architectural Data Load SRAR UCR */
  @@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a
   QEMU itself instead of guest system!: %p\n, addr);
   return 0;
   }
  +kvm_hwpoison_page_add(ram_addr);
   kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
   } else
   #endif
  
  
 
 Looks fine otherwise. Unless that simplification makes sense, I could
 offer to include this into my MCE rework (there is some minor conflict).
 If all goes well, that series should be posted during this week.

Thanks.

Best Regards,
Huang Ying

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH uq/master -v2 1/2] Add qemu_ram_remap

2011-02-08 Thread Huang Ying
qemu_ram_remap() unmaps the specified RAM pages, then re-maps these
pages again.  This is used by KVM HWPoison support to clear HWPoisoned
page tables across guest rebooting, so that a new page may be
allocated later to recover the memory error.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 cpu-all.h|4 +++
 cpu-common.h |1 
 exec.c   |   61 ++-
 3 files changed, 65 insertions(+), 1 deletion(-)

--- a/cpu-all.h
+++ b/cpu-all.h
@@ -863,10 +863,14 @@ target_phys_addr_t cpu_get_phys_page_deb
 extern int phys_ram_fd;
 extern ram_addr_t ram_size;
 
+/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
+#define RAM_PREALLOC_MASK  (1  0)
+
 typedef struct RAMBlock {
 uint8_t *host;
 ram_addr_t offset;
 ram_addr_t length;
+uint32_t flags;
 char idstr[256];
 QLIST_ENTRY(RAMBlock) next;
 #if defined(__linux__)  !defined(TARGET_S390X)
--- a/exec.c
+++ b/exec.c
@@ -2837,6 +2837,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 
 if (host) {
 new_block-host = host;
+new_block-flags |= RAM_PREALLOC_MASK;
 } else {
 if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
@@ -2890,7 +2891,9 @@ void qemu_ram_free(ram_addr_t addr)
 QLIST_FOREACH(block, ram_list.blocks, next) {
 if (addr == block-offset) {
 QLIST_REMOVE(block, next);
-if (mem_path) {
+if (block-flags  RAM_PREALLOC_MASK) {
+;
+} else if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
 if (block-fd) {
 munmap(block-host, block-length);
@@ -2913,6 +2916,62 @@ void qemu_ram_free(ram_addr_t addr)
 
 }
 
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+{
+RAMBlock *block;
+ram_addr_t offset;
+int flags;
+void *area, *vaddr;
+
+QLIST_FOREACH(block, ram_list.blocks, next) {
+offset = addr - block-offset;
+if (offset  block-length) {
+vaddr = block-host + offset;
+if (block-flags  RAM_PREALLOC_MASK) {
+;
+} else {
+flags = MAP_FIXED;
+munmap(vaddr, length);
+if (mem_path) {
+#if defined (__linux__)  !defined(TARGET_S390X)
+if (block-fd) {
+#ifdef MAP_POPULATE
+flags |= mem_prealloc ? MAP_POPULATE | MAP_SHARED :
+MAP_PRIVATE;
+#else
+flags |= MAP_PRIVATE;
+#endif
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, block-fd, offset);
+} else {
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+}
+#endif
+} else {
+#if defined(TARGET_S390X)  defined(CONFIG_KVM)
+flags |= MAP_SHARED | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_EXEC|PROT_READ|PROT_WRITE,
+flags, -1, 0);
+#else
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+#endif
+}
+if (area != vaddr) {
+fprintf(stderr, Could not remap addr: %lx@%lx\n,
+length, addr);
+exit(1);
+}
+qemu_madvise(vaddr, length, QEMU_MADV_MERGEABLE);
+}
+return;
+}
+}
+}
+
 /* Return a host pointer to ram allocated with qemu_ram_alloc.
With the exception of the softmmu code in this file, this should
only be used for local memory (e.g. video ram) that the device owns,
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -50,6 +50,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size);
 void qemu_ram_free(ram_addr_t addr);
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 /* This should only be used for ram local to a device.  */
 void *qemu_get_ram_ptr(ram_addr_t addr);
 /* Same but slower, to use for migration, where the order of


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot

2011-02-08 Thread Huang Ying
In Linux kernel HWPoison processing implementation, the virtual
address in processes mapping the error physical memory page is marked
as HWPoison.  So that, the further accessing to the virtual
address will kill corresponding processes with SIGBUS.

If the error physical memory page is used by a KVM guest, the SIGBUS
will be sent to QEMU, and QEMU will simulate a MCE to report that
memory error to the guest OS.  If the guest OS can not recover from
the error (for example, the page is accessed by kernel code), guest OS
will reboot the system.  But because the underlying host virtual
address backing the guest physical memory is still poisoned, if the
guest system accesses the corresponding guest physical memory even
after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
simulated.  That is, guest system can not recover via rebooting.

In fact, across rebooting, the contents of guest physical memory page
need not to be kept.  We can allocate a new host physical page to
back the corresponding guest physical address.

This patch fixes this issue in QEMU-KVM via calling qemu_ram_remap()
to clear the corresponding page table entry, so that make it possible
to allocate a new page to recover the issue.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 target-i386/kvm.c |   39 +++
 1 file changed, 39 insertions(+)

--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -508,6 +508,42 @@ static int kvm_get_supported_msrs(KVMSta
 return ret;
 }
 
+struct HWPoisonPage;
+typedef struct HWPoisonPage HWPoisonPage;
+struct HWPoisonPage
+{
+ram_addr_t ram_addr;
+QLIST_ENTRY(HWPoisonPage) list;
+};
+
+static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
+QLIST_HEAD_INITIALIZER(hwpoison_page_list);
+
+static void kvm_unpoison_all(void *param)
+{
+HWPoisonPage *page, *next_page;
+
+QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
+QLIST_REMOVE(page, list);
+qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
+qemu_free(page);
+}
+}
+
+static void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+{
+HWPoisonPage *page;
+
+QLIST_FOREACH(page, hwpoison_page_list, list) {
+if (page-ram_addr == ram_addr)
+return;
+}
+
+page = qemu_malloc(sizeof(HWPoisonPage));
+page-ram_addr = ram_addr;
+QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
+}
+
 int kvm_arch_init(KVMState *s)
 {
 uint64_t identity_base = 0xfffbc000;
@@ -556,6 +592,7 @@ int kvm_arch_init(KVMState *s)
 fprintf(stderr, e820_add_entry() table is full\n);
 return ret;
 }
+qemu_register_reset(kvm_unpoison_all, NULL);
 
 return 0;
 }
@@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en
 hardware_memory_error();
 }
 }
+kvm_hwpoison_page_add(ram_addr);
 
 if (code == BUS_MCEERR_AR) {
 /* Fake an Intel architectural Data Load SRAR UCR */
@@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a
 QEMU itself instead of guest system!: %p\n, addr);
 return 0;
 }
+kvm_hwpoison_page_add(ram_addr);
 kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
 } else
 #endif


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2 0/3] KVM, Replace is_hwpoison_address with __get_user_pages

2011-01-29 Thread Huang Ying
v2:

- Export __get_user_pages, because we need other get_user_pages variants too.


[PATCH -v2 1/3] mm, export __get_user_pages
[PATCH -v2 2/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page 
optionally
[PATCH -v2 3/3] KVM, Replace is_hwpoison_address with __get_user_pages
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2 2/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-29 Thread Huang Ying
Make __get_user_pages return -EHWPOISON for HWPOISON page only if
FOLL_HWPOISON is specified.  With this patch, the interested callers
can distinguish HWPOISON pages from general FAULT pages, while other
callers will still get -EFAULT for all these pages, so the user space
interface need not to be changed.

This feature is needed by KVM, where UCR MCE should be relayed to
guest for HWPOISON page, while instruction emulation and MMIO will be
tried for general FAULT page.

The idea comes from Andrew Morton.

Signed-off-by: Huang Ying ying.hu...@intel.com
Cc: Andrew Morton a...@linux-foundation.org
---
 include/asm-generic/errno.h |2 ++
 include/linux/mm.h  |1 +
 mm/memory.c |   13 ++---
 3 files changed, 13 insertions(+), 3 deletions(-)

--- a/include/asm-generic/errno.h
+++ b/include/asm-generic/errno.h
@@ -108,4 +108,6 @@
 
 #define ERFKILL132 /* Operation not possible due to 
RF-kill */
 
+#define EHWPOISON  133 /* Memory page has hardware error */
+
 #endif
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1534,6 +1534,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */
 #define FOLL_MLOCK 0x40/* mark page as mlocked */
 #define FOLL_SPLIT 0x80/* don't return transhuge pages, split them */
+#define FOLL_HWPOISON  0x100   /* check page is hwpoisoned */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1576,9 +1576,16 @@ int __get_user_pages(struct task_struct
if (ret  VM_FAULT_ERROR) {
if (ret  VM_FAULT_OOM)
return i ? i : -ENOMEM;
-   if (ret 
-   
(VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE|
-VM_FAULT_SIGBUS))
+   if (ret  (VM_FAULT_HWPOISON |
+  VM_FAULT_HWPOISON_LARGE)) {
+   if (i)
+   return i;
+   else if (gup_flags  
FOLL_HWPOISON)
+   return -EHWPOISON;
+   else
+   return -EFAULT;
+   }
+   if (ret  VM_FAULT_SIGBUS)
return i ? i : -EFAULT;
BUG();
}
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2 3/3] KVM, Replace is_hwpoison_address with __get_user_pages

2011-01-29 Thread Huang Ying
is_hwpoison_address only checks whether the page table entry is
hwpoisoned, regardless the memory page mapped.  While __get_user_pages
will check both.

QEMU will clear the poisoned page table entry (via unmap/map) to make
it possible to allocate a new memory page for the virtual address
across guest rebooting.  But it is also possible that the underlying
memory page is kept poisoned even after the corresponding page table
entry is cleared, that is, a new memory page can not be allocated.
__get_user_pages can catch these situations.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 include/linux/mm.h  |8 
 mm/memory-failure.c |   32 
 virt/kvm/kvm_main.c |   11 ++-
 3 files changed, 10 insertions(+), 41 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1627,14 +1627,6 @@ extern int sysctl_memory_failure_recover
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
-#ifdef CONFIG_MEMORY_FAILURE
-int is_hwpoison_address(unsigned long addr);
-#else
-static inline int is_hwpoison_address(unsigned long addr)
-{
-   return 0;
-}
-#endif
 
 extern void dump_page(struct page *page);
 
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1437,35 +1437,3 @@ done:
/* keep elevated page count for bad page */
return ret;
 }
-
-/*
- * The caller must hold current-mm-mmap_sem in read mode.
- */
-int is_hwpoison_address(unsigned long addr)
-{
-   pgd_t *pgdp;
-   pud_t pud, *pudp;
-   pmd_t pmd, *pmdp;
-   pte_t pte, *ptep;
-   swp_entry_t entry;
-
-   pgdp = pgd_offset(current-mm, addr);
-   if (!pgd_present(*pgdp))
-   return 0;
-   pudp = pud_offset(pgdp, addr);
-   pud = *pudp;
-   if (!pud_present(pud) || pud_large(pud))
-   return 0;
-   pmdp = pmd_offset(pudp, addr);
-   pmd = *pmdp;
-   if (!pmd_present(pmd) || pmd_large(pmd))
-   return 0;
-   ptep = pte_offset_map(pmdp, addr);
-   pte = *ptep;
-   pte_unmap(ptep);
-   if (!is_swap_pte(pte))
-   return 0;
-   entry = pte_to_swp_entry(pte);
-   return is_hwpoison_entry(entry);
-}
-EXPORT_SYMBOL_GPL(is_hwpoison_address);
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1029,6 +1029,15 @@ static pfn_t get_fault_pfn(void)
return fault_pfn;
 }
 
+static inline int check_user_page_hwpoison(unsigned long addr)
+{
+   int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
+
+   rc = __get_user_pages(current, current-mm, addr, 1,
+ flags, NULL, NULL, NULL);
+   return rc == -EHWPOISON;
+}
+
 static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic,
bool *async, bool write_fault, bool *writable)
 {
@@ -1076,7 +1085,7 @@ static pfn_t hva_to_pfn(struct kvm *kvm,
return get_fault_pfn();
 
down_read(current-mm-mmap_sem);
-   if (is_hwpoison_address(addr)) {
+   if (check_user_page_hwpoison(addr)) {
up_read(current-mm-mmap_sem);
get_page(hwpoison_page);
return page_to_pfn(hwpoison_page);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2 1/3] mm, export __get_user_pages

2011-01-29 Thread Huang Ying
In most cases, get_user_pages and get_user_pages_fast should be used
to pin user pages in memory.  But sometimes, some special flags except
FOLL_GET, FOLL_WRITE and FOLL_FORCE are needed, for example in
following patch, KVM needs FOLL_HWPOISON.  To support these users,
__get_user_pages is exported directly.

There are some symbol name conflicts in infiniband driver, fixed them too.

Signed-off-by: Huang Ying ying.hu...@intel.com
CC: Andrew Morton a...@linux-foundation.org
CC: Michel Lespinasse wal...@google.com
CC: Roland Dreier rol...@kernel.org
CC: Ralph Campbell infinip...@qlogic.com
---
 drivers/infiniband/hw/ipath/ipath_user_pages.c |6 +--
 drivers/infiniband/hw/qib/qib_user_pages.c |6 +--
 include/linux/mm.h |4 ++
 mm/internal.h  |5 --
 mm/memory.c|   50 +
 5 files changed, 60 insertions(+), 11 deletions(-)

--- a/drivers/infiniband/hw/ipath/ipath_user_pages.c
+++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c
@@ -53,8 +53,8 @@ static void __ipath_release_user_pages(s
 }
 
 /* call with current-mm-mmap_sem held */
-static int __get_user_pages(unsigned long start_page, size_t num_pages,
-   struct page **p, struct vm_area_struct **vma)
+static int __ipath_get_user_pages(unsigned long start_page, size_t num_pages,
+ struct page **p, struct vm_area_struct **vma)
 {
unsigned long lock_limit;
size_t got;
@@ -165,7 +165,7 @@ int ipath_get_user_pages(unsigned long s
 
down_write(current-mm-mmap_sem);
 
-   ret = __get_user_pages(start_page, num_pages, p, NULL);
+   ret = __ipath_get_user_pages(start_page, num_pages, p, NULL);
 
up_write(current-mm-mmap_sem);
 
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -51,8 +51,8 @@ static void __qib_release_user_pages(str
 /*
  * Call with current-mm-mmap_sem held.
  */
-static int __get_user_pages(unsigned long start_page, size_t num_pages,
-   struct page **p, struct vm_area_struct **vma)
+static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
+   struct page **p, struct vm_area_struct **vma)
 {
unsigned long lock_limit;
size_t got;
@@ -136,7 +136,7 @@ int qib_get_user_pages(unsigned long sta
 
down_write(current-mm-mmap_sem);
 
-   ret = __get_user_pages(start_page, num_pages, p, NULL);
+   ret = __qib_get_user_pages(start_page, num_pages, p, NULL);
 
up_write(current-mm-mmap_sem);
 
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -965,6 +965,10 @@ static inline int handle_mm_fault(struct
 extern int make_pages_present(unsigned long addr, unsigned long end);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void 
*buf, int len, int write);
 
+int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+unsigned long start, int len, unsigned int foll_flags,
+struct page **pages, struct vm_area_struct **vmas,
+int *nonblocking);
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int nr_pages, int write, int force,
struct page **pages, struct vm_area_struct **vmas);
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -245,11 +245,6 @@ static inline void mminit_validate_memmo
 }
 #endif /* CONFIG_SPARSEMEM */
 
-int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-unsigned long start, int len, unsigned int foll_flags,
-struct page **pages, struct vm_area_struct **vmas,
-int *nonblocking);
-
 #define ZONE_RECLAIM_NOSCAN-2
 #define ZONE_RECLAIM_FULL  -1
 #define ZONE_RECLAIM_SOME  0
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1410,6 +1410,55 @@ no_page_table:
return page;
 }
 
+/**
+ * __get_user_pages() - pin user pages in memory
+ * @tsk:   task_struct of target task
+ * @mm:mm_struct of target mm
+ * @start: starting user address
+ * @nr_pages:  number of pages from start to pin
+ * @gup_flags: flags modifying pin behaviour
+ * @pages: array that receives pointers to the pages pinned.
+ * Should be at least nr_pages long. Or NULL, if caller
+ * only intends to ensure the pages are faulted in.
+ * @vmas:  array of pointers to vmas corresponding to each page.
+ * Or NULL if the caller does not require them.
+ * @nonblocking: whether waiting for disk IO or mmap_sem contention
+ *
+ * Returns number of pages pinned. This may be fewer than the number
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
+ * were pinned, returns -errno. Each page returned must be released
+ * with a put_page() call when it is finished with. vmas will only

Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-26 Thread Huang Ying
Hi, Andrew,

On Thu, 2011-01-20 at 23:50 +0800, Marcelo Tosatti wrote:
 On Mon, Jan 17, 2011 at 08:47:39AM +0800, Huang Ying wrote:
  Hi, Andrew,
  
  On Sun, 2011-01-16 at 23:35 +0800, Avi Kivity wrote:
   On 01/14/2011 03:37 AM, Huang Ying wrote:
On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote:
  On 01/13/2011 10:42 AM, Huang Ying wrote:
Make __get_user_pages return -EHWPOISON for HWPOISON page only if
FOLL_HWPOISON is specified.  With this patch, the interested 
 callers
can distinguish HWPOISON page from general FAULT page, while other
callers will still get -EFAULT for pages, so the user space 
 interface
need not to be changed.
  
get_user_pages_hwpoison is added as a variant of get_user_pages 
 that
can return -EHWPOISON for HWPOISON page.
  
This feature is needed by KVM, where UCR MCE should be relayed to
guest for HWPOISON page, while instruction emulation and MMIO 
 will be
tried for general FAULT page.
  
The idea comes from Andrew Morton.
  
Signed-off-by: Huang Yingying.hu...@intel.com
Cc: Andrew Mortona...@linux-foundation.org
  
+#ifdef CONFIG_MEMORY_FAILURE
+int get_user_pages_hwpoison(struct task_struct *tsk, struct 
 mm_struct *mm,
+ unsigned long start, int nr_pages, int 
 write,
+ int force, struct page **pages,
+ struct vm_area_struct **vmas);
+#else

  Since we'd also like to add get_user_pages_noio(), perhaps adding a
  flags field to get_user_pages() is better.
   
Or export __get_user_pages()?
   
   That's better, yes.
  
  Do you think it is a good idea to export __get_user_pages() instead of
  adding get_user_pages_noio() and get_user_pages_hwpoison()?
 
 Better Andrew and/or Linus should decide it.

We really need your comments about this.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-16 Thread Huang Ying
Hi, Andrew,

On Sun, 2011-01-16 at 23:35 +0800, Avi Kivity wrote:
 On 01/14/2011 03:37 AM, Huang Ying wrote:
  On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote:
On 01/13/2011 10:42 AM, Huang Ying wrote:
  Make __get_user_pages return -EHWPOISON for HWPOISON page only if
  FOLL_HWPOISON is specified.  With this patch, the interested callers
  can distinguish HWPOISON page from general FAULT page, while other
  callers will still get -EFAULT for pages, so the user space interface
  need not to be changed.

  get_user_pages_hwpoison is added as a variant of get_user_pages that
  can return -EHWPOISON for HWPOISON page.

  This feature is needed by KVM, where UCR MCE should be relayed to
  guest for HWPOISON page, while instruction emulation and MMIO will be
  tried for general FAULT page.

  The idea comes from Andrew Morton.

  Signed-off-by: Huang Yingying.hu...@intel.com
  Cc: Andrew Mortona...@linux-foundation.org

  +#ifdef CONFIG_MEMORY_FAILURE
  +int get_user_pages_hwpoison(struct task_struct *tsk, struct 
   mm_struct *mm,
  + unsigned long start, int nr_pages, int 
   write,
  + int force, struct page **pages,
  + struct vm_area_struct **vmas);
  +#else
  
Since we'd also like to add get_user_pages_noio(), perhaps adding a
flags field to get_user_pages() is better.
 
  Or export __get_user_pages()?
 
 That's better, yes.

Do you think it is a good idea to export __get_user_pages() instead of
adding get_user_pages_noio() and get_user_pages_hwpoison()?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Remove user space triggerable MCE error message

2011-01-16 Thread Huang Ying
On Sat, 2011-01-15 at 17:00 +0800, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 This case is a pure user space error we do not need to record. Moreover,
 it can be misused to flood the kernel log. Remove it.

I don't think this is a pure user space error.  This happens on real
hardware too, if the Machine Check exception is raised during early boot
stage or the second MC exception is raised before the first MC exception
is processed/cleared.

So I use printk here to help debugging these issues.

To avoid flooding the kernel log, we can use ratelimit.

Best Regards,
Huang Ying

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  arch/x86/kvm/x86.c |3 ---
  1 files changed, 0 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 9dda70d..7f7e4a5 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2575,9 +2575,6 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu 
 *vcpu,
   if (mce-status  MCI_STATUS_UC) {
   if ((vcpu-arch.mcg_status  MCG_STATUS_MCIP) ||
   !kvm_read_cr4_bits(vcpu, X86_CR4_MCE)) {
 - printk(KERN_DEBUG kvm: set_mce: 
 -injects mce exception while 
 -previous one is in progress!\n);
   kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
   return 0;
   }


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-16 Thread Huang Ying
On Fri, 2011-01-14 at 16:38 +0800, Jan Kiszka wrote:
 Am 14.01.2011 02:51, Huang Ying wrote:
  On Thu, 2011-01-13 at 17:01 +0800, Jan Kiszka wrote:
  Am 13.01.2011 09:34, Huang Ying wrote:
[snip]
  +
  +void kvm_unpoison_all(void *param)
 
  Minor nit: This can be static now.
  
  In uq/master, it can be make static.  But in kvm/master, kvm_arch_init
  is not compiled because of conditional compiling, so we will get warning
  and error for unused symbol.  Should we consider kvm/master in this
  patch?
 
 qemu-kvm is very close to switching to upstream kvm_*init. As long as it
 requires this service in its own modules, it will have to patch this
 detail. It does this for other functions already.

OK.  I will change this.

[snip]
  As indicated, I'm sitting on lots of fixes and refactorings of the MCE
  user space code. How do you test your patches? Any suggestions how to do
  this efficiently would be warmly welcome.
  
  We use a self-made test script to test.  Repository is at:
  
  git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
  
  The kvm test script is in kvm sub-directory.
  
  The qemu patch attached is need by the test script.
  
 
 Yeah, I already found this yesterday and started reading. I was just
 searching for p2v in qemu, but now it's clear where it comes from. Will
 have a look (if you want to preview my changes:
 git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream).
 
 I was almost about to use MADV_HWPOISON instead of the injection module.
 Is there a way to recover the fake corruption afterward? I think that
 would allow to move some of the test logic into qemu and avoid p2v which
 - IIRC - was disliked upstream.

I don't know how to fully recover from  MADV_HWPOISON.  You can recover
the virtual address space via qemu_ram_remap() introduced in 1/2 of this
patchset.  But you will lose one or several physical pages for each
testing.  I think that may be not a big issue for a testing machine.

Ccing Andi and Fengguang, they know more than me about MADV_HWPOISON.

 Also, is there a way to simulate corrected errors (BUS_MCEERR_AO)?

BUS_MCEERR_AO is recoverable uncorrected error instead of corrected
error.

The test script is for BUS_MCEERR_AO and BUS_MCEERR_AR.  To see the
effect of pure BUS_MCEERR_AO, just remove the memory accessing loop
(memset) in tools/simple_process/simple_process.c.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Remove user space triggerable MCE error message

2011-01-16 Thread Huang Ying
On Mon, 2011-01-17 at 15:11 +0800, Jan Kiszka wrote:
 On 2011-01-17 01:54, Huang Ying wrote:
  On Sat, 2011-01-15 at 17:00 +0800, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  This case is a pure user space error we do not need to record. Moreover,
  it can be misused to flood the kernel log. Remove it.
  
  I don't think this is a pure user space error.  This happens on real
  hardware too, if the Machine Check exception is raised during early boot
  stage or the second MC exception is raised before the first MC exception
  is processed/cleared.
  
  So I use printk here to help debugging these issues.
  
  To avoid flooding the kernel log, we can use ratelimit.
 
 With user space I meant qemu, and maybe error was the wrong term. This
 code path is only triggered if qemu decides to.

Not only decided by qemu, but also decided by guest OS.  If guest OS
does not clear the MSR or guest OS does not set the X86_CR4_MCE bit in
the cr4, the triple fault will be triggered.

 And there you may also
 print this event (and you already do).

Sorry, which print do you mean?  I can not find similar print in user
space.

 Another reason to not rely on catching this case here: KVM_X86_SET_MCE
 is obsolete on current kernels. Qemu will use a combination of
 KVM_SET_MSRS and KVM_SET_VCPU_EVENTS in the future, only falling back to
 this interface on pre-vcpu-events kernels. Then you need to debug this
 in user space anyway as the triple fault will no longer make it to the
 kernel.

OK.  Then, I think it will be helpful for debugging if we can print
something like this in user space implementation.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-13 Thread Huang Ying
In Linux kernel HWPoison processing implementation, the virtual
address in processes mapping the error physical memory page is marked
as HWPoison.  So that, the further accessing to the virtual
address will kill corresponding processes with SIGBUS.

If the error physical memory page is used by a KVM guest, the SIGBUS
will be sent to QEMU, and QEMU will simulate a MCE to report that
memory error to the guest OS.  If the guest OS can not recover from
the error (for example, the page is accessed by kernel code), guest OS
will reboot the system.  But because the underlying host virtual
address backing the guest physical memory is still poisoned, if the
guest system accesses the corresponding guest physical memory even
after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
simulated.  That is, guest system can not recover via rebooting.

In fact, across rebooting, the contents of guest physical memory page
need not to be kept.  We can allocate a new host physical page to
back the corresponding guest physical address.

This patch fixes this issue in QEMU via calling qemu_ram_remap() to
clear the corresponding page table entry, so that make it possible to
allocate a new page to recover the issue.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 kvm.h |2 ++
 target-i386/kvm.c |   39 +++
 2 files changed, 41 insertions(+)

--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -580,6 +580,42 @@ static int kvm_get_supported_msrs(void)
 return ret;
 }
 
+struct HWPoisonPage;
+typedef struct HWPoisonPage HWPoisonPage;
+struct HWPoisonPage
+{
+ram_addr_t ram_addr;
+QLIST_ENTRY(HWPoisonPage) list;
+};
+
+static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
+QLIST_HEAD_INITIALIZER(hwpoison_page_list);
+
+void kvm_unpoison_all(void *param)
+{
+HWPoisonPage *page, *next_page;
+
+QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
+QLIST_REMOVE(page, list);
+qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
+qemu_free(page);
+}
+}
+
+static void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+{
+HWPoisonPage *page;
+
+QLIST_FOREACH(page, hwpoison_page_list, list) {
+if (page-ram_addr == ram_addr)
+return;
+}
+
+page = qemu_malloc(sizeof(HWPoisonPage));
+page-ram_addr = ram_addr;
+QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
+}
+
 int kvm_arch_init(void)
 {
 uint64_t identity_base = 0xfffbc000;
@@ -632,6 +668,7 @@ int kvm_arch_init(void)
 fprintf(stderr, e820_add_entry() table is full\n);
 return ret;
 }
+qemu_register_reset(kvm_unpoison_all, NULL);
 
 return 0;
 }
@@ -1940,6 +1977,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in
 hardware_memory_error();
 }
 }
+kvm_hwpoison_page_add(ram_addr);
 
 if (code == BUS_MCEERR_AR) {
 /* Fake an Intel architectural Data Load SRAR UCR */
@@ -1984,6 +2022,7 @@ int kvm_on_sigbus(int code, void *addr)
 QEMU itself instead of guest system!: %p\n, addr);
 return 0;
 }
+kvm_hwpoison_page_add(ram_addr);
 kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
 } else
 #endif
--- a/kvm.h
+++ b/kvm.h
@@ -188,6 +188,8 @@ int kvm_physical_memory_addr_from_ram(ra
   target_phys_addr_t *phys_addr);
 #endif
 
+void kvm_unpoison_all(void *param);
+
 #endif
 int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool 
assign);
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH uq/master 1/2] Add qemu_ram_remap

2011-01-13 Thread Huang Ying
qemu_ram_remap() unmaps the specified RAM pages, then re-maps these
pages again.  This is used by KVM HWPoison support to clear HWPoisoned
page tables across guest rebooting, so that a new page may be
allocated later to recover the memory error.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 cpu-all.h|4 +++
 cpu-common.h |1 
 exec.c   |   61 ++-
 3 files changed, 65 insertions(+), 1 deletion(-)

--- a/cpu-all.h
+++ b/cpu-all.h
@@ -863,10 +863,14 @@ target_phys_addr_t cpu_get_phys_page_deb
 extern int phys_ram_fd;
 extern ram_addr_t ram_size;
 
+/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
+#define RAM_PREALLOC_MASK  (1  0)
+
 typedef struct RAMBlock {
 uint8_t *host;
 ram_addr_t offset;
 ram_addr_t length;
+uint32_t flags;
 char idstr[256];
 QLIST_ENTRY(RAMBlock) next;
 #if defined(__linux__)  !defined(TARGET_S390X)
--- a/exec.c
+++ b/exec.c
@@ -2830,6 +2830,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 
 if (host) {
 new_block-host = host;
+new_block-flags |= RAM_PREALLOC_MASK;
 } else {
 if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
@@ -2883,7 +2884,9 @@ void qemu_ram_free(ram_addr_t addr)
 QLIST_FOREACH(block, ram_list.blocks, next) {
 if (addr == block-offset) {
 QLIST_REMOVE(block, next);
-if (mem_path) {
+if (block-flags  RAM_PREALLOC_MASK)
+;
+else if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
 if (block-fd) {
 munmap(block-host, block-length);
@@ -2906,6 +2909,62 @@ void qemu_ram_free(ram_addr_t addr)
 
 }
 
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+{
+RAMBlock *block;
+ram_addr_t offset;
+int flags;
+void *area, *vaddr;
+
+QLIST_FOREACH(block, ram_list.blocks, next) {
+offset = addr - block-offset;
+if (offset  block-length) {
+vaddr = block-host + offset;
+if (block-flags  RAM_PREALLOC_MASK) {
+;
+} else {
+flags = MAP_FIXED;
+munmap(vaddr, length);
+if (mem_path) {
+#if defined (__linux__)  !defined(TARGET_S390X)
+if (block-fd) {
+#ifdef MAP_POPULATE
+flags |= mem_prealloc ? MAP_POPULATE | MAP_SHARED :
+MAP_PRIVATE;
+#else
+flags |= MAP_PRIVATE;
+#endif
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, block-fd, offset);
+} else {
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+}
+#endif
+} else {
+#if defined(TARGET_S390X)  defined(CONFIG_KVM)
+flags |= MAP_SHARED | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_EXEC|PROT_READ|PROT_WRITE,
+flags, -1, 0);
+#else
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+#endif
+}
+if (area != vaddr) {
+fprintf(stderr, Could not remap addr: %lx@%lx\n,
+length, addr);
+exit(1);
+}
+qemu_madvise(vaddr, length, QEMU_MADV_MERGEABLE);
+}
+return;
+}
+}
+}
+
 /* Return a host pointer to ram allocated with qemu_ram_alloc.
With the exception of the softmmu code in this file, this should
only be used for local memory (e.g. video ram) that the device owns,
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -50,6 +50,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size);
 void qemu_ram_free(ram_addr_t addr);
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 /* This should only be used for ram local to a device.  */
 void *qemu_get_ram_ptr(ram_addr_t addr);
 /* Same but slower, to use for migration, where the order of


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-13 Thread Huang Ying
Make __get_user_pages return -EHWPOISON for HWPOISON page only if
FOLL_HWPOISON is specified.  With this patch, the interested callers
can distinguish HWPOISON page from general FAULT page, while other
callers will still get -EFAULT for pages, so the user space interface
need not to be changed.

get_user_pages_hwpoison is added as a variant of get_user_pages that
can return -EHWPOISON for HWPOISON page.

This feature is needed by KVM, where UCR MCE should be relayed to
guest for HWPOISON page, while instruction emulation and MMIO will be
tried for general FAULT page.

The idea comes from Andrew Morton.

Signed-off-by: Huang Ying ying.hu...@intel.com
Cc: Andrew Morton a...@linux-foundation.org
---
 include/asm-generic/errno.h |2 +
 include/linux/mm.h  |   17 +
 mm/memory.c |   55 +---
 3 files changed, 71 insertions(+), 3 deletions(-)

--- a/include/asm-generic/errno.h
+++ b/include/asm-generic/errno.h
@@ -108,4 +108,6 @@
 
 #define ERFKILL132 /* Operation not possible due to 
RF-kill */
 
+#define EHWPOISON  133 /* Memory page has hardware error */
+
 #endif
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -860,6 +860,22 @@ int get_user_pages(struct task_struct *t
struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
+#ifdef CONFIG_MEMORY_FAILURE
+int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, int nr_pages, int write,
+   int force, struct page **pages,
+   struct vm_area_struct **vmas);
+#else
+static inline int get_user_pages_hwpoison(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start, int nr_pages,
+ int write, int force,
+ struct page **pages,
+ struct vm_area_struct **vmas) {
+   return get_user_pages(tsk, mm, start, nr_pages,
+ write, force, pages, vmas);
+}
+#endif
 struct page *get_dump_page(unsigned long addr);
 
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
@@ -1415,6 +1431,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET   0x04/* do get_page on page */
 #define FOLL_DUMP  0x08/* give error on hole if it would be zero */
 #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */
+#define FOLL_HWPOISON  0x20/* check page is hwpoisoned */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct
if (ret  VM_FAULT_ERROR) {
if (ret  VM_FAULT_OOM)
return i ? i : -ENOMEM;
-   if (ret 
-   
(VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE|
-VM_FAULT_SIGBUS))
+   if (ret  (VM_FAULT_HWPOISON |
+  VM_FAULT_HWPOISON_LARGE)) {
+   if (i)
+   return i;
+   else if (gup_flags  
FOLL_HWPOISON)
+   return -EHWPOISON;
+   else
+   return -EFAULT;
+   }
+   if (ret  VM_FAULT_SIGBUS)
return i ? i : -EFAULT;
BUG();
}
@@ -1563,6 +1570,48 @@ int get_user_pages(struct task_struct *t
 }
 EXPORT_SYMBOL(get_user_pages);
 
+#ifdef CONFIG_MEMORY_FAILURE
+/**
+ * get_user_pages_hwpoison() - pin user pages in memory, return hwpoison status
+ * @tsk:   task_struct of target task
+ * @mm:mm_struct of target mm
+ * @start: starting user address
+ * @nr_pages:  number of pages from start to pin
+ * @write: whether pages will be written to by the caller
+ * @force: whether to force write access even if user mapping is
+ * readonly. This will result in the page being COWed even
+ * in MAP_SHARED mappings. You do not want this.
+ * @pages: array that receives pointers to the pages pinned.
+ * Should be at least nr_pages long. Or NULL

[PATCH 2/2] KVM, Replace is_hwpoison_address with get_user_pages_hwpoison

2011-01-13 Thread Huang Ying
is_hwpoison_address only checks whether the page table entry is
hwpoisoned, regardless the memory page mapped.  While
get_user_pages_hwpoison will check both.

QEMU will clear the poisoned page table entry (via unmap/map) to make
it possible to allocate a new memory page for the virtual address
across guest rebooting.  But it is also possible that the underlying
memory page is kept poisoned even after the corresponding page table
entry is cleared, that is, a new memory page can not be allocated.
get_user_pages_hwpoison can catch these situations.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 include/linux/mm.h  |8 
 mm/memory-failure.c |   32 
 virt/kvm/kvm_main.c |4 +++-
 3 files changed, 3 insertions(+), 41 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1524,14 +1524,6 @@ extern int sysctl_memory_failure_recover
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
-#ifdef CONFIG_MEMORY_FAILURE
-int is_hwpoison_address(unsigned long addr);
-#else
-static inline int is_hwpoison_address(unsigned long addr)
-{
-   return 0;
-}
-#endif
 
 extern void dump_page(struct page *page);
 
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1433,35 +1433,3 @@ done:
/* keep elevated page count for bad page */
return ret;
 }
-
-/*
- * The caller must hold current-mm-mmap_sem in read mode.
- */
-int is_hwpoison_address(unsigned long addr)
-{
-   pgd_t *pgdp;
-   pud_t pud, *pudp;
-   pmd_t pmd, *pmdp;
-   pte_t pte, *ptep;
-   swp_entry_t entry;
-
-   pgdp = pgd_offset(current-mm, addr);
-   if (!pgd_present(*pgdp))
-   return 0;
-   pudp = pud_offset(pgdp, addr);
-   pud = *pudp;
-   if (!pud_present(pud) || pud_large(pud))
-   return 0;
-   pmdp = pmd_offset(pudp, addr);
-   pmd = *pmdp;
-   if (!pmd_present(pmd) || pmd_large(pmd))
-   return 0;
-   ptep = pte_offset_map(pmdp, addr);
-   pte = *ptep;
-   pte_unmap(ptep);
-   if (!is_swap_pte(pte))
-   return 0;
-   entry = pte_to_swp_entry(pte);
-   return is_hwpoison_entry(entry);
-}
-EXPORT_SYMBOL_GPL(is_hwpoison_address);
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -966,7 +966,9 @@ static pfn_t hva_to_pfn(struct kvm *kvm,
goto return_fault_page;
 
down_read(current-mm-mmap_sem);
-   if (is_hwpoison_address(addr)) {
+   npages = get_user_pages_hwpoison(current, current-mm,
+addr, 1, 1, 0, page, NULL);
+   if (npages == -EHWPOISON) {
up_read(current-mm-mmap_sem);
get_page(hwpoison_page);
return page_to_pfn(hwpoison_page);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-13 Thread Huang Ying
On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote:
 On 01/13/2011 10:42 AM, Huang Ying wrote:
  Make __get_user_pages return -EHWPOISON for HWPOISON page only if
  FOLL_HWPOISON is specified.  With this patch, the interested callers
  can distinguish HWPOISON page from general FAULT page, while other
  callers will still get -EFAULT for pages, so the user space interface
  need not to be changed.
 
  get_user_pages_hwpoison is added as a variant of get_user_pages that
  can return -EHWPOISON for HWPOISON page.
 
  This feature is needed by KVM, where UCR MCE should be relayed to
  guest for HWPOISON page, while instruction emulation and MMIO will be
  tried for general FAULT page.
 
  The idea comes from Andrew Morton.
 
  Signed-off-by: Huang Yingying.hu...@intel.com
  Cc: Andrew Mortona...@linux-foundation.org
 
  +#ifdef CONFIG_MEMORY_FAILURE
  +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm,
  +   unsigned long start, int nr_pages, int write,
  +   int force, struct page **pages,
  +   struct vm_area_struct **vmas);
  +#else
 
 Since we'd also like to add get_user_pages_noio(), perhaps adding a 
 flags field to get_user_pages() is better.

Or export __get_user_pages()?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-13 Thread Huang Ying
On Thu, 2011-01-13 at 17:01 +0800, Jan Kiszka wrote:
 Am 13.01.2011 09:34, Huang Ying wrote:
  In Linux kernel HWPoison processing implementation, the virtual
  address in processes mapping the error physical memory page is marked
  as HWPoison.  So that, the further accessing to the virtual
  address will kill corresponding processes with SIGBUS.
  
  If the error physical memory page is used by a KVM guest, the SIGBUS
  will be sent to QEMU, and QEMU will simulate a MCE to report that
  memory error to the guest OS.  If the guest OS can not recover from
  the error (for example, the page is accessed by kernel code), guest OS
  will reboot the system.  But because the underlying host virtual
  address backing the guest physical memory is still poisoned, if the
  guest system accesses the corresponding guest physical memory even
  after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
  simulated.  That is, guest system can not recover via rebooting.
  
  In fact, across rebooting, the contents of guest physical memory page
  need not to be kept.  We can allocate a new host physical page to
  back the corresponding guest physical address.
  
  This patch fixes this issue in QEMU via calling qemu_ram_remap() to
  clear the corresponding page table entry, so that make it possible to
  allocate a new page to recover the issue.
  
  Signed-off-by: Huang Ying ying.hu...@intel.com
  ---
   kvm.h |2 ++
   target-i386/kvm.c |   39 +++
   2 files changed, 41 insertions(+)
  
  --- a/target-i386/kvm.c
  +++ b/target-i386/kvm.c
  @@ -580,6 +580,42 @@ static int kvm_get_supported_msrs(void)
   return ret;
   }
   
  +struct HWPoisonPage;
  +typedef struct HWPoisonPage HWPoisonPage;
  +struct HWPoisonPage
  +{
  +ram_addr_t ram_addr;
  +QLIST_ENTRY(HWPoisonPage) list;
  +};
  +
  +static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
  +QLIST_HEAD_INITIALIZER(hwpoison_page_list);
  +
  +void kvm_unpoison_all(void *param)
 
 Minor nit: This can be static now.

In uq/master, it can be make static.  But in kvm/master, kvm_arch_init
is not compiled because of conditional compiling, so we will get warning
and error for unused symbol.  Should we consider kvm/master in this
patch?

  +{
  +HWPoisonPage *page, *next_page;
  +
  +QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
  +QLIST_REMOVE(page, list);
  +qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
  +qemu_free(page);
  +}
  +}
  +
  +static void kvm_hwpoison_page_add(ram_addr_t ram_addr)
  +{
  +HWPoisonPage *page;
  +
  +QLIST_FOREACH(page, hwpoison_page_list, list) {
  +if (page-ram_addr == ram_addr)
  +return;
  +}
  +
  +page = qemu_malloc(sizeof(HWPoisonPage));
  +page-ram_addr = ram_addr;
  +QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
  +}
  +
   int kvm_arch_init(void)
   {
   uint64_t identity_base = 0xfffbc000;
  @@ -632,6 +668,7 @@ int kvm_arch_init(void)
   fprintf(stderr, e820_add_entry() table is full\n);
   return ret;
   }
  +qemu_register_reset(kvm_unpoison_all, NULL);
   
   return 0;
   }
  @@ -1940,6 +1977,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in
   hardware_memory_error();
   }
   }
  +kvm_hwpoison_page_add(ram_addr);
   
   if (code == BUS_MCEERR_AR) {
   /* Fake an Intel architectural Data Load SRAR UCR */
  @@ -1984,6 +2022,7 @@ int kvm_on_sigbus(int code, void *addr)
   QEMU itself instead of guest system!: %p\n, addr);
   return 0;
   }
  +kvm_hwpoison_page_add(ram_addr);
   kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
   } else
   #endif
  --- a/kvm.h
  +++ b/kvm.h
  @@ -188,6 +188,8 @@ int kvm_physical_memory_addr_from_ram(ra
 target_phys_addr_t *phys_addr);
   #endif
   
  +void kvm_unpoison_all(void *param);
  +
 
 To be removed if kvm_unpoison_all is static.
 
   #endif
   int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool 
  assign);
   
  
 
 As indicated, I'm sitting on lots of fixes and refactorings of the MCE
 user space code. How do you test your patches? Any suggestions how to do
 this efficiently would be warmly welcome.

We use a self-made test script to test.  Repository is at:

git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git

The kvm test script is in kvm sub-directory.

The qemu patch attached is need by the test script.

Best Regards,
Huang Ying

Author: Max Asbock masb...@linux.vnet.ibm.com
Subject: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address

Add command x-gpa2hva to translate guest physical address to host
virtual address. Because gpa to hva translation is not consistent, so
this command is only used for debugging.

The x-gpa2hva command

Re: [PATCH v3 12/21] kvm: x86: Drop MCE MSRs write back restrictions

2011-01-05 Thread Huang Ying
On Wed, 2011-01-05 at 16:07 +0800, Jan Kiszka wrote:
 Am 05.01.2011 07:42, Huang Ying wrote:
  On Tue, 2011-01-04 at 16:32 +0800, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  There is no need to restrict writing back MCE MSRs to reset or full
  state updates as setting their values has no side effects.
  
  Sorry for late.
 
 Don't worry.
 
  
  The MCE MSRs contents is sticky for warm reset except MCG_STATUS, so
  their content should be kept.  And the following sequence may set
  uncorrected value in MCE registers.
  
  savevm - loadvm - (OS clear MCE registers) - reset - (MCE registers
  has new (uncorrected) value)
 
 Sorry, I can't follow. Unless I miss some subtle detail, the question is
 not when we transfer the mcg_* CPUState fields to the kernel, but when
 and how we manipulate them in user space, e.g. on reset. Where are those
 fields touched incorrectly between get and put msrs so that we cannot
 write them back?

If my understanding is correct, MSRs are not saved to user space
(env-mce_banks) during reset in current code.  So if all MCE MSRs are
restored to kernel, their user space contents from previous loadvm may
be put into kernel after reset.

Best Regards,
Huang Ying



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 12/21] kvm: x86: Drop MCE MSRs write back restrictions

2011-01-04 Thread Huang Ying
On Tue, 2011-01-04 at 16:32 +0800, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 There is no need to restrict writing back MCE MSRs to reset or full
 state updates as setting their values has no side effects.

Sorry for late.

The MCE MSRs contents is sticky for warm reset except MCG_STATUS, so
their content should be kept.  And the following sequence may set
uncorrected value in MCE registers.

savevm - loadvm - (OS clear MCE registers) - reset - (MCE registers
has new (uncorrected) value)

Best Regards,
Huang Ying

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Huang Ying ying.hu...@intel.com
 ---
  target-i386/kvm.c |   12 
  1 files changed, 4 insertions(+), 8 deletions(-)
 
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 8267655..1789bff 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -863,14 +863,10 @@ static int kvm_put_msrs(CPUState *env, int level)
  if (env-mcg_cap) {
  int i;
  
 -if (level == KVM_PUT_RESET_STATE) {
 -kvm_msr_entry_set(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
 -} else if (level == KVM_PUT_FULL_STATE) {
 -kvm_msr_entry_set(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
 -kvm_msr_entry_set(msrs[n++], MSR_MCG_CTL, env-mcg_ctl);
 -for (i = 0; i  (env-mcg_cap  0xff) * 4; i++) {
 -kvm_msr_entry_set(msrs[n++], MSR_MC0_CTL + i, 
 env-mce_banks[i]);
 -}
 +kvm_msr_entry_set(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
 +kvm_msr_entry_set(msrs[n++], MSR_MCG_CTL, env-mcg_ctl);
 +for (i = 0; i  (env-mcg_cap  0xff) * 4; i++) {
 +kvm_msr_entry_set(msrs[n++], MSR_MC0_CTL + i, 
 env-mce_banks[i]);
  }
  }
  #endif


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] KVM, MCE, unpoison memory address across reboot

2011-01-04 Thread Huang Ying
On Fri, 2010-12-31 at 17:10 +0800, Jan Kiszka wrote:
 Am 31.12.2010 06:22, Huang Ying wrote:
  In Linux kernel HWPoison processing implementation, the virtual
  address in processes mapping the error physical memory page is marked
  as HWPoison.  So that, the further accessing to the virtual
  address will kill corresponding processes with SIGBUS.
  
  If the error physical memory page is used by a KVM guest, the SIGBUS
  will be sent to QEMU, and QEMU will simulate a MCE to report that
  memory error to the guest OS.  If the guest OS can not recover from
  the error (for example, the page is accessed by kernel code), guest OS
  will reboot the system.  But because the underlying host virtual
  address backing the guest physical memory is still poisoned, if the
  guest system accesses the corresponding guest physical memory even
  after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
  simulated.  That is, guest system can not recover via rebooting.
  
  In fact, across rebooting, the contents of guest physical memory page
  need not to be kept.  We can allocate a new host physical page to
  back the corresponding guest physical address.
  
  This patch fixes this issue in QEMU-KVM via calling qemu_ram_remap()
  to clear the corresponding page table entry, so that make it possible
  to allocate a new page to recover the issue.
  
  Signed-off-by: Huang Ying ying.hu...@intel.com
  ---
   kvm.h |2 ++
   qemu-kvm.c|   37 +
 
 What's missing in upstream to make this a uq/master patch? We are still
 piling up features and fixes in qemu-kvm* that should better target
 upstream directly. That's work needlessly done twice.

OK. I will do that. Just based on uq/master is sufficient to make it an
upstream patch?

 Is this infrastructure really arch-independent? Will there be other
 users besides x86? If not, better keep it in target-i386/kvm.c.

No.  It is used only in x86.  I will move it into target-i386/kvm.c.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 1/2] Add qemu_ram_remap

2010-12-30 Thread Huang Ying
qemu_ram_remap() unmaps the specified RAM pages, then re-maps these
pages again.  This is used by KVM HWPoison support to clear HWPoisoned
page tables across guest rebooting, so that a new page may be
allocated later to recover the memory error.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 cpu-all.h|4 +++
 cpu-common.h |1 
 exec.c   |   61 ++-
 3 files changed, 65 insertions(+), 1 deletion(-)

--- a/cpu-all.h
+++ b/cpu-all.h
@@ -861,10 +861,14 @@ target_phys_addr_t cpu_get_phys_page_deb
 extern int phys_ram_fd;
 extern ram_addr_t ram_size;
 
+/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
+#define RAM_PREALLOC_MASK  (1  0)
+
 typedef struct RAMBlock {
 uint8_t *host;
 ram_addr_t offset;
 ram_addr_t length;
+uint32_t flags;
 char idstr[256];
 QLIST_ENTRY(RAMBlock) next;
 #if defined(__linux__)  !defined(TARGET_S390X)
--- a/exec.c
+++ b/exec.c
@@ -2845,6 +2845,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 
 if (host) {
 new_block-host = host;
+new_block-flags |= RAM_PREALLOC_MASK;
 } else {
 if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
@@ -2911,7 +2912,9 @@ void qemu_ram_free(ram_addr_t addr)
 QLIST_FOREACH(block, ram_list.blocks, next) {
 if (addr == block-offset) {
 QLIST_REMOVE(block, next);
-if (mem_path) {
+if (block-flags  RAM_PREALLOC_MASK)
+;
+else if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
 if (block-fd) {
 munmap(block-host, block-length);
@@ -2934,6 +2937,62 @@ void qemu_ram_free(ram_addr_t addr)
 
 }
 
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+{
+RAMBlock *block;
+ram_addr_t offset;
+int flags;
+void *area, *vaddr;
+
+QLIST_FOREACH(block, ram_list.blocks, next) {
+offset = addr - block-offset;
+if (offset  block-length) {
+vaddr = block-host + offset;
+if (block-flags  RAM_PREALLOC_MASK)
+;
+else {
+flags = MAP_FIXED;
+munmap(vaddr, length);
+if (mem_path) {
+#if defined (__linux__)  !defined(TARGET_S390X)
+if (block-fd) {
+#ifdef MAP_POPULATE
+flags |= mem_prealloc ? MAP_POPULATE | MAP_SHARED :
+MAP_PRIVATE;
+#else
+flags |= MAP_PRIVATE;
+#endif
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, block-fd, offset);
+} else {
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+}
+#endif
+} else {
+#if defined(TARGET_S390X)  defined(CONFIG_KVM)
+flags |= MAP_SHARED | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_EXEC|PROT_READ|PROT_WRITE,
+flags, -1, 0);
+#else
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+#endif
+}
+if (area != vaddr) {
+fprintf(stderr, Could not remap addr: %...@%lx\n,
+length, addr);
+exit(1);
+}
+qemu_madvise(vaddr, length, QEMU_MADV_MERGEABLE);
+}
+return;
+}
+}
+}
+
 /* Return a host pointer to ram allocated with qemu_ram_alloc.
With the exception of the softmmu code in this file, this should
only be used for local memory (e.g. video ram) that the device owns,
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -45,6 +45,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size);
 void qemu_ram_unmap(ram_addr_t addr);
 void qemu_ram_free(ram_addr_t addr);
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 /* This should only be used for ram local to a device.  */
 void *qemu_get_ram_ptr(ram_addr_t addr);
 /* This should not be used by devices.  */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 2/2] KVM, MCE, unpoison memory address across reboot

2010-12-30 Thread Huang Ying
In Linux kernel HWPoison processing implementation, the virtual
address in processes mapping the error physical memory page is marked
as HWPoison.  So that, the further accessing to the virtual
address will kill corresponding processes with SIGBUS.

If the error physical memory page is used by a KVM guest, the SIGBUS
will be sent to QEMU, and QEMU will simulate a MCE to report that
memory error to the guest OS.  If the guest OS can not recover from
the error (for example, the page is accessed by kernel code), guest OS
will reboot the system.  But because the underlying host virtual
address backing the guest physical memory is still poisoned, if the
guest system accesses the corresponding guest physical memory even
after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
simulated.  That is, guest system can not recover via rebooting.

In fact, across rebooting, the contents of guest physical memory page
need not to be kept.  We can allocate a new host physical page to
back the corresponding guest physical address.

This patch fixes this issue in QEMU-KVM via calling qemu_ram_remap()
to clear the corresponding page table entry, so that make it possible
to allocate a new page to recover the issue.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 kvm.h |2 ++
 qemu-kvm.c|   37 +
 target-i386/kvm.c |2 ++
 3 files changed, 41 insertions(+)

--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1803,6 +1803,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in
 hardware_memory_error();
 }
 }
+kvm_hwpoison_page_add(ram_addr);
 mce.addr = paddr;
 r = kvm_set_mce(env, mce);
 if (r  0) {
@@ -1841,6 +1842,7 @@ int kvm_on_sigbus(int code, void *addr)
 QEMU itself instead of guest system!: %p\n, addr);
 return 0;
 }
+kvm_hwpoison_page_add(ram_addr);
 status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
 | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
 | 0xc0;
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1619,6 +1619,42 @@ int kvm_arch_init_irq_routing(void)
 }
 #endif
 
+struct HWPoisonPage;
+typedef struct HWPoisonPage HWPoisonPage;
+struct HWPoisonPage
+{
+ram_addr_t ram_addr;
+QLIST_ENTRY(HWPoisonPage) list;
+};
+
+static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
+QLIST_HEAD_INITIALIZER(hwpoison_page_list);
+
+static void kvm_unpoison_all(void *param)
+{
+HWPoisonPage *page, *next_page;
+
+QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
+QLIST_REMOVE(page, list);
+qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
+qemu_free(page);
+}
+}
+
+void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+{
+HWPoisonPage *page;
+
+QLIST_FOREACH(page, hwpoison_page_list, list) {
+if (page-ram_addr == ram_addr)
+return;
+}
+
+page = qemu_malloc(sizeof(HWPoisonPage));
+page-ram_addr = ram_addr;
+QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
+}
+
 extern int no_hpet;
 
 static int kvm_create_context(void)
@@ -1703,6 +1739,7 @@ static int kvm_create_context(void)
 }
 #endif
 }
+qemu_register_reset(kvm_unpoison_all, NULL);
 
 return 0;
 }
--- a/kvm.h
+++ b/kvm.h
@@ -221,4 +221,6 @@ int kvm_irqchip_in_kernel(void);
 
 int kvm_set_irq(int irq, int level, int *status);
 
+void kvm_hwpoison_page_add(ram_addr_t ram_addr);
+
 #endif


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QEMU, MCE, unpoison memory address across reboot

2010-12-28 Thread Huang Ying
On Tue, 2010-12-28 at 16:27 +0800, Gleb Natapov wrote:
 On Mon, Dec 27, 2010 at 07:27:54PM -0200, Marcelo Tosatti wrote:
  On Sun, Dec 26, 2010 at 02:27:26PM +0200, Avi Kivity wrote:
 +static void kvm_unpoison_all(void *param)
 +{
 +HWPoisonPage *page, *next_page;
 +unsigned long address;
 +KVMState *s = param;
 +
 +QLIST_FOREACH_SAFE(page,hwpoison_page_list, list, next_page) {
 +address = (unsigned long)page-vaddr;
 +QLIST_REMOVE(page, list);
 +kvm_vm_ioctl(s, KVM_UNPOISON_ADDRESS, address);
 +qemu_free(page);
 +}
 +}
   
   Can't you free and reallocate all guest memory instead, on reboot, if
   there's a hwpoisoned page? Then you don't need this interface.
   
   
   Alternatively, MADV_DONTNEED?  We already use it for ballooning.
  
  Does not work for hugetlbfs.
  
 Don't we break huge page to 4k pages during poisoning?

Yes.  That has not been implemented yet.  So in fact, we can not deal
with hwpoison+hugetlb in kvm now.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QEMU, MCE, unpoison memory address across reboot

2010-12-27 Thread Huang Ying
On Tue, 2010-12-28 at 05:27 +0800, Marcelo Tosatti wrote:
 On Sun, Dec 26, 2010 at 02:27:26PM +0200, Avi Kivity wrote:
+static void kvm_unpoison_all(void *param)
+{
+HWPoisonPage *page, *next_page;
+unsigned long address;
+KVMState *s = param;
+
+QLIST_FOREACH_SAFE(page,hwpoison_page_list, list, next_page) {
+address = (unsigned long)page-vaddr;
+QLIST_REMOVE(page, list);
+kvm_vm_ioctl(s, KVM_UNPOISON_ADDRESS, address);
+qemu_free(page);
+}
+}
  
  Can't you free and reallocate all guest memory instead, on reboot, if
  there's a hwpoisoned page? Then you don't need this interface.
  
  
  Alternatively, MADV_DONTNEED?  We already use it for ballooning.
 
 Does not work for hugetlbfs.

Yes.  And I think zap the page range is just the implementation detail
but semantics of MADV_DONTNEED.

But on the other hand, whether qemu_vmalloc is implemented via
posix_memalign on Linux?  If it is, we can not guarantee that
corresponding page table is zapped after qemu_vfree and qemu_vmalloc?
That is glibc implementation details.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QEMU, MCE, unpoison memory address across reboot

2010-12-23 Thread Huang Ying
Hi, Andi,

On Fri, 2010-12-24 at 00:57 +0800, Andi Kleen wrote:
  Can't you free and reallocate all guest memory instead, on reboot, if
  there's a hwpoisoned page? Then you don't need this interface.
 
 I think that would be more efficient. You can potentially save a lot
 of memory if the new guest doesn't need as much as the old one.

So you suggest to free/reallocate all guest memory across every reboot
regardless whether there's hwpoisoned page?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QEMU, MCE, unpoison memory address across reboot

2010-12-23 Thread Huang Ying
On Thu, 2010-12-23 at 22:28 +0800, Marcelo Tosatti wrote:
 Can't you free and reallocate all guest memory instead, on reboot, if
 there's a hwpoisoned page? Then you don't need this interface.

Consider about this method.  It seems that some guest RAMs are not
allocated in qemu_ram_alloc_from_ptr(), that is, host parameter is
allocated elsewhere and passed in.  I found two:

- assigned_dev_register_regions() in hw/device-assignment.c
- create_shared_memory_BAR() and ivshmem_read() in hw/ivshmem.c

There is no general method to reallocate these memory so far.  We may
need a flag in struct RAMBlock to track these memory, and ignore them
during reallocation.  But if there are hwpoisoned pages in these memory,
we can not recover.  Do you think that is OK?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2010-12-22 Thread Huang Ying
Hi, Andrew,

Thanks for comments.

On Thu, 2010-12-23 at 07:07 +0800, Andrew Morton wrote:
 On Wed, 22 Dec 2010 10:51:55 +0800
 Huang Ying ying.hu...@intel.com wrote:
 
  Make __get_user_pages return -EHWPOISON for HWPOISON page only if
  FOLL_HWPOISON is specified.  With this patch, the interested callers
  can distinguish HWPOISON page from general FAULT page, while other
  callers will still get -EFAULT for pages, so the user space interface
  need not to be changed.
  
  get_user_pages_hwpoison is added as a variant of get_user_pages that
  can return -EHWPOISON for HWPOISON page.
  
  This feature is needed by KVM, where UCR MCE should be relayed to
  guest for HWPOISON page, while instruction emulation and MMIO will be
  tried for general FAULT page.
  
  The idea comes from Andrew Morton.
 
 hm, I don't remember that.  I suspect it came from someone else?

It's long long ago.  I use another solution at that time, but now I
think your one is better.

http://marc.info/?l=linux-kernelm=127241995131034w=2

  Signed-off-by: Huang Ying ying.hu...@intel.com
  Cc: Andrew Morton a...@linux-foundation.org
  ---
   include/asm-generic/errno.h |2 +
   include/linux/mm.h  |5 
   mm/memory.c |   53 
  +---
   3 files changed, 57 insertions(+), 3 deletions(-)
  
  --- a/include/asm-generic/errno.h
  +++ b/include/asm-generic/errno.h
  @@ -108,4 +108,6 @@
   
   #define ERFKILL132 /* Operation not possible due to 
  RF-kill */
   
  +#define EHWPOISON  133 /* Memory page has hardware error */
 
 So this can be propagated to userspace?
 
 If so, which syscalls might return EHWPOISON and are there any manpage
 implications?

No. EHWPOISON will only be returned if FOLL_HWPOISON is specified in
parameter flags of __get_user_pages.

   #endif
  --- a/include/linux/mm.h
  +++ b/include/linux/mm.h
  @@ -860,6 +860,10 @@ int get_user_pages(struct task_struct *t
  struct page **pages, struct vm_area_struct **vmas);
   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages);
  +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm,
  +   unsigned long start, int nr_pages, int write,
  +   int force, struct page **pages,
  +   struct vm_area_struct **vmas);
   struct page *get_dump_page(unsigned long addr);
   
   extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
  @@ -1415,6 +1419,7 @@ struct page *follow_page(struct vm_area_
   #define FOLL_GET   0x04/* do get_page on page */
   #define FOLL_DUMP  0x08/* give error on hole if it would be zero */
   #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */
  +#define FOLL_HWPOISON  0x20/* check page is hwpoisoned */
   
   typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
  void *data);
  --- a/mm/memory.c
  +++ b/mm/memory.c
  @@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct
  if (ret  VM_FAULT_ERROR) {
  if (ret  VM_FAULT_OOM)
  return i ? i : -ENOMEM;
  -   if (ret 
  -   
  (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE|
  -VM_FAULT_SIGBUS))
  +   if (ret  (VM_FAULT_HWPOISON |
  +  VM_FAULT_HWPOISON_LARGE)) {
  +   if (i)
  +   return i;
  +   else if (gup_flags  
  FOLL_HWPOISON)
  +   return -EHWPOISON;
  +   else
  +   return -EFAULT;
  +   }
  +   if (ret  VM_FAULT_SIGBUS)
 
 hm, that function is getting a bit unweildy.

Yes. Do you think the following code is better?

return i ? i : (gup_flags  FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;

   /**
  + * get_user_pages_hwpoison() - pin user pages in memory, return hwpoison 
  status
  + * @tsk:   task_struct of target task
  + * @mm:mm_struct of target mm
  + * @start: starting user address
  + * @nr_pages:  number of pages from start to pin
  + * @write: whether pages will be written to by the caller
  + * @force: whether to force write access even if user mapping is
  + * readonly. This will result in the page being COWed even
  + * in MAP_SHARED mappings. You do not want this.
  + * @pages: array that receives pointers to the pages pinned.
  + * Should be at least nr_pages long. Or NULL, if caller
  + * only intends

[RFC 0/3] KVM, HWPoison, unpoison address across rebooting

2010-12-21 Thread Huang Ying
Unpoison address across rebooting, to make it possible that a new
memory page can be allocated, so that guest system can successfully
reboot.

[RFC 1/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page 
optionally
[RFC 2/3] KVM, Replace is_hwpoison_address with get_user_pages_hwpoison
[RFC 3/3] KVM, HWPoison, unpoison address across rebooting
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 1/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2010-12-21 Thread Huang Ying
Make __get_user_pages return -EHWPOISON for HWPOISON page only if
FOLL_HWPOISON is specified.  With this patch, the interested callers
can distinguish HWPOISON page from general FAULT page, while other
callers will still get -EFAULT for pages, so the user space interface
need not to be changed.

get_user_pages_hwpoison is added as a variant of get_user_pages that
can return -EHWPOISON for HWPOISON page.

This feature is needed by KVM, where UCR MCE should be relayed to
guest for HWPOISON page, while instruction emulation and MMIO will be
tried for general FAULT page.

The idea comes from Andrew Morton.

Signed-off-by: Huang Ying ying.hu...@intel.com
Cc: Andrew Morton a...@linux-foundation.org
---
 include/asm-generic/errno.h |2 +
 include/linux/mm.h  |5 
 mm/memory.c |   53 +---
 3 files changed, 57 insertions(+), 3 deletions(-)

--- a/include/asm-generic/errno.h
+++ b/include/asm-generic/errno.h
@@ -108,4 +108,6 @@
 
 #define ERFKILL132 /* Operation not possible due to 
RF-kill */
 
+#define EHWPOISON  133 /* Memory page has hardware error */
+
 #endif
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -860,6 +860,10 @@ int get_user_pages(struct task_struct *t
struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
+int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, int nr_pages, int write,
+   int force, struct page **pages,
+   struct vm_area_struct **vmas);
 struct page *get_dump_page(unsigned long addr);
 
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
@@ -1415,6 +1419,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET   0x04/* do get_page on page */
 #define FOLL_DUMP  0x08/* give error on hole if it would be zero */
 #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */
+#define FOLL_HWPOISON  0x20/* check page is hwpoisoned */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct
if (ret  VM_FAULT_ERROR) {
if (ret  VM_FAULT_OOM)
return i ? i : -ENOMEM;
-   if (ret 
-   
(VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE|
-VM_FAULT_SIGBUS))
+   if (ret  (VM_FAULT_HWPOISON |
+  VM_FAULT_HWPOISON_LARGE)) {
+   if (i)
+   return i;
+   else if (gup_flags  
FOLL_HWPOISON)
+   return -EHWPOISON;
+   else
+   return -EFAULT;
+   }
+   if (ret  VM_FAULT_SIGBUS)
return i ? i : -EFAULT;
BUG();
}
@@ -1564,6 +1571,46 @@ int get_user_pages(struct task_struct *t
 EXPORT_SYMBOL(get_user_pages);
 
 /**
+ * get_user_pages_hwpoison() - pin user pages in memory, return hwpoison status
+ * @tsk:   task_struct of target task
+ * @mm:mm_struct of target mm
+ * @start: starting user address
+ * @nr_pages:  number of pages from start to pin
+ * @write: whether pages will be written to by the caller
+ * @force: whether to force write access even if user mapping is
+ * readonly. This will result in the page being COWed even
+ * in MAP_SHARED mappings. You do not want this.
+ * @pages: array that receives pointers to the pages pinned.
+ * Should be at least nr_pages long. Or NULL, if caller
+ * only intends to ensure the pages are faulted in.
+ * @vmas:  array of pointers to vmas corresponding to each page.
+ * Or NULL if the caller does not require them.
+ *
+ * Returns number of pages pinned.
+ *
+ * If the page table or memory page is hwpoisoned, return -EHWPOISON.
+ *
+ * Otherwise, same as get_user_pages.
+ */
+int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, int nr_pages, int write,
+   int force, struct page **pages,
+   struct

[RFC 3/3] KVM, HWPoison, unpoison address across rebooting

2010-12-21 Thread Huang Ying
In HWPoison processing code, not only the struct page corresponding
the error physical memory page is marked as HWPoison, but also the
virtual address in processes mapping the error physical memory page is
marked as HWPoison.  So that, the further accessing to the virtual
address will kill corresponding processes with SIGBUS.

If the error physical memory page is used by a KVM guest, the SIGBUS
will be sent to QEMU, and QEMU will simulate a MCE to report that
memory error to the guest OS.  If the guest OS can not recover from
the error (for example, the page is accessed by kernel code), guest OS
will reboot the system.  But because the underlying host virtual
address backing the guest physical memory is still poisoned, if the
guest system accesses the corresponding guest physical memory even
after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
simulated.  That is, guest system can not recover via rebooting.

In fact, across rebooting, the contents of guest physical memory page
need not to be kept.  We can allocate a new host physical page to
back the corresponding guest physical address.

To do that, a mechanism in KVM to unpoison poisoned virtual address
by clearing the corresponding PTE is provided.  So that, when doing
rebooting, QEMU can unpoison the poisoned virtual address, and when
the unpoisoned memory page is accessed, a new physical memory may be
allocated if possible.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 include/linux/kvm.h |1 +
 include/linux/mm.h  |8 
 mm/memory-failure.c |   39 +++
 virt/kvm/kvm_main.c |   14 ++
 4 files changed, 62 insertions(+)

--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -676,6 +676,7 @@ struct kvm_clock_data {
 #define KVM_SET_PIT2  _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
 /* Available with KVM_CAP_PPC_GET_PVINFO */
 #define KVM_PPC_GET_PVINFO   _IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
+#define KVM_UNPOISON_ADDRESS _IO(KVMIO,  0xa2)
 
 /*
  * ioctls for vcpu fds
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1512,6 +1512,14 @@ extern int sysctl_memory_failure_recover
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+int unpoison_address(unsigned long addr);
+#else
+static inline int unpoison_address(unsigned long addr)
+{
+   return -EINVAL;
+}
+#endif
 
 extern void dump_page(struct page *page);
 
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1433,3 +1433,42 @@ done:
/* keep elevated page count for bad page */
return ret;
 }
+
+int unpoison_address(unsigned long addr)
+{
+   struct mm_struct *mm;
+   pgd_t *pgdp;
+   pud_t pud, *pudp;
+   pmd_t pmd, *pmdp;
+   pte_t pte, *ptep;
+   spinlock_t *ptl;
+   swp_entry_t entry;
+   int rc;
+
+   mm = current-mm;
+   pgdp = pgd_offset(mm, addr);
+   if (!pgd_present(*pgdp))
+   return -EINVAL;
+   pudp = pud_offset(pgdp, addr);
+   pud = *pudp;
+   if (!pud_present(pud) || pud_large(pud))
+   return -EINVAL;
+   pmdp = pmd_offset(pudp, addr);
+   pmd = *pmdp;
+   /* can not unpoison huge page yet */
+   if (!pmd_present(pmd) || pmd_large(pmd))
+   return -EINVAL;
+   ptep = pte_offset_map_lock(mm, pmdp, addr, ptl);
+   pte = *ptep;
+   rc = -EINVAL;
+   if (!is_swap_pte(pte))
+   goto out;
+   entry = pte_to_swp_entry(pte);
+   if (!is_hwpoison_entry(entry))
+   goto out;
+   pte_clear(mm, addr, ptep);
+out:
+   pte_unmap_unlock(ptep, ptl);
+   return rc;
+}
+EXPORT_SYMBOL_GPL(unpoison_address);
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -774,6 +774,17 @@ int kvm_vm_ioctl_set_memory_region(struc
return kvm_set_memory_region(kvm, mem, user_alloc);
 }
 
+static int kvm_unpoison_address(struct kvm *kvm, unsigned long address)
+{
+   int r;
+
+   down_read(current-mm-mmap_sem);
+   r = unpoison_address(address);
+   up_read(current-mm-mmap_sem);
+
+   return r;
+}
+
 int kvm_get_dirty_log(struct kvm *kvm,
struct kvm_dirty_log *log, int *is_dirty)
 {
@@ -1728,6 +1739,9 @@ static long kvm_vm_ioctl(struct file *fi
mutex_unlock(kvm-lock);
break;
 #endif
+   case KVM_UNPOISON_ADDRESS:
+   r = kvm_unpoison_address(kvm, arg);
+   break;
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
if (r == -ENOTTY)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


QEMU, MCE, unpoison memory address across reboot

2010-12-21 Thread Huang Ying
In Linux kernel HWPoison processing implementation, the virtual
address in processes mapping the error physical memory page is marked
as HWPoison.  So that, the further accessing to the virtual
address will kill corresponding processes with SIGBUS.

If the error physical memory page is used by a KVM guest, the SIGBUS
will be sent to QEMU, and QEMU will simulate a MCE to report that
memory error to the guest OS.  If the guest OS can not recover from
the error (for example, the page is accessed by kernel code), guest OS
will reboot the system.  But because the underlying host virtual
address backing the guest physical memory is still poisoned, if the
guest system accesses the corresponding guest physical memory even
after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
simulated.  That is, guest system can not recover via rebooting.

In fact, across rebooting, the contents of guest physical memory page
need not to be kept.  We can allocate a new host physical page to
back the corresponding guest physical address.

This patch fixes this issue in QEMU-KVM via invoke the unpoison
mechanism implemented in Linux kernel to clear the corresponding page
table entry, so that make it possible to allocate a new page to
recover the issue.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 kvm.h   |2 ++
 kvm/include/linux/kvm.h |2 ++
 qemu-kvm.c  |   40 
 target-i386/kvm.c   |2 ++
 4 files changed, 46 insertions(+)

--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1803,6 +1803,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in
 hardware_memory_error();
 }
 }
+kvm_hwpoison_page_add(vaddr);
 mce.addr = paddr;
 r = kvm_set_mce(env, mce);
 if (r  0) {
@@ -1841,6 +1842,7 @@ int kvm_on_sigbus(int code, void *addr)
 QEMU itself instead of guest system!: %p\n, addr);
 return 0;
 }
+kvm_hwpoison_page_add(vaddr);
 status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
 | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
 | 0xc0;
--- a/kvm/include/linux/kvm.h
+++ b/kvm/include/linux/kvm.h
@@ -663,6 +663,8 @@ struct kvm_clock_data {
 /* Available with KVM_CAP_PIT_STATE2 */
 #define KVM_GET_PIT2  _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
 #define KVM_SET_PIT2  _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
+#define KVM_PPC_GET_PVINFO_IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
+#define KVM_UNPOISON_ADDRESS  _IO(KVMIO,  0xa2)
 
 /*
  * ioctls for vcpu fds
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1619,6 +1619,45 @@ int kvm_arch_init_irq_routing(void)
 }
 #endif
 
+struct HWPoisonPage;
+typedef struct HWPoisonPage HWPoisonPage;
+struct HWPoisonPage
+{
+void *vaddr;
+QLIST_ENTRY(HWPoisonPage) list;
+};
+
+static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
+QLIST_HEAD_INITIALIZER(hwpoison_page_list);
+
+static void kvm_unpoison_all(void *param)
+{
+HWPoisonPage *page, *next_page;
+unsigned long address;
+KVMState *s = param;
+
+QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
+address = (unsigned long)page-vaddr;
+QLIST_REMOVE(page, list);
+kvm_vm_ioctl(s, KVM_UNPOISON_ADDRESS, address);
+qemu_free(page);
+}
+}
+
+void kvm_hwpoison_page_add(void *vaddr)
+{
+HWPoisonPage *page;
+
+QLIST_FOREACH(page, hwpoison_page_list, list) {
+if (page-vaddr == vaddr)
+return;
+}
+
+page = qemu_malloc(sizeof(HWPoisonPage));
+page-vaddr = vaddr;
+QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
+}
+
 extern int no_hpet;
 
 static int kvm_create_context(void)
@@ -1703,6 +1742,7 @@ static int kvm_create_context(void)
 }
 #endif
 }
+qemu_register_reset(kvm_unpoison_all, kvm_state);
 
 return 0;
 }
--- a/kvm.h
+++ b/kvm.h
@@ -221,4 +221,6 @@ int kvm_irqchip_in_kernel(void);
 
 int kvm_set_irq(int irq, int level, int *status);
 
+void kvm_hwpoison_page_add(void *vaddr);
+
 #endif


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 2/3] KVM, Replace is_hwpoison_address with get_user_pages_hwpoison

2010-12-21 Thread Huang Ying
is_hwpoison_address only checks whether the page table entry is
hwpoisoned, regardless the memory page mapped.  While
get_user_pages_hwpoison will check both.

In a following patch, we will introduce unpoison_address, which will
clear the poisoned page table entry to make it possible to allocate a
new memory page for the virtual address.  But it is also possible that
the underlying memory page is kept poisoned even after the
corresponding page table entry is cleared, that is, a new memory page
can not be allocated.  get_user_pages_hwpoison can catch these
situations.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 include/linux/mm.h  |8 
 mm/memory-failure.c |   32 
 virt/kvm/kvm_main.c |4 +++-
 3 files changed, 3 insertions(+), 41 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1512,14 +1512,6 @@ extern int sysctl_memory_failure_recover
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
-#ifdef CONFIG_MEMORY_FAILURE
-int is_hwpoison_address(unsigned long addr);
-#else
-static inline int is_hwpoison_address(unsigned long addr)
-{
-   return 0;
-}
-#endif
 
 extern void dump_page(struct page *page);
 
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1433,35 +1433,3 @@ done:
/* keep elevated page count for bad page */
return ret;
 }
-
-/*
- * The caller must hold current-mm-mmap_sem in read mode.
- */
-int is_hwpoison_address(unsigned long addr)
-{
-   pgd_t *pgdp;
-   pud_t pud, *pudp;
-   pmd_t pmd, *pmdp;
-   pte_t pte, *ptep;
-   swp_entry_t entry;
-
-   pgdp = pgd_offset(current-mm, addr);
-   if (!pgd_present(*pgdp))
-   return 0;
-   pudp = pud_offset(pgdp, addr);
-   pud = *pudp;
-   if (!pud_present(pud) || pud_large(pud))
-   return 0;
-   pmdp = pmd_offset(pudp, addr);
-   pmd = *pmdp;
-   if (!pmd_present(pmd) || pmd_large(pmd))
-   return 0;
-   ptep = pte_offset_map(pmdp, addr);
-   pte = *ptep;
-   pte_unmap(ptep);
-   if (!is_swap_pte(pte))
-   return 0;
-   entry = pte_to_swp_entry(pte);
-   return is_hwpoison_entry(entry);
-}
-EXPORT_SYMBOL_GPL(is_hwpoison_address);
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -966,7 +966,9 @@ static pfn_t hva_to_pfn(struct kvm *kvm,
goto return_fault_page;
 
down_read(current-mm-mmap_sem);
-   if (is_hwpoison_address(addr)) {
+   npages = get_user_pages_hwpoison(current, current-mm,
+addr, 1, 1, 0, page, NULL);
+   if (npages == -EHWPOISON) {
up_read(current-mm-mmap_sem);
get_page(hwpoison_page);
return page_to_pfn(hwpoison_page);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Check processor version for broadcast.

2010-11-30 Thread Huang Ying
Hi, Dongming,

On Tue, 2010-11-30 at 16:15 +0800, Jin Dongming wrote:
 Broadcast MCA signal is not supported by the CPUs whose version is
 below 06H_EH.
 
 Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
 ---
  target-i386/helper.c |   13 +
  1 files changed, 13 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/helper.c b/target-i386/helper.c
 index 7e07ebd..437290b 100644
 --- a/target-i386/helper.c
 +++ b/target-i386/helper.c
 @@ -1077,10 +1077,23 @@ void cpu_inject_x86_mce(CPUState *cenv, int bank, 
 uint64_t status,
  unsigned bank_num = cenv-mcg_cap  0xff;
  CPUState *env;
  int flag = 0;
 +int family, model, cpuver = first_cpu-cpuid_version;
  
  if (bank = bank_num || !(status  MCI_STATUS_VAL))
  return;
  
 +if (broadcast) {
 +family = (cpuver  8)  0x0f;
 +model = ((cpuver  12)  0xf0) + ((cpuver  4)  0x0f);
 +
 +if ((family == 6  model = 14) || family  6)
 +broadcast = 1;
 +else {
 +fprintf(stderr, Current CPU does not support broadcast\n);
 +return;
 +}
 +}
 +
  if (kvm_enabled()) {
  if (broadcast)
  flag |= 0x02; /* bit 1: 1(broadcast); 0(not broadcast) */

Why not wrap this into a function? I think it may be used by other
functions too.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add function for clearing the requested VCPUs' mce registers.

2010-11-24 Thread Huang Ying
Hi, Dongming,

On Thu, 2010-11-25 at 09:14 +0800, Jin Dongming wrote:
 In some case of mce test, the injected data can be remained
 in the registers and ca affect to the result of following test cases.
 So add codes for clearing mce registers of given VCPU.
 
 mce registers of give VCPU could be cleared from kernel by calling
 the function in this patch.  What need to be paid attention is that
 the status and mcg_status of mce must be set with 0. If not, mce
 registers will not be cleared.

Why do you need this? To use mce=3 in guest Linux MCE testing?

If it is, why not use full reboot? MCE registers are not cleared in real
machine too. And it is not so pain to reboot the guest.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Add broadcast option for mce command.

2010-11-24 Thread Huang Ying
On Thu, 2010-11-25 at 09:19 +0800, Jin Dongming wrote:
[...]

 --- a/hmp-commands.hx
 +++ b/hmp-commands.hx
 @@ -1053,9 +1053,15 @@ ETEXI
  
  {
  .name   = mce,
 +#if defined(KVM_CAP_MCE)
 +.args_type  = 
 cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l,broadcast:s?,
 +.params = cpu bank status mcgstatus addr misc [broadcast|b],
 +.help   = inject a MCE on the given CPU [and broadcast to other 
 CPUs],
 +#else
  .args_type  = 
 cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l,
  .params = cpu bank status mcgstatus addr misc,
  .help   = inject a MCE on the given CPU,
 +#endif

Broadcast can not be used by QEMU-TCG?

[...]

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add function for clearing the requested VCPUs' mce registers.

2010-11-24 Thread Huang Ying
On Thu, 2010-11-25 at 13:30 +0800, Jin Dongming wrote:
 Hi, Huang-san
 
 (2010/11/25 10:27), Huang Ying wrote:
  Hi, Dongming,
  
  On Thu, 2010-11-25 at 09:14 +0800, Jin Dongming wrote:
  In some case of mce test, the injected data can be remained
  in the registers and ca affect to the result of following test cases.
  So add codes for clearing mce registers of given VCPU.
 
  mce registers of give VCPU could be cleared from kernel by calling
  the function in this patch.  What need to be paid attention is that
  the status and mcg_status of mce must be set with 0. If not, mce
  registers will not be cleared.
  
  Why do you need this? To use mce=3 in guest Linux MCE testing?
 When I set fake_panic==1 on Guest OS and tested mce via qemu-monitor,
 I got the wrong result sometimes. The reason why I got the wrong result
 was that mce registers were not cleared. So I made this patch.

This can be done in QEMU via KVM_SET_MSRS too.
 
  If it is, why not use full reboot? MCE registers are not cleared in real
  machine too. And it is not so pain to reboot the guest.
 Though we know that mce registers could be cleared by rebooting the Guest
 Machine and the reboot of Guest Machine could save much more time than
 the reboot of Host Machine, it is still slower than setting
 fake_panic==1 to mce test.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] reset mce registers of the given VCPU or all VCPUs with mce command.

2010-11-24 Thread Huang Ying
On Thu, 2010-11-25 at 09:20 +0800, Jin Dongming wrote:
 --- a/monitor.c
 +++ b/monitor.c
 @@ -60,6 +60,7 @@
  #include trace.h
  #endif
  #include qemu-kvm.h
 +#include kvm_x86.h
  
  //#define DEBUG
  //#define DEBUG_COMPLETION
 @@ -2277,7 +2278,19 @@ static void do_inject_mce(Monitor *mon, const QDict 
 *qdict)
  #endif
  
  for (cenv = first_cpu; cenv != NULL; cenv = cenv-next_cpu) {
 -if (cenv-cpu_index == cpu_index  cenv-mcg_cap) {
 +if (!cenv-mcg_cap)
 +continue;
 +
 +#if defined(KVM_CAP_MCE)
 +if (!status  !mcg_status) {
 +if (cenv-cpu_index == cpu_index || broadcast)
 +kvm_inject_x86_mce(cenv, 0, 0, 0, 0, 0, 0);
 +
 +continue;
 +}
 +#endif
 +
 +if (cenv-cpu_index == cpu_index) {
  cpu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
  #if defined(KVM_CAP_MCE)
  if (broadcast)

Are you sure there is no test cases that require the Machine Check
registers not cleared? I have had a test case before that injects
another MCE when MCIP in MCGSTATUS is set to check whether system will
go reset.

It may be better to clear MC registers in an explicit mode.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address

2010-11-19 Thread Huang Ying
On Tue, 2010-11-16 at 10:23 +0800, Huang Ying wrote:
 Author: Max Asbock masb...@linux.vnet.ibm.com
 
 Add command x-gpa2hva to translate guest physical address to host
 virtual address. Because gpa to hva translation is not consistent, so
 this command is only used for debugging.
 
 The x-gpa2hva command provides one step in a chain of translations from
 guest virtual to guest physical to host virtual to host physical. Host
 physical is then used to inject a machine check error. As a
 consequence the HWPOISON code on the host and the MCE injection code
 in qemu-kvm are exercised.
 
 v3:
 
 - Rename to x-gpa2hva
 - Remove QMP support, because gpa2hva is not consistent

Is this patch an acceptable solution for now? This command is useful for
our testing.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address

2010-11-15 Thread Huang Ying
Author: Max Asbock masb...@linux.vnet.ibm.com

Add command x-gpa2hva to translate guest physical address to host
virtual address. Because gpa to hva translation is not consistent, so
this command is only used for debugging.

The x-gpa2hva command provides one step in a chain of translations from
guest virtual to guest physical to host virtual to host physical. Host
physical is then used to inject a machine check error. As a
consequence the HWPOISON code on the host and the MCE injection code
in qemu-kvm are exercised.

v3:

- Rename to x-gpa2hva
- Remove QMP support, because gpa2hva is not consistent

v2:

- Add QMP support

Signed-off-by: Max Asbock masb...@linux.vnet.ibm.com
Signed-off-by: Jiajia Zheng jiajia.zh...@intel.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 hmp-commands.hx |   15 +++
 monitor.c   |   22 ++
 2 files changed, 37 insertions(+)

--- a/monitor.c
+++ b/monitor.c
@@ -2272,6 +2272,28 @@ static void do_inject_mce(Monitor *mon,
 }
 #endif
 
+static void do_gpa2hva_print(Monitor *mon, const QObject *data)
+{
+QInt *qint;
+
+qint = qobject_to_qint(data);
+monitor_printf(mon, 0x%lx\n, (unsigned long)qint-value);
+}
+
+static int do_gpa2hva(Monitor *mon, const QDict *qdict, QObject **ret_data)
+{
+target_phys_addr_t paddr;
+target_phys_addr_t size = TARGET_PAGE_SIZE;
+void *vaddr;
+
+paddr = qdict_get_int(qdict, addr);
+vaddr = cpu_physical_memory_map(paddr, size, 0);
+cpu_physical_memory_unmap(vaddr, size, 0, 0);
+*ret_data = qobject_from_jsonf(%ld, (unsigned long)vaddr);
+
+return 0;
+}
+
 static int do_getfd(Monitor *mon, const QDict *qdict, QObject **ret_data)
 {
 const char *fdname = qdict_get_str(qdict, fdname);
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -293,6 +293,21 @@ Start gdbserver session (default @var{po
 ETEXI
 
 {
+.name   = x-gpa2hva,
+.args_type  = fmt:/,addr:l,
+.params = /fmt addr,
+.help   = translate guest physical 'addr' to host virtual 
address, only for debugging,
+.user_print = do_gpa2hva_print,
+.mhandler.cmd_new = do_gpa2hva,
+},
+
+STEXI
+...@item x-gpa2hva @var{addr}
+...@findex x-gpa2hva
+Translate guest physical @var{addr} to host virtual address, only for 
debugging.
+ETEXI
+
+{
 .name   = x,
 .args_type  = fmt:/,addr:l,
 .params = /fmt addr,


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-14 Thread Huang Ying
On Sun, 2010-11-14 at 19:08 +0800, Avi Kivity wrote:
 On 11/12/2010 03:16 AM, Huang Ying wrote:
  On Thu, 2010-11-11 at 17:39 +0800, Avi Kivity wrote:
On 11/11/2010 02:56 AM, Huang Ying wrote:
  On Thu, 2010-11-11 at 00:49 +0800, Anthony Liguori wrote:
 On 11/10/2010 02:34 AM, Avi Kivity wrote:
Why the gpa -hva mapping is not
consistent for RAM if -mempath is not used?
 
Video RAM in the range a-b and PCI mapped RAM can 
   change gpas
(while remaining in the same hva).
 
Even for ordinary RAM, which doesn't normally change gpa/hva, 
   I'm not
sure we want to guarantee that it won't.
  
 We can't universally either.  Memory hot remove potentially 
   breaks the
 mapping and some non-x86 architectures (like ARM) can alias RAM 
   via a
 guest triggered mechanism.

  Thanks for clarification. Now I think we have two options.

  1) Clearly mark gpa2hva (pfa2hva now, should renamed to gpa2hva) as a
  testing only interface, and should be used only on restricted
  environment (normal memory, without hot-remove, maybe only on x86).

  2) Find some way to lock the gpa -   hva mapping during operating. 
   Such
  as gpa2hva_begin and gpa2hva_end and lock the mapping in between.

  I think 2) may be possible. But if there are no other users, why do 
   that
  for a test case? So I think 1) may be the better option.

  
3) Move the poisoning code into qemu, so the command becomes
  
posison-addressaddr
  
(though physical addresses aren't stable either)
 
  Poisoning includes:
 
  a) gva -  gpa
  b) gpa -  hva
  c) hva -  hpa
  d) inject the MCE into host via some external tool
 
  poison-address need to do b, c, d. Do you think it is good to do all
  these inside qemu?
 
 If d is not too complicated (an ioctl?), we might to it in qemu.

The issue of d) is that there are multiple ways to inject MCE. Now one
software based, one APEI based, and maybe some others in the future.
They all use different interfaces. And as debug interface, there are not
considered kernel ABI too (some are in debugfs). So I think it is better
to use these ABI only in some test suite.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-11 Thread Huang Ying
On Thu, 2010-11-11 at 17:39 +0800, Avi Kivity wrote:
 On 11/11/2010 02:56 AM, Huang Ying wrote:
  On Thu, 2010-11-11 at 00:49 +0800, Anthony Liguori wrote:
On 11/10/2010 02:34 AM, Avi Kivity wrote:
  Why the gpa -   hva mapping is not
  consistent for RAM if -mempath is not used?

  Video RAM in the range a-b and PCI mapped RAM can change gpas
  (while remaining in the same hva).

  Even for ordinary RAM, which doesn't normally change gpa/hva, I'm not
  sure we want to guarantee that it won't.
  
We can't universally either.  Memory hot remove potentially breaks the
mapping and some non-x86 architectures (like ARM) can alias RAM via a
guest triggered mechanism.
 
  Thanks for clarification. Now I think we have two options.
 
  1) Clearly mark gpa2hva (pfa2hva now, should renamed to gpa2hva) as a
  testing only interface, and should be used only on restricted
  environment (normal memory, without hot-remove, maybe only on x86).
 
  2) Find some way to lock the gpa -  hva mapping during operating. Such
  as gpa2hva_begin and gpa2hva_end and lock the mapping in between.
 
  I think 2) may be possible. But if there are no other users, why do that
  for a test case? So I think 1) may be the better option.
 
 
 3) Move the poisoning code into qemu, so the command becomes
 
 posison-address addr
 
 (though physical addresses aren't stable either)

Poisoning includes:

a) gva - gpa
b) gpa - hva
c) hva - hpa
d) inject the MCE into host via some external tool

poison-address need to do b, c, d. Do you think it is good to do all
these inside qemu?

 4) Use -mempath on /dev/shm and poison a page in the backing file

If we can poison a page in the backing file, how do we know the
corresponding gpa and hpa?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-10 Thread Huang Ying
On Wed, 2010-11-10 at 14:56 +0800, Avi Kivity wrote:
 On 11/08/2010 05:39 AM, Anthony Liguori wrote:
  Yes. The main usage of the interface is automated testing.
 
  That's precisely what the command should not be used for.
 
  You can't assume a gpa - hva mapping is consistent in an external 
  application.  If you want to implement an interface for testing, you 
  have to push more of the logic into QEMU to avoid the race.
 
 An alternative is to use -mempath.  Does poisoning work for tmpfs?

Sorry for my stupid question. Why the gpa - hva mapping is not
consistent for RAM if -mempath is not used?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-10 Thread Huang Ying
On Wed, 2010-11-10 at 16:28 +0800, Avi Kivity wrote:
 On 11/10/2010 09:41 AM, Huang Ying wrote:
  On Wed, 2010-11-10 at 14:59 +0800, Avi Kivity wrote:
On 11/10/2010 08:56 AM, Avi Kivity wrote:
  On 11/08/2010 05:39 AM, Anthony Liguori wrote:
  Yes. The main usage of the interface is automated testing.

  That's precisely what the command should not be used for.

  You can't assume a gpa -  hva mapping is consistent in an external
  application.  If you want to implement an interface for testing, you
  have to push more of the logic into QEMU to avoid the race.

  An alternative is to use -mempath.  Does poisoning work for tmpfs?

Or hugetlbfs - I think it does?
 
  The QEMU support for hugetlbfs has some issues now. Because it is hard
  for QEMU to deal with 2M poisoned page reported by host OS. Although it
  is possible for QEMU to relay 2M poisoned page as MCE to guest OS, the
  guest OS may not work properly for this kind of MCE.
 
 
 If we get a full address (rather than just a frame number) then we can 
 identify the 4k page and send an mce just for that frame?

We need host kernel to break down the 2M huge page into 4k pages. Then
send SIGBUS to QEMU with the poisoned 4k page. Because host kernel will
poison the whole 2M virtual address space otherwise, and other 4k pages
inside the 2M page can not used accessed in guest (will trigger SIGBUS
and SRAR MCE).

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-10 Thread Huang Ying
On Thu, 2010-11-11 at 00:49 +0800, Anthony Liguori wrote:
 On 11/10/2010 02:34 AM, Avi Kivity wrote:
  Why the gpa -  hva mapping is not
  consistent for RAM if -mempath is not used?
 
  Video RAM in the range a-b and PCI mapped RAM can change gpas 
  (while remaining in the same hva).
 
  Even for ordinary RAM, which doesn't normally change gpa/hva, I'm not 
  sure we want to guarantee that it won't.
 
 We can't universally either.  Memory hot remove potentially breaks the 
 mapping and some non-x86 architectures (like ARM) can alias RAM via a 
 guest triggered mechanism.

Thanks for clarification. Now I think we have two options.

1) Clearly mark gpa2hva (pfa2hva now, should renamed to gpa2hva) as a
testing only interface, and should be used only on restricted
environment (normal memory, without hot-remove, maybe only on x86).

2) Find some way to lock the gpa - hva mapping during operating. Such
as gpa2hva_begin and gpa2hva_end and lock the mapping in between.

I think 2) may be possible. But if there are no other users, why do that
for a test case? So I think 1) may be the better option.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-09 Thread Huang Ying
On Wed, 2010-11-10 at 14:59 +0800, Avi Kivity wrote:
 On 11/10/2010 08:56 AM, Avi Kivity wrote:
  On 11/08/2010 05:39 AM, Anthony Liguori wrote:
  Yes. The main usage of the interface is automated testing.
 
  That's precisely what the command should not be used for.
 
  You can't assume a gpa - hva mapping is consistent in an external 
  application.  If you want to implement an interface for testing, you 
  have to push more of the logic into QEMU to avoid the race.
 
  An alternative is to use -mempath.  Does poisoning work for tmpfs?
 
 Or hugetlbfs - I think it does?

The QEMU support for hugetlbfs has some issues now. Because it is hard
for QEMU to deal with 2M poisoned page reported by host OS. Although it
is possible for QEMU to relay 2M poisoned page as MCE to guest OS, the
guest OS may not work properly for this kind of MCE.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-08 Thread Huang Ying
Hi, Anthony,

On Mon, 2010-11-08 at 11:39 +0800, Anthony Liguori wrote:
 On 11/07/2010 07:29 PM, Huang Ying wrote:
  On Sun, 2010-11-07 at 00:24 +0800, Avi Kivity wrote:
 
  On 11/01/2010 03:22 PM, Anthony Liguori wrote:
   
  On 11/01/2010 02:20 PM, Huang Ying wrote:
 
  Yes. As general interface, it may not work so well, but as test
  interface, it works quite well and useful.
 
  Do we have any mechanism to add a test only interface?
   
  I'd like to see what Luiz/Markus think but definitely only a human
  monitor interface and probably prefix the command with a 'x-' prefix
  to indicate that it's not a supported interface.
 
 
  Automated testing will want to use qmp.
   
  Yes. The main usage of the interface is automated testing.
 
 
 That's precisely what the command should not be used for.
 
 You can't assume a gpa - hva mapping is consistent in an external 
 application.  If you want to implement an interface for testing, you 
 have to push more of the logic into QEMU to avoid the race.

Yes. pfa2hva (should be renamed to gpa2hva) in the patch works not well
in general cases, but it works well for our testing. So we do not want
to make it a general interface, but a testing only interface. Maybe via
putting it into a special name space like x-. Do you agree?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-08 Thread Huang Ying
On Mon, 2010-11-08 at 11:39 +0800, Anthony Liguori wrote:
 On 11/07/2010 07:29 PM, Huang Ying wrote:
  On Sun, 2010-11-07 at 00:24 +0800, Avi Kivity wrote:
 
  On 11/01/2010 03:22 PM, Anthony Liguori wrote:
   
  On 11/01/2010 02:20 PM, Huang Ying wrote:
 
  Yes. As general interface, it may not work so well, but as test
  interface, it works quite well and useful.
 
  Do we have any mechanism to add a test only interface?
   
  I'd like to see what Luiz/Markus think but definitely only a human
  monitor interface and probably prefix the command with a 'x-' prefix
  to indicate that it's not a supported interface.
 
 
  Automated testing will want to use qmp.
   
  Yes. The main usage of the interface is automated testing.
 
 
 That's precisely what the command should not be used for.
 
 You can't assume a gpa - hva mapping is consistent in an external 
 application.  If you want to implement an interface for testing, you 
 have to push more of the logic into QEMU to avoid the race.

From the code of cpu_physical_memory_map(), it seems that if the 'addr'
is physical address in RAM, the mapping should be consistent at least
for x86, doesn't it?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-07 Thread Huang Ying
On Sun, 2010-11-07 at 00:24 +0800, Avi Kivity wrote:
 On 11/01/2010 03:22 PM, Anthony Liguori wrote:
  On 11/01/2010 02:20 PM, Huang Ying wrote:
  Yes. As general interface, it may not work so well, but as test
  interface, it works quite well and useful.
 
  Do we have any mechanism to add a test only interface?
 
  I'd like to see what Luiz/Markus think but definitely only a human 
  monitor interface and probably prefix the command with a 'x-' prefix 
  to indicate that it's not a supported interface.
 
 
 Automated testing will want to use qmp.

Yes. The main usage of the interface is automated testing.

  The documentation should be very clear about the limitations of the 
  interface and the intended use-case.
 
 
 Perhaps a { execute: 'enable-version-specific-commands', arguments: 
 ['pfa2hva'] } ?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-01 Thread Huang Ying
On Mon, 2010-11-01 at 11:49 -0700, Anthony Liguori wrote:
 On 11/01/2010 11:09 AM, Marcelo Tosatti wrote:
  On Tue, Oct 26, 2010 at 10:39:48AM +0800, Huang Ying wrote:
 
  Author: Max Asbockmasb...@linux.vnet.ibm.com
 
  Add command pfa2hva to translate guest physical address to host
  virtual address.
 
  The pfa2hva command provides one step in a chain of translations from
  guest virtual to guest physical to host virtual to host physical. Host
  physical is then used to inject a machine check error. As a
  consequence the HWPOISON code on the host and the MCE injection code
  in qemu-kvm are exercised.
 
  v2:
 
  - Add QMP support
   
 
  Looks good to me. Anthony?
 
 
 The translation is not stable so this is really a bad interface.
 
 It definitely shouldn't be a QMP command and I don't think it's 
 generally useful enough that it should be a monitor command.
 
 It's impossible to use correctly as a general interface.
 
 I gave this feedback when it was initially submitted.

Yes. As general interface, it may not work so well, but as test
interface, it works quite well and useful.

Do we have any mechanism to add a test only interface?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-10-25 Thread Huang Ying
Author: Max Asbock masb...@linux.vnet.ibm.com

Add command pfa2hva to translate guest physical address to host
virtual address.

The pfa2hva command provides one step in a chain of translations from
guest virtual to guest physical to host virtual to host physical. Host
physical is then used to inject a machine check error. As a
consequence the HWPOISON code on the host and the MCE injection code
in qemu-kvm are exercised.

v2:

- Add QMP support

Signed-off-by: Max Asbock masb...@linux.vnet.ibm.com
Signed-off-by: Jiajia Zheng jiajia.zh...@intel.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 hmp-commands.hx |   15 +++
 monitor.c   |   22 ++
 qmp-commands.hx |   27 +++
 3 files changed, 64 insertions(+)

--- a/monitor.c
+++ b/monitor.c
@@ -2272,6 +2272,28 @@ static void do_inject_mce(Monitor *mon,
 }
 #endif
 
+static void do_pfa2hva_print(Monitor *mon, const QObject *data)
+{
+QInt *qint;
+
+qint = qobject_to_qint(data);
+monitor_printf(mon, 0x%lx\n, (unsigned long)qint-value);
+}
+
+static int do_pfa2hva(Monitor *mon, const QDict *qdict, QObject **ret_data)
+{
+target_phys_addr_t paddr;
+target_phys_addr_t size = TARGET_PAGE_SIZE;
+void *vaddr;
+
+paddr = qdict_get_int(qdict, addr);
+vaddr = cpu_physical_memory_map(paddr, size, 0);
+cpu_physical_memory_unmap(vaddr, size, 0, 0);
+*ret_data = qobject_from_jsonf(%ld, (unsigned long)vaddr);
+
+return 0;
+}
+
 static int do_getfd(Monitor *mon, const QDict *qdict, QObject **ret_data)
 {
 const char *fdname = qdict_get_str(qdict, fdname);
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -293,6 +293,21 @@ Start gdbserver session (default @var{po
 ETEXI
 
 {
+.name   = pfa2hva,
+.args_type  = fmt:/,addr:l,
+.params = /fmt addr,
+.help   = translate guest physical 'addr' to host virtual 
address,
+.user_print = do_pfa2hva_print,
+.mhandler.cmd_new = do_pfa2hva,
+},
+
+STEXI
+...@item pfa2hva @var{addr}
+...@findex pfa2hva
+Translate guest physical @var{addr} to host virtual address.
+ETEXI
+
+{
 .name   = x,
 .args_type  = fmt:/,addr:l,
 .params = /fmt addr,
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -738,6 +738,33 @@ Example:
 EQMP
 
 {
+.name   = pfa2hva,
+.args_type  = addr:l,
+.params = addr,
+.help   = translate guest physical 'addr' to host virtual 
address,
+.user_print = do_pfa2hva_print,
+.mhandler.cmd_new = do_pfa2hva,
+},
+
+SQMP
+pfa2hva
+---
+
+Translate guest physical 'addr' to host virtual address.
+
+Arguments:
+
+- addr: the guest physical address
+
+Example:
+
+- { execute: pfa2hva,
+   arguments: { addr: 196608 } }
+- { return: 139888084717568 }
+
+EQMP
+
+{
 .name   = qmp_capabilities,
 .args_type  = ,
 .params = ,


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCH 11/11] kvm, x86: broadcast mce depending on the cpu version

2010-10-14 Thread Huang Ying
On Fri, 2010-10-15 at 09:52 +0800, Hidetoshi Seto wrote:
 (2010/10/15 10:06), Marcelo Tosatti wrote:
  On Thu, Oct 14, 2010 at 05:55:28PM +0900, Jin Dongming wrote:
  There is no reason why SRAO event received by the main thread
  is the only one that being broadcasted.
 
  According to the x86 ASDM vol.3A 15.10.4.1,
  MCE signal is broadcast on processor version 06H_EH or later.
 
  This change is required to handle SRAR in the guest.
 
  Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com
  Tested-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
  ---
   qemu-kvm.c |   63 
  +--
   1 files changed, 31 insertions(+), 32 deletions(-)
  
  Why is this necessary? _AO SIGBUS should be sent to all vcpu threads and
  main thread.
 
 Humm? If you are right, vcpu threads will receive same SRAO event twice,
 one is that received by itself and another is that received by main thread
 and forwarded by the broadcast.
 
 My understanding is (Jin, please correct me if something wrong):
  - _AO SIGBUS is sent to main thread only, and then SRAO event is
broadcasted to all vcpu threads.

Yes. It is.

  - _AR SIGBUS is sent to a vcpu thread that tried to touch the
unmapped poisoned page, and SRAR event is posted to the vcpu.
 
 One problem here is that SRAR is not broadcasted.
 The guest might observe the event differently, like some cpus
 don't enter machine check.

Yes. SRAR Broadcast follows spec better.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] kvm, MCE, Add MCG_SER_P into KVM_MCE_CAP_SUPPORTED

2010-10-08 Thread Huang Ying
Now we have MCG_SER_P (and corresponding SRAO/SRAR MCE) support in
kernel and QEMU-KVM, the MCG_SER_P should be added into
KVM_MCE_CAP_SUPPORTED to make all these code really works.

Reported-by: Dean Nelson dnel...@redhat.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 arch/x86/kvm/x86.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -71,7 +71,7 @@
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
 #define KVM_MAX_MCE_BANKS 32
-#define KVM_MCE_CAP_SUPPORTED MCG_CTL_P
+#define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P)
 
 /* EFER defaults:
  * - enable syscall per default because its emulated by KVM


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] kvm, MCE, Send SRAR SIGBUS directly

2010-10-08 Thread Huang Ying
Originally, SRAR SIGBUS is sent to QEMU-KVM via touching the poisoned
page. But commit 96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the
signal from being sent. So now the signal is sent via
force_sig_info_fault directly.

Reported-by: Dean Nelson dnel...@redhat.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 arch/x86/include/asm/signal.h |3 +++
 arch/x86/kvm/mmu.c|   15 +++
 arch/x86/mm/fault.c   |6 +++---
 3 files changed, 9 insertions(+), 15 deletions(-)

--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -258,6 +258,9 @@ struct pt_regs;
 
 #define ptrace_signal_deliver(regs, cookie) do { } while (0)
 
+void force_sig_info_fault(int si_signo, int si_code, unsigned long address,
+ struct task_struct *tsk);
+
 #endif /* __KERNEL__ */
 #endif /* __ASSEMBLY__ */
 
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -158,9 +158,8 @@ is_prefetch(struct pt_regs *regs, unsign
return prefetch;
 }
 
-static void
-force_sig_info_fault(int si_signo, int si_code, unsigned long address,
-struct task_struct *tsk)
+void force_sig_info_fault(int si_signo, int si_code, unsigned long address,
+ struct task_struct *tsk)
 {
siginfo_t info;
 
@@ -172,6 +171,7 @@ force_sig_info_fault(int si_signo, int s
 
force_sig_info(si_signo, info, tsk);
 }
+EXPORT_SYMBOL_GPL(force_sig_info_fault);
 
 DEFINE_SPINLOCK(pgd_lock);
 LIST_HEAD(pgd_list);
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -39,6 +39,7 @@
 #include asm/cmpxchg.h
 #include asm/io.h
 #include asm/vmx.h
+#include asm/signal.h
 
 /*
  * When setting this variable to true it enables Two-Dimensional-Paging
@@ -2104,22 +2105,12 @@ static int __direct_map(struct kvm_vcpu
return pt_write;
 }
 
-static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn)
-{
-   char buf[1];
-   void __user *hva;
-   int r;
-
-   /* Touch the page, so send SIGBUS */
-   hva = (void __user *)gfn_to_hva(kvm, gfn);
-   r = copy_from_user(buf, hva, 1);
-}
-
 static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn)
 {
kvm_release_pfn_clean(pfn);
if (is_hwpoison_pfn(pfn)) {
-   kvm_send_hwpoison_signal(kvm, gfn);
+   force_sig_info_fault(SIGBUS, BUS_MCEERR_AR,
+gfn_to_hva(kvm, gfn), current);
return 0;
} else if (is_fault_pfn(pfn))
return -EFAULT;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix SRAO/SRAR MCE injecting on guest without MCG_SER_P

2010-10-08 Thread Huang Ying
On real machine, if MCG_SER_P in MSR_MCG_CAP is not set, SRAO/SRAR MCE
should not be raised by hardware. This patch makes QEMU-KVM to follow
the same rule.

Reported-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 qemu-kvm.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1134,7 +1134,7 @@ static void sigbus_handler(int n, struct
void *ctx)
 {
 #if defined(KVM_CAP_MCE)  defined(TARGET_I386)
-if (first_cpu-mcg_cap  siginfo-ssi_addr
+if ((first_cpu-mcg_cap  MCG_SER_P)  siginfo-ssi_addr
  siginfo-ssi_code == BUS_MCEERR_AO) {
 uint64_t status;
 void *vaddr;
@@ -1324,7 +1324,7 @@ static void kvm_on_sigbus(CPUState *env,
 unsigned long paddr;
 int r;
 
-if (env-mcg_cap  siginfo-si_addr
+if ((env-mcg_cap  MCG_SER_P)  siginfo-si_addr
  (siginfo-si_code == BUS_MCEERR_AR
 || siginfo-si_code == BUS_MCEERR_AO)) {
 if (siginfo-si_code == BUS_MCEERR_AR) {
@@ -1356,7 +1356,7 @@ static void kvm_on_sigbus(CPUState *env,
 if (do_qemu_ram_addr_from_host(vaddr, ram_addr) ||
 !kvm_physical_memory_addr_from_ram(kvm_state, ram_addr, 
(target_phys_addr_t *)paddr)) {
 fprintf(stderr, Hardware memory error for memory used by 
-QEMU itself instaed of guest system!\n);
+QEMU itself instead of guest system!\n);
 /* Hope we are lucky for AO MCE */
 if (siginfo-si_code == BUS_MCEERR_AO) {
 return;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm, MCE, Send SRAR SIGBUS directly

2010-10-08 Thread Huang Ying
On Sat, 2010-10-09 at 04:25 +0800, Marcelo Tosatti wrote:
 On Fri, Oct 08, 2010 at 04:24:15PM +0800, Huang Ying wrote:
  Originally, SRAR SIGBUS is sent to QEMU-KVM via touching the poisoned
  page. But commit 96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the
  signal from being sent. So now the signal is sent via
  force_sig_info_fault directly.
  
  Reported-by: Dean Nelson dnel...@redhat.com
  Signed-off-by: Huang Ying ying.hu...@intel.com
  ---
   arch/x86/include/asm/signal.h |3 +++
   arch/x86/kvm/mmu.c|   15 +++
   arch/x86/mm/fault.c   |6 +++---
   3 files changed, 9 insertions(+), 15 deletions(-)
  
  --- a/arch/x86/include/asm/signal.h
  +++ b/arch/x86/include/asm/signal.h
  @@ -258,6 +258,9 @@ struct pt_regs;
   
   #define ptrace_signal_deliver(regs, cookie) do { } while (0)
   
  +void force_sig_info_fault(int si_signo, int si_code, unsigned long address,
  + struct task_struct *tsk);
  +
   #endif /* __KERNEL__ */
   #endif /* __ASSEMBLY__ */
   
  --- a/arch/x86/mm/fault.c
  +++ b/arch/x86/mm/fault.c
  @@ -158,9 +158,8 @@ is_prefetch(struct pt_regs *regs, unsign
  return prefetch;
   }
   
  -static void
  -force_sig_info_fault(int si_signo, int si_code, unsigned long address,
  -struct task_struct *tsk)
  +void force_sig_info_fault(int si_signo, int si_code, unsigned long address,
  + struct task_struct *tsk)
   {
  siginfo_t info;
   
  @@ -172,6 +171,7 @@ force_sig_info_fault(int si_signo, int s
   
  force_sig_info(si_signo, info, tsk);
   }
  +EXPORT_SYMBOL_GPL(force_sig_info_fault);
   
   DEFINE_SPINLOCK(pgd_lock);
   LIST_HEAD(pgd_list);
  --- a/arch/x86/kvm/mmu.c
  +++ b/arch/x86/kvm/mmu.c
  @@ -39,6 +39,7 @@
   #include asm/cmpxchg.h
   #include asm/io.h
   #include asm/vmx.h
  +#include asm/signal.h
   
   /*
* When setting this variable to true it enables Two-Dimensional-Paging
  @@ -2104,22 +2105,12 @@ static int __direct_map(struct kvm_vcpu
  return pt_write;
   }
   
  -static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn)
  -{
  -   char buf[1];
  -   void __user *hva;
  -   int r;
  -
  -   /* Touch the page, so send SIGBUS */
  -   hva = (void __user *)gfn_to_hva(kvm, gfn);
  -   r = copy_from_user(buf, hva, 1);
  -}
  -
   static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn)
   {
  kvm_release_pfn_clean(pfn);
  if (is_hwpoison_pfn(pfn)) {
  -   kvm_send_hwpoison_signal(kvm, gfn);
  +   force_sig_info_fault(SIGBUS, BUS_MCEERR_AR,
  +gfn_to_hva(kvm, gfn), current);
  return 0;
  } else if (is_fault_pfn(pfn))
  return -EFAULT;
 
 Better fill siginfo and use force_sig_info, which is already exported.

It seems that force_sig_info is not exported for now.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-07 Thread Huang Ying
On Thu, 2010-10-07 at 00:05 +0800, Marcelo Tosatti wrote:
 On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:
  I got some more question:
  
  (2010/10/05 3:54), Marcelo Tosatti wrote:
   Index: qemu/target-i386/cpu.h
   ===
   --- qemu.orig/target-i386/cpu.h
   +++ qemu/target-i386/cpu.h
   @@ -250,16 +250,32 @@
#define PG_ERROR_RSVD_MASK 0x08
#define PG_ERROR_I_D_MASK  0x10

   -#define MCG_CTL_P(1UL8)   /* MCG_CAP register available */
   +#define MCG_CTL_P(1ULL8)   /* MCG_CAP register available */
   +#define MCG_SER_P(1ULL24) /* MCA recovery/new status bits */

   -#define MCE_CAP_DEF  MCG_CTL_P
   +#define MCE_CAP_DEF  (MCG_CTL_P|MCG_SER_P)
#define MCE_BANKS_DEF10

  
  It seems that current kvm doesn't support SER_P, so injecting SRAO
  to guest will mean that guest receives VAL|UC|!PCC and RIPV event
  from virtual processor that doesn't have SER_P.
 
 Dean also noted this. I don't think it was deliberate choice to not
 expose SER_P. Huang?

In fact, that should be a BUG. I will fix it as soon as possible.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-07 Thread Huang Ying
, and the issue remains.
 
 Definitely.
 I guess Huang has some plan or hint for rework this point.

Yes. This should be fixed. The SRAR SIGBUS should be sent directly
instead of being sent via touching poisoned virtual address.
 
  I would think that if the the bad page can't be sidelined, such that
  the newly booting guest can't use it, then the new guest shouldn't be
  allowed to boot. But perhaps there is some merit in letting it try to
  boot and see if one gets 'lucky'.
 
 In case of booting a real machine in real world, hardware and firmware
 usually (or often) do self-test before passing control to OS.
 Some platform can boot OS with degraded configuration (for example,
 fewer memory) if it has trouble on its component.  Some BIOS may
 stop booting and show messages like please reseat [component] on the
 screen.  So we could implement/request qemu to have such mechanism.
 
 I can understand the merit you mentioned here, in some degree. But I
 think it is hard to say unlucky to customer in business...

Because the contents of poisoned pages are not relevant after reboot.
Qemu can replace the poisoned pages with good pages when reboot guest.
Do you think that is good.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Monitor command to translate guest physical address to host virtual address

2010-09-30 Thread Huang Ying
From: Max Asbock masb...@linux.vnet.ibm.com

Add command p2v to translate guest physical address to host virtual
address.

The p2v command provides one step in a chain of translations from
guest virtual to guest physical to host virtual to host physical. Host
physical is then used to inject a machine check error. As a
consequence the HWPOISON code on the host and the MCE injection code
in qemu-kvm are exercised.

Signed-off-by: Max Asbock masb...@linux.vnet.ibm.com
Signed-off-by: Jiajia Zheng jiajia.zh...@intel.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 monitor.c   |   11 +++
 qemu-monitor.hx |   13 +
 2 files changed, 24 insertions(+)

--- a/monitor.c
+++ b/monitor.c
@@ -2301,6 +2301,17 @@ static void do_inject_mce(Monitor *mon,
 }
 #endif
 
+static void do_p2v(Monitor *mon, const QDict *qdict)
+{
+target_phys_addr_t size = TARGET_PAGE_SIZE;
+target_phys_addr_t addr = qdict_get_int(qdict, addr);
+void *vaddr;
+
+vaddr = cpu_physical_memory_map(addr, size, 0);
+monitor_printf(mon, Guest physical address %p is mapped at 
+   host virtual address %p\n, (void *)addr, vaddr);
+}
+
 static int do_getfd(Monitor *mon, const QDict *qdict, QObject **ret_data)
 {
 const char *fdname = qdict_get_str(qdict, fdname);
--- a/qemu-monitor.hx
+++ b/qemu-monitor.hx
@@ -459,6 +459,19 @@ Start gdbserver session (default @var{po
 ETEXI
 
 {
+.name   = p2v,
+.args_type  = fmt:/,addr:l,
+.params = /fmt addr,
+.help   = translate guest physical 'addr' to host virtual 
address,
+.mhandler.cmd = do_p2v,
+},
+STEXI
+...@item p2v @var{addr}
+...@findex mce
+Translate guest physical @var{addr} to host virtual address.
+ETEXI
+
+{
 .name   = x,
 .args_type  = fmt:/,addr:l,
 .params = /fmt addr,


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add RAM - physical addr mapping in MCE simulation

2010-09-26 Thread Huang Ying
Hi, Marcelo,

On Tue, 2010-09-21 at 00:10 +0800, Marcelo Tosatti wrote:
 On Sun, Sep 19, 2010 at 09:00:33AM +0800, Huang Ying wrote:
  In QEMU-KVM, physical address != RAM address. While MCE simulation
  needs physical address instead of RAM address. So
  kvm_physical_memory_addr_from_ram() is implemented to do the
  conversion, and it is invoked before being filled in the IA32_MCi_ADDR
  MSR.
  
  Reported-by: Dean Nelson dnel...@redhat.com
  Signed-off-by: Huang Ying ying.hu...@intel.com
 
 Applied, thanks.
 
 Can you please submit the code necessary to run the test suite upstream?
 Is there anything else needed other than p2v monitor command?

The only missing part is p2v monitor command. I will re-post it.

Do you want to merge the test suite into kvm autotest tool? If so,
Shaohui can help to do that. But it seems that he is busy on some other
project now.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Add RAM - physical addr mapping in MCE simulation

2010-09-18 Thread Huang Ying
In QEMU-KVM, physical address != RAM address. While MCE simulation
needs physical address instead of RAM address. So
kvm_physical_memory_addr_from_ram() is implemented to do the
conversion, and it is invoked before being filled in the IA32_MCi_ADDR
MSR.

Reported-by: Dean Nelson dnel...@redhat.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 kvm-all.c  |   18 ++
 kvm.h  |3 +++
 qemu-kvm.c |   13 ++---
 3 files changed, 31 insertions(+), 3 deletions(-)

--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1137,12 +1137,15 @@ static void sigbus_handler(int n, struct
 if (first_cpu-mcg_cap  siginfo-ssi_addr
  siginfo-ssi_code == BUS_MCEERR_AO) {
 uint64_t status;
+void *vaddr;
+ram_addr_t ram_addr;
 unsigned long paddr;
 CPUState *cenv;
 
 /* Hope we are lucky for AO MCE */
-if (do_qemu_ram_addr_from_host((void *)(intptr_t)siginfo-ssi_addr,
-  paddr)) {
+vaddr = (void *)(intptr_t)siginfo-ssi_addr;
+if (do_qemu_ram_addr_from_host(vaddr, ram_addr) ||
+!kvm_physical_memory_addr_from_ram(kvm_state, ram_addr, paddr)) {
 fprintf(stderr, Hardware memory error for memory used by 
 QEMU itself instead of guest system!: %llx\n,
 (unsigned long long)siginfo-ssi_addr);
@@ -1316,6 +1319,8 @@ static void kvm_on_sigbus(CPUState *env,
 struct kvm_x86_mce mce = {
 .bank = 9,
 };
+void *vaddr;
+ram_addr_t ram_addr;
 unsigned long paddr;
 int r;
 
@@ -1347,7 +1352,9 @@ static void kvm_on_sigbus(CPUState *env,
 mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
 mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV;
 }
-if (do_qemu_ram_addr_from_host((void *)siginfo-si_addr, paddr)) {
+vaddr = (void *)siginfo-si_addr;
+if (do_qemu_ram_addr_from_host(vaddr, ram_addr) ||
+!kvm_physical_memory_addr_from_ram(kvm_state, ram_addr, paddr)) {
 fprintf(stderr, Hardware memory error for memory used by 
 QEMU itself instaed of guest system!\n);
 /* Hope we are lucky for AO MCE */
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -141,6 +141,24 @@ static KVMSlot *kvm_lookup_overlapping_s
 return found;
 }
 
+int kvm_physical_memory_addr_from_ram(KVMState *s, ram_addr_t ram_addr,
+  target_phys_addr_t *phys_addr)
+{
+int i;
+
+for (i = 0; i  ARRAY_SIZE(s-slots); i++) {
+KVMSlot *mem = s-slots[i];
+
+if (ram_addr = mem-phys_offset 
+ram_addr  mem-phys_offset + mem-memory_size) {
+*phys_addr = mem-start_addr + (ram_addr - mem-phys_offset);
+return 1;
+}
+}
+
+return 0;
+}
+
 static int kvm_set_user_memory_region(KVMState *s, KVMSlot *slot)
 {
 struct kvm_userspace_memory_region mem;
--- a/kvm.h
+++ b/kvm.h
@@ -195,4 +195,7 @@ int kvm_set_irqfd(int gsi, int fd, bool
 #endif
 
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool 
assign);
+
+int kvm_physical_memory_addr_from_ram(KVMState *s, ram_addr_t ram_addr,
+  target_phys_addr_t *phys_addr);
 #endif


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: Expose MCE control MSRs to userspace

2010-07-11 Thread Huang Ying
Sorry for late.

On Thu, 2010-07-08 at 17:12 +0800, Avi Kivity wrote:
[..snip..]
  And if we clear MSR_IA32_MCG_CTL, the machine check reporting is
  disabled according to SDM Vol 3A, section 15.3.1.3
 
 
  Won't the kernel reenable MCE?  In my testing, the sequence
  MCE-reset-MCE worked after the patch (whereas it would fail without it).
   
  Yes. Because kernel will reenable it. But if we only clear
  MSR_IA32_MCG_STATUS only, MCE-reset-MCE should work too.
 
 
 What happens if we reboot into a kernel that doesn't enable MCE?
 
 I guess it doesn't matter: the new kernel will keep cr4.mce cleared and 
 thus MCE will be blocked.
 
 I'd like to keep the patch as is, so live migration works for MCE (we'll 
 need to add bank support).  I think there's no problem clearing _CTL on 
 reset.  If there is, we can patch qemu not to clear the MSR.  Is that 
 acceptable?

Yes. At least for Linux that is OK. Don't know whether Windows will set
_CTL during boot.

Best Regards,
Huang Ying



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: Expose MCE control MSRs to userspace

2010-07-08 Thread Huang Ying
On Thu, 2010-07-08 at 15:43 +0800, Avi Kivity wrote:
 On 07/08/2010 05:07 AM, Huang Ying wrote:
 
static u32 emulated_msrs[] = {
 MSR_IA32_MISC_ENABLE,
  +  MSR_IA32_MCG_STATUS,
  +  MSR_IA32_MCG_CTL,
   
  We need only clear MSR_IA32_MCG_STATUS during reset, but should not
  clear MSR_IA32_MCG_CTL.
 
 
 
 Why not?

According to Intel 64 and IA32 Architectures Software Developer's Manual
(SDM) Vol 3A (Table 9-1), machine check MSRs should be sticky across
reset. Except we need some special processing for MSR_IA32_MCG_STATUS.

And if we clear MSR_IA32_MCG_CTL, the machine check reporting is
disabled according to SDM Vol 3A, section 15.3.1.3

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: Expose MCE control MSRs to userspace

2010-07-07 Thread Huang Ying
Hi, Avi,

On Wed, 2010-07-07 at 19:09 +0800, Avi Kivity wrote:
 Userspace needs to reset and save/restore these MSRs.
 
 The MCE banks are not exposed since their number varies from vcpu to vcpu.
 
 Signed-off-by: Avi Kivity a...@redhat.com
 ---
  arch/x86/kvm/x86.c |2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 7070b41..1e12cc5 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -744,6 +744,8 @@ static unsigned num_msrs_to_save;
  
  static u32 emulated_msrs[] = {
   MSR_IA32_MISC_ENABLE,
 + MSR_IA32_MCG_STATUS,
 + MSR_IA32_MCG_CTL,

We need only clear MSR_IA32_MCG_STATUS during reset, but should not
clear MSR_IA32_MCG_CTL.

  };
  
  static int set_efer(struct kvm_vcpu *vcpu, u64 efer)

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUGFIX] kvm, Fix a race condition for usage of is_hwpoison_address

2010-06-22 Thread Huang Ying
is_hwpoison_address accesses the page table, so the caller must hold
current-mm-mmap_sem in read mode. So fix its usage in hav_to_pfn of
kvm accordingly.

Comments on is_hwpoison_address are added to remind other users.

Reported-by: Avi Kivity a...@redhat.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 mm/memory-failure.c |3 +++
 virt/kvm/kvm_main.c |3 ++-
 2 files changed, 5 insertions(+), 1 deletion(-)

--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1298,6 +1298,9 @@ done:
return ret;
 }
 
+/*
+ * The caller must hold current-mm-mmap_sem in read mode.
+ */
 int is_hwpoison_address(unsigned long addr)
 {
pgd_t *pgdp;
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -955,12 +955,13 @@ static pfn_t hva_to_pfn(struct kvm *kvm,
if (unlikely(npages != 1)) {
struct vm_area_struct *vma;
 
+   down_read(current-mm-mmap_sem);
if (is_hwpoison_address(addr)) {
+   up_read(current-mm-mmap_sem);
get_page(hwpoison_page);
return page_to_pfn(hwpoison_page);
}
 
-   down_read(current-mm-mmap_sem);
vma = find_vma(current-mm, addr);
 
if (vma == NULL || addr  vma-vm_start ||


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()

2010-06-16 Thread huang ying
On Tue, Jun 15, 2010 at 7:22 PM, Avi Kivity a...@redhat.com wrote:
 btw, is_hwpoison_address() is racy.  While it looks up the address, some
 other task can unmap the page tables under us.

 Andi/Huang?

 One way of fixing it is get_user_pages_ptes_fast(), which also returns the
 pte, also atomically.  I want it for other reasons as well (respond to a
 read fault by gupping the page for read, but allowing write access if the
 pte indicates it is writeable).

Yes. is_hwpoison_address() is racy. But I think it is not absolutely
necessary to call is_hwpoison_address() in hva_to_pfn_atomic(), is it?

For is_hwpoison_address() in hva_to_pfn(), we can protect it with mmap_sem.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v3] KVM, Fix QEMU-KVM is killed by guest SRAO MCE (resend)

2010-05-31 Thread Huang Ying
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.

But it is reported that if the poisoned page is accessed in guest
after un-mapped and before MCE is relayed to guest OS, QEMU-KVM will
be killed.

The reason is as follow. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then QEMU-KVM
is killed by force_sig_info.

To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.


Changelog:

v3:

- Make is_hwpoison_address workable for pud_large or pmd_large address.

v2:

- Use page table walker to determine whether the virtual address is
  poisoned to avoid change user space interface (via changing
  get_user_pages).

- Wrap bad page processing into kvm_handle_bad_page to avoid code
  duplicating.

Reported-by: Max Asbock masb...@linux.vnet.ibm.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 arch/x86/kvm/mmu.c |   34 ++
 arch/x86/kvm/paging_tmpl.h |7 ++-
 include/linux/kvm_host.h   |1 +
 include/linux/mm.h |8 
 mm/memory-failure.c|   30 ++
 virt/kvm/kvm_main.c|   30 --
 6 files changed, 95 insertions(+), 15 deletions(-)

--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -32,6 +32,7 @@
 #include linux/compiler.h
 #include linux/srcu.h
 #include linux/slab.h
+#include linux/uaccess.h
 
 #include asm/page.h
 #include asm/cmpxchg.h
@@ -1953,6 +1954,27 @@ static int __direct_map(struct kvm_vcpu
return pt_write;
 }
 
+static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn)
+{
+   char buf[1];
+   void __user *hva;
+   int r;
+
+   /* Touch the page, so send SIGBUS */
+   hva = (void __user *)gfn_to_hva(kvm, gfn);
+   r = copy_from_user(buf, hva, 1);
+}
+
+static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn)
+{
+   kvm_release_pfn_clean(pfn);
+   if (is_hwpoison_pfn(pfn)) {
+   kvm_send_hwpoison_signal(kvm, gfn);
+   return 0;
+   }
+   return 1;
+}
+
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn)
 {
int r;
@@ -1976,10 +1998,8 @@ static int nonpaging_map(struct kvm_vcpu
pfn = gfn_to_pfn(vcpu-kvm, gfn);
 
/* mmio */
-   if (is_error_pfn(pfn)) {
-   kvm_release_pfn_clean(pfn);
-   return 1;
-   }
+   if (is_error_pfn(pfn))
+   return kvm_handle_bad_page(vcpu-kvm, gfn, pfn);
 
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
@@ -2191,10 +2211,8 @@ static int tdp_page_fault(struct kvm_vcp
mmu_seq = vcpu-kvm-mmu_notifier_seq;
smp_rmb();
pfn = gfn_to_pfn(vcpu-kvm, gfn);
-   if (is_error_pfn(pfn)) {
-   kvm_release_pfn_clean(pfn);
-   return 1;
-   }
+   if (is_error_pfn(pfn))
+   return kvm_handle_bad_page(vcpu-kvm, gfn, pfn);
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -430,11 +430,8 @@ static int FNAME(page_fault)(struct kvm_
pfn = gfn_to_pfn(vcpu-kvm, walker.gfn);
 
/* mmio */
-   if (is_error_pfn(pfn)) {
-   pgprintk(gfn %lx is mmio\n, walker.gfn);
-   kvm_release_pfn_clean(pfn);
-   return 1;
-   }
+   if (is_error_pfn(pfn))
+   return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn);
 
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -266,6 +266,7 @@ extern pfn_t bad_pfn;
 
 int is_error_page(struct page *page);
 int is_error_pfn(pfn_t pfn);
+int is_hwpoison_pfn(pfn_t pfn);
 int kvm_is_error_hva(unsigned long addr);
 int kvm_set_memory_region(struct kvm *kvm,
  struct kvm_userspace_memory_region *mem,
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1465,6 +1465,14 @@ extern int sysctl_memory_failure_recover
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+int is_hwpoison_address(unsigned long addr);
+#else
+static inline int is_hwpoison_address(unsigned long addr)
+{
+   return 0;
+}
+#endif
 
 extern void dump_page(struct page *page);
 
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -45,6 +45,7 @@
 #include linux/page-isolation.h

[PATCH -v3] KVM, Fix QEMU-KVM is killed by guest SRAO MCE

2010-05-17 Thread Huang Ying
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.

But it is reported that if the poisoned page is accessed in guest
after un-mapped and before MCE is relayed to guest OS, QEMU-KVM will
be killed.

The reason is as follow. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then QEMU-KVM
is killed by force_sig_info.

To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.


Changelog:

v3:

- Make is_hwpoison_address workable for pud_large or pmd_large address.

v2:

- Use page table walker to determine whether the virtual address is
  poisoned to avoid change user space interface (via changing
  get_user_pages).

- Wrap bad page processing into kvm_handle_bad_page to avoid code
  duplicating.

Reported-by: Max Asbock masb...@linux.vnet.ibm.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 arch/x86/kvm/mmu.c |   34 ++
 arch/x86/kvm/paging_tmpl.h |7 ++-
 include/linux/kvm_host.h   |1 +
 include/linux/mm.h |8 
 mm/memory-failure.c|   30 ++
 virt/kvm/kvm_main.c|   30 --
 6 files changed, 95 insertions(+), 15 deletions(-)

--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -32,6 +32,7 @@
 #include linux/compiler.h
 #include linux/srcu.h
 #include linux/slab.h
+#include linux/uaccess.h
 
 #include asm/page.h
 #include asm/cmpxchg.h
@@ -1975,6 +1976,27 @@ static int __direct_map(struct kvm_vcpu
return pt_write;
 }
 
+static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn)
+{
+   char buf[1];
+   void __user *hva;
+   int r;
+
+   /* Touch the page, so send SIGBUS */
+   hva = (void __user *)gfn_to_hva(kvm, gfn);
+   r = copy_from_user(buf, hva, 1);
+}
+
+static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn)
+{
+   kvm_release_pfn_clean(pfn);
+   if (is_hwpoison_pfn(pfn)) {
+   kvm_send_hwpoison_signal(kvm, gfn);
+   return 0;
+   }
+   return 1;
+}
+
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn)
 {
int r;
@@ -1998,10 +2020,8 @@ static int nonpaging_map(struct kvm_vcpu
pfn = gfn_to_pfn(vcpu-kvm, gfn);
 
/* mmio */
-   if (is_error_pfn(pfn)) {
-   kvm_release_pfn_clean(pfn);
-   return 1;
-   }
+   if (is_error_pfn(pfn))
+   return kvm_handle_bad_page(vcpu-kvm, gfn, pfn);
 
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
@@ -2204,10 +2224,8 @@ static int tdp_page_fault(struct kvm_vcp
mmu_seq = vcpu-kvm-mmu_notifier_seq;
smp_rmb();
pfn = gfn_to_pfn(vcpu-kvm, gfn);
-   if (is_error_pfn(pfn)) {
-   kvm_release_pfn_clean(pfn);
-   return 1;
-   }
+   if (is_error_pfn(pfn))
+   return kvm_handle_bad_page(vcpu-kvm, gfn, pfn);
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -424,11 +424,8 @@ static int FNAME(page_fault)(struct kvm_
pfn = gfn_to_pfn(vcpu-kvm, walker.gfn);
 
/* mmio */
-   if (is_error_pfn(pfn)) {
-   pgprintk(gfn %lx is mmio\n, walker.gfn);
-   kvm_release_pfn_clean(pfn);
-   return 1;
-   }
+   if (is_error_pfn(pfn))
+   return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn);
 
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -254,6 +254,7 @@ extern pfn_t bad_pfn;
 
 int is_error_page(struct page *page);
 int is_error_pfn(pfn_t pfn);
+int is_hwpoison_pfn(pfn_t pfn);
 int kvm_is_error_hva(unsigned long addr);
 int kvm_set_memory_region(struct kvm *kvm,
  struct kvm_userspace_memory_region *mem,
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1464,6 +1464,14 @@ extern int sysctl_memory_failure_recover
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+int is_hwpoison_address(unsigned long addr);
+#else
+static inline int is_hwpoison_address(unsigned long addr)
+{
+   return 0;
+}
+#endif
 
 extern void dump_page(struct page *page);
 
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -45,6 +45,7 @@
 #include linux/page-isolation.h

Re: [PATCH -v2] KVM, Fix QEMU-KVM is killed by guest SRAO MCE

2010-05-13 Thread Huang Ying
On Fri, 2010-05-14 at 05:43 +0800, Marcelo Tosatti wrote:
 On Wed, May 12, 2010 at 02:44:03PM +0800, Huang Ying wrote:
  @@ -1975,6 +1976,27 @@ static int __direct_map(struct kvm_vcpu
  return pt_write;
   }
   
  +static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn)
  +{
  +   char buf[1];
  +   void __user *hva;
  +   int r;
  +
  +   /* Touch the page, so send SIGBUS */
  +   hva = (void __user *)gfn_to_hva(kvm, gfn);
  +   r = copy_from_user(buf, hva, 1);
  +}
 
 A SIGBUS signal has been raised by memory poisoning already, so i don't
 see why this is needed?
 
 To avoid the MMIO processing in userspace before the MCE is sent to the
 guest you can just return -EAGAIN from the page fault handlers back to
 kvm_mmu_page_fault.

The SIGBUS signal is necessary here because this SIGBUS is SRAR (Action
Requirement) while the previously sent one is SRAO (Action Optional).
They have different semantics and processed differently in QEMU-KVM and
guest OS.

For example the guest Linux kernel may ignore SRAO MCE (raised by
QEMU-KVM after receiving SRAO SIGBUS), because it is optional, but for
SRAR MCE (raised by QEMU-KVM after receiving SRAR SIGBUS) the guest
Linux kernel must kill corresponding application or go panic.

  +int is_hwpoison_address(unsigned long addr)
  +{
  +   pgd_t *pgdp;
  +   pud_t *pudp;
  +   pmd_t *pmdp;
  +   pte_t pte, *ptep;
  +   swp_entry_t entry;
  +
  +   pgdp = pgd_offset(current-mm, addr);
  +   if (!pgd_present(*pgdp))
  +   return 0;
  +   pudp = pud_offset(pgdp, addr);
  +   if (!pud_present(*pudp))
  +   return 0;
  +   pmdp = pmd_offset(pudp, addr);
  +   if (!pmd_present(*pmdp))
  +   return 0;
 
 Need to bail out if pmd is huge.

Yes. I will change this.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2] KVM, Fix QEMU-KVM is killed by guest SRAO MCE

2010-05-12 Thread Huang Ying
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.

But it is reported that if the poisoned page is accessed in guest
after un-mapped and before MCE is relayed to guest OS, QEMU-KVM will
be killed.

The reason is as follow. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then QEMU-KVM
is killed by force_sig_info.

To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.


Changelog:

v2:

- Use page table walker to determine whether the virtual address is
  poisoned to avoid change user space interface (via changing
  get_user_pages).

- Wrap bad page processing into kvm_handle_bad_page to avoid code
  duplicating.

Reported-by: Max Asbock masb...@linux.vnet.ibm.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 arch/x86/kvm/mmu.c |   34 ++
 arch/x86/kvm/paging_tmpl.h |7 ++-
 include/linux/kvm_host.h   |1 +
 include/linux/mm.h |8 
 mm/memory-failure.c|   28 
 virt/kvm/kvm_main.c|   30 --
 6 files changed, 93 insertions(+), 15 deletions(-)

--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -32,6 +32,7 @@
 #include linux/compiler.h
 #include linux/srcu.h
 #include linux/slab.h
+#include linux/uaccess.h
 
 #include asm/page.h
 #include asm/cmpxchg.h
@@ -1975,6 +1976,27 @@ static int __direct_map(struct kvm_vcpu
return pt_write;
 }
 
+static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn)
+{
+   char buf[1];
+   void __user *hva;
+   int r;
+
+   /* Touch the page, so send SIGBUS */
+   hva = (void __user *)gfn_to_hva(kvm, gfn);
+   r = copy_from_user(buf, hva, 1);
+}
+
+static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn)
+{
+   kvm_release_pfn_clean(pfn);
+   if (is_hwpoison_pfn(pfn)) {
+   kvm_send_hwpoison_signal(kvm, gfn);
+   return 0;
+   }
+   return 1;
+}
+
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn)
 {
int r;
@@ -1998,10 +2020,8 @@ static int nonpaging_map(struct kvm_vcpu
pfn = gfn_to_pfn(vcpu-kvm, gfn);
 
/* mmio */
-   if (is_error_pfn(pfn)) {
-   kvm_release_pfn_clean(pfn);
-   return 1;
-   }
+   if (is_error_pfn(pfn))
+   return kvm_handle_bad_page(vcpu-kvm, gfn, pfn);
 
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
@@ -2204,10 +2224,8 @@ static int tdp_page_fault(struct kvm_vcp
mmu_seq = vcpu-kvm-mmu_notifier_seq;
smp_rmb();
pfn = gfn_to_pfn(vcpu-kvm, gfn);
-   if (is_error_pfn(pfn)) {
-   kvm_release_pfn_clean(pfn);
-   return 1;
-   }
+   if (is_error_pfn(pfn))
+   return kvm_handle_bad_page(vcpu-kvm, gfn, pfn);
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -424,11 +424,8 @@ static int FNAME(page_fault)(struct kvm_
pfn = gfn_to_pfn(vcpu-kvm, walker.gfn);
 
/* mmio */
-   if (is_error_pfn(pfn)) {
-   pgprintk(gfn %lx is mmio\n, walker.gfn);
-   kvm_release_pfn_clean(pfn);
-   return 1;
-   }
+   if (is_error_pfn(pfn))
+   return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn);
 
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -254,6 +254,7 @@ extern pfn_t bad_pfn;
 
 int is_error_page(struct page *page);
 int is_error_pfn(pfn_t pfn);
+int is_hwpoison_pfn(pfn_t pfn);
 int kvm_is_error_hva(unsigned long addr);
 int kvm_set_memory_region(struct kvm *kvm,
  struct kvm_userspace_memory_region *mem,
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -92,6 +92,9 @@ static bool kvm_rebooting;
 
 static bool largepages_enabled = true;
 
+struct page *hwpoison_page;
+pfn_t hwpoison_pfn;
+
 inline int kvm_is_mmio_pfn(pfn_t pfn)
 {
if (pfn_valid(pfn)) {
@@ -809,16 +812,22 @@ EXPORT_SYMBOL_GPL(kvm_disable_largepages
 
 int is_error_page(struct page *page)
 {
-   return page == bad_page;
+   return page == bad_page || page == hwpoison_page;
 }
 EXPORT_SYMBOL_GPL(is_error_page);
 
 int is_error_pfn(pfn_t pfn)
 {
-   return pfn == bad_pfn;
+   return pfn == bad_pfn || pfn == hwpoison_pfn;
 }
 EXPORT_SYMBOL_GPL(is_error_pfn);
 
+int

Re: [PATCH] Ignore SRAO MCE if another MCE is being processed

2010-04-28 Thread Huang Ying
On Wed, 2010-04-28 at 00:12 +0800, Marcelo Tosatti wrote:
 On Tue, Apr 27, 2010 at 03:10:49PM +0800, Huang Ying wrote:
  In common cases, guest SRAO MCE will cause corresponding poisoned page
  be un-mapped in host and SIGBUS be sent to QEMU-KVM, then QEMU-KVM
  will relay the MCE to guest OS.
  
  But it is possible that the poisoned page is accessed in guest after
  un-mapped in host and before MCE is relayed to guest OS. So that, the
  SRAR SIGBUS is sent to QEMU-KVM before the SRAO SIGBUS, and if
  QEMU-KVM relays them to guest OS one by one, guest system may reset,
  because the SRAO MCE may be triggered while the SRAR MCE is being
  processed. In fact, the SRAO MCE can be ignored in this situation, so
  that the guest system is given opportunity to survive.
  
  Signed-off-by: Huang Ying ying.hu...@intel.com
  ---
   qemu-kvm.c |   28 
   1 file changed, 28 insertions(+)
  
  --- a/qemu-kvm.c
  +++ b/qemu-kvm.c
  @@ -1610,6 +1610,19 @@ static void flush_queued_work(CPUState *
   pthread_cond_broadcast(qemu_work_cond);
   }
   
  +static int kvm_mce_in_exception(CPUState *env)
  +{
  +struct kvm_msr_entry msr_mcg_status = {
  +.index = MSR_MCG_STATUS,
  +};
  +int r;
  +
  +r = kvm_get_msrs(env, msr_mcg_status, 1);
  +if (r == -1 || r == 0)
  +return -1;
  +return !!(msr_mcg_status.data  MCG_STATUS_MCIP);
  +}
  +
   static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo)
   {
   #if defined(KVM_CAP_MCE)  defined(TARGET_I386)
  @@ -1630,6 +1643,15 @@ static void kvm_on_sigbus(CPUState *env,
   mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
   mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV;
   } else {
  +/*
  + * If there is an MCE excpetion being processed, ignore
  + * this SRAO MCE
  + */
  +r = kvm_mce_in_exception(env);
  +if (r == -1)
  +fprintf(stderr, Failed to get MCE status\n);
  +else if (r)
  +return;
   /* Fake an Intel architectural Memory scrubbing UCR */
   mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
   | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
  @@ -2475,6 +2497,12 @@ static void kvm_do_inject_x86_mce(void *
   struct kvm_x86_mce_data *data = _data;
   int r;
   
  +/* If there is an MCE excpetion being processed, ignore this SRAO MCE 
  */
  +r = kvm_mce_in_exception(data-env);
  +if (r == -1)
  +fprintf(stderr, Failed to get MCE status\n);
  +else if (r  !(data-mce-status  MCI_STATUS_AR))
  +return;
 
 Don't you need to set the OVER bit in the MCI_STATUS register when 
 this happens?

The OVER bit is set when uncorrected error overwrite the corrected
error. There is no specification for OVER bit for this situation. I just
don't find benefit for it.

 Unrelated to this patch, it would be nice if you can share the testing
 code.

There is some test script and document for this in:

git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git

test script is in kvm directory, testing document is kvm/README

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM, Fix QEMU-KVM is killed by guest SRAO MCE

2010-04-28 Thread Huang Ying
On Wed, 2010-04-28 at 17:47 +0800, Avi Kivity wrote:
 On 04/28/2010 05:56 AM, Huang Ying wrote:
 
 
  Just want to use the side effect of copy_from_user, SIGBUS will be sent
  to current process because the page touched is marked as poisoned. That
  is, failure is expected, so the return value is not checked.
 
 
  What if the failure doesn't happen?  Say, someone mmap()ed over the page.
   
  Sorry, not get your idea clearly. hva is re-mmap()ed? We just read the
  hva, not write, so I think it should be OK here.
 
 
 
 We don't generate a signal in this case.  Does the code continue to work 
 correctly (not sure what correctly is in this case... should probably 
 just continue).
 
 There's also the possibility of -EFAULT.

I think signal should be generated for copy_from_user, because the hva
is poisoned now. The signal will not generated only if the hva is
re-mmap()ped to some other physical page, but this should be impossible
unless we have memory hotadd/hotremove in KVM.

If the signal is not generated, lost or overwritten, guest will
continue, and if the hva is still poisoned, the page fault will be
triggered again; if the hva is not poisoned, there will be no further
page fault.

Best Regards,
Huang Ying



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Ignore SRAO MCE if another MCE is being processed

2010-04-27 Thread Huang Ying
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped in host and SIGBUS be sent to QEMU-KVM, then QEMU-KVM
will relay the MCE to guest OS.

But it is possible that the poisoned page is accessed in guest after
un-mapped in host and before MCE is relayed to guest OS. So that, the
SRAR SIGBUS is sent to QEMU-KVM before the SRAO SIGBUS, and if
QEMU-KVM relays them to guest OS one by one, guest system may reset,
because the SRAO MCE may be triggered while the SRAR MCE is being
processed. In fact, the SRAO MCE can be ignored in this situation, so
that the guest system is given opportunity to survive.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 qemu-kvm.c |   28 
 1 file changed, 28 insertions(+)

--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1610,6 +1610,19 @@ static void flush_queued_work(CPUState *
 pthread_cond_broadcast(qemu_work_cond);
 }
 
+static int kvm_mce_in_exception(CPUState *env)
+{
+struct kvm_msr_entry msr_mcg_status = {
+.index = MSR_MCG_STATUS,
+};
+int r;
+
+r = kvm_get_msrs(env, msr_mcg_status, 1);
+if (r == -1 || r == 0)
+return -1;
+return !!(msr_mcg_status.data  MCG_STATUS_MCIP);
+}
+
 static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo)
 {
 #if defined(KVM_CAP_MCE)  defined(TARGET_I386)
@@ -1630,6 +1643,15 @@ static void kvm_on_sigbus(CPUState *env,
 mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
 mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV;
 } else {
+/*
+ * If there is an MCE excpetion being processed, ignore
+ * this SRAO MCE
+ */
+r = kvm_mce_in_exception(env);
+if (r == -1)
+fprintf(stderr, Failed to get MCE status\n);
+else if (r)
+return;
 /* Fake an Intel architectural Memory scrubbing UCR */
 mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
 | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
@@ -2475,6 +2497,12 @@ static void kvm_do_inject_x86_mce(void *
 struct kvm_x86_mce_data *data = _data;
 int r;
 
+/* If there is an MCE excpetion being processed, ignore this SRAO MCE */
+r = kvm_mce_in_exception(data-env);
+if (r == -1)
+fprintf(stderr, Failed to get MCE status\n);
+else if (r  !(data-mce-status  MCI_STATUS_AR))
+return;
 r = kvm_set_mce(data-env, data-mce);
 if (r  0) {
 perror(kvm_set_mce FAILED);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM, Fix QEMU-KVM is killed by guest SRAO MCE

2010-04-27 Thread Huang Ying
On Tue, 2010-04-27 at 15:47 +0800, Avi Kivity wrote:
 (please copy kvm@vger.kernel.org on kvm patches)

Sorry, will do that for all future patches.

 On 04/27/2010 10:04 AM, Huang Ying wrote:
 
  +static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn)
  +{
  +   char buf[1];
  +   void __user *hva;
  +   int r;
  +
  +   /* Touch the page, so send SIGBUS */
  +   hva = (void __user *)gfn_to_hva(kvm, gfn);
  +   r = copy_from_user(buf, hva, 1);
 
 
 No error check?  What will a copy_from_user() of poisoned page expected 
 to return?
 
 Best to return -EFAULT on failure for consistency.

Just want to use the side effect of copy_from_user, SIGBUS will be sent
to current process because the page touched is marked as poisoned. That
is, failure is expected, so the return value is not checked.

  +}
  +
static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t 
  gfn)
{
  int r;
  @@ -1997,7 +2009,11 @@ static int nonpaging_map(struct kvm_vcpu
  /* mmio */
  if (is_error_pfn(pfn)) {
  kvm_release_pfn_clean(pfn);
  -   return 1;
  +   if (is_hwpoison_pfn(pfn)) {
  +   kvm_send_hwpoison_signal(vcpu-kvm, gfn);
  +   return 0;
  +   } else
  +   return 1;
  }
 
 
 This is duplicated several times.  Please introduce a kvm_handle_bad_page():
 
  if (is_error_pfn(pfn))
  return kvm_handle_bad_page(vcpu-kvm, gfn, pfn);

OK. Will do that.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM, Fix QEMU-KVM is killed by guest SRAO MCE

2010-04-27 Thread Huang Ying
On Tue, 2010-04-27 at 17:30 +0800, Avi Kivity wrote:
 On 04/27/2010 12:25 PM, Huang Ying wrote:
 
 
  On 04/27/2010 10:04 AM, Huang Ying wrote:
   
  +static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn)
  +{
  + char buf[1];
  + void __user *hva;
  + int r;
  +
  + /* Touch the page, so send SIGBUS */
  + hva = (void __user *)gfn_to_hva(kvm, gfn);
  + r = copy_from_user(buf, hva, 1);
 
 
  No error check?  What will a copy_from_user() of poisoned page expected
  to return?
 
  Best to return -EFAULT on failure for consistency.
   
  Just want to use the side effect of copy_from_user, SIGBUS will be sent
  to current process because the page touched is marked as poisoned. That
  is, failure is expected, so the return value is not checked.
 
 
 What if the failure doesn't happen?  Say, someone mmap()ed over the page.

Sorry, not get your idea clearly. hva is re-mmap()ed? We just read the
hva, not write, so I think it should be OK here.

 btw, better to use (void)copy_from_user(...) instead to avoid the 
 initialized but not used warning the compiler may generate.

OK. Will do that.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v3] Add savevm/loadvm support for MCE

2010-03-03 Thread Huang Ying
MCE registers are saved/load into/from CPUState in
kvm_arch_save/load_regs. To simulate the MCG_STATUS clearing upon
reset, MSR_MCG_STATUS is set to 0 for KVM_PUT_RESET_STATE.

v3:

 - use msrs[] in kvm_arch_load/save_regs and get_msr_entry directly.

v2:

 - Rebased on new CPU registers save/load framework.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 qemu-kvm-x86.c |   36 
 1 file changed, 36 insertions(+)

--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -748,7 +748,22 @@ static int get_msr_entry(struct kvm_msr_
 case MSR_KVM_WALL_CLOCK:
 env-wall_clock_msr = entry-data;
 break;
+#ifdef KVM_CAP_MCE
+case MSR_MCG_STATUS:
+env-mcg_status = entry-data;
+break;
+case MSR_MCG_CTL:
+env-mcg_ctl = entry-data;
+break;
+#endif
 default:
+#ifdef KVM_CAP_MCE
+if (entry-index = MSR_MC0_CTL  \
+entry-index  MSR_MC0_CTL + (env-mcg_cap  0xff) * 4) {
+env-mce_banks[entry-index - MSR_MC0_CTL] = entry-data;
+break;
+}
+#endif
 printf(Warning unknown msr index 0x%x\n, entry-index);
 return 1;
 }
@@ -979,6 +994,18 @@ void kvm_arch_load_regs(CPUState *env, i
 set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr);
 set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr);
 }
+#ifdef KVM_CAP_MCE
+if (env-mcg_cap) {
+if (level == KVM_PUT_RESET_STATE)
+set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
+else if (level == KVM_PUT_FULL_STATE) {
+set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
+set_msr_entry(msrs[n++], MSR_MCG_CTL, env-mcg_ctl);
+for (i = 0; i  (env-mcg_cap  0xff); i++)
+set_msr_entry(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]);
+}
+}
+#endif
 
 rc = kvm_set_msrs(env, msrs, n);
 if (rc == -1)
@@ -1144,6 +1171,15 @@ void kvm_arch_save_regs(CPUState *env)
 msrs[n++].index = MSR_KVM_SYSTEM_TIME;
 msrs[n++].index = MSR_KVM_WALL_CLOCK;
 
+#ifdef KVM_CAP_MCE
+if (env-mcg_cap) {
+msrs[n++].index = MSR_MCG_STATUS;
+msrs[n++].index = MSR_MCG_CTL;
+for (i = 0; i  (env-mcg_cap  0xff) * 4; i++)
+msrs[n++].index = MSR_MC0_CTL + i;
+}
+#endif
+
 rc = kvm_get_msrs(env, msrs, n);
 if (rc == -1) {
 perror(kvm_get_msrs FAILED);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Add savevm/loadvm support for MCE

2010-03-02 Thread Huang Ying
On Tue, 2010-03-02 at 16:09 +0800, Jan Kiszka wrote:
 Huang Ying wrote:
  MCE registers are saved/load into/from CPUState in
  kvm_arch_save/load_regs. To simulate the MCG_STATUS clearing upon
  reset, MSR_MCG_STATUS is set to 0 for KVM_PUT_RESET_STATE.
  
  v2:
  
   - Rebased on new CPU registers save/load framework.
 
 Yep, much closer. :)
 
  
  Signed-off-by: Huang Ying ying.hu...@intel.com
  ---
   qemu-kvm-x86.c |   43 +++
   1 file changed, 43 insertions(+)
  
  --- a/qemu-kvm-x86.c
  +++ b/qemu-kvm-x86.c
  @@ -979,11 +979,33 @@ void kvm_arch_load_regs(CPUState *env, i
   set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, 
  env-system_time_msr);
   set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr);
   }
  +#ifdef KVM_CAP_MCE
  +if (env-mcg_cap  level == KVM_PUT_RESET_STATE) {
  +/*
  + * MCG_STATUS should reset to 0 after reset, while other MCE
  + * registers should be unchanged
  + */
  +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0);
 
 For the sake of consistency, just write mcg_status here (it's properly
 updated in cpu_reset).

OK.

  +}
  +#endif
   
   rc = kvm_set_msrs(env, msrs, n);
   if (rc == -1)
   perror(kvm_set_msrs FAILED);
   
  +#ifdef KVM_CAP_MCE
  +if (env-mcg_cap  level == KVM_PUT_FULL_STATE) {
  +n = 0;
  +set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
  +set_msr_entry(msrs[n++], MSR_MCG_CTL, env-mcg_ctl);
 
 You can move this block up, reusing the kvm_set_msrs above. But...
 
  +for (i = 0; i  (env-mcg_cap  0xff); i++)
 
 ...this requires some care. We have space for writing up to 100
 registers in our msrs array. You may have to extend it unless this
 number is much smaller in reality.

The default number of MCE banks is 10, this means 42 entries. So I think
it is safer to use another kvm_set_msrs. And the stack space is limited
too.

  +set_msr_entry(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]);
  +rc = kvm_set_msrs(env, msrs, n);
  +if (rc == -1)
  +perror(kvm_set_msrs FAILED);
  +}
  +#endif
  +
   if (level = KVM_PUT_RESET_STATE) {
   kvm_arch_load_mpstate(env);
   kvm_load_lapic(env);
  @@ -1155,6 +1177,27 @@ void kvm_arch_save_regs(CPUState *env)
   return;
   }
   }
  +
  +#ifdef KVM_CAP_MCE
  +if (env-mcg_cap) {
 
 No need to check for msg_cap, the kernel will ignore unknown MSRs.

And some error message will be printed in kernel log. Is it OK?

  +msrs[0].index = MSR_MCG_STATUS;
  +msrs[1].index = MSR_MCG_CTL;
  +n = (env-mcg_cap  0xff) * 4;
  +for (i = 0; i  n; i++)
 
 Same are above, we may run out of array space.
 
  +msrs[2 + i].index = MSR_MC0_CTL + i;
  +
  +rc = kvm_get_msrs(env, msrs, n + 2);
  +if (rc == -1)
  +perror(kvm_get_msrs FAILED);
  +else {
  +env-mcg_status = msrs[0].data;
  +env-mcg_ctl = msrs[1].data;
  +for (i = 0; i  n; i++)
  +env-mce_banks[i] = msrs[2 + i].data;
  +}
 
 Please split this block into setup and MSR transfer, and then merge it
 into the existing MSR readout to avoid calling kvm_get_msrs twice.

I will do that if the stack space is not an issue.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2] Add savevm/loadvm support for MCE

2010-03-01 Thread Huang Ying
MCE registers are saved/load into/from CPUState in
kvm_arch_save/load_regs. To simulate the MCG_STATUS clearing upon
reset, MSR_MCG_STATUS is set to 0 for KVM_PUT_RESET_STATE.

v2:

 - Rebased on new CPU registers save/load framework.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 qemu-kvm-x86.c |   43 +++
 1 file changed, 43 insertions(+)

--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -979,11 +979,33 @@ void kvm_arch_load_regs(CPUState *env, i
 set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr);
 set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr);
 }
+#ifdef KVM_CAP_MCE
+if (env-mcg_cap  level == KVM_PUT_RESET_STATE) {
+/*
+ * MCG_STATUS should reset to 0 after reset, while other MCE
+ * registers should be unchanged
+ */
+set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0);
+}
+#endif
 
 rc = kvm_set_msrs(env, msrs, n);
 if (rc == -1)
 perror(kvm_set_msrs FAILED);
 
+#ifdef KVM_CAP_MCE
+if (env-mcg_cap  level == KVM_PUT_FULL_STATE) {
+n = 0;
+set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
+set_msr_entry(msrs[n++], MSR_MCG_CTL, env-mcg_ctl);
+for (i = 0; i  (env-mcg_cap  0xff); i++)
+set_msr_entry(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]);
+rc = kvm_set_msrs(env, msrs, n);
+if (rc == -1)
+perror(kvm_set_msrs FAILED);
+}
+#endif
+
 if (level = KVM_PUT_RESET_STATE) {
 kvm_arch_load_mpstate(env);
 kvm_load_lapic(env);
@@ -1155,6 +1177,27 @@ void kvm_arch_save_regs(CPUState *env)
 return;
 }
 }
+
+#ifdef KVM_CAP_MCE
+if (env-mcg_cap) {
+msrs[0].index = MSR_MCG_STATUS;
+msrs[1].index = MSR_MCG_CTL;
+n = (env-mcg_cap  0xff) * 4;
+for (i = 0; i  n; i++)
+msrs[2 + i].index = MSR_MC0_CTL + i;
+
+rc = kvm_get_msrs(env, msrs, n + 2);
+if (rc == -1)
+perror(kvm_get_msrs FAILED);
+else {
+env-mcg_status = msrs[0].data;
+env-mcg_ctl = msrs[1].data;
+for (i = 0; i  n; i++)
+env-mce_banks[i] = msrs[2 + i].data;
+}
+}
+#endif
+
 kvm_arch_save_mpstate(env);
 kvm_save_lapic(env);
 kvm_get_vcpu_events(env);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Add savevm/loadvm support for MCE

2010-02-25 Thread Huang Ying
MCE registers are saved/load into/from CPUState in
kvm_arch_save/load_regs. Because all MCE registers except for
MCG_STATUS should be preserved, MCE registers are saved before
kvm_arch_load_regs in kvm_arch_cpu_reset. To simulate the MCG_STATUS
clearing upon reset, env-mcg_status is set to 0 after saving.

Signed-off-by: Huang Ying ying.hu...@intel.com

---
 qemu-kvm-x86.c |   54 ++
 1 file changed, 54 insertions(+)

--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -803,6 +803,27 @@ static void get_seg(SegmentCache *lhs, c
| (rhs-avl * DESC_AVL_MASK);
 }
 
+static void kvm_load_mce_regs(CPUState *env)
+{
+#ifdef KVM_CAP_MCE
+struct kvm_msr_entry msrs[100];
+int rc, n, i;
+
+if (!env-mcg_cap)
+   return;
+
+n = 0;
+set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
+set_msr_entry(msrs[n++], MSR_MCG_CTL, env-mcg_ctl);
+for (i = 0; i  (env-mcg_cap  0xff) * 4; i++)
+set_msr_entry(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]);
+
+rc = kvm_set_msrs(env, msrs, n);
+if (rc == -1)
+perror(kvm_set_msrs FAILED);
+#endif
+}
+
 void kvm_arch_load_regs(CPUState *env)
 {
 struct kvm_regs regs;
@@ -922,6 +943,8 @@ void kvm_arch_load_regs(CPUState *env)
 if (rc == -1)
 perror(kvm_set_msrs FAILED);
 
+kvm_load_mce_regs(env);
+
 /*
  * Kernels before 2.6.33 (which correlates with !kvm_has_vcpu_events())
  * overwrote flags.TF injected via SET_GUEST_DEBUG while updating GP regs.
@@ -991,6 +1014,33 @@ void kvm_arch_load_mpstate(CPUState *env
 #endif
 }
 
+static void kvm_save_mce_regs(CPUState *env)
+{
+#ifdef KVM_CAP_MCE
+struct kvm_msr_entry msrs[100];
+int rc, n, i;
+
+if (!env-mcg_cap)
+   return;
+
+msrs[0].index = MSR_MCG_STATUS;
+msrs[1].index = MSR_MCG_CTL;
+n = (env-mcg_cap  0xff) * 4;
+for (i = 0; i  n; i++)
+msrs[2 + i].index = MSR_MC0_CTL + i;
+
+rc = kvm_get_msrs(env, msrs, n + 2);
+if (rc == -1)
+perror(kvm_set_msrs FAILED);
+else {
+env-mcg_status = msrs[0].data;
+env-mcg_ctl = msrs[1].data;
+for (i = 0; i  n; i++)
+env-mce_banks[i] = msrs[2 + i].data;
+}
+#endif
+}
+
 void kvm_arch_save_regs(CPUState *env)
 {
 struct kvm_regs regs;
@@ -1148,6 +1198,7 @@ void kvm_arch_save_regs(CPUState *env)
 }
 }
 kvm_arch_save_mpstate(env);
+kvm_save_mce_regs(env);
 }
 
 static void do_cpuid_ent(struct kvm_cpuid_entry2 *e, uint32_t function,
@@ -1385,6 +1436,9 @@ void kvm_arch_push_nmi(void *opaque)
 void kvm_arch_cpu_reset(CPUState *env)
 {
 kvm_arch_reset_vcpu(env);
+/* MCE registers except MCG_STATUS should be unchanged across reset */
+kvm_save_mce_regs(env);
+env-mcg_status = 0;
 kvm_arch_load_regs(env);
 kvm_put_vcpu_events(env);
 if (!cpu_is_bsp(env)) {


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUGFIX] MCE: Fix bug of IA32_MCG_STATUS after system reset

2010-01-06 Thread Huang Ying
On Wed, 2010-01-06 at 16:03 +0800, Avi Kivity wrote:
 On 01/06/2010 09:05 AM, Huang Ying wrote:
  @@ -1015,6 +1015,7 @@ void kvm_arch_load_regs(CPUState *env)
 #endif
 set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME,  
  env-system_time_msr);
 set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK,  
  env-wall_clock_msr);
  +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0);
 
 
 
  Not sure why you reset this in kvm_arch_load_regs().  Shouldn't this be
  in the cpu reset code?
   
  I found kvm_arch_load_regs() is called by kvm_arch_cpu_reset(), which is
  called by qemu_kvm_system_reset(). It is not in cpu reset path?
 
 
 It is, but it is also called from many other places, which could cause 
 this msr to be zeroed.
 
 A better solution is to allocate it a field in CPUState, load and save 
 it in kvm_arch_*_regs, and zero it during reset.

Yes. You are right. I will fix this.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUGFIX] MCE: Fix bug of IA32_MCG_STATUS after system reset

2010-01-06 Thread Huang Ying
On Tue, 2010-01-05 at 18:44 +0800, Andi Kleen wrote:
  --- a/qemu-kvm-x86.c
  +++ b/qemu-kvm-x86.c
  @@ -1015,6 +1015,7 @@ void kvm_arch_load_regs(CPUState *env)
   #endif
   set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME,  env-system_time_msr);
   set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK,  env-wall_clock_msr);
  +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0);
 
 Still need to keep EIPV and possibly RIPV, don't we?

It appears that EIPV and RIPV is meaningless outside of MCE exception
handler. But I will try to check the real hardware behavior.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUGFIX] MCE: Fix bug of IA32_MCG_STATUS after system reset

2010-01-05 Thread Huang Ying
Now, if we inject a fatal MCE into guest OS, for example Linux, Linux
will go panic and then reboot. But if we inject another MCE now,
system will reset directly instead of go panic firstly, because
MCG_STATUS.MCIP is set to 1 and not cleared after reboot. This is does
not follow the behavior in real hardware.

This patch fixes this via set IA32_MCG_STATUS to 0 during system reset.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 qemu-kvm-x86.c |1 +
 1 file changed, 1 insertion(+)

--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1015,6 +1015,7 @@ void kvm_arch_load_regs(CPUState *env)
 #endif
 set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME,  env-system_time_msr);
 set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK,  env-wall_clock_msr);
+set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0);
 
 rc = kvm_set_msrs(env, msrs, n);
 if (rc == -1)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUGFIX] MCE: Fix bug of IA32_MCG_STATUS after system reset

2010-01-05 Thread Huang Ying
Hi, Avi,

On Tue, 2010-01-05 at 18:50 +0800, Avi Kivity wrote:
 On 01/05/2010 10:34 AM, Huang Ying wrote:
  Now, if we inject a fatal MCE into guest OS, for example Linux, Linux
  will go panic and then reboot. But if we inject another MCE now,
  system will reset directly instead of go panic firstly, because
  MCG_STATUS.MCIP is set to 1 and not cleared after reboot. This is does
  not follow the behavior in real hardware.
 
  This patch fixes this via set IA32_MCG_STATUS to 0 during system reset.
 
  Signed-off-by: Huang Yingying.hu...@intel.com
  ---
qemu-kvm-x86.c |1 +
1 file changed, 1 insertion(+)
 
  --- a/qemu-kvm-x86.c
  +++ b/qemu-kvm-x86.c
  @@ -1015,6 +1015,7 @@ void kvm_arch_load_regs(CPUState *env)
#endif
set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME,  env-system_time_msr);
set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK,  env-wall_clock_msr);
  +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0);
 
 
 
 Not sure why you reset this in kvm_arch_load_regs().  Shouldn't this be 
 in the cpu reset code?

I found kvm_arch_load_regs() is called by kvm_arch_cpu_reset(), which is
called by qemu_kvm_system_reset(). It is not in cpu reset path?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix parameters of prctl

2009-12-15 Thread Huang Ying
Because kenrel prctl implementation checks whether arg4 and arg5 are 0
for PR_MCE_KILL, qmeu-kvm should invoke prctl syscall as that.

Reported-by: Max Asbock masb...@linux.vnet.ibm.com
Signed-off-by: Huang Ying ying.hu...@intel.com

---
 qemu-kvm.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1984,7 +1984,7 @@ int kvm_init_ap(void)
 action.sa_flags = SA_SIGINFO;
 action.sa_sigaction = (void (*)(int, siginfo_t*, void*))sigbus_handler;
 sigaction(SIGBUS, action, NULL);
-prctl(PR_MCE_KILL, 1, 1);
+prctl(PR_MCE_KILL, 1, 1, 0, 0);
 return 0;
 }
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v4] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-21 Thread Huang Ying
On Mon, 2009-09-21 at 18:08 +0800, Avi Kivity wrote: 
 On 09/21/2009 05:43 AM, Huang Ying wrote:
  UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
  where some hardware error such as some memory error can be reported
  without PCC (processor context corrupted). To recover from such MCE,
  the corresponding memory will be unmapped, and all processes accessing
  the memory will be killed via SIGBUS.
 
  For KVM, if QEMU/KVM is killed, all guest processes will be killed
  too. So we relay SIGBUS from host OS to guest system via a UCR MCE
  injection. Then guest OS can isolate corresponding memory and kill
  necessary guest processes only. SIGBUS sent to main thread (not VCPU
  threads) will be broadcast to all VCPU threads as UCR MCE.
 
 
 
  --- a/qemu-kvm.c
  +++ b/qemu-kvm.c
  @@ -27,10 +27,23 @@
#includesys/mman.h
#includesys/ioctl.h
#includesignal.h
  +#includesys/signalfd.h
 
 
 This causes a build failure, since not all hosts have sys/signalfd.h, 
 but more importantly:

Maybe we can just add necessary fields to struct qemu_signalfd_siginfo.
But this may be not portable.

  +
  +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void 
  *ctx)
  +{
 
 
 Here you accept signalfd_siginfo, while
 
  +
  +memset(action, 0, sizeof(action));
  +action.sa_flags = SA_SIGINFO;
  +action.sa_sigaction = (void (*)(int, siginfo_t*, void*))sigbus_handler;
  +sigaction(SIGBUS,action, NULL);
  +prctl(PR_MCE_KILL, 1, 1);
return 0;
 
 
 here you arm the function with something that will send it a siginfo_t.  
 So it looks like this is broken if a signal is ever received directly?  
 But can this happen due to signalfd?

Because SIGBUS is blocked, I think the signal handler will not be called
directly, but from sigfd_handler.

}
 
  @@ -1962,7 +2116,10 @@ static void sigfd_handler(void *opaque)
}
 
sigaction(info.ssi_signo, NULL,action);
  -if (action.sa_handler)
  +if ((action.sa_flags  SA_SIGINFO)  action.sa_sigaction)
  +action.sa_sigaction(info.ssi_signo,
  +(siginfo_t *)info, NULL);
  +   else if (action.sa_handler)
action.sa_handler(info.ssi_signo);
 
 
 The whole extract handler from sigaction and call it was a hack.

The hack above (signalfd_siginfo vs siginfo_t) is for extract handler
from sigaction and call it too. So I suggest to replace it with calling
handler directly.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v3] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-20 Thread Huang Ying
On Sat, 2009-09-19 at 01:05 +0800, Marcelo Tosatti wrote: 
 On Fri, Sep 18, 2009 at 03:58:59PM +0800, Huang Ying wrote:
  UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
  where some hardware error such as some memory error can be reported
  without PCC (processor context corrupted). To recover from such MCE,
  the corresponding memory will be unmapped, and all processes accessing
  the memory will be killed via SIGBUS.
 
  For KVM, if QEMU/KVM is killed, all guest processes will be killed
  too. So we relay SIGBUS from host OS to guest system via a UCR MCE
  injection. Then guest OS can isolate corresponding memory and kill
  necessary guest processes only. SIGBUS sent to main thread (not VCPU
  threads) will be broadcast to all VCPU threads as UCR MCE.
 
  v3:
 
  - Re-raise SIGBUS for SIGBUS not from MCE
  - Kill itself for error in kvm_inject_x86_mce
 
 This is broken, non-MCE SIGBUS causes qemu-kvm to call the sigbus
 handler in a loop.

Sorry. I tested the wrong branch in previous development. I will fix
this.

 BTW, how are you testing this and what guests have been tested?

I use a self-made kernel module to simulate SIGBUS from MCE handler to
qemu. Until now, only Linux guest has been tested.

Best Regards,
Huang Ying

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v4] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-20 Thread Huang Ying
UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
where some hardware error such as some memory error can be reported
without PCC (processor context corrupted). To recover from such MCE,
the corresponding memory will be unmapped, and all processes accessing
the memory will be killed via SIGBUS.

For KVM, if QEMU/KVM is killed, all guest processes will be killed
too. So we relay SIGBUS from host OS to guest system via a UCR MCE
injection. Then guest OS can isolate corresponding memory and kill
necessary guest processes only. SIGBUS sent to main thread (not VCPU
threads) will be broadcast to all VCPU threads as UCR MCE.

v4:

- Fix SIGBUS re-raise implementation.

v3:

- Re-raise SIGBUS for SIGBUS not from MCE
- Kill itself for error in kvm_inject_x86_mce

v2:

- Use qemu_ram_addr_from_host instead of self made one to covert from
  host address to guest RAM address. Thanks Anthony Liguori.

Signed-off-by: Huang Ying ying.hu...@intel.com

---
 cpu-common.h |1 
 exec.c   |   20 +++--
 qemu-kvm.c   |  195 +++
 qemu-kvm.h   |9 +-
 target-i386/cpu.h|   20 -
 target-i386/helper.c |2 
 6 files changed, 225 insertions(+), 22 deletions(-)

--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -27,10 +27,23 @@
 #include sys/mman.h
 #include sys/ioctl.h
 #include signal.h
+#include sys/signalfd.h
+#include sys/prctl.h
 
 #define false 0
 #define true 1
 
+#ifndef PR_MCE_KILL
+#define PR_MCE_KILL 33
+#endif
+
+#ifndef BUS_MCEERR_AR
+#define BUS_MCEERR_AR 4
+#endif
+#ifndef BUS_MCEERR_AO
+#define BUS_MCEERR_AO 5
+#endif
+
 #define EXPECTED_KVM_API_VERSION 12
 
 #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION
@@ -1509,6 +1522,66 @@ static void sig_ipi_handler(int n)
 {
 }
 
+static void hardware_memory_error(void)
+{
+fprintf(stderr, Hardware memory error!\n);
+exit(1);
+}
+
+static void sigbus_reraise(void)
+{
+sigset_t set;
+struct sigaction action;
+
+memset(action, 0, sizeof(action));
+action.sa_handler = SIG_DFL;
+if (!sigaction(SIGBUS, action, NULL)) {
+raise(SIGBUS);
+sigemptyset(set);
+sigaddset(set, SIGBUS);
+sigprocmask(SIG_UNBLOCK, set, NULL);
+}
+perror(Failed to re-raise SIGBUS!\n);
+abort();
+}
+
+static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx)
+{
+#if defined(KVM_CAP_MCE)  defined(TARGET_I386)
+if (first_cpu-mcg_cap  siginfo-ssi_addr
+ siginfo-ssi_code == BUS_MCEERR_AO) {
+uint64_t status;
+unsigned long paddr;
+CPUState *cenv;
+
+/* Hope we are lucky for AO MCE */
+if (do_qemu_ram_addr_from_host((void *)siginfo-ssi_addr, paddr)) {
+fprintf(stderr, Hardware memory error for memory used by 
+QEMU itself instead of guest system!: %llx\n,
+(unsigned long long)siginfo-ssi_addr);
+return;
+}
+status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| 0xc0;
+kvm_inject_x86_mce(first_cpu, 9, status,
+   MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
+   (MCM_ADDR_PHYS  6) | 0xc, 1);
+for (cenv = first_cpu-next_cpu; cenv != NULL; cenv = cenv-next_cpu)
+kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC,
+   MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0, 1);
+} else
+#endif
+{
+if (siginfo-ssi_code == BUS_MCEERR_AO)
+return;
+else if (siginfo-ssi_code == BUS_MCEERR_AR)
+hardware_memory_error();
+else
+sigbus_reraise();
+}
+}
+
 static void on_vcpu(CPUState *env, void (*func)(void *data), void *data)
 {
 struct qemu_work_item wi;
@@ -1666,29 +1739,101 @@ static void flush_queued_work(CPUState *
 pthread_cond_broadcast(qemu_work_cond);
 }
 
+static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo)
+{
+#if defined(KVM_CAP_MCE)  defined(TARGET_I386)
+struct kvm_x86_mce mce = {
+.bank = 9,
+};
+unsigned long paddr;
+int r;
+
+if (env-mcg_cap  siginfo-si_addr
+ (siginfo-si_code == BUS_MCEERR_AR
+|| siginfo-si_code == BUS_MCEERR_AO)) {
+if (siginfo-si_code == BUS_MCEERR_AR) {
+/* Fake an Intel architectural Data Load SRAR UCR */
+mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| MCI_STATUS_AR | 0x134;
+mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
+mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV;
+} else {
+/* Fake an Intel architectural Memory scrubbing UCR */
+mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S

[PATCH -v3] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-18 Thread Huang Ying
UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
where some hardware error such as some memory error can be reported
without PCC (processor context corrupted). To recover from such MCE,
the corresponding memory will be unmapped, and all processes accessing
the memory will be killed via SIGBUS.

For KVM, if QEMU/KVM is killed, all guest processes will be killed
too. So we relay SIGBUS from host OS to guest system via a UCR MCE
injection. Then guest OS can isolate corresponding memory and kill
necessary guest processes only. SIGBUS sent to main thread (not VCPU
threads) will be broadcast to all VCPU threads as UCR MCE.

v3:

- Re-raise SIGBUS for SIGBUS not from MCE
- Kill itself for error in kvm_inject_x86_mce

v2:

- Use qemu_ram_addr_from_host instead of self made one to covert from
  host address to guest RAM address. Thanks Anthony Liguori.

Signed-off-by: Huang Ying ying.hu...@intel.com

---
 cpu-common.h |1 
 exec.c   |   20 -
 qemu-kvm.c   |  178 +++
 qemu-kvm.h   |9 ++
 target-i386/cpu.h|   20 +
 target-i386/helper.c |2 
 6 files changed, 208 insertions(+), 22 deletions(-)

--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -27,10 +27,23 @@
 #include sys/mman.h
 #include sys/ioctl.h
 #include signal.h
+#include sys/signalfd.h
+#include sys/prctl.h
 
 #define false 0
 #define true 1
 
+#ifndef PR_MCE_KILL
+#define PR_MCE_KILL 33
+#endif
+
+#ifndef BUS_MCEERR_AR
+#define BUS_MCEERR_AR 4
+#endif
+#ifndef BUS_MCEERR_AO
+#define BUS_MCEERR_AO 5
+#endif
+
 #define EXPECTED_KVM_API_VERSION 12
 
 #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION
@@ -1509,6 +1522,49 @@ static void sig_ipi_handler(int n)
 {
 }
 
+static void hardware_memory_error(void)
+{
+fprintf(stderr, Hardware memory error!\n);
+exit(1);
+}
+
+static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx)
+{
+#if defined(KVM_CAP_MCE)  defined(TARGET_I386)
+if (first_cpu-mcg_cap  siginfo-ssi_addr
+ siginfo-ssi_code == BUS_MCEERR_AO) {
+uint64_t status;
+unsigned long paddr;
+CPUState *cenv;
+
+/* Hope we are lucky for AO MCE */
+if (do_qemu_ram_addr_from_host((void *)siginfo-ssi_addr, paddr)) {
+fprintf(stderr, Hardware memory error for memory used by 
+QEMU itself instead of guest system!: %llx\n,
+(unsigned long long)siginfo-ssi_addr);
+return;
+}
+status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| 0xc0;
+kvm_inject_x86_mce(first_cpu, 9, status,
+   MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
+   (MCM_ADDR_PHYS  6) | 0xc, 1);
+for (cenv = first_cpu-next_cpu; cenv != NULL; cenv = cenv-next_cpu)
+kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC,
+   MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0, 1);
+} else
+#endif
+{
+if (siginfo-ssi_code == BUS_MCEERR_AO)
+return;
+else if (siginfo-ssi_code == BUS_MCEERR_AR)
+hardware_memory_error();
+else
+raise(SIGBUS);
+}
+}
+
 static void on_vcpu(CPUState *env, void (*func)(void *data), void *data)
 {
 struct qemu_work_item wi;
@@ -1666,29 +1722,101 @@ static void flush_queued_work(CPUState *
 pthread_cond_broadcast(qemu_work_cond);
 }
 
+static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo)
+{
+#if defined(KVM_CAP_MCE)  defined(TARGET_I386)
+struct kvm_x86_mce mce = {
+.bank = 9,
+};
+unsigned long paddr;
+int r;
+
+if (env-mcg_cap  siginfo-si_addr
+ (siginfo-si_code == BUS_MCEERR_AR
+|| siginfo-si_code == BUS_MCEERR_AO)) {
+if (siginfo-si_code == BUS_MCEERR_AR) {
+/* Fake an Intel architectural Data Load SRAR UCR */
+mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| MCI_STATUS_AR | 0x134;
+mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
+mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV;
+} else {
+/* Fake an Intel architectural Memory scrubbing UCR */
+mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| 0xc0;
+mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
+mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV;
+}
+if (do_qemu_ram_addr_from_host((void *)siginfo-si_addr, paddr)) {
+fprintf(stderr, Hardware memory error for memory used by 
+QEMU itself instaed of guest system!\n);
+/* Hope we are lucky for AO MCE */
+if (siginfo-si_code == BUS_MCEERR_AO

Re: [PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-17 Thread Huang Ying
On Fri, 2009-09-18 at 05:36 +0800, Marcelo Tosatti wrote: 
+} else if (siginfo-ssi_code == BUS_MCEERR_AR)
+fprintf(stderr, Hardware memory error!\n);
+else
+fprintf(stderr, Internal error in QEMU!\n);
   
   Can you re-raise SIGBUS so you we get a coredump on non-MCE SIGBUS as
   usual?
  
  We discuss this before. Copied below, please comment the comments
  below, :)
  
  Avi:
  (also, I if we can't handle guest-mode SIGBUS I think it would be nice 
  to raise it again so the process terminates due to the SIGBUS).
  
  Huang Ying:
  For SIGBUS we can not relay to guest as MCE, we can either abort or
  reset SIGBUS to SIGDFL and re-raise it. Both are OK for me. You prefer
  the latter one?
  
  Andi:
  I think a suitable error message and exit would be better than a plain 
  signal kill. It shouldn't look like qemu crashed due to a software
  bug. Ideally a error message in a way that it can be parsed by libvirt
  etc. and reported in a suitable way.
  
  However qemu getting killed itself is very unlikely, it doesn't
  have much memory foot print compared to the guest and other data. 
  So this should be a very rare condition.
  
  Avi:
  libvirt etc. can/should wait() for qemu to terminate abnormally and 
  report the reason why.  However it doesn't seem there is a way to get 
  extended signal information from wait(), so it looks like internal 
  handling by qemu is better.
 
 I'm not talking about SIGBUS generated by MCE.
 
 What i mean is, for SIGBUS signals that are not due to MCE errors, the
 current behaviour is to generate a core dump (which is useful
 information for debugging). 
 
 With your patch, qemu-kvm handles the signal, prints a message before
 exiting.
 
 This is annoying. It seems the discussion above is about SIGBUS
 initiated by MCE errors.

OK. I will re-raise for SIGBUS not initiated by MCE.

Best Regards,
Huang Ying

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-16 Thread Huang Ying
On Thu, 2009-09-17 at 01:59 +0800, Marcelo Tosatti wrote: 
 On Wed, Sep 09, 2009 at 10:28:02AM +0800, Huang Ying wrote:
  UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
  where some hardware error such as some memory error can be reported
  without PCC (processor context corrupted). To recover from such MCE,
  the corresponding memory will be unmapped, and all processes accessing
  the memory will be killed via SIGBUS.
  
  For KVM, if QEMU/KVM is killed, all guest processes will be killed
  too. So we relay SIGBUS from host OS to guest system via a UCR MCE
  injection. Then guest OS can isolate corresponding memory and kill
  necessary guest processes only. SIGBUS sent to main thread (not VCPU
  threads) will be broadcast to all VCPU threads as UCR MCE.
  
  v2:
  
  - Use qemu_ram_addr_from_host instead of self made one to covert from
host address to guest RAM address. Thanks Anthony Liguori.
  
  Signed-off-by: Huang Ying ying.hu...@intel.com
  
  ---
   cpu-common.h  |1 
   exec.c|   20 +--
   qemu-kvm.c|  154 
  ++
   target-i386/cpu.h |   20 ++-
   4 files changed, 178 insertions(+), 17 deletions(-)
  
  --- a/qemu-kvm.c
  +++ b/qemu-kvm.c
  @@ -27,10 +27,23 @@
   #include sys/mman.h
   #include sys/ioctl.h
   #include signal.h
  +#include sys/signalfd.h
  +#include sys/prctl.h
   
   #define false 0
   #define true 1
   
  +#ifndef PR_MCE_KILL
  +#define PR_MCE_KILL 33
  +#endif
  +
  +#ifndef BUS_MCEERR_AR
  +#define BUS_MCEERR_AR 4
  +#endif
  +#ifndef BUS_MCEERR_AO
  +#define BUS_MCEERR_AO 5
  +#endif
  +
   #define EXPECTED_KVM_API_VERSION 12
   
   #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION
  @@ -1507,6 +1520,37 @@ static void sig_ipi_handler(int n)
   {
   }
   
  +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void 
  *ctx)
  +{
  +if (siginfo-ssi_code == BUS_MCEERR_AO) {
  +uint64_t status;
  +unsigned long paddr;
  +CPUState *cenv;
  +
  +/* Hope we are lucky for AO MCE */
  +if (do_qemu_ram_addr_from_host((void *)siginfo-ssi_addr, paddr)) 
  {
  +fprintf(stderr, Hardware memory error for memory used by 
  +QEMU itself instead of guest system!: %llx\n,
  +(unsigned long long)siginfo-ssi_addr);
  +return;
 
 qemu-kvm should die here?

There are two kinds of UCR MCE. One is triggered by user space/guest
read/write, the other is triggered by asynchronously detected error
(e.g. patrol scrubbing). The latter one is reported as AO (Action
Optional) MCE, and it has nothing to do with current path. So if we are
lucky enough, we can survive. And when we finally touch the error memory
reported by AO MCE, another AR (Action Required) MCE will be triggered.
We have another chance to deal with it.

  +}
  +status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
  +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
  +| 0xc0;
  +kvm_inject_x86_mce(first_cpu, 9, status,
  +   MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
  +   (MCM_ADDR_PHYS  6) | 0xc);
  +for (cenv = first_cpu-next_cpu; cenv != NULL; cenv = 
  cenv-next_cpu)
  +kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC,
  +   MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0);
  +return;
 
 Should abort if kvm_inject_x86_mce fails?

kvm_inject_x86_mce will abort by itself.

  +} else if (siginfo-ssi_code == BUS_MCEERR_AR)
  +fprintf(stderr, Hardware memory error!\n);
  +else
  +fprintf(stderr, Internal error in QEMU!\n);
 
 Can you re-raise SIGBUS so you we get a coredump on non-MCE SIGBUS as
 usual?

We discuss this before. Copied below, please comment the comments
below, :)

Avi:
(also, I if we can't handle guest-mode SIGBUS I think it would be nice 
to raise it again so the process terminates due to the SIGBUS).

Huang Ying:
For SIGBUS we can not relay to guest as MCE, we can either abort or
reset SIGBUS to SIGDFL and re-raise it. Both are OK for me. You prefer
the latter one?

Andi:
I think a suitable error message and exit would be better than a plain 
signal kill. It shouldn't look like qemu crashed due to a software
bug. Ideally a error message in a way that it can be parsed by libvirt
etc. and reported in a suitable way.

However qemu getting killed itself is very unlikely, it doesn't
have much memory foot print compared to the guest and other data. 
So this should be a very rare condition.

Avi:
libvirt etc. can/should wait() for qemu to terminate abnormally and 
report the reason why.  However it doesn't seem there is a way to get 
extended signal information from wait(), so it looks like internal 
handling by qemu is better.

  +exit(1);
  +}
  +
   static void on_vcpu(CPUState *env, void (*func)(void *data), void *data

Re: [PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-15 Thread Huang Ying
On Mon, 2009-09-14 at 13:10 +0800, Avi Kivity wrote: 
 On 09/14/2009 05:55 AM, Huang Ying wrote:
  Hi, Avi,
 
  On Thu, 2009-09-10 at 17:35 +0800, Andi Kleen wrote:
 
  (also, I if we can't handle guest-mode SIGBUS I think it would be nice
  to raise it again so the process terminates due to the SIGBUS).
   
  For SIGBUS we can not relay to guest as MCE, we can either abort or
  reset SIGBUS to SIGDFL and re-raise it. Both are OK for me. You prefer
  the latter one?
 
  I think a suitable error message and exit would be better than a plain
  signal kill. It shouldn't look like qemu crashed due to a software
  bug. Ideally a error message in a way that it can be parsed by libvirt etc.
  and reported in a suitable way.
   
  Do you agree with us about SIGBUS processing?
 
 
 Yes.

Do you have some other issue for this patch?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-13 Thread Huang Ying
Hi, Avi,

On Thu, 2009-09-10 at 17:35 +0800, Andi Kleen wrote: 
   (also, I if we can't handle guest-mode SIGBUS I think it would be nice 
   to raise it again so the process terminates due to the SIGBUS).
  
  For SIGBUS we can not relay to guest as MCE, we can either abort or
  reset SIGBUS to SIGDFL and re-raise it. Both are OK for me. You prefer
  the latter one?
 
 I think a suitable error message and exit would be better than a plain 
 signal kill. It shouldn't look like qemu crashed due to a software
 bug. Ideally a error message in a way that it can be parsed by libvirt etc.
 and reported in a suitable way.

Do you agree with us about SIGBUS processing?

Best Regards,
Huang Ying

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-08 Thread Huang Ying
On Tue, 2009-09-08 at 14:44 +0800, Avi Kivity wrote: 
 On 09/07/2009 11:32 AM, Huang Ying wrote:
  UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
  where some hardware error such as some memory error can be reported
  without PCC (processor context corrupted). To recover from such MCE,
  the corresponding memory will be unmapped, and all processes accessing
  the memory will be killed via SIGBUS.
 
  For KVM, if QEMU/KVM is killed, all guest processes will be killed
  too. So we relay SIGBUS from host OS to guest system via a UCR MCE
  injection. Then guest OS can isolate corresponding memory and kill
  necessary guest processes only. SIGBUS sent to main thread (not VCPU
  threads) will be broadcast to all VCPU threads as UCR MCE.
 
 
 Won't the guest be confused by the broadcast?  How does real hardware work?

We do broadcasting to follow the hardware behavior.

  +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void 
  *ctx)
  +{
  +if (siginfo-ssi_code == BUS_MCEERR_AO) {
  +uint64_t status;
  +unsigned long paddr;
  +CPUState *cenv;
  +
  +/* Hope we are lucky for AO MCE */
  +if (kvm_addr_userspace_to_phys((unsigned long)siginfo-ssi_addr,
  +paddr)) {
  +fprintf(stderr, Hardware memory error for memory used by 
  +QEMU itself instead of guest system!: %llx\n,
  +(unsigned long long)siginfo-ssi_addr);
  +return;
  +}
  +status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
  +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
  +| 0xc0;
  +kvm_inject_x86_mce(first_cpu, 9, status,
  +   MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
  +   (MCM_ADDR_PHYS  6) | 0xc);
 
 
 This is a vcpu ioctl, yes?  if so it must be called from the vcpu thread.

No. kvm_inject_x86_mce will call on_vcpu to do the real vcpu ioctl.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-08 Thread Huang Ying
On Tue, 2009-09-08 at 14:41 +0800, Avi Kivity wrote: 
 On 09/07/2009 11:48 PM, Anthony Liguori wrote:
 
   #ifdef KVM_CAP_IRQCHIP
 
   int kvm_set_irq_level(kvm_context_t kvm, int irq, int level, int 
  *status)
  @@ -1515,6 +1546,38 @@ static void sig_ipi_handler(int n)
   {
   }
 
  +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, 
  void *ctx)
  +{
  +if (siginfo-ssi_code == BUS_MCEERR_AO) {
  +uint64_t status;
  +unsigned long paddr;
  +CPUState *cenv;
  +
  +/* Hope we are lucky for AO MCE */
 
  Even if the error was limited to guest memory, it could have been 
  generated by either the kernel or userspace reading guest memory, no?
 
  Does this potentially open a security hole for us?  Consider the 
  following:
 
  1) We happen to read guest memory and that causes an MCE.  For 
  instance, say we're in virtio.c and we read the virtio ring.
  2) That should trigger the kernel to generate a sigbus.
  3) We catch sigbus, and queue an MCE for delivery.
  4) After sigbus handler completes, we're back in virtio.c, what was 
  the value of the memory operation we just completed?
 
  If the instruction gets skipped, we may be leaking host memory because 
  the access never happened.
 
 
 I think it's a lot safer to only report guest mode accesses to the 
 guest, and let user mode accesses terminate qemu.  The guest wouldn't 
 expect 100% recovery; for example if an uncorrectable error hit a vital 
 kernel data structure.

Yes, this is the current behavior. If MCE is caused by user mode
accessing, the KVM will be killed by force_sig_info, only MCE caused by
guest mode accessing will be captured by SIGBUS signal handler.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-08 Thread Huang Ying
UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
where some hardware error such as some memory error can be reported
without PCC (processor context corrupted). To recover from such MCE,
the corresponding memory will be unmapped, and all processes accessing
the memory will be killed via SIGBUS.

For KVM, if QEMU/KVM is killed, all guest processes will be killed
too. So we relay SIGBUS from host OS to guest system via a UCR MCE
injection. Then guest OS can isolate corresponding memory and kill
necessary guest processes only. SIGBUS sent to main thread (not VCPU
threads) will be broadcast to all VCPU threads as UCR MCE.

v2:

- Use qemu_ram_addr_from_host instead of self made one to covert from
  host address to guest RAM address. Thanks Anthony Liguori.

Signed-off-by: Huang Ying ying.hu...@intel.com

---
 cpu-common.h  |1 
 exec.c|   20 +--
 qemu-kvm.c|  154 ++
 target-i386/cpu.h |   20 ++-
 4 files changed, 178 insertions(+), 17 deletions(-)

--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -27,10 +27,23 @@
 #include sys/mman.h
 #include sys/ioctl.h
 #include signal.h
+#include sys/signalfd.h
+#include sys/prctl.h
 
 #define false 0
 #define true 1
 
+#ifndef PR_MCE_KILL
+#define PR_MCE_KILL 33
+#endif
+
+#ifndef BUS_MCEERR_AR
+#define BUS_MCEERR_AR 4
+#endif
+#ifndef BUS_MCEERR_AO
+#define BUS_MCEERR_AO 5
+#endif
+
 #define EXPECTED_KVM_API_VERSION 12
 
 #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION
@@ -1507,6 +1520,37 @@ static void sig_ipi_handler(int n)
 {
 }
 
+static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx)
+{
+if (siginfo-ssi_code == BUS_MCEERR_AO) {
+uint64_t status;
+unsigned long paddr;
+CPUState *cenv;
+
+/* Hope we are lucky for AO MCE */
+if (do_qemu_ram_addr_from_host((void *)siginfo-ssi_addr, paddr)) {
+fprintf(stderr, Hardware memory error for memory used by 
+QEMU itself instead of guest system!: %llx\n,
+(unsigned long long)siginfo-ssi_addr);
+return;
+}
+status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| 0xc0;
+kvm_inject_x86_mce(first_cpu, 9, status,
+   MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
+   (MCM_ADDR_PHYS  6) | 0xc);
+for (cenv = first_cpu-next_cpu; cenv != NULL; cenv = cenv-next_cpu)
+kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC,
+   MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0);
+return;
+} else if (siginfo-ssi_code == BUS_MCEERR_AR)
+fprintf(stderr, Hardware memory error!\n);
+else
+fprintf(stderr, Internal error in QEMU!\n);
+exit(1);
+}
+
 static void on_vcpu(CPUState *env, void (*func)(void *data), void *data)
 {
 struct qemu_work_item wi;
@@ -1649,29 +1693,102 @@ static void flush_queued_work(CPUState *
 pthread_cond_broadcast(qemu_work_cond);
 }
 
+static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo)
+{
+#if defined(KVM_CAP_MCE)  defined(TARGET_I386)
+struct kvm_x86_mce mce = {
+.bank = 9,
+};
+unsigned long paddr;
+int r;
+
+if (env-mcg_cap  siginfo-si_addr
+ (siginfo-si_code == BUS_MCEERR_AR
+|| siginfo-si_code == BUS_MCEERR_AO)) {
+if (siginfo-si_code == BUS_MCEERR_AR) {
+/* Fake an Intel architectural Data Load SRAR UCR */
+mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| MCI_STATUS_AR | 0x134;
+mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
+mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV;
+} else {
+/* Fake an Intel architectural Memory scrubbing UCR */
+mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| 0xc0;
+mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
+mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV;
+}
+if (do_qemu_ram_addr_from_host((void *)siginfo-si_addr, paddr)) {
+fprintf(stderr, Hardware memory error for memory used by 
+QEMU itself instaed of guest system!\n);
+/* Hope we are lucky for AO MCE */
+if (siginfo-si_code == BUS_MCEERR_AO)
+return;
+else
+exit(1);
+}
+mce.addr = paddr;
+r = kvm_set_mce(env-kvm_cpu_state.vcpu_ctx, mce);
+if (r  0) {
+fprintf(stderr, kvm_set_mce: %s\n, strerror(errno));
+exit(1);
+}
+} else
+#endif
+{
+if (siginfo-si_code == BUS_MCEERR_AO)
+return;
+if (siginfo-si_code

[PATCH] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-07 Thread Huang Ying
UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
where some hardware error such as some memory error can be reported
without PCC (processor context corrupted). To recover from such MCE,
the corresponding memory will be unmapped, and all processes accessing
the memory will be killed via SIGBUS.

For KVM, if QEMU/KVM is killed, all guest processes will be killed
too. So we relay SIGBUS from host OS to guest system via a UCR MCE
injection. Then guest OS can isolate corresponding memory and kill
necessary guest processes only. SIGBUS sent to main thread (not VCPU
threads) will be broadcast to all VCPU threads as UCR MCE.

Signed-off-by: Huang Ying ying.hu...@intel.com

---
 qemu-kvm.c|  173 ++
 target-i386/cpu.h |   20 +-
 2 files changed, 181 insertions(+), 12 deletions(-)

--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -27,10 +27,23 @@
 #include sys/mman.h
 #include sys/ioctl.h
 #include signal.h
+#include sys/signalfd.h
+#include sys/prctl.h
 
 #define false 0
 #define true 1
 
+#ifndef PR_MCE_KILL
+#define PR_MCE_KILL 33
+#endif
+
+#ifndef BUS_MCEERR_AR
+#define BUS_MCEERR_AR 4
+#endif
+#ifndef BUS_MCEERR_AO
+#define BUS_MCEERR_AO 5
+#endif
+
 #define EXPECTED_KVM_API_VERSION 12
 
 #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION
@@ -702,6 +715,24 @@ int kvm_get_dirty_pages_range(kvm_contex
 return 0;
 }
 
+static int kvm_addr_userspace_to_phys(unsigned long userspace_addr,
+   unsigned long *phys_addr)
+{
+   int i;
+   struct slot_info *slot;
+
+   for (i = 0; i  KVM_MAX_NUM_MEM_REGIONS; ++i) {
+   slot = slots[i];
+   if (slot-len  slot-userspace_addr = userspace_addr 
+   (slot-userspace_addr + slot-len)  userspace_addr) {
+   *phys_addr = userspace_addr - slot-userspace_addr +
+   slot-phys_addr;
+   return 0;
+   }
+   }
+   return -1;
+}
+
 #ifdef KVM_CAP_IRQCHIP
 
 int kvm_set_irq_level(kvm_context_t kvm, int irq, int level, int *status)
@@ -1515,6 +1546,38 @@ static void sig_ipi_handler(int n)
 {
 }
 
+static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx)
+{
+if (siginfo-ssi_code == BUS_MCEERR_AO) {
+uint64_t status;
+unsigned long paddr;
+CPUState *cenv;
+
+/* Hope we are lucky for AO MCE */
+if (kvm_addr_userspace_to_phys((unsigned long)siginfo-ssi_addr,
+   paddr)) {
+fprintf(stderr, Hardware memory error for memory used by 
+QEMU itself instead of guest system!: %llx\n,
+(unsigned long long)siginfo-ssi_addr);
+return;
+}
+status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| 0xc0;
+kvm_inject_x86_mce(first_cpu, 9, status,
+   MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
+   (MCM_ADDR_PHYS  6) | 0xc);
+for (cenv = first_cpu-next_cpu; cenv != NULL; cenv = cenv-next_cpu)
+kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC,
+   MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0);
+return;
+} else if (siginfo-ssi_code == BUS_MCEERR_AR)
+fprintf(stderr, Hardware memory error!\n);
+else
+fprintf(stderr, Internal error in QEMU!\n);
+exit(1);
+}
+
 static void on_vcpu(CPUState *env, void (*func)(void *data), void *data)
 {
 struct qemu_work_item wi;
@@ -1657,29 +1720,102 @@ static void flush_queued_work(CPUState *
 pthread_cond_broadcast(qemu_work_cond);
 }
 
+static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo)
+{
+#if defined(KVM_CAP_MCE)  defined(TARGET_I386)
+struct kvm_x86_mce mce = {
+.bank = 9,
+};
+unsigned long paddr;
+int r;
+
+if (env-mcg_cap  siginfo-si_addr
+ (siginfo-si_code == BUS_MCEERR_AR
+|| siginfo-si_code == BUS_MCEERR_AO)) {
+if (siginfo-si_code == BUS_MCEERR_AR) {
+mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| MCI_STATUS_AR;
+mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
+mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV;
+} else {
+/* Fake an Intel architectural Memory scrubbing UCR */
+mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| 0xc0;
+mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
+mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV;
+}
+if (kvm_addr_userspace_to_phys((unsigned long)siginfo-si_addr,
+   paddr)) {
+fprintf(stderr

Re: [PATCH] QEMU-KVM: MCE: Relay UCR MCE to guest

2009-09-07 Thread Huang Ying
On Tue, 2009-09-08 at 04:48 +0800, Anthony Liguori wrote: 
 Hi Huang,
 
 Huang Ying wrote:
  UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
  where some hardware error such as some memory error can be reported
  without PCC (processor context corrupted). To recover from such MCE,
  the corresponding memory will be unmapped, and all processes accessing
  the memory will be killed via SIGBUS.
 
  For KVM, if QEMU/KVM is killed, all guest processes will be killed
  too. So we relay SIGBUS from host OS to guest system via a UCR MCE
  injection. Then guest OS can isolate corresponding memory and kill
  necessary guest processes only. SIGBUS sent to main thread (not VCPU
  threads) will be broadcast to all VCPU threads as UCR MCE.
 
  Signed-off-by: Huang Ying ying.hu...@intel.com
 
  ---
   qemu-kvm.c|  173 
  ++
   target-i386/cpu.h |   20 +-
   2 files changed, 181 insertions(+), 12 deletions(-)
 
  --- a/qemu-kvm.c
  +++ b/qemu-kvm.c
  @@ -27,10 +27,23 @@
   #include sys/mman.h
   #include sys/ioctl.h
   #include signal.h
  +#include sys/signalfd.h
  +#include sys/prctl.h
 
   #define false 0
   #define true 1
 
  +#ifndef PR_MCE_KILL
  +#define PR_MCE_KILL 33
  +#endif
  +
  +#ifndef BUS_MCEERR_AR
  +#define BUS_MCEERR_AR 4
  +#endif
  +#ifndef BUS_MCEERR_AO
  +#define BUS_MCEERR_AO 5
  +#endif
  +
   #define EXPECTED_KVM_API_VERSION 12
 
   #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION
  @@ -702,6 +715,24 @@ int kvm_get_dirty_pages_range(kvm_contex
   return 0;
   }
 
  +static int kvm_addr_userspace_to_phys(unsigned long userspace_addr,
  +   unsigned long *phys_addr)
  +{
  +   int i;
  +   struct slot_info *slot;
  +
  +   for (i = 0; i  KVM_MAX_NUM_MEM_REGIONS; ++i) {
  +   slot = slots[i];
  +   if (slot-len  slot-userspace_addr = userspace_addr 
  +   (slot-userspace_addr + slot-len)  userspace_addr) {
  +   *phys_addr = userspace_addr - slot-userspace_addr +
  +   slot-phys_addr;
  +   return 0;
  +   }
  +   }
  +   return -1;
  +}
  +

 
 The slot mapping is actually a copy of the qemu's ram_blocks structure 
 (see exec.c).  If you base your check on that, it will Just Work for 
 QEMU too.

I find there is already a function named qemu_ram_addr_from_host which
translate from user space virtual address into qemu RAM address. But I
need function to return a error code instead of abort in case of no RAM
address corresponding specified user space virtual address. So I plan to
use following code to deal with that.

int do_qemu_ram_addr_from_host(void *ptr, ram_addr_t *ram_addr);
ram_addr_t qemu_ram_addr_from_host(void *ptr);

Does this follow the coding style of qemu?

   #ifdef KVM_CAP_IRQCHIP
 
   int kvm_set_irq_level(kvm_context_t kvm, int irq, int level, int *status)
  @@ -1515,6 +1546,38 @@ static void sig_ipi_handler(int n)
   {
   }
 
  +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void 
  *ctx)
  +{
  +if (siginfo-ssi_code == BUS_MCEERR_AO) {
  +uint64_t status;
  +unsigned long paddr;
  +CPUState *cenv;
  +
  +/* Hope we are lucky for AO MCE */

 
 Even if the error was limited to guest memory, it could have been 
 generated by either the kernel or userspace reading guest memory, no?
 
 Does this potentially open a security hole for us?  Consider the following:
 
 1) We happen to read guest memory and that causes an MCE.  For instance, 
 say we're in virtio.c and we read the virtio ring.
 2) That should trigger the kernel to generate a sigbus.
 3) We catch sigbus, and queue an MCE for delivery.
 4) After sigbus handler completes, we're back in virtio.c, what was the 
 value of the memory operation we just completed?
 
 If the instruction gets skipped, we may be leaking host memory because 
 the access never happened.

There are two kinds of recoverable MCE named SRAO (Software Recoverable
Action Optional) and SRAR (Software Recoverable Action Required). For
your example, it is a SRAR error. Where kernel will munmap the error
page and send SIGBUS to qemu via force_sig_info, which will unblock
SIGBUS and reset its action to SIG_DFL, so qemu will be terminated.

If the guest mode is interrupted, because signal mask processing of KVM
kernel part, SIGBUS can be captured by qemu.

For more details of recoverable MCE (SRAO and SRAR), you can refer to
latest Intel software developer's manual.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v6] QEMU-KVM: MCE: Add MCE simulation support to qemu/kvm

2009-07-15 Thread Huang Ying
KVM ioctls are used to initialize MCE simulation and inject MCE. The
real MCE simulation is implemented in Linux kernel. The Kernel part
has been merged.


ChangeLog:

v6:

- Re-based on latest qemu-kvm.git

v5:

- Re-based on latest qemu-kvm.git

v3:

- Re-based on qemu/tcg MCE support patch

v2:

- Use new kernel MCE capability exportion interface.


Signed-off-by: Huang Ying ying.hu...@intel.com

---
 kvm/include/linux/kvm.h   |   20 
 kvm/include/x86/asm/kvm.h |1 
 libkvm-all.h  |4 +++
 qemu-kvm-x86.c|   55 ++
 qemu-kvm.c|   40 +
 qemu-kvm.h|3 ++
 target-i386/helper.c  |5 
 target-i386/machine.c |4 +--
 8 files changed, 130 insertions(+), 2 deletions(-)

--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -444,6 +444,39 @@ int kvm_set_msrs(kvm_vcpu_context_t vcpu
 return r;
 }
 
+int kvm_get_mce_cap_supported(kvm_context_t kvm, uint64_t *mce_cap,
+  int *max_banks)
+{
+#ifdef KVM_CAP_MCE
+int r;
+
+r = ioctl(kvm-fd, KVM_CHECK_EXTENSION, KVM_CAP_MCE);
+if (r  0) {
+*max_banks = r;
+return ioctl(kvm-fd, KVM_X86_GET_MCE_CAP_SUPPORTED, mce_cap);
+}
+#endif
+return -ENOSYS;
+}
+
+int kvm_setup_mce(kvm_vcpu_context_t vcpu, uint64_t *mcg_cap)
+{
+#ifdef KVM_CAP_MCE
+return ioctl(vcpu-fd, KVM_X86_SETUP_MCE, mcg_cap);
+#else
+return -ENOSYS;
+#endif
+}
+
+int kvm_set_mce(kvm_vcpu_context_t vcpu, struct kvm_x86_mce *m)
+{
+#ifdef KVM_CAP_MCE
+return ioctl(vcpu-fd, KVM_X86_SET_MCE, m);
+#else
+return -ENOSYS;
+#endif
+}
+
 static void print_seg(FILE *file, const char *name, struct kvm_segment *seg)
 {
fprintf(stderr,
@@ -1301,6 +1334,28 @@ int kvm_arch_qemu_init_env(CPUState *cen
 
 kvm_setup_cpuid2(cenv-kvm_cpu_state.vcpu_ctx, cpuid_nent, cpuid_ent);
 
+#ifdef KVM_CAP_MCE
+if (((cenv-cpuid_version  8)0xF) = 6
+ (cenv-cpuid_features(CPUID_MCE|CPUID_MCA)) == 
(CPUID_MCE|CPUID_MCA)
+ kvm_check_extension(kvm_context, KVM_CAP_MCE)  0) {
+uint64_t mcg_cap;
+int banks;
+
+if (kvm_get_mce_cap_supported(kvm_context, mcg_cap, banks))
+perror(kvm_get_mce_cap_supported FAILED);
+else {
+if (banks  MCE_BANKS_DEF)
+banks = MCE_BANKS_DEF;
+mcg_cap = MCE_CAP_DEF;
+mcg_cap |= banks;
+if (kvm_setup_mce(cenv-kvm_cpu_state.vcpu_ctx, mcg_cap))
+perror(kvm_setup_mce FAILED);
+else
+cenv-mcg_cap = mcg_cap;
+}
+}
+#endif
+
 return 0;
 }
 
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -241,4 +241,7 @@ static inline int kvm_set_migration_log(
 return kvm_physical_memory_set_dirty_tracking(enable);
 }
 
+void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
+uint64_t mcg_status, uint64_t addr, uint64_t misc);
+
 #endif
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1509,6 +1509,11 @@ void cpu_inject_x86_mce(CPUState *cenv, 
 unsigned bank_num = mcg_cap  0xff;
 uint64_t *banks = cenv-mce_banks;
 
+if (kvm_enabled()) {
+kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
+return;
+}
+
 if (bank = bank_num || !(status  MCI_STATUS_VAL))
 return;
 
--- a/kvm/include/linux/kvm.h
+++ b/kvm/include/linux/kvm.h
@@ -463,6 +463,9 @@ struct kvm_trace_rec {
 #define KVM_CAP_ASSIGN_DEV_IRQ 29
 /* Another bug in KVM_SET_USER_MEMORY_REGION fixed: */
 #define KVM_CAP_JOIN_MEMORY_REGIONS_WORKS 30
+#ifdef __KVM_HAVE_MCE
+#define KVM_CAP_MCE 31
+#endif
 #define KVM_CAP_PIT2 33
 #define KVM_CAP_PIT_STATE2 35
 
@@ -504,6 +507,19 @@ struct kvm_irq_routing {
 
 #endif
 
+#ifdef KVM_CAP_MCE
+/* x86 MCE */
+struct kvm_x86_mce {
+   __u64 status;
+   __u64 addr;
+   __u64 misc;
+   __u64 mcg_status;
+   __u8 bank;
+   __u8 pad1[7];
+   __u64 pad2[3];
+};
+#endif
+
 /*
  * ioctls for VM fds
  */
@@ -592,6 +608,10 @@ struct kvm_irq_routing {
 #define KVM_NMI   _IO(KVMIO,  0x9a)
 /* Available with KVM_CAP_SET_GUEST_DEBUG */
 #define KVM_SET_GUEST_DEBUG   _IOW(KVMIO,  0x9b, struct kvm_guest_debug)
+/* MCE for x86 */
+#define KVM_X86_SETUP_MCE _IOW(KVMIO,  0x9c, __u64)
+#define KVM_X86_GET_MCE_CAP_SUPPORTED _IOR(KVMIO,  0x9d, __u64)
+#define KVM_X86_SET_MCE   _IOW(KVMIO,  0x9e, struct kvm_x86_mce)
 
 /*
  * Deprecated interfaces
--- a/kvm/include/x86/asm/kvm.h
+++ b/kvm/include/x86/asm/kvm.h
@@ -57,6 +57,7 @@
 #define __KVM_HAVE_USER_NMI
 #define __KVM_HAVE_GUEST_DEBUG
 #define __KVM_HAVE_MSIX
+#define __KVM_HAVE_MCE
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -2845,3 +2845,43 @@ int kvm_set_boot_cpu_id(uint32_t id)
 {
return kvm_set_boot_vcpu_id(kvm_context, id);
 }
+
+#ifdef

  1   2   >