Re: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address
On 04/28/2011 10:04 PM, Marcelo Tosatti wrote: On Thu, Apr 28, 2011 at 08:00:19AM -0500, Anthony Liguori wrote: On 04/27/2011 06:06 PM, Marcelo Tosatti wrote: On Fri, Nov 19, 2010 at 04:17:35PM +0800, Huang Ying wrote: On Tue, 2010-11-16 at 10:23 +0800, Huang Ying wrote: Author: Max Asbockmasb...@linux.vnet.ibm.com Add command x-gpa2hva to translate guest physical address to host virtual address. Because gpa to hva translation is not consistent, so this command is only used for debugging. The x-gpa2hva command provides one step in a chain of translations from guest virtual to guest physical to host virtual to host physical. Host physical is then used to inject a machine check error. As a consequence the HWPOISON code on the host and the MCE injection code in qemu-kvm are exercised. v3: - Rename to x-gpa2hva - Remove QMP support, because gpa2hva is not consistent Is this patch an acceptable solution for now? This command is useful for our testing. Anthony? Yeah, but it should come through qemu-devel, no? Yes, Huang Ying, can you please resend? Via QEMU git or uq/master branch of KVM git? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot
On Thu, 2011-02-10 at 16:52 +0800, Jan Kiszka wrote: On 2011-02-10 01:27, Huang Ying wrote: @@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en hardware_memory_error(); } } +kvm_hwpoison_page_add(ram_addr); if (code == BUS_MCEERR_AR) { /* Fake an Intel architectural Data Load SRAR UCR */ @@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a QEMU itself instead of guest system!: %p\n, addr); return 0; } +kvm_hwpoison_page_add(ram_addr); kvm_mce_inj_srao_memscrub2(first_cpu, paddr); } else #endif Looks fine otherwise. Unless that simplification makes sense, I could offer to include this into my MCE rework (there is some minor conflict). If all goes well, that series should be posted during this week. Please have a look at git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream and tell me if it works for you and your signed-off still applies. Thanks! Works as expected in my testing! Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot
On Wed, 2011-02-09 at 16:00 +0800, Jan Kiszka wrote: On 2011-02-09 04:00, Huang Ying wrote: In Linux kernel HWPoison processing implementation, the virtual address in processes mapping the error physical memory page is marked as HWPoison. So that, the further accessing to the virtual address will kill corresponding processes with SIGBUS. If the error physical memory page is used by a KVM guest, the SIGBUS will be sent to QEMU, and QEMU will simulate a MCE to report that memory error to the guest OS. If the guest OS can not recover from the error (for example, the page is accessed by kernel code), guest OS will reboot the system. But because the underlying host virtual address backing the guest physical memory is still poisoned, if the guest system accesses the corresponding guest physical memory even after rebooting, the SIGBUS will still be sent to QEMU and MCE will be simulated. That is, guest system can not recover via rebooting. Yeah, saw this already during my test... In fact, across rebooting, the contents of guest physical memory page need not to be kept. We can allocate a new host physical page to back the corresponding guest physical address. I just wondering what would be architecturally suboptimal if we simply remapped on SIGBUS directly. Would save us at least the bookkeeping. Because we can not change the content of memory silently during guest OS running, this may corrupts guest OS data structure and even ruins disk contents. But during rebooting, all guest OS state are discarded. [snip] @@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en hardware_memory_error(); } } +kvm_hwpoison_page_add(ram_addr); if (code == BUS_MCEERR_AR) { /* Fake an Intel architectural Data Load SRAR UCR */ @@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a QEMU itself instead of guest system!: %p\n, addr); return 0; } +kvm_hwpoison_page_add(ram_addr); kvm_mce_inj_srao_memscrub2(first_cpu, paddr); } else #endif Looks fine otherwise. Unless that simplification makes sense, I could offer to include this into my MCE rework (there is some minor conflict). If all goes well, that series should be posted during this week. Thanks. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH uq/master -v2 1/2] Add qemu_ram_remap
qemu_ram_remap() unmaps the specified RAM pages, then re-maps these pages again. This is used by KVM HWPoison support to clear HWPoisoned page tables across guest rebooting, so that a new page may be allocated later to recover the memory error. Signed-off-by: Huang Ying ying.hu...@intel.com --- cpu-all.h|4 +++ cpu-common.h |1 exec.c | 61 ++- 3 files changed, 65 insertions(+), 1 deletion(-) --- a/cpu-all.h +++ b/cpu-all.h @@ -863,10 +863,14 @@ target_phys_addr_t cpu_get_phys_page_deb extern int phys_ram_fd; extern ram_addr_t ram_size; +/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */ +#define RAM_PREALLOC_MASK (1 0) + typedef struct RAMBlock { uint8_t *host; ram_addr_t offset; ram_addr_t length; +uint32_t flags; char idstr[256]; QLIST_ENTRY(RAMBlock) next; #if defined(__linux__) !defined(TARGET_S390X) --- a/exec.c +++ b/exec.c @@ -2837,6 +2837,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic if (host) { new_block-host = host; +new_block-flags |= RAM_PREALLOC_MASK; } else { if (mem_path) { #if defined (__linux__) !defined(TARGET_S390X) @@ -2890,7 +2891,9 @@ void qemu_ram_free(ram_addr_t addr) QLIST_FOREACH(block, ram_list.blocks, next) { if (addr == block-offset) { QLIST_REMOVE(block, next); -if (mem_path) { +if (block-flags RAM_PREALLOC_MASK) { +; +} else if (mem_path) { #if defined (__linux__) !defined(TARGET_S390X) if (block-fd) { munmap(block-host, block-length); @@ -2913,6 +2916,62 @@ void qemu_ram_free(ram_addr_t addr) } +void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) +{ +RAMBlock *block; +ram_addr_t offset; +int flags; +void *area, *vaddr; + +QLIST_FOREACH(block, ram_list.blocks, next) { +offset = addr - block-offset; +if (offset block-length) { +vaddr = block-host + offset; +if (block-flags RAM_PREALLOC_MASK) { +; +} else { +flags = MAP_FIXED; +munmap(vaddr, length); +if (mem_path) { +#if defined (__linux__) !defined(TARGET_S390X) +if (block-fd) { +#ifdef MAP_POPULATE +flags |= mem_prealloc ? MAP_POPULATE | MAP_SHARED : +MAP_PRIVATE; +#else +flags |= MAP_PRIVATE; +#endif +area = mmap(vaddr, length, PROT_READ | PROT_WRITE, +flags, block-fd, offset); +} else { +flags |= MAP_PRIVATE | MAP_ANONYMOUS; +area = mmap(vaddr, length, PROT_READ | PROT_WRITE, +flags, -1, 0); +} +#endif +} else { +#if defined(TARGET_S390X) defined(CONFIG_KVM) +flags |= MAP_SHARED | MAP_ANONYMOUS; +area = mmap(vaddr, length, PROT_EXEC|PROT_READ|PROT_WRITE, +flags, -1, 0); +#else +flags |= MAP_PRIVATE | MAP_ANONYMOUS; +area = mmap(vaddr, length, PROT_READ | PROT_WRITE, +flags, -1, 0); +#endif +} +if (area != vaddr) { +fprintf(stderr, Could not remap addr: %lx@%lx\n, +length, addr); +exit(1); +} +qemu_madvise(vaddr, length, QEMU_MADV_MERGEABLE); +} +return; +} +} +} + /* Return a host pointer to ram allocated with qemu_ram_alloc. With the exception of the softmmu code in this file, this should only be used for local memory (e.g. video ram) that the device owns, --- a/cpu-common.h +++ b/cpu-common.h @@ -50,6 +50,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic ram_addr_t size, void *host); ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size); void qemu_ram_free(ram_addr_t addr); +void qemu_ram_remap(ram_addr_t addr, ram_addr_t length); /* This should only be used for ram local to a device. */ void *qemu_get_ram_ptr(ram_addr_t addr); /* Same but slower, to use for migration, where the order of -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot
In Linux kernel HWPoison processing implementation, the virtual address in processes mapping the error physical memory page is marked as HWPoison. So that, the further accessing to the virtual address will kill corresponding processes with SIGBUS. If the error physical memory page is used by a KVM guest, the SIGBUS will be sent to QEMU, and QEMU will simulate a MCE to report that memory error to the guest OS. If the guest OS can not recover from the error (for example, the page is accessed by kernel code), guest OS will reboot the system. But because the underlying host virtual address backing the guest physical memory is still poisoned, if the guest system accesses the corresponding guest physical memory even after rebooting, the SIGBUS will still be sent to QEMU and MCE will be simulated. That is, guest system can not recover via rebooting. In fact, across rebooting, the contents of guest physical memory page need not to be kept. We can allocate a new host physical page to back the corresponding guest physical address. This patch fixes this issue in QEMU-KVM via calling qemu_ram_remap() to clear the corresponding page table entry, so that make it possible to allocate a new page to recover the issue. Signed-off-by: Huang Ying ying.hu...@intel.com --- target-i386/kvm.c | 39 +++ 1 file changed, 39 insertions(+) --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -508,6 +508,42 @@ static int kvm_get_supported_msrs(KVMSta return ret; } +struct HWPoisonPage; +typedef struct HWPoisonPage HWPoisonPage; +struct HWPoisonPage +{ +ram_addr_t ram_addr; +QLIST_ENTRY(HWPoisonPage) list; +}; + +static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list = +QLIST_HEAD_INITIALIZER(hwpoison_page_list); + +static void kvm_unpoison_all(void *param) +{ +HWPoisonPage *page, *next_page; + +QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) { +QLIST_REMOVE(page, list); +qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE); +qemu_free(page); +} +} + +static void kvm_hwpoison_page_add(ram_addr_t ram_addr) +{ +HWPoisonPage *page; + +QLIST_FOREACH(page, hwpoison_page_list, list) { +if (page-ram_addr == ram_addr) +return; +} + +page = qemu_malloc(sizeof(HWPoisonPage)); +page-ram_addr = ram_addr; +QLIST_INSERT_HEAD(hwpoison_page_list, page, list); +} + int kvm_arch_init(KVMState *s) { uint64_t identity_base = 0xfffbc000; @@ -556,6 +592,7 @@ int kvm_arch_init(KVMState *s) fprintf(stderr, e820_add_entry() table is full\n); return ret; } +qemu_register_reset(kvm_unpoison_all, NULL); return 0; } @@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en hardware_memory_error(); } } +kvm_hwpoison_page_add(ram_addr); if (code == BUS_MCEERR_AR) { /* Fake an Intel architectural Data Load SRAR UCR */ @@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a QEMU itself instead of guest system!: %p\n, addr); return 0; } +kvm_hwpoison_page_add(ram_addr); kvm_mce_inj_srao_memscrub2(first_cpu, paddr); } else #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2 0/3] KVM, Replace is_hwpoison_address with __get_user_pages
v2: - Export __get_user_pages, because we need other get_user_pages variants too. [PATCH -v2 1/3] mm, export __get_user_pages [PATCH -v2 2/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally [PATCH -v2 3/3] KVM, Replace is_hwpoison_address with __get_user_pages -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2 2/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON pages from general FAULT pages, while other callers will still get -EFAULT for all these pages, so the user space interface need not to be changed. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. Signed-off-by: Huang Ying ying.hu...@intel.com Cc: Andrew Morton a...@linux-foundation.org --- include/asm-generic/errno.h |2 ++ include/linux/mm.h |1 + mm/memory.c | 13 ++--- 3 files changed, 13 insertions(+), 3 deletions(-) --- a/include/asm-generic/errno.h +++ b/include/asm-generic/errno.h @@ -108,4 +108,6 @@ #define ERFKILL132 /* Operation not possible due to RF-kill */ +#define EHWPOISON 133 /* Memory page has hardware error */ + #endif --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1534,6 +1534,7 @@ struct page *follow_page(struct vm_area_ #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */ #define FOLL_MLOCK 0x40/* mark page as mlocked */ #define FOLL_SPLIT 0x80/* don't return transhuge pages, split them */ +#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); --- a/mm/memory.c +++ b/mm/memory.c @@ -1576,9 +1576,16 @@ int __get_user_pages(struct task_struct if (ret VM_FAULT_ERROR) { if (ret VM_FAULT_OOM) return i ? i : -ENOMEM; - if (ret - (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE| -VM_FAULT_SIGBUS)) + if (ret (VM_FAULT_HWPOISON | + VM_FAULT_HWPOISON_LARGE)) { + if (i) + return i; + else if (gup_flags FOLL_HWPOISON) + return -EHWPOISON; + else + return -EFAULT; + } + if (ret VM_FAULT_SIGBUS) return i ? i : -EFAULT; BUG(); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2 3/3] KVM, Replace is_hwpoison_address with __get_user_pages
is_hwpoison_address only checks whether the page table entry is hwpoisoned, regardless the memory page mapped. While __get_user_pages will check both. QEMU will clear the poisoned page table entry (via unmap/map) to make it possible to allocate a new memory page for the virtual address across guest rebooting. But it is also possible that the underlying memory page is kept poisoned even after the corresponding page table entry is cleared, that is, a new memory page can not be allocated. __get_user_pages can catch these situations. Signed-off-by: Huang Ying ying.hu...@intel.com --- include/linux/mm.h |8 mm/memory-failure.c | 32 virt/kvm/kvm_main.c | 11 ++- 3 files changed, 10 insertions(+), 41 deletions(-) --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1627,14 +1627,6 @@ extern int sysctl_memory_failure_recover extern void shake_page(struct page *p, int access); extern atomic_long_t mce_bad_pages; extern int soft_offline_page(struct page *page, int flags); -#ifdef CONFIG_MEMORY_FAILURE -int is_hwpoison_address(unsigned long addr); -#else -static inline int is_hwpoison_address(unsigned long addr) -{ - return 0; -} -#endif extern void dump_page(struct page *page); --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1437,35 +1437,3 @@ done: /* keep elevated page count for bad page */ return ret; } - -/* - * The caller must hold current-mm-mmap_sem in read mode. - */ -int is_hwpoison_address(unsigned long addr) -{ - pgd_t *pgdp; - pud_t pud, *pudp; - pmd_t pmd, *pmdp; - pte_t pte, *ptep; - swp_entry_t entry; - - pgdp = pgd_offset(current-mm, addr); - if (!pgd_present(*pgdp)) - return 0; - pudp = pud_offset(pgdp, addr); - pud = *pudp; - if (!pud_present(pud) || pud_large(pud)) - return 0; - pmdp = pmd_offset(pudp, addr); - pmd = *pmdp; - if (!pmd_present(pmd) || pmd_large(pmd)) - return 0; - ptep = pte_offset_map(pmdp, addr); - pte = *ptep; - pte_unmap(ptep); - if (!is_swap_pte(pte)) - return 0; - entry = pte_to_swp_entry(pte); - return is_hwpoison_entry(entry); -} -EXPORT_SYMBOL_GPL(is_hwpoison_address); --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1029,6 +1029,15 @@ static pfn_t get_fault_pfn(void) return fault_pfn; } +static inline int check_user_page_hwpoison(unsigned long addr) +{ + int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE; + + rc = __get_user_pages(current, current-mm, addr, 1, + flags, NULL, NULL, NULL); + return rc == -EHWPOISON; +} + static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic, bool *async, bool write_fault, bool *writable) { @@ -1076,7 +1085,7 @@ static pfn_t hva_to_pfn(struct kvm *kvm, return get_fault_pfn(); down_read(current-mm-mmap_sem); - if (is_hwpoison_address(addr)) { + if (check_user_page_hwpoison(addr)) { up_read(current-mm-mmap_sem); get_page(hwpoison_page); return page_to_pfn(hwpoison_page); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2 1/3] mm, export __get_user_pages
In most cases, get_user_pages and get_user_pages_fast should be used to pin user pages in memory. But sometimes, some special flags except FOLL_GET, FOLL_WRITE and FOLL_FORCE are needed, for example in following patch, KVM needs FOLL_HWPOISON. To support these users, __get_user_pages is exported directly. There are some symbol name conflicts in infiniband driver, fixed them too. Signed-off-by: Huang Ying ying.hu...@intel.com CC: Andrew Morton a...@linux-foundation.org CC: Michel Lespinasse wal...@google.com CC: Roland Dreier rol...@kernel.org CC: Ralph Campbell infinip...@qlogic.com --- drivers/infiniband/hw/ipath/ipath_user_pages.c |6 +-- drivers/infiniband/hw/qib/qib_user_pages.c |6 +-- include/linux/mm.h |4 ++ mm/internal.h |5 -- mm/memory.c| 50 + 5 files changed, 60 insertions(+), 11 deletions(-) --- a/drivers/infiniband/hw/ipath/ipath_user_pages.c +++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c @@ -53,8 +53,8 @@ static void __ipath_release_user_pages(s } /* call with current-mm-mmap_sem held */ -static int __get_user_pages(unsigned long start_page, size_t num_pages, - struct page **p, struct vm_area_struct **vma) +static int __ipath_get_user_pages(unsigned long start_page, size_t num_pages, + struct page **p, struct vm_area_struct **vma) { unsigned long lock_limit; size_t got; @@ -165,7 +165,7 @@ int ipath_get_user_pages(unsigned long s down_write(current-mm-mmap_sem); - ret = __get_user_pages(start_page, num_pages, p, NULL); + ret = __ipath_get_user_pages(start_page, num_pages, p, NULL); up_write(current-mm-mmap_sem); --- a/drivers/infiniband/hw/qib/qib_user_pages.c +++ b/drivers/infiniband/hw/qib/qib_user_pages.c @@ -51,8 +51,8 @@ static void __qib_release_user_pages(str /* * Call with current-mm-mmap_sem held. */ -static int __get_user_pages(unsigned long start_page, size_t num_pages, - struct page **p, struct vm_area_struct **vma) +static int __qib_get_user_pages(unsigned long start_page, size_t num_pages, + struct page **p, struct vm_area_struct **vma) { unsigned long lock_limit; size_t got; @@ -136,7 +136,7 @@ int qib_get_user_pages(unsigned long sta down_write(current-mm-mmap_sem); - ret = __get_user_pages(start_page, num_pages, p, NULL); + ret = __qib_get_user_pages(start_page, num_pages, p, NULL); up_write(current-mm-mmap_sem); --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -965,6 +965,10 @@ static inline int handle_mm_fault(struct extern int make_pages_present(unsigned long addr, unsigned long end); extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write); +int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, +unsigned long start, int len, unsigned int foll_flags, +struct page **pages, struct vm_area_struct **vmas, +int *nonblocking); int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas); --- a/mm/internal.h +++ b/mm/internal.h @@ -245,11 +245,6 @@ static inline void mminit_validate_memmo } #endif /* CONFIG_SPARSEMEM */ -int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, -unsigned long start, int len, unsigned int foll_flags, -struct page **pages, struct vm_area_struct **vmas, -int *nonblocking); - #define ZONE_RECLAIM_NOSCAN-2 #define ZONE_RECLAIM_FULL -1 #define ZONE_RECLAIM_SOME 0 --- a/mm/memory.c +++ b/mm/memory.c @@ -1410,6 +1410,55 @@ no_page_table: return page; } +/** + * __get_user_pages() - pin user pages in memory + * @tsk: task_struct of target task + * @mm:mm_struct of target mm + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @gup_flags: flags modifying pin behaviour + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. Or NULL, if caller + * only intends to ensure the pages are faulted in. + * @vmas: array of pointers to vmas corresponding to each page. + * Or NULL if the caller does not require them. + * @nonblocking: whether waiting for disk IO or mmap_sem contention + * + * Returns number of pages pinned. This may be fewer than the number + * requested. If nr_pages is 0 or negative, returns 0. If no pages + * were pinned, returns -errno. Each page returned must be released + * with a put_page() call when it is finished with. vmas will only
Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
Hi, Andrew, On Thu, 2011-01-20 at 23:50 +0800, Marcelo Tosatti wrote: On Mon, Jan 17, 2011 at 08:47:39AM +0800, Huang Ying wrote: Hi, Andrew, On Sun, 2011-01-16 at 23:35 +0800, Avi Kivity wrote: On 01/14/2011 03:37 AM, Huang Ying wrote: On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote: On 01/13/2011 10:42 AM, Huang Ying wrote: Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON page from general FAULT page, while other callers will still get -EFAULT for pages, so the user space interface need not to be changed. get_user_pages_hwpoison is added as a variant of get_user_pages that can return -EHWPOISON for HWPOISON page. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. Signed-off-by: Huang Yingying.hu...@intel.com Cc: Andrew Mortona...@linux-foundation.org +#ifdef CONFIG_MEMORY_FAILURE +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, + int force, struct page **pages, + struct vm_area_struct **vmas); +#else Since we'd also like to add get_user_pages_noio(), perhaps adding a flags field to get_user_pages() is better. Or export __get_user_pages()? That's better, yes. Do you think it is a good idea to export __get_user_pages() instead of adding get_user_pages_noio() and get_user_pages_hwpoison()? Better Andrew and/or Linus should decide it. We really need your comments about this. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
Hi, Andrew, On Sun, 2011-01-16 at 23:35 +0800, Avi Kivity wrote: On 01/14/2011 03:37 AM, Huang Ying wrote: On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote: On 01/13/2011 10:42 AM, Huang Ying wrote: Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON page from general FAULT page, while other callers will still get -EFAULT for pages, so the user space interface need not to be changed. get_user_pages_hwpoison is added as a variant of get_user_pages that can return -EHWPOISON for HWPOISON page. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. Signed-off-by: Huang Yingying.hu...@intel.com Cc: Andrew Mortona...@linux-foundation.org +#ifdef CONFIG_MEMORY_FAILURE +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, + int force, struct page **pages, + struct vm_area_struct **vmas); +#else Since we'd also like to add get_user_pages_noio(), perhaps adding a flags field to get_user_pages() is better. Or export __get_user_pages()? That's better, yes. Do you think it is a good idea to export __get_user_pages() instead of adding get_user_pages_noio() and get_user_pages_hwpoison()? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Remove user space triggerable MCE error message
On Sat, 2011-01-15 at 17:00 +0800, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com This case is a pure user space error we do not need to record. Moreover, it can be misused to flood the kernel log. Remove it. I don't think this is a pure user space error. This happens on real hardware too, if the Machine Check exception is raised during early boot stage or the second MC exception is raised before the first MC exception is processed/cleared. So I use printk here to help debugging these issues. To avoid flooding the kernel log, we can use ratelimit. Best Regards, Huang Ying Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/x86.c |3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9dda70d..7f7e4a5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2575,9 +2575,6 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu, if (mce-status MCI_STATUS_UC) { if ((vcpu-arch.mcg_status MCG_STATUS_MCIP) || !kvm_read_cr4_bits(vcpu, X86_CR4_MCE)) { - printk(KERN_DEBUG kvm: set_mce: -injects mce exception while -previous one is in progress!\n); kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); return 0; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH uq/master 2/2] MCE, unpoison memory address across reboot
On Fri, 2011-01-14 at 16:38 +0800, Jan Kiszka wrote: Am 14.01.2011 02:51, Huang Ying wrote: On Thu, 2011-01-13 at 17:01 +0800, Jan Kiszka wrote: Am 13.01.2011 09:34, Huang Ying wrote: [snip] + +void kvm_unpoison_all(void *param) Minor nit: This can be static now. In uq/master, it can be make static. But in kvm/master, kvm_arch_init is not compiled because of conditional compiling, so we will get warning and error for unused symbol. Should we consider kvm/master in this patch? qemu-kvm is very close to switching to upstream kvm_*init. As long as it requires this service in its own modules, it will have to patch this detail. It does this for other functions already. OK. I will change this. [snip] As indicated, I'm sitting on lots of fixes and refactorings of the MCE user space code. How do you test your patches? Any suggestions how to do this efficiently would be warmly welcome. We use a self-made test script to test. Repository is at: git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git The kvm test script is in kvm sub-directory. The qemu patch attached is need by the test script. Yeah, I already found this yesterday and started reading. I was just searching for p2v in qemu, but now it's clear where it comes from. Will have a look (if you want to preview my changes: git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream). I was almost about to use MADV_HWPOISON instead of the injection module. Is there a way to recover the fake corruption afterward? I think that would allow to move some of the test logic into qemu and avoid p2v which - IIRC - was disliked upstream. I don't know how to fully recover from MADV_HWPOISON. You can recover the virtual address space via qemu_ram_remap() introduced in 1/2 of this patchset. But you will lose one or several physical pages for each testing. I think that may be not a big issue for a testing machine. Ccing Andi and Fengguang, they know more than me about MADV_HWPOISON. Also, is there a way to simulate corrected errors (BUS_MCEERR_AO)? BUS_MCEERR_AO is recoverable uncorrected error instead of corrected error. The test script is for BUS_MCEERR_AO and BUS_MCEERR_AR. To see the effect of pure BUS_MCEERR_AO, just remove the memory accessing loop (memset) in tools/simple_process/simple_process.c. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Remove user space triggerable MCE error message
On Mon, 2011-01-17 at 15:11 +0800, Jan Kiszka wrote: On 2011-01-17 01:54, Huang Ying wrote: On Sat, 2011-01-15 at 17:00 +0800, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com This case is a pure user space error we do not need to record. Moreover, it can be misused to flood the kernel log. Remove it. I don't think this is a pure user space error. This happens on real hardware too, if the Machine Check exception is raised during early boot stage or the second MC exception is raised before the first MC exception is processed/cleared. So I use printk here to help debugging these issues. To avoid flooding the kernel log, we can use ratelimit. With user space I meant qemu, and maybe error was the wrong term. This code path is only triggered if qemu decides to. Not only decided by qemu, but also decided by guest OS. If guest OS does not clear the MSR or guest OS does not set the X86_CR4_MCE bit in the cr4, the triple fault will be triggered. And there you may also print this event (and you already do). Sorry, which print do you mean? I can not find similar print in user space. Another reason to not rely on catching this case here: KVM_X86_SET_MCE is obsolete on current kernels. Qemu will use a combination of KVM_SET_MSRS and KVM_SET_VCPU_EVENTS in the future, only falling back to this interface on pre-vcpu-events kernels. Then you need to debug this in user space anyway as the triple fault will no longer make it to the kernel. OK. Then, I think it will be helpful for debugging if we can print something like this in user space implementation. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH uq/master 2/2] MCE, unpoison memory address across reboot
In Linux kernel HWPoison processing implementation, the virtual address in processes mapping the error physical memory page is marked as HWPoison. So that, the further accessing to the virtual address will kill corresponding processes with SIGBUS. If the error physical memory page is used by a KVM guest, the SIGBUS will be sent to QEMU, and QEMU will simulate a MCE to report that memory error to the guest OS. If the guest OS can not recover from the error (for example, the page is accessed by kernel code), guest OS will reboot the system. But because the underlying host virtual address backing the guest physical memory is still poisoned, if the guest system accesses the corresponding guest physical memory even after rebooting, the SIGBUS will still be sent to QEMU and MCE will be simulated. That is, guest system can not recover via rebooting. In fact, across rebooting, the contents of guest physical memory page need not to be kept. We can allocate a new host physical page to back the corresponding guest physical address. This patch fixes this issue in QEMU via calling qemu_ram_remap() to clear the corresponding page table entry, so that make it possible to allocate a new page to recover the issue. Signed-off-by: Huang Ying ying.hu...@intel.com --- kvm.h |2 ++ target-i386/kvm.c | 39 +++ 2 files changed, 41 insertions(+) --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -580,6 +580,42 @@ static int kvm_get_supported_msrs(void) return ret; } +struct HWPoisonPage; +typedef struct HWPoisonPage HWPoisonPage; +struct HWPoisonPage +{ +ram_addr_t ram_addr; +QLIST_ENTRY(HWPoisonPage) list; +}; + +static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list = +QLIST_HEAD_INITIALIZER(hwpoison_page_list); + +void kvm_unpoison_all(void *param) +{ +HWPoisonPage *page, *next_page; + +QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) { +QLIST_REMOVE(page, list); +qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE); +qemu_free(page); +} +} + +static void kvm_hwpoison_page_add(ram_addr_t ram_addr) +{ +HWPoisonPage *page; + +QLIST_FOREACH(page, hwpoison_page_list, list) { +if (page-ram_addr == ram_addr) +return; +} + +page = qemu_malloc(sizeof(HWPoisonPage)); +page-ram_addr = ram_addr; +QLIST_INSERT_HEAD(hwpoison_page_list, page, list); +} + int kvm_arch_init(void) { uint64_t identity_base = 0xfffbc000; @@ -632,6 +668,7 @@ int kvm_arch_init(void) fprintf(stderr, e820_add_entry() table is full\n); return ret; } +qemu_register_reset(kvm_unpoison_all, NULL); return 0; } @@ -1940,6 +1977,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in hardware_memory_error(); } } +kvm_hwpoison_page_add(ram_addr); if (code == BUS_MCEERR_AR) { /* Fake an Intel architectural Data Load SRAR UCR */ @@ -1984,6 +2022,7 @@ int kvm_on_sigbus(int code, void *addr) QEMU itself instead of guest system!: %p\n, addr); return 0; } +kvm_hwpoison_page_add(ram_addr); kvm_mce_inj_srao_memscrub2(first_cpu, paddr); } else #endif --- a/kvm.h +++ b/kvm.h @@ -188,6 +188,8 @@ int kvm_physical_memory_addr_from_ram(ra target_phys_addr_t *phys_addr); #endif +void kvm_unpoison_all(void *param); + #endif int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool assign); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH uq/master 1/2] Add qemu_ram_remap
qemu_ram_remap() unmaps the specified RAM pages, then re-maps these pages again. This is used by KVM HWPoison support to clear HWPoisoned page tables across guest rebooting, so that a new page may be allocated later to recover the memory error. Signed-off-by: Huang Ying ying.hu...@intel.com --- cpu-all.h|4 +++ cpu-common.h |1 exec.c | 61 ++- 3 files changed, 65 insertions(+), 1 deletion(-) --- a/cpu-all.h +++ b/cpu-all.h @@ -863,10 +863,14 @@ target_phys_addr_t cpu_get_phys_page_deb extern int phys_ram_fd; extern ram_addr_t ram_size; +/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */ +#define RAM_PREALLOC_MASK (1 0) + typedef struct RAMBlock { uint8_t *host; ram_addr_t offset; ram_addr_t length; +uint32_t flags; char idstr[256]; QLIST_ENTRY(RAMBlock) next; #if defined(__linux__) !defined(TARGET_S390X) --- a/exec.c +++ b/exec.c @@ -2830,6 +2830,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic if (host) { new_block-host = host; +new_block-flags |= RAM_PREALLOC_MASK; } else { if (mem_path) { #if defined (__linux__) !defined(TARGET_S390X) @@ -2883,7 +2884,9 @@ void qemu_ram_free(ram_addr_t addr) QLIST_FOREACH(block, ram_list.blocks, next) { if (addr == block-offset) { QLIST_REMOVE(block, next); -if (mem_path) { +if (block-flags RAM_PREALLOC_MASK) +; +else if (mem_path) { #if defined (__linux__) !defined(TARGET_S390X) if (block-fd) { munmap(block-host, block-length); @@ -2906,6 +2909,62 @@ void qemu_ram_free(ram_addr_t addr) } +void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) +{ +RAMBlock *block; +ram_addr_t offset; +int flags; +void *area, *vaddr; + +QLIST_FOREACH(block, ram_list.blocks, next) { +offset = addr - block-offset; +if (offset block-length) { +vaddr = block-host + offset; +if (block-flags RAM_PREALLOC_MASK) { +; +} else { +flags = MAP_FIXED; +munmap(vaddr, length); +if (mem_path) { +#if defined (__linux__) !defined(TARGET_S390X) +if (block-fd) { +#ifdef MAP_POPULATE +flags |= mem_prealloc ? MAP_POPULATE | MAP_SHARED : +MAP_PRIVATE; +#else +flags |= MAP_PRIVATE; +#endif +area = mmap(vaddr, length, PROT_READ | PROT_WRITE, +flags, block-fd, offset); +} else { +flags |= MAP_PRIVATE | MAP_ANONYMOUS; +area = mmap(vaddr, length, PROT_READ | PROT_WRITE, +flags, -1, 0); +} +#endif +} else { +#if defined(TARGET_S390X) defined(CONFIG_KVM) +flags |= MAP_SHARED | MAP_ANONYMOUS; +area = mmap(vaddr, length, PROT_EXEC|PROT_READ|PROT_WRITE, +flags, -1, 0); +#else +flags |= MAP_PRIVATE | MAP_ANONYMOUS; +area = mmap(vaddr, length, PROT_READ | PROT_WRITE, +flags, -1, 0); +#endif +} +if (area != vaddr) { +fprintf(stderr, Could not remap addr: %lx@%lx\n, +length, addr); +exit(1); +} +qemu_madvise(vaddr, length, QEMU_MADV_MERGEABLE); +} +return; +} +} +} + /* Return a host pointer to ram allocated with qemu_ram_alloc. With the exception of the softmmu code in this file, this should only be used for local memory (e.g. video ram) that the device owns, --- a/cpu-common.h +++ b/cpu-common.h @@ -50,6 +50,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic ram_addr_t size, void *host); ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size); void qemu_ram_free(ram_addr_t addr); +void qemu_ram_remap(ram_addr_t addr, ram_addr_t length); /* This should only be used for ram local to a device. */ void *qemu_get_ram_ptr(ram_addr_t addr); /* Same but slower, to use for migration, where the order of -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON page from general FAULT page, while other callers will still get -EFAULT for pages, so the user space interface need not to be changed. get_user_pages_hwpoison is added as a variant of get_user_pages that can return -EHWPOISON for HWPOISON page. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. Signed-off-by: Huang Ying ying.hu...@intel.com Cc: Andrew Morton a...@linux-foundation.org --- include/asm-generic/errno.h |2 + include/linux/mm.h | 17 + mm/memory.c | 55 +--- 3 files changed, 71 insertions(+), 3 deletions(-) --- a/include/asm-generic/errno.h +++ b/include/asm-generic/errno.h @@ -108,4 +108,6 @@ #define ERFKILL132 /* Operation not possible due to RF-kill */ +#define EHWPOISON 133 /* Memory page has hardware error */ + #endif --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -860,6 +860,22 @@ int get_user_pages(struct task_struct *t struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +#ifdef CONFIG_MEMORY_FAILURE +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, + int force, struct page **pages, + struct vm_area_struct **vmas); +#else +static inline int get_user_pages_hwpoison(struct task_struct *tsk, + struct mm_struct *mm, + unsigned long start, int nr_pages, + int write, int force, + struct page **pages, + struct vm_area_struct **vmas) { + return get_user_pages(tsk, mm, start, nr_pages, + write, force, pages, vmas); +} +#endif struct page *get_dump_page(unsigned long addr); extern int try_to_release_page(struct page * page, gfp_t gfp_mask); @@ -1415,6 +1431,7 @@ struct page *follow_page(struct vm_area_ #define FOLL_GET 0x04/* do get_page on page */ #define FOLL_DUMP 0x08/* give error on hole if it would be zero */ #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */ +#define FOLL_HWPOISON 0x20/* check page is hwpoisoned */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); --- a/mm/memory.c +++ b/mm/memory.c @@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct if (ret VM_FAULT_ERROR) { if (ret VM_FAULT_OOM) return i ? i : -ENOMEM; - if (ret - (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE| -VM_FAULT_SIGBUS)) + if (ret (VM_FAULT_HWPOISON | + VM_FAULT_HWPOISON_LARGE)) { + if (i) + return i; + else if (gup_flags FOLL_HWPOISON) + return -EHWPOISON; + else + return -EFAULT; + } + if (ret VM_FAULT_SIGBUS) return i ? i : -EFAULT; BUG(); } @@ -1563,6 +1570,48 @@ int get_user_pages(struct task_struct *t } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_MEMORY_FAILURE +/** + * get_user_pages_hwpoison() - pin user pages in memory, return hwpoison status + * @tsk: task_struct of target task + * @mm:mm_struct of target mm + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @write: whether pages will be written to by the caller + * @force: whether to force write access even if user mapping is + * readonly. This will result in the page being COWed even + * in MAP_SHARED mappings. You do not want this. + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. Or NULL
[PATCH 2/2] KVM, Replace is_hwpoison_address with get_user_pages_hwpoison
is_hwpoison_address only checks whether the page table entry is hwpoisoned, regardless the memory page mapped. While get_user_pages_hwpoison will check both. QEMU will clear the poisoned page table entry (via unmap/map) to make it possible to allocate a new memory page for the virtual address across guest rebooting. But it is also possible that the underlying memory page is kept poisoned even after the corresponding page table entry is cleared, that is, a new memory page can not be allocated. get_user_pages_hwpoison can catch these situations. Signed-off-by: Huang Ying ying.hu...@intel.com --- include/linux/mm.h |8 mm/memory-failure.c | 32 virt/kvm/kvm_main.c |4 +++- 3 files changed, 3 insertions(+), 41 deletions(-) --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1524,14 +1524,6 @@ extern int sysctl_memory_failure_recover extern void shake_page(struct page *p, int access); extern atomic_long_t mce_bad_pages; extern int soft_offline_page(struct page *page, int flags); -#ifdef CONFIG_MEMORY_FAILURE -int is_hwpoison_address(unsigned long addr); -#else -static inline int is_hwpoison_address(unsigned long addr) -{ - return 0; -} -#endif extern void dump_page(struct page *page); --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1433,35 +1433,3 @@ done: /* keep elevated page count for bad page */ return ret; } - -/* - * The caller must hold current-mm-mmap_sem in read mode. - */ -int is_hwpoison_address(unsigned long addr) -{ - pgd_t *pgdp; - pud_t pud, *pudp; - pmd_t pmd, *pmdp; - pte_t pte, *ptep; - swp_entry_t entry; - - pgdp = pgd_offset(current-mm, addr); - if (!pgd_present(*pgdp)) - return 0; - pudp = pud_offset(pgdp, addr); - pud = *pudp; - if (!pud_present(pud) || pud_large(pud)) - return 0; - pmdp = pmd_offset(pudp, addr); - pmd = *pmdp; - if (!pmd_present(pmd) || pmd_large(pmd)) - return 0; - ptep = pte_offset_map(pmdp, addr); - pte = *ptep; - pte_unmap(ptep); - if (!is_swap_pte(pte)) - return 0; - entry = pte_to_swp_entry(pte); - return is_hwpoison_entry(entry); -} -EXPORT_SYMBOL_GPL(is_hwpoison_address); --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -966,7 +966,9 @@ static pfn_t hva_to_pfn(struct kvm *kvm, goto return_fault_page; down_read(current-mm-mmap_sem); - if (is_hwpoison_address(addr)) { + npages = get_user_pages_hwpoison(current, current-mm, +addr, 1, 1, 0, page, NULL); + if (npages == -EHWPOISON) { up_read(current-mm-mmap_sem); get_page(hwpoison_page); return page_to_pfn(hwpoison_page); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote: On 01/13/2011 10:42 AM, Huang Ying wrote: Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON page from general FAULT page, while other callers will still get -EFAULT for pages, so the user space interface need not to be changed. get_user_pages_hwpoison is added as a variant of get_user_pages that can return -EHWPOISON for HWPOISON page. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. Signed-off-by: Huang Yingying.hu...@intel.com Cc: Andrew Mortona...@linux-foundation.org +#ifdef CONFIG_MEMORY_FAILURE +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, + int force, struct page **pages, + struct vm_area_struct **vmas); +#else Since we'd also like to add get_user_pages_noio(), perhaps adding a flags field to get_user_pages() is better. Or export __get_user_pages()? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH uq/master 2/2] MCE, unpoison memory address across reboot
On Thu, 2011-01-13 at 17:01 +0800, Jan Kiszka wrote: Am 13.01.2011 09:34, Huang Ying wrote: In Linux kernel HWPoison processing implementation, the virtual address in processes mapping the error physical memory page is marked as HWPoison. So that, the further accessing to the virtual address will kill corresponding processes with SIGBUS. If the error physical memory page is used by a KVM guest, the SIGBUS will be sent to QEMU, and QEMU will simulate a MCE to report that memory error to the guest OS. If the guest OS can not recover from the error (for example, the page is accessed by kernel code), guest OS will reboot the system. But because the underlying host virtual address backing the guest physical memory is still poisoned, if the guest system accesses the corresponding guest physical memory even after rebooting, the SIGBUS will still be sent to QEMU and MCE will be simulated. That is, guest system can not recover via rebooting. In fact, across rebooting, the contents of guest physical memory page need not to be kept. We can allocate a new host physical page to back the corresponding guest physical address. This patch fixes this issue in QEMU via calling qemu_ram_remap() to clear the corresponding page table entry, so that make it possible to allocate a new page to recover the issue. Signed-off-by: Huang Ying ying.hu...@intel.com --- kvm.h |2 ++ target-i386/kvm.c | 39 +++ 2 files changed, 41 insertions(+) --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -580,6 +580,42 @@ static int kvm_get_supported_msrs(void) return ret; } +struct HWPoisonPage; +typedef struct HWPoisonPage HWPoisonPage; +struct HWPoisonPage +{ +ram_addr_t ram_addr; +QLIST_ENTRY(HWPoisonPage) list; +}; + +static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list = +QLIST_HEAD_INITIALIZER(hwpoison_page_list); + +void kvm_unpoison_all(void *param) Minor nit: This can be static now. In uq/master, it can be make static. But in kvm/master, kvm_arch_init is not compiled because of conditional compiling, so we will get warning and error for unused symbol. Should we consider kvm/master in this patch? +{ +HWPoisonPage *page, *next_page; + +QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) { +QLIST_REMOVE(page, list); +qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE); +qemu_free(page); +} +} + +static void kvm_hwpoison_page_add(ram_addr_t ram_addr) +{ +HWPoisonPage *page; + +QLIST_FOREACH(page, hwpoison_page_list, list) { +if (page-ram_addr == ram_addr) +return; +} + +page = qemu_malloc(sizeof(HWPoisonPage)); +page-ram_addr = ram_addr; +QLIST_INSERT_HEAD(hwpoison_page_list, page, list); +} + int kvm_arch_init(void) { uint64_t identity_base = 0xfffbc000; @@ -632,6 +668,7 @@ int kvm_arch_init(void) fprintf(stderr, e820_add_entry() table is full\n); return ret; } +qemu_register_reset(kvm_unpoison_all, NULL); return 0; } @@ -1940,6 +1977,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in hardware_memory_error(); } } +kvm_hwpoison_page_add(ram_addr); if (code == BUS_MCEERR_AR) { /* Fake an Intel architectural Data Load SRAR UCR */ @@ -1984,6 +2022,7 @@ int kvm_on_sigbus(int code, void *addr) QEMU itself instead of guest system!: %p\n, addr); return 0; } +kvm_hwpoison_page_add(ram_addr); kvm_mce_inj_srao_memscrub2(first_cpu, paddr); } else #endif --- a/kvm.h +++ b/kvm.h @@ -188,6 +188,8 @@ int kvm_physical_memory_addr_from_ram(ra target_phys_addr_t *phys_addr); #endif +void kvm_unpoison_all(void *param); + To be removed if kvm_unpoison_all is static. #endif int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool assign); As indicated, I'm sitting on lots of fixes and refactorings of the MCE user space code. How do you test your patches? Any suggestions how to do this efficiently would be warmly welcome. We use a self-made test script to test. Repository is at: git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git The kvm test script is in kvm sub-directory. The qemu patch attached is need by the test script. Best Regards, Huang Ying Author: Max Asbock masb...@linux.vnet.ibm.com Subject: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address Add command x-gpa2hva to translate guest physical address to host virtual address. Because gpa to hva translation is not consistent, so this command is only used for debugging. The x-gpa2hva command
Re: [PATCH v3 12/21] kvm: x86: Drop MCE MSRs write back restrictions
On Wed, 2011-01-05 at 16:07 +0800, Jan Kiszka wrote: Am 05.01.2011 07:42, Huang Ying wrote: On Tue, 2011-01-04 at 16:32 +0800, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com There is no need to restrict writing back MCE MSRs to reset or full state updates as setting their values has no side effects. Sorry for late. Don't worry. The MCE MSRs contents is sticky for warm reset except MCG_STATUS, so their content should be kept. And the following sequence may set uncorrected value in MCE registers. savevm - loadvm - (OS clear MCE registers) - reset - (MCE registers has new (uncorrected) value) Sorry, I can't follow. Unless I miss some subtle detail, the question is not when we transfer the mcg_* CPUState fields to the kernel, but when and how we manipulate them in user space, e.g. on reset. Where are those fields touched incorrectly between get and put msrs so that we cannot write them back? If my understanding is correct, MSRs are not saved to user space (env-mce_banks) during reset in current code. So if all MCE MSRs are restored to kernel, their user space contents from previous loadvm may be put into kernel after reset. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 12/21] kvm: x86: Drop MCE MSRs write back restrictions
On Tue, 2011-01-04 at 16:32 +0800, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com There is no need to restrict writing back MCE MSRs to reset or full state updates as setting their values has no side effects. Sorry for late. The MCE MSRs contents is sticky for warm reset except MCG_STATUS, so their content should be kept. And the following sequence may set uncorrected value in MCE registers. savevm - loadvm - (OS clear MCE registers) - reset - (MCE registers has new (uncorrected) value) Best Regards, Huang Ying Signed-off-by: Jan Kiszka jan.kis...@siemens.com CC: Huang Ying ying.hu...@intel.com --- target-i386/kvm.c | 12 1 files changed, 4 insertions(+), 8 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 8267655..1789bff 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -863,14 +863,10 @@ static int kvm_put_msrs(CPUState *env, int level) if (env-mcg_cap) { int i; -if (level == KVM_PUT_RESET_STATE) { -kvm_msr_entry_set(msrs[n++], MSR_MCG_STATUS, env-mcg_status); -} else if (level == KVM_PUT_FULL_STATE) { -kvm_msr_entry_set(msrs[n++], MSR_MCG_STATUS, env-mcg_status); -kvm_msr_entry_set(msrs[n++], MSR_MCG_CTL, env-mcg_ctl); -for (i = 0; i (env-mcg_cap 0xff) * 4; i++) { -kvm_msr_entry_set(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]); -} +kvm_msr_entry_set(msrs[n++], MSR_MCG_STATUS, env-mcg_status); +kvm_msr_entry_set(msrs[n++], MSR_MCG_CTL, env-mcg_ctl); +for (i = 0; i (env-mcg_cap 0xff) * 4; i++) { +kvm_msr_entry_set(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]); } } #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] KVM, MCE, unpoison memory address across reboot
On Fri, 2010-12-31 at 17:10 +0800, Jan Kiszka wrote: Am 31.12.2010 06:22, Huang Ying wrote: In Linux kernel HWPoison processing implementation, the virtual address in processes mapping the error physical memory page is marked as HWPoison. So that, the further accessing to the virtual address will kill corresponding processes with SIGBUS. If the error physical memory page is used by a KVM guest, the SIGBUS will be sent to QEMU, and QEMU will simulate a MCE to report that memory error to the guest OS. If the guest OS can not recover from the error (for example, the page is accessed by kernel code), guest OS will reboot the system. But because the underlying host virtual address backing the guest physical memory is still poisoned, if the guest system accesses the corresponding guest physical memory even after rebooting, the SIGBUS will still be sent to QEMU and MCE will be simulated. That is, guest system can not recover via rebooting. In fact, across rebooting, the contents of guest physical memory page need not to be kept. We can allocate a new host physical page to back the corresponding guest physical address. This patch fixes this issue in QEMU-KVM via calling qemu_ram_remap() to clear the corresponding page table entry, so that make it possible to allocate a new page to recover the issue. Signed-off-by: Huang Ying ying.hu...@intel.com --- kvm.h |2 ++ qemu-kvm.c| 37 + What's missing in upstream to make this a uq/master patch? We are still piling up features and fixes in qemu-kvm* that should better target upstream directly. That's work needlessly done twice. OK. I will do that. Just based on uq/master is sufficient to make it an upstream patch? Is this infrastructure really arch-independent? Will there be other users besides x86? If not, better keep it in target-i386/kvm.c. No. It is used only in x86. I will move it into target-i386/kvm.c. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 1/2] Add qemu_ram_remap
qemu_ram_remap() unmaps the specified RAM pages, then re-maps these pages again. This is used by KVM HWPoison support to clear HWPoisoned page tables across guest rebooting, so that a new page may be allocated later to recover the memory error. Signed-off-by: Huang Ying ying.hu...@intel.com --- cpu-all.h|4 +++ cpu-common.h |1 exec.c | 61 ++- 3 files changed, 65 insertions(+), 1 deletion(-) --- a/cpu-all.h +++ b/cpu-all.h @@ -861,10 +861,14 @@ target_phys_addr_t cpu_get_phys_page_deb extern int phys_ram_fd; extern ram_addr_t ram_size; +/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */ +#define RAM_PREALLOC_MASK (1 0) + typedef struct RAMBlock { uint8_t *host; ram_addr_t offset; ram_addr_t length; +uint32_t flags; char idstr[256]; QLIST_ENTRY(RAMBlock) next; #if defined(__linux__) !defined(TARGET_S390X) --- a/exec.c +++ b/exec.c @@ -2845,6 +2845,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic if (host) { new_block-host = host; +new_block-flags |= RAM_PREALLOC_MASK; } else { if (mem_path) { #if defined (__linux__) !defined(TARGET_S390X) @@ -2911,7 +2912,9 @@ void qemu_ram_free(ram_addr_t addr) QLIST_FOREACH(block, ram_list.blocks, next) { if (addr == block-offset) { QLIST_REMOVE(block, next); -if (mem_path) { +if (block-flags RAM_PREALLOC_MASK) +; +else if (mem_path) { #if defined (__linux__) !defined(TARGET_S390X) if (block-fd) { munmap(block-host, block-length); @@ -2934,6 +2937,62 @@ void qemu_ram_free(ram_addr_t addr) } +void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) +{ +RAMBlock *block; +ram_addr_t offset; +int flags; +void *area, *vaddr; + +QLIST_FOREACH(block, ram_list.blocks, next) { +offset = addr - block-offset; +if (offset block-length) { +vaddr = block-host + offset; +if (block-flags RAM_PREALLOC_MASK) +; +else { +flags = MAP_FIXED; +munmap(vaddr, length); +if (mem_path) { +#if defined (__linux__) !defined(TARGET_S390X) +if (block-fd) { +#ifdef MAP_POPULATE +flags |= mem_prealloc ? MAP_POPULATE | MAP_SHARED : +MAP_PRIVATE; +#else +flags |= MAP_PRIVATE; +#endif +area = mmap(vaddr, length, PROT_READ | PROT_WRITE, +flags, block-fd, offset); +} else { +flags |= MAP_PRIVATE | MAP_ANONYMOUS; +area = mmap(vaddr, length, PROT_READ | PROT_WRITE, +flags, -1, 0); +} +#endif +} else { +#if defined(TARGET_S390X) defined(CONFIG_KVM) +flags |= MAP_SHARED | MAP_ANONYMOUS; +area = mmap(vaddr, length, PROT_EXEC|PROT_READ|PROT_WRITE, +flags, -1, 0); +#else +flags |= MAP_PRIVATE | MAP_ANONYMOUS; +area = mmap(vaddr, length, PROT_READ | PROT_WRITE, +flags, -1, 0); +#endif +} +if (area != vaddr) { +fprintf(stderr, Could not remap addr: %...@%lx\n, +length, addr); +exit(1); +} +qemu_madvise(vaddr, length, QEMU_MADV_MERGEABLE); +} +return; +} +} +} + /* Return a host pointer to ram allocated with qemu_ram_alloc. With the exception of the softmmu code in this file, this should only be used for local memory (e.g. video ram) that the device owns, --- a/cpu-common.h +++ b/cpu-common.h @@ -45,6 +45,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size); void qemu_ram_unmap(ram_addr_t addr); void qemu_ram_free(ram_addr_t addr); +void qemu_ram_remap(ram_addr_t addr, ram_addr_t length); /* This should only be used for ram local to a device. */ void *qemu_get_ram_ptr(ram_addr_t addr); /* This should not be used by devices. */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 2/2] KVM, MCE, unpoison memory address across reboot
In Linux kernel HWPoison processing implementation, the virtual address in processes mapping the error physical memory page is marked as HWPoison. So that, the further accessing to the virtual address will kill corresponding processes with SIGBUS. If the error physical memory page is used by a KVM guest, the SIGBUS will be sent to QEMU, and QEMU will simulate a MCE to report that memory error to the guest OS. If the guest OS can not recover from the error (for example, the page is accessed by kernel code), guest OS will reboot the system. But because the underlying host virtual address backing the guest physical memory is still poisoned, if the guest system accesses the corresponding guest physical memory even after rebooting, the SIGBUS will still be sent to QEMU and MCE will be simulated. That is, guest system can not recover via rebooting. In fact, across rebooting, the contents of guest physical memory page need not to be kept. We can allocate a new host physical page to back the corresponding guest physical address. This patch fixes this issue in QEMU-KVM via calling qemu_ram_remap() to clear the corresponding page table entry, so that make it possible to allocate a new page to recover the issue. Signed-off-by: Huang Ying ying.hu...@intel.com --- kvm.h |2 ++ qemu-kvm.c| 37 + target-i386/kvm.c |2 ++ 3 files changed, 41 insertions(+) --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1803,6 +1803,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in hardware_memory_error(); } } +kvm_hwpoison_page_add(ram_addr); mce.addr = paddr; r = kvm_set_mce(env, mce); if (r 0) { @@ -1841,6 +1842,7 @@ int kvm_on_sigbus(int code, void *addr) QEMU itself instead of guest system!: %p\n, addr); return 0; } +kvm_hwpoison_page_add(ram_addr); status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S | 0xc0; --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -1619,6 +1619,42 @@ int kvm_arch_init_irq_routing(void) } #endif +struct HWPoisonPage; +typedef struct HWPoisonPage HWPoisonPage; +struct HWPoisonPage +{ +ram_addr_t ram_addr; +QLIST_ENTRY(HWPoisonPage) list; +}; + +static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list = +QLIST_HEAD_INITIALIZER(hwpoison_page_list); + +static void kvm_unpoison_all(void *param) +{ +HWPoisonPage *page, *next_page; + +QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) { +QLIST_REMOVE(page, list); +qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE); +qemu_free(page); +} +} + +void kvm_hwpoison_page_add(ram_addr_t ram_addr) +{ +HWPoisonPage *page; + +QLIST_FOREACH(page, hwpoison_page_list, list) { +if (page-ram_addr == ram_addr) +return; +} + +page = qemu_malloc(sizeof(HWPoisonPage)); +page-ram_addr = ram_addr; +QLIST_INSERT_HEAD(hwpoison_page_list, page, list); +} + extern int no_hpet; static int kvm_create_context(void) @@ -1703,6 +1739,7 @@ static int kvm_create_context(void) } #endif } +qemu_register_reset(kvm_unpoison_all, NULL); return 0; } --- a/kvm.h +++ b/kvm.h @@ -221,4 +221,6 @@ int kvm_irqchip_in_kernel(void); int kvm_set_irq(int irq, int level, int *status); +void kvm_hwpoison_page_add(ram_addr_t ram_addr); + #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QEMU, MCE, unpoison memory address across reboot
On Tue, 2010-12-28 at 16:27 +0800, Gleb Natapov wrote: On Mon, Dec 27, 2010 at 07:27:54PM -0200, Marcelo Tosatti wrote: On Sun, Dec 26, 2010 at 02:27:26PM +0200, Avi Kivity wrote: +static void kvm_unpoison_all(void *param) +{ +HWPoisonPage *page, *next_page; +unsigned long address; +KVMState *s = param; + +QLIST_FOREACH_SAFE(page,hwpoison_page_list, list, next_page) { +address = (unsigned long)page-vaddr; +QLIST_REMOVE(page, list); +kvm_vm_ioctl(s, KVM_UNPOISON_ADDRESS, address); +qemu_free(page); +} +} Can't you free and reallocate all guest memory instead, on reboot, if there's a hwpoisoned page? Then you don't need this interface. Alternatively, MADV_DONTNEED? We already use it for ballooning. Does not work for hugetlbfs. Don't we break huge page to 4k pages during poisoning? Yes. That has not been implemented yet. So in fact, we can not deal with hwpoison+hugetlb in kvm now. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QEMU, MCE, unpoison memory address across reboot
On Tue, 2010-12-28 at 05:27 +0800, Marcelo Tosatti wrote: On Sun, Dec 26, 2010 at 02:27:26PM +0200, Avi Kivity wrote: +static void kvm_unpoison_all(void *param) +{ +HWPoisonPage *page, *next_page; +unsigned long address; +KVMState *s = param; + +QLIST_FOREACH_SAFE(page,hwpoison_page_list, list, next_page) { +address = (unsigned long)page-vaddr; +QLIST_REMOVE(page, list); +kvm_vm_ioctl(s, KVM_UNPOISON_ADDRESS, address); +qemu_free(page); +} +} Can't you free and reallocate all guest memory instead, on reboot, if there's a hwpoisoned page? Then you don't need this interface. Alternatively, MADV_DONTNEED? We already use it for ballooning. Does not work for hugetlbfs. Yes. And I think zap the page range is just the implementation detail but semantics of MADV_DONTNEED. But on the other hand, whether qemu_vmalloc is implemented via posix_memalign on Linux? If it is, we can not guarantee that corresponding page table is zapped after qemu_vfree and qemu_vmalloc? That is glibc implementation details. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QEMU, MCE, unpoison memory address across reboot
Hi, Andi, On Fri, 2010-12-24 at 00:57 +0800, Andi Kleen wrote: Can't you free and reallocate all guest memory instead, on reboot, if there's a hwpoisoned page? Then you don't need this interface. I think that would be more efficient. You can potentially save a lot of memory if the new guest doesn't need as much as the old one. So you suggest to free/reallocate all guest memory across every reboot regardless whether there's hwpoisoned page? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QEMU, MCE, unpoison memory address across reboot
On Thu, 2010-12-23 at 22:28 +0800, Marcelo Tosatti wrote: Can't you free and reallocate all guest memory instead, on reboot, if there's a hwpoisoned page? Then you don't need this interface. Consider about this method. It seems that some guest RAMs are not allocated in qemu_ram_alloc_from_ptr(), that is, host parameter is allocated elsewhere and passed in. I found two: - assigned_dev_register_regions() in hw/device-assignment.c - create_shared_memory_BAR() and ivshmem_read() in hw/ivshmem.c There is no general method to reallocate these memory so far. We may need a flag in struct RAMBlock to track these memory, and ignore them during reallocation. But if there are hwpoisoned pages in these memory, we can not recover. Do you think that is OK? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
Hi, Andrew, Thanks for comments. On Thu, 2010-12-23 at 07:07 +0800, Andrew Morton wrote: On Wed, 22 Dec 2010 10:51:55 +0800 Huang Ying ying.hu...@intel.com wrote: Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON page from general FAULT page, while other callers will still get -EFAULT for pages, so the user space interface need not to be changed. get_user_pages_hwpoison is added as a variant of get_user_pages that can return -EHWPOISON for HWPOISON page. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. hm, I don't remember that. I suspect it came from someone else? It's long long ago. I use another solution at that time, but now I think your one is better. http://marc.info/?l=linux-kernelm=127241995131034w=2 Signed-off-by: Huang Ying ying.hu...@intel.com Cc: Andrew Morton a...@linux-foundation.org --- include/asm-generic/errno.h |2 + include/linux/mm.h |5 mm/memory.c | 53 +--- 3 files changed, 57 insertions(+), 3 deletions(-) --- a/include/asm-generic/errno.h +++ b/include/asm-generic/errno.h @@ -108,4 +108,6 @@ #define ERFKILL132 /* Operation not possible due to RF-kill */ +#define EHWPOISON 133 /* Memory page has hardware error */ So this can be propagated to userspace? If so, which syscalls might return EHWPOISON and are there any manpage implications? No. EHWPOISON will only be returned if FOLL_HWPOISON is specified in parameter flags of __get_user_pages. #endif --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -860,6 +860,10 @@ int get_user_pages(struct task_struct *t struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, + int force, struct page **pages, + struct vm_area_struct **vmas); struct page *get_dump_page(unsigned long addr); extern int try_to_release_page(struct page * page, gfp_t gfp_mask); @@ -1415,6 +1419,7 @@ struct page *follow_page(struct vm_area_ #define FOLL_GET 0x04/* do get_page on page */ #define FOLL_DUMP 0x08/* give error on hole if it would be zero */ #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */ +#define FOLL_HWPOISON 0x20/* check page is hwpoisoned */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); --- a/mm/memory.c +++ b/mm/memory.c @@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct if (ret VM_FAULT_ERROR) { if (ret VM_FAULT_OOM) return i ? i : -ENOMEM; - if (ret - (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE| -VM_FAULT_SIGBUS)) + if (ret (VM_FAULT_HWPOISON | + VM_FAULT_HWPOISON_LARGE)) { + if (i) + return i; + else if (gup_flags FOLL_HWPOISON) + return -EHWPOISON; + else + return -EFAULT; + } + if (ret VM_FAULT_SIGBUS) hm, that function is getting a bit unweildy. Yes. Do you think the following code is better? return i ? i : (gup_flags FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; /** + * get_user_pages_hwpoison() - pin user pages in memory, return hwpoison status + * @tsk: task_struct of target task + * @mm:mm_struct of target mm + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @write: whether pages will be written to by the caller + * @force: whether to force write access even if user mapping is + * readonly. This will result in the page being COWed even + * in MAP_SHARED mappings. You do not want this. + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. Or NULL, if caller + * only intends
[RFC 0/3] KVM, HWPoison, unpoison address across rebooting
Unpoison address across rebooting, to make it possible that a new memory page can be allocated, so that guest system can successfully reboot. [RFC 1/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally [RFC 2/3] KVM, Replace is_hwpoison_address with get_user_pages_hwpoison [RFC 3/3] KVM, HWPoison, unpoison address across rebooting -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 1/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON page from general FAULT page, while other callers will still get -EFAULT for pages, so the user space interface need not to be changed. get_user_pages_hwpoison is added as a variant of get_user_pages that can return -EHWPOISON for HWPOISON page. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. Signed-off-by: Huang Ying ying.hu...@intel.com Cc: Andrew Morton a...@linux-foundation.org --- include/asm-generic/errno.h |2 + include/linux/mm.h |5 mm/memory.c | 53 +--- 3 files changed, 57 insertions(+), 3 deletions(-) --- a/include/asm-generic/errno.h +++ b/include/asm-generic/errno.h @@ -108,4 +108,6 @@ #define ERFKILL132 /* Operation not possible due to RF-kill */ +#define EHWPOISON 133 /* Memory page has hardware error */ + #endif --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -860,6 +860,10 @@ int get_user_pages(struct task_struct *t struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, + int force, struct page **pages, + struct vm_area_struct **vmas); struct page *get_dump_page(unsigned long addr); extern int try_to_release_page(struct page * page, gfp_t gfp_mask); @@ -1415,6 +1419,7 @@ struct page *follow_page(struct vm_area_ #define FOLL_GET 0x04/* do get_page on page */ #define FOLL_DUMP 0x08/* give error on hole if it would be zero */ #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */ +#define FOLL_HWPOISON 0x20/* check page is hwpoisoned */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); --- a/mm/memory.c +++ b/mm/memory.c @@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct if (ret VM_FAULT_ERROR) { if (ret VM_FAULT_OOM) return i ? i : -ENOMEM; - if (ret - (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE| -VM_FAULT_SIGBUS)) + if (ret (VM_FAULT_HWPOISON | + VM_FAULT_HWPOISON_LARGE)) { + if (i) + return i; + else if (gup_flags FOLL_HWPOISON) + return -EHWPOISON; + else + return -EFAULT; + } + if (ret VM_FAULT_SIGBUS) return i ? i : -EFAULT; BUG(); } @@ -1564,6 +1571,46 @@ int get_user_pages(struct task_struct *t EXPORT_SYMBOL(get_user_pages); /** + * get_user_pages_hwpoison() - pin user pages in memory, return hwpoison status + * @tsk: task_struct of target task + * @mm:mm_struct of target mm + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @write: whether pages will be written to by the caller + * @force: whether to force write access even if user mapping is + * readonly. This will result in the page being COWed even + * in MAP_SHARED mappings. You do not want this. + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. Or NULL, if caller + * only intends to ensure the pages are faulted in. + * @vmas: array of pointers to vmas corresponding to each page. + * Or NULL if the caller does not require them. + * + * Returns number of pages pinned. + * + * If the page table or memory page is hwpoisoned, return -EHWPOISON. + * + * Otherwise, same as get_user_pages. + */ +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, + int force, struct page **pages, + struct
[RFC 3/3] KVM, HWPoison, unpoison address across rebooting
In HWPoison processing code, not only the struct page corresponding the error physical memory page is marked as HWPoison, but also the virtual address in processes mapping the error physical memory page is marked as HWPoison. So that, the further accessing to the virtual address will kill corresponding processes with SIGBUS. If the error physical memory page is used by a KVM guest, the SIGBUS will be sent to QEMU, and QEMU will simulate a MCE to report that memory error to the guest OS. If the guest OS can not recover from the error (for example, the page is accessed by kernel code), guest OS will reboot the system. But because the underlying host virtual address backing the guest physical memory is still poisoned, if the guest system accesses the corresponding guest physical memory even after rebooting, the SIGBUS will still be sent to QEMU and MCE will be simulated. That is, guest system can not recover via rebooting. In fact, across rebooting, the contents of guest physical memory page need not to be kept. We can allocate a new host physical page to back the corresponding guest physical address. To do that, a mechanism in KVM to unpoison poisoned virtual address by clearing the corresponding PTE is provided. So that, when doing rebooting, QEMU can unpoison the poisoned virtual address, and when the unpoisoned memory page is accessed, a new physical memory may be allocated if possible. Signed-off-by: Huang Ying ying.hu...@intel.com --- include/linux/kvm.h |1 + include/linux/mm.h |8 mm/memory-failure.c | 39 +++ virt/kvm/kvm_main.c | 14 ++ 4 files changed, 62 insertions(+) --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -676,6 +676,7 @@ struct kvm_clock_data { #define KVM_SET_PIT2 _IOW(KVMIO, 0xa0, struct kvm_pit_state2) /* Available with KVM_CAP_PPC_GET_PVINFO */ #define KVM_PPC_GET_PVINFO _IOW(KVMIO, 0xa1, struct kvm_ppc_pvinfo) +#define KVM_UNPOISON_ADDRESS _IO(KVMIO, 0xa2) /* * ioctls for vcpu fds --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1512,6 +1512,14 @@ extern int sysctl_memory_failure_recover extern void shake_page(struct page *p, int access); extern atomic_long_t mce_bad_pages; extern int soft_offline_page(struct page *page, int flags); +#ifdef CONFIG_MEMORY_FAILURE +int unpoison_address(unsigned long addr); +#else +static inline int unpoison_address(unsigned long addr) +{ + return -EINVAL; +} +#endif extern void dump_page(struct page *page); --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1433,3 +1433,42 @@ done: /* keep elevated page count for bad page */ return ret; } + +int unpoison_address(unsigned long addr) +{ + struct mm_struct *mm; + pgd_t *pgdp; + pud_t pud, *pudp; + pmd_t pmd, *pmdp; + pte_t pte, *ptep; + spinlock_t *ptl; + swp_entry_t entry; + int rc; + + mm = current-mm; + pgdp = pgd_offset(mm, addr); + if (!pgd_present(*pgdp)) + return -EINVAL; + pudp = pud_offset(pgdp, addr); + pud = *pudp; + if (!pud_present(pud) || pud_large(pud)) + return -EINVAL; + pmdp = pmd_offset(pudp, addr); + pmd = *pmdp; + /* can not unpoison huge page yet */ + if (!pmd_present(pmd) || pmd_large(pmd)) + return -EINVAL; + ptep = pte_offset_map_lock(mm, pmdp, addr, ptl); + pte = *ptep; + rc = -EINVAL; + if (!is_swap_pte(pte)) + goto out; + entry = pte_to_swp_entry(pte); + if (!is_hwpoison_entry(entry)) + goto out; + pte_clear(mm, addr, ptep); +out: + pte_unmap_unlock(ptep, ptl); + return rc; +} +EXPORT_SYMBOL_GPL(unpoison_address); --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -774,6 +774,17 @@ int kvm_vm_ioctl_set_memory_region(struc return kvm_set_memory_region(kvm, mem, user_alloc); } +static int kvm_unpoison_address(struct kvm *kvm, unsigned long address) +{ + int r; + + down_read(current-mm-mmap_sem); + r = unpoison_address(address); + up_read(current-mm-mmap_sem); + + return r; +} + int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log, int *is_dirty) { @@ -1728,6 +1739,9 @@ static long kvm_vm_ioctl(struct file *fi mutex_unlock(kvm-lock); break; #endif + case KVM_UNPOISON_ADDRESS: + r = kvm_unpoison_address(kvm, arg); + break; default: r = kvm_arch_vm_ioctl(filp, ioctl, arg); if (r == -ENOTTY) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
QEMU, MCE, unpoison memory address across reboot
In Linux kernel HWPoison processing implementation, the virtual address in processes mapping the error physical memory page is marked as HWPoison. So that, the further accessing to the virtual address will kill corresponding processes with SIGBUS. If the error physical memory page is used by a KVM guest, the SIGBUS will be sent to QEMU, and QEMU will simulate a MCE to report that memory error to the guest OS. If the guest OS can not recover from the error (for example, the page is accessed by kernel code), guest OS will reboot the system. But because the underlying host virtual address backing the guest physical memory is still poisoned, if the guest system accesses the corresponding guest physical memory even after rebooting, the SIGBUS will still be sent to QEMU and MCE will be simulated. That is, guest system can not recover via rebooting. In fact, across rebooting, the contents of guest physical memory page need not to be kept. We can allocate a new host physical page to back the corresponding guest physical address. This patch fixes this issue in QEMU-KVM via invoke the unpoison mechanism implemented in Linux kernel to clear the corresponding page table entry, so that make it possible to allocate a new page to recover the issue. Signed-off-by: Huang Ying ying.hu...@intel.com --- kvm.h |2 ++ kvm/include/linux/kvm.h |2 ++ qemu-kvm.c | 40 target-i386/kvm.c |2 ++ 4 files changed, 46 insertions(+) --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1803,6 +1803,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in hardware_memory_error(); } } +kvm_hwpoison_page_add(vaddr); mce.addr = paddr; r = kvm_set_mce(env, mce); if (r 0) { @@ -1841,6 +1842,7 @@ int kvm_on_sigbus(int code, void *addr) QEMU itself instead of guest system!: %p\n, addr); return 0; } +kvm_hwpoison_page_add(vaddr); status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S | 0xc0; --- a/kvm/include/linux/kvm.h +++ b/kvm/include/linux/kvm.h @@ -663,6 +663,8 @@ struct kvm_clock_data { /* Available with KVM_CAP_PIT_STATE2 */ #define KVM_GET_PIT2 _IOR(KVMIO, 0x9f, struct kvm_pit_state2) #define KVM_SET_PIT2 _IOW(KVMIO, 0xa0, struct kvm_pit_state2) +#define KVM_PPC_GET_PVINFO_IOW(KVMIO, 0xa1, struct kvm_ppc_pvinfo) +#define KVM_UNPOISON_ADDRESS _IO(KVMIO, 0xa2) /* * ioctls for vcpu fds --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -1619,6 +1619,45 @@ int kvm_arch_init_irq_routing(void) } #endif +struct HWPoisonPage; +typedef struct HWPoisonPage HWPoisonPage; +struct HWPoisonPage +{ +void *vaddr; +QLIST_ENTRY(HWPoisonPage) list; +}; + +static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list = +QLIST_HEAD_INITIALIZER(hwpoison_page_list); + +static void kvm_unpoison_all(void *param) +{ +HWPoisonPage *page, *next_page; +unsigned long address; +KVMState *s = param; + +QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) { +address = (unsigned long)page-vaddr; +QLIST_REMOVE(page, list); +kvm_vm_ioctl(s, KVM_UNPOISON_ADDRESS, address); +qemu_free(page); +} +} + +void kvm_hwpoison_page_add(void *vaddr) +{ +HWPoisonPage *page; + +QLIST_FOREACH(page, hwpoison_page_list, list) { +if (page-vaddr == vaddr) +return; +} + +page = qemu_malloc(sizeof(HWPoisonPage)); +page-vaddr = vaddr; +QLIST_INSERT_HEAD(hwpoison_page_list, page, list); +} + extern int no_hpet; static int kvm_create_context(void) @@ -1703,6 +1742,7 @@ static int kvm_create_context(void) } #endif } +qemu_register_reset(kvm_unpoison_all, kvm_state); return 0; } --- a/kvm.h +++ b/kvm.h @@ -221,4 +221,6 @@ int kvm_irqchip_in_kernel(void); int kvm_set_irq(int irq, int level, int *status); +void kvm_hwpoison_page_add(void *vaddr); + #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 2/3] KVM, Replace is_hwpoison_address with get_user_pages_hwpoison
is_hwpoison_address only checks whether the page table entry is hwpoisoned, regardless the memory page mapped. While get_user_pages_hwpoison will check both. In a following patch, we will introduce unpoison_address, which will clear the poisoned page table entry to make it possible to allocate a new memory page for the virtual address. But it is also possible that the underlying memory page is kept poisoned even after the corresponding page table entry is cleared, that is, a new memory page can not be allocated. get_user_pages_hwpoison can catch these situations. Signed-off-by: Huang Ying ying.hu...@intel.com --- include/linux/mm.h |8 mm/memory-failure.c | 32 virt/kvm/kvm_main.c |4 +++- 3 files changed, 3 insertions(+), 41 deletions(-) --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1512,14 +1512,6 @@ extern int sysctl_memory_failure_recover extern void shake_page(struct page *p, int access); extern atomic_long_t mce_bad_pages; extern int soft_offline_page(struct page *page, int flags); -#ifdef CONFIG_MEMORY_FAILURE -int is_hwpoison_address(unsigned long addr); -#else -static inline int is_hwpoison_address(unsigned long addr) -{ - return 0; -} -#endif extern void dump_page(struct page *page); --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1433,35 +1433,3 @@ done: /* keep elevated page count for bad page */ return ret; } - -/* - * The caller must hold current-mm-mmap_sem in read mode. - */ -int is_hwpoison_address(unsigned long addr) -{ - pgd_t *pgdp; - pud_t pud, *pudp; - pmd_t pmd, *pmdp; - pte_t pte, *ptep; - swp_entry_t entry; - - pgdp = pgd_offset(current-mm, addr); - if (!pgd_present(*pgdp)) - return 0; - pudp = pud_offset(pgdp, addr); - pud = *pudp; - if (!pud_present(pud) || pud_large(pud)) - return 0; - pmdp = pmd_offset(pudp, addr); - pmd = *pmdp; - if (!pmd_present(pmd) || pmd_large(pmd)) - return 0; - ptep = pte_offset_map(pmdp, addr); - pte = *ptep; - pte_unmap(ptep); - if (!is_swap_pte(pte)) - return 0; - entry = pte_to_swp_entry(pte); - return is_hwpoison_entry(entry); -} -EXPORT_SYMBOL_GPL(is_hwpoison_address); --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -966,7 +966,9 @@ static pfn_t hva_to_pfn(struct kvm *kvm, goto return_fault_page; down_read(current-mm-mmap_sem); - if (is_hwpoison_address(addr)) { + npages = get_user_pages_hwpoison(current, current-mm, +addr, 1, 1, 0, page, NULL); + if (npages == -EHWPOISON) { up_read(current-mm-mmap_sem); get_page(hwpoison_page); return page_to_pfn(hwpoison_page); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Check processor version for broadcast.
Hi, Dongming, On Tue, 2010-11-30 at 16:15 +0800, Jin Dongming wrote: Broadcast MCA signal is not supported by the CPUs whose version is below 06H_EH. Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com --- target-i386/helper.c | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/target-i386/helper.c b/target-i386/helper.c index 7e07ebd..437290b 100644 --- a/target-i386/helper.c +++ b/target-i386/helper.c @@ -1077,10 +1077,23 @@ void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status, unsigned bank_num = cenv-mcg_cap 0xff; CPUState *env; int flag = 0; +int family, model, cpuver = first_cpu-cpuid_version; if (bank = bank_num || !(status MCI_STATUS_VAL)) return; +if (broadcast) { +family = (cpuver 8) 0x0f; +model = ((cpuver 12) 0xf0) + ((cpuver 4) 0x0f); + +if ((family == 6 model = 14) || family 6) +broadcast = 1; +else { +fprintf(stderr, Current CPU does not support broadcast\n); +return; +} +} + if (kvm_enabled()) { if (broadcast) flag |= 0x02; /* bit 1: 1(broadcast); 0(not broadcast) */ Why not wrap this into a function? I think it may be used by other functions too. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add function for clearing the requested VCPUs' mce registers.
Hi, Dongming, On Thu, 2010-11-25 at 09:14 +0800, Jin Dongming wrote: In some case of mce test, the injected data can be remained in the registers and ca affect to the result of following test cases. So add codes for clearing mce registers of given VCPU. mce registers of give VCPU could be cleared from kernel by calling the function in this patch. What need to be paid attention is that the status and mcg_status of mce must be set with 0. If not, mce registers will not be cleared. Why do you need this? To use mce=3 in guest Linux MCE testing? If it is, why not use full reboot? MCE registers are not cleared in real machine too. And it is not so pain to reboot the guest. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Add broadcast option for mce command.
On Thu, 2010-11-25 at 09:19 +0800, Jin Dongming wrote: [...] --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -1053,9 +1053,15 @@ ETEXI { .name = mce, +#if defined(KVM_CAP_MCE) +.args_type = cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l,broadcast:s?, +.params = cpu bank status mcgstatus addr misc [broadcast|b], +.help = inject a MCE on the given CPU [and broadcast to other CPUs], +#else .args_type = cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l, .params = cpu bank status mcgstatus addr misc, .help = inject a MCE on the given CPU, +#endif Broadcast can not be used by QEMU-TCG? [...] Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add function for clearing the requested VCPUs' mce registers.
On Thu, 2010-11-25 at 13:30 +0800, Jin Dongming wrote: Hi, Huang-san (2010/11/25 10:27), Huang Ying wrote: Hi, Dongming, On Thu, 2010-11-25 at 09:14 +0800, Jin Dongming wrote: In some case of mce test, the injected data can be remained in the registers and ca affect to the result of following test cases. So add codes for clearing mce registers of given VCPU. mce registers of give VCPU could be cleared from kernel by calling the function in this patch. What need to be paid attention is that the status and mcg_status of mce must be set with 0. If not, mce registers will not be cleared. Why do you need this? To use mce=3 in guest Linux MCE testing? When I set fake_panic==1 on Guest OS and tested mce via qemu-monitor, I got the wrong result sometimes. The reason why I got the wrong result was that mce registers were not cleared. So I made this patch. This can be done in QEMU via KVM_SET_MSRS too. If it is, why not use full reboot? MCE registers are not cleared in real machine too. And it is not so pain to reboot the guest. Though we know that mce registers could be cleared by rebooting the Guest Machine and the reboot of Guest Machine could save much more time than the reboot of Host Machine, it is still slower than setting fake_panic==1 to mce test. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] reset mce registers of the given VCPU or all VCPUs with mce command.
On Thu, 2010-11-25 at 09:20 +0800, Jin Dongming wrote: --- a/monitor.c +++ b/monitor.c @@ -60,6 +60,7 @@ #include trace.h #endif #include qemu-kvm.h +#include kvm_x86.h //#define DEBUG //#define DEBUG_COMPLETION @@ -2277,7 +2278,19 @@ static void do_inject_mce(Monitor *mon, const QDict *qdict) #endif for (cenv = first_cpu; cenv != NULL; cenv = cenv-next_cpu) { -if (cenv-cpu_index == cpu_index cenv-mcg_cap) { +if (!cenv-mcg_cap) +continue; + +#if defined(KVM_CAP_MCE) +if (!status !mcg_status) { +if (cenv-cpu_index == cpu_index || broadcast) +kvm_inject_x86_mce(cenv, 0, 0, 0, 0, 0, 0); + +continue; +} +#endif + +if (cenv-cpu_index == cpu_index) { cpu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc); #if defined(KVM_CAP_MCE) if (broadcast) Are you sure there is no test cases that require the Machine Check registers not cleared? I have had a test case before that injects another MCE when MCIP in MCGSTATUS is set to check whether system will go reset. It may be better to clear MC registers in an explicit mode. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address
On Tue, 2010-11-16 at 10:23 +0800, Huang Ying wrote: Author: Max Asbock masb...@linux.vnet.ibm.com Add command x-gpa2hva to translate guest physical address to host virtual address. Because gpa to hva translation is not consistent, so this command is only used for debugging. The x-gpa2hva command provides one step in a chain of translations from guest virtual to guest physical to host virtual to host physical. Host physical is then used to inject a machine check error. As a consequence the HWPOISON code on the host and the MCE injection code in qemu-kvm are exercised. v3: - Rename to x-gpa2hva - Remove QMP support, because gpa2hva is not consistent Is this patch an acceptable solution for now? This command is useful for our testing. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address
Author: Max Asbock masb...@linux.vnet.ibm.com Add command x-gpa2hva to translate guest physical address to host virtual address. Because gpa to hva translation is not consistent, so this command is only used for debugging. The x-gpa2hva command provides one step in a chain of translations from guest virtual to guest physical to host virtual to host physical. Host physical is then used to inject a machine check error. As a consequence the HWPOISON code on the host and the MCE injection code in qemu-kvm are exercised. v3: - Rename to x-gpa2hva - Remove QMP support, because gpa2hva is not consistent v2: - Add QMP support Signed-off-by: Max Asbock masb...@linux.vnet.ibm.com Signed-off-by: Jiajia Zheng jiajia.zh...@intel.com Signed-off-by: Huang Ying ying.hu...@intel.com --- hmp-commands.hx | 15 +++ monitor.c | 22 ++ 2 files changed, 37 insertions(+) --- a/monitor.c +++ b/monitor.c @@ -2272,6 +2272,28 @@ static void do_inject_mce(Monitor *mon, } #endif +static void do_gpa2hva_print(Monitor *mon, const QObject *data) +{ +QInt *qint; + +qint = qobject_to_qint(data); +monitor_printf(mon, 0x%lx\n, (unsigned long)qint-value); +} + +static int do_gpa2hva(Monitor *mon, const QDict *qdict, QObject **ret_data) +{ +target_phys_addr_t paddr; +target_phys_addr_t size = TARGET_PAGE_SIZE; +void *vaddr; + +paddr = qdict_get_int(qdict, addr); +vaddr = cpu_physical_memory_map(paddr, size, 0); +cpu_physical_memory_unmap(vaddr, size, 0, 0); +*ret_data = qobject_from_jsonf(%ld, (unsigned long)vaddr); + +return 0; +} + static int do_getfd(Monitor *mon, const QDict *qdict, QObject **ret_data) { const char *fdname = qdict_get_str(qdict, fdname); --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -293,6 +293,21 @@ Start gdbserver session (default @var{po ETEXI { +.name = x-gpa2hva, +.args_type = fmt:/,addr:l, +.params = /fmt addr, +.help = translate guest physical 'addr' to host virtual address, only for debugging, +.user_print = do_gpa2hva_print, +.mhandler.cmd_new = do_gpa2hva, +}, + +STEXI +...@item x-gpa2hva @var{addr} +...@findex x-gpa2hva +Translate guest physical @var{addr} to host virtual address, only for debugging. +ETEXI + +{ .name = x, .args_type = fmt:/,addr:l, .params = /fmt addr, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Sun, 2010-11-14 at 19:08 +0800, Avi Kivity wrote: On 11/12/2010 03:16 AM, Huang Ying wrote: On Thu, 2010-11-11 at 17:39 +0800, Avi Kivity wrote: On 11/11/2010 02:56 AM, Huang Ying wrote: On Thu, 2010-11-11 at 00:49 +0800, Anthony Liguori wrote: On 11/10/2010 02:34 AM, Avi Kivity wrote: Why the gpa -hva mapping is not consistent for RAM if -mempath is not used? Video RAM in the range a-b and PCI mapped RAM can change gpas (while remaining in the same hva). Even for ordinary RAM, which doesn't normally change gpa/hva, I'm not sure we want to guarantee that it won't. We can't universally either. Memory hot remove potentially breaks the mapping and some non-x86 architectures (like ARM) can alias RAM via a guest triggered mechanism. Thanks for clarification. Now I think we have two options. 1) Clearly mark gpa2hva (pfa2hva now, should renamed to gpa2hva) as a testing only interface, and should be used only on restricted environment (normal memory, without hot-remove, maybe only on x86). 2) Find some way to lock the gpa - hva mapping during operating. Such as gpa2hva_begin and gpa2hva_end and lock the mapping in between. I think 2) may be possible. But if there are no other users, why do that for a test case? So I think 1) may be the better option. 3) Move the poisoning code into qemu, so the command becomes posison-addressaddr (though physical addresses aren't stable either) Poisoning includes: a) gva - gpa b) gpa - hva c) hva - hpa d) inject the MCE into host via some external tool poison-address need to do b, c, d. Do you think it is good to do all these inside qemu? If d is not too complicated (an ioctl?), we might to it in qemu. The issue of d) is that there are multiple ways to inject MCE. Now one software based, one APEI based, and maybe some others in the future. They all use different interfaces. And as debug interface, there are not considered kernel ABI too (some are in debugfs). So I think it is better to use these ABI only in some test suite. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Thu, 2010-11-11 at 17:39 +0800, Avi Kivity wrote: On 11/11/2010 02:56 AM, Huang Ying wrote: On Thu, 2010-11-11 at 00:49 +0800, Anthony Liguori wrote: On 11/10/2010 02:34 AM, Avi Kivity wrote: Why the gpa - hva mapping is not consistent for RAM if -mempath is not used? Video RAM in the range a-b and PCI mapped RAM can change gpas (while remaining in the same hva). Even for ordinary RAM, which doesn't normally change gpa/hva, I'm not sure we want to guarantee that it won't. We can't universally either. Memory hot remove potentially breaks the mapping and some non-x86 architectures (like ARM) can alias RAM via a guest triggered mechanism. Thanks for clarification. Now I think we have two options. 1) Clearly mark gpa2hva (pfa2hva now, should renamed to gpa2hva) as a testing only interface, and should be used only on restricted environment (normal memory, without hot-remove, maybe only on x86). 2) Find some way to lock the gpa - hva mapping during operating. Such as gpa2hva_begin and gpa2hva_end and lock the mapping in between. I think 2) may be possible. But if there are no other users, why do that for a test case? So I think 1) may be the better option. 3) Move the poisoning code into qemu, so the command becomes posison-address addr (though physical addresses aren't stable either) Poisoning includes: a) gva - gpa b) gpa - hva c) hva - hpa d) inject the MCE into host via some external tool poison-address need to do b, c, d. Do you think it is good to do all these inside qemu? 4) Use -mempath on /dev/shm and poison a page in the backing file If we can poison a page in the backing file, how do we know the corresponding gpa and hpa? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Wed, 2010-11-10 at 14:56 +0800, Avi Kivity wrote: On 11/08/2010 05:39 AM, Anthony Liguori wrote: Yes. The main usage of the interface is automated testing. That's precisely what the command should not be used for. You can't assume a gpa - hva mapping is consistent in an external application. If you want to implement an interface for testing, you have to push more of the logic into QEMU to avoid the race. An alternative is to use -mempath. Does poisoning work for tmpfs? Sorry for my stupid question. Why the gpa - hva mapping is not consistent for RAM if -mempath is not used? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Wed, 2010-11-10 at 16:28 +0800, Avi Kivity wrote: On 11/10/2010 09:41 AM, Huang Ying wrote: On Wed, 2010-11-10 at 14:59 +0800, Avi Kivity wrote: On 11/10/2010 08:56 AM, Avi Kivity wrote: On 11/08/2010 05:39 AM, Anthony Liguori wrote: Yes. The main usage of the interface is automated testing. That's precisely what the command should not be used for. You can't assume a gpa - hva mapping is consistent in an external application. If you want to implement an interface for testing, you have to push more of the logic into QEMU to avoid the race. An alternative is to use -mempath. Does poisoning work for tmpfs? Or hugetlbfs - I think it does? The QEMU support for hugetlbfs has some issues now. Because it is hard for QEMU to deal with 2M poisoned page reported by host OS. Although it is possible for QEMU to relay 2M poisoned page as MCE to guest OS, the guest OS may not work properly for this kind of MCE. If we get a full address (rather than just a frame number) then we can identify the 4k page and send an mce just for that frame? We need host kernel to break down the 2M huge page into 4k pages. Then send SIGBUS to QEMU with the poisoned 4k page. Because host kernel will poison the whole 2M virtual address space otherwise, and other 4k pages inside the 2M page can not used accessed in guest (will trigger SIGBUS and SRAR MCE). Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Thu, 2010-11-11 at 00:49 +0800, Anthony Liguori wrote: On 11/10/2010 02:34 AM, Avi Kivity wrote: Why the gpa - hva mapping is not consistent for RAM if -mempath is not used? Video RAM in the range a-b and PCI mapped RAM can change gpas (while remaining in the same hva). Even for ordinary RAM, which doesn't normally change gpa/hva, I'm not sure we want to guarantee that it won't. We can't universally either. Memory hot remove potentially breaks the mapping and some non-x86 architectures (like ARM) can alias RAM via a guest triggered mechanism. Thanks for clarification. Now I think we have two options. 1) Clearly mark gpa2hva (pfa2hva now, should renamed to gpa2hva) as a testing only interface, and should be used only on restricted environment (normal memory, without hot-remove, maybe only on x86). 2) Find some way to lock the gpa - hva mapping during operating. Such as gpa2hva_begin and gpa2hva_end and lock the mapping in between. I think 2) may be possible. But if there are no other users, why do that for a test case? So I think 1) may be the better option. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Wed, 2010-11-10 at 14:59 +0800, Avi Kivity wrote: On 11/10/2010 08:56 AM, Avi Kivity wrote: On 11/08/2010 05:39 AM, Anthony Liguori wrote: Yes. The main usage of the interface is automated testing. That's precisely what the command should not be used for. You can't assume a gpa - hva mapping is consistent in an external application. If you want to implement an interface for testing, you have to push more of the logic into QEMU to avoid the race. An alternative is to use -mempath. Does poisoning work for tmpfs? Or hugetlbfs - I think it does? The QEMU support for hugetlbfs has some issues now. Because it is hard for QEMU to deal with 2M poisoned page reported by host OS. Although it is possible for QEMU to relay 2M poisoned page as MCE to guest OS, the guest OS may not work properly for this kind of MCE. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
Hi, Anthony, On Mon, 2010-11-08 at 11:39 +0800, Anthony Liguori wrote: On 11/07/2010 07:29 PM, Huang Ying wrote: On Sun, 2010-11-07 at 00:24 +0800, Avi Kivity wrote: On 11/01/2010 03:22 PM, Anthony Liguori wrote: On 11/01/2010 02:20 PM, Huang Ying wrote: Yes. As general interface, it may not work so well, but as test interface, it works quite well and useful. Do we have any mechanism to add a test only interface? I'd like to see what Luiz/Markus think but definitely only a human monitor interface and probably prefix the command with a 'x-' prefix to indicate that it's not a supported interface. Automated testing will want to use qmp. Yes. The main usage of the interface is automated testing. That's precisely what the command should not be used for. You can't assume a gpa - hva mapping is consistent in an external application. If you want to implement an interface for testing, you have to push more of the logic into QEMU to avoid the race. Yes. pfa2hva (should be renamed to gpa2hva) in the patch works not well in general cases, but it works well for our testing. So we do not want to make it a general interface, but a testing only interface. Maybe via putting it into a special name space like x-. Do you agree? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Mon, 2010-11-08 at 11:39 +0800, Anthony Liguori wrote: On 11/07/2010 07:29 PM, Huang Ying wrote: On Sun, 2010-11-07 at 00:24 +0800, Avi Kivity wrote: On 11/01/2010 03:22 PM, Anthony Liguori wrote: On 11/01/2010 02:20 PM, Huang Ying wrote: Yes. As general interface, it may not work so well, but as test interface, it works quite well and useful. Do we have any mechanism to add a test only interface? I'd like to see what Luiz/Markus think but definitely only a human monitor interface and probably prefix the command with a 'x-' prefix to indicate that it's not a supported interface. Automated testing will want to use qmp. Yes. The main usage of the interface is automated testing. That's precisely what the command should not be used for. You can't assume a gpa - hva mapping is consistent in an external application. If you want to implement an interface for testing, you have to push more of the logic into QEMU to avoid the race. From the code of cpu_physical_memory_map(), it seems that if the 'addr' is physical address in RAM, the mapping should be consistent at least for x86, doesn't it? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Sun, 2010-11-07 at 00:24 +0800, Avi Kivity wrote: On 11/01/2010 03:22 PM, Anthony Liguori wrote: On 11/01/2010 02:20 PM, Huang Ying wrote: Yes. As general interface, it may not work so well, but as test interface, it works quite well and useful. Do we have any mechanism to add a test only interface? I'd like to see what Luiz/Markus think but definitely only a human monitor interface and probably prefix the command with a 'x-' prefix to indicate that it's not a supported interface. Automated testing will want to use qmp. Yes. The main usage of the interface is automated testing. The documentation should be very clear about the limitations of the interface and the intended use-case. Perhaps a { execute: 'enable-version-specific-commands', arguments: ['pfa2hva'] } ? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Mon, 2010-11-01 at 11:49 -0700, Anthony Liguori wrote: On 11/01/2010 11:09 AM, Marcelo Tosatti wrote: On Tue, Oct 26, 2010 at 10:39:48AM +0800, Huang Ying wrote: Author: Max Asbockmasb...@linux.vnet.ibm.com Add command pfa2hva to translate guest physical address to host virtual address. The pfa2hva command provides one step in a chain of translations from guest virtual to guest physical to host virtual to host physical. Host physical is then used to inject a machine check error. As a consequence the HWPOISON code on the host and the MCE injection code in qemu-kvm are exercised. v2: - Add QMP support Looks good to me. Anthony? The translation is not stable so this is really a bad interface. It definitely shouldn't be a QMP command and I don't think it's generally useful enough that it should be a monitor command. It's impossible to use correctly as a general interface. I gave this feedback when it was initially submitted. Yes. As general interface, it may not work so well, but as test interface, it works quite well and useful. Do we have any mechanism to add a test only interface? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
Author: Max Asbock masb...@linux.vnet.ibm.com Add command pfa2hva to translate guest physical address to host virtual address. The pfa2hva command provides one step in a chain of translations from guest virtual to guest physical to host virtual to host physical. Host physical is then used to inject a machine check error. As a consequence the HWPOISON code on the host and the MCE injection code in qemu-kvm are exercised. v2: - Add QMP support Signed-off-by: Max Asbock masb...@linux.vnet.ibm.com Signed-off-by: Jiajia Zheng jiajia.zh...@intel.com Signed-off-by: Huang Ying ying.hu...@intel.com --- hmp-commands.hx | 15 +++ monitor.c | 22 ++ qmp-commands.hx | 27 +++ 3 files changed, 64 insertions(+) --- a/monitor.c +++ b/monitor.c @@ -2272,6 +2272,28 @@ static void do_inject_mce(Monitor *mon, } #endif +static void do_pfa2hva_print(Monitor *mon, const QObject *data) +{ +QInt *qint; + +qint = qobject_to_qint(data); +monitor_printf(mon, 0x%lx\n, (unsigned long)qint-value); +} + +static int do_pfa2hva(Monitor *mon, const QDict *qdict, QObject **ret_data) +{ +target_phys_addr_t paddr; +target_phys_addr_t size = TARGET_PAGE_SIZE; +void *vaddr; + +paddr = qdict_get_int(qdict, addr); +vaddr = cpu_physical_memory_map(paddr, size, 0); +cpu_physical_memory_unmap(vaddr, size, 0, 0); +*ret_data = qobject_from_jsonf(%ld, (unsigned long)vaddr); + +return 0; +} + static int do_getfd(Monitor *mon, const QDict *qdict, QObject **ret_data) { const char *fdname = qdict_get_str(qdict, fdname); --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -293,6 +293,21 @@ Start gdbserver session (default @var{po ETEXI { +.name = pfa2hva, +.args_type = fmt:/,addr:l, +.params = /fmt addr, +.help = translate guest physical 'addr' to host virtual address, +.user_print = do_pfa2hva_print, +.mhandler.cmd_new = do_pfa2hva, +}, + +STEXI +...@item pfa2hva @var{addr} +...@findex pfa2hva +Translate guest physical @var{addr} to host virtual address. +ETEXI + +{ .name = x, .args_type = fmt:/,addr:l, .params = /fmt addr, --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -738,6 +738,33 @@ Example: EQMP { +.name = pfa2hva, +.args_type = addr:l, +.params = addr, +.help = translate guest physical 'addr' to host virtual address, +.user_print = do_pfa2hva_print, +.mhandler.cmd_new = do_pfa2hva, +}, + +SQMP +pfa2hva +--- + +Translate guest physical 'addr' to host virtual address. + +Arguments: + +- addr: the guest physical address + +Example: + +- { execute: pfa2hva, + arguments: { addr: 196608 } } +- { return: 139888084717568 } + +EQMP + +{ .name = qmp_capabilities, .args_type = , .params = , -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 11/11] kvm, x86: broadcast mce depending on the cpu version
On Fri, 2010-10-15 at 09:52 +0800, Hidetoshi Seto wrote: (2010/10/15 10:06), Marcelo Tosatti wrote: On Thu, Oct 14, 2010 at 05:55:28PM +0900, Jin Dongming wrote: There is no reason why SRAO event received by the main thread is the only one that being broadcasted. According to the x86 ASDM vol.3A 15.10.4.1, MCE signal is broadcast on processor version 06H_EH or later. This change is required to handle SRAR in the guest. Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com Tested-by: Jin Dongming jin.dongm...@np.css.fujitsu.com --- qemu-kvm.c | 63 +-- 1 files changed, 31 insertions(+), 32 deletions(-) Why is this necessary? _AO SIGBUS should be sent to all vcpu threads and main thread. Humm? If you are right, vcpu threads will receive same SRAO event twice, one is that received by itself and another is that received by main thread and forwarded by the broadcast. My understanding is (Jin, please correct me if something wrong): - _AO SIGBUS is sent to main thread only, and then SRAO event is broadcasted to all vcpu threads. Yes. It is. - _AR SIGBUS is sent to a vcpu thread that tried to touch the unmapped poisoned page, and SRAR event is posted to the vcpu. One problem here is that SRAR is not broadcasted. The guest might observe the event differently, like some cpus don't enter machine check. Yes. SRAR Broadcast follows spec better. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] kvm, MCE, Add MCG_SER_P into KVM_MCE_CAP_SUPPORTED
Now we have MCG_SER_P (and corresponding SRAO/SRAR MCE) support in kernel and QEMU-KVM, the MCG_SER_P should be added into KVM_MCE_CAP_SUPPORTED to make all these code really works. Reported-by: Dean Nelson dnel...@redhat.com Signed-off-by: Huang Ying ying.hu...@intel.com --- arch/x86/kvm/x86.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -71,7 +71,7 @@ #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR) #define KVM_MAX_MCE_BANKS 32 -#define KVM_MCE_CAP_SUPPORTED MCG_CTL_P +#define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P) /* EFER defaults: * - enable syscall per default because its emulated by KVM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] kvm, MCE, Send SRAR SIGBUS directly
Originally, SRAR SIGBUS is sent to QEMU-KVM via touching the poisoned page. But commit 96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the signal from being sent. So now the signal is sent via force_sig_info_fault directly. Reported-by: Dean Nelson dnel...@redhat.com Signed-off-by: Huang Ying ying.hu...@intel.com --- arch/x86/include/asm/signal.h |3 +++ arch/x86/kvm/mmu.c| 15 +++ arch/x86/mm/fault.c |6 +++--- 3 files changed, 9 insertions(+), 15 deletions(-) --- a/arch/x86/include/asm/signal.h +++ b/arch/x86/include/asm/signal.h @@ -258,6 +258,9 @@ struct pt_regs; #define ptrace_signal_deliver(regs, cookie) do { } while (0) +void force_sig_info_fault(int si_signo, int si_code, unsigned long address, + struct task_struct *tsk); + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -158,9 +158,8 @@ is_prefetch(struct pt_regs *regs, unsign return prefetch; } -static void -force_sig_info_fault(int si_signo, int si_code, unsigned long address, -struct task_struct *tsk) +void force_sig_info_fault(int si_signo, int si_code, unsigned long address, + struct task_struct *tsk) { siginfo_t info; @@ -172,6 +171,7 @@ force_sig_info_fault(int si_signo, int s force_sig_info(si_signo, info, tsk); } +EXPORT_SYMBOL_GPL(force_sig_info_fault); DEFINE_SPINLOCK(pgd_lock); LIST_HEAD(pgd_list); --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -39,6 +39,7 @@ #include asm/cmpxchg.h #include asm/io.h #include asm/vmx.h +#include asm/signal.h /* * When setting this variable to true it enables Two-Dimensional-Paging @@ -2104,22 +2105,12 @@ static int __direct_map(struct kvm_vcpu return pt_write; } -static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn) -{ - char buf[1]; - void __user *hva; - int r; - - /* Touch the page, so send SIGBUS */ - hva = (void __user *)gfn_to_hva(kvm, gfn); - r = copy_from_user(buf, hva, 1); -} - static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn) { kvm_release_pfn_clean(pfn); if (is_hwpoison_pfn(pfn)) { - kvm_send_hwpoison_signal(kvm, gfn); + force_sig_info_fault(SIGBUS, BUS_MCEERR_AR, +gfn_to_hva(kvm, gfn), current); return 0; } else if (is_fault_pfn(pfn)) return -EFAULT; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix SRAO/SRAR MCE injecting on guest without MCG_SER_P
On real machine, if MCG_SER_P in MSR_MCG_CAP is not set, SRAO/SRAR MCE should not be raised by hardware. This patch makes QEMU-KVM to follow the same rule. Reported-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -1134,7 +1134,7 @@ static void sigbus_handler(int n, struct void *ctx) { #if defined(KVM_CAP_MCE) defined(TARGET_I386) -if (first_cpu-mcg_cap siginfo-ssi_addr +if ((first_cpu-mcg_cap MCG_SER_P) siginfo-ssi_addr siginfo-ssi_code == BUS_MCEERR_AO) { uint64_t status; void *vaddr; @@ -1324,7 +1324,7 @@ static void kvm_on_sigbus(CPUState *env, unsigned long paddr; int r; -if (env-mcg_cap siginfo-si_addr +if ((env-mcg_cap MCG_SER_P) siginfo-si_addr (siginfo-si_code == BUS_MCEERR_AR || siginfo-si_code == BUS_MCEERR_AO)) { if (siginfo-si_code == BUS_MCEERR_AR) { @@ -1356,7 +1356,7 @@ static void kvm_on_sigbus(CPUState *env, if (do_qemu_ram_addr_from_host(vaddr, ram_addr) || !kvm_physical_memory_addr_from_ram(kvm_state, ram_addr, (target_phys_addr_t *)paddr)) { fprintf(stderr, Hardware memory error for memory used by -QEMU itself instaed of guest system!\n); +QEMU itself instead of guest system!\n); /* Hope we are lucky for AO MCE */ if (siginfo-si_code == BUS_MCEERR_AO) { return; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm, MCE, Send SRAR SIGBUS directly
On Sat, 2010-10-09 at 04:25 +0800, Marcelo Tosatti wrote: On Fri, Oct 08, 2010 at 04:24:15PM +0800, Huang Ying wrote: Originally, SRAR SIGBUS is sent to QEMU-KVM via touching the poisoned page. But commit 96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the signal from being sent. So now the signal is sent via force_sig_info_fault directly. Reported-by: Dean Nelson dnel...@redhat.com Signed-off-by: Huang Ying ying.hu...@intel.com --- arch/x86/include/asm/signal.h |3 +++ arch/x86/kvm/mmu.c| 15 +++ arch/x86/mm/fault.c |6 +++--- 3 files changed, 9 insertions(+), 15 deletions(-) --- a/arch/x86/include/asm/signal.h +++ b/arch/x86/include/asm/signal.h @@ -258,6 +258,9 @@ struct pt_regs; #define ptrace_signal_deliver(regs, cookie) do { } while (0) +void force_sig_info_fault(int si_signo, int si_code, unsigned long address, + struct task_struct *tsk); + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -158,9 +158,8 @@ is_prefetch(struct pt_regs *regs, unsign return prefetch; } -static void -force_sig_info_fault(int si_signo, int si_code, unsigned long address, -struct task_struct *tsk) +void force_sig_info_fault(int si_signo, int si_code, unsigned long address, + struct task_struct *tsk) { siginfo_t info; @@ -172,6 +171,7 @@ force_sig_info_fault(int si_signo, int s force_sig_info(si_signo, info, tsk); } +EXPORT_SYMBOL_GPL(force_sig_info_fault); DEFINE_SPINLOCK(pgd_lock); LIST_HEAD(pgd_list); --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -39,6 +39,7 @@ #include asm/cmpxchg.h #include asm/io.h #include asm/vmx.h +#include asm/signal.h /* * When setting this variable to true it enables Two-Dimensional-Paging @@ -2104,22 +2105,12 @@ static int __direct_map(struct kvm_vcpu return pt_write; } -static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn) -{ - char buf[1]; - void __user *hva; - int r; - - /* Touch the page, so send SIGBUS */ - hva = (void __user *)gfn_to_hva(kvm, gfn); - r = copy_from_user(buf, hva, 1); -} - static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn) { kvm_release_pfn_clean(pfn); if (is_hwpoison_pfn(pfn)) { - kvm_send_hwpoison_signal(kvm, gfn); + force_sig_info_fault(SIGBUS, BUS_MCEERR_AR, +gfn_to_hva(kvm, gfn), current); return 0; } else if (is_fault_pfn(pfn)) return -EFAULT; Better fill siginfo and use force_sig_info, which is already exported. It seems that force_sig_info is not exported for now. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest
On Thu, 2010-10-07 at 00:05 +0800, Marcelo Tosatti wrote: On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote: I got some more question: (2010/10/05 3:54), Marcelo Tosatti wrote: Index: qemu/target-i386/cpu.h === --- qemu.orig/target-i386/cpu.h +++ qemu/target-i386/cpu.h @@ -250,16 +250,32 @@ #define PG_ERROR_RSVD_MASK 0x08 #define PG_ERROR_I_D_MASK 0x10 -#define MCG_CTL_P(1UL8) /* MCG_CAP register available */ +#define MCG_CTL_P(1ULL8) /* MCG_CAP register available */ +#define MCG_SER_P(1ULL24) /* MCA recovery/new status bits */ -#define MCE_CAP_DEF MCG_CTL_P +#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P) #define MCE_BANKS_DEF10 It seems that current kvm doesn't support SER_P, so injecting SRAO to guest will mean that guest receives VAL|UC|!PCC and RIPV event from virtual processor that doesn't have SER_P. Dean also noted this. I don't think it was deliberate choice to not expose SER_P. Huang? In fact, that should be a BUG. I will fix it as soon as possible. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest
, and the issue remains. Definitely. I guess Huang has some plan or hint for rework this point. Yes. This should be fixed. The SRAR SIGBUS should be sent directly instead of being sent via touching poisoned virtual address. I would think that if the the bad page can't be sidelined, such that the newly booting guest can't use it, then the new guest shouldn't be allowed to boot. But perhaps there is some merit in letting it try to boot and see if one gets 'lucky'. In case of booting a real machine in real world, hardware and firmware usually (or often) do self-test before passing control to OS. Some platform can boot OS with degraded configuration (for example, fewer memory) if it has trouble on its component. Some BIOS may stop booting and show messages like please reseat [component] on the screen. So we could implement/request qemu to have such mechanism. I can understand the merit you mentioned here, in some degree. But I think it is hard to say unlucky to customer in business... Because the contents of poisoned pages are not relevant after reboot. Qemu can replace the poisoned pages with good pages when reboot guest. Do you think that is good. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Monitor command to translate guest physical address to host virtual address
From: Max Asbock masb...@linux.vnet.ibm.com Add command p2v to translate guest physical address to host virtual address. The p2v command provides one step in a chain of translations from guest virtual to guest physical to host virtual to host physical. Host physical is then used to inject a machine check error. As a consequence the HWPOISON code on the host and the MCE injection code in qemu-kvm are exercised. Signed-off-by: Max Asbock masb...@linux.vnet.ibm.com Signed-off-by: Jiajia Zheng jiajia.zh...@intel.com Signed-off-by: Huang Ying ying.hu...@intel.com --- monitor.c | 11 +++ qemu-monitor.hx | 13 + 2 files changed, 24 insertions(+) --- a/monitor.c +++ b/monitor.c @@ -2301,6 +2301,17 @@ static void do_inject_mce(Monitor *mon, } #endif +static void do_p2v(Monitor *mon, const QDict *qdict) +{ +target_phys_addr_t size = TARGET_PAGE_SIZE; +target_phys_addr_t addr = qdict_get_int(qdict, addr); +void *vaddr; + +vaddr = cpu_physical_memory_map(addr, size, 0); +monitor_printf(mon, Guest physical address %p is mapped at + host virtual address %p\n, (void *)addr, vaddr); +} + static int do_getfd(Monitor *mon, const QDict *qdict, QObject **ret_data) { const char *fdname = qdict_get_str(qdict, fdname); --- a/qemu-monitor.hx +++ b/qemu-monitor.hx @@ -459,6 +459,19 @@ Start gdbserver session (default @var{po ETEXI { +.name = p2v, +.args_type = fmt:/,addr:l, +.params = /fmt addr, +.help = translate guest physical 'addr' to host virtual address, +.mhandler.cmd = do_p2v, +}, +STEXI +...@item p2v @var{addr} +...@findex mce +Translate guest physical @var{addr} to host virtual address. +ETEXI + +{ .name = x, .args_type = fmt:/,addr:l, .params = /fmt addr, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add RAM - physical addr mapping in MCE simulation
Hi, Marcelo, On Tue, 2010-09-21 at 00:10 +0800, Marcelo Tosatti wrote: On Sun, Sep 19, 2010 at 09:00:33AM +0800, Huang Ying wrote: In QEMU-KVM, physical address != RAM address. While MCE simulation needs physical address instead of RAM address. So kvm_physical_memory_addr_from_ram() is implemented to do the conversion, and it is invoked before being filled in the IA32_MCi_ADDR MSR. Reported-by: Dean Nelson dnel...@redhat.com Signed-off-by: Huang Ying ying.hu...@intel.com Applied, thanks. Can you please submit the code necessary to run the test suite upstream? Is there anything else needed other than p2v monitor command? The only missing part is p2v monitor command. I will re-post it. Do you want to merge the test suite into kvm autotest tool? If so, Shaohui can help to do that. But it seems that he is busy on some other project now. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Add RAM - physical addr mapping in MCE simulation
In QEMU-KVM, physical address != RAM address. While MCE simulation needs physical address instead of RAM address. So kvm_physical_memory_addr_from_ram() is implemented to do the conversion, and it is invoked before being filled in the IA32_MCi_ADDR MSR. Reported-by: Dean Nelson dnel...@redhat.com Signed-off-by: Huang Ying ying.hu...@intel.com --- kvm-all.c | 18 ++ kvm.h |3 +++ qemu-kvm.c | 13 ++--- 3 files changed, 31 insertions(+), 3 deletions(-) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -1137,12 +1137,15 @@ static void sigbus_handler(int n, struct if (first_cpu-mcg_cap siginfo-ssi_addr siginfo-ssi_code == BUS_MCEERR_AO) { uint64_t status; +void *vaddr; +ram_addr_t ram_addr; unsigned long paddr; CPUState *cenv; /* Hope we are lucky for AO MCE */ -if (do_qemu_ram_addr_from_host((void *)(intptr_t)siginfo-ssi_addr, - paddr)) { +vaddr = (void *)(intptr_t)siginfo-ssi_addr; +if (do_qemu_ram_addr_from_host(vaddr, ram_addr) || +!kvm_physical_memory_addr_from_ram(kvm_state, ram_addr, paddr)) { fprintf(stderr, Hardware memory error for memory used by QEMU itself instead of guest system!: %llx\n, (unsigned long long)siginfo-ssi_addr); @@ -1316,6 +1319,8 @@ static void kvm_on_sigbus(CPUState *env, struct kvm_x86_mce mce = { .bank = 9, }; +void *vaddr; +ram_addr_t ram_addr; unsigned long paddr; int r; @@ -1347,7 +1352,9 @@ static void kvm_on_sigbus(CPUState *env, mce.misc = (MCM_ADDR_PHYS 6) | 0xc; mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV; } -if (do_qemu_ram_addr_from_host((void *)siginfo-si_addr, paddr)) { +vaddr = (void *)siginfo-si_addr; +if (do_qemu_ram_addr_from_host(vaddr, ram_addr) || +!kvm_physical_memory_addr_from_ram(kvm_state, ram_addr, paddr)) { fprintf(stderr, Hardware memory error for memory used by QEMU itself instaed of guest system!\n); /* Hope we are lucky for AO MCE */ --- a/kvm-all.c +++ b/kvm-all.c @@ -141,6 +141,24 @@ static KVMSlot *kvm_lookup_overlapping_s return found; } +int kvm_physical_memory_addr_from_ram(KVMState *s, ram_addr_t ram_addr, + target_phys_addr_t *phys_addr) +{ +int i; + +for (i = 0; i ARRAY_SIZE(s-slots); i++) { +KVMSlot *mem = s-slots[i]; + +if (ram_addr = mem-phys_offset +ram_addr mem-phys_offset + mem-memory_size) { +*phys_addr = mem-start_addr + (ram_addr - mem-phys_offset); +return 1; +} +} + +return 0; +} + static int kvm_set_user_memory_region(KVMState *s, KVMSlot *slot) { struct kvm_userspace_memory_region mem; --- a/kvm.h +++ b/kvm.h @@ -195,4 +195,7 @@ int kvm_set_irqfd(int gsi, int fd, bool #endif int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool assign); + +int kvm_physical_memory_addr_from_ram(KVMState *s, ram_addr_t ram_addr, + target_phys_addr_t *phys_addr); #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: Expose MCE control MSRs to userspace
Sorry for late. On Thu, 2010-07-08 at 17:12 +0800, Avi Kivity wrote: [..snip..] And if we clear MSR_IA32_MCG_CTL, the machine check reporting is disabled according to SDM Vol 3A, section 15.3.1.3 Won't the kernel reenable MCE? In my testing, the sequence MCE-reset-MCE worked after the patch (whereas it would fail without it). Yes. Because kernel will reenable it. But if we only clear MSR_IA32_MCG_STATUS only, MCE-reset-MCE should work too. What happens if we reboot into a kernel that doesn't enable MCE? I guess it doesn't matter: the new kernel will keep cr4.mce cleared and thus MCE will be blocked. I'd like to keep the patch as is, so live migration works for MCE (we'll need to add bank support). I think there's no problem clearing _CTL on reset. If there is, we can patch qemu not to clear the MSR. Is that acceptable? Yes. At least for Linux that is OK. Don't know whether Windows will set _CTL during boot. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: Expose MCE control MSRs to userspace
On Thu, 2010-07-08 at 15:43 +0800, Avi Kivity wrote: On 07/08/2010 05:07 AM, Huang Ying wrote: static u32 emulated_msrs[] = { MSR_IA32_MISC_ENABLE, + MSR_IA32_MCG_STATUS, + MSR_IA32_MCG_CTL, We need only clear MSR_IA32_MCG_STATUS during reset, but should not clear MSR_IA32_MCG_CTL. Why not? According to Intel 64 and IA32 Architectures Software Developer's Manual (SDM) Vol 3A (Table 9-1), machine check MSRs should be sticky across reset. Except we need some special processing for MSR_IA32_MCG_STATUS. And if we clear MSR_IA32_MCG_CTL, the machine check reporting is disabled according to SDM Vol 3A, section 15.3.1.3 Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: Expose MCE control MSRs to userspace
Hi, Avi, On Wed, 2010-07-07 at 19:09 +0800, Avi Kivity wrote: Userspace needs to reset and save/restore these MSRs. The MCE banks are not exposed since their number varies from vcpu to vcpu. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/x86.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7070b41..1e12cc5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -744,6 +744,8 @@ static unsigned num_msrs_to_save; static u32 emulated_msrs[] = { MSR_IA32_MISC_ENABLE, + MSR_IA32_MCG_STATUS, + MSR_IA32_MCG_CTL, We need only clear MSR_IA32_MCG_STATUS during reset, but should not clear MSR_IA32_MCG_CTL. }; static int set_efer(struct kvm_vcpu *vcpu, u64 efer) Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUGFIX] kvm, Fix a race condition for usage of is_hwpoison_address
is_hwpoison_address accesses the page table, so the caller must hold current-mm-mmap_sem in read mode. So fix its usage in hav_to_pfn of kvm accordingly. Comments on is_hwpoison_address are added to remind other users. Reported-by: Avi Kivity a...@redhat.com Signed-off-by: Huang Ying ying.hu...@intel.com --- mm/memory-failure.c |3 +++ virt/kvm/kvm_main.c |3 ++- 2 files changed, 5 insertions(+), 1 deletion(-) --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1298,6 +1298,9 @@ done: return ret; } +/* + * The caller must hold current-mm-mmap_sem in read mode. + */ int is_hwpoison_address(unsigned long addr) { pgd_t *pgdp; --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -955,12 +955,13 @@ static pfn_t hva_to_pfn(struct kvm *kvm, if (unlikely(npages != 1)) { struct vm_area_struct *vma; + down_read(current-mm-mmap_sem); if (is_hwpoison_address(addr)) { + up_read(current-mm-mmap_sem); get_page(hwpoison_page); return page_to_pfn(hwpoison_page); } - down_read(current-mm-mmap_sem); vma = find_vma(current-mm, addr); if (vma == NULL || addr vma-vm_start || -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()
On Tue, Jun 15, 2010 at 7:22 PM, Avi Kivity a...@redhat.com wrote: btw, is_hwpoison_address() is racy. Â While it looks up the address, some other task can unmap the page tables under us. Andi/Huang? One way of fixing it is get_user_pages_ptes_fast(), which also returns the pte, also atomically. Â I want it for other reasons as well (respond to a read fault by gupping the page for read, but allowing write access if the pte indicates it is writeable). Yes. is_hwpoison_address() is racy. But I think it is not absolutely necessary to call is_hwpoison_address() in hva_to_pfn_atomic(), is it? For is_hwpoison_address() in hva_to_pfn(), we can protect it with mmap_sem. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v3] KVM, Fix QEMU-KVM is killed by guest SRAO MCE (resend)
In common cases, guest SRAO MCE will cause corresponding poisoned page be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay the MCE to guest OS. But it is reported that if the poisoned page is accessed in guest after un-mapped and before MCE is relayed to guest OS, QEMU-KVM will be killed. The reason is as follow. Because poisoned page has been un-mapped, guest access will cause guest exit and kvm_mmu_page_fault will be called. kvm_mmu_page_fault can not get the poisoned page for fault address, so kernel and user space MMIO processing is tried in turn. In user MMIO processing, poisoned page is accessed again, then QEMU-KVM is killed by force_sig_info. To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM and do not try kernel and user space MMIO processing for poisoned page. Changelog: v3: - Make is_hwpoison_address workable for pud_large or pmd_large address. v2: - Use page table walker to determine whether the virtual address is poisoned to avoid change user space interface (via changing get_user_pages). - Wrap bad page processing into kvm_handle_bad_page to avoid code duplicating. Reported-by: Max Asbock masb...@linux.vnet.ibm.com Signed-off-by: Huang Ying ying.hu...@intel.com --- arch/x86/kvm/mmu.c | 34 ++ arch/x86/kvm/paging_tmpl.h |7 ++- include/linux/kvm_host.h |1 + include/linux/mm.h |8 mm/memory-failure.c| 30 ++ virt/kvm/kvm_main.c| 30 -- 6 files changed, 95 insertions(+), 15 deletions(-) --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -32,6 +32,7 @@ #include linux/compiler.h #include linux/srcu.h #include linux/slab.h +#include linux/uaccess.h #include asm/page.h #include asm/cmpxchg.h @@ -1953,6 +1954,27 @@ static int __direct_map(struct kvm_vcpu return pt_write; } +static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn) +{ + char buf[1]; + void __user *hva; + int r; + + /* Touch the page, so send SIGBUS */ + hva = (void __user *)gfn_to_hva(kvm, gfn); + r = copy_from_user(buf, hva, 1); +} + +static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn) +{ + kvm_release_pfn_clean(pfn); + if (is_hwpoison_pfn(pfn)) { + kvm_send_hwpoison_signal(kvm, gfn); + return 0; + } + return 1; +} + static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) { int r; @@ -1976,10 +1998,8 @@ static int nonpaging_map(struct kvm_vcpu pfn = gfn_to_pfn(vcpu-kvm, gfn); /* mmio */ - if (is_error_pfn(pfn)) { - kvm_release_pfn_clean(pfn); - return 1; - } + if (is_error_pfn(pfn)) + return kvm_handle_bad_page(vcpu-kvm, gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) @@ -2191,10 +2211,8 @@ static int tdp_page_fault(struct kvm_vcp mmu_seq = vcpu-kvm-mmu_notifier_seq; smp_rmb(); pfn = gfn_to_pfn(vcpu-kvm, gfn); - if (is_error_pfn(pfn)) { - kvm_release_pfn_clean(pfn); - return 1; - } + if (is_error_pfn(pfn)) + return kvm_handle_bad_page(vcpu-kvm, gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -430,11 +430,8 @@ static int FNAME(page_fault)(struct kvm_ pfn = gfn_to_pfn(vcpu-kvm, walker.gfn); /* mmio */ - if (is_error_pfn(pfn)) { - pgprintk(gfn %lx is mmio\n, walker.gfn); - kvm_release_pfn_clean(pfn); - return 1; - } + if (is_error_pfn(pfn)) + return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -266,6 +266,7 @@ extern pfn_t bad_pfn; int is_error_page(struct page *page); int is_error_pfn(pfn_t pfn); +int is_hwpoison_pfn(pfn_t pfn); int kvm_is_error_hva(unsigned long addr); int kvm_set_memory_region(struct kvm *kvm, struct kvm_userspace_memory_region *mem, --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1465,6 +1465,14 @@ extern int sysctl_memory_failure_recover extern void shake_page(struct page *p, int access); extern atomic_long_t mce_bad_pages; extern int soft_offline_page(struct page *page, int flags); +#ifdef CONFIG_MEMORY_FAILURE +int is_hwpoison_address(unsigned long addr); +#else +static inline int is_hwpoison_address(unsigned long addr) +{ + return 0; +} +#endif extern void dump_page(struct page *page); --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -45,6 +45,7 @@ #include linux/page-isolation.h
[PATCH -v3] KVM, Fix QEMU-KVM is killed by guest SRAO MCE
In common cases, guest SRAO MCE will cause corresponding poisoned page be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay the MCE to guest OS. But it is reported that if the poisoned page is accessed in guest after un-mapped and before MCE is relayed to guest OS, QEMU-KVM will be killed. The reason is as follow. Because poisoned page has been un-mapped, guest access will cause guest exit and kvm_mmu_page_fault will be called. kvm_mmu_page_fault can not get the poisoned page for fault address, so kernel and user space MMIO processing is tried in turn. In user MMIO processing, poisoned page is accessed again, then QEMU-KVM is killed by force_sig_info. To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM and do not try kernel and user space MMIO processing for poisoned page. Changelog: v3: - Make is_hwpoison_address workable for pud_large or pmd_large address. v2: - Use page table walker to determine whether the virtual address is poisoned to avoid change user space interface (via changing get_user_pages). - Wrap bad page processing into kvm_handle_bad_page to avoid code duplicating. Reported-by: Max Asbock masb...@linux.vnet.ibm.com Signed-off-by: Huang Ying ying.hu...@intel.com --- arch/x86/kvm/mmu.c | 34 ++ arch/x86/kvm/paging_tmpl.h |7 ++- include/linux/kvm_host.h |1 + include/linux/mm.h |8 mm/memory-failure.c| 30 ++ virt/kvm/kvm_main.c| 30 -- 6 files changed, 95 insertions(+), 15 deletions(-) --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -32,6 +32,7 @@ #include linux/compiler.h #include linux/srcu.h #include linux/slab.h +#include linux/uaccess.h #include asm/page.h #include asm/cmpxchg.h @@ -1975,6 +1976,27 @@ static int __direct_map(struct kvm_vcpu return pt_write; } +static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn) +{ + char buf[1]; + void __user *hva; + int r; + + /* Touch the page, so send SIGBUS */ + hva = (void __user *)gfn_to_hva(kvm, gfn); + r = copy_from_user(buf, hva, 1); +} + +static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn) +{ + kvm_release_pfn_clean(pfn); + if (is_hwpoison_pfn(pfn)) { + kvm_send_hwpoison_signal(kvm, gfn); + return 0; + } + return 1; +} + static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) { int r; @@ -1998,10 +2020,8 @@ static int nonpaging_map(struct kvm_vcpu pfn = gfn_to_pfn(vcpu-kvm, gfn); /* mmio */ - if (is_error_pfn(pfn)) { - kvm_release_pfn_clean(pfn); - return 1; - } + if (is_error_pfn(pfn)) + return kvm_handle_bad_page(vcpu-kvm, gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) @@ -2204,10 +2224,8 @@ static int tdp_page_fault(struct kvm_vcp mmu_seq = vcpu-kvm-mmu_notifier_seq; smp_rmb(); pfn = gfn_to_pfn(vcpu-kvm, gfn); - if (is_error_pfn(pfn)) { - kvm_release_pfn_clean(pfn); - return 1; - } + if (is_error_pfn(pfn)) + return kvm_handle_bad_page(vcpu-kvm, gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -424,11 +424,8 @@ static int FNAME(page_fault)(struct kvm_ pfn = gfn_to_pfn(vcpu-kvm, walker.gfn); /* mmio */ - if (is_error_pfn(pfn)) { - pgprintk(gfn %lx is mmio\n, walker.gfn); - kvm_release_pfn_clean(pfn); - return 1; - } + if (is_error_pfn(pfn)) + return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -254,6 +254,7 @@ extern pfn_t bad_pfn; int is_error_page(struct page *page); int is_error_pfn(pfn_t pfn); +int is_hwpoison_pfn(pfn_t pfn); int kvm_is_error_hva(unsigned long addr); int kvm_set_memory_region(struct kvm *kvm, struct kvm_userspace_memory_region *mem, --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1464,6 +1464,14 @@ extern int sysctl_memory_failure_recover extern void shake_page(struct page *p, int access); extern atomic_long_t mce_bad_pages; extern int soft_offline_page(struct page *page, int flags); +#ifdef CONFIG_MEMORY_FAILURE +int is_hwpoison_address(unsigned long addr); +#else +static inline int is_hwpoison_address(unsigned long addr) +{ + return 0; +} +#endif extern void dump_page(struct page *page); --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -45,6 +45,7 @@ #include linux/page-isolation.h
Re: [PATCH -v2] KVM, Fix QEMU-KVM is killed by guest SRAO MCE
On Fri, 2010-05-14 at 05:43 +0800, Marcelo Tosatti wrote: On Wed, May 12, 2010 at 02:44:03PM +0800, Huang Ying wrote: @@ -1975,6 +1976,27 @@ static int __direct_map(struct kvm_vcpu return pt_write; } +static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn) +{ + char buf[1]; + void __user *hva; + int r; + + /* Touch the page, so send SIGBUS */ + hva = (void __user *)gfn_to_hva(kvm, gfn); + r = copy_from_user(buf, hva, 1); +} A SIGBUS signal has been raised by memory poisoning already, so i don't see why this is needed? To avoid the MMIO processing in userspace before the MCE is sent to the guest you can just return -EAGAIN from the page fault handlers back to kvm_mmu_page_fault. The SIGBUS signal is necessary here because this SIGBUS is SRAR (Action Requirement) while the previously sent one is SRAO (Action Optional). They have different semantics and processed differently in QEMU-KVM and guest OS. For example the guest Linux kernel may ignore SRAO MCE (raised by QEMU-KVM after receiving SRAO SIGBUS), because it is optional, but for SRAR MCE (raised by QEMU-KVM after receiving SRAR SIGBUS) the guest Linux kernel must kill corresponding application or go panic. +int is_hwpoison_address(unsigned long addr) +{ + pgd_t *pgdp; + pud_t *pudp; + pmd_t *pmdp; + pte_t pte, *ptep; + swp_entry_t entry; + + pgdp = pgd_offset(current-mm, addr); + if (!pgd_present(*pgdp)) + return 0; + pudp = pud_offset(pgdp, addr); + if (!pud_present(*pudp)) + return 0; + pmdp = pmd_offset(pudp, addr); + if (!pmd_present(*pmdp)) + return 0; Need to bail out if pmd is huge. Yes. I will change this. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2] KVM, Fix QEMU-KVM is killed by guest SRAO MCE
In common cases, guest SRAO MCE will cause corresponding poisoned page be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay the MCE to guest OS. But it is reported that if the poisoned page is accessed in guest after un-mapped and before MCE is relayed to guest OS, QEMU-KVM will be killed. The reason is as follow. Because poisoned page has been un-mapped, guest access will cause guest exit and kvm_mmu_page_fault will be called. kvm_mmu_page_fault can not get the poisoned page for fault address, so kernel and user space MMIO processing is tried in turn. In user MMIO processing, poisoned page is accessed again, then QEMU-KVM is killed by force_sig_info. To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM and do not try kernel and user space MMIO processing for poisoned page. Changelog: v2: - Use page table walker to determine whether the virtual address is poisoned to avoid change user space interface (via changing get_user_pages). - Wrap bad page processing into kvm_handle_bad_page to avoid code duplicating. Reported-by: Max Asbock masb...@linux.vnet.ibm.com Signed-off-by: Huang Ying ying.hu...@intel.com --- arch/x86/kvm/mmu.c | 34 ++ arch/x86/kvm/paging_tmpl.h |7 ++- include/linux/kvm_host.h |1 + include/linux/mm.h |8 mm/memory-failure.c| 28 virt/kvm/kvm_main.c| 30 -- 6 files changed, 93 insertions(+), 15 deletions(-) --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -32,6 +32,7 @@ #include linux/compiler.h #include linux/srcu.h #include linux/slab.h +#include linux/uaccess.h #include asm/page.h #include asm/cmpxchg.h @@ -1975,6 +1976,27 @@ static int __direct_map(struct kvm_vcpu return pt_write; } +static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn) +{ + char buf[1]; + void __user *hva; + int r; + + /* Touch the page, so send SIGBUS */ + hva = (void __user *)gfn_to_hva(kvm, gfn); + r = copy_from_user(buf, hva, 1); +} + +static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn) +{ + kvm_release_pfn_clean(pfn); + if (is_hwpoison_pfn(pfn)) { + kvm_send_hwpoison_signal(kvm, gfn); + return 0; + } + return 1; +} + static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) { int r; @@ -1998,10 +2020,8 @@ static int nonpaging_map(struct kvm_vcpu pfn = gfn_to_pfn(vcpu-kvm, gfn); /* mmio */ - if (is_error_pfn(pfn)) { - kvm_release_pfn_clean(pfn); - return 1; - } + if (is_error_pfn(pfn)) + return kvm_handle_bad_page(vcpu-kvm, gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) @@ -2204,10 +2224,8 @@ static int tdp_page_fault(struct kvm_vcp mmu_seq = vcpu-kvm-mmu_notifier_seq; smp_rmb(); pfn = gfn_to_pfn(vcpu-kvm, gfn); - if (is_error_pfn(pfn)) { - kvm_release_pfn_clean(pfn); - return 1; - } + if (is_error_pfn(pfn)) + return kvm_handle_bad_page(vcpu-kvm, gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -424,11 +424,8 @@ static int FNAME(page_fault)(struct kvm_ pfn = gfn_to_pfn(vcpu-kvm, walker.gfn); /* mmio */ - if (is_error_pfn(pfn)) { - pgprintk(gfn %lx is mmio\n, walker.gfn); - kvm_release_pfn_clean(pfn); - return 1; - } + if (is_error_pfn(pfn)) + return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn); spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -254,6 +254,7 @@ extern pfn_t bad_pfn; int is_error_page(struct page *page); int is_error_pfn(pfn_t pfn); +int is_hwpoison_pfn(pfn_t pfn); int kvm_is_error_hva(unsigned long addr); int kvm_set_memory_region(struct kvm *kvm, struct kvm_userspace_memory_region *mem, --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -92,6 +92,9 @@ static bool kvm_rebooting; static bool largepages_enabled = true; +struct page *hwpoison_page; +pfn_t hwpoison_pfn; + inline int kvm_is_mmio_pfn(pfn_t pfn) { if (pfn_valid(pfn)) { @@ -809,16 +812,22 @@ EXPORT_SYMBOL_GPL(kvm_disable_largepages int is_error_page(struct page *page) { - return page == bad_page; + return page == bad_page || page == hwpoison_page; } EXPORT_SYMBOL_GPL(is_error_page); int is_error_pfn(pfn_t pfn) { - return pfn == bad_pfn; + return pfn == bad_pfn || pfn == hwpoison_pfn; } EXPORT_SYMBOL_GPL(is_error_pfn); +int
Re: [PATCH] Ignore SRAO MCE if another MCE is being processed
On Wed, 2010-04-28 at 00:12 +0800, Marcelo Tosatti wrote: On Tue, Apr 27, 2010 at 03:10:49PM +0800, Huang Ying wrote: In common cases, guest SRAO MCE will cause corresponding poisoned page be un-mapped in host and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay the MCE to guest OS. But it is possible that the poisoned page is accessed in guest after un-mapped in host and before MCE is relayed to guest OS. So that, the SRAR SIGBUS is sent to QEMU-KVM before the SRAO SIGBUS, and if QEMU-KVM relays them to guest OS one by one, guest system may reset, because the SRAO MCE may be triggered while the SRAR MCE is being processed. In fact, the SRAO MCE can be ignored in this situation, so that the guest system is given opportunity to survive. Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm.c | 28 1 file changed, 28 insertions(+) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -1610,6 +1610,19 @@ static void flush_queued_work(CPUState * pthread_cond_broadcast(qemu_work_cond); } +static int kvm_mce_in_exception(CPUState *env) +{ +struct kvm_msr_entry msr_mcg_status = { +.index = MSR_MCG_STATUS, +}; +int r; + +r = kvm_get_msrs(env, msr_mcg_status, 1); +if (r == -1 || r == 0) +return -1; +return !!(msr_mcg_status.data MCG_STATUS_MCIP); +} + static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo) { #if defined(KVM_CAP_MCE) defined(TARGET_I386) @@ -1630,6 +1643,15 @@ static void kvm_on_sigbus(CPUState *env, mce.misc = (MCM_ADDR_PHYS 6) | 0xc; mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV; } else { +/* + * If there is an MCE excpetion being processed, ignore + * this SRAO MCE + */ +r = kvm_mce_in_exception(env); +if (r == -1) +fprintf(stderr, Failed to get MCE status\n); +else if (r) +return; /* Fake an Intel architectural Memory scrubbing UCR */ mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S @@ -2475,6 +2497,12 @@ static void kvm_do_inject_x86_mce(void * struct kvm_x86_mce_data *data = _data; int r; +/* If there is an MCE excpetion being processed, ignore this SRAO MCE */ +r = kvm_mce_in_exception(data-env); +if (r == -1) +fprintf(stderr, Failed to get MCE status\n); +else if (r !(data-mce-status MCI_STATUS_AR)) +return; Don't you need to set the OVER bit in the MCI_STATUS register when this happens? The OVER bit is set when uncorrected error overwrite the corrected error. There is no specification for OVER bit for this situation. I just don't find benefit for it. Unrelated to this patch, it would be nice if you can share the testing code. There is some test script and document for this in: git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git test script is in kvm directory, testing document is kvm/README Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM, Fix QEMU-KVM is killed by guest SRAO MCE
On Wed, 2010-04-28 at 17:47 +0800, Avi Kivity wrote: On 04/28/2010 05:56 AM, Huang Ying wrote: Just want to use the side effect of copy_from_user, SIGBUS will be sent to current process because the page touched is marked as poisoned. That is, failure is expected, so the return value is not checked. What if the failure doesn't happen? Say, someone mmap()ed over the page. Sorry, not get your idea clearly. hva is re-mmap()ed? We just read the hva, not write, so I think it should be OK here. We don't generate a signal in this case. Does the code continue to work correctly (not sure what correctly is in this case... should probably just continue). There's also the possibility of -EFAULT. I think signal should be generated for copy_from_user, because the hva is poisoned now. The signal will not generated only if the hva is re-mmap()ped to some other physical page, but this should be impossible unless we have memory hotadd/hotremove in KVM. If the signal is not generated, lost or overwritten, guest will continue, and if the hva is still poisoned, the page fault will be triggered again; if the hva is not poisoned, there will be no further page fault. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Ignore SRAO MCE if another MCE is being processed
In common cases, guest SRAO MCE will cause corresponding poisoned page be un-mapped in host and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay the MCE to guest OS. But it is possible that the poisoned page is accessed in guest after un-mapped in host and before MCE is relayed to guest OS. So that, the SRAR SIGBUS is sent to QEMU-KVM before the SRAO SIGBUS, and if QEMU-KVM relays them to guest OS one by one, guest system may reset, because the SRAO MCE may be triggered while the SRAR MCE is being processed. In fact, the SRAO MCE can be ignored in this situation, so that the guest system is given opportunity to survive. Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm.c | 28 1 file changed, 28 insertions(+) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -1610,6 +1610,19 @@ static void flush_queued_work(CPUState * pthread_cond_broadcast(qemu_work_cond); } +static int kvm_mce_in_exception(CPUState *env) +{ +struct kvm_msr_entry msr_mcg_status = { +.index = MSR_MCG_STATUS, +}; +int r; + +r = kvm_get_msrs(env, msr_mcg_status, 1); +if (r == -1 || r == 0) +return -1; +return !!(msr_mcg_status.data MCG_STATUS_MCIP); +} + static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo) { #if defined(KVM_CAP_MCE) defined(TARGET_I386) @@ -1630,6 +1643,15 @@ static void kvm_on_sigbus(CPUState *env, mce.misc = (MCM_ADDR_PHYS 6) | 0xc; mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV; } else { +/* + * If there is an MCE excpetion being processed, ignore + * this SRAO MCE + */ +r = kvm_mce_in_exception(env); +if (r == -1) +fprintf(stderr, Failed to get MCE status\n); +else if (r) +return; /* Fake an Intel architectural Memory scrubbing UCR */ mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S @@ -2475,6 +2497,12 @@ static void kvm_do_inject_x86_mce(void * struct kvm_x86_mce_data *data = _data; int r; +/* If there is an MCE excpetion being processed, ignore this SRAO MCE */ +r = kvm_mce_in_exception(data-env); +if (r == -1) +fprintf(stderr, Failed to get MCE status\n); +else if (r !(data-mce-status MCI_STATUS_AR)) +return; r = kvm_set_mce(data-env, data-mce); if (r 0) { perror(kvm_set_mce FAILED); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM, Fix QEMU-KVM is killed by guest SRAO MCE
On Tue, 2010-04-27 at 15:47 +0800, Avi Kivity wrote: (please copy kvm@vger.kernel.org on kvm patches) Sorry, will do that for all future patches. On 04/27/2010 10:04 AM, Huang Ying wrote: +static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn) +{ + char buf[1]; + void __user *hva; + int r; + + /* Touch the page, so send SIGBUS */ + hva = (void __user *)gfn_to_hva(kvm, gfn); + r = copy_from_user(buf, hva, 1); No error check? What will a copy_from_user() of poisoned page expected to return? Best to return -EFAULT on failure for consistency. Just want to use the side effect of copy_from_user, SIGBUS will be sent to current process because the page touched is marked as poisoned. That is, failure is expected, so the return value is not checked. +} + static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) { int r; @@ -1997,7 +2009,11 @@ static int nonpaging_map(struct kvm_vcpu /* mmio */ if (is_error_pfn(pfn)) { kvm_release_pfn_clean(pfn); - return 1; + if (is_hwpoison_pfn(pfn)) { + kvm_send_hwpoison_signal(vcpu-kvm, gfn); + return 0; + } else + return 1; } This is duplicated several times. Please introduce a kvm_handle_bad_page(): if (is_error_pfn(pfn)) return kvm_handle_bad_page(vcpu-kvm, gfn, pfn); OK. Will do that. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM, Fix QEMU-KVM is killed by guest SRAO MCE
On Tue, 2010-04-27 at 17:30 +0800, Avi Kivity wrote: On 04/27/2010 12:25 PM, Huang Ying wrote: On 04/27/2010 10:04 AM, Huang Ying wrote: +static void kvm_send_hwpoison_signal(struct kvm *kvm, gfn_t gfn) +{ + char buf[1]; + void __user *hva; + int r; + + /* Touch the page, so send SIGBUS */ + hva = (void __user *)gfn_to_hva(kvm, gfn); + r = copy_from_user(buf, hva, 1); No error check? What will a copy_from_user() of poisoned page expected to return? Best to return -EFAULT on failure for consistency. Just want to use the side effect of copy_from_user, SIGBUS will be sent to current process because the page touched is marked as poisoned. That is, failure is expected, so the return value is not checked. What if the failure doesn't happen? Say, someone mmap()ed over the page. Sorry, not get your idea clearly. hva is re-mmap()ed? We just read the hva, not write, so I think it should be OK here. btw, better to use (void)copy_from_user(...) instead to avoid the initialized but not used warning the compiler may generate. OK. Will do that. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v3] Add savevm/loadvm support for MCE
MCE registers are saved/load into/from CPUState in kvm_arch_save/load_regs. To simulate the MCG_STATUS clearing upon reset, MSR_MCG_STATUS is set to 0 for KVM_PUT_RESET_STATE. v3: - use msrs[] in kvm_arch_load/save_regs and get_msr_entry directly. v2: - Rebased on new CPU registers save/load framework. Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm-x86.c | 36 1 file changed, 36 insertions(+) --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -748,7 +748,22 @@ static int get_msr_entry(struct kvm_msr_ case MSR_KVM_WALL_CLOCK: env-wall_clock_msr = entry-data; break; +#ifdef KVM_CAP_MCE +case MSR_MCG_STATUS: +env-mcg_status = entry-data; +break; +case MSR_MCG_CTL: +env-mcg_ctl = entry-data; +break; +#endif default: +#ifdef KVM_CAP_MCE +if (entry-index = MSR_MC0_CTL \ +entry-index MSR_MC0_CTL + (env-mcg_cap 0xff) * 4) { +env-mce_banks[entry-index - MSR_MC0_CTL] = entry-data; +break; +} +#endif printf(Warning unknown msr index 0x%x\n, entry-index); return 1; } @@ -979,6 +994,18 @@ void kvm_arch_load_regs(CPUState *env, i set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr); set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr); } +#ifdef KVM_CAP_MCE +if (env-mcg_cap) { +if (level == KVM_PUT_RESET_STATE) +set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status); +else if (level == KVM_PUT_FULL_STATE) { +set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status); +set_msr_entry(msrs[n++], MSR_MCG_CTL, env-mcg_ctl); +for (i = 0; i (env-mcg_cap 0xff); i++) +set_msr_entry(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]); +} +} +#endif rc = kvm_set_msrs(env, msrs, n); if (rc == -1) @@ -1144,6 +1171,15 @@ void kvm_arch_save_regs(CPUState *env) msrs[n++].index = MSR_KVM_SYSTEM_TIME; msrs[n++].index = MSR_KVM_WALL_CLOCK; +#ifdef KVM_CAP_MCE +if (env-mcg_cap) { +msrs[n++].index = MSR_MCG_STATUS; +msrs[n++].index = MSR_MCG_CTL; +for (i = 0; i (env-mcg_cap 0xff) * 4; i++) +msrs[n++].index = MSR_MC0_CTL + i; +} +#endif + rc = kvm_get_msrs(env, msrs, n); if (rc == -1) { perror(kvm_get_msrs FAILED); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Add savevm/loadvm support for MCE
On Tue, 2010-03-02 at 16:09 +0800, Jan Kiszka wrote: Huang Ying wrote: MCE registers are saved/load into/from CPUState in kvm_arch_save/load_regs. To simulate the MCG_STATUS clearing upon reset, MSR_MCG_STATUS is set to 0 for KVM_PUT_RESET_STATE. v2: - Rebased on new CPU registers save/load framework. Yep, much closer. :) Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm-x86.c | 43 +++ 1 file changed, 43 insertions(+) --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -979,11 +979,33 @@ void kvm_arch_load_regs(CPUState *env, i set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr); set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr); } +#ifdef KVM_CAP_MCE +if (env-mcg_cap level == KVM_PUT_RESET_STATE) { +/* + * MCG_STATUS should reset to 0 after reset, while other MCE + * registers should be unchanged + */ +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0); For the sake of consistency, just write mcg_status here (it's properly updated in cpu_reset). OK. +} +#endif rc = kvm_set_msrs(env, msrs, n); if (rc == -1) perror(kvm_set_msrs FAILED); +#ifdef KVM_CAP_MCE +if (env-mcg_cap level == KVM_PUT_FULL_STATE) { +n = 0; +set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status); +set_msr_entry(msrs[n++], MSR_MCG_CTL, env-mcg_ctl); You can move this block up, reusing the kvm_set_msrs above. But... +for (i = 0; i (env-mcg_cap 0xff); i++) ...this requires some care. We have space for writing up to 100 registers in our msrs array. You may have to extend it unless this number is much smaller in reality. The default number of MCE banks is 10, this means 42 entries. So I think it is safer to use another kvm_set_msrs. And the stack space is limited too. +set_msr_entry(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]); +rc = kvm_set_msrs(env, msrs, n); +if (rc == -1) +perror(kvm_set_msrs FAILED); +} +#endif + if (level = KVM_PUT_RESET_STATE) { kvm_arch_load_mpstate(env); kvm_load_lapic(env); @@ -1155,6 +1177,27 @@ void kvm_arch_save_regs(CPUState *env) return; } } + +#ifdef KVM_CAP_MCE +if (env-mcg_cap) { No need to check for msg_cap, the kernel will ignore unknown MSRs. And some error message will be printed in kernel log. Is it OK? +msrs[0].index = MSR_MCG_STATUS; +msrs[1].index = MSR_MCG_CTL; +n = (env-mcg_cap 0xff) * 4; +for (i = 0; i n; i++) Same are above, we may run out of array space. +msrs[2 + i].index = MSR_MC0_CTL + i; + +rc = kvm_get_msrs(env, msrs, n + 2); +if (rc == -1) +perror(kvm_get_msrs FAILED); +else { +env-mcg_status = msrs[0].data; +env-mcg_ctl = msrs[1].data; +for (i = 0; i n; i++) +env-mce_banks[i] = msrs[2 + i].data; +} Please split this block into setup and MSR transfer, and then merge it into the existing MSR readout to avoid calling kvm_get_msrs twice. I will do that if the stack space is not an issue. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2] Add savevm/loadvm support for MCE
MCE registers are saved/load into/from CPUState in kvm_arch_save/load_regs. To simulate the MCG_STATUS clearing upon reset, MSR_MCG_STATUS is set to 0 for KVM_PUT_RESET_STATE. v2: - Rebased on new CPU registers save/load framework. Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm-x86.c | 43 +++ 1 file changed, 43 insertions(+) --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -979,11 +979,33 @@ void kvm_arch_load_regs(CPUState *env, i set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr); set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr); } +#ifdef KVM_CAP_MCE +if (env-mcg_cap level == KVM_PUT_RESET_STATE) { +/* + * MCG_STATUS should reset to 0 after reset, while other MCE + * registers should be unchanged + */ +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0); +} +#endif rc = kvm_set_msrs(env, msrs, n); if (rc == -1) perror(kvm_set_msrs FAILED); +#ifdef KVM_CAP_MCE +if (env-mcg_cap level == KVM_PUT_FULL_STATE) { +n = 0; +set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status); +set_msr_entry(msrs[n++], MSR_MCG_CTL, env-mcg_ctl); +for (i = 0; i (env-mcg_cap 0xff); i++) +set_msr_entry(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]); +rc = kvm_set_msrs(env, msrs, n); +if (rc == -1) +perror(kvm_set_msrs FAILED); +} +#endif + if (level = KVM_PUT_RESET_STATE) { kvm_arch_load_mpstate(env); kvm_load_lapic(env); @@ -1155,6 +1177,27 @@ void kvm_arch_save_regs(CPUState *env) return; } } + +#ifdef KVM_CAP_MCE +if (env-mcg_cap) { +msrs[0].index = MSR_MCG_STATUS; +msrs[1].index = MSR_MCG_CTL; +n = (env-mcg_cap 0xff) * 4; +for (i = 0; i n; i++) +msrs[2 + i].index = MSR_MC0_CTL + i; + +rc = kvm_get_msrs(env, msrs, n + 2); +if (rc == -1) +perror(kvm_get_msrs FAILED); +else { +env-mcg_status = msrs[0].data; +env-mcg_ctl = msrs[1].data; +for (i = 0; i n; i++) +env-mce_banks[i] = msrs[2 + i].data; +} +} +#endif + kvm_arch_save_mpstate(env); kvm_save_lapic(env); kvm_get_vcpu_events(env); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Add savevm/loadvm support for MCE
MCE registers are saved/load into/from CPUState in kvm_arch_save/load_regs. Because all MCE registers except for MCG_STATUS should be preserved, MCE registers are saved before kvm_arch_load_regs in kvm_arch_cpu_reset. To simulate the MCG_STATUS clearing upon reset, env-mcg_status is set to 0 after saving. Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm-x86.c | 54 ++ 1 file changed, 54 insertions(+) --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -803,6 +803,27 @@ static void get_seg(SegmentCache *lhs, c | (rhs-avl * DESC_AVL_MASK); } +static void kvm_load_mce_regs(CPUState *env) +{ +#ifdef KVM_CAP_MCE +struct kvm_msr_entry msrs[100]; +int rc, n, i; + +if (!env-mcg_cap) + return; + +n = 0; +set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status); +set_msr_entry(msrs[n++], MSR_MCG_CTL, env-mcg_ctl); +for (i = 0; i (env-mcg_cap 0xff) * 4; i++) +set_msr_entry(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]); + +rc = kvm_set_msrs(env, msrs, n); +if (rc == -1) +perror(kvm_set_msrs FAILED); +#endif +} + void kvm_arch_load_regs(CPUState *env) { struct kvm_regs regs; @@ -922,6 +943,8 @@ void kvm_arch_load_regs(CPUState *env) if (rc == -1) perror(kvm_set_msrs FAILED); +kvm_load_mce_regs(env); + /* * Kernels before 2.6.33 (which correlates with !kvm_has_vcpu_events()) * overwrote flags.TF injected via SET_GUEST_DEBUG while updating GP regs. @@ -991,6 +1014,33 @@ void kvm_arch_load_mpstate(CPUState *env #endif } +static void kvm_save_mce_regs(CPUState *env) +{ +#ifdef KVM_CAP_MCE +struct kvm_msr_entry msrs[100]; +int rc, n, i; + +if (!env-mcg_cap) + return; + +msrs[0].index = MSR_MCG_STATUS; +msrs[1].index = MSR_MCG_CTL; +n = (env-mcg_cap 0xff) * 4; +for (i = 0; i n; i++) +msrs[2 + i].index = MSR_MC0_CTL + i; + +rc = kvm_get_msrs(env, msrs, n + 2); +if (rc == -1) +perror(kvm_set_msrs FAILED); +else { +env-mcg_status = msrs[0].data; +env-mcg_ctl = msrs[1].data; +for (i = 0; i n; i++) +env-mce_banks[i] = msrs[2 + i].data; +} +#endif +} + void kvm_arch_save_regs(CPUState *env) { struct kvm_regs regs; @@ -1148,6 +1198,7 @@ void kvm_arch_save_regs(CPUState *env) } } kvm_arch_save_mpstate(env); +kvm_save_mce_regs(env); } static void do_cpuid_ent(struct kvm_cpuid_entry2 *e, uint32_t function, @@ -1385,6 +1436,9 @@ void kvm_arch_push_nmi(void *opaque) void kvm_arch_cpu_reset(CPUState *env) { kvm_arch_reset_vcpu(env); +/* MCE registers except MCG_STATUS should be unchanged across reset */ +kvm_save_mce_regs(env); +env-mcg_status = 0; kvm_arch_load_regs(env); kvm_put_vcpu_events(env); if (!cpu_is_bsp(env)) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUGFIX] MCE: Fix bug of IA32_MCG_STATUS after system reset
On Wed, 2010-01-06 at 16:03 +0800, Avi Kivity wrote: On 01/06/2010 09:05 AM, Huang Ying wrote: @@ -1015,6 +1015,7 @@ void kvm_arch_load_regs(CPUState *env) #endif set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr); set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr); +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0); Not sure why you reset this in kvm_arch_load_regs(). Shouldn't this be in the cpu reset code? I found kvm_arch_load_regs() is called by kvm_arch_cpu_reset(), which is called by qemu_kvm_system_reset(). It is not in cpu reset path? It is, but it is also called from many other places, which could cause this msr to be zeroed. A better solution is to allocate it a field in CPUState, load and save it in kvm_arch_*_regs, and zero it during reset. Yes. You are right. I will fix this. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUGFIX] MCE: Fix bug of IA32_MCG_STATUS after system reset
On Tue, 2010-01-05 at 18:44 +0800, Andi Kleen wrote: --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -1015,6 +1015,7 @@ void kvm_arch_load_regs(CPUState *env) #endif set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr); set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr); +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0); Still need to keep EIPV and possibly RIPV, don't we? It appears that EIPV and RIPV is meaningless outside of MCE exception handler. But I will try to check the real hardware behavior. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUGFIX] MCE: Fix bug of IA32_MCG_STATUS after system reset
Now, if we inject a fatal MCE into guest OS, for example Linux, Linux will go panic and then reboot. But if we inject another MCE now, system will reset directly instead of go panic firstly, because MCG_STATUS.MCIP is set to 1 and not cleared after reboot. This is does not follow the behavior in real hardware. This patch fixes this via set IA32_MCG_STATUS to 0 during system reset. Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm-x86.c |1 + 1 file changed, 1 insertion(+) --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -1015,6 +1015,7 @@ void kvm_arch_load_regs(CPUState *env) #endif set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr); set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr); +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0); rc = kvm_set_msrs(env, msrs, n); if (rc == -1) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUGFIX] MCE: Fix bug of IA32_MCG_STATUS after system reset
Hi, Avi, On Tue, 2010-01-05 at 18:50 +0800, Avi Kivity wrote: On 01/05/2010 10:34 AM, Huang Ying wrote: Now, if we inject a fatal MCE into guest OS, for example Linux, Linux will go panic and then reboot. But if we inject another MCE now, system will reset directly instead of go panic firstly, because MCG_STATUS.MCIP is set to 1 and not cleared after reboot. This is does not follow the behavior in real hardware. This patch fixes this via set IA32_MCG_STATUS to 0 during system reset. Signed-off-by: Huang Yingying.hu...@intel.com --- qemu-kvm-x86.c |1 + 1 file changed, 1 insertion(+) --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -1015,6 +1015,7 @@ void kvm_arch_load_regs(CPUState *env) #endif set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr); set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr); +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0); Not sure why you reset this in kvm_arch_load_regs(). Shouldn't this be in the cpu reset code? I found kvm_arch_load_regs() is called by kvm_arch_cpu_reset(), which is called by qemu_kvm_system_reset(). It is not in cpu reset path? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix parameters of prctl
Because kenrel prctl implementation checks whether arg4 and arg5 are 0 for PR_MCE_KILL, qmeu-kvm should invoke prctl syscall as that. Reported-by: Max Asbock masb...@linux.vnet.ibm.com Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -1984,7 +1984,7 @@ int kvm_init_ap(void) action.sa_flags = SA_SIGINFO; action.sa_sigaction = (void (*)(int, siginfo_t*, void*))sigbus_handler; sigaction(SIGBUS, action, NULL); -prctl(PR_MCE_KILL, 1, 1); +prctl(PR_MCE_KILL, 1, 1, 0, 0); return 0; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v4] QEMU-KVM: MCE: Relay UCR MCE to guest
On Mon, 2009-09-21 at 18:08 +0800, Avi Kivity wrote: On 09/21/2009 05:43 AM, Huang Ying wrote: UCR (uncorrected recovery) MCE is supported in recent Intel CPUs, where some hardware error such as some memory error can be reported without PCC (processor context corrupted). To recover from such MCE, the corresponding memory will be unmapped, and all processes accessing the memory will be killed via SIGBUS. For KVM, if QEMU/KVM is killed, all guest processes will be killed too. So we relay SIGBUS from host OS to guest system via a UCR MCE injection. Then guest OS can isolate corresponding memory and kill necessary guest processes only. SIGBUS sent to main thread (not VCPU threads) will be broadcast to all VCPU threads as UCR MCE. --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -27,10 +27,23 @@ #includesys/mman.h #includesys/ioctl.h #includesignal.h +#includesys/signalfd.h This causes a build failure, since not all hosts have sys/signalfd.h, but more importantly: Maybe we can just add necessary fields to struct qemu_signalfd_siginfo. But this may be not portable. + +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx) +{ Here you accept signalfd_siginfo, while + +memset(action, 0, sizeof(action)); +action.sa_flags = SA_SIGINFO; +action.sa_sigaction = (void (*)(int, siginfo_t*, void*))sigbus_handler; +sigaction(SIGBUS,action, NULL); +prctl(PR_MCE_KILL, 1, 1); return 0; here you arm the function with something that will send it a siginfo_t. So it looks like this is broken if a signal is ever received directly? But can this happen due to signalfd? Because SIGBUS is blocked, I think the signal handler will not be called directly, but from sigfd_handler. } @@ -1962,7 +2116,10 @@ static void sigfd_handler(void *opaque) } sigaction(info.ssi_signo, NULL,action); -if (action.sa_handler) +if ((action.sa_flags SA_SIGINFO) action.sa_sigaction) +action.sa_sigaction(info.ssi_signo, +(siginfo_t *)info, NULL); + else if (action.sa_handler) action.sa_handler(info.ssi_signo); The whole extract handler from sigaction and call it was a hack. The hack above (signalfd_siginfo vs siginfo_t) is for extract handler from sigaction and call it too. So I suggest to replace it with calling handler directly. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v3] QEMU-KVM: MCE: Relay UCR MCE to guest
On Sat, 2009-09-19 at 01:05 +0800, Marcelo Tosatti wrote: On Fri, Sep 18, 2009 at 03:58:59PM +0800, Huang Ying wrote: UCR (uncorrected recovery) MCE is supported in recent Intel CPUs, where some hardware error such as some memory error can be reported without PCC (processor context corrupted). To recover from such MCE, the corresponding memory will be unmapped, and all processes accessing the memory will be killed via SIGBUS. For KVM, if QEMU/KVM is killed, all guest processes will be killed too. So we relay SIGBUS from host OS to guest system via a UCR MCE injection. Then guest OS can isolate corresponding memory and kill necessary guest processes only. SIGBUS sent to main thread (not VCPU threads) will be broadcast to all VCPU threads as UCR MCE. v3: - Re-raise SIGBUS for SIGBUS not from MCE - Kill itself for error in kvm_inject_x86_mce This is broken, non-MCE SIGBUS causes qemu-kvm to call the sigbus handler in a loop. Sorry. I tested the wrong branch in previous development. I will fix this. BTW, how are you testing this and what guests have been tested? I use a self-made kernel module to simulate SIGBUS from MCE handler to qemu. Until now, only Linux guest has been tested. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v4] QEMU-KVM: MCE: Relay UCR MCE to guest
UCR (uncorrected recovery) MCE is supported in recent Intel CPUs, where some hardware error such as some memory error can be reported without PCC (processor context corrupted). To recover from such MCE, the corresponding memory will be unmapped, and all processes accessing the memory will be killed via SIGBUS. For KVM, if QEMU/KVM is killed, all guest processes will be killed too. So we relay SIGBUS from host OS to guest system via a UCR MCE injection. Then guest OS can isolate corresponding memory and kill necessary guest processes only. SIGBUS sent to main thread (not VCPU threads) will be broadcast to all VCPU threads as UCR MCE. v4: - Fix SIGBUS re-raise implementation. v3: - Re-raise SIGBUS for SIGBUS not from MCE - Kill itself for error in kvm_inject_x86_mce v2: - Use qemu_ram_addr_from_host instead of self made one to covert from host address to guest RAM address. Thanks Anthony Liguori. Signed-off-by: Huang Ying ying.hu...@intel.com --- cpu-common.h |1 exec.c | 20 +++-- qemu-kvm.c | 195 +++ qemu-kvm.h |9 +- target-i386/cpu.h| 20 - target-i386/helper.c |2 6 files changed, 225 insertions(+), 22 deletions(-) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -27,10 +27,23 @@ #include sys/mman.h #include sys/ioctl.h #include signal.h +#include sys/signalfd.h +#include sys/prctl.h #define false 0 #define true 1 +#ifndef PR_MCE_KILL +#define PR_MCE_KILL 33 +#endif + +#ifndef BUS_MCEERR_AR +#define BUS_MCEERR_AR 4 +#endif +#ifndef BUS_MCEERR_AO +#define BUS_MCEERR_AO 5 +#endif + #define EXPECTED_KVM_API_VERSION 12 #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION @@ -1509,6 +1522,66 @@ static void sig_ipi_handler(int n) { } +static void hardware_memory_error(void) +{ +fprintf(stderr, Hardware memory error!\n); +exit(1); +} + +static void sigbus_reraise(void) +{ +sigset_t set; +struct sigaction action; + +memset(action, 0, sizeof(action)); +action.sa_handler = SIG_DFL; +if (!sigaction(SIGBUS, action, NULL)) { +raise(SIGBUS); +sigemptyset(set); +sigaddset(set, SIGBUS); +sigprocmask(SIG_UNBLOCK, set, NULL); +} +perror(Failed to re-raise SIGBUS!\n); +abort(); +} + +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx) +{ +#if defined(KVM_CAP_MCE) defined(TARGET_I386) +if (first_cpu-mcg_cap siginfo-ssi_addr + siginfo-ssi_code == BUS_MCEERR_AO) { +uint64_t status; +unsigned long paddr; +CPUState *cenv; + +/* Hope we are lucky for AO MCE */ +if (do_qemu_ram_addr_from_host((void *)siginfo-ssi_addr, paddr)) { +fprintf(stderr, Hardware memory error for memory used by +QEMU itself instead of guest system!: %llx\n, +(unsigned long long)siginfo-ssi_addr); +return; +} +status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| 0xc0; +kvm_inject_x86_mce(first_cpu, 9, status, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr, + (MCM_ADDR_PHYS 6) | 0xc, 1); +for (cenv = first_cpu-next_cpu; cenv != NULL; cenv = cenv-next_cpu) +kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0, 1); +} else +#endif +{ +if (siginfo-ssi_code == BUS_MCEERR_AO) +return; +else if (siginfo-ssi_code == BUS_MCEERR_AR) +hardware_memory_error(); +else +sigbus_reraise(); +} +} + static void on_vcpu(CPUState *env, void (*func)(void *data), void *data) { struct qemu_work_item wi; @@ -1666,29 +1739,101 @@ static void flush_queued_work(CPUState * pthread_cond_broadcast(qemu_work_cond); } +static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo) +{ +#if defined(KVM_CAP_MCE) defined(TARGET_I386) +struct kvm_x86_mce mce = { +.bank = 9, +}; +unsigned long paddr; +int r; + +if (env-mcg_cap siginfo-si_addr + (siginfo-si_code == BUS_MCEERR_AR +|| siginfo-si_code == BUS_MCEERR_AO)) { +if (siginfo-si_code == BUS_MCEERR_AR) { +/* Fake an Intel architectural Data Load SRAR UCR */ +mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| MCI_STATUS_AR | 0x134; +mce.misc = (MCM_ADDR_PHYS 6) | 0xc; +mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV; +} else { +/* Fake an Intel architectural Memory scrubbing UCR */ +mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
[PATCH -v3] QEMU-KVM: MCE: Relay UCR MCE to guest
UCR (uncorrected recovery) MCE is supported in recent Intel CPUs, where some hardware error such as some memory error can be reported without PCC (processor context corrupted). To recover from such MCE, the corresponding memory will be unmapped, and all processes accessing the memory will be killed via SIGBUS. For KVM, if QEMU/KVM is killed, all guest processes will be killed too. So we relay SIGBUS from host OS to guest system via a UCR MCE injection. Then guest OS can isolate corresponding memory and kill necessary guest processes only. SIGBUS sent to main thread (not VCPU threads) will be broadcast to all VCPU threads as UCR MCE. v3: - Re-raise SIGBUS for SIGBUS not from MCE - Kill itself for error in kvm_inject_x86_mce v2: - Use qemu_ram_addr_from_host instead of self made one to covert from host address to guest RAM address. Thanks Anthony Liguori. Signed-off-by: Huang Ying ying.hu...@intel.com --- cpu-common.h |1 exec.c | 20 - qemu-kvm.c | 178 +++ qemu-kvm.h |9 ++ target-i386/cpu.h| 20 + target-i386/helper.c |2 6 files changed, 208 insertions(+), 22 deletions(-) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -27,10 +27,23 @@ #include sys/mman.h #include sys/ioctl.h #include signal.h +#include sys/signalfd.h +#include sys/prctl.h #define false 0 #define true 1 +#ifndef PR_MCE_KILL +#define PR_MCE_KILL 33 +#endif + +#ifndef BUS_MCEERR_AR +#define BUS_MCEERR_AR 4 +#endif +#ifndef BUS_MCEERR_AO +#define BUS_MCEERR_AO 5 +#endif + #define EXPECTED_KVM_API_VERSION 12 #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION @@ -1509,6 +1522,49 @@ static void sig_ipi_handler(int n) { } +static void hardware_memory_error(void) +{ +fprintf(stderr, Hardware memory error!\n); +exit(1); +} + +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx) +{ +#if defined(KVM_CAP_MCE) defined(TARGET_I386) +if (first_cpu-mcg_cap siginfo-ssi_addr + siginfo-ssi_code == BUS_MCEERR_AO) { +uint64_t status; +unsigned long paddr; +CPUState *cenv; + +/* Hope we are lucky for AO MCE */ +if (do_qemu_ram_addr_from_host((void *)siginfo-ssi_addr, paddr)) { +fprintf(stderr, Hardware memory error for memory used by +QEMU itself instead of guest system!: %llx\n, +(unsigned long long)siginfo-ssi_addr); +return; +} +status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| 0xc0; +kvm_inject_x86_mce(first_cpu, 9, status, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr, + (MCM_ADDR_PHYS 6) | 0xc, 1); +for (cenv = first_cpu-next_cpu; cenv != NULL; cenv = cenv-next_cpu) +kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0, 1); +} else +#endif +{ +if (siginfo-ssi_code == BUS_MCEERR_AO) +return; +else if (siginfo-ssi_code == BUS_MCEERR_AR) +hardware_memory_error(); +else +raise(SIGBUS); +} +} + static void on_vcpu(CPUState *env, void (*func)(void *data), void *data) { struct qemu_work_item wi; @@ -1666,29 +1722,101 @@ static void flush_queued_work(CPUState * pthread_cond_broadcast(qemu_work_cond); } +static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo) +{ +#if defined(KVM_CAP_MCE) defined(TARGET_I386) +struct kvm_x86_mce mce = { +.bank = 9, +}; +unsigned long paddr; +int r; + +if (env-mcg_cap siginfo-si_addr + (siginfo-si_code == BUS_MCEERR_AR +|| siginfo-si_code == BUS_MCEERR_AO)) { +if (siginfo-si_code == BUS_MCEERR_AR) { +/* Fake an Intel architectural Data Load SRAR UCR */ +mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| MCI_STATUS_AR | 0x134; +mce.misc = (MCM_ADDR_PHYS 6) | 0xc; +mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV; +} else { +/* Fake an Intel architectural Memory scrubbing UCR */ +mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| 0xc0; +mce.misc = (MCM_ADDR_PHYS 6) | 0xc; +mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV; +} +if (do_qemu_ram_addr_from_host((void *)siginfo-si_addr, paddr)) { +fprintf(stderr, Hardware memory error for memory used by +QEMU itself instaed of guest system!\n); +/* Hope we are lucky for AO MCE */ +if (siginfo-si_code == BUS_MCEERR_AO
Re: [PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest
On Fri, 2009-09-18 at 05:36 +0800, Marcelo Tosatti wrote: +} else if (siginfo-ssi_code == BUS_MCEERR_AR) +fprintf(stderr, Hardware memory error!\n); +else +fprintf(stderr, Internal error in QEMU!\n); Can you re-raise SIGBUS so you we get a coredump on non-MCE SIGBUS as usual? We discuss this before. Copied below, please comment the comments below, :) Avi: (also, I if we can't handle guest-mode SIGBUS I think it would be nice to raise it again so the process terminates due to the SIGBUS). Huang Ying: For SIGBUS we can not relay to guest as MCE, we can either abort or reset SIGBUS to SIGDFL and re-raise it. Both are OK for me. You prefer the latter one? Andi: I think a suitable error message and exit would be better than a plain signal kill. It shouldn't look like qemu crashed due to a software bug. Ideally a error message in a way that it can be parsed by libvirt etc. and reported in a suitable way. However qemu getting killed itself is very unlikely, it doesn't have much memory foot print compared to the guest and other data. So this should be a very rare condition. Avi: libvirt etc. can/should wait() for qemu to terminate abnormally and report the reason why. However it doesn't seem there is a way to get extended signal information from wait(), so it looks like internal handling by qemu is better. I'm not talking about SIGBUS generated by MCE. What i mean is, for SIGBUS signals that are not due to MCE errors, the current behaviour is to generate a core dump (which is useful information for debugging). With your patch, qemu-kvm handles the signal, prints a message before exiting. This is annoying. It seems the discussion above is about SIGBUS initiated by MCE errors. OK. I will re-raise for SIGBUS not initiated by MCE. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest
On Thu, 2009-09-17 at 01:59 +0800, Marcelo Tosatti wrote: On Wed, Sep 09, 2009 at 10:28:02AM +0800, Huang Ying wrote: UCR (uncorrected recovery) MCE is supported in recent Intel CPUs, where some hardware error such as some memory error can be reported without PCC (processor context corrupted). To recover from such MCE, the corresponding memory will be unmapped, and all processes accessing the memory will be killed via SIGBUS. For KVM, if QEMU/KVM is killed, all guest processes will be killed too. So we relay SIGBUS from host OS to guest system via a UCR MCE injection. Then guest OS can isolate corresponding memory and kill necessary guest processes only. SIGBUS sent to main thread (not VCPU threads) will be broadcast to all VCPU threads as UCR MCE. v2: - Use qemu_ram_addr_from_host instead of self made one to covert from host address to guest RAM address. Thanks Anthony Liguori. Signed-off-by: Huang Ying ying.hu...@intel.com --- cpu-common.h |1 exec.c| 20 +-- qemu-kvm.c| 154 ++ target-i386/cpu.h | 20 ++- 4 files changed, 178 insertions(+), 17 deletions(-) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -27,10 +27,23 @@ #include sys/mman.h #include sys/ioctl.h #include signal.h +#include sys/signalfd.h +#include sys/prctl.h #define false 0 #define true 1 +#ifndef PR_MCE_KILL +#define PR_MCE_KILL 33 +#endif + +#ifndef BUS_MCEERR_AR +#define BUS_MCEERR_AR 4 +#endif +#ifndef BUS_MCEERR_AO +#define BUS_MCEERR_AO 5 +#endif + #define EXPECTED_KVM_API_VERSION 12 #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION @@ -1507,6 +1520,37 @@ static void sig_ipi_handler(int n) { } +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx) +{ +if (siginfo-ssi_code == BUS_MCEERR_AO) { +uint64_t status; +unsigned long paddr; +CPUState *cenv; + +/* Hope we are lucky for AO MCE */ +if (do_qemu_ram_addr_from_host((void *)siginfo-ssi_addr, paddr)) { +fprintf(stderr, Hardware memory error for memory used by +QEMU itself instead of guest system!: %llx\n, +(unsigned long long)siginfo-ssi_addr); +return; qemu-kvm should die here? There are two kinds of UCR MCE. One is triggered by user space/guest read/write, the other is triggered by asynchronously detected error (e.g. patrol scrubbing). The latter one is reported as AO (Action Optional) MCE, and it has nothing to do with current path. So if we are lucky enough, we can survive. And when we finally touch the error memory reported by AO MCE, another AR (Action Required) MCE will be triggered. We have another chance to deal with it. +} +status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| 0xc0; +kvm_inject_x86_mce(first_cpu, 9, status, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr, + (MCM_ADDR_PHYS 6) | 0xc); +for (cenv = first_cpu-next_cpu; cenv != NULL; cenv = cenv-next_cpu) +kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0); +return; Should abort if kvm_inject_x86_mce fails? kvm_inject_x86_mce will abort by itself. +} else if (siginfo-ssi_code == BUS_MCEERR_AR) +fprintf(stderr, Hardware memory error!\n); +else +fprintf(stderr, Internal error in QEMU!\n); Can you re-raise SIGBUS so you we get a coredump on non-MCE SIGBUS as usual? We discuss this before. Copied below, please comment the comments below, :) Avi: (also, I if we can't handle guest-mode SIGBUS I think it would be nice to raise it again so the process terminates due to the SIGBUS). Huang Ying: For SIGBUS we can not relay to guest as MCE, we can either abort or reset SIGBUS to SIGDFL and re-raise it. Both are OK for me. You prefer the latter one? Andi: I think a suitable error message and exit would be better than a plain signal kill. It shouldn't look like qemu crashed due to a software bug. Ideally a error message in a way that it can be parsed by libvirt etc. and reported in a suitable way. However qemu getting killed itself is very unlikely, it doesn't have much memory foot print compared to the guest and other data. So this should be a very rare condition. Avi: libvirt etc. can/should wait() for qemu to terminate abnormally and report the reason why. However it doesn't seem there is a way to get extended signal information from wait(), so it looks like internal handling by qemu is better. +exit(1); +} + static void on_vcpu(CPUState *env, void (*func)(void *data), void *data
Re: [PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest
On Mon, 2009-09-14 at 13:10 +0800, Avi Kivity wrote: On 09/14/2009 05:55 AM, Huang Ying wrote: Hi, Avi, On Thu, 2009-09-10 at 17:35 +0800, Andi Kleen wrote: (also, I if we can't handle guest-mode SIGBUS I think it would be nice to raise it again so the process terminates due to the SIGBUS). For SIGBUS we can not relay to guest as MCE, we can either abort or reset SIGBUS to SIGDFL and re-raise it. Both are OK for me. You prefer the latter one? I think a suitable error message and exit would be better than a plain signal kill. It shouldn't look like qemu crashed due to a software bug. Ideally a error message in a way that it can be parsed by libvirt etc. and reported in a suitable way. Do you agree with us about SIGBUS processing? Yes. Do you have some other issue for this patch? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest
Hi, Avi, On Thu, 2009-09-10 at 17:35 +0800, Andi Kleen wrote: (also, I if we can't handle guest-mode SIGBUS I think it would be nice to raise it again so the process terminates due to the SIGBUS). For SIGBUS we can not relay to guest as MCE, we can either abort or reset SIGBUS to SIGDFL and re-raise it. Both are OK for me. You prefer the latter one? I think a suitable error message and exit would be better than a plain signal kill. It shouldn't look like qemu crashed due to a software bug. Ideally a error message in a way that it can be parsed by libvirt etc. and reported in a suitable way. Do you agree with us about SIGBUS processing? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] QEMU-KVM: MCE: Relay UCR MCE to guest
On Tue, 2009-09-08 at 14:44 +0800, Avi Kivity wrote: On 09/07/2009 11:32 AM, Huang Ying wrote: UCR (uncorrected recovery) MCE is supported in recent Intel CPUs, where some hardware error such as some memory error can be reported without PCC (processor context corrupted). To recover from such MCE, the corresponding memory will be unmapped, and all processes accessing the memory will be killed via SIGBUS. For KVM, if QEMU/KVM is killed, all guest processes will be killed too. So we relay SIGBUS from host OS to guest system via a UCR MCE injection. Then guest OS can isolate corresponding memory and kill necessary guest processes only. SIGBUS sent to main thread (not VCPU threads) will be broadcast to all VCPU threads as UCR MCE. Won't the guest be confused by the broadcast? How does real hardware work? We do broadcasting to follow the hardware behavior. +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx) +{ +if (siginfo-ssi_code == BUS_MCEERR_AO) { +uint64_t status; +unsigned long paddr; +CPUState *cenv; + +/* Hope we are lucky for AO MCE */ +if (kvm_addr_userspace_to_phys((unsigned long)siginfo-ssi_addr, +paddr)) { +fprintf(stderr, Hardware memory error for memory used by +QEMU itself instead of guest system!: %llx\n, +(unsigned long long)siginfo-ssi_addr); +return; +} +status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| 0xc0; +kvm_inject_x86_mce(first_cpu, 9, status, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr, + (MCM_ADDR_PHYS 6) | 0xc); This is a vcpu ioctl, yes? if so it must be called from the vcpu thread. No. kvm_inject_x86_mce will call on_vcpu to do the real vcpu ioctl. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] QEMU-KVM: MCE: Relay UCR MCE to guest
On Tue, 2009-09-08 at 14:41 +0800, Avi Kivity wrote: On 09/07/2009 11:48 PM, Anthony Liguori wrote: #ifdef KVM_CAP_IRQCHIP int kvm_set_irq_level(kvm_context_t kvm, int irq, int level, int *status) @@ -1515,6 +1546,38 @@ static void sig_ipi_handler(int n) { } +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx) +{ +if (siginfo-ssi_code == BUS_MCEERR_AO) { +uint64_t status; +unsigned long paddr; +CPUState *cenv; + +/* Hope we are lucky for AO MCE */ Even if the error was limited to guest memory, it could have been generated by either the kernel or userspace reading guest memory, no? Does this potentially open a security hole for us? Consider the following: 1) We happen to read guest memory and that causes an MCE. For instance, say we're in virtio.c and we read the virtio ring. 2) That should trigger the kernel to generate a sigbus. 3) We catch sigbus, and queue an MCE for delivery. 4) After sigbus handler completes, we're back in virtio.c, what was the value of the memory operation we just completed? If the instruction gets skipped, we may be leaking host memory because the access never happened. I think it's a lot safer to only report guest mode accesses to the guest, and let user mode accesses terminate qemu. The guest wouldn't expect 100% recovery; for example if an uncorrectable error hit a vital kernel data structure. Yes, this is the current behavior. If MCE is caused by user mode accessing, the KVM will be killed by force_sig_info, only MCE caused by guest mode accessing will be captured by SIGBUS signal handler. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest
UCR (uncorrected recovery) MCE is supported in recent Intel CPUs, where some hardware error such as some memory error can be reported without PCC (processor context corrupted). To recover from such MCE, the corresponding memory will be unmapped, and all processes accessing the memory will be killed via SIGBUS. For KVM, if QEMU/KVM is killed, all guest processes will be killed too. So we relay SIGBUS from host OS to guest system via a UCR MCE injection. Then guest OS can isolate corresponding memory and kill necessary guest processes only. SIGBUS sent to main thread (not VCPU threads) will be broadcast to all VCPU threads as UCR MCE. v2: - Use qemu_ram_addr_from_host instead of self made one to covert from host address to guest RAM address. Thanks Anthony Liguori. Signed-off-by: Huang Ying ying.hu...@intel.com --- cpu-common.h |1 exec.c| 20 +-- qemu-kvm.c| 154 ++ target-i386/cpu.h | 20 ++- 4 files changed, 178 insertions(+), 17 deletions(-) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -27,10 +27,23 @@ #include sys/mman.h #include sys/ioctl.h #include signal.h +#include sys/signalfd.h +#include sys/prctl.h #define false 0 #define true 1 +#ifndef PR_MCE_KILL +#define PR_MCE_KILL 33 +#endif + +#ifndef BUS_MCEERR_AR +#define BUS_MCEERR_AR 4 +#endif +#ifndef BUS_MCEERR_AO +#define BUS_MCEERR_AO 5 +#endif + #define EXPECTED_KVM_API_VERSION 12 #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION @@ -1507,6 +1520,37 @@ static void sig_ipi_handler(int n) { } +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx) +{ +if (siginfo-ssi_code == BUS_MCEERR_AO) { +uint64_t status; +unsigned long paddr; +CPUState *cenv; + +/* Hope we are lucky for AO MCE */ +if (do_qemu_ram_addr_from_host((void *)siginfo-ssi_addr, paddr)) { +fprintf(stderr, Hardware memory error for memory used by +QEMU itself instead of guest system!: %llx\n, +(unsigned long long)siginfo-ssi_addr); +return; +} +status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| 0xc0; +kvm_inject_x86_mce(first_cpu, 9, status, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr, + (MCM_ADDR_PHYS 6) | 0xc); +for (cenv = first_cpu-next_cpu; cenv != NULL; cenv = cenv-next_cpu) +kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0); +return; +} else if (siginfo-ssi_code == BUS_MCEERR_AR) +fprintf(stderr, Hardware memory error!\n); +else +fprintf(stderr, Internal error in QEMU!\n); +exit(1); +} + static void on_vcpu(CPUState *env, void (*func)(void *data), void *data) { struct qemu_work_item wi; @@ -1649,29 +1693,102 @@ static void flush_queued_work(CPUState * pthread_cond_broadcast(qemu_work_cond); } +static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo) +{ +#if defined(KVM_CAP_MCE) defined(TARGET_I386) +struct kvm_x86_mce mce = { +.bank = 9, +}; +unsigned long paddr; +int r; + +if (env-mcg_cap siginfo-si_addr + (siginfo-si_code == BUS_MCEERR_AR +|| siginfo-si_code == BUS_MCEERR_AO)) { +if (siginfo-si_code == BUS_MCEERR_AR) { +/* Fake an Intel architectural Data Load SRAR UCR */ +mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| MCI_STATUS_AR | 0x134; +mce.misc = (MCM_ADDR_PHYS 6) | 0xc; +mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV; +} else { +/* Fake an Intel architectural Memory scrubbing UCR */ +mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| 0xc0; +mce.misc = (MCM_ADDR_PHYS 6) | 0xc; +mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV; +} +if (do_qemu_ram_addr_from_host((void *)siginfo-si_addr, paddr)) { +fprintf(stderr, Hardware memory error for memory used by +QEMU itself instaed of guest system!\n); +/* Hope we are lucky for AO MCE */ +if (siginfo-si_code == BUS_MCEERR_AO) +return; +else +exit(1); +} +mce.addr = paddr; +r = kvm_set_mce(env-kvm_cpu_state.vcpu_ctx, mce); +if (r 0) { +fprintf(stderr, kvm_set_mce: %s\n, strerror(errno)); +exit(1); +} +} else +#endif +{ +if (siginfo-si_code == BUS_MCEERR_AO) +return; +if (siginfo-si_code
[PATCH] QEMU-KVM: MCE: Relay UCR MCE to guest
UCR (uncorrected recovery) MCE is supported in recent Intel CPUs, where some hardware error such as some memory error can be reported without PCC (processor context corrupted). To recover from such MCE, the corresponding memory will be unmapped, and all processes accessing the memory will be killed via SIGBUS. For KVM, if QEMU/KVM is killed, all guest processes will be killed too. So we relay SIGBUS from host OS to guest system via a UCR MCE injection. Then guest OS can isolate corresponding memory and kill necessary guest processes only. SIGBUS sent to main thread (not VCPU threads) will be broadcast to all VCPU threads as UCR MCE. Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm.c| 173 ++ target-i386/cpu.h | 20 +- 2 files changed, 181 insertions(+), 12 deletions(-) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -27,10 +27,23 @@ #include sys/mman.h #include sys/ioctl.h #include signal.h +#include sys/signalfd.h +#include sys/prctl.h #define false 0 #define true 1 +#ifndef PR_MCE_KILL +#define PR_MCE_KILL 33 +#endif + +#ifndef BUS_MCEERR_AR +#define BUS_MCEERR_AR 4 +#endif +#ifndef BUS_MCEERR_AO +#define BUS_MCEERR_AO 5 +#endif + #define EXPECTED_KVM_API_VERSION 12 #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION @@ -702,6 +715,24 @@ int kvm_get_dirty_pages_range(kvm_contex return 0; } +static int kvm_addr_userspace_to_phys(unsigned long userspace_addr, + unsigned long *phys_addr) +{ + int i; + struct slot_info *slot; + + for (i = 0; i KVM_MAX_NUM_MEM_REGIONS; ++i) { + slot = slots[i]; + if (slot-len slot-userspace_addr = userspace_addr + (slot-userspace_addr + slot-len) userspace_addr) { + *phys_addr = userspace_addr - slot-userspace_addr + + slot-phys_addr; + return 0; + } + } + return -1; +} + #ifdef KVM_CAP_IRQCHIP int kvm_set_irq_level(kvm_context_t kvm, int irq, int level, int *status) @@ -1515,6 +1546,38 @@ static void sig_ipi_handler(int n) { } +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx) +{ +if (siginfo-ssi_code == BUS_MCEERR_AO) { +uint64_t status; +unsigned long paddr; +CPUState *cenv; + +/* Hope we are lucky for AO MCE */ +if (kvm_addr_userspace_to_phys((unsigned long)siginfo-ssi_addr, + paddr)) { +fprintf(stderr, Hardware memory error for memory used by +QEMU itself instead of guest system!: %llx\n, +(unsigned long long)siginfo-ssi_addr); +return; +} +status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| 0xc0; +kvm_inject_x86_mce(first_cpu, 9, status, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr, + (MCM_ADDR_PHYS 6) | 0xc); +for (cenv = first_cpu-next_cpu; cenv != NULL; cenv = cenv-next_cpu) +kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC, + MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0); +return; +} else if (siginfo-ssi_code == BUS_MCEERR_AR) +fprintf(stderr, Hardware memory error!\n); +else +fprintf(stderr, Internal error in QEMU!\n); +exit(1); +} + static void on_vcpu(CPUState *env, void (*func)(void *data), void *data) { struct qemu_work_item wi; @@ -1657,29 +1720,102 @@ static void flush_queued_work(CPUState * pthread_cond_broadcast(qemu_work_cond); } +static void kvm_on_sigbus(CPUState *env, siginfo_t *siginfo) +{ +#if defined(KVM_CAP_MCE) defined(TARGET_I386) +struct kvm_x86_mce mce = { +.bank = 9, +}; +unsigned long paddr; +int r; + +if (env-mcg_cap siginfo-si_addr + (siginfo-si_code == BUS_MCEERR_AR +|| siginfo-si_code == BUS_MCEERR_AO)) { +if (siginfo-si_code == BUS_MCEERR_AR) { +mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| MCI_STATUS_AR; +mce.misc = (MCM_ADDR_PHYS 6) | 0xc; +mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV; +} else { +/* Fake an Intel architectural Memory scrubbing UCR */ +mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN +| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S +| 0xc0; +mce.misc = (MCM_ADDR_PHYS 6) | 0xc; +mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV; +} +if (kvm_addr_userspace_to_phys((unsigned long)siginfo-si_addr, + paddr)) { +fprintf(stderr
Re: [PATCH] QEMU-KVM: MCE: Relay UCR MCE to guest
On Tue, 2009-09-08 at 04:48 +0800, Anthony Liguori wrote: Hi Huang, Huang Ying wrote: UCR (uncorrected recovery) MCE is supported in recent Intel CPUs, where some hardware error such as some memory error can be reported without PCC (processor context corrupted). To recover from such MCE, the corresponding memory will be unmapped, and all processes accessing the memory will be killed via SIGBUS. For KVM, if QEMU/KVM is killed, all guest processes will be killed too. So we relay SIGBUS from host OS to guest system via a UCR MCE injection. Then guest OS can isolate corresponding memory and kill necessary guest processes only. SIGBUS sent to main thread (not VCPU threads) will be broadcast to all VCPU threads as UCR MCE. Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm.c| 173 ++ target-i386/cpu.h | 20 +- 2 files changed, 181 insertions(+), 12 deletions(-) --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -27,10 +27,23 @@ #include sys/mman.h #include sys/ioctl.h #include signal.h +#include sys/signalfd.h +#include sys/prctl.h #define false 0 #define true 1 +#ifndef PR_MCE_KILL +#define PR_MCE_KILL 33 +#endif + +#ifndef BUS_MCEERR_AR +#define BUS_MCEERR_AR 4 +#endif +#ifndef BUS_MCEERR_AO +#define BUS_MCEERR_AO 5 +#endif + #define EXPECTED_KVM_API_VERSION 12 #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION @@ -702,6 +715,24 @@ int kvm_get_dirty_pages_range(kvm_contex return 0; } +static int kvm_addr_userspace_to_phys(unsigned long userspace_addr, + unsigned long *phys_addr) +{ + int i; + struct slot_info *slot; + + for (i = 0; i KVM_MAX_NUM_MEM_REGIONS; ++i) { + slot = slots[i]; + if (slot-len slot-userspace_addr = userspace_addr + (slot-userspace_addr + slot-len) userspace_addr) { + *phys_addr = userspace_addr - slot-userspace_addr + + slot-phys_addr; + return 0; + } + } + return -1; +} + The slot mapping is actually a copy of the qemu's ram_blocks structure (see exec.c). If you base your check on that, it will Just Work for QEMU too. I find there is already a function named qemu_ram_addr_from_host which translate from user space virtual address into qemu RAM address. But I need function to return a error code instead of abort in case of no RAM address corresponding specified user space virtual address. So I plan to use following code to deal with that. int do_qemu_ram_addr_from_host(void *ptr, ram_addr_t *ram_addr); ram_addr_t qemu_ram_addr_from_host(void *ptr); Does this follow the coding style of qemu? #ifdef KVM_CAP_IRQCHIP int kvm_set_irq_level(kvm_context_t kvm, int irq, int level, int *status) @@ -1515,6 +1546,38 @@ static void sig_ipi_handler(int n) { } +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx) +{ +if (siginfo-ssi_code == BUS_MCEERR_AO) { +uint64_t status; +unsigned long paddr; +CPUState *cenv; + +/* Hope we are lucky for AO MCE */ Even if the error was limited to guest memory, it could have been generated by either the kernel or userspace reading guest memory, no? Does this potentially open a security hole for us? Consider the following: 1) We happen to read guest memory and that causes an MCE. For instance, say we're in virtio.c and we read the virtio ring. 2) That should trigger the kernel to generate a sigbus. 3) We catch sigbus, and queue an MCE for delivery. 4) After sigbus handler completes, we're back in virtio.c, what was the value of the memory operation we just completed? If the instruction gets skipped, we may be leaking host memory because the access never happened. There are two kinds of recoverable MCE named SRAO (Software Recoverable Action Optional) and SRAR (Software Recoverable Action Required). For your example, it is a SRAR error. Where kernel will munmap the error page and send SIGBUS to qemu via force_sig_info, which will unblock SIGBUS and reset its action to SIG_DFL, so qemu will be terminated. If the guest mode is interrupted, because signal mask processing of KVM kernel part, SIGBUS can be captured by qemu. For more details of recoverable MCE (SRAO and SRAR), you can refer to latest Intel software developer's manual. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v6] QEMU-KVM: MCE: Add MCE simulation support to qemu/kvm
KVM ioctls are used to initialize MCE simulation and inject MCE. The real MCE simulation is implemented in Linux kernel. The Kernel part has been merged. ChangeLog: v6: - Re-based on latest qemu-kvm.git v5: - Re-based on latest qemu-kvm.git v3: - Re-based on qemu/tcg MCE support patch v2: - Use new kernel MCE capability exportion interface. Signed-off-by: Huang Ying ying.hu...@intel.com --- kvm/include/linux/kvm.h | 20 kvm/include/x86/asm/kvm.h |1 libkvm-all.h |4 +++ qemu-kvm-x86.c| 55 ++ qemu-kvm.c| 40 + qemu-kvm.h|3 ++ target-i386/helper.c |5 target-i386/machine.c |4 +-- 8 files changed, 130 insertions(+), 2 deletions(-) --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -444,6 +444,39 @@ int kvm_set_msrs(kvm_vcpu_context_t vcpu return r; } +int kvm_get_mce_cap_supported(kvm_context_t kvm, uint64_t *mce_cap, + int *max_banks) +{ +#ifdef KVM_CAP_MCE +int r; + +r = ioctl(kvm-fd, KVM_CHECK_EXTENSION, KVM_CAP_MCE); +if (r 0) { +*max_banks = r; +return ioctl(kvm-fd, KVM_X86_GET_MCE_CAP_SUPPORTED, mce_cap); +} +#endif +return -ENOSYS; +} + +int kvm_setup_mce(kvm_vcpu_context_t vcpu, uint64_t *mcg_cap) +{ +#ifdef KVM_CAP_MCE +return ioctl(vcpu-fd, KVM_X86_SETUP_MCE, mcg_cap); +#else +return -ENOSYS; +#endif +} + +int kvm_set_mce(kvm_vcpu_context_t vcpu, struct kvm_x86_mce *m) +{ +#ifdef KVM_CAP_MCE +return ioctl(vcpu-fd, KVM_X86_SET_MCE, m); +#else +return -ENOSYS; +#endif +} + static void print_seg(FILE *file, const char *name, struct kvm_segment *seg) { fprintf(stderr, @@ -1301,6 +1334,28 @@ int kvm_arch_qemu_init_env(CPUState *cen kvm_setup_cpuid2(cenv-kvm_cpu_state.vcpu_ctx, cpuid_nent, cpuid_ent); +#ifdef KVM_CAP_MCE +if (((cenv-cpuid_version 8)0xF) = 6 + (cenv-cpuid_features(CPUID_MCE|CPUID_MCA)) == (CPUID_MCE|CPUID_MCA) + kvm_check_extension(kvm_context, KVM_CAP_MCE) 0) { +uint64_t mcg_cap; +int banks; + +if (kvm_get_mce_cap_supported(kvm_context, mcg_cap, banks)) +perror(kvm_get_mce_cap_supported FAILED); +else { +if (banks MCE_BANKS_DEF) +banks = MCE_BANKS_DEF; +mcg_cap = MCE_CAP_DEF; +mcg_cap |= banks; +if (kvm_setup_mce(cenv-kvm_cpu_state.vcpu_ctx, mcg_cap)) +perror(kvm_setup_mce FAILED); +else +cenv-mcg_cap = mcg_cap; +} +} +#endif + return 0; } --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -241,4 +241,7 @@ static inline int kvm_set_migration_log( return kvm_physical_memory_set_dirty_tracking(enable); } +void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t status, +uint64_t mcg_status, uint64_t addr, uint64_t misc); + #endif --- a/target-i386/helper.c +++ b/target-i386/helper.c @@ -1509,6 +1509,11 @@ void cpu_inject_x86_mce(CPUState *cenv, unsigned bank_num = mcg_cap 0xff; uint64_t *banks = cenv-mce_banks; +if (kvm_enabled()) { +kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc); +return; +} + if (bank = bank_num || !(status MCI_STATUS_VAL)) return; --- a/kvm/include/linux/kvm.h +++ b/kvm/include/linux/kvm.h @@ -463,6 +463,9 @@ struct kvm_trace_rec { #define KVM_CAP_ASSIGN_DEV_IRQ 29 /* Another bug in KVM_SET_USER_MEMORY_REGION fixed: */ #define KVM_CAP_JOIN_MEMORY_REGIONS_WORKS 30 +#ifdef __KVM_HAVE_MCE +#define KVM_CAP_MCE 31 +#endif #define KVM_CAP_PIT2 33 #define KVM_CAP_PIT_STATE2 35 @@ -504,6 +507,19 @@ struct kvm_irq_routing { #endif +#ifdef KVM_CAP_MCE +/* x86 MCE */ +struct kvm_x86_mce { + __u64 status; + __u64 addr; + __u64 misc; + __u64 mcg_status; + __u8 bank; + __u8 pad1[7]; + __u64 pad2[3]; +}; +#endif + /* * ioctls for VM fds */ @@ -592,6 +608,10 @@ struct kvm_irq_routing { #define KVM_NMI _IO(KVMIO, 0x9a) /* Available with KVM_CAP_SET_GUEST_DEBUG */ #define KVM_SET_GUEST_DEBUG _IOW(KVMIO, 0x9b, struct kvm_guest_debug) +/* MCE for x86 */ +#define KVM_X86_SETUP_MCE _IOW(KVMIO, 0x9c, __u64) +#define KVM_X86_GET_MCE_CAP_SUPPORTED _IOR(KVMIO, 0x9d, __u64) +#define KVM_X86_SET_MCE _IOW(KVMIO, 0x9e, struct kvm_x86_mce) /* * Deprecated interfaces --- a/kvm/include/x86/asm/kvm.h +++ b/kvm/include/x86/asm/kvm.h @@ -57,6 +57,7 @@ #define __KVM_HAVE_USER_NMI #define __KVM_HAVE_GUEST_DEBUG #define __KVM_HAVE_MSIX +#define __KVM_HAVE_MCE /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -2845,3 +2845,43 @@ int kvm_set_boot_cpu_id(uint32_t id) { return kvm_set_boot_vcpu_id(kvm_context, id); } + +#ifdef