Re: [PATCH v3 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline
On 2022/10/29 0:13, Luck, Tony wrote: >>> Cannot call memory_failure() directly from the fault handler because >>> mmap_lock (and others) are held. >> >> Could you please explain which lock makes it unfeasible to call >> memory_failure() directly and >> why? I'm somewhat confused. But I agree using memory_failure_queue() should >> be a good idea. > > I tried calling memory_failure() directly, and my system just hung. I made > the assumption > that it had deadlocked based somewhat on the comments in mm/memory.c about > mmap_lock > being held ... but I didn't dig into what had gone wrong. I see. Thanks for your explanation. :) > > -Tony >
Re: [PATCH v3 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline
On 2022/10/22 4:01, Tony Luck wrote: > Cannot call memory_failure() directly from the fault handler because > mmap_lock (and others) are held. Could you please explain which lock makes it unfeasible to call memory_failure() directly and why? I'm somewhat confused. But I agree using memory_failure_queue() should be a good idea. > > It is important, but not urgent, to mark the source page as h/w poisoned > and unmap it from other tasks. > > Use memory_failure_queue() to request a call to memory_failure() for the > page with the error. > > Also provide a stub version for CONFIG_MEMORY_FAILURE=n > > Signed-off-by: Tony Luck > --- > include/linux/mm.h | 5 - > mm/memory.c| 4 +++- > 2 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 8bbcccbc5565..03ced659eb58 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3268,7 +3268,6 @@ enum mf_flags { > int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, > unsigned long count, int mf_flags); > extern int memory_failure(unsigned long pfn, int flags); > -extern void memory_failure_queue(unsigned long pfn, int flags); > extern void memory_failure_queue_kick(int cpu); > extern int unpoison_memory(unsigned long pfn); > extern int sysctl_memory_failure_early_kill; > @@ -3277,8 +3276,12 @@ extern void shake_page(struct page *p); > extern atomic_long_t num_poisoned_pages __read_mostly; > extern int soft_offline_page(unsigned long pfn, int flags); > #ifdef CONFIG_MEMORY_FAILURE > +extern void memory_failure_queue(unsigned long pfn, int flags); > extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags); > #else > +static inline void memory_failure_queue(unsigned long pfn, int flags) > +{ > +} > static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags) > { > return 0; > diff --git a/mm/memory.c b/mm/memory.c > index b6056eef2f72..eae242351726 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2866,8 +2866,10 @@ static inline int __wp_page_copy_user(struct page > *dst, struct page *src, > unsigned long addr = vmf->address; > > if (likely(src)) { > - if (copy_mc_user_highpage(dst, src, addr, vma)) > + if (copy_mc_user_highpage(dst, src, addr, vma)) { > + memory_failure_queue(page_to_pfn(src), 0); It seems MF_ACTION_REQUIRED is not needed for memory_failure_queue() here. Thanks for your patch. Reviewed-by: Miaohe Lin Thanks, Miaohe Lin
Re: [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults
On 2022/10/22 4:01, Tony Luck wrote: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test case. Just inject an error into a private > page, fork(2), and have the child process write to the page. > > I wrapped that neatly into a test at: > > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git > > just enable ACPI error injection and run: > > # ./einj_mem-uc -f copy-on-write > > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() > on architectures where that is available (currently x86 and powerpc). > When an error is detected during the page copy, return VM_FAULT_HWPOISON > to caller of wp_page_copy(). This propagates up the call stack. Both x86 > and powerpc have code in their fault handler to deal with this code by > sending a SIGBUS to the application. > > Note that this patch avoids a system crash and signals the process that > triggered the copy-on-write action. It does not take any action for the > memory error that is still in the shared page. To handle that a call to > memory_failure() is needed. But this cannot be done from wp_page_copy() > because it holds mmap_lock(). Perhaps the architecture fault handlers > can deal with this loose end in a subsequent patch? > > On Intel/x86 this loose end will often be handled automatically because > the memory controller provides an additional notification of the h/w > poison in memory, the handler for this will call memory_failure(). This > isn't a 100% solution. If there are multiple errors, not all may be > logged in this way. > > Reviewed-by: Dan Williams > Signed-off-by: Tony Luck Thanks for your work, Tony. > > --- > Changes in V3: > Dan Williams > Rename copy_user_highpage_mc() to copy_mc_user_highpage() for > consistency with Linus' discussion on names of functions that > check for machine check. > Write complete functions for the have/have-not copy_mc_to_kernel > cases (so grep shows there are two versions) > Change __wp_page_copy_user() to return 0 for success, negative for fail > [I picked -EAGAIN for both non-EHWPOISON cases] > > Changes in V2: >Naoya Horiguchi: > 1) Use -EHWPOISON error code instead of minus one. > 2) Poison path needs also to deal with old_page >Tony Luck: > Rewrote commit message > Added some powerpc folks to Cc: list > --- > include/linux/highmem.h | 24 > mm/memory.c | 30 -- > 2 files changed, 44 insertions(+), 10 deletions(-) > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > index e9912da5441b..a32c64681f03 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -319,6 +319,30 @@ static inline void copy_user_highpage(struct page *to, > struct page *from, > > #endif > > +#ifdef copy_mc_to_kernel > +static inline int copy_mc_user_highpage(struct page *to, struct page *from, > + unsigned long vaddr, struct > vm_area_struct *vma) > +{ > + unsigned long ret; > + char *vfrom, *vto; > + > + vfrom = kmap_local_page(from); > + vto = kmap_local_page(to); > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is done after the copy when __HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something similar here? But I'm not familiar with kmsan, so I can easy be wrong. Anyway, this patch looks good to me. Thanks. Reviewed-by: Miaohe Lin Thanks, Miaohe Lin
Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults
; > diff --git a/mm/memory.c b/mm/memory.c > index b6056eef2f72..4a1304cf1f4e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf) > return same; > } > > +#ifdef CONFIG_MEMORY_FAILURE > +struct pfn_work { > + struct work_struct work; > + unsigned long pfn; > +}; > + > +static void do_sched_memory_failure(struct work_struct *w) > +{ > + struct pfn_work *p = container_of(w, struct pfn_work, work); > + > + memory_failure(p->pfn, 0); > + kfree(p); > +} > + > +static void sched_memory_failure(unsigned long pfn) > +{ > + struct pfn_work *p; > + > + p = kmalloc(sizeof *p, GFP_KERNEL); > + if (!p) > + return; > + INIT_WORK(>work, do_sched_memory_failure); > + p->pfn = pfn; > + schedule_work(>work); There is already memory_failure_queue() that can do this. Can we use it directly? Thanks, Miaohe Lin > +} > +#else > +static void sched_memory_failure(unsigned long pfn) > +{ > +} > +#endif > + > /* > * Return: > * 0: copied succeeded > @@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page > *dst, struct page *src, > unsigned long addr = vmf->address; > > if (likely(src)) { > - if (copy_mc_user_highpage(dst, src, addr, vma)) > + if (copy_mc_user_highpage(dst, src, addr, vma)) { > + sched_memory_failure(page_to_pfn(src)); > return -EHWPOISON; > + } > return 0; > } > >
Re: [PATCH v11 04/13] mm/ioremap: rename ioremap_*_range to vmap_*_range
Hi: On 2021/1/26 12:45, Nicholas Piggin wrote: > This will be used as a generic kernel virtual mapping function, so > re-name it in preparation. > Looks good to me. Thanks. Reviewed-by: Miaohe Lin > Signed-off-by: Nicholas Piggin > --- > mm/ioremap.c | 64 +++- > 1 file changed, 33 insertions(+), 31 deletions(-) > > diff --git a/mm/ioremap.c b/mm/ioremap.c > index 5fa1ab41d152..3f4d36f9745a 100644 > --- a/mm/ioremap.c > +++ b/mm/ioremap.c > @@ -61,9 +61,9 @@ static inline int ioremap_pud_enabled(void) { return 0; } > static inline int ioremap_pmd_enabled(void) { return 0; } > #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ > > -static int ioremap_pte_range(pmd_t *pmd, unsigned long addr, > - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, > - pgtbl_mod_mask *mask) > +static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > + phys_addr_t phys_addr, pgprot_t prot, > + pgtbl_mod_mask *mask) > { > pte_t *pte; > u64 pfn; > @@ -81,9 +81,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr, > return 0; > } > > -static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr, > - unsigned long end, phys_addr_t phys_addr, > - pgprot_t prot) > +static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long > end, > + phys_addr_t phys_addr, pgprot_t prot) > { > if (!ioremap_pmd_enabled()) > return 0; > @@ -103,9 +102,9 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long > addr, > return pmd_set_huge(pmd, phys_addr, prot); > } > > -static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr, > - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, > - pgtbl_mod_mask *mask) > +static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, > + phys_addr_t phys_addr, pgprot_t prot, > + pgtbl_mod_mask *mask) > { > pmd_t *pmd; > unsigned long next; > @@ -116,20 +115,19 @@ static inline int ioremap_pmd_range(pud_t *pud, > unsigned long addr, > do { > next = pmd_addr_end(addr, end); > > - if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) { > + if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) { > *mask |= PGTBL_PMD_MODIFIED; > continue; > } > > - if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask)) > + if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask)) > return -ENOMEM; > } while (pmd++, phys_addr += (next - addr), addr = next, addr != end); > return 0; > } > > -static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr, > - unsigned long end, phys_addr_t phys_addr, > - pgprot_t prot) > +static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long > end, > + phys_addr_t phys_addr, pgprot_t prot) > { > if (!ioremap_pud_enabled()) > return 0; > @@ -149,9 +147,9 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long > addr, > return pud_set_huge(pud, phys_addr, prot); > } > > -static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr, > - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, > - pgtbl_mod_mask *mask) > +static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > + phys_addr_t phys_addr, pgprot_t prot, > + pgtbl_mod_mask *mask) > { > pud_t *pud; > unsigned long next; > @@ -162,20 +160,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, > unsigned long addr, > do { > next = pud_addr_end(addr, end); > > - if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) { > + if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) { > *mask |= PGTBL_PUD_MODIFIED; > continue; > } > > - if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask)) > + if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask)) > return -ENOMEM; > } while (pud++, phys_addr += (next - addr), addr = next, addr != end); > return 0; > } > > -static int ioremap_try_huge_p4d(p4
Re: [PATCH v11 02/13] mm: apply_to_pte_range warn and fail if a large pte is encountered
Hi: On 2021/1/26 12:44, Nicholas Piggin wrote: > apply_to_pte_range might mistake a large pte for bad, or treat it as a > page table, resulting in a crash or corruption. Add a test to warn and > return error if large entries are found. > > Reviewed-by: Christoph Hellwig > Signed-off-by: Nicholas Piggin > --- > mm/memory.c | 66 +++-- > 1 file changed, 49 insertions(+), 17 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index feff48e1465a..672e39a72788 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2440,13 +2440,21 @@ static int apply_to_pmd_range(struct mm_struct *mm, > pud_t *pud, > } > do { > next = pmd_addr_end(addr, end); > - if (create || !pmd_none_or_clear_bad(pmd)) { > - err = apply_to_pte_range(mm, pmd, addr, next, fn, data, > - create, mask); > - if (err) > - break; > + if (pmd_none(*pmd) && !create) > + continue; > + if (WARN_ON_ONCE(pmd_leaf(*pmd))) > + return -EINVAL; > + if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) { > + if (!create) > + continue; > + pmd_clear_bad(pmd); > } > + err = apply_to_pte_range(mm, pmd, addr, next, > + fn, data, create, mask); > + if (err) > + break; > } while (pmd++, addr = next, addr != end); > + > return err; > } > > @@ -2468,13 +2476,21 @@ static int apply_to_pud_range(struct mm_struct *mm, > p4d_t *p4d, > } > do { > next = pud_addr_end(addr, end); > - if (create || !pud_none_or_clear_bad(pud)) { > - err = apply_to_pmd_range(mm, pud, addr, next, fn, data, > - create, mask); > - if (err) > - break; > + if (pud_none(*pud) && !create) > + continue; > + if (WARN_ON_ONCE(pud_leaf(*pud))) > + return -EINVAL; > + if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) { > + if (!create) > + continue; > + pud_clear_bad(pud); > } > + err = apply_to_pmd_range(mm, pud, addr, next, > + fn, data, create, mask); > + if (err) > + break; > } while (pud++, addr = next, addr != end); > + > return err; > } > > @@ -2496,13 +2512,21 @@ static int apply_to_p4d_range(struct mm_struct *mm, > pgd_t *pgd, > } > do { > next = p4d_addr_end(addr, end); > - if (create || !p4d_none_or_clear_bad(p4d)) { > - err = apply_to_pud_range(mm, p4d, addr, next, fn, data, > - create, mask); > - if (err) > - break; > + if (p4d_none(*p4d) && !create) > + continue; > + if (WARN_ON_ONCE(p4d_leaf(*p4d))) > + return -EINVAL; > + if (!p4d_none(*p4d) && WARN_ON_ONCE(p4d_bad(*p4d))) { > + if (!create) > + continue; > + p4d_clear_bad(p4d); > } > + err = apply_to_pud_range(mm, p4d, addr, next, > + fn, data, create, mask); > + if (err) > + break; > } while (p4d++, addr = next, addr != end); > + > return err; > } > > @@ -2522,9 +2546,17 @@ static int __apply_to_page_range(struct mm_struct *mm, > unsigned long addr, > pgd = pgd_offset(mm, addr); > do { > next = pgd_addr_end(addr, end); > - if (!create && pgd_none_or_clear_bad(pgd)) > + if (pgd_none(*pgd) && !create) > continue; > - err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, > ); > + if (WARN_ON_ONCE(pgd_leaf(*pgd))) > + return -EINVAL; > + if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) { > + if (!create) > + continue; > + pgd_clear_bad(pgd); > + } > + err = apply_to_p4d_range(mm, pgd, addr, next, > + fn, data, create, ); > if (err) > break; > } while (pgd++, addr = next, addr != end); > Looks good to me, thanks. Reviewed-by: Miaohe Lin
Re: [PATCH v11 01/13] mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page
Hi: On 2021/1/26 12:44, Nicholas Piggin wrote: > vmalloc_to_page returns NULL for addresses mapped by larger pages[*]. > Whether or not a vmap is huge depends on the architecture details, > alignments, boot options, etc., which the caller can not be expected > to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page. > > This change teaches vmalloc_to_page about larger pages, and returns > the struct page that corresponds to the offset within the large page. > This makes the API agnostic to mapping implementation details. > > [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap: > fail gracefully on unexpected huge vmap mappings") > > Reviewed-by: Christoph Hellwig > Signed-off-by: Nicholas Piggin > --- > mm/vmalloc.c | 41 ++--- > 1 file changed, 26 insertions(+), 15 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index e6f352bf0498..62372f9e0167 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -34,7 +34,7 @@ > #include > #include > #include > - > +#include > #include > #include > #include > @@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x) > } > > /* > - * Walk a vmap address to the struct page it maps. > + * Walk a vmap address to the struct page it maps. Huge vmap mappings will > + * return the tail page that corresponds to the base page address, which > + * matches small vmap mappings. > */ > struct page *vmalloc_to_page(const void *vmalloc_addr) > { > @@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr) > > if (pgd_none(*pgd)) > return NULL; > + if (WARN_ON_ONCE(pgd_leaf(*pgd))) > + return NULL; /* XXX: no allowance for huge pgd */ > + if (WARN_ON_ONCE(pgd_bad(*pgd))) > + return NULL; > + > p4d = p4d_offset(pgd, addr); > if (p4d_none(*p4d)) > return NULL; > - pud = pud_offset(p4d, addr); > + if (p4d_leaf(*p4d)) > + return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT); > + if (WARN_ON_ONCE(p4d_bad(*p4d))) > + return NULL; > > - /* > - * Don't dereference bad PUD or PMD (below) entries. This will also > - * identify huge mappings, which we may encounter on architectures > - * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be > - * identified as vmalloc addresses by is_vmalloc_addr(), but are > - * not [unambiguously] associated with a struct page, so there is > - * no correct value to return for them. > - */ > - WARN_ON_ONCE(pud_bad(*pud)); > - if (pud_none(*pud) || pud_bad(*pud)) > + pud = pud_offset(p4d, addr); > + if (pud_none(*pud)) > + return NULL; > + if (pud_leaf(*pud)) > + return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); > + if (WARN_ON_ONCE(pud_bad(*pud))) > return NULL; > + > pmd = pmd_offset(pud, addr); > - WARN_ON_ONCE(pmd_bad(*pmd)); > - if (pmd_none(*pmd) || pmd_bad(*pmd)) > + if (pmd_none(*pmd)) > + return NULL; > + if (pmd_leaf(*pmd)) > + return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); > + if (WARN_ON_ONCE(pmd_bad(*pmd))) > return NULL; > > ptep = pte_offset_map(pmd, addr); > @@ -389,6 +399,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr) > if (pte_present(pte)) > page = pte_page(pte); > pte_unmap(ptep); > + > return page; > } > EXPORT_SYMBOL(vmalloc_to_page); > LGTM. Thanks. Reviewed-by: Miaohe Lin