Re: [PATCH v3 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline

2022-10-28 Thread Miaohe Lin
On 2022/10/29 0:13, Luck, Tony wrote:
>>> Cannot call memory_failure() directly from the fault handler because
>>> mmap_lock (and others) are held.
>>
>> Could you please explain which lock makes it unfeasible to call 
>> memory_failure() directly and
>> why? I'm somewhat confused. But I agree using memory_failure_queue() should 
>> be a good idea.
> 
> I tried calling memory_failure() directly, and my system just hung. I made 
> the assumption
> that it had deadlocked based somewhat on the comments in mm/memory.c about 
> mmap_lock
> being held ... but I didn't dig into what had gone wrong.

I see. Thanks for your explanation. :)

> 
> -Tony
> 



Re: [PATCH v3 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline

2022-10-27 Thread Miaohe Lin
On 2022/10/22 4:01, Tony Luck wrote:
> Cannot call memory_failure() directly from the fault handler because
> mmap_lock (and others) are held.

Could you please explain which lock makes it unfeasible to call 
memory_failure() directly and
why? I'm somewhat confused. But I agree using memory_failure_queue() should be 
a good idea.

> 
> It is important, but not urgent, to mark the source page as h/w poisoned
> and unmap it from other tasks.
> 
> Use memory_failure_queue() to request a call to memory_failure() for the
> page with the error.
> 
> Also provide a stub version for CONFIG_MEMORY_FAILURE=n
> 
> Signed-off-by: Tony Luck 
> ---
>  include/linux/mm.h | 5 -
>  mm/memory.c| 4 +++-
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8bbcccbc5565..03ced659eb58 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3268,7 +3268,6 @@ enum mf_flags {
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
>  extern int memory_failure(unsigned long pfn, int flags);
> -extern void memory_failure_queue(unsigned long pfn, int flags);
>  extern void memory_failure_queue_kick(int cpu);
>  extern int unpoison_memory(unsigned long pfn);
>  extern int sysctl_memory_failure_early_kill;
> @@ -3277,8 +3276,12 @@ extern void shake_page(struct page *p);
>  extern atomic_long_t num_poisoned_pages __read_mostly;
>  extern int soft_offline_page(unsigned long pfn, int flags);
>  #ifdef CONFIG_MEMORY_FAILURE
> +extern void memory_failure_queue(unsigned long pfn, int flags);
>  extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
>  #else
> +static inline void memory_failure_queue(unsigned long pfn, int flags)
> +{
> +}
>  static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
>  {
>   return 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index b6056eef2f72..eae242351726 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2866,8 +2866,10 @@ static inline int __wp_page_copy_user(struct page 
> *dst, struct page *src,
>   unsigned long addr = vmf->address;
>  
>   if (likely(src)) {
> - if (copy_mc_user_highpage(dst, src, addr, vma))
> + if (copy_mc_user_highpage(dst, src, addr, vma)) {
> +     memory_failure_queue(page_to_pfn(src), 0);

It seems MF_ACTION_REQUIRED is not needed for memory_failure_queue() here. 
Thanks for your patch.

Reviewed-by: Miaohe Lin 

Thanks,
Miaohe Lin



Re: [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-27 Thread Miaohe Lin
On 2022/10/22 4:01, Tony Luck wrote:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
> 
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
> 
> I wrapped that neatly into a test at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> 
> just enable ACPI error injection and run:
> 
>   # ./einj_mem-uc -f copy-on-write
> 
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.
> 
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
> 
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
> 
> Reviewed-by: Dan Williams 
> Signed-off-by: Tony Luck 

Thanks for your work, Tony.

> 
> ---
> Changes in V3:
> Dan Williams
>   Rename copy_user_highpage_mc() to copy_mc_user_highpage() for
>   consistency with Linus' discussion on names of functions that
>   check for machine check.
>   Write complete functions for the have/have-not copy_mc_to_kernel
>   cases (so grep shows there are two versions)
>   Change __wp_page_copy_user() to return 0 for success, negative for fail
>   [I picked -EAGAIN for both non-EHWPOISON cases]
> 
> Changes in V2:
>Naoya Horiguchi:
>   1) Use -EHWPOISON error code instead of minus one.
>   2) Poison path needs also to deal with old_page
>Tony Luck:
>   Rewrote commit message
>   Added some powerpc folks to Cc: list
> ---
>  include/linux/highmem.h | 24 
>  mm/memory.c | 30 --
>  2 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..a32c64681f03 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,30 @@ static inline void copy_user_highpage(struct page *to, 
> struct page *from,
>  
>  #endif
>  
> +#ifdef copy_mc_to_kernel
> +static inline int copy_mc_user_highpage(struct page *to, struct page *from,
> + unsigned long vaddr, struct 
> vm_area_struct *vma)
> +{
> + unsigned long ret;
> + char *vfrom, *vto;
> +
> + vfrom = kmap_local_page(from);
> + vto = kmap_local_page(to);
> + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);

In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is 
done after the copy when
__HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something 
similar here? But I'm not familiar
with kmsan, so I can easy be wrong.

Anyway, this patch looks good to me. Thanks.

Reviewed-by: Miaohe Lin 

Thanks,
Miaohe Lin




Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Miaohe Lin
; 
> diff --git a/mm/memory.c b/mm/memory.c
> index b6056eef2f72..4a1304cf1f4e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
>   return same;
>  }
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +struct pfn_work {
> + struct work_struct work;
> + unsigned long pfn;
> +};
> +
> +static void do_sched_memory_failure(struct work_struct *w)
> +{
> + struct pfn_work *p = container_of(w, struct pfn_work, work);
> +
> + memory_failure(p->pfn, 0);
> + kfree(p);
> +}
> +
> +static void sched_memory_failure(unsigned long pfn)
> +{
> + struct pfn_work *p;
> +
> + p = kmalloc(sizeof *p, GFP_KERNEL);
> + if (!p)
> + return;
> + INIT_WORK(>work, do_sched_memory_failure);
> + p->pfn = pfn;
> + schedule_work(>work);

There is already memory_failure_queue() that can do this. Can we use it 
directly?

Thanks,
Miaohe Lin


> +}
> +#else
> +static void sched_memory_failure(unsigned long pfn)
> +{
> +}
> +#endif
> +
>  /*
>   * Return:
>   *   0:  copied succeeded
> @@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page 
> *dst, struct page *src,
>   unsigned long addr = vmf->address;
>  
>   if (likely(src)) {
> - if (copy_mc_user_highpage(dst, src, addr, vma))
> + if (copy_mc_user_highpage(dst, src, addr, vma)) {
> + sched_memory_failure(page_to_pfn(src));
>   return -EHWPOISON;
> + }
>   return 0;
>   }
>  
> 



Re: [PATCH v11 04/13] mm/ioremap: rename ioremap_*_range to vmap_*_range

2021-01-27 Thread Miaohe Lin
Hi:
On 2021/1/26 12:45, Nicholas Piggin wrote:
> This will be used as a generic kernel virtual mapping function, so
> re-name it in preparation.
> 

Looks good to me. Thanks.

Reviewed-by: Miaohe Lin 

> Signed-off-by: Nicholas Piggin 
> ---
>  mm/ioremap.c | 64 +++-
>  1 file changed, 33 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/ioremap.c b/mm/ioremap.c
> index 5fa1ab41d152..3f4d36f9745a 100644
> --- a/mm/ioremap.c
> +++ b/mm/ioremap.c
> @@ -61,9 +61,9 @@ static inline int ioremap_pud_enabled(void) { return 0; }
>  static inline int ioremap_pmd_enabled(void) { return 0; }
>  #endif   /* CONFIG_HAVE_ARCH_HUGE_VMAP */
>  
> -static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
> - unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
> - pgtbl_mod_mask *mask)
> +static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> + phys_addr_t phys_addr, pgprot_t prot,
> + pgtbl_mod_mask *mask)
>  {
>   pte_t *pte;
>   u64 pfn;
> @@ -81,9 +81,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
>   return 0;
>  }
>  
> -static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
> - unsigned long end, phys_addr_t phys_addr,
> - pgprot_t prot)
> +static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long 
> end,
> + phys_addr_t phys_addr, pgprot_t prot)
>  {
>   if (!ioremap_pmd_enabled())
>   return 0;
> @@ -103,9 +102,9 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long 
> addr,
>   return pmd_set_huge(pmd, phys_addr, prot);
>  }
>  
> -static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
> - unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
> - pgtbl_mod_mask *mask)
> +static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> + phys_addr_t phys_addr, pgprot_t prot,
> + pgtbl_mod_mask *mask)
>  {
>   pmd_t *pmd;
>   unsigned long next;
> @@ -116,20 +115,19 @@ static inline int ioremap_pmd_range(pud_t *pud, 
> unsigned long addr,
>   do {
>   next = pmd_addr_end(addr, end);
>  
> - if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
> + if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
>   *mask |= PGTBL_PMD_MODIFIED;
>   continue;
>   }
>  
> - if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
> + if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
>   return -ENOMEM;
>   } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
>   return 0;
>  }
>  
> -static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
> - unsigned long end, phys_addr_t phys_addr,
> - pgprot_t prot)
> +static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long 
> end,
> + phys_addr_t phys_addr, pgprot_t prot)
>  {
>   if (!ioremap_pud_enabled())
>   return 0;
> @@ -149,9 +147,9 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long 
> addr,
>   return pud_set_huge(pud, phys_addr, prot);
>  }
>  
> -static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
> - unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
> - pgtbl_mod_mask *mask)
> +static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> + phys_addr_t phys_addr, pgprot_t prot,
> + pgtbl_mod_mask *mask)
>  {
>   pud_t *pud;
>   unsigned long next;
> @@ -162,20 +160,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, 
> unsigned long addr,
>   do {
>   next = pud_addr_end(addr, end);
>  
> - if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
> + if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
>   *mask |= PGTBL_PUD_MODIFIED;
>   continue;
>   }
>  
> - if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
> + if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
>   return -ENOMEM;
>   } while (pud++, phys_addr += (next - addr), addr = next, addr != end);
>   return 0;
>  }
>  
> -static int ioremap_try_huge_p4d(p4

Re: [PATCH v11 02/13] mm: apply_to_pte_range warn and fail if a large pte is encountered

2021-01-26 Thread Miaohe Lin
Hi:
On 2021/1/26 12:44, Nicholas Piggin wrote:
> apply_to_pte_range might mistake a large pte for bad, or treat it as a
> page table, resulting in a crash or corruption. Add a test to warn and
> return error if large entries are found.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Nicholas Piggin 
> ---
>  mm/memory.c | 66 +++--
>  1 file changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index feff48e1465a..672e39a72788 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2440,13 +2440,21 @@ static int apply_to_pmd_range(struct mm_struct *mm, 
> pud_t *pud,
>   }
>   do {
>   next = pmd_addr_end(addr, end);
> - if (create || !pmd_none_or_clear_bad(pmd)) {
> - err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
> -  create, mask);
> - if (err)
> - break;
> + if (pmd_none(*pmd) && !create)
> + continue;
> + if (WARN_ON_ONCE(pmd_leaf(*pmd)))
> + return -EINVAL;
> + if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) {
> + if (!create)
> + continue;
> + pmd_clear_bad(pmd);
>   }
> + err = apply_to_pte_range(mm, pmd, addr, next,
> +  fn, data, create, mask);
> + if (err)
> + break;
>   } while (pmd++, addr = next, addr != end);
> +
>   return err;
>  }
>  
> @@ -2468,13 +2476,21 @@ static int apply_to_pud_range(struct mm_struct *mm, 
> p4d_t *p4d,
>   }
>   do {
>   next = pud_addr_end(addr, end);
> - if (create || !pud_none_or_clear_bad(pud)) {
> - err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
> -  create, mask);
> - if (err)
> - break;
> + if (pud_none(*pud) && !create)
> + continue;
> + if (WARN_ON_ONCE(pud_leaf(*pud)))
> + return -EINVAL;
> + if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) {
> + if (!create)
> + continue;
> + pud_clear_bad(pud);
>   }
> + err = apply_to_pmd_range(mm, pud, addr, next,
> +  fn, data, create, mask);
> + if (err)
> + break;
>   } while (pud++, addr = next, addr != end);
> +
>   return err;
>  }
>  
> @@ -2496,13 +2512,21 @@ static int apply_to_p4d_range(struct mm_struct *mm, 
> pgd_t *pgd,
>   }
>   do {
>   next = p4d_addr_end(addr, end);
> - if (create || !p4d_none_or_clear_bad(p4d)) {
> - err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
> -  create, mask);
> - if (err)
> - break;
> + if (p4d_none(*p4d) && !create)
> + continue;
> + if (WARN_ON_ONCE(p4d_leaf(*p4d)))
> + return -EINVAL;
> + if (!p4d_none(*p4d) && WARN_ON_ONCE(p4d_bad(*p4d))) {
> + if (!create)
> + continue;
> + p4d_clear_bad(p4d);
>   }
> + err = apply_to_pud_range(mm, p4d, addr, next,
> +  fn, data, create, mask);
> + if (err)
> + break;
>   } while (p4d++, addr = next, addr != end);
> +
>   return err;
>  }
>  
> @@ -2522,9 +2546,17 @@ static int __apply_to_page_range(struct mm_struct *mm, 
> unsigned long addr,
>   pgd = pgd_offset(mm, addr);
>   do {
>   next = pgd_addr_end(addr, end);
> - if (!create && pgd_none_or_clear_bad(pgd))
> + if (pgd_none(*pgd) && !create)
>   continue;
> - err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, 
> );
> + if (WARN_ON_ONCE(pgd_leaf(*pgd)))
> + return -EINVAL;
> + if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) {
> +         if (!create)
> + continue;
> + pgd_clear_bad(pgd);
> + }
> + err = apply_to_p4d_range(mm, pgd, addr, next,
> +  fn, data, create, );
>   if (err)
>   break;
>   } while (pgd++, addr = next, addr != end);
> 

Looks good to me, thanks.

Reviewed-by: Miaohe Lin 


Re: [PATCH v11 01/13] mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page

2021-01-26 Thread Miaohe Lin
Hi:
On 2021/1/26 12:44, Nicholas Piggin wrote:
> vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
> Whether or not a vmap is huge depends on the architecture details,
> alignments, boot options, etc., which the caller can not be expected
> to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.
> 
> This change teaches vmalloc_to_page about larger pages, and returns
> the struct page that corresponds to the offset within the large page.
> This makes the API agnostic to mapping implementation details.
> 
> [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
> fail gracefully on unexpected huge vmap mappings")
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Nicholas Piggin 
> ---
>  mm/vmalloc.c | 41 ++---
>  1 file changed, 26 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e6f352bf0498..62372f9e0167 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -34,7 +34,7 @@
>  #include 
>  #include 
>  #include 
> -
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x)
>  }
>  
>  /*
> - * Walk a vmap address to the struct page it maps.
> + * Walk a vmap address to the struct page it maps. Huge vmap mappings will
> + * return the tail page that corresponds to the base page address, which
> + * matches small vmap mappings.
>   */
>  struct page *vmalloc_to_page(const void *vmalloc_addr)
>  {
> @@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>  
>   if (pgd_none(*pgd))
>   return NULL;
> + if (WARN_ON_ONCE(pgd_leaf(*pgd)))
> + return NULL; /* XXX: no allowance for huge pgd */
> + if (WARN_ON_ONCE(pgd_bad(*pgd)))
> + return NULL;
> +
>   p4d = p4d_offset(pgd, addr);
>   if (p4d_none(*p4d))
>   return NULL;
> - pud = pud_offset(p4d, addr);
> + if (p4d_leaf(*p4d))
> + return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
> + if (WARN_ON_ONCE(p4d_bad(*p4d)))
> + return NULL;
>  
> - /*
> -  * Don't dereference bad PUD or PMD (below) entries. This will also
> -  * identify huge mappings, which we may encounter on architectures
> -  * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
> -  * identified as vmalloc addresses by is_vmalloc_addr(), but are
> -  * not [unambiguously] associated with a struct page, so there is
> -  * no correct value to return for them.
> -  */
> - WARN_ON_ONCE(pud_bad(*pud));
> - if (pud_none(*pud) || pud_bad(*pud))
> + pud = pud_offset(p4d, addr);
> + if (pud_none(*pud))
> + return NULL;
> + if (pud_leaf(*pud))
> + return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> + if (WARN_ON_ONCE(pud_bad(*pud)))
>   return NULL;
> +
>   pmd = pmd_offset(pud, addr);
> - WARN_ON_ONCE(pmd_bad(*pmd));
> - if (pmd_none(*pmd) || pmd_bad(*pmd))
> + if (pmd_none(*pmd))
> + return NULL;
> + if (pmd_leaf(*pmd))
> + return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> + if (WARN_ON_ONCE(pmd_bad(*pmd)))
>   return NULL;
>  
>   ptep = pte_offset_map(pmd, addr);
> @@ -389,6 +399,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>   if (pte_present(pte))
>   page = pte_page(pte);
>   pte_unmap(ptep);
> +
>   return page;
>  }
>  EXPORT_SYMBOL(vmalloc_to_page);
> 

LGTM. Thanks.

Reviewed-by: Miaohe Lin