from:"Aili Yao"

Re: [PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-18 Thread Aili Yao

On Mon, 19 Apr 2021 11:36:58 +0900
Naoya Horiguchi  wrote:

> > > 2. In the function hwpoison_pte_range():
> > > if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) this check seem we 
> > > should use PMD_SIZE/PAGE_SIZE or some macro like this?  
> > 
> > Thanks, that's right.  HPAGE_PMD_NR seems to fit here.
> > We also need "#ifdef CONFIG_TRANSPARENT_HUGEPAGE" to use it.  
> 
> I found that the #ifdef is not necessary because the whole
> "if (ptl)" is compiled out.  So I don't add #ifdef.
> 
> Here's the v2 of 3/3.
> 
> Aili, could you test with it?
> 
> Thanks,
> Naoya Horiguchi
> 

I tested this v2 version, In my test, this patches worked as expected and the 
previous
issues didn't happen again.

Test-by: Aili Yao 

Thanks,
Aili Yao

> -
> From: Naoya Horiguchi 
> Date: Tue, 13 Apr 2021 07:26:25 +0900
> Subject: [PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find 
> error
>  virtual address
> 
> The previous patch solves the infinite MCE loop issue when multiple
> MCE events races.  The remaining issue is to make sure that all threads
> processing Action Required MCEs send to the current processes the
> SIGBUS with the proper virtual address and the error size.
> 
> This patch suggests to do page table walk to find the error virtual
> address.  If we find multiple virtual addresses in walking, we now can't
> determine which one is correct, so we fall back to sending SIGBUS in
> kill_me_maybe() without error info as we do now.  This corner case needs
> to be solved in the future.
> 
> Signed-off-by: Naoya Horiguchi 
> ---
> change log v1 -> v2:
> - initialize local variables in check_hwpoisoned_entry() and
>   hwpoison_pte_range()
> - fix and improve logic to calculate error address offset.
> ---
>  arch/x86/kernel/cpu/mce/core.c |  13 ++-
>  include/linux/swapops.h|   5 ++
>  mm/memory-failure.c| 147 -
>  3 files changed, 161 insertions(+), 4 deletions(-)
>

Re: [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

2021-04-16 Thread Aili Yao

On Tue, 13 Apr 2021 07:43:17 +0900
Naoya Horiguchi  wrote:

> Hi,
> 
> I wrote this patchset to materialize what I think is the current
> allowable solution mentioned by the previous discussion [1].
> I simply borrowed Tony's mutex patch and Aili's return code patch,
> then I queued another one to find error virtual address in the best
> effort manner.  I know that this is not a perfect solution, but
> should work for some typical case.
> 
> My simple testing showed this patchset seems to work as intended,
> but if you have the related testcases, could you please test and
> let me have some feedback?
> 
> Thanks,
> Naoya Horiguchi
> 
> [1]: 
> https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
> ---
> Summary:
> 
> Aili Yao (1):
>   mm,hwpoison: return -EHWPOISON when page already
> 
> Naoya Horiguchi (1):
>   mm,hwpoison: add kill_accessing_process() to find error virtual address
> 
> Tony Luck (1):
>   mm/memory-failure: Use a mutex to avoid memory_failure() races
> 
>  arch/x86/kernel/cpu/mce/core.c |  13 +++-
>  include/linux/swapops.h|   5 ++
>  mm/memory-failure.c| 166 
> -
>  3 files changed, 178 insertions(+), 6 deletions(-)

Hi Naoya,

Thanks for your patch and complete fix for this race issue.

I test your patches, mainly it worked as expected, but in some cases it failed, 
I checked  it
and find some doubt places, could you help confirm it?

1. there is a compile warning:
static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
  unsigned long end, struct mm_walk *walk)
{
struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
int ret; here

It seems this ret may not be initialized, and some time ret may be error 
retruned?

and for this:
static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
unsigned long poisoned_pfn, struct to_kill *tk)
{
unsigned long pfn;

I think it better to be initialized too.

2. In the function hwpoison_pte_range():
if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) this check seem we should use 
PMD_SIZE/PAGE_SIZE or some macro like this?

3. unsigned long hwpoison_vaddr = addr + (hwp->pfn << PAGE_SHIFT & ~PMD_MASK); 
this seems not exact accurate?

4. static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
{
if (tk->addr) {--- I am not sure about this check and if it will 
lead failure.
return 1;
}
In my test, it seems sometimes it will hit this branch, I don't know it's multi 
entry issue or multi posion issue.
when i get to this fail, there is not enough log for this, but i can't 
reproduce it after that.

wolud you help confirm this and if any changes, please post again and I will do 
the test again.

Thansk
Aili Yao

Re: [RFC 0/4] Fix machine check recovery for copy_from_user

2021-04-09 Thread Aili Yao

On Thu, 8 Apr 2021 14:39:09 +
"Luck, Tony"  wrote:

> > I have one scenario, may you take into account:
> >
> > If one copyin case occurs, write() returned by your patch, the user process 
> > may
> > check the return values, for errors, it may exit the process, then the 
> > error page
> > will be freed, and then the page maybe alloced to other process or to 
> > kernel itself,
> > then code will initialize it and this will trigger one SRAO, if it's used 
> > by kernel,
> > we may do nothing for this, and kernel may still touch it, and lead to one 
> > panic.  
> 
> In this case kill_me_never() calls memory_failure() with flags == 0. I think 
> (hope!)
> that means that it will unmap the page from the task, but will not send a 
> signal.
> 
> When the task exits the PTE for this page has the swap/poison signature, so 
> the
> page is not freed for re-use.
> 
> -Tony

Oh, Yes, Sorry for my rudeness and error-understandings, I just happen to can't 
control my emotions and get confused for some other things.

Thanks!
Aili Yao

Re: [RFC 0/4] Fix machine check recovery for copy_from_user

2021-04-07 Thread Aili Yao

On Thu, 25 Mar 2021 17:02:31 -0700
Tony Luck  wrote:

> Maybe this is the way forward?  I made some poor choices before
> to treat poison consumption in the kernel when accessing user data
> (get_user() or copy_from_user()) ... in particular assuming that
> the right action was sending a SIGBUS to the task as if it had
> synchronously accessed the poison location.
> 
> First three patches may need to be combined (or broken up differently)
> for bisectablilty. But they are presented separately here since they
> touch separate parts of the problem.
> 
> Second part is definitley incomplete. But I'd like to check that it
> is the right approach before expending more brain cells in the maze
> of nested macros that is lib/iov_iter.c
> 
> Last part has been posted before. It covers the case where the kernel
> takes more than one swing at reading poison data before returning to
> user.
> 
> Tony Luck (4):
>   x86/mce: Fix copyin code to return -EFAULT on machine check.
>   mce/iter: Check for copyin failure & return error up stack
>   mce/copyin: fix to not SIGBUS when copying from user hits poison
>   x86/mce: Avoid infinite loop for copy from user recovery
> 
>  arch/x86/kernel/cpu/mce/core.c | 63 +-
>  arch/x86/kernel/cpu/mce/severity.c |  2 -
>  arch/x86/lib/copy_user_64.S| 18 +
>  fs/iomap/buffered-io.c |  8 +++-
>  include/linux/sched.h  |  2 +-
>  include/linux/uio.h|  2 +-
>  lib/iov_iter.c | 15 ++-
>  7 files changed, 77 insertions(+), 33 deletions(-)
> 

I have one scenario, may you take into account:

If one copyin case occurs, write() returned by your patch, the user process may
check the return values, for errors, it may exit the process, then the error 
page
will be freed, and then the page maybe alloced to other process or to kernel 
itself,
then code will initialize it and this will trigger one SRAO, if it's used by 
kernel,
we may do nothing for this, and kernel may still touch it, and lead to one 
panic.

Is this we expect? 

Thanks!
Aili Yao

Re: [PATCH v7] mm/gup: check page hwpoison status for memory recovery failures.

2021-04-07 Thread Aili Yao

On Wed, 7 Apr 2021 01:54:28 +
HORIGUCHI NAOYA(堀口　直也)  wrote:

> On Tue, Apr 06, 2021 at 10:41:23AM +0800, Aili Yao wrote:
> > When we call get_user_pages() to pin user page in memory, there may be
> > hwpoison page, currently, we just handle the normal case that memory
> > recovery jod is correctly finished, and we will not return the hwpoison
> > page to callers, but for other cases like memory recovery fails and the
> > user process related pte is not correctly set invalid, we will still
> > return the hwpoison page, and may touch it and lead to panic.
> > 
> > In gup.c, for normal page, after we call follow_page_mask(), we will
> > return the related page pointer; or like another hwpoison case with pte
> > invalid, it will return NULL. For NULL, we will handle it in if (!page)
> > branch. In this patch, we will filter out the hwpoison page in
> > follow_page_mask() and return error code for recovery failure cases.
> > 
> > We will check the page hwpoison status as soon as possible and avoid doing
> > followed normal procedure and try not to grab related pages.
> > 
> > Changes since v6:
> > - Fix wrong page pointer check in follow_trans_huge_pmd();
> > 
> > Signed-off-by: Aili Yao 
> > Cc: David Hildenbrand 
> > Cc: Matthew Wilcox 
> > Cc: Naoya Horiguchi 
> > Cc: Oscar Salvador 
> > Cc: Mike Kravetz 
> > Cc: Andrew Morton 
> > Cc: sta...@vger.kernel.org
> > ---
> >  mm/gup.c | 27 +++
> >  mm/huge_memory.c | 11 ---
> >  mm/hugetlb.c |  8 +++-
> >  mm/internal.h| 13 +
> >  4 files changed, 51 insertions(+), 8 deletions(-)  
> 
> Thank you for the work.
> 
> Looking through this patch, the internal of follow_page_mask() is
> very complicated so it's not easy to make this hwpoison-aware.
> Now I'm getting unsure to judge that this is the best approach.
> What actually I imagined might be like below (which is totally
> untested, and I'm sorry about my previous misleading comments):
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index e40579624f10..a60a08fc7668 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1090,6 +1090,11 @@ static long __get_user_pages(struct mm_struct *mm,
>   } else if (IS_ERR(page)) {
>   ret = PTR_ERR(page);
>   goto out;
> + } else if (gup_flags & FOLL_HWPOISON && PageHWPoison(page)) {
> + if (gup_flags & FOLL_GET)
> + put_page(page);
> + ret = -EHWPOISON;
> + goto out;
>   }
>   if (pages) {
>   pages[i] = page;
> @@ -1532,7 +1537,7 @@ struct page *get_dump_page(unsigned long addr)
>   if (mmap_read_lock_killable(mm))
>   return NULL;
>   ret = __get_user_pages_locked(mm, addr, 1, , NULL, ,
> -   FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> +   FOLL_FORCE | FOLL_DUMP | FOLL_GET | 
> FOLL_HWPOISON);
>   if (locked)
>   mmap_read_unlock(mm);
>   return (ret == 1) ? page : NULL;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a86a58ef132d..03c3d3225c0d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4949,6 +4949,14 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   continue;
>   }
>  
> + if (flags & FOLL_HWPOISON && PageHWPoison(page)) {
> + vaddr += huge_page_size(h);
> + remainder -= pages_per_huge_page(h);
> + i += pages_per_huge_page(h);
> + spin_unlock(ptl);
> + continue;
> + }
> +
>   refs = min3(pages_per_huge_page(h) - pfn_offset,
>   (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
>  
> 
> We can surely say that this change only affects get_user_pages() callers
> with FOLL_HWPOISON set, so this should pinpoint the current problem only.
> A side note is that the above change on follow_hugetlb_page() has a room of
> refactoring to reduce duplicated code.
> 
> Could you try to test and complete it?

Got it, I will try to complete it and test it.

For the code:

long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   continue;
>   }
>  
> + if (flags & FOLL_HWPOISON && PageHWPoison(page)) {
> + vaddr += huge_page_size(h);
> + remainder -=

[PATCH v7] mm/gup: check page hwpoison status for memory recovery failures.

2021-04-05 Thread Aili Yao

When we call get_user_pages() to pin user page in memory, there may be
hwpoison page, currently, we just handle the normal case that memory
recovery jod is correctly finished, and we will not return the hwpoison
page to callers, but for other cases like memory recovery fails and the
user process related pte is not correctly set invalid, we will still
return the hwpoison page, and may touch it and lead to panic.

In gup.c, for normal page, after we call follow_page_mask(), we will
return the related page pointer; or like another hwpoison case with pte
invalid, it will return NULL. For NULL, we will handle it in if (!page)
branch. In this patch, we will filter out the hwpoison page in
follow_page_mask() and return error code for recovery failure cases.

We will check the page hwpoison status as soon as possible and avoid doing
followed normal procedure and try not to grab related pages.

Changes since v6:
- Fix wrong page pointer check in follow_trans_huge_pmd();

Signed-off-by: Aili Yao 
Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Naoya Horiguchi 
Cc: Oscar Salvador 
Cc: Mike Kravetz 
Cc: Andrew Morton 
Cc: sta...@vger.kernel.org
---
 mm/gup.c | 27 +++
 mm/huge_memory.c | 11 ---
 mm/hugetlb.c |  8 +++-
 mm/internal.h| 13 +
 4 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..88a93b89c03e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -433,6 +433,9 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
page = ERR_PTR(ret);
goto out;
}
+   } else if (PageHWPoison(page)) {
+   page = ERR_PTR(-EHWPOISON);
+   goto out;
}
 
if (flags & FOLL_SPLIT && PageTransCompound(page)) {
@@ -540,8 +543,13 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
page = follow_huge_pd(vma, address,
  __hugepd(pmd_val(pmdval)), flags,
  PMD_SHIFT);
-   if (page)
-   return page;
+   if (page) {
+   struct page *p = check_page_hwpoison(page);
+
+   if (p == ERR_PTR(-EHWPOISON) && flags & FOLL_GET)
+   put_page(page);
+   return p;
+   }
return no_page_table(vma, flags);
}
 retry:
@@ -643,7 +651,7 @@ static struct page *follow_pud_mask(struct vm_area_struct 
*vma,
if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) {
page = follow_huge_pud(mm, address, pud, flags);
if (page)
-   return page;
+   return check_page_hwpoison(page);
return no_page_table(vma, flags);
}
if (is_hugepd(__hugepd(pud_val(*pud {
@@ -652,6 +660,13 @@ static struct page *follow_pud_mask(struct vm_area_struct 
*vma,
  PUD_SHIFT);
if (page)
return page;
+   if (page) {
+   struct page *p = check_page_hwpoison(page);
+
+   if (p == ERR_PTR(-EHWPOISON) && flags & FOLL_GET)
+   put_page(page);
+   return p;
+   }
return no_page_table(vma, flags);
}
if (pud_devmap(*pud)) {
@@ -1087,10 +1102,14 @@ static long __get_user_pages(struct mm_struct *mm,
 * struct page.
 */
goto next_page;
-   } else if (IS_ERR(page)) {
+   } else if (PTR_ERR(page) == -EHWPOISON) {
+   ret = (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : 
-EFAULT;
+   goto out;
+   }  else if (IS_ERR(page)) {
ret = PTR_ERR(page);
goto out;
}
+
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae907a9c2050..56ff2e83b67c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1349,6 +1349,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
 {
struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;
+   struct page *tail = NULL;
 
assert_spin_locked(pmd_lockptr(mm, pmd));
 
@@ -1366,6 +1367,11 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
 
+   tail = page + ((addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT);
+
+   if (PageHWPoison(tail))
+   return ERR_PTR(-EHWPOISON);
+
if (!try_grab_page

[PATCH v6] mm/gup: check page hwpoison status for memory recovery failures.

2021-04-05 Thread Aili Yao

When we call get_user_pages() to pin user page in memory, there may be
hwpoison page, currently, we just handle the normal case that memory
recovery jod is correctly finished, and we will not return the hwpoison
page to callers, but for other cases like memory recovery fails and the
user process related pte is not correctly set invalid, we will still
return the hwpoison page, and may touch it and lead to panic.

In gup.c, for normal page, after we call follow_page_mask(), we will
return the related page pointer; or like another hwpoison case with pte
invalid, it will return NULL. For NULL, we will handle it in if (!page)
branch. In this patch, we will filter out the hwpoison page in
follow_page_mask() and return error code for recovery failure cases.

We will check the page hwpoison status as soon as possible and avoid doing
followed normal procedure and try not to grab related pages.

Signed-off-by: Aili Yao 
Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Naoya Horiguchi 
Cc: Oscar Salvador 
Cc: Mike Kravetz 
Cc: Andrew Morton 
Cc: sta...@vger.kernel.org
---
 mm/gup.c | 27 +++
 mm/huge_memory.c |  9 +++--
 mm/hugetlb.c |  8 +++-
 mm/internal.h| 13 +
 4 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..88a93b89c03e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -433,6 +433,9 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
page = ERR_PTR(ret);
goto out;
}
+   } else if (PageHWPoison(page)) {
+   page = ERR_PTR(-EHWPOISON);
+   goto out;
}
 
if (flags & FOLL_SPLIT && PageTransCompound(page)) {
@@ -540,8 +543,13 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
page = follow_huge_pd(vma, address,
  __hugepd(pmd_val(pmdval)), flags,
  PMD_SHIFT);
-   if (page)
-   return page;
+   if (page) {
+   struct page *p = check_page_hwpoison(page);
+
+   if (p == ERR_PTR(-EHWPOISON) && flags & FOLL_GET)
+   put_page(page);
+   return p;
+   }
return no_page_table(vma, flags);
}
 retry:
@@ -643,7 +651,7 @@ static struct page *follow_pud_mask(struct vm_area_struct 
*vma,
if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) {
page = follow_huge_pud(mm, address, pud, flags);
if (page)
-   return page;
+   return check_page_hwpoison(page);
return no_page_table(vma, flags);
}
if (is_hugepd(__hugepd(pud_val(*pud {
@@ -652,6 +660,13 @@ static struct page *follow_pud_mask(struct vm_area_struct 
*vma,
  PUD_SHIFT);
if (page)
return page;
+   if (page) {
+   struct page *p = check_page_hwpoison(page);
+
+   if (p == ERR_PTR(-EHWPOISON) && flags & FOLL_GET)
+   put_page(page);
+   return p;
+   }
return no_page_table(vma, flags);
}
if (pud_devmap(*pud)) {
@@ -1087,10 +1102,14 @@ static long __get_user_pages(struct mm_struct *mm,
 * struct page.
 */
goto next_page;
-   } else if (IS_ERR(page)) {
+   } else if (PTR_ERR(page) == -EHWPOISON) {
+   ret = (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : 
-EFAULT;
+   goto out;
+   }  else if (IS_ERR(page)) {
ret = PTR_ERR(page);
goto out;
}
+
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae907a9c2050..3973d988e485 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1349,6 +1349,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
 {
struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;
+   struct page *tail = NULL;
 
assert_spin_locked(pmd_lockptr(mm, pmd));
 
@@ -1366,6 +1367,11 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
 
+   tail = page + ((addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT);
+
+   if (PageHWPoison(tail))
+   return ERR_PTR(-EHWPOISON);
+
if (!try_grab_page(page, flags))
return ERR_PTR(-ENOMEM);
 
@@ -1405,11 +1411

Re: [PATCH v3] mm,hwpoison: return -EHWPOISON when page already poisoned

2021-04-05 Thread Aili Yao

On Mon, 5 Apr 2021 13:50:18 +
HORIGUCHI NAOYA(堀口　直也)  wrote:

> On Fri, Apr 02, 2021 at 03:11:20PM +, Luck, Tony wrote:
> > >> Combined with my "mutex" patch (to get rid of races where 2nd process 
> > >> returns
> > >> early, but first process is still looking for mappings to unmap and tasks
> > >> to signal) this patch moves forward a bit. But I think it needs an
> > >> additional change here in kill_me_maybe() to just "return" if there is a
> > >> EHWPOISON return from memory_failure()
> > >> 
> > > Got this, Thanks for your reply!
> > > I will dig into this!
> > 
> > One problem with this approach is when the first task to find poison
> > fails to complete actions. Then the poison pages are not unmapped,
> > and just returning from kill_me_maybe() gets into a loop :-(
> 
> Yes, that's the pain point.  We need send SIGBUS to the current process in
> "already haredware poisoned" case of memory_failure().  SIGBUS should
> contain the error virtual address, but unfortunately walking the page table
> or using p->mce_vaddr is not always reliable now.
> 
> So as a second-best approach, we can extend the "walking page table"
> approach such that we walk over the whole virtual address space to make sure
> that the number of entries pointing to the error page is exactly 1.
> If that's the case, then we can confidently send SIGBUS with it.  If we find
> multiple entries pointing to the error page, then we give up guessing, then
> send a nomral SIGBUS to the current process.  That's not worse than now,
> and I think we need wait in the hope that the virtual address will be
> available in MCE handler.
> 
> Anyway I'll try to write a patch for this.

Yeah, previous patch didn't adress the multiple virtual address issue, If there 
is a way to fix that,
That would be great!

-- 
Thanks!
Aili Yao

Re: [PATCH v3] mm,hwpoison: return -EHWPOISON when page already poisoned

2021-04-01 Thread Aili Yao

On Thu, 1 Apr 2021 08:33:20 -0700
"Luck, Tony"  wrote:

> On Wed, Mar 31, 2021 at 07:25:40PM +0800, Aili Yao wrote:
> > When the page is already poisoned, another memory_failure() call in the
> > same page now return 0, meaning OK. For nested memory mce handling, this
> > behavior may lead to one mce looping, Example:
> > 
> > 1.When LCME is enabled, and there are two processes A && B running on
> > different core X && Y separately, which will access one same page, then
> > the page corrupted when process A access it, a MCE will be rasied to
> > core X and the error process is just underway.
> > 
> > 2.Then B access the page and trigger another MCE to core Y, it will also
> > do error process, it will see TestSetPageHWPoison be true, and 0 is
> > returned.
> > 
> > 3.The kill_me_maybe will check the return:
> > 
> > 1244 static void kill_me_maybe(struct callback_head *cb)
> > 1245 {
> > 
> > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> > p->mce_whole_page);
> > 1257 sync_core();
> > 1258 return;
> > 1259 }
> > 
> > 1267 }
> 
> With your change memory_failure() will return -EHWPOISON for the
> second task that consumes poison ... so that "if" statement won't
> be true and so we fall into the following code:
> 
> 1273 if (p->mce_vaddr != (void __user *)-1l) {
> 1274 force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, 
> PAGE_SHIFT);
> 1275 } else {
> 1276 pr_err("Memory error not recovered");
> 1277 kill_me_now(cb);
> 1278 }
> 
> If this was a copy_from_user() machine check, p->mce_vaddr is set and
> the task gets a BUS_MCEERR_AR SIGBUS, otherwise we print that
> 
>   "Memory error not recovered"
> 
> message and send a generic SIGBUS.  I don't think either of those options
> is right.
> 
> Combined with my "mutex" patch (to get rid of races where 2nd process returns
> early, but first process is still looking for mappings to unmap and tasks
> to signal) this patch moves forward a bit. But I think it needs an
> additional change here in kill_me_maybe() to just "return" if there is a
> EHWPOISON return from memory_failure()
> 
Got this, Thanks for your reply!
I will dig into this!

-- 
Thanks!
Aili Yao

Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

2021-03-31 Thread Aili Yao

On Wed, 31 Mar 2021 08:44:53 +0200
David Hildenbrand  wrote:

> On 31.03.21 06:32, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote:  
> >> On Wed, 31 Mar 2021 01:52:59 + HORIGUCHI NAOYA(堀口　直也) 
> >>  wrote:  
> >>> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:  
> >>>> On 26.03.21 15:09, David Hildenbrand wrote:  
> >>>>> On 22.03.21 12:33, Aili Yao wrote:  
> >>>>>> When we do coredump for user process signal, this may be one SIGBUS 
> >>>>>> signal
> >>>>>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> >>>>>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> >>>>>> recovery work is finished correctly, then the get_dump_page() will not
> >>>>>> return the error page as its process pte is set invalid by
> >>>>>> memory_failure().
> >>>>>>
> >>>>>> But memory_failure() may fail, and the process's related pte may not be
> >>>>>> correctly set invalid, for current code, we will return the poison 
> >>>>>> page,
> >>>>>> get it dumped, and then lead to system panic as its in kernel code.
> >>>>>>
> >>>>>> So check the hwpoison status in get_dump_page(), and if TRUE, return 
> >>>>>> NULL.
> >>>>>>
> >>>>>> There maybe other scenario that is also better to check hwposion status
> >>>>>> and not to panic, so make a wrapper for this check, Thanks to David's
> >>>>>> suggestion().
> >>>>>>
> >>>>>> Link: 
> >>>>>> https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> >>>>>> Signed-off-by: Aili Yao 
> >>>>>> Cc: David Hildenbrand 
> >>>>>> Cc: Matthew Wilcox 
> >>>>>> Cc: Naoya Horiguchi 
> >>>>>> Cc: Oscar Salvador 
> >>>>>> Cc: Mike Kravetz 
> >>>>>> Cc: Aili Yao 
> >>>>>> Cc: sta...@vger.kernel.org
> >>>>>> Signed-off-by: Andrew Morton 
> >>>>>> ---
> >>>>>> mm/gup.c  |  4 
> >>>>>> mm/internal.h | 20 
> >>>>>> 2 files changed, 24 insertions(+)
> >>>>>>
> >>>>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>>>> index e4c224c..6f7e1aa 100644
> >>>>>> --- a/mm/gup.c
> >>>>>> +++ b/mm/gup.c
> >>>>>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> >>>>>>  FOLL_FORCE | FOLL_DUMP | 
> >>>>>> FOLL_GET);
> >>>>>>if (locked)
> >>>>>>mmap_read_unlock(mm);  
> >>>>>
> >>>>> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> >>>>> when stumbling over a hwpoisoned page?
> >>>>>
> >>>>> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> >>>>>
> >>>>> handle_mm_fault()->vm_fault_to_errno(), which translates
> >>>>> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> 
> >>>>> -EHWPOISON)
> >>>>>
> >>>>> ?  
> >>>
> >>> We could get -EFAULT, but sometimes not (depends on how memory_failure() 
> >>> fails).
> >>>
> >>> If we failed to unmap, the page table is not converted to hwpoison entry,
> >>> so __get_user_pages_locked() get the hwpoisoned page.
> >>>
> >>> If we successfully unmapped but failed in truncate_error_page() for 
> >>> example,
> >>> the processes mapping the page would get -EFAULT as expected.  But even in
> >>> this case, other processes could reach the error page via page cache and
> >>> __get_user_pages_locked() for them could return the hwpoisoned page.
> >>>  
> >>>>
> >>>> Or doesn't that happen as you describe "But memory_failure() may fail, 
> >>>> and
> >>>> the process's related pte may not be correctly set invalid" -- but why 
> >>>> does
> >>>> that

[PATCH v3] mm,hwpoison: return -EHWPOISON when page already poisoned

2021-03-31 Thread Aili Yao

When the page is already poisoned, another memory_failure() call in the
same page now return 0, meaning OK. For nested memory mce handling, this
behavior may lead to one mce looping, Example:

1.When LCME is enabled, and there are two processes A && B running on
different core X && Y separately, which will access one same page, then
the page corrupted when process A access it, a MCE will be rasied to
core X and the error process is just underway.

2.Then B access the page and trigger another MCE to core Y, it will also
do error process, it will see TestSetPageHWPoison be true, and 0 is
returned.

3.The kill_me_maybe will check the return:

1244 static void kill_me_maybe(struct callback_head *cb)
1245 {

1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
p->mce_whole_page);
1257 sync_core();
1258 return;
1259 }

1267 }

4. The error process for B will end, and may nothing happened if
kill-early is not set, The process B will re-excute instruction and get
into mce again and then loop happens. And also the set_mce_nospec()
here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").

For other cases which care the return value of memory_failure() should
check why they want to process a memory error which have already been
processed. This behavior seems reasonable.

Signed-off-by: Aili Yao 
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..5cd42144b67c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1228,7 +1228,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int 
flags)
if (TestSetPageHWPoison(head)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
   pfn);
-   return 0;
+   return -EHWPOISON;
}
 
num_poisoned_pages_inc();
@@ -1430,7 +1430,7 @@ int memory_failure(unsigned long pfn, int flags)
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
-   return 0;
+   return -EHWPOISON;
}
 
orig_head = hpage = compound_head(p);
-- 
2.25.1

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-31 Thread Aili Yao

On Thu, 25 Feb 2021 12:39:30 +0100
Oscar Salvador  wrote:

> On Thu, Feb 25, 2021 at 11:28:18AM +, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Hi Aili,
> > 
> > I agree that this set_mce_nospec() is not expected to be called for
> > "already hwpoisoned" page because in the reported case the error
> > page is already contained and no need to resort changing cache mode.  
> 
> Out of curiosity, what is the current behavour now?
> Say we have an ongoing MCE which has marked the page as HWPoison but
> memory_failure did not take any action on the page yet.
> And then, we have another MCE, which ends up there.
> set_mce_nospec might clear _PAGE_PRESENT bit.
> 
> Does that have any impact on the first MCE?
> 
> > It seems to me that memory_failure() does not return MF_XXX.  But yes,
> > returning some positive value for the reported case could be a solution.  
> 
> No, you are right. I somehow managed to confuse myself.
> I see now that MF_XXX return codes are filtered out in page_action.
> 
> > We could use some negative value (error code) to report the reported case,
> > then as you mentioned above, some callers need change to handle the
> > new case, and the same is true if you use some positive value.
> > My preference is -EHWPOISON, but other options are fine if justified well.  
> 
> -EHWPOISON seems like a good fit.
> 

Hi Oscar, david:

Long away fron this topic, but i noticed today I made a stupid mistake that 
EHWPOISON is already
been declared, so we should better return EHWPOISON for this case.

Really sorry for this!

As the patch is still under review, I will post a new version for this, if I 
change this, may I add
your review tag here please?

-- 
Thanks!
Aili Yao

Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

2021-03-31 Thread Aili Yao

On Wed, 31 Mar 2021 08:44:53 +0200
David Hildenbrand  wrote:

> On 31.03.21 06:32, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote:  
> >> On Wed, 31 Mar 2021 01:52:59 + HORIGUCHI NAOYA(堀口　直也) 
> >>  wrote:  
> >>> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:  
> >>>> On 26.03.21 15:09, David Hildenbrand wrote:  
> >>>>> On 22.03.21 12:33, Aili Yao wrote:  
> >>>>>> When we do coredump for user process signal, this may be one SIGBUS 
> >>>>>> signal
> >>>>>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> >>>>>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> >>>>>> recovery work is finished correctly, then the get_dump_page() will not
> >>>>>> return the error page as its process pte is set invalid by
> >>>>>> memory_failure().
> >>>>>>
> >>>>>> But memory_failure() may fail, and the process's related pte may not be
> >>>>>> correctly set invalid, for current code, we will return the poison 
> >>>>>> page,
> >>>>>> get it dumped, and then lead to system panic as its in kernel code.
> >>>>>>
> >>>>>> So check the hwpoison status in get_dump_page(), and if TRUE, return 
> >>>>>> NULL.
> >>>>>>
> >>>>>> There maybe other scenario that is also better to check hwposion status
> >>>>>> and not to panic, so make a wrapper for this check, Thanks to David's
> >>>>>> suggestion().
> >>>>>>
> >>>>>> Link: 
> >>>>>> https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> >>>>>> Signed-off-by: Aili Yao 
> >>>>>> Cc: David Hildenbrand 
> >>>>>> Cc: Matthew Wilcox 
> >>>>>> Cc: Naoya Horiguchi 
> >>>>>> Cc: Oscar Salvador 
> >>>>>> Cc: Mike Kravetz 
> >>>>>> Cc: Aili Yao 
> >>>>>> Cc: sta...@vger.kernel.org
> >>>>>> Signed-off-by: Andrew Morton 
> >>>>>> ---
> >>>>>> mm/gup.c  |  4 
> >>>>>> mm/internal.h | 20 
> >>>>>> 2 files changed, 24 insertions(+)
> >>>>>>
> >>>>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>>>> index e4c224c..6f7e1aa 100644
> >>>>>> --- a/mm/gup.c
> >>>>>> +++ b/mm/gup.c
> >>>>>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> >>>>>>  FOLL_FORCE | FOLL_DUMP | 
> >>>>>> FOLL_GET);
> >>>>>>if (locked)
> >>>>>>mmap_read_unlock(mm);  
> >>>>>
> >>>>> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> >>>>> when stumbling over a hwpoisoned page?
> >>>>>
> >>>>> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> >>>>>
> >>>>> handle_mm_fault()->vm_fault_to_errno(), which translates
> >>>>> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> 
> >>>>> -EHWPOISON)
> >>>>>
> >>>>> ?  
> >>>
> >>> We could get -EFAULT, but sometimes not (depends on how memory_failure() 
> >>> fails).
> >>>
> >>> If we failed to unmap, the page table is not converted to hwpoison entry,
> >>> so __get_user_pages_locked() get the hwpoisoned page.
> >>>
> >>> If we successfully unmapped but failed in truncate_error_page() for 
> >>> example,
> >>> the processes mapping the page would get -EFAULT as expected.  But even in
> >>> this case, other processes could reach the error page via page cache and
> >>> __get_user_pages_locked() for them could return the hwpoisoned page.
> >>>  
> >>>>
> >>>> Or doesn't that happen as you describe "But memory_failure() may fail, 
> >>>> and
> >>>> the process's related pte may not be correctly set invalid" -- but why 
> >>>> does
> >>>&g

Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

2021-03-30 Thread Aili Yao

On Wed, 31 Mar 2021 01:52:59 +
HORIGUCHI NAOYA(堀口　直也)  wrote:

> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
> > On 26.03.21 15:09, David Hildenbrand wrote:  
> > > On 22.03.21 12:33, Aili Yao wrote:  
> > > > When we do coredump for user process signal, this may be one SIGBUS 
> > > > signal
> > > > with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> > > > resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> > > > recovery work is finished correctly, then the get_dump_page() will not
> > > > return the error page as its process pte is set invalid by
> > > > memory_failure().
> > > > 
> > > > But memory_failure() may fail, and the process's related pte may not be
> > > > correctly set invalid, for current code, we will return the poison page,
> > > > get it dumped, and then lead to system panic as its in kernel code.
> > > > 
> > > > So check the hwpoison status in get_dump_page(), and if TRUE, return 
> > > > NULL.
> > > > 
> > > > There maybe other scenario that is also better to check hwposion status
> > > > and not to panic, so make a wrapper for this check, Thanks to David's
> > > > suggestion().
> > > > 
> > > > Link: 
> > > > https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> > > > Signed-off-by: Aili Yao 
> > > > Cc: David Hildenbrand 
> > > > Cc: Matthew Wilcox 
> > > > Cc: Naoya Horiguchi 
> > > > Cc: Oscar Salvador 
> > > > Cc: Mike Kravetz 
> > > > Cc: Aili Yao 
> > > > Cc: sta...@vger.kernel.org
> > > > Signed-off-by: Andrew Morton 
> > > > ---
> > > >mm/gup.c  |  4 
> > > >mm/internal.h | 20 
> > > >2 files changed, 24 insertions(+)
> > > > 
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index e4c224c..6f7e1aa 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> > > >   FOLL_FORCE | FOLL_DUMP | 
> > > > FOLL_GET);
> > > > if (locked)
> > > > mmap_read_unlock(mm);  
> > > 
> > > Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> > > when stumbling over a hwpoisoned page?
> > > 
> > > See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> > > 
> > > handle_mm_fault()->vm_fault_to_errno(), which translates
> > > VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> > > 
> > > ?  
> 
> We could get -EFAULT, but sometimes not (depends on how memory_failure() 
> fails).
> 
> If we failed to unmap, the page table is not converted to hwpoison entry,
> so __get_user_pages_locked() get the hwpoisoned page.
> 
> If we successfully unmapped but failed in truncate_error_page() for example,
> the processes mapping the page would get -EFAULT as expected.  But even in
> this case, other processes could reach the error page via page cache and
> __get_user_pages_locked() for them could return the hwpoisoned page.
> 
> > 
> > Or doesn't that happen as you describe "But memory_failure() may fail, and
> > the process's related pte may not be correctly set invalid" -- but why does
> > that happen?  
> 
> Simply because memory_failure() doesn't handle some page types like ksm page
> and zero page. Or maybe shmem thp also belongs to this class.
> 
> > 
> > On a similar thought, should get_user_pages() never return a page that has
> > HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned? 
> >  
> 
> Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
> follow_huge_pmd() would work well.

I think we should take more care to broadcast the hwpoison check to other cases,
SIGBUS coredump is such a case that it is supposed to not touch the poison 
page, 
and if we return NULL for this, the coredump process will get a successful 
finish.

Other cases may also meet the requirements like coredump, but we need to 
identify it,
that's the poison check wrapper's purpose. If not, we may break the integrity 
of the
related action, which may be no better than panic.

-- 
Thanks!
Aili Yao

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-03-24 Thread Aili Yao

On Wed, 24 Mar 2021 10:59:50 +0800
Aili Yao  wrote:

> On Wed, 24 Feb 2021 10:39:21 +0800
> Aili Yao  wrote:
> 
> > On Tue, 23 Feb 2021 16:12:43 +
> > "Luck, Tony"  wrote:
> >   
> > > > What I think is qemu has not an easy to get the MCE signature from host 
> > > > or currently no methods for this
> > > > So qemu treat all AR will be No RIPV, Do more is better than do less.   
> > > >
> > > 
> > > RIPV would be important in the guest in the case where the guest can fix 
> > > the problem that caused
> > > the machine check and return to the failed instruction to continue.
> > > 
> > > I think the only case where this happens is a fault in a read-only page 
> > > mapped from a file (typically
> > > code page, but could be a data page). In this case memory-failure() 
> > > unmaps the page with the posion
> > > but Linux can recover by reading data from the file into a new page.
> > > 
> > > Other cases we send SIGBUS (so go to the signal handler instead of to the 
> > > faulting instruction).
> > > 
> > > So it would be good if the state of RIPV could be added to the signal 
> > > state sent to qemu. If that
> > > isn't possible, then this full recovery case turns into another SIGBUS 
> > > case.
> > 
> > This KVM and VM case of failing recovery for SRAR is just one scenario I 
> > think,
> > If Intel guarantee that when memory SRAR is triggered, RIPV will always be 
> > set, then it's the job of qemu to
> > set the RIPV instead.
> > Or if When SRAR is triggered with RIPV cleared, the same issue will be true 
> > for host.
> > 
> > And I think it's better for VM to know the real RIPV value, It need more 
> > work in qemu and kernel if possible.
> > 
> > Thanks
> > Aili Yao  
> 
> ADD this topic to qemu list, this is really one bad issue.
> 
> Issue report:
> when VM receive one SRAR memory failure from host, it all has RIPV cleared, 
> and then vm process it and trigger one panic!
> 
> Can any qemu maintainer fix this?
> 
> Suggestion:
> qemu get the true value of RIPV from host, the inject it to VM accordingly.

Sorry for my previous description, I may not describe the issue clearly,
I found this issue when I do memory SRAR test for kvm virtual machine, the step 
is:

1. Inject one uncorrectable error to one specific memory address A.
2. Then one user process in the VM access the address A and trigger a MCE 
exception to host.
3. In do_machine_check() kernel will check the related register and do recovery 
job from memory_failure();
4. Normally a BUS_MCEERR_AR SIGBUS is sent to the specifc core triggering this 
error.
5. Qemu will take control, and will inject this event to VM, all infomation 
qume can get currently is the Error code
   BUS_MCEERR_AR and virtual address, in the qemu inject function:
if (code == BUS_MCEERR_AR) {
status |= MCI_STATUS_AR | 0x134;
mcg_status |= MCG_STATUS_EIPV;
} else {
status |= 0xc0;
mcg_status |= MCG_STATUS_RIPV;
}
For BUS_MCEERR_AR case, MCG_STATUS_RIPV will always be cleared.

6. Then in VM kernel, do_machine_check will got this:
if (!(m.mcgstatus & MCG_STATUS_RIPV))
kill_current_task = 1;
   then go to force_sig(SIGBUS) without calling memory_failure();
   so for now, the page is not marked hwpoison.

7  The VM kernel want to exit to user mode and then process the SIGBUS signal.
   As SIGBUS is a fatal signal, the coredump related work will be called.

8. Then coredump will get the user space mapped memory dumped, include the 
error page.

9. Then UE is triggered again, and qemu will take control again, then inject 
this UE event to VM and
   this time the error is triggered in kernel code, then VM panic.

I don't know how can this issue be fixed cleanly, maybe qemu developers may 
help on this. 
If qemu can fix this, that will be great!

-- 
Thanks!
Aili Yao

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-03-23 Thread Aili Yao

On Wed, 24 Feb 2021 10:39:21 +0800
Aili Yao  wrote:

> On Tue, 23 Feb 2021 16:12:43 +
> "Luck, Tony"  wrote:
> 
> > > What I think is qemu has not an easy to get the MCE signature from host 
> > > or currently no methods for this
> > > So qemu treat all AR will be No RIPV, Do more is better than do less.
> > 
> > RIPV would be important in the guest in the case where the guest can fix 
> > the problem that caused
> > the machine check and return to the failed instruction to continue.
> > 
> > I think the only case where this happens is a fault in a read-only page 
> > mapped from a file (typically
> > code page, but could be a data page). In this case memory-failure() unmaps 
> > the page with the posion
> > but Linux can recover by reading data from the file into a new page.
> > 
> > Other cases we send SIGBUS (so go to the signal handler instead of to the 
> > faulting instruction).
> > 
> > So it would be good if the state of RIPV could be added to the signal state 
> > sent to qemu. If that
> > isn't possible, then this full recovery case turns into another SIGBUS 
> > case.  
> 
> This KVM and VM case of failing recovery for SRAR is just one scenario I 
> think,
> If Intel guarantee that when memory SRAR is triggered, RIPV will always be 
> set, then it's the job of qemu to
> set the RIPV instead.
> Or if When SRAR is triggered with RIPV cleared, the same issue will be true 
> for host.
> 
> And I think it's better for VM to know the real RIPV value, It need more work 
> in qemu and kernel if possible.
> 
> Thanks
> Aili Yao

ADD this topic to qemu list, this is really one bad issue.

Issue report:
when VM receive one SRAR memory failure from host, it all has RIPV cleared, and 
then vm process it and trigger one panic!

Can any qemu maintainer fix this?

Suggestion:
qemu get the true value of RIPV from host, the inject it to VM accordingly.

Thanks
Aili Yao!

[PATCH v5] mm/gup: check page hwposion status for coredump.

2021-03-22 Thread Aili Yao

When we do coredump for user process signal, this may be one SIGBUS signal
with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
resulted from ECC memory fail like SRAR or SRAO, we expect the memory
recovery work is finished correctly, then the get_dump_page() will not
return the error page as its process pte is set invalid by
memory_failure().

But memory_failure() may fail, and the process's related pte may not be
correctly set invalid, for current code, we will return the poison page,
get it dumped, and then lead to system panic as its in kernel code.

So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.

There maybe other scenario that is also better to check hwposion status
and not to panic, so make a wrapper for this check, Thanks to David's
suggestion().

Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
Signed-off-by: Aili Yao 
Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Naoya Horiguchi 
Cc: Oscar Salvador 
Cc: Mike Kravetz 
Cc: Aili Yao 
Cc: sta...@vger.kernel.org
Signed-off-by: Andrew Morton 
---
 mm/gup.c  |  4 
 mm/internal.h | 20 
 2 files changed, 24 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index e4c224c..6f7e1aa 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
  FOLL_FORCE | FOLL_DUMP | FOLL_GET);
if (locked)
mmap_read_unlock(mm);
+
+   if (ret == 1 && is_page_hwpoison(page))
+   return NULL;
+
return (ret == 1) ? page : NULL;
 }
 #endif /* CONFIG_ELF_CORE */
diff --git a/mm/internal.h b/mm/internal.h
index 25d2b2439..b751cef 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,6 +97,26 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
 }
 
+/*
+ * When kernel touch the user page, the user page may be have been marked
+ * poison but still mapped in user space, if without this page, the kernel
+ * can guarantee the data integrity and operation success, the kernel is
+ * better to check the posion status and avoid touching it, be good not to
+ * panic, coredump for process fatal signal is a sample case matching this
+ * scenario. Or if kernel can't guarantee the data integrity, it's better
+ * not to call this function, let kernel touch the poison page and get to
+ * panic.
+ */
+static inline bool is_page_hwpoison(struct page *page)
+{
+   if (PageHWPoison(page))
+   return true;
+   else if (PageHuge(page) && PageHWPoison(compound_head(page)))
+   return true;
+
+   return false;
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
-- 
1.8.3.1

Re: [PATCH v3] mm/gup: check page posion status for coredump.

2021-03-21 Thread Aili Yao

On Sat, 20 Mar 2021 00:35:16 +
Matthew Wilcox  wrote:

> On Fri, Mar 19, 2021 at 10:44:37AM +0800, Aili Yao wrote:
> > +++ b/mm/gup.c
> > @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> >   FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> > if (locked)
> > mmap_read_unlock(mm);
> > +
> > +   if (ret == 1 && is_page_poisoned(page))
> > +   return NULL;
> > +
> > return (ret == 1) ? page : NULL;
> >  }
> >  #endif /* CONFIG_ELF_CORE */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 25d2b2439..902d993 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -97,6 +97,27 @@ static inline void set_page_refcounted(struct page *page)
> > set_page_count(page, 1);
> >  }
> >  
> > +/*
> > + * When kernel touch the user page, the user page may be have been marked
> > + * poison but still mapped in user space, if without this page, the kernel
> > + * can guarantee the data integrity and operation success, the kernel is
> > + * better to check the posion status and avoid touching it, be good not to
> > + * panic, coredump for process fatal signal is a sample case matching this
> > + * scenario. Or if kernel can't guarantee the data integrity, it's better
> > + * not to call this function, let kernel touch the poison page and get to
> > + * panic.
> > + */
> > +static inline bool is_page_poisoned(struct page *page)
> > +{
> > +   if (page != NULL) {  
> 
> Why are you checking page for NULL here?  How can it possibly be NULL?

For this get_dump_page() case, it can't be NULL, I thougt may other place
will call this function and may not guarantee this, But yes, kernel is a more
safer place and checking page NULL is not a common behavior.

Better to remove it, Thanks you very much for pointing this!

> > +   if (PageHWPoison(page))
> > +   return true;
> > +   else if (PageHuge(page) && PageHWPoison(compound_head(page)))
> > +   return true;
> > +   }
> > +   return 0;
> > +}
> > +
> >  extern unsigned long highest_memmap_pfn;
> >  
> >  /*
> > -- 
> > 1.8.3.1
> > 
> >   
-- 
Thanks!
Aili Yao

[PATCH v3] mm/gup: check page posion status for coredump.

2021-03-18 Thread Aili Yao

When we do coredump for user process signal, this may be an SIGBUS signal
with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
resulted from ECC memory fail like SRAR or SRAO, we expect the memory
recovery work is finished correctly, then the get_dump_page() will not
return the error page as its process pte is set invalid by
memory_failure().

But memory_failure() may fail, and the process's related pte may not be
correctly set invalid, for current code, we will return the poison page,
get it dumped, and then lead to system panic as its in kernel code.

So check the poison status in get_dump_page(), and if TRUE, return NULL.

There maybe other scenario that is also better to check the posion status
and not to panic, so make a wrapper for this check, Thanks to David's
suggestion().

Signed-off-by: Aili Yao 
---
 mm/gup.c  |  4 
 mm/internal.h | 21 +
 2 files changed, 25 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index e4c224c..dcabe96 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
  FOLL_FORCE | FOLL_DUMP | FOLL_GET);
if (locked)
mmap_read_unlock(mm);
+
+   if (ret == 1 && is_page_poisoned(page))
+   return NULL;
+
return (ret == 1) ? page : NULL;
 }
 #endif /* CONFIG_ELF_CORE */
diff --git a/mm/internal.h b/mm/internal.h
index 25d2b2439..902d993 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,6 +97,27 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
 }
 
+/*
+ * When kernel touch the user page, the user page may be have been marked
+ * poison but still mapped in user space, if without this page, the kernel
+ * can guarantee the data integrity and operation success, the kernel is
+ * better to check the posion status and avoid touching it, be good not to
+ * panic, coredump for process fatal signal is a sample case matching this
+ * scenario. Or if kernel can't guarantee the data integrity, it's better
+ * not to call this function, let kernel touch the poison page and get to
+ * panic.
+ */
+static inline bool is_page_poisoned(struct page *page)
+{
+   if (page != NULL) {
+   if (PageHWPoison(page))
+   return true;
+   else if (PageHuge(page) && PageHWPoison(compound_head(page)))
+   return true;
+   }
+   return 0;
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
-- 
1.8.3.1

Re: [PATCH] mm/gup: check page posion status for coredump.

2021-03-17 Thread Aili Yao

On Thu, 18 Mar 2021 04:46:00 +
Matthew Wilcox  wrote:

> On Wed, Mar 17, 2021 at 10:12:02AM +0100, David Hildenbrand wrote:
> > > + if (IS_ENABLED(CONFIG_MEMORY_FAILURE) && ret == 1) {
> > > + if (unlikely(PageHuge(page) && 
> > > PageHWPoison(compound_head(page
> > > + ret = 0;
> > > + else if (unlikely(PageHWPoison(page)))
> > > + ret = 0;
> > > + }  
> > 
> > I wonder if a simple
> > 
> > if (PageHWPoison(compound_head(page)))
> > ret = 0;
> > 
> > won't suffice. But I guess the "issue" is compound pages that are not huge
> > pages or transparent huge pages.  
> 
> THPs don't set the HWPoison bit on the head page.
> 
> https://lore.kernel.org/linux-mm/20210316140947.ga3...@casper.infradead.org/
> 
> (and PAGEFLAG(HWPoison, hwpoison, PF_ANY))
> 
> By the way,
> 
> #ifdef CONFIG_MEMORY_FAILURE
> PAGEFLAG(HWPoison, hwpoison, PF_ANY)
> TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
> #define __PG_HWPOISON (1UL << PG_hwpoison)
> extern bool take_page_off_buddy(struct page *page);
> #else
> PAGEFLAG_FALSE(HWPoison)
> #define __PG_HWPOISON 0
> #endif
> 
> so there's no need for this 
>   if (IS_ENABLED(CONFIG_MEMORY_FAILURE)
> check, as it simply turns into
> 
>   if (PageHuge(page) && 0)
>   else if (0)
> 
> and the compiler can optimise it all away.

Yes, You are right, I will modify this later.
Thanks for correction

-- 
Thanks!
Aili Yao

[PATCH v2] mm/gup: check page posion status for coredump.

2021-03-17 Thread Aili Yao

When we do coredump for user process signal, this may be an SIGBUS signal
with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
resulted from ECC memory fail like SRAR or SRAO, we expect the memory
recovery work is finished correctly, then the get_dump_page() will not
return the error page as its process pte is set invalid by
memory_failure().

But memory_failure() may fail, and the process's related pte may not be
correctly set invalid, for current code, we will return the poison page
and get it dumped and lead to system panic as its in kernel code.

So check the poison status in get_dump_page(), and if TRUE, return NULL.

There maybe other scenario that is also better to check the posion status
and not to panic, so make a wrapper for this check, suggested by
David Hildenbrand 

Signed-off-by: Aili Yao 
---
 mm/gup.c  |  4 
 mm/internal.h | 21 +
 2 files changed, 25 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index e4c224c..3b4703a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
  FOLL_FORCE | FOLL_DUMP | FOLL_GET);
if (locked)
mmap_read_unlock(mm);
+
+   if (ret == 1 && check_user_page_poison(page))
+   return NULL;
+
return (ret == 1) ? page : NULL;
 }
 #endif /* CONFIG_ELF_CORE */
diff --git a/mm/internal.h b/mm/internal.h
index 25d2b2439..777b3e5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,6 +97,27 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
 }
 
+/*
+ * When kernel touch the user page, the user page may be have been marked
+ * poison but still mapped in user space, if without this page, the kernel
+ * can guarantee the data integrity and operation success, the kernel is
+ * better to check the posion status and avoid touching it, be good not to
+ * panic, coredump for process fatal signal is a sample case matching this
+ * scenario. Or if kernel can't guarantee the data integrity, it's better
+ * not to call this function, let kernel touch the poison page and get to
+ * panic.
+ */
+static inline int check_user_page_poison(struct page *page)
+{
+   if (IS_ENABLED(CONFIG_MEMORY_FAILURE) && page != NULL) {
+   if (unlikely(PageHuge(page) && 
PageHWPoison(compound_head(page
+   return true;
+   else if (unlikely(PageHWPoison(page)))
+   return true;
+   }
+   return 0;
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
-- 
1.8.3.1

Re: [PATCH] mm/gup: check page posion status for coredump.

2021-03-17 Thread Aili Yao

On Wed, 17 Mar 2021 10:12:02 +0100
David Hildenbrand  wrote:

> 
> I wonder if a simple
> 
> if (PageHWPoison(compound_head(page)))
>   ret = 0;
> 
> won't suffice. But I guess the "issue" is compound pages that are not 
> huge pages or transparent huge pages.

Yes, the simple case won't suffice, as we mark the hugetlb page poison in head, 
and
other cases in the specific single page struct.

> If not, we certainly want a wrapper for that magic, otherwise we have to 
> replicate the same logic all over the place.
> 
> > +
> > return (ret == 1) ? page : NULL;
> >   }
> >   #endif /* CONFIG_ELF_CORE */
> >   
> 
> 

Yes, May other places meet the requirements as the coredump meets, it's better 
to make a
wrapper for this. But i am not familiar with the specific scenario, so this 
patch only cover
the coredump case.

I will post a v2 patch for this.

-- 
Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-17 Thread Aili Yao

On Tue, 16 Mar 2021 17:29:39 -0700
"Luck, Tony"  wrote:

> On Fri, Mar 12, 2021 at 11:48:31PM +, Luck, Tony wrote:
> > Thanks to the decode of the instruction we do have the virtual address. So 
> > we just need
> > a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) 
> > with a write
> > of a "not-present" value. Maybe a different poison type from the one we get 
> > from
> > memory_failure() so that the #PF code can recognize this as a special case 
> > and do any
> > other work that we avoided because we were in #MC context.  
> 
> There is no such thing as the safe walk I describe above. In a multi
> threaded application other threads might munmap(2) bits of the address
> space which could free some of the p4d->pud->pmd->pte tables.
> 
> But the pgd table is guaranteed to exist as long as any of the threads
> in the process exist. Which gave rise to a new approach (partial credit
> to Dave Hansen who helped me understand why a more extreme patch that
> I wanted to use wasn't working ... and he pointed out the pgd persistence).
> 
> N.B. The code below DOES NOT WORK. My test application tried to do a write(2)
> syscall with some poison in the buffer. Goal is that write should return a
> short count (with only the bytes from the buffer leading up to the poison
> written to the file). Currently the process gets a suprise SIGSEGV in
> some glibc code :-(
> 
> The code would also need a bunch more work to handle the fact that
> in a multi-threaded application multiple threads might consume the
> poison, and some innocent threads not consuming the poison may also end
> up drawn into the melee in the page fault handler.
> 
> 
> The big picture.
> ---
> 
> This approach passes the buck for the recovery from the #MC handler
> (which has very limited options to do anything useful) to the #PF
> handler (which is a much friendlier environment where locks and mutexes
> can be obtained as needed).
> 
> The mechanism for this buck passing is for the #MC handler to set a
> reserved bit in the PGD entry for the user address where the machine
> check happened, flush the TLB for that address, and then return from
> the #MC handler to the kernel get_user()/copy_from_user() code which
> will re-execute the instruction to access the poison address, but this
> time take a #PF because of the reserved bit in the PGD.
> 
> There's a couple of changes bundled in here to help my debugging:
> 1) Turn off UCNA calls to memory_failure() ... just avoids races
>and make tests more determinstic for now
> 2) Disable "fast string" support ... avoid "REP MOVS" copy routines
>since I'm testing on an old system that treats poison consumed
>by REP MOVS as fatal.
> 
> Posted here for general comments on the approach.  On the plus
> side it avoids taking multiple machine checks in the kernel when
> it is accessing user data. I think it can also meet the goal set
> by Andy Lutomirski of avoiding SIGBUS from syscalls that happen
> to touch user poison. They should see either short counts for
> calls like write(2) that may partially succeed or -EFAULT if
> the system call totally failed.
> 

I have another view here, but maybe wrong, post here for discussion:
When the process is signaled SIGBUS with BUS_MCEERR_AR, i think it should be 
from
a read, with the data read, the process can proceed, like process the data, or 
make decision.
When for poison, the data can't be returned, the process don't know what to do, 
and kill
is the first option.

while for a copyin case, the read is excuted by kernel, and for poison kernel 
will refuse to
excute current read and further operation. For the process, it seems it have a 
change to proceed.

if just error code is returned, the process may care or not, it may not 
correctly process the error.
It seems the worst case here is the process will touch the poison page again, 
trigger another MCE and
accordingly be killed.

It's not that bad?

Thanks
Aili Yao

> 
> diff --git a/arch/x86/include/asm/pgtable_types.h 
> b/arch/x86/include/asm/pgtable_types.h
> index f24d7ef8fffa..e533eaf20834 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -23,6 +23,7 @@
>  #define _PAGE_BIT_SOFTW2 10  /* " */
>  #define _PAGE_BIT_SOFTW3 11  /* " */
>  #define _PAGE_BIT_PAT_LARGE  12  /* On 2MB or 1GB pages */
> +#define _PAGE_BIT_RESERVED1  51  /* Reserved bit */
>  #define _PAGE_BIT_SOFTW4 58  /* available for programmer */
>  #define _PAGE_BIT_PKEY_BIT0  59  /* Protection Keys, bit 1/4 */
>  #define _PAGE_BIT_PKEY_BIT1  60  /* Protection Keys, bi

[PATCH] mm/gup: check page posion status for coredump.

2021-03-17 Thread Aili Yao

When we do coredump for user process signal, this may be an SIGBUS signal
with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
resulted from ECC memory fail like SRAR or SRAO, we expect the memory
recovery work is finished correctly, then the get_dump_page() will not
return the error page as its process pte is set invalid by
memory_failure().

But memory_failure() may fail, and the process's related pte may not be
correctly set invalid, for current code, we will return the poison page
and get it dumped and lead to system panic as its in kernel code.

So check the poison status in get_dump_page(), and if TRUE, return NULL.

Signed-off-by: Aili Yao 
---
 mm/gup.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index e4c224c..499a496 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1536,6 +1536,14 @@ struct page *get_dump_page(unsigned long addr)
  FOLL_FORCE | FOLL_DUMP | FOLL_GET);
if (locked)
mmap_read_unlock(mm);
+
+   if (IS_ENABLED(CONFIG_MEMORY_FAILURE) && ret == 1) {
+   if (unlikely(PageHuge(page) && 
PageHWPoison(compound_head(page
+   ret = 0;
+   else if (unlikely(PageHWPoison(page)))
+   ret = 0;
+   }
+
return (ret == 1) ? page : NULL;
 }
 #endif /* CONFIG_ELF_CORE */
-- 
1.8.3.1

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-17 Thread Aili Yao



> > 
> > Returning true means you stop walking when you find the first entry pointing
> > to a given pfn. But there could be multiple such entries, so if MCE SRAR is
> > triggered by memory access to the larger address in hwpoisoned entries, the
> > returned virtual address might be wrong.
> >   
> 
> I can't find the way to fix this, maybe the virtual address is contained in
> related register, but this is really beyong my knowledge.
> 
> This is a v2 RFC patch, add support for thp and 1G huge page errors.
> 

Sorry for the debug info and other unclean modifications.

Post a clean one.

Thanks
Aili Yao

>From 2289276ba943cdcddbf3b5b2cdbcaff78690e2e8 Mon Sep 17 00:00:00 2001
From: Aili Yao 
Date: Wed, 17 Mar 2021 16:12:41 +0800
Subject: [PATCH] fix invalid SIGBUS address for recovery fail

Walk the current process pages and compare with the pfn, then get the
user address and related page_shift.

For thp pages, we can only split anonoums thp page, so I think there may
be no race condition for walking and searching the thp error page for such
case; For non anonymous thp, the page flag and pte will not be change. so
when code goes into this place, it may be race condition for non-anonoums
thp page or from a recovery fail for anonoums thp, the page status will
not change, i am not so sure about this;

For the case we don't find the related virtual address, Maybe sending one
BUS_MCEERR_AR signal with invalid address NULL is a better option, but i am
not sure.

And this may get the wrong virtual address if process have multiple entry
for a same page, I don't find a way to get it correct.

Maybe other issues is not recognized.
---
 arch/x86/kernel/cpu/mce/core.c |  12 +---
 include/linux/mm.h |   1 +
 mm/memory-failure.c| 127 +
 3 files changed, 131 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index db4afc5..4cb873c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1246,7 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb)
struct task_struct *p = container_of(cb, struct task_struct, 
mce_kill_me);
int flags = MF_ACTION_REQUIRED;
 
-   pr_err("Uncorrected hardware memory error in user-access at %llx", 
p->mce_addr);
+   pr_err("Uncorrected hardware memory error in user-access at %llx\n", 
p->mce_addr);
 
if (!p->mce_ripv)
flags |= MF_MUST_KILL;
@@ -1258,14 +1258,8 @@ static void kill_me_maybe(struct callback_head *cb)
return;
}
 
-   if (p->mce_vaddr != (void __user *)-1l) {
-   pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to 
%s:%d due to hardware memory corruption\n",
-   p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
-   force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
-   } else {
-   pr_err("Memory error not recovered");
-   kill_me_now(cb);
-   }
+   memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);
+
 }
 
 static void queue_task_work(struct mce *m, int kill_current_task)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8..cff2f02 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3046,6 +3046,7 @@ enum mf_flags {
MF_SOFT_OFFLINE = 1 << 3,
 };
 extern int memory_failure(unsigned long pfn, int flags);
+extern void memory_failure_error(struct task_struct *p, unsigned long pfn);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06f0061..359b42f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -1553,6 +1554,132 @@ int memory_failure(unsigned long pfn, int flags)
 }
 EXPORT_SYMBOL_GPL(memory_failure);
 
+static int pte_range(pte_t *ptep, unsigned long addr, unsigned long next, 
struct mm_walk *walk)
+{
+   u64 *buff = (u64 *)walk->private;
+   u64 pfn = buff[0];
+   pte_t pte = *ptep;
+
+   if (!pte_none(pte) && !pte_present(pte)) {
+   swp_entry_t swp_temp = pte_to_swp_entry(pte);
+
+   if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn)
+   goto find;
+   } else if (pte_pfn(pte) == pfn) {
+   goto find;
+   }
+
+   return 0;
+
+find:
+   buff[0] = addr;
+   buff[1] = PAGE_SHIFT;
+   return true;
+}
+
+static int pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
+struct mm_walk *walk)
+{
+

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-17 Thread Aili Yao



> 
> Returning true means you stop walking when you find the first entry pointing
> to a given pfn. But there could be multiple such entries, so if MCE SRAR is
> triggered by memory access to the larger address in hwpoisoned entries, the
> returned virtual address might be wrong.
> 

I can't find the way to fix this, maybe the virtual address is contained in
related register, but this is really beyong my knowledge.

This is a v2 RFC patch, add support for thp and 1G huge page errors.

Thanks
Aili Yao

>From 31b685609610b3b06c8fd98d866913dbfeb7e159 Mon Sep 17 00:00:00 2001
From: Aili Yao 
Date: Wed, 17 Mar 2021 15:34:15 +0800
Subject: [PATCH] fix invalid SIGBUS address for recovery fail

Walk the current process pages and compare with the pfn, then get the
user address and related page_shift.

For thp pages, we can only split anonoums thp page, so I think there may
be no race condition for walking and searching the thp error page for such
case; For non anonymous thp, the page flag and pte will not be change. so
when code goes into this place, it may be race condition for non-anonoums
thp page or from a recovery fail for anonoums thp, the page status will
not change, i am not so sure about this;

For the case we don't find the related virtual address, Maybe sending one
BUS_MCEERR_AR signal with invalid address NULL is a better option, but i am
not sure.

And this may get the wrong virtual address if process have multiple entry
for a same page, I don't find a way to get it correct.

Maybe other issues is not recognized.
---
 arch/x86/kernel/cpu/mce/core.c |  16 ++---
 include/linux/mm.h |   1 +
 mm/memory-failure.c| 145 -
 3 files changed, 152 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index db4afc5..1bf21cc 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -28,8 +28,12 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1246,7 +1250,7 @@ static void kill_me_maybe(struct callback_head *cb)
struct task_struct *p = container_of(cb, struct task_struct, 
mce_kill_me);
int flags = MF_ACTION_REQUIRED;
 
-   pr_err("Uncorrected hardware memory error in user-access at %llx", 
p->mce_addr);
+   pr_err("Uncorrected hardware memory error in user-access at %llx, 
%llx", p->mce_addr, p->mce_vaddr);
 
if (!p->mce_ripv)
flags |= MF_MUST_KILL;
@@ -1258,14 +1262,8 @@ static void kill_me_maybe(struct callback_head *cb)
return;
}
 
-   if (p->mce_vaddr != (void __user *)-1l) {
-   pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to 
%s:%d due to hardware memory corruption\n",
-   p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
-   force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
-   } else {
-   pr_err("Memory error not recovered");
-   kill_me_now(cb);
-   }
+   memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);
+
 }
 
 static void queue_task_work(struct mce *m, int kill_current_task)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8..cff2f02 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3046,6 +3046,7 @@ enum mf_flags {
MF_SOFT_OFFLINE = 1 << 3,
 };
 extern int memory_failure(unsigned long pfn, int flags);
+extern void memory_failure_error(struct task_struct *p, unsigned long pfn);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06f0061..aaf99a7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -56,8 +56,10 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "ras/ras_event.h"
+#include 
 
 int sysctl_memory_failure_early_kill __read_mostly = 0;
 
@@ -240,7 +242,7 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, 
int flags)
int ret = 0;
 
pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware 
memory corruption\n",
-   pfn, t->comm, t->pid);
+   tk->addr, t->comm, t->pid);
 
if (flags & MF_ACTION_REQUIRED) {
WARN_ON_ONCE(t != current);
@@ -1417,6 +1419,7 @@ int memory_failure(unsigned long pfn, int flags)
 try_again:
if (PageHuge(p))
return memory_failure_hugetlb(pfn, flags);
+
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
@@ -1553,6 +1556,146 @@ int

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-16 Thread Aili Yao

> As you answered in another email, p->mce_vaddr is set only on KERNEL_COPYIN 
> case,
> then if p->mce_vaddr is available also for generic SRAR MCE (I'm assuming 
> that we are
> talking about issues on race between generic SRAR MCE not only for 
> KERNEL_COPYIN case),
> that might be helpful, although it might be hard to implement.
> And I'm afraid that walking page table could find the wrong virtual address 
> if a process
> have multiple entries to the single page. We could send multiple SIGBUSs for 
> such case,
> but maybe that's not an optimal solution.

Yes, agree, I can't find one way to address this multiple entries case, to make 
sure we get
the exact correct virtual address.

But also I will post a v2 RFC patch for this issue for only discussion purpose!

Thanks!

> 
> From 147449a97e2ea3420ac3523f13523f5d30a13614 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi 
> Date: Tue, 16 Mar 2021 14:22:21 +0900
> Subject: [PATCH] pagemap: expose hwpoison entry
> 
> not-signed-off-by-yet: Naoya Horiguchi 
> ---
>  fs/proc/task_mmu.c  |  6 ++
>  include/linux/swapops.h | 12 
>  tools/vm/page-types.c   |  5 -
>  3 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 602e3a52884d..08cea209bae7 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1300,6 +1300,7 @@ struct pagemapread {
>  #define PM_PFRAME_MASK   GENMASK_ULL(PM_PFRAME_BITS - 1, 0)
>  #define PM_SOFT_DIRTYBIT_ULL(55)
>  #define PM_MMAP_EXCLUSIVEBIT_ULL(56)
> +#define PM_HWPOISON  BIT_ULL(60)
>  #define PM_FILE  BIT_ULL(61)
>  #define PM_SWAP  BIT_ULL(62)
>  #define PM_PRESENT   BIT_ULL(63)
> @@ -1385,6 +1386,11 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
> pagemapread *pm,
>   if (is_migration_entry(entry))
>   page = migration_entry_to_page(entry);
>  
> + if (is_hwpoison_entry(entry)) {
> + page = hwpoison_entry_to_page(entry);
> + flags |= PM_HWPOISON;
> + }
> +
>   if (is_device_private_entry(entry))
>   page = device_private_entry_to_page(entry);
>   }
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index d9b7c9132c2f..1b9dedbd06ab 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -323,6 +323,13 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
>   return swp_type(entry) == SWP_HWPOISON;
>  }
>  
> +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
> +{
> + struct page *p = pfn_to_page(swp_offset(entry));
> + WARN_ON(!PageHWPoison(p));
> + return p;
> +}
> +
>  static inline void num_poisoned_pages_inc(void)
>  {
>   atomic_long_inc(_poisoned_pages);
> @@ -345,6 +352,11 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
>   return 0;
>  }
>  
> +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
> +{
> + return NULL;
> +}
> +
>  static inline void num_poisoned_pages_inc(void)
>  {
>  }
> diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
> index 0517c744b04e..1160d5a14955 100644
> --- a/tools/vm/page-types.c
> +++ b/tools/vm/page-types.c
> @@ -53,6 +53,7 @@
>  #define PM_SWAP_OFFSET(x)(((x) & PM_PFRAME_MASK) >> MAX_SWAPFILES_SHIFT)
>  #define PM_SOFT_DIRTY(1ULL << 55)
>  #define PM_MMAP_EXCLUSIVE(1ULL << 56)
> +#define PM_HWPOISON  (1ULL << 60)
>  #define PM_FILE  (1ULL << 61)
>  #define PM_SWAP  (1ULL << 62)
>  #define PM_PRESENT   (1ULL << 63)
> @@ -311,6 +312,8 @@ static unsigned long pagemap_pfn(uint64_t val)
>  
>   if (val & PM_PRESENT)
>   pfn = PM_PFRAME(val);
> + else if (val & PM_HWPOISON)
> + pfn = PM_SWAP_OFFSET(val);
>   else
>   pfn = 0;
>  
> @@ -742,7 +745,7 @@ static void walk_vma(unsigned long index, unsigned long 
> count)
>   pfn = pagemap_pfn(buf[i]);
>   if (pfn)
>   walk_pfn(index + i, pfn, 1, buf[i]);
> - if (buf[i] & PM_SWAP)
> + else if (buf[i] & PM_SWAP)
>   walk_swap(index + i, buf[i]);
>   }
>  



-- 
Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-11 Thread Aili Yao

On Thu, 11 Mar 2021 17:05:53 +
"Luck, Tony"  wrote:

> > I guess that p->mce_vaddr stores the virtual address of the error here.
> > If so, sending SIGBUS with the address looks enough as we do now, so why
> > do you walk page table to find the error virtual address?  
> 
> p->mce_vaddr only has the virtual address for the COPYIN case. In that code
> path we decode the kernel instruction that hit the fault in order to find the 
> virtual
> address. That's easy because:
> 
> 1) The kernel RIP is known to be good (can't page fault etc. on kernel 
> address).
> 2) There are only a half dozen instructions used by the kernel for get_user() 
> or
>  copy_from_user().
> 
> When the machine check happens during user execution accessing poison data
> we only have the physical address (from MCi_ADDR).
> 
> -Tony

Sorry to interrupt as I am really confused here:
If it's a copyin case, has the page been mapped for the current process?
will memory_failure() find it and unmap it? if succeed, then the current will be
signaled with correct vaddr and shift?

Maybe the mce_vaddr is set correctly, but we may lost the correct page shift?

And for copyin case, we don't need to call set_mce_nospec()?

-- 
Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-11 Thread Aili Yao

On Thu, 11 Mar 2021 08:55:30 +
HORIGUCHI NAOYA(堀口　直也)  wrote:

> On Wed, Mar 10, 2021 at 02:10:42PM +0800, Aili Yao wrote:
> > On Fri, 5 Mar 2021 15:55:25 +
> > "Luck, Tony"  wrote:
> >   
> > > > From the walk, it seems we have got the virtual address, can we just 
> > > > send a SIGBUS with it?
> > > 
> > > If the walk wins the race and the pte for the poisoned page is still 
> > > valid, then yes.
> > > 
> > > But we could have:
> > > 
> > > CPU1CPU2
> > > memory_failure sets poison
> > > bit for struct page
> > > 
> > > 
> > > rmap finds page in task
> > > on CPU2 and sets PTE
> > > to not-valid-poison
> > > 
> > > memory_failure returns
> > > early because struct page
> > > already marked as poison
> > > 
> > > walk page tables looking
> > > for mapping - don't find it
> > > 
> > > -Tony  
> > 
> > While I don't think there is a race condition, and if you really think the 
> > pfn with SIGBUS is not
> > proper, I think following patch maybe one way.
> > I copy your abandon code, and make a little modification, and just now it 
> > pass
> > my simple test.
> > 
> > And also this is a RFC version, only valid if you think the pfn with SIGBUS 
> > is not right.
> > 
> > Thanks!
> > 
> > From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001
> > From: Aili Yao 
> > Date: Wed, 10 Mar 2021 13:59:18 +0800
> > Subject: [PATCH] x86/mce: fix invalid SIGBUS address
> > 
> > walk the current process pte and compare with the pfn;
> > 1. only test for normal page and 2M hugetlb page;
> > 2. 1G hugetlb and transparentHuge is not support currently;
> > 3. May other fails is not recognized, This is a RFC version.
> > 
> > ---
> >  arch/x86/kernel/cpu/mce/core.c | 83 
> > --
> >  1 file changed, 80 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index db4afc5..65d7ef7 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -28,8 +28,12 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> > +#include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include   
> 
> Maybe requiring many dependencies like this implies that you might be better
> to do below in mm/memory-failure.c instead of in this file.

Yes, agree, I will change this, Thanks!
 
> > @@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct 
> > pt_regs *regs, struct mce *fin
> > /* mce_clear_state will clear *final, save locally for use later */
> > *m = *final;
> >  }
> > +static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long 
> > next, struct mm_walk *walk)
> > +{
> > +   u64 *buff = (u64 *)walk->private;
> > +   u64 pfn = buff[0];
> > +
> > +   if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte)))
> > +   goto find;
> > +   else if (pte_pfn(*pte) == pfn)
> > +   goto find;
> > +
> > +   return 0;
> > +find:
> > +   buff[0] = addr;
> > +   buff[1] = PAGE_SHIFT;
> > +   return true;  
> 
> Returning true means you stop walking when you find the first entry pointing
> to a given pfn. But there could be multiple such entries, so if MCE SRAR is
> triggered by memory access to the larger address in hwpoisoned entries, the
> returned virtual address might be wrong.

Yes, We need to consider multiple posion page entries, I will fix this. Thanks 
for
you sugguestion!


> > +}
> > +
> > +extern bool is_hugetlb_entry_hwpoisoned(pte_t pte);
> > +
> > +static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask,
> > +unsigned long addr, unsigned long end,
> > +struct mm_walk *walk)
> > +{
> > +   u64 *buff = (u64 *)walk->private;
> > +   u64 pfn = buff[0];
> > +   int shift = PMD_SHIFT;
> > +   pte_t pte =  huge_ptep_get(ptep);
> > +
> > +   if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
> > +   goto find;
> > +
> > +   if (pt

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-10 Thread Aili Yao

On Wed, 10 Mar 2021 17:28:12 -0800
Andy Lutomirski  wrote:

> On Wed, Mar 10, 2021 at 5:19 PM Aili Yao  wrote:
> >
> > On Mon, 8 Mar 2021 11:00:28 -0800
> > Andy Lutomirski  wrote:
> >  
> > > > On Mar 8, 2021, at 10:31 AM, Luck, Tony  wrote:
> > > >
> > > >   
> > > >>
> > > >> Can you point me at that SIGBUS code in a current kernel?  
> > > >
> > > > It is in kill_me_maybe().  mce_vaddr is setup when we disassemble 
> > > > whatever get_user()
> > > > or copy from user variant was in use in the kernel when the poison 
> > > > memory was consumed.
> > > >
> > > >if (p->mce_vaddr != (void __user *)-1l) {
> > > >force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, 
> > > > PAGE_SHIFT);  
> > >
> > > Hmm. On the one hand, no one has complained yet. On the other hand, 
> > > hardware that supports this isn’t exactly common.
> > >
> > > We may need some actual ABI design here. We also need to make sure that 
> > > things like io_uring accesses or, more generally, anything using the 
> > > use_mm / use_temporary_mm ends up either sending no signal or sending a 
> > > signal to the right target.
> > >  
> > > >
> > > > Would it be any better if we used the BUS_MCEERR_AO code that goes into 
> > > > siginfo?  
> > >
> > > Dunno.  
> >
> > I have one thought here but don't know if it's proper:
> >
> > Previous patch use force_sig_mceerr to the user process for such a 
> > scenario; with this method
> > The SIGBUS can't be ignored as force_sig_mceerr() was designed to.
> >
> > If the user process don't want this signal, will it set signal config to 
> > ignore?
> > Maybe we can use a send_sig_mceerr() instead of force_sig_mceerr(), if 
> > process want to
> > ignore the SIGBUS, then it will ignore that, or it can also process the 
> > SIGBUS?  
> 
> I don't think the signal blocking mechanism makes sense for this.
> Blocking a signal is for saying that, if another process sends the
> signal (or an async event like ctrl-C), then the process doesn't want
> it.  Blocking doesn't block synchronous things like faults.
> 
> I think we need to at least fix the existing bug before we add more
> signals.  AFAICS the MCE_IN_KERNEL_COPYIN code is busted for kernel
> threads.

Got this, Thanks!

I read https://man7.org/linux/man-pages/man2/write.2.html, and it seems the 
write syscall is not expecting
an signal, maybe a specific error code for this scenario is enough.

-- 
Thanks!
Aili Yao

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-10 Thread Aili Yao

On Mon, 8 Mar 2021 11:00:28 -0800
Andy Lutomirski  wrote:

> > On Mar 8, 2021, at 10:31 AM, Luck, Tony  wrote:
> > 
> >   
> >> 
> >> Can you point me at that SIGBUS code in a current kernel?  
> > 
> > It is in kill_me_maybe().  mce_vaddr is setup when we disassemble whatever 
> > get_user()
> > or copy from user variant was in use in the kernel when the poison memory 
> > was consumed.
> > 
> >if (p->mce_vaddr != (void __user *)-1l) {
> >force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);  
> 
> Hmm. On the one hand, no one has complained yet. On the other hand, hardware 
> that supports this isn’t exactly common.
> 
> We may need some actual ABI design here. We also need to make sure that 
> things like io_uring accesses or, more generally, anything using the use_mm / 
> use_temporary_mm ends up either sending no signal or sending a signal to the 
> right target.
> 
> > 
> > Would it be any better if we used the BUS_MCEERR_AO code that goes into 
> > siginfo?  
> 
> Dunno.

I have one thought here but don't know if it's proper:

Previous patch use force_sig_mceerr to the user process for such a scenario; 
with this method
The SIGBUS can't be ignored as force_sig_mceerr() was designed to.

If the user process don't want this signal, will it set signal config to ignore?
Maybe we can use a send_sig_mceerr() instead of force_sig_mceerr(), if process 
want to
ignore the SIGBUS, then it will ignore that, or it can also process the SIGBUS?

-- 
Thanks!
Aili Yao

Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-10 Thread Aili Yao

On Tue, 9 Mar 2021 08:28:24 +
HORIGUCHI NAOYA(堀口　直也)  wrote:

> On Tue, Mar 09, 2021 at 02:35:34PM +0800, Aili Yao wrote:
> > When the page is already poisoned, another memory_failure() call in the
> > same page now return 0, meaning OK. For nested memory mce handling, this
> > behavior may lead to mce looping, Example:
> > 
> > 1.When LCME is enabled, and there are two processes A && B running on
> > different core X && Y separately, which will access one same page, then
> > the page corrupted when process A access it, a MCE will be rasied to
> > core X and the error process is just underway.
> > 
> > 2.Then B access the page and trigger another MCE to core Y, it will also
> > do error process, it will see TestSetPageHWPoison be true, and 0 is
> > returned.
> > 
> > 3.The kill_me_maybe will check the return:
> > 
> > 1244 static void kill_me_maybe(struct callback_head *cb)
> > 1245 {
> > 
> > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> > p->mce_whole_page);
> > 1257 sync_core();
> > 1258 return;
> > 1259 }
> > 
> > 1267 }
> > 
> > 4. The error process for B will end, and may nothing happened if
> > kill-early is not set, The process B will re-excute instruction and get
> > into mce again and then loop happens. And also the set_mce_nospec()
> > here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
> > mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").
> > 
> > For other cases which care the return value of memory_failure() should
> > check why they want to process a memory error which have already been
> > processed. This behavior seems reasonable.  
> 
> Other reviewers shared ideas about the returned value, but actually
> I'm not sure which the best one is (EBUSY is not that bad).
> What we need to fix the reported issue is to return non-zero value
> for "already poisoned" case (the value itself is not so important). 
> 
> Other callers of memory_failure() (mostly test programs) could see
> the change of return value, but they can already see EBUSY now and
> anyway they should check dmesg for more detail about why failed,
> so the impact of the change is not so big.
> 
> > 
> > Signed-off-by: Aili Yao   
> 
> Reviewed-by: Naoya Horiguchi 
> 
> Thanks,
> Naoya Horiguchi

Thanks!

And I found my mail was lost in mailist!

-- 
Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-09 Thread Aili Yao

On Fri, 5 Mar 2021 15:55:25 +
"Luck, Tony"  wrote:

> > From the walk, it seems we have got the virtual address, can we just send a 
> > SIGBUS with it?  
> 
> If the walk wins the race and the pte for the poisoned page is still valid, 
> then yes.
> 
> But we could have:
> 
> CPU1CPU2
> memory_failure sets poison
> bit for struct page
> 
> 
> rmap finds page in task
> on CPU2 and sets PTE
> to not-valid-poison
> 
> memory_failure returns
> early because struct page
> already marked as poison
> 
> walk page tables looking
> for mapping - don't find it
> 
> -Tony

While I don't think there is a race condition, and if you really think the pfn 
with SIGBUS is not
proper, I think following patch maybe one way.
I copy your abandon code, and make a little modification, and just now it pass
my simple test.

And also this is a RFC version, only valid if you think the pfn with SIGBUS is 
not right.

Thanks!

>From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001
From: Aili Yao 
Date: Wed, 10 Mar 2021 13:59:18 +0800
Subject: [PATCH] x86/mce: fix invalid SIGBUS address

walk the current process pte and compare with the pfn;
1. only test for normal page and 2M hugetlb page;
2. 1G hugetlb and transparentHuge is not support currently;
3. May other fails is not recognized, This is a RFC version.

---
 arch/x86/kernel/cpu/mce/core.c | 83 --
 1 file changed, 80 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index db4afc5..65d7ef7 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -28,8 +28,12 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct 
pt_regs *regs, struct mce *fin
/* mce_clear_state will clear *final, save locally for use later */
*m = *final;
 }
+static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, 
struct mm_walk *walk)
+{
+   u64 *buff = (u64 *)walk->private;
+   u64 pfn = buff[0];
+
+   if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte)))
+   goto find;
+   else if (pte_pfn(*pte) == pfn)
+   goto find;
+
+   return 0;
+find:
+   buff[0] = addr;
+   buff[1] = PAGE_SHIFT;
+   return true;
+}
+
+extern bool is_hugetlb_entry_hwpoisoned(pte_t pte);
+
+static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask,
+unsigned long addr, unsigned long end,
+struct mm_walk *walk)
+{
+   u64 *buff = (u64 *)walk->private;
+   u64 pfn = buff[0];
+   int shift = PMD_SHIFT;
+   pte_t pte =  huge_ptep_get(ptep);
+
+   if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
+   goto find;
+
+   if (pte_pfn(*ptep) == pfn)
+   goto find;
+
+   return 0;
+find:
+   buff[0] = addr;
+   buff[1] = shift;
+   return true;
+}
+
+static struct mm_walk_ops walk = {
+   .pte_entry = mc_pte_entry,
+   .hugetlb_entry  = mc_hugetlb_range
+};
+
+void mc_memory_failure_error(struct task_struct *p, unsigned long pfn)
+{
+   u64 buff[2] = {pfn, 0};
+   struct page *page;
+   int ret = -1;
+
+   page = pfn_to_page(pfn);
+   if (!page)
+   goto force_sigbus;
+
+   if (is_zone_device_page(page))
+   goto force_sigbus;
+
+   mmap_read_lock(p->mm);
+   ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, , (void *)buff);
+   mmap_read_unlock(p->mm);
+
+   if (ret && buff[0]) {
+   pr_err("Memory error may not recovered: %#llx: Sending SIGBUS 
to %s:%d due to hardware memory corruption\n",
+   buff[0], p->comm, p->pid);
+   force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], 
buff[1]);
+   } else {
+force_sigbus:
+   pr_err("Memory error may not recovered, pfn: %#lx: Sending 
SIGBUS to %s:%d due to hardware memory corruption\n",
+   pfn, p->comm, p->pid);
+   force_sig_mceerr(BUS_MCEERR_AR, (void __user *)pfn, PAGE_SHIFT);
+   }
+
+}
 
 static void kill_me_now(struct callback_head *ch)
 {
@@ -1259,9 +1338,7 @@ static void kill_me_maybe(struct callback_head *cb)
}
 
if (p->mce_vaddr != (void __user *)-1l) {
-   pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to 
%s:%d due to hardware memory corruption\n",
-   p->mce_addr >> PAGE_SHIFT, p->

Re: [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-03-08 Thread Aili Yao

On Tue, 9 Mar 2021 06:04:41 +
HORIGUCHI NAOYA(堀口　直也)  wrote:

> ...
> > 
> > If others are OK with this method, then I am OK too.
> > But I have two concerns, May you take into account:
> > 
> > 1. The memory_failure with 0 return code for race condition, then the 
> > kill_me_maybe() goes into branch:
> > if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> > sync_core();
> > return;
> > }
> > 
> > while we place set_mce_nospec() here is for a reason, please see commit 
> > fd0e786d9d09024f67b.
> > 
> > 2. When memory_failure return 0 and maybe return to user process, and it 
> > may re-execute the instruction triggering previous fault, this behavior
> > assume an implicit dependence that the related pte has been correctly set. 
> > or if not correctlily set, it will lead to infinite loop again.  
> 
> These seem to be separate issues from memory_failure()'s concurrency issue,
> so I'm still expecting that your patch is to be merged. Maybe do you want
> to update it based on the discussion (if it's concluded)?
> 
> Thanks,
> Naoya Horiguchi

I have submitted a v2 patch, and please help review.

Thanks!
 
-- 
Thanks!
Aili Yao

[PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-08 Thread Aili Yao

When the page is already poisoned, another memory_failure() call in the
same page now return 0, meaning OK. For nested memory mce handling, this
behavior may lead to mce looping, Example:

1.When LCME is enabled, and there are two processes A && B running on
different core X && Y separately, which will access one same page, then
the page corrupted when process A access it, a MCE will be rasied to
core X and the error process is just underway.

2.Then B access the page and trigger another MCE to core Y, it will also
do error process, it will see TestSetPageHWPoison be true, and 0 is
returned.

3.The kill_me_maybe will check the return:

1244 static void kill_me_maybe(struct callback_head *cb)
1245 {

1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
p->mce_whole_page);
1257 sync_core();
1258 return;
1259 }

1267 }

4. The error process for B will end, and may nothing happened if
kill-early is not set, The process B will re-excute instruction and get
into mce again and then loop happens. And also the set_mce_nospec()
here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").

For other cases which care the return value of memory_failure() should
check why they want to process a memory error which have already been
processed. This behavior seems reasonable.

Signed-off-by: Aili Yao 
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..b6bc77460ee1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1228,7 +1228,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int 
flags)
if (TestSetPageHWPoison(head)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
   pfn);
-   return 0;
+   return -EBUSY;
}
 
num_poisoned_pages_inc();
@@ -1430,7 +1430,7 @@ int memory_failure(unsigned long pfn, int flags)
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
-   return 0;
+   return -EBUSY;
}
 
orig_head = hpage = compound_head(p);
-- 
2.25.1

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-08 Thread Aili Yao

On Tue, 9 Mar 2021 10:14:52 +0800
Aili Yao  wrote:

> On Mon, 8 Mar 2021 18:31:07 +
> "Luck, Tony"  wrote:
> 
> > > Can you point me at that SIGBUS code in a current kernel?
> > 
> > It is in kill_me_maybe().  mce_vaddr is setup when we disassemble whatever 
> > get_user()
> > or copy from user variant was in use in the kernel when the poison memory 
> > was consumed.
> > 
> > if (p->mce_vaddr != (void __user *)-1l) {
> > force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> > 
> > Would it be any better if we used the BUS_MCEERR_AO code that goes into 
> > siginfo?
> > 
> > That would make it match up better with what happens when poison is found
> > asynchronously by the patrol scrubber. I.e. the semantics are:
> > 
> > AR: You just touched poison at this address and need to do something about 
> > that.
> > AO: Just letting you know that you have some poison at the address in 
> > siginfo.
> > 
> > -Tony  
> 
> Is the kill action for this scenario in memory_failure()?

Does the current logic kill the process twice for this scenario ?
I am confused.

-- 
Thanks!
Aili Yao

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-08 Thread Aili Yao

On Mon, 8 Mar 2021 18:31:07 +
"Luck, Tony"  wrote:

> > Can you point me at that SIGBUS code in a current kernel?  
> 
> It is in kill_me_maybe().  mce_vaddr is setup when we disassemble whatever 
> get_user()
> or copy from user variant was in use in the kernel when the poison memory was 
> consumed.
> 
> if (p->mce_vaddr != (void __user *)-1l) {
> force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> 
> Would it be any better if we used the BUS_MCEERR_AO code that goes into 
> siginfo?
> 
> That would make it match up better with what happens when poison is found
> asynchronously by the patrol scrubber. I.e. the semantics are:
> 
> AR: You just touched poison at this address and need to do something about 
> that.
> AO: Just letting you know that you have some poison at the address in siginfo.
> 
> -Tony

Is the kill action for this scenario in memory_failure()?

-- 
Thanks!
Aili Yao

Re: [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-03-08 Thread Aili Yao

On Mon, 8 Mar 2021 14:55:04 -0800
"Luck, Tony"  wrote:

> There can be races when multiple CPUs consume poison from the same
> page. The first into memory_failure() atomically sets the HWPoison
> page flag and begins hunting for tasks that map this page. Eventually
> it invalidates those mappings and may send a SIGBUS to the affected
> tasks.
> 
> But while all that work is going on, other CPUs see a "success"
> return code from memory_failure() and so they believe the error
> has been handled and continue executing.
> 
> Fix by wrapping most of the internal parts of memory_failure() in
> a mutex.
> 
> Signed-off-by: Tony Luck 
> ---
>  mm/memory-failure.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..c1509f4b565e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
>   return rc;
>  }
>  
> +static DEFINE_MUTEX(mf_mutex);
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
>   return -ENXIO;
>   }
>  
> + mutex_lock(_mutex);
> +
>  try_again:
> - if (PageHuge(p))
> - return memory_failure_hugetlb(pfn, flags);
> + if (PageHuge(p)) {
> + res = memory_failure_hugetlb(pfn, flags);
> + goto out2;
> + }
> +
>   if (TestSetPageHWPoison(p)) {
>   pr_err("Memory failure: %#lx: already hardware poisoned\n",
>   pfn);
> + mutex_unlock(_mutex);
>   return 0;
>   }
>  
> @@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags)
>   res = MF_FAILED;
>   }
>   action_result(pfn, MF_MSG_BUDDY, res);
> + mutex_unlock(_mutex);
>   return res == MF_RECOVERED ? 0 : -EBUSY;
>   } else {
>   action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, 
> MF_IGNORED);
> + mutex_unlock(_mutex);
>   return -EBUSY;
>   }
>   }
> @@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags)
>   if (PageTransHuge(hpage)) {
>   if (try_to_split_thp_page(p, "Memory Failure") < 0) {
>   action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> + mutex_unlock(_mutex);
>   return -EBUSY;
>   }
>   VM_BUG_ON_PAGE(!page_count(p), p);
> @@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags)
>   num_poisoned_pages_dec();
>   unlock_page(p);
>   put_page(p);
> + mutex_unlock(_mutex);
>   return 0;
>   }
>   if (hwpoison_filter(p)) {
> @@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags)
>   num_poisoned_pages_dec();
>   unlock_page(p);
>   put_page(p);
> + mutex_unlock(_mutex);
>   return 0;
>   }
>  
> @@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags)
>   res = identify_page_state(pfn, p, page_flags);
>  out:
>   unlock_page(p);
> +out2:
> + mutex_unlock(_mutex);
>   return res;
>  }
>  EXPORT_SYMBOL_GPL(memory_failure);

If others are OK with this method, then I am OK too.
But I have two concerns, May you take into account:

1. The memory_failure with 0 return code for race condition, then the 
kill_me_maybe() goes into branch:
if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
!(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
sync_core();
return;
}

while we place set_mce_nospec() here is for a reason, please see commit 
fd0e786d9d09024f67b.

2. When memory_failure return 0 and maybe return to user process, and it may 
re-execute the instruction triggering previous fault, this behavior
assume an implicit dependence that the related pte has been correctly set. or 
if not correctlily set, it will lead to infinite loop again.

-- 
Thanks!
Aili Yao

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-08 Thread Aili Yao

On Sun, 7 Mar 2021 11:16:24 -0800
Andy Lutomirski  wrote:

> > > > >> Some programs may use read(2), write(2), etc as ways to check if
> > > > >> memory is valid without getting a signal.  They might not want
> > > > >> signals, which means that this feature might need to be 
> > > > >> configurable.  
> > > > >
> > > > > That sounds like an appalling hack. If users need such a mechanism
> > > > > we should create some better way to do that.
> > > > >  
> > > >
> > > > Appalling hack or not, it works. So, if we’re going to send a signal to 
> > > > user code that looks like it originated from a bina fide architectural 
> > > > recoverable fault, it needs to be recoverable.  A load from a failed 
> > > > NVDIMM page is such a fault. A *kernel* load is not. So we need to 
> > > > distinguish it somehow.  
> > >
> > > Sorry for my previous mis-understanding, and i have some questions:
> > > if programs use read,write to check if if memory is valid, does it really 
> > > want to cover the poison case?  
> 
> I don't know.
> 
> > > When for such a case, an error is returned,  can the program realize it's 
> > > hwposion issue not other software error and process correctly?  
> 
> Again, I don't know.  But changing the API like this seems potentialy
> dangerous and needs to be done with care.
> 
> > >
> > > if this is the proper action, the original posion flow in current code 
> > > from read and write need to change too.
> > >  
> >
> > Sorry, another question:
> >   When programs use read(2), write(2) as ways to check if memory is valid, 
> > does it really want to check if the user page the program provided is 
> > valid, not the destination or disk space valid?  
> 
> They may well be trying to see if their memory is valid.

Thanks for your reply, and I don't know what to do.
For current code, if user program write to a block device(maybe a test try) and 
if its user copy page corrupt when in kernel copy, the process is killed with a 
SIGBUS.
And for the page fault case in this thread, the process is error returned.

-- 
Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-04 Thread Aili Yao

On Fri, 5 Mar 2021 09:30:16 +0800
Aili Yao  wrote:

> On Thu, 4 Mar 2021 15:57:20 -0800
> "Luck, Tony"  wrote:
> 
> > On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote:
> > > > > if your methods works, should it be like this?
> > > > > 
> > > > > 1582 pteval = 
> > > > > swp_entry_to_pte(make_hwpoison_entry(subpage));
> > > > > 1583 if (PageHuge(page)) {
> > > > > 1584 
> > > > > hugetlb_count_sub(compound_nr(page), mm);
> > > > > 1585 set_huge_swap_pte_at(mm, address,
> > > > > 1586  pvmw.pte, 
> > > > > pteval,
> > > > > 1587  
> > > > > vma_mmu_pagesize(vma));
> > > > > 1588 } else {
> > > > > 1589 dec_mm_counter(mm, 
> > > > > mm_counter(page));
> > > > > 1590 set_pte_at(mm, address, 
> > > > > pvmw.pte, pteval);
> > > > > 1591 }
> > > > > 
> > > > > the page fault check if it's a poison page using is_hwpoison_entry(),
> > > > > 
> > > > 
> > > > And if it works, does we need some locking mechanism before we call 
> > > > walk_page_range();
> > > > if we lock, does we need to process the blocking interrupted error as 
> > > > other places will do?
> > > >   
> > > 
> > > And another thing:
> > > Do we need a call to flush_tlb_page(vma, address) to make the pte changes 
> > > into effect?  
> > 
> > Thanks for all the pointers.  I added them to the patch (see below).
> > [The pmd/pud cases may need some tweaking ... but just trying to get
> > the 4K page case working first]
> > 
> > I tried testing by skipping the call to memory_failure() and just
> > using this new code to search the page tables for current page and
> > marking it hwpoison (to simulate the case where 2nd process gets the
> > early return from memory_failure(). Something is still missing because I 
> > get:
> > 
> > [  481.911298] mce: pte_entry: matched pfn - mark poison & zap pte
> > [  481.917935] MCE: Killing einj_mem_uc: due to hardware memory 
> > corruption fault at 7fe64b33b400
> > [  482.933775] BUG: Bad page cache in process einj_mem_uc  pfn:408b6d6
> > [  482.940777] page:13ea6e96 refcount:3 mapcount:1 
> > mapping:e3a069d9 index:0x0 pfn:0x408b6d6
> > [  482.951355] memcg:94a809834000
> > [  482.955153] aops:shmem_aops ino:3c04
> > [  482.959142] flags: 
> > 0x97c0880015(locked|uptodate|lru|swapbacked|hwpoison)
> > [  482.967018] raw: 0097c0880015 94c80e93ec00 94c80e93ec00 
> > 94c80a9b25a8
> > [  482.975666] raw:   0003 
> > 94a809834000
> > [  482.984310] page dumped because: still mapped when deleted
> 
> From the walk, it seems we have got the virtual address, can we just send a 
> SIGBUS with it?

Does this walk proper for other memory-failure error cases, can it be applied 
to if (p->mce_vaddr != (void __user *)-1l) branch
in kill_me_maybe()?

> > commit e5de44560b33e2d407704243566253a70f858a59
> > Author: Tony Luck 
> > Date:   Tue Mar 2 15:06:33 2021 -0800
> > 
> > x86/mce: Handle races between machine checks
> > 
> > When multiple CPUs hit the same poison memory there is a race. The
> > first CPU into memory_failure() atomically marks the page as poison
> > and continues processing to hunt down all the tasks that map this page
> > so that the virtual addresses can be marked not-present and SIGBUS
> > sent to the task that did the access.
> > 
> > Later CPUs get an early return from memory_failure() and may return
> > to user mode and access the poison again.
> > 
> > Add a new argument to memory_failure() so that it can indicate when
> > the race has been lost. Fix kill_me_maybe() to scan page tables in
> > this case to unmap pages.
> > 
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index 7962355436da..a52c6a772de2 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -28,8 +28,12 @@
> >  #include 
>

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-04 Thread Aili Yao

On Thu, 4 Mar 2021 15:57:20 -0800
"Luck, Tony"  wrote:

> On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote:
> > > > if your methods works, should it be like this?
> > > > 
> > > > 1582 pteval = 
> > > > swp_entry_to_pte(make_hwpoison_entry(subpage));
> > > > 1583 if (PageHuge(page)) {
> > > > 1584 
> > > > hugetlb_count_sub(compound_nr(page), mm);
> > > > 1585 set_huge_swap_pte_at(mm, address,
> > > > 1586  pvmw.pte, 
> > > > pteval,
> > > > 1587  
> > > > vma_mmu_pagesize(vma));
> > > > 1588 } else {
> > > > 1589 dec_mm_counter(mm, 
> > > > mm_counter(page));
> > > > 1590 set_pte_at(mm, address, pvmw.pte, 
> > > > pteval);
> > > > 1591 }
> > > > 
> > > > the page fault check if it's a poison page using is_hwpoison_entry(),
> > > > 
> > > 
> > > And if it works, does we need some locking mechanism before we call 
> > > walk_page_range();
> > > if we lock, does we need to process the blocking interrupted error as 
> > > other places will do?
> > >   
> > 
> > And another thing:
> > Do we need a call to flush_tlb_page(vma, address) to make the pte changes 
> > into effect?  
> 
> Thanks for all the pointers.  I added them to the patch (see below).
> [The pmd/pud cases may need some tweaking ... but just trying to get
> the 4K page case working first]
> 
> I tried testing by skipping the call to memory_failure() and just
> using this new code to search the page tables for current page and
> marking it hwpoison (to simulate the case where 2nd process gets the
> early return from memory_failure(). Something is still missing because I get:
> 
> [  481.911298] mce: pte_entry: matched pfn - mark poison & zap pte
> [  481.917935] MCE: Killing einj_mem_uc: due to hardware memory 
> corruption fault at 7fe64b33b400
> [  482.933775] BUG: Bad page cache in process einj_mem_uc  pfn:408b6d6
> [  482.940777] page:13ea6e96 refcount:3 mapcount:1 
> mapping:e3a069d9 index:0x0 pfn:0x408b6d6
> [  482.951355] memcg:94a809834000
> [  482.955153] aops:shmem_aops ino:3c04
> [  482.959142] flags: 
> 0x97c0880015(locked|uptodate|lru|swapbacked|hwpoison)
> [  482.967018] raw: 0097c0880015 94c80e93ec00 94c80e93ec00 
> 94c80a9b25a8
> [  482.975666] raw:   0003 
> 94a809834000
> [  482.984310] page dumped because: still mapped when deleted

>From the walk, it seems we have got the virtual address, can we just send a 
>SIGBUS with it?



> commit e5de44560b33e2d407704243566253a70f858a59
> Author: Tony Luck 
> Date:   Tue Mar 2 15:06:33 2021 -0800
> 
> x86/mce: Handle races between machine checks
> 
> When multiple CPUs hit the same poison memory there is a race. The
> first CPU into memory_failure() atomically marks the page as poison
> and continues processing to hunt down all the tasks that map this page
> so that the virtual addresses can be marked not-present and SIGBUS
> sent to the task that did the access.
> 
> Later CPUs get an early return from memory_failure() and may return
> to user mode and access the poison again.
> 
> Add a new argument to memory_failure() so that it can indicate when
> the race has been lost. Fix kill_me_maybe() to scan page tables in
> this case to unmap pages.
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 7962355436da..a52c6a772de2 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -28,8 +28,12 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, 
> unsigned long val,
>  {
>   struct mce *mce = (struct mce *)data;
>   unsigned long pfn;
> + int already = 0;
>  
>   if (!mce || !mce_usable_address(mce))
>   return NOTIFY_DONE;
> @@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, 
> unsigned long val,
>   return NOTIFY_DO

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-03 Thread Aili Yao

On Thu, 4 Mar 2021 12:19:41 +0800
Aili Yao  wrote:

> On Thu, 4 Mar 2021 10:16:53 +0800
> Aili Yao  wrote:
> 
> > On Wed, 3 Mar 2021 15:41:35 +
> > "Luck, Tony"  wrote:
> >   
> > > > For error address with sigbus, i think this is not an issue resulted by 
> > > > the patch i post, before my patch, the issue is already there.
> > > > I don't find a realizable way to get the correct address for same 
> > > > reason --- we don't know whether the page mapping is there or not when
> > > > we got to kill_me_maybe(), in some case, we may get it, but there are a 
> > > > lot of parallel issue need to consider, and if failed we have to 
> > > > fallback
> > > > to the error brach again, remaining current code may be an easy option; 
> > > >  
> > > 
> > > My RFC patch from yesterday removes the uncertainty about whether the 
> > > page is there or not. After it walks the page
> > > tables we know that the poison page isn't mapped (note that patch is RFC 
> > > for a reason ... I'm 90% sure that it should
> > > do a bit more that just clear the PRESENT bit).
> > > 
> > > So perhaps memory_failure() has queued a SIGBUS for this task, if so, we 
> > > take it when we return from kill_me_maybe()  
> 
> And when this happen, the process will receive an SIGBUS with AO level, is it 
> proper as not an AR?
> 
> > > If not, we will return to user mode and re-execute the failing 
> > > instruction ... but because the page is unmapped we will take a #PF
> > 
> > Got this, I have some error thoughts here.
> > 
> >   
> > > The x86 page fault handler will see that the page for this physical 
> > > address is marked HWPOISON, and it will send the SIGBUS
> > > (just like it does if the page had been removed by an earlier UCNA/SRAO 
> > > error).
> > 
> > if your methods works, should it be like this?
> > 
> > 1582 pteval = 
> > swp_entry_to_pte(make_hwpoison_entry(subpage));
> > 1583 if (PageHuge(page)) {
> > 1584 hugetlb_count_sub(compound_nr(page), 
> > mm);
> > 1585 set_huge_swap_pte_at(mm, address,
> > 1586  pvmw.pte, pteval,
> > 1587  
> > vma_mmu_pagesize(vma));
> > 1588 } else {
> > 1589 dec_mm_counter(mm, mm_counter(page));
> > 1590 set_pte_at(mm, address, pvmw.pte, 
> > pteval);
> > 1591         }
> > 
> > the page fault check if it's a poison page using is_hwpoison_entry(),
> >   
> 
> And if it works, does we need some locking mechanism before we call 
> walk_page_range();
> if we lock, does we need to process the blocking interrupted error as other 
> places will do?
> 

And another thing:
Do we need a call to flush_tlb_page(vma, address) to make the pte changes into 
effect?

-- 
Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-03 Thread Aili Yao

On Thu, 4 Mar 2021 10:16:53 +0800
Aili Yao  wrote:

> On Wed, 3 Mar 2021 15:41:35 +
> "Luck, Tony"  wrote:
> 
> > > For error address with sigbus, i think this is not an issue resulted by 
> > > the patch i post, before my patch, the issue is already there.
> > > I don't find a realizable way to get the correct address for same reason 
> > > --- we don't know whether the page mapping is there or not when
> > > we got to kill_me_maybe(), in some case, we may get it, but there are a 
> > > lot of parallel issue need to consider, and if failed we have to fallback
> > > to the error brach again, remaining current code may be an easy option;   
> > >  
> > 
> > My RFC patch from yesterday removes the uncertainty about whether the page 
> > is there or not. After it walks the page
> > tables we know that the poison page isn't mapped (note that patch is RFC 
> > for a reason ... I'm 90% sure that it should
> > do a bit more that just clear the PRESENT bit).
> > 
> > So perhaps memory_failure() has queued a SIGBUS for this task, if so, we 
> > take it when we return from kill_me_maybe()

And when this happen, the process will receive an SIGBUS with AO level, is it 
proper as not an AR?

> > If not, we will return to user mode and re-execute the failing instruction 
> > ... but because the page is unmapped we will take a #PF  
> 
> Got this, I have some error thoughts here.
> 
> 
> > The x86 page fault handler will see that the page for this physical address 
> > is marked HWPOISON, and it will send the SIGBUS
> > (just like it does if the page had been removed by an earlier UCNA/SRAO 
> > error).  
> 
> if your methods works, should it be like this?
> 
> 1582 pteval = 
> swp_entry_to_pte(make_hwpoison_entry(subpage));
> 1583 if (PageHuge(page)) {
> 1584 hugetlb_count_sub(compound_nr(page), mm);
> 1585 set_huge_swap_pte_at(mm, address,
> 1586  pvmw.pte, pteval,
> 1587  
> vma_mmu_pagesize(vma));
> 1588 } else {
> 1589 dec_mm_counter(mm, mm_counter(page));
> 1590 set_pte_at(mm, address, pvmw.pte, 
> pteval);
> 1591 }
> 
> the page fault check if it's a poison page using is_hwpoison_entry(),
> 

And if it works, does we need some locking mechanism before we call 
walk_page_range();
if we lock, does we need to process the blocking interrupted error as other 
places will do?

-- 
Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-03 Thread Aili Yao

On Wed, 3 Mar 2021 15:41:35 +
"Luck, Tony"  wrote:

> > For error address with sigbus, i think this is not an issue resulted by the 
> > patch i post, before my patch, the issue is already there.
> > I don't find a realizable way to get the correct address for same reason 
> > --- we don't know whether the page mapping is there or not when
> > we got to kill_me_maybe(), in some case, we may get it, but there are a lot 
> > of parallel issue need to consider, and if failed we have to fallback
> > to the error brach again, remaining current code may be an easy option;  
> 
> My RFC patch from yesterday removes the uncertainty about whether the page is 
> there or not. After it walks the page
> tables we know that the poison page isn't mapped (note that patch is RFC for 
> a reason ... I'm 90% sure that it should
> do a bit more that just clear the PRESENT bit).
> 
> So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take 
> it when we return from kill_me_maybe()
> 
> If not, we will return to user mode and re-execute the failing instruction 
> ... but because the page is unmapped we will take a #PF

Got this, I have some error thoughts here.


> The x86 page fault handler will see that the page for this physical address 
> is marked HWPOISON, and it will send the SIGBUS
> (just like it does if the page had been removed by an earlier UCNA/SRAO 
> error).

if your methods works, should it be like this?

1582 pteval = 
swp_entry_to_pte(make_hwpoison_entry(subpage));
1583 if (PageHuge(page)) {
1584 hugetlb_count_sub(compound_nr(page), mm);
1585 set_huge_swap_pte_at(mm, address,
1586  pvmw.pte, pteval,
1587  
vma_mmu_pagesize(vma));
1588 } else {
1589 dec_mm_counter(mm, mm_counter(page));
1590 set_pte_at(mm, address, pvmw.pte, pteval);
1591         }

the page fault check if it's a poison page using is_hwpoison_entry(),

-- 
Thanks!
Aili Yao

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-03 Thread Aili Yao

On Mon, 1 Mar 2021 11:09:36 -0800
Andy Lutomirski  wrote:

> > On Mar 1, 2021, at 11:02 AM, Luck, Tony  wrote:
> > 
> >   
> >> 
> >> Some programs may use read(2), write(2), etc as ways to check if
> >> memory is valid without getting a signal.  They might not want
> >> signals, which means that this feature might need to be configurable.  
> > 
> > That sounds like an appalling hack. If users need such a mechanism
> > we should create some better way to do that.
> >   
> 
> Appalling hack or not, it works. So, if we’re going to send a signal to user 
> code that looks like it originated from a bina fide architectural recoverable 
> fault, it needs to be recoverable.  A load from a failed NVDIMM page is such 
> a fault. A *kernel* load is not. So we need to distinguish it somehow.

Sorry for my previous mis-understanding, and i have some questions: 
if programs use read,write to check if if memory is valid, does it really want 
to cover the poison case? 
When for such a case, an error is returned,  can the program realize it's 
hwposion issue not other software error and process correctly?

if this is the proper action, the original posion flow in current code from 
read and write need to change too.

-- 
Thanks!
Aili Yao

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-03 Thread Aili Yao

On Wed, 3 Mar 2021 20:24:02 +0800
Aili Yao  wrote:

> On Mon, 1 Mar 2021 11:09:36 -0800
> Andy Lutomirski  wrote:
> 
> > > On Mar 1, 2021, at 11:02 AM, Luck, Tony  wrote:
> > > 
> > > 
> > >> 
> > >> Some programs may use read(2), write(2), etc as ways to check if
> > >> memory is valid without getting a signal.  They might not want
> > >> signals, which means that this feature might need to be configurable.
> > > 
> > > That sounds like an appalling hack. If users need such a mechanism
> > > we should create some better way to do that.
> > > 
> > 
> > Appalling hack or not, it works. So, if we’re going to send a signal to 
> > user code that looks like it originated from a bina fide architectural 
> > recoverable fault, it needs to be recoverable.  A load from a failed NVDIMM 
> > page is such a fault. A *kernel* load is not. So we need to distinguish it 
> > somehow.  
> 
> Sorry for my previous mis-understanding, and i have some questions: 
> if programs use read,write to check if if memory is valid, does it really 
> want to cover the poison case? 
> When for such a case, an error is returned,  can the program realize it's 
> hwposion issue not other software error and process correctly?
> 
> if this is the proper action, the original posion flow in current code from 
> read and write need to change too.
> 

Sorry, another question:
  When programs use read(2), write(2) as ways to check if memory is valid, does 
it really want to check if the user page the program provided is valid, not the 
destination or disk space valid?
  the patch will not affect this purpose as it's only valid for user page which 
program provide to write or some syscall similiar parameter.

-- 
Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-03 Thread Aili Yao

Hi tony:

> On Tue, 2 Mar 2021 19:39:53 -0800
> "Luck, Tony"  wrote:
> 
> > On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote:  
> > > Hi naoya, tony:
> > > > > 
> > > > > Idea for what we should do next ... Now that x86 is calling 
> > > > > memory_failure()
> > > > > from user context ... maybe parallel calls for the same page should
> > > > > be blocked until the first caller completes so we can:
> > > > > a) know that pages are unmapped (if that happens)
> > > > > b) all get the same success/fail status  
> > > > 
> > > > One memory_failure() call changes the target page's status and
> > > > affects all mappings to all affected processes, so I think that
> > > > (ideally) we don't have to block other threads (letting them
> > > > early return seems fine).  Sometimes memory_failure() fails,
> > > > but even in such case, PG_hwpoison is set on the page and other
> > > > threads properly get SIGBUSs with this patch, so I think that
> > > > we can avoid the worst scenario (like system stall by MCE loop).
> > > > 
> > > I agree with naoya's point, if we block for this issue, Does this change 
> > > the result
> > > that the process should be killed? Or is there something other still need 
> > > to be considered?
> > 
> > Ok ... no blocking ... 

I do think about blocking method and the error address issue with sigbus,here 
is my opinion, maybe helpful:

For blocking, if we block here, there are some undefine work i think should be 
done.
As we don't know the process B triggering this error again is early-kill or 
not, so the previous memory_failure() call may 
not add B on kill_list, even if B is on kill_list, the error level for B is not 
proper set, as B should get an AR SIGBUS;

So we can't just wait, We must have some logic adding the process B to kill 
list, and as this is an AR error
another change should be done to current code, we need more logic in kill_proc 
or some other place.

Even if all the work is done right. There is one more serious scenario though, 
we even don't know the current step the previous memory_failure() is on,
So previous modification may not be usefull at all; When this scenario happens, 
what we can do?  block or return ?
if finally we return, an error code should be taken back; so we have to go to 
error process logic and a signal without right address will be sent;

For error address with sigbus, i think this is not an issue resulted by the 
patch i post, before my patch, the issue is already there.
I don't find a realizable way to get the correct address for same reason --- we 
don't know whether the page mapping is there or not when
we got to kill_me_maybe(), in some case, we may get it, but there are a lot of 
parallel issue need to consider, and if failed we have to fallback
to the error brach again, remaining current code may be an easy option;

Any methods or patchs can solve the issue in a better way is OK to me, i want 
this issue fixed and in more complete way!

-- 
Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-03 Thread Aili Yao

On Tue, 2 Mar 2021 19:39:53 -0800
"Luck, Tony"  wrote:

> On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote:
> > Hi naoya, tony:  
> > > > 
> > > > Idea for what we should do next ... Now that x86 is calling 
> > > > memory_failure()
> > > > from user context ... maybe parallel calls for the same page should
> > > > be blocked until the first caller completes so we can:
> > > > a) know that pages are unmapped (if that happens)
> > > > b) all get the same success/fail status
> > > 
> > > One memory_failure() call changes the target page's status and
> > > affects all mappings to all affected processes, so I think that
> > > (ideally) we don't have to block other threads (letting them
> > > early return seems fine).  Sometimes memory_failure() fails,
> > > but even in such case, PG_hwpoison is set on the page and other
> > > threads properly get SIGBUSs with this patch, so I think that
> > > we can avoid the worst scenario (like system stall by MCE loop).
> > >   
> > I agree with naoya's point, if we block for this issue, Does this change 
> > the result
> > that the process should be killed? Or is there something other still need 
> > to be considered?  
> 
> Ok ... no blocking ... I think someone in this thread suggested
> scanning the page tables to make sure the poisoned page had been
> unmapped.
> 
> There's a walk_page_range() function that does all the work for that.
> Just need to supply some callback routines that check whether a
> mapping includes the bad PFN and clear the PRESENT bit.
> 
> RFC patch below against v5.12-rc1
> 
> -Tony
> 
> From 8de23b7f1be00ad38e129690dfe0b1558fad5ff8 Mon Sep 17 00:00:00 2001
> From: Tony Luck 
> Date: Tue, 2 Mar 2021 15:06:33 -0800
> Subject: [PATCH] x86/mce: Handle races between machine checks
> 
> When multiple CPUs hit the same poison memory there is a race. The
> first CPU into memory_failure() atomically marks the page as poison
> and continues processing to hunt down all the tasks that map this page
> so that the virtual addresses can be marked not-present and SIGBUS
> sent to the task that did the access.
> 
> Later CPUs get an early return from memory_failure() and may return
> to user mode and access the poison again.
> 
> Add a new argument to memory_failure() so that it can indicate when
> the race has been lost. Fix kill_me_maybe() to scan page tables in
> this case to unmap pages.

> +
>  static void kill_me_now(struct callback_head *ch)
>  {
>   force_sig(SIGBUS);
> @@ -1257,15 +1304,19 @@ static void kill_me_maybe(struct callback_head *cb)
>  {
>   struct task_struct *p = container_of(cb, struct task_struct, 
> mce_kill_me);
>   int flags = MF_ACTION_REQUIRED;
> + int already = 0;
>  
>   pr_err("Uncorrected hardware memory error in user-access at %llx", 
> p->mce_addr);
>  
>   if (!p->mce_ripv)
>   flags |= MF_MUST_KILL;
>  
> - if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> + if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, ) &&
>   !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> + if (already)
> + walk_page_range(current->mm, 0, TASK_SIZE_MAX, , 
> (void *)(p->mce_addr >> PAGE_SHIFT));
> +     else
> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT, 
> p->mce_whole_page);
>   sync_core();
>   return;

>   MF_MUST_KILL = 1 << 2,
>   MF_SOFT_OFFLINE = 1 << 3,
>  };

I have one doubt here, when "walk_page_range(current->mm, 0, TASK_SIZE_MAX, 
, (void *)(p->mce_addr >> PAGE_SHIFT));" is done,
so how is the process triggering this error returned if it have taken the wrong 
data?

-- 
Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-02 Thread Aili Yao

On Fri, 26 Feb 2021 09:58:37 -0800
"Luck, Tony"  wrote:

> On Fri, Feb 26, 2021 at 10:52:50AM +0800, Aili Yao wrote:
> > Hi naoya,Oscar,david:  
> > >   
> > > > We could use some negative value (error code) to report the reported 
> > > > case,
> > > > then as you mentioned above, some callers need change to handle the
> > > > new case, and the same is true if you use some positive value.
> > > > My preference is -EHWPOISON, but other options are fine if justified 
> > > > well.
> > > 
> > > -EHWPOISON seems like a good fit.
> > >   
> > I am OK with the -EHWPOISON error code, But I have one doubt here:
> > When we return this -EHWPOISON error code, Does this means we have to add a 
> > new error code
> > to error-base.h or errno.h? Is this easy realized?  
> 
> The page already poisoned isn't really an error though. Just the result
> of a race condition.  What if we added an extra argument to memory_failure()
> so it can tell the caller that the specific reason for the early successful
> return is that the page was already poisoned?
> 

It may be not an error, Is it reasonable to return a positive value like 
MF_HWPOISON, it seems the 0
return code donesn't tell the whole story. 

Your patch seems more safer, But I don't know if it's worth such multi module 
modifications for this case.
It really should be referenced to other maintainers and reviewers and thet can 
give more expert suggestions.

Thanks!
Aili Yao

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-01 Thread Aili Yao

Hi Luto:

> > > > At the very least, this needs a clear explanation of why your proposed 
> > > > behavior is better than the existing behavior.  
> > >
> > > The explanation is buried in that "can't trust the process" line.
> > >
> > > E.g. user space isn't good about checking for failed write(2) syscalls.
> > > So if the poison was in a user buffer passed to write(fd, buffer, count)
> > > sending a SIGBUS would be the action if they read the poison directly,
> > > so it seems reasonable to send the same signal if the kernel read their
> > > poison for them.
> > >
> > > It would avoid users that didn't check the return value merrily proceeding
> > > as if everything was ok.  
> >
> > Hi luto:
> >I will add more infomation:
> >Even if the process will check return value of syscall like write, I 
> > don't think
> > process will take proper action for this.
> >In test example, the return value will be errno is 14 (Bad Address), the 
> > process may not realize
> > this is a hw issue, and may take wrong action not as expected.
> >And totally, A hw error will rarely happen, and the hw error hitting 
> > this branch will be
> > more unlikely, the impaction without this patch is quite minor, but this is 
> > still not good enough, we should
> > make it better, right?  
> 
> There are a few issues I can imagine:
> 
> Some programs may use read(2), write(2), etc as ways to check if
> memory is valid without getting a signal.  They might not want
> signals, which means that this feature might need to be configurable.

I checked the code again and found that: For poison page access, the process 
may not ignore the SIGBUS signal even if it was set to

1298 /*
1299  * Force a signal that the process can't ignore: if necessary
1300  * we unblock the signal and change any SIG_IGN to SIG_DFL.
1301  *
1302  * Note: If we unblock the signal, we always reset it to SIG_DFL,
1303  * since we do not want to have a signal handler that was blocked
1304  * be invoked when user space had explicitly blocked it.
1305  *
1306  * We don't want to have recursive SIGSEGV's etc, for example,
1307  * that is why we also clear SIGNAL_UNKILLABLE.
1308  */
1309 static int
1310 force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t)

> It's worth making sure that this doesn't end up sending duplicate
> signals.  If nothing else, this would impact the vsyscall emulation
> code.

I am not totally get the "duplicate signals" meaning , SIGBUS is a fatal signal 
and if it was 
processed, the process should exit and another same signal will not be 
processed i think. Or if
the process capture the signal and not to exit, duplicate SIGBUS signal seems 
not a problem if that happens 

For vsyscall emulation:
I do check the related code, and this may be a read operation like instruction 
fetch for the issue, it will
not hit the modified branch but go to emulation code, it seems we can't 
differentiate between a vsyscall emulation page fault
and a hwposion page fault, for current code it may access the invalid page 
again and lead to a panic. This patch will not
cover this scenario.

> Programs that get a signal might expect that the RIP that the signal
> frame points to is the instruction that caused the signal and that the
> instruction faulted without side effects.  For SIGSEGV, I would be
> especially nervous about this.  Maybe SIGBUS is safer.  For SIGSEGV,
> it's entirely valid to look at CR2 / si_fault_addr, fix it up, and
> return.  This would be completely *invalid* with your patch.  I'm not
> sure what to do about this.

Do you mean the patch will replace the SIGSEGV with SIGBUS for hwposion case? I 
think SIGBUS is more accurate for the error.
Normally for poison access, the process shouldn't be returned and an exit will 
be good or we need another code stream for this I think.
This is the legacy way to process user poison access error like other posion 
code branch in kernel. 

Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-02-25 Thread Aili Yao

Hi naoya, tony:
> > 
> > Idea for what we should do next ... Now that x86 is calling memory_failure()
> > from user context ... maybe parallel calls for the same page should
> > be blocked until the first caller completes so we can:
> > a) know that pages are unmapped (if that happens)
> > b) all get the same success/fail status  
> 
> One memory_failure() call changes the target page's status and
> affects all mappings to all affected processes, so I think that
> (ideally) we don't have to block other threads (letting them
> early return seems fine).  Sometimes memory_failure() fails,
> but even in such case, PG_hwpoison is set on the page and other
> threads properly get SIGBUSs with this patch, so I think that
> we can avoid the worst scenario (like system stall by MCE loop).
> 
I agree with naoya's point, if we block for this issue, Does this change the 
result
that the process should be killed? Or is there something other still need to be 
considered?

Thanks!
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-02-25 Thread Aili Yao

Hi naoya,Oscar,david:
> 
> > We could use some negative value (error code) to report the reported case,
> > then as you mentioned above, some callers need change to handle the
> > new case, and the same is true if you use some positive value.
> > My preference is -EHWPOISON, but other options are fine if justified well.  
> 
> -EHWPOISON seems like a good fit.
> 
I am OK with the -EHWPOISON error code, But I have one doubt here:
When we return this -EHWPOISON error code, Does this means we have to add a new 
error code
to error-base.h or errno.h? Is this easy realized?

Thanks!
Aili Yao

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-02-24 Thread Aili Yao

On Tue, 23 Feb 2021 08:42:59 -0800
"Luck, Tony"  wrote:

> On Tue, Feb 23, 2021 at 07:33:46AM -0800, Andy Lutomirski wrote:
> >   
> > > On Feb 23, 2021, at 4:44 AM, Aili Yao  wrote:
> > > 
> > > On Fri, 5 Feb 2021 17:01:35 +0800
> > > Aili Yao  wrote:
> > >   
> > >> When one page is already hwpoisoned by MCE AO action, processes may not
> > >> be killed, processes mapping this page may make a syscall include this
> > >> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> > >> mode it may be fixed by fixup_exception, current code will just return
> > >> error code to user code.
> > >> 
> > >> This is not sufficient, we should send a SIGBUS to the process and log
> > >> the info to console, as we can't trust the process will handle the error
> > >> correctly.
> > >> 
> > >> Suggested-by: Feng Yang 
> > >> Signed-off-by: Aili Yao 
> > >> ---
> > >> arch/x86/mm/fault.c | 62 +
> > >> 1 file changed, 40 insertions(+), 22 deletions(-)
> > >>   
> > > Hi luto;
> > >  Is there any feedback?  
> > 
> > At the very least, this needs a clear explanation of why your proposed 
> > behavior is better than the existing behavior.  
> 
> The explanation is buried in that "can't trust the process" line.
> 
> E.g. user space isn't good about checking for failed write(2) syscalls.
> So if the poison was in a user buffer passed to write(fd, buffer, count)
> sending a SIGBUS would be the action if they read the poison directly,
> so it seems reasonable to send the same signal if the kernel read their
> poison for them.
> 
> It would avoid users that didn't check the return value merrily proceeding
> as if everything was ok.

Hi luto:
   I will add more infomation:
   Even if the process will check return value of syscall like write, I don't 
think
process will take proper action for this.
   In test example, the return value will be errno is 14 (Bad Address), the 
process may not realize
this is a hw issue, and may take wrong action not as expected.
   And totally, A hw error will rarely happen, and the hw error hitting this 
branch will be
more unlikely, the impaction without this patch is quite minor, but this is 
still not good enough, we should
make it better, right?

Thanks
Aili Yao

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-02-24 Thread Aili Yao

On Wed, 24 Feb 2021 11:31:55 +0100
Oscar Salvador  wrote:


> I have some questions:
> 
> > 1.When LCME is enabled, and there are two processes A && B running on
> > different core X && Y separately, which will access one same page, then
> > the page corrupted when process A access it, a MCE will be rasied to
> > core X and the error process is just underway.  
> 
> When !LMCE, that is not a problem because new MCE needs to wait for the 
> ongoing MCE?

I am not sure whether this case will happen when !LMCE, when I realized this 
place may be an issue
I tried to reproduce it and my configuration is LMCE enabled.

> > 2.Then B access the page and trigger another MCE to core Y, it will also
> > do error process, it will see TestSetPageHWPoison be true, and 0 is
> > returned.  
> 
> For non-nested calls, that is no problem because the page will be taken out
> of business(unmapped from the processes), right? So no more MCE are possible.

Yes, I think after the recovery jod is finished, other processes still access 
the page
will meet a page fault and error will be returned;
 
> > 
> > 3.The kill_me_maybe will check the return:
> > 
> > 1244 static void kill_me_maybe(struct callback_head *cb)
> > 1245 {
> > 
> > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,  
> 
> So, IIUC, in case of a LMCE nested call, the second MCE will reach here.
> set_mce_nospec() will either mark the underlying page as not mapped/cached.
>
This set_mce_nospec() is not proper when the recovery job is on the fly. In my 
test
this function failed.
 
> Should not have memory_failure()->hwpoison_user_mappings() unmapped the page
> from both process A and B? Or this is in case the ongoing MCE(process A) has
> not still unmapped anything, so process B can still access this page.
> 
What I care is the process B triggered the error again after process A,
I don't know how it return and proceed.

> So with your change, process B will be sent a SIGBUG, while process A is still
> handling the MCE, right?

Right!

> > p->mce_whole_page);
> > 1257 sync_core();
> > 1258 return;
> > 1259 }
> > 
> > 1267 }
> > 
> > 4. The error process for B will end, and may nothing happened if
> > kill-early is not set, We may let the wrong data go into effect.
> > 
> > For other cases which care the return value of memory_failure() should
> > check why they want to process a memory error which have already been
> > processed. This behavior seems reasonable.
> > 
> > In kill_me_maybe, log the fact about the memory may not recovered, and
> > we will kill the related process.
> > 
> > Signed-off-by: Aili Yao 
> > ---
> >  arch/x86/kernel/cpu/mce/core.c | 2 ++
> >  mm/memory-failure.c| 4 ++--
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index e133ce1e562b..db4afc5bf15a 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -1259,6 +1259,8 @@ static void kill_me_maybe(struct callback_head *cb)
> > }
> >  
> > if (p->mce_vaddr != (void __user *)-1l) {
> > +   pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to 
> > %s:%d due to hardware memory corruption\n",
> > +   p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
> > force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> > } else {
> > pr_err("Memory error not recovered");
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index e9481632fcd1..06f006174b8c 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1224,7 +1224,7 @@ static int memory_failure_hugetlb(unsigned long pfn, 
> > int flags)
> > if (TestSetPageHWPoison(head)) {
> > pr_err("Memory failure: %#lx: already hardware poisoned\n",
> >pfn);
> > -   return 0;
> > +   return -EBUSY;  
> 
> As David said, madvise_inject_error() will start returning -EBUSY now in case
> we madvise(MADV_HWPOISON) on an already hwpoisoned page.
> 
> AFAICS, memory_failure() can return 0, -Eerrors, and MF_XXX.
> Would it make sense to unify that? That way we could declare error codes that
> make somse sense (like MF_ALREA

[PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-02-23 Thread Aili Yao

When the page is already poisoned, another memory_failure() call in the
same page now return 0, meaning OK. For nested memory mce handling, this
behavior may lead real serious problem, Example:

1.When LCME is enabled, and there are two processes A && B running on
different core X && Y separately, which will access one same page, then
the page corrupted when process A access it, a MCE will be rasied to
core X and the error process is just underway.

2.Then B access the page and trigger another MCE to core Y, it will also
do error process, it will see TestSetPageHWPoison be true, and 0 is
returned.

3.The kill_me_maybe will check the return:

1244 static void kill_me_maybe(struct callback_head *cb)
1245 {

1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
p->mce_whole_page);
1257 sync_core();
1258 return;
1259 }

1267 }

4. The error process for B will end, and may nothing happened if
kill-early is not set, We may let the wrong data go into effect.

For other cases which care the return value of memory_failure() should
check why they want to process a memory error which have already been
processed. This behavior seems reasonable.

In kill_me_maybe, log the fact about the memory may not recovered, and
we will kill the related process.

Signed-off-by: Aili Yao 
---
 arch/x86/kernel/cpu/mce/core.c | 2 ++
 mm/memory-failure.c| 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e133ce1e562b..db4afc5bf15a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1259,6 +1259,8 @@ static void kill_me_maybe(struct callback_head *cb)
}
 
if (p->mce_vaddr != (void __user *)-1l) {
+   pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to 
%s:%d due to hardware memory corruption\n",
+   p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
} else {
pr_err("Memory error not recovered");
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e9481632fcd1..06f006174b8c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1224,7 +1224,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int 
flags)
if (TestSetPageHWPoison(head)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
   pfn);
-   return 0;
+   return -EBUSY;
}
 
num_poisoned_pages_inc();
@@ -1420,7 +1420,7 @@ int memory_failure(unsigned long pfn, int flags)
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
-   return 0;
+   return -EBUSY;
}
 
orig_head = hpage = compound_head(p);
-- 
2.25.1

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-23 Thread Aili Yao

On Tue, 23 Feb 2021 16:12:43 +
"Luck, Tony"  wrote:

> > What I think is qemu has not an easy to get the MCE signature from host or 
> > currently no methods for this
> > So qemu treat all AR will be No RIPV, Do more is better than do less.  
> 
> RIPV would be important in the guest in the case where the guest can fix the 
> problem that caused
> the machine check and return to the failed instruction to continue.
> 
> I think the only case where this happens is a fault in a read-only page 
> mapped from a file (typically
> code page, but could be a data page). In this case memory-failure() unmaps 
> the page with the posion
> but Linux can recover by reading data from the file into a new page.
> 
> Other cases we send SIGBUS (so go to the signal handler instead of to the 
> faulting instruction).
> 
> So it would be good if the state of RIPV could be added to the signal state 
> sent to qemu. If that
> isn't possible, then this full recovery case turns into another SIGBUS case.

This KVM and VM case of failing recovery for SRAR is just one scenario I think,
If Intel guarantee that when memory SRAR is triggered, RIPV will always be set, 
then it's the job of qemu to
set the RIPV instead.
Or if When SRAR is triggered with RIPV cleared, the same issue will be true for 
host.

And I think it's better for VM to know the real RIPV value, It need more work 
in qemu and kernel if possible.

Thanks
Aili Yao

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-02-23 Thread Aili Yao

On Fri, 5 Feb 2021 17:01:35 +0800
Aili Yao  wrote:

> When one page is already hwpoisoned by MCE AO action, processes may not
> be killed, processes mapping this page may make a syscall include this
> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> mode it may be fixed by fixup_exception, current code will just return
> error code to user code.
> 
> This is not sufficient, we should send a SIGBUS to the process and log
> the info to console, as we can't trust the process will handle the error
> correctly.
> 
> Suggested-by: Feng Yang 
> Signed-off-by: Aili Yao 
> ---
>  arch/x86/mm/fault.c | 62 +
>  1 file changed, 40 insertions(+), 22 deletions(-)
> 
Hi luto;
  Is there any feedback?

Thanks
Aili Yao

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-23 Thread Aili Yao

On Tue, 23 Feb 2021 11:05:38 +0100
Borislav Petkov  wrote:

> On Tue, Feb 23, 2021 at 05:56:40PM +0800, Aili Yao wrote:
> > What i inject is AR error, and I don't see MCG_STATUS_RIPV flag.  
> 
> Then keep debugging qemu to figure out why that is.
> 

What I think is qemu has not an easy to get the MCE signature from host or 
currently no methods for this
So qemu treat all AR will be No RIPV, Do more is better than do less.

Thanks
Aili Yao

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-23 Thread Aili Yao

On Tue, 23 Feb 2021 10:43:00 +0100
Borislav Petkov  wrote:

> On Tue, Feb 23, 2021 at 10:27:55AM +0800, Aili Yao wrote:
> > When Guest access one address with UE error, it will exit guest mode,
> > the host will do the recovery job, and then one SIGBUS is send to
> > the VCPU and qemu will catch the signal, there is only address and
> > error level no RIPV in signal, so qemu will assume RIPV is cleared and
> > inject the error into guest OS.  
> 
> Lemme see:
> 
> void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
> 
> /* If we get an action required MCE, it has been injected by KVM
>  * while the VM was running.  An action optional MCE instead should
>  * be coming from the main thread, which qemu_init_sigbus identifies
>  * as the "early kill" thread.
>  */
> assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
> 
> ...
> 
>   kvm_mce_inject(cpu, paddr, code);
> 
> in that function:
> 
> if (code == BUS_MCEERR_AR) {
> status |= MCI_STATUS_AR | 0x134;
> mcg_status |= MCG_STATUS_EIPV;
> } else {
> status |= 0xc0;
> mcg_status |= MCG_STATUS_RIPV;
> }
> 
> That looks like a valid RIP bit to me. Then cpu_x86_inject_mce() gets
> that mcg_status and injects it into the guest.

What i inject is AR error, and I don't see MCG_STATUS_RIPV flag.

Tks
Aili Yao

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-22 Thread Aili Yao

On Mon, 22 Feb 2021 13:45:50 +0100
Borislav Petkov  wrote:

> On Mon, Feb 22, 2021 at 08:35:49PM +0800, Aili Yao wrote:
> > Guest VM, the qemu has no way to know the RIPV value, so always get it
> > cleared.  
> 
> What does that mean?
> 
> The guest VM will get the MCE signature it gets from the host kernel so
> the host kernel most definitely knows the RIPV value.

When Guest access one address with UE error, it will exit guest mode, the host
will do the recovery job, and then one SIGBUS is send to the VCPU and qemu will
catch the signal, there is only address and error level no RIPV in signal, so 
qemu will
assume RIPV is cleared and inject the error into guest OS.

> It looks like you're testing how guests will handle MCEs which the host
> has caught and wants to inject into the guest for further handling. What
> is your exact use case? Please explain in detail how I can reproduce it
> step-by-step locally.

Yeah, there are multiple steps i do:
1. One small test code in guest OS access one address A which will be injected 
UC error,
   the address will be logged, and use vtop you can get the guest physical 
address.

2. Using "virsh qemu-monitor-command guest --hmp gpa2hvagpa2hva 0xx" to get 
the user
   virtual address,

3. Using vtop you can get host physical address from the above user address.

4. Inject 0x10 level error using einj module.

5. then when guest access the address, you will see what happens.

Please using latest upstream kernel for guest OS, and you may change 
monarch_timeout to a bigger
value, or you will see other issues not only talked one.

Tks

Best Regards!
Aili Yao

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-22 Thread Aili Yao

On Mon, 22 Feb 2021 13:22:41 +0100
Borislav Petkov  wrote:

> On Mon, Feb 22, 2021 at 08:17:23PM +0800, Aili Yao wrote:
> > AR (Action Required) flag, bit 55 - Indicates (when set) that MCA
> > error code specific recovery action must be...  
> 
> Give me the *exact* MCE signature you're injecting please.
> 
> Thx.
> 

Guest VM, the qemu has no way to know the RIPV value, so always get it cleared.

Hardware event. This is not a software error.
MCE 0
CPU 9 BANK 9 TSC 103d511e68c
RIP 33:401270
MISC 8c ADDR 10e91d000
TIME 1613974147 Mon Feb 22 01:09:07 2021
MCG status:EIPV MCIP LMCE
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
SRAR
MCA: Data CACHE Level-0 Data-Read Error
STATUS bd800134 MCGSTATUS e
MCGCAP 900010a APICID 9 SOCKETID 9
MICROCODE 1
CPUID Vendor Intel Family 6 Model 85 Step 7

Host:
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 1 TSC 1ee4f074462
RIP 33:4013a6
MISC 86 ADDR 10ed608000
TIME 1613985132 Mon Feb 22 17:12:12 2021
MCG status:RIPV EIPV MCIP LMCE
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
SRAR
MCA: Data CACHE Level-0 Data-Read Error
STATUS bd8000100134 MCGSTATUS f
MCGCAP f000c14 APICID 2 SOCKETID 0
MICROCODE 521
CPUID Vendor Intel Family 6 Model 85

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-22 Thread Aili Yao

On Mon, 22 Feb 2021 19:21:46 +0800
Aili Yao  wrote:

> On Mon, 22 Feb 2021 11:22:06 +0100
> Borislav Petkov  wrote:
> 
> > On Mon, Feb 22, 2021 at 06:08:19PM +0800, Aili Yao wrote:  
> > > So why would intel provide this MCG_STATUS_RIPV flag, it's better to
> > > remove it as it will never be set, and all the related logic for this
> > > flag is really needed ?
> > 
> > Why would it never be set - of course it will be. You don't set it. If
> > you wanna inject errors, then make sure you inject *valid* errors which
> > the hardware *actually* generates, not some random ones.
> >   
> 
> As far as I know, Most of RAS related tests are faked, not real errors, and 
> it's really meaningful.
> 
> You should better reproduce the issue I tried to fix, or at least read the 
> code more detailly and you will
> know if it's random and invalid
> 
I See this in sdm 325462:

AR (Action Required) flag, bit 55 - Indicates (when set) that MCA error code 
specific recovery action must be
performed by system software at the time this error was signaled. This recovery 
action must be completed
successfully before any additional work is scheduled for this processor. 
---
When the RIPV flag in the IA32_MCG_STATUS is clear, an alternative execution 
stream needs to be provided; 
--
when the MCA error code
specific recovery specific recovery action cannot be successfully completed, 
system software must shut down
the system. When the AR flag in the IA32_MCi_STATUS register is clear, system 
software may still take MCA
error code specific recovery action but this is optional; system software can 
safely resume program execution
at the instruction pointer saved on the stack from the machine check exception 
when the RIPV flag in the
IA32_MCG_STATUS register is set.

Best Regards!
Aili Yao

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-22 Thread Aili Yao

On Mon, 22 Feb 2021 11:22:06 +0100
Borislav Petkov  wrote:

> On Mon, Feb 22, 2021 at 06:08:19PM +0800, Aili Yao wrote:
> > So why would intel provide this MCG_STATUS_RIPV flag, it's better to
> > remove it as it will never be set, and all the related logic for this
> > flag is really needed ?  
> 
> Why would it never be set - of course it will be. You don't set it. If
> you wanna inject errors, then make sure you inject *valid* errors which
> the hardware *actually* generates, not some random ones.
> 

As far as I know, Most of RAS related tests are faked, not real errors, and 
it's really meaningful.

You should better reproduce the issue I tried to fix, or at least read the code 
more detailly and you will
know if it's random and invalid

Best Regards!
Aili Yao

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-22 Thread Aili Yao

On Mon, 22 Feb 2021 11:03:56 +0100
Borislav Petkov  wrote:

> On Mon, Feb 22, 2021 at 05:31:09PM +0800, Aili Yao wrote:
> > you can inject a memory UE to a VM, it should always be MCG_STATUS_RIPV 0.  
> 
> So the signature you injected is not something the hardware would
> generate - you just didn't set MCG_STATUS_RIPV.
> 
> If so, why should the code handle invalid signatures which the harware
> cannot generate?
> 

So why would intel provide this MCG_STATUS_RIPV flag, it's better to remove it 
as it will
never be set, and all the related logic for this flag is really needed ?

Re: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-22 Thread Aili Yao

On Mon, 22 Feb 2021 10:24:03 +0100
Borislav Petkov  wrote:

> On Mon, Feb 22, 2021 at 11:50:07AM +0800, Aili Yao wrote:
> > From commit b2f9d678e28c ("x86/mce: Check for faults tagged in
> > EXTABLE_CLASS_FAULT exception table entries"), When there is a
> > memory MCE_AR_SEVERITY error with no return ip,  
> 
> What is a "no return ip" - MCG_STATUS_RIPV?

yes 

> How do you trigger this error?

you can inject a memory UE to a VM, it should always be MCG_STATUS_RIPV 0.

Best Regard!
Aili Yao

[PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-21 Thread Aili Yao

>From commit b2f9d678e28c ("x86/mce: Check for faults tagged in
EXTABLE_CLASS_FAULT exception table entries"), When there is a
memory MCE_AR_SEVERITY error with no return ip, Only a SIGBUS
signal is send to current. As the page is not poisoned, the SIGBUS
process's coredump step in kernel will touch the error page again,
which result to a fatal error. We need to poison the page and then
kill current in memory-failure module.

So fix it using the orinigal checking method.

Signed-off-by: Aili Yao 
---
 arch/x86/kernel/cpu/mce/core.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e133ce1e562b..70380d7d98b3 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1414,7 +1414,10 @@ noinstr void do_machine_check(struct pt_regs *regs)
/* If this triggers there is no way to recover. Die hard. */
BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-   queue_task_work(, kill_current_task);
+   if (worst == MCE_AR_SEVERITY)
+   queue_task_work(, 0);
+   else if (kill_current_task)
+   queue_task_work(, kill_current_task);
 
} else {
/*
-- 
2.25.1

x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-21 Thread Aili Yao

>From commit b2f9d678e28c ("x86/mce: Check for faults tagged in
EXTABLE_CLASS_FAULT exception table entries"), When there is a
memory MCE_AR_SEVERITY error with no return ip, Only a SIGBUS
signal is send to current. As the page is not poisoned, the SIGBUS
process coredump step in kernel will touch the error page again,
whick result to a fatal error. We need to poison the page and then
kill current in memory-failure module.

So fix it using the orinigal checking method.

Signed-off-by: Aili Yao 
---
 arch/x86/kernel/cpu/mce/core.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e133ce1e562b..ae09b0279422 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1413,9 +1413,10 @@ noinstr void do_machine_check(struct pt_regs *regs)
if ((m.cs & 3) == 3) {
/* If this triggers there is no way to recover. Die hard. */
BUG_ON(!on_thread_stack() || !user_mode(regs));
-
-   queue_task_work(, kill_current_task);
-
+   if (worst == MCE_AR_SEVERITY)
+   queue_task_work(, 0);
+   else if (kill_current_task)
+   queue_task_work(, kill_current_task);
} else {
/*
 * Handle an MCE which has happened in kernel space but from
-- 
2.25.1

[PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-02-05 Thread Aili Yao

When one page is already hwpoisoned by MCE AO action, processes may not
be killed, processes mapping this page may make a syscall include this
page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
mode it may be fixed by fixup_exception, current code will just return
error code to user code.

This is not sufficient, we should send a SIGBUS to the process and log
the info to console, as we can't trust the process will handle the error
correctly.

Suggested-by: Feng Yang 
Signed-off-by: Aili Yao 
---
 arch/x86/mm/fault.c | 62 +
 1 file changed, 40 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 08f5f74cf989..62df798abb56 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -617,6 +617,30 @@ static void set_signal_archinfo(unsigned long address,
tsk->thread.cr2 = address;
 }
 
+static int
+do_sigbus_mceerr(unsigned long error_code, unsigned long address, vm_fault_t 
fault, int prepared)
+{
+   struct task_struct *tsk = current;
+   unsigned int lsb = 0;
+
+   if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
+   pr_err(
+   "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
+   tsk->comm, tsk->pid, address);
+   if (fault & VM_FAULT_HWPOISON_LARGE)
+   lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
+   if (fault & VM_FAULT_HWPOISON)
+   lsb = PAGE_SHIFT;
+   if (!prepared) {
+   sanitize_error_code(address, _code);
+   set_signal_archinfo(address, error_code);
+   }
+   force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb);
+   return 1;
+   }
+   return 0;
+}
+
 static noinline void
 page_fault_oops(struct pt_regs *regs, unsigned long error_code,
unsigned long address)
@@ -694,7 +718,7 @@ page_fault_oops(struct pt_regs *regs, unsigned long 
error_code,
 
 static noinline void
 kernelmode_fixup_or_oops(struct pt_regs *regs, unsigned long error_code,
-unsigned long address, int signal, int si_code)
+unsigned long address, int signal, int si_code, 
vm_fault_t fault)
 {
WARN_ON_ONCE(user_mode(regs));
 
@@ -714,12 +738,17 @@ kernelmode_fixup_or_oops(struct pt_regs *regs, unsigned 
long error_code,
 * In this case we need to make sure we're not recursively
 * faulting through the emulate_vsyscall() logic.
 */
+
+   /* Sending MCERR Sigbus for page fault error from hwpoison */
+   if (IS_ENABLED(CONFIG_MEMORY_FAILURE)
+   && do_sigbus_mceerr(error_code, address, fault, 0))
+   return;
+
if (current->thread.sig_on_uaccess_err && signal) {
sanitize_error_code(address, _code);
 
set_signal_archinfo(address, error_code);
 
-   /* XXX: hwpoison faults will set the wrong code. */
force_sig_fault(signal, si_code, (void __user 
*)address);
}
 
@@ -782,7 +811,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long 
error_code,
struct task_struct *tsk = current;
 
if (!user_mode(regs)) {
-   kernelmode_fixup_or_oops(regs, error_code, address, pkey, 
si_code);
+   kernelmode_fixup_or_oops(regs, error_code, address, pkey, 
si_code, 0);
return;
}
 
@@ -914,7 +943,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
 {
/* Kernel mode? Handle exceptions or die: */
if (!user_mode(regs)) {
-   kernelmode_fixup_or_oops(regs, error_code, address, SIGBUS, 
BUS_ADRERR);
+   kernelmode_fixup_or_oops(regs, error_code, address, SIGBUS, 
BUS_ADRERR, fault);
return;
}
 
@@ -929,22 +958,11 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
 
set_signal_archinfo(address, error_code);
 
-#ifdef CONFIG_MEMORY_FAILURE
-   if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
-   struct task_struct *tsk = current;
-   unsigned lsb = 0;
-
-   pr_err(
-   "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
-   tsk->comm, tsk->pid, address);
-   if (fault & VM_FAULT_HWPOISON_LARGE)
-   lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
-   if (fault & VM_FAULT_HWPOISON)
-   lsb = PAGE_SHIFT;
-   force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb);
+   /* Sending MCERR Sigbus for

Re: [PATCH v2] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-02-04 Thread Aili Yao

On Thu, 4 Feb 2021 07:25:55 +
HORIGUCHI NAOYA(堀口　直也)  wrote:

> Hi Aili,
> 
> On Mon, Feb 01, 2021 at 04:17:49PM +0800, Aili Yao wrote:
> > When one page is already hwpoisoned by AO action, process may not be
> > killed, the process mapping this page may make a syscall include this
> > page and result to trigger a VM_FAULT_HWPOISON fault, if it's in kernel
> > mode it may be fixed by fixup_exception. Current code will just return
> > error code to user process.
> > 
> > This is not sufficient, we should send a SIGBUS to the process and log
> > the info to console, as we can't trust the process will handle the error
> > correctly.
> > 
> > Suggested-by: Feng Yang 
> > Signed-off-by: Aili Yao 
> > ---  
> ...
> 
> > @@ -662,12 +662,32 @@ no_context(struct pt_regs *regs, unsigned long 
> > error_code,
> >  * In this case we need to make sure we're not recursively
> >  * faulting through the emulate_vsyscall() logic.
> >  */
> > +
> > +   if (IS_ENABLED(CONFIG_MEMORY_FAILURE) &&
> > +   fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
> > +   unsigned int lsb = 0;
> > +
> > +   pr_err("MCE: Killing %s:%d due to hardware memory 
> > corruption fault at %lx\n",
> > +   current->comm, current->pid, address);
> > +
> > +   sanitize_error_code(address, _code);
> > +   set_signal_archinfo(address, error_code);
> > +
> > +   if (fault & VM_FAULT_HWPOISON_LARGE)
> > +   lsb = 
> > hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
> > +   if (fault & VM_FAULT_HWPOISON)
> > +   lsb = PAGE_SHIFT;
> > +
> > +   force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, 
> > lsb);  
> 
> This part contains some duplicated code with do_sigbus(), so some refactoring 
> (like
> adding a common function) would be more helpful.

Yes, agree, I will modify this and rebase to the big fault series from tip.

Thanks
Aili Yao

Re: [PATCH v2] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-02-01 Thread Aili Yao

On Mon, 1 Feb 2021 08:58:27 -0800
Andy Lutomirski  wrote:

> On Mon, Feb 1, 2021 at 12:17 AM Aili Yao  wrote:
> >
> > When one page is already hwpoisoned by AO action, process may not be
> > killed, the process mapping this page may make a syscall include this
> > page and result to trigger a VM_FAULT_HWPOISON fault, if it's in kernel
> > mode it may be fixed by fixup_exception. Current code will just return
> > error code to user process.
> >
> > This is not sufficient, we should send a SIGBUS to the process and log
> > the info to console, as we can't trust the process will handle the error
> > correctly.  
> 
> Does this happen when one process gets SIGBUSed due to memory failure
> and another process independently hits the poisoned memory?  I'm not
> entirely convinced that this is a problem.
> 

OK, I will explain more, hope this will be helpful:
One page get poisoned which can be caused by at least two scenarios:
1. One user process access a address which corrupted, the memory failure() will 
be called, the function will unmap the page which contain the corrupt memory 
cell, the process
triggering the error will get signaled with SIGBUS. Other process sharing this 
page
will get its related pte marked with SWP_HWPOISON, and in early-kill case, 
these other processes
will also be signaled with SIGBUS, In later-kill case, It should be signaled 
when it touch the page
which has been poisoned. 

2.A patrol scrub UE error will also trigger the same process, page unmapped, 
pte being marked with 
SWP_HWPOISON. In later-kill case, the process which touch the poisoned page 
will trigger a page fault
and should be signaled with SIGBUS.

In this later-kill case, normally it will hit the following code:

965 #ifdef CONFIG_MEMORY_FAILURE
966 if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
967 struct task_struct *tsk = current;
968 unsigned lsb = 0;
969 
970 pr_err(
971 "MCE: Killing %s:%d due to hardware memory corruption fault at 
%lx\n",
972 tsk->comm, tsk->pid, address);
973 if (fault & VM_FAULT_HWPOISON_LARGE)
974 lsb = 
hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
975 if (fault & VM_FAULT_HWPOISON)
976 lsb = PAGE_SHIFT;
977 force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, 
lsb);
978 return;
979 }

Or there is a case that user process make a syscall including the posioned user 
address(assume to be ADD_A)
and make a request to copy the ADD_A content to kernel space. In such a case, 
it will trigger a page fault when
copying starts. As it's in kernel mode and the address is in user space, the 
process will hit:

944 static void
945 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long 
address,
946   vm_fault_t fault)
947 {
948 /* Kernel mode? Handle exceptions or die: */
949 if (!(error_code & X86_PF_USER)) {
950 no_context(regs, error_code, address, SIGBUS, 
BUS_ADRERR, fault);
951 return;
952 } 

In no_context(), fixup_exception() will be called, usually the copy function in 
such a case will provide
one fixup function, which will return true, then no_context() return, finally 
the syscall will return one ERROR
code(errno=14 Bad address) to user process, which the user process won't know 
the where the real problem is.
>From syslog, we can't guarantee memory error recovey log and the user process 
>error will have a close correlation in
timestamp.

Previous behavior is not only for latest upstream code, but also apply to older 
kernel versions.

This patch is to correct this behavior by Sending SIGBUS to user process in 
such a scenario. This behavior
in this patch is consistent with other memory poison case.

Following is the test result:

Current 5.11 rc6:
 ./einj_mem_uc -S -c 1 -f copyin
0: copyin   vaddr = 0x7fed9808e400 paddr = 1b00c00400
./einj_mem_uc: couldn't write temp file (errno=14) 
Expected SIGBUS, didn't get one
page not present
Big surprise ... still running. Thought that would be fatal
Test passed

Current 5.11 rc6 with this patch:
./einj_mem_uc -S -c 1 -f copyin
0: copyin   vaddr = 0x7fef60e3d400 paddr = 14b7f43400
SIGBUS: addr = 0x7fef60e3d400
page not present
Big surprise ... still running. Thought that would be fatal
Test passed

And there is a small modification in the einj_mem_uc.c I missed in previous 
mail, which will lead the
test result unexpected.

306 int trigger_copyin(char *addr)
307 {

316 if ((ret = write(copyin_fd, addr - memcpy_runup, memcpy_size)) 
!= memcpy_size) {

324 }

> In any case, this patch needs rebasing on top

[PATCH v2] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-02-01 Thread Aili Yao

When one page is already hwpoisoned by AO action, process may not be
killed, the process mapping this page may make a syscall include this
page and result to trigger a VM_FAULT_HWPOISON fault, if it's in kernel
mode it may be fixed by fixup_exception. Current code will just return
error code to user process.

This is not sufficient, we should send a SIGBUS to the process and log
the info to console, as we can't trust the process will handle the error
correctly.

Suggested-by: Feng Yang 
Signed-off-by: Aili Yao 
---
 arch/x86/mm/fault.c | 34 +++---
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f1f1b5a0956a..23095b94cf42 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -631,7 +631,7 @@ static void set_signal_archinfo(unsigned long address,
 
 static noinline void
 no_context(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, int signal, int si_code)
+  unsigned long address, int signal, int si_code, vm_fault_t fault)
 {
struct task_struct *tsk = current;
unsigned long flags;
@@ -662,12 +662,32 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 * In this case we need to make sure we're not recursively
 * faulting through the emulate_vsyscall() logic.
 */
+
+   if (IS_ENABLED(CONFIG_MEMORY_FAILURE) &&
+   fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
+   unsigned int lsb = 0;
+
+   pr_err("MCE: Killing %s:%d due to hardware memory 
corruption fault at %lx\n",
+   current->comm, current->pid, address);
+
+   sanitize_error_code(address, _code);
+   set_signal_archinfo(address, error_code);
+
+   if (fault & VM_FAULT_HWPOISON_LARGE)
+   lsb = 
hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
+   if (fault & VM_FAULT_HWPOISON)
+   lsb = PAGE_SHIFT;
+
+   force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, 
lsb);
+
+   return;
+   }
+
if (current->thread.sig_on_uaccess_err && signal) {
sanitize_error_code(address, _code);
 
set_signal_archinfo(address, error_code);
 
-   /* XXX: hwpoison faults will set the wrong code. */
force_sig_fault(signal, si_code, (void __user 
*)address);
}
 
@@ -836,7 +856,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long 
error_code,
if (is_f00f_bug(regs, address))
return;
 
-   no_context(regs, error_code, address, SIGSEGV, si_code);
+   no_context(regs, error_code, address, SIGSEGV, si_code, 0);
 }
 
 static noinline void
@@ -927,7 +947,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
 {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
-   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR, 
fault);
return;
}
 
@@ -966,7 +986,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long 
error_code,
   unsigned long address, vm_fault_t fault)
 {
if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
-   no_context(regs, error_code, address, 0, 0);
+   no_context(regs, error_code, address, 0, 0, 0);
return;
}
 
@@ -974,7 +994,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long 
error_code,
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
no_context(regs, error_code, address,
-  SIGSEGV, SEGV_MAPERR);
+  SIGSEGV, SEGV_MAPERR, 0);
return;
}
 
@@ -1396,7 +1416,7 @@ void do_user_addr_fault(struct pt_regs *regs,
if (fault_signal_pending(fault, regs)) {
if (!user_mode(regs))
no_context(regs, hw_error_code, address, SIGBUS,
-  BUS_ADRERR);
+  BUS_ADRERR, 0);
return;
}
 
-- 
2.25.1

Re: [PATCH] x86/fault: Send SIGBUS to user process always for hwpoison page access.

2021-01-31 Thread Aili Yao

On Fri, 29 Jan 2021 14:55:29 -0800
"Luck, Tony"  wrote:

> Thanks for the explanation and test code. I think I see better what
> is going on here.
> 
> [I took your idea for using madvise(...MADV_HWPOISON) and added a new "-S"
> option to my einj_mem_uc test program to use madvise instead of ACPI/EINJ
> for injections. Update pushed here:
>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git ]
> 
> There have been some small changes to arch/x86/mm/fault.c since you wrote
> the patch.  Can you rebase to v5.11-rc5?

Yes, I will rebase it to newest version.
 
> Also maybe this might be a case to use IS_ENABLED() instead of #ifdef to
> make the code a little less ugly. At least for the 2nd hunk in your patch
> this would work well:
> 
>   if (IS_ENABLED(CONFIG_MEMORY_FAILURE) &&
>   (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)))
>   no_context(regs, error_code, address, SIGBUS, BUS_MCEERR_AR);
>   else
>   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
> 
> The first hunk might need a bit more thought.
> 
Do you mean the force_sig_mceerr and force_sig_fault difference? I see a 
hwpoison related comment
there, but it's better to follow the usual way force_sig_mceerr, I will modify 
this in a v2 patch.

Or something other, you may post a better one.

Thanks

-- 
Best Regards!

Aili Yao

Re: [PATCH] x86/fault: Send SIGBUS to user process always for hwpoison page access.

2021-01-28 Thread Aili Yao

On Thu, 28 Jan 2021 09:43:52 -0800
"Luck, Tony"  wrote:

> On Thu, Jan 28, 2021 at 07:43:26PM +0800, Aili Yao wrote:
> > when one page is already hwpoisoned by AO action, process may not be
> > killed, the process mapping this page may make a syscall include this
> > page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> > mode it may be fixed by fixup_exception, current code will just return
> > error code to user process.  
> 
> Shouldn't the AO action that poisoned the page have also unmapped it?

Yes, The page has been unmapped in the code mm/rmap.c:

1567 if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
1568 pteval = 
swp_entry_to_pte(make_hwpoison_entry(subpage));
1569 if (PageHuge(page)) {
1570 hugetlb_count_sub(compound_nr(page), mm);
1571 set_huge_swap_pte_at(mm, address,
1572  pvmw.pte, pteval,
1573  
vma_mmu_pagesize(vma));
1574 } else {
1575 dec_mm_counter(mm, mm_counter(page));
1576 set_pte_at(mm, address, pvmw.pte, pteval);
1577 }
1578 
1579 }

The pte for this page of processes mapping it should be marked with 
SWP_HWPOISON.
But the process may not be informed and may continue with the address which has 
been
ummaped, Then accessing the content in the page will trigger a page fault.

Normally, it will hit the code in arch/x86/mm/fault.c:

 945 #ifdef CONFIG_MEMORY_FAILURE
 946 if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
 947 struct task_struct *tsk = current;
 948 unsigned lsb = 0;
 949 
 950 pr_err(
 951 "MCE: Killing %s:%d due to hardware memory corruption fault at 
%lx\n",
 952 tsk->comm, tsk->pid, address);
 953 if (fault & VM_FAULT_HWPOISON_LARGE)
 954 lsb = 
hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
 955 if (fault & VM_FAULT_HWPOISON)
 956 lsb = PAGE_SHIFT;
 957 force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, 
lsb);
 958 return;
 959 }
 960 #endif
 961 force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
 962 }

But when the user processes do a syscall and make a copyin action in kernel 
space, 
the page fault triggered by this action will not got the the above code, it 
actually
go to the code in arch/x86/mm/fault.c:

 650 if (fixup_exception(regs, X86_TRAP_PF, error_code, address)) {
 674 /*
 675  * Barring that, we can do the fixup and be happy.
 676  */
 677 return;
 678 }

> > 
> > This is not suffient, we should send a SIGBUS to the process and log the
> > info to console, as we can't trust the process will handle the error
> > correctly.  
> 
> I agree with this part ... few apps check for -EFAULT and do the right
> thing.  But I'm not sure how this happens. Can you provide a bit more
> detail on the steps
> 

Attachment is a simple code to test this, you can try this test with:
./einj_mem_uc -f copyin2

In my enviroment, the stack will be:

[ 1583.063050] Memory failure: 0x1030254: recovery action for dirty LRU page: 
Recovered
[ 1583.071724] MCE: Killing einj_mem_uc:11139 due to hardware memory corruption 
fault at 7f4d59032000
[ 1583.081732] CPU: 38 PID: 11139 Comm: einj_mem_uc Kdump: loaded Not tainted 
5.11.0-rc2+ #43
[ 1583.102607] Call Trace:
[ 1583.105338]  dump_stack+0x57/0x6a
[ 1583.109041]  no_context.cold+0xf6/0x284
[ 1583.113315]  mm_fault_error+0xc3/0x1b0
[ 1583.117503]  exc_page_fault+0x57/0x110
[ 1583.121690]  asm_exc_page_fault+0x1e/0x30
[ 1583.126159] RIP: 0010:__get_user_nocheck_8+0x10/0x13
[ 1583.131704] Code: 0f b7 10 31 c0 0f 01 ca c3 90 0f 01 cb 0f ae e8 8b 10 31 
c0 0f 01 ca c3 66 90 0f 01 cb 0f ae e8 48 8b 10 31 c0 0f 01 ca c3 90 <0f> 01 ca 
31 d2 48 c7 c0 f2 ff ff ff c3 cc cc cc 0f 1f 44 00 00 40
[ 1583.152659] RSP: 0018:b9e462917d90 EFLAGS: 00050293
[ 1583.158490] RAX: 7f4d59032000 RBX:  RCX: 7f4d59032000
[ 1583.166453] RDX:  RSI: 0200 RDI: 7f4d590321ff
[ 1583.174418] RBP: 0200 R08: 0200 R09: b9e462917e50
[ 1583.182382] R10: 0200 R11:  R12: b9e462917e60
[ 1583.190345] R13: 941470e93058 R14: 1000 R15: c0626540
[ 1583.198310]  iov_iter_fault_in_readable+0x4f/0x120
[ 1583.203657]  generic_perform_write+0x83/0x1c0
[ 1583.208520]  ext4_buffered_write_iter+0x93/0x150 [ext

[PATCH] x86/fault: Send SIGBUS to user process always for hwpoison page access.

2021-01-28 Thread Aili Yao

when one page is already hwpoisoned by AO action, process may not be
killed, the process mapping this page may make a syscall include this
page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
mode it may be fixed by fixup_exception, current code will just return
error code to user process.

This is not suffient, we should send a SIGBUS to the process and log the
info to console, as we can't trust the process will handle the error
correctly.

Suggested-by: Feng Yang 
Signed-off-by: Aili Yao 
---
 arch/x86/mm/fault.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f1f1b5a0956a..36d1e385512b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -662,7 +662,16 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 * In this case we need to make sure we're not recursively
 * faulting through the emulate_vsyscall() logic.
 */
+#ifdef CONFIG_MEMORY_FAILURE
+   if (si_code == BUS_MCEERR_AR && signal == SIGBUS)
+   pr_err("MCE: Killing %s:%d due to hardware memory 
corruption fault at %lx\n",
+   current->comm, current->pid, address);
+
+   if ((current->thread.sig_on_uaccess_err && signal) ||
+   (si_code == BUS_MCEERR_AR && signal == SIGBUS)) {
+#else
if (current->thread.sig_on_uaccess_err && signal) {
+#endif
sanitize_error_code(address, _code);
 
set_signal_archinfo(address, error_code);
@@ -927,7 +936,14 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
 {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
+#ifdef CONFIG_MEMORY_FAILURE
+   if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE))
+   no_context(regs, error_code, address, SIGBUS, 
BUS_MCEERR_AR);
+   else
+   no_context(regs, error_code, address, SIGBUS, 
BUS_ADRERR);
+#else
no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+#endif
return;
}
 
-- 
2.25.1

74 matches

Mail list logo