[PATCH v3 2/3] mm,hwpoison: return -EHWPOISON when page already

2021-04-20 Thread Naoya Horiguchi
From: Aili Yao 

When the page is already poisoned, another memory_failure() call in the
same page now returns 0, meaning OK. For nested memory mce handling, this
behavior may lead to one mce looping, Example:

1. When LCME is enabled, and there are two processes A && B running on
different core X && Y separately, which will access one same page, then
the page corrupted when process A access it, a MCE will be rasied to
core X and the error process is just underway.

2. Then B access the page and trigger another MCE to core Y, it will also
do error process, it will see TestSetPageHWPoison be true, and 0 is
returned.

3. The kill_me_maybe will check the return:

1244 static void kill_me_maybe(struct callback_head *cb)
1245 {
...
1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, 
p->mce_whole_page);
1257 sync_core();
1258 return;
1259 }
...
1267 }

4. The error process for B will end, and may nothing happened if
kill-early is not set, The process B will re-excute instruction and get
into mce again and then loop happens. And also the set_mce_nospec()
here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").

For other cases which care the return value of memory_failure() should
check why they want to process a memory error which have already been
processed. This behavior seems reasonable.

Signed-off-by: Aili Yao 
Signed-off-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git v5.12-rc8/mm/memory-failure.c v5.12-rc8_patched/mm/memory-failure.c
index 4087308e4b32..39d0ff0339b9 100644
--- v5.12-rc8/mm/memory-failure.c
+++ v5.12-rc8_patched/mm/memory-failure.c
@@ -1228,7 +1228,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int 
flags)
if (TestSetPageHWPoison(head)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
   pfn);
-   return 0;
+   return -EHWPOISON;
}
 
num_poisoned_pages_inc();
@@ -1437,6 +1437,7 @@ int memory_failure(unsigned long pfn, int flags)
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
+   res = -EHWPOISON;
goto unlock_mutex;
}
 
-- 
2.25.1



[PATCH v3 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-20 Thread Naoya Horiguchi
From: Naoya Horiguchi 

The previous patch solves the infinite MCE loop issue when multiple
MCE events races.  The remaining issue is to make sure that all threads
processing Action Required MCEs send to the current processes the
SIGBUS with the proper virtual address and the error size.

This patch suggests to do page table walk to find the error virtual
address.  If we find multiple virtual addresses in walking, we now can't
determine which one is correct, so we fall back to sending SIGBUS in
kill_me_maybe() without error info as we do now.  This corner case needs
to be solved in the future.

Signed-off-by: Naoya Horiguchi 
Tested-by: Aili Yao 
---
change log v1 -> v2:
- initialize local variables in check_hwpoisoned_entry() and
  hwpoison_pte_range()
- fix and improve logic to calculate error address offset.
---
 arch/x86/kernel/cpu/mce/core.c |  13 ++-
 include/linux/swapops.h|   5 ++
 mm/memory-failure.c| 143 -
 3 files changed, 158 insertions(+), 3 deletions(-)

diff --git v5.12-rc8/arch/x86/kernel/cpu/mce/core.c 
v5.12-rc8_patched/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..3ce23445a48c 100644
--- v5.12-rc8/arch/x86/kernel/cpu/mce/core.c
+++ v5.12-rc8_patched/arch/x86/kernel/cpu/mce/core.c
@@ -1257,19 +1257,28 @@ static void kill_me_maybe(struct callback_head *cb)
 {
struct task_struct *p = container_of(cb, struct task_struct, 
mce_kill_me);
int flags = MF_ACTION_REQUIRED;
+   int ret;
 
pr_err("Uncorrected hardware memory error in user-access at %llx", 
p->mce_addr);
 
if (!p->mce_ripv)
flags |= MF_MUST_KILL;
 
-   if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
-   !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
+   ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
+   if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
sync_core();
return;
}
 
+   /*
+* -EHWPOISON from memory_failure() means that it already sent SIGBUS
+* to the current process with the proper error info, so no need to
+* send it here again.
+*/
+   if (ret == -EHWPOISON)
+   return;
+
if (p->mce_vaddr != (void __user *)-1l) {
force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
} else {
diff --git v5.12-rc8/include/linux/swapops.h 
v5.12-rc8_patched/include/linux/swapops.h
index d9b7c9132c2f..98ea67fcf360 100644
--- v5.12-rc8/include/linux/swapops.h
+++ v5.12-rc8_patched/include/linux/swapops.h
@@ -323,6 +323,11 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+   return swp_offset(entry);
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
atomic_long_inc(_poisoned_pages);
diff --git v5.12-rc8/mm/memory-failure.c v5.12-rc8_patched/mm/memory-failure.c
index 39d0ff0339b9..7cc563e1770a 100644
--- v5.12-rc8/mm/memory-failure.c
+++ v5.12-rc8_patched/mm/memory-failure.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -554,6 +555,141 @@ static void collect_procs(struct page *page, struct 
list_head *tokill,
collect_procs_file(page, tokill, force_early);
 }
 
+struct hwp_walk {
+   struct to_kill tk;
+   unsigned long pfn;
+   int flags;
+};
+
+static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
+{
+   /* Abort pagewalk when finding multiple mappings to the error page. */
+   if (tk->addr)
+   return 1;
+   tk->addr = addr;
+   tk->size_shift = shift;
+   return 0;
+}
+
+static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
+   unsigned long poisoned_pfn, struct to_kill *tk)
+{
+   unsigned long pfn = 0;
+
+   if (pte_present(pte)) {
+   pfn = pte_pfn(pte);
+   } else {
+   swp_entry_t swp = pte_to_swp_entry(pte);
+
+   if (is_hwpoison_entry(swp))
+   pfn = hwpoison_entry_to_pfn(swp);
+   }
+
+   if (!pfn || pfn != poisoned_pfn)
+   return 0;
+
+   return set_to_kill(tk, addr, shift);
+}
+
+static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+   struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
+   int ret = 0;
+   pte_t *ptep;
+   spinlock_t *ptl;
+
+   ptl = pmd_trans_huge_lock(pmdp, walk->vma);
+   if (ptl) {
+   pmd_t pmd = *pmdp;
+
+   if (pmd_present(pmd)) {
+   

[PATCH v3 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-04-20 Thread Naoya Horiguchi
From: Tony Luck 

There can be races when multiple CPUs consume poison from the same
page. The first into memory_failure() atomically sets the HWPoison
page flag and begins hunting for tasks that map this page. Eventually
it invalidates those mappings and may send a SIGBUS to the affected
tasks.

But while all that work is going on, other CPUs see a "success"
return code from memory_failure() and so they believe the error
has been handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in
a mutex.

Signed-off-by: Tony Luck 
Signed-off-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 37 -
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git v5.12-rc8/mm/memory-failure.c v5.12-rc8_patched/mm/memory-failure.c
index 24210c9bd843..4087308e4b32 100644
--- v5.12-rc8/mm/memory-failure.c
+++ v5.12-rc8_patched/mm/memory-failure.c
@@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1404,7 +1406,7 @@ int memory_failure(unsigned long pfn, int flags)
struct page *hpage;
struct page *orig_head;
struct dev_pagemap *pgmap;
-   int res;
+   int res = 0;
unsigned long page_flags;
bool retry = true;
 
@@ -1424,13 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
return -ENXIO;
}
 
+   mutex_lock(_mutex);
+
 try_again:
-   if (PageHuge(p))
-   return memory_failure_hugetlb(pfn, flags);
+   if (PageHuge(p)) {
+   res = memory_failure_hugetlb(pfn, flags);
+   goto unlock_mutex;
+   }
+
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
-   return 0;
+   goto unlock_mutex;
}
 
orig_head = hpage = compound_head(p);
@@ -1463,17 +1470,19 @@ int memory_failure(unsigned long pfn, int flags)
res = MF_FAILED;
}
action_result(pfn, MF_MSG_BUDDY, res);
-   return res == MF_RECOVERED ? 0 : -EBUSY;
+   res = res == MF_RECOVERED ? 0 : -EBUSY;
} else {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, 
MF_IGNORED);
-   return -EBUSY;
+   res = -EBUSY;
}
+   goto unlock_mutex;
}
 
if (PageTransHuge(hpage)) {
if (try_to_split_thp_page(p, "Memory Failure") < 0) {
action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
-   return -EBUSY;
+   res = -EBUSY;
+   goto unlock_mutex;
}
VM_BUG_ON_PAGE(!page_count(p), p);
}
@@ -1497,7 +1506,7 @@ int memory_failure(unsigned long pfn, int flags)
if (PageCompound(p) && compound_head(p) != orig_head) {
action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
res = -EBUSY;
-   goto out;
+   goto unlock_page;
}
 
/*
@@ -1517,14 +1526,14 @@ int memory_failure(unsigned long pfn, int flags)
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
-   return 0;
+   goto unlock_mutex;
}
if (hwpoison_filter(p)) {
if (TestClearPageHWPoison(p))
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
-   return 0;
+   goto unlock_mutex;
}
 
if (!PageTransTail(p) && !PageLRU(p))
@@ -1543,7 +1552,7 @@ int memory_failure(unsigned long pfn, int flags)
if (!hwpoison_user_mappings(p, pfn, flags, )) {
action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
res = -EBUSY;
-   goto out;
+   goto unlock_page;
}
 
/*
@@ -1552,13 +1561,15 @@ int memory_failure(unsigned long pfn, int flags)
if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
res = -EBUSY;
-   goto out;
+   goto unlock_page;
}
 
 identify_page_state:
res = identify_page_state(pfn, p, page_flags);
-out:
+unlock_page:
unlock_page(p);
+unlock_mutex:
+   mutex_unlock(_mutex);
return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);
-- 
2.25.1



[PATCH v3 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

2021-04-20 Thread Naoya Horiguchi
Hi,

I updated the series based on previous feedbacks/discussions.

Changelog v2 -> v3:
- 1/3 aligned returning code with "goto out" approach,
- 3/3 added one Tested-by tag,
- 3/3 fixed comment on kill_accessing_process().

Changelog v1 -> v2:
- 3/3 fixed on initializing local variables and calculation logic
  of error virtual address,

v1: 
https://lore.kernel.org/linux-mm/20210412224320.1747638-1-nao.horigu...@gmail.com/
v2 (only 3/3 is posted): 
https://lore.kernel.org/linux-mm/20210419023658.GA1962954@u2004/

Thanks,
Naoya Horiguchi

--- quote from cover letter of v1 ---

I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1].
I simply borrowed Tony's mutex patch and Aili's return code patch,
then I queued another one to find error virtual address in the best
effort manner.  I know that this is not a perfect solution, but
should work for some typical case.

[1]: 
https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
---
Summary:

Aili Yao (1):
  mm,hwpoison: return -EHWPOISON when page already

Naoya Horiguchi (1):
  mm,hwpoison: add kill_accessing_process() to find error virtual address

Tony Luck (1):
  mm/memory-failure: Use a mutex to avoid memory_failure() races

 arch/x86/kernel/cpu/mce/core.c |  13 ++-
 include/linux/swapops.h|   5 ++
 mm/memory-failure.c| 181 +
 3 files changed, 183 insertions(+), 16 deletions(-)


[PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-18 Thread Naoya Horiguchi
> > 2. In the function hwpoison_pte_range():
> > if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) this check seem we should 
> > use PMD_SIZE/PAGE_SIZE or some macro like this?
> 
> Thanks, that's right.  HPAGE_PMD_NR seems to fit here.
> We also need "#ifdef CONFIG_TRANSPARENT_HUGEPAGE" to use it.

I found that the #ifdef is not necessary because the whole
"if (ptl)" is compiled out.  So I don't add #ifdef.

Here's the v2 of 3/3.

Aili, could you test with it?

Thanks,
Naoya Horiguchi

-
From: Naoya Horiguchi 
Date: Tue, 13 Apr 2021 07:26:25 +0900
Subject: [PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error
 virtual address

The previous patch solves the infinite MCE loop issue when multiple
MCE events races.  The remaining issue is to make sure that all threads
processing Action Required MCEs send to the current processes the
SIGBUS with the proper virtual address and the error size.

This patch suggests to do page table walk to find the error virtual
address.  If we find multiple virtual addresses in walking, we now can't
determine which one is correct, so we fall back to sending SIGBUS in
kill_me_maybe() without error info as we do now.  This corner case needs
to be solved in the future.

Signed-off-by: Naoya Horiguchi 
---
change log v1 -> v2:
- initialize local variables in check_hwpoisoned_entry() and
  hwpoison_pte_range()
- fix and improve logic to calculate error address offset.
---
 arch/x86/kernel/cpu/mce/core.c |  13 ++-
 include/linux/swapops.h|   5 ++
 mm/memory-failure.c| 147 -
 3 files changed, 161 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..3ce23445a48c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1257,19 +1257,28 @@ static void kill_me_maybe(struct callback_head *cb)
 {
struct task_struct *p = container_of(cb, struct task_struct, 
mce_kill_me);
int flags = MF_ACTION_REQUIRED;
+   int ret;
 
pr_err("Uncorrected hardware memory error in user-access at %llx", 
p->mce_addr);
 
if (!p->mce_ripv)
flags |= MF_MUST_KILL;
 
-   if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
-   !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
+   ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
+   if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
sync_core();
return;
}
 
+   /*
+* -EHWPOISON from memory_failure() means that it already sent SIGBUS
+* to the current process with the proper error info, so no need to
+* send it here again.
+*/
+   if (ret == -EHWPOISON)
+   return;
+
if (p->mce_vaddr != (void __user *)-1l) {
force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
} else {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d9b7c9132c2f..98ea67fcf360 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -323,6 +323,11 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+   return swp_offset(entry);
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
atomic_long_inc(_poisoned_pages);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 368ef77e01f9..99dd4caf43cb 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -554,6 +555,142 @@ static void collect_procs(struct page *page, struct 
list_head *tokill,
collect_procs_file(page, tokill, force_early);
 }
 
+struct hwp_walk {
+   struct to_kill tk;
+   unsigned long pfn;
+   int flags;
+};
+
+static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
+{
+   /* Abort pagewalk when finding multiple mappings to the error page. */
+   if (tk->addr)
+   return 1;
+   tk->addr = addr;
+   tk->size_shift = shift;
+   return 0;
+}
+
+static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
+   unsigned long poisoned_pfn, struct to_kill *tk)
+{
+   unsigned long pfn = 0;
+
+   if (pte_present(pte)) {
+   pfn = pte_pfn(pte);
+   } else {
+   swp_entry_t swp = pte_to_swp_entry(pte);
+
+   if (is_hwpoison_entry(swp))
+   pfn = hwpoison_entry_to_pfn(swp);
+   }
+
+   if (!pfn || pfn != poisoned_pfn)

[PATCH v1 2/3] mm,hwpoison: return -EHWPOISON when page already

2021-04-12 Thread Naoya Horiguchi
From: Aili Yao 

When the page is already poisoned, another memory_failure() call in the
same page now returns 0, meaning OK. For nested memory mce handling, this
behavior may lead to one mce looping, Example:

1. When LCME is enabled, and there are two processes A && B running on
different core X && Y separately, which will access one same page, then
the page corrupted when process A access it, a MCE will be rasied to
core X and the error process is just underway.

2. Then B access the page and trigger another MCE to core Y, it will also
do error process, it will see TestSetPageHWPoison be true, and 0 is
returned.

3. The kill_me_maybe will check the return:

1244 static void kill_me_maybe(struct callback_head *cb)
1245 {
...
1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, 
p->mce_whole_page);
1257 sync_core();
1258 return;
1259 }
...
1267 }

4. The error process for B will end, and may nothing happened if
kill-early is not set, The process B will re-excute instruction and get
into mce again and then loop happens. And also the set_mce_nospec()
here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").

For other cases which care the return value of memory_failure() should
check why they want to process a memory error which have already been
processed. This behavior seems reasonable.

Signed-off-by: Aili Yao 
Signed-off-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git v5.12-rc5/mm/memory-failure.c v5.12-rc5_patched/mm/memory-failure.c
index c1509f4b565e..368ef77e01f9 100644
--- v5.12-rc5/mm/memory-failure.c
+++ v5.12-rc5_patched/mm/memory-failure.c
@@ -1228,7 +1228,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int 
flags)
if (TestSetPageHWPoison(head)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
   pfn);
-   return 0;
+   return -EHWPOISON;
}
 
num_poisoned_pages_inc();
@@ -1438,7 +1438,7 @@ int memory_failure(unsigned long pfn, int flags)
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
mutex_unlock(_mutex);
-   return 0;
+   return -EHWPOISON;
}
 
orig_head = hpage = compound_head(p);
-- 
2.25.1



[PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-12 Thread Naoya Horiguchi
From: Naoya Horiguchi 

The previous patch solves the infinite MCE loop issue when multiple
MCE events races.  The remaining issue is to make sure that all threads
processing Action Required MCEs send to the current processes the
SIGBUS with the proper virtual address and the error size.

This patch suggests to do page table walk to find the error virtual
address.  If we find multiple virtual addresses in walking, we now can't
determine which one is correct, so we fall back to sending SIGBUS in
kill_me_maybe() without error info as we do now.  This corner case needs
to be solved in the future.

Signed-off-by: Naoya Horiguchi 
---
 arch/x86/kernel/cpu/mce/core.c |  13 ++-
 include/linux/swapops.h|   5 ++
 mm/memory-failure.c| 147 -
 3 files changed, 161 insertions(+), 4 deletions(-)

diff --git v5.12-rc5/arch/x86/kernel/cpu/mce/core.c 
v5.12-rc5_patched/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..3ce23445a48c 100644
--- v5.12-rc5/arch/x86/kernel/cpu/mce/core.c
+++ v5.12-rc5_patched/arch/x86/kernel/cpu/mce/core.c
@@ -1257,19 +1257,28 @@ static void kill_me_maybe(struct callback_head *cb)
 {
struct task_struct *p = container_of(cb, struct task_struct, 
mce_kill_me);
int flags = MF_ACTION_REQUIRED;
+   int ret;
 
pr_err("Uncorrected hardware memory error in user-access at %llx", 
p->mce_addr);
 
if (!p->mce_ripv)
flags |= MF_MUST_KILL;
 
-   if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
-   !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
+   ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
+   if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
sync_core();
return;
}
 
+   /*
+* -EHWPOISON from memory_failure() means that it already sent SIGBUS
+* to the current process with the proper error info, so no need to
+* send it here again.
+*/
+   if (ret == -EHWPOISON)
+   return;
+
if (p->mce_vaddr != (void __user *)-1l) {
force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
} else {
diff --git v5.12-rc5/include/linux/swapops.h 
v5.12-rc5_patched/include/linux/swapops.h
index d9b7c9132c2f..98ea67fcf360 100644
--- v5.12-rc5/include/linux/swapops.h
+++ v5.12-rc5_patched/include/linux/swapops.h
@@ -323,6 +323,11 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+   return swp_offset(entry);
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
atomic_long_inc(_poisoned_pages);
diff --git v5.12-rc5/mm/memory-failure.c v5.12-rc5_patched/mm/memory-failure.c
index 368ef77e01f9..04e002bd573a 100644
--- v5.12-rc5/mm/memory-failure.c
+++ v5.12-rc5_patched/mm/memory-failure.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -554,6 +555,142 @@ static void collect_procs(struct page *page, struct 
list_head *tokill,
collect_procs_file(page, tokill, force_early);
 }
 
+struct hwp_walk {
+   struct to_kill tk;
+   unsigned long pfn;
+   int flags;
+};
+
+static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
+{
+   /* Abort pagewalk when finding multiple mappings to the error page. */
+   if (tk->addr)
+   return 1;
+   tk->addr = addr;
+   tk->size_shift = shift;
+   return 0;
+}
+
+static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
+   unsigned long poisoned_pfn, struct to_kill *tk)
+{
+   unsigned long pfn;
+
+   if (pte_present(pte)) {
+   pfn = pte_pfn(pte);
+   } else {
+   swp_entry_t swp = pte_to_swp_entry(pte);
+
+   if (is_hwpoison_entry(swp))
+   pfn = hwpoison_entry_to_pfn(swp);
+   }
+
+   if (!pfn || pfn != poisoned_pfn)
+   return 0;
+
+   return set_to_kill(tk, addr, shift);
+}
+
+static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+   struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
+   int ret;
+   pte_t *ptep;
+   spinlock_t *ptl;
+
+   ptl = pmd_trans_huge_lock(pmdp, walk->vma);
+   if (ptl) {
+   pmd_t pmd = *pmdp;
+
+   if (pmd_present(pmd)) {
+   unsigned long pfn = pmd_pfn(pmd);
+
+   if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) {
+   unsigned long hwpoison_vaddr = add

[PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-04-12 Thread Naoya Horiguchi
From: Tony Luck 

There can be races when multiple CPUs consume poison from the same
page. The first into memory_failure() atomically sets the HWPoison
page flag and begins hunting for tasks that map this page. Eventually
it invalidates those mappings and may send a SIGBUS to the affected
tasks.

But while all that work is going on, other CPUs see a "success"
return code from memory_failure() and so they believe the error
has been handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in
a mutex.

Signed-off-by: Tony Luck 
Signed-off-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git v5.12-rc5/mm/memory-failure.c v5.12-rc5_patched/mm/memory-failure.c
index 24210c9bd843..c1509f4b565e 100644
--- v5.12-rc5/mm/memory-failure.c
+++ v5.12-rc5_patched/mm/memory-failure.c
@@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
return -ENXIO;
}
 
+   mutex_lock(_mutex);
+
 try_again:
-   if (PageHuge(p))
-   return memory_failure_hugetlb(pfn, flags);
+   if (PageHuge(p)) {
+   res = memory_failure_hugetlb(pfn, flags);
+   goto out2;
+   }
+
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
+   mutex_unlock(_mutex);
return 0;
}
 
@@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags)
res = MF_FAILED;
}
action_result(pfn, MF_MSG_BUDDY, res);
+   mutex_unlock(_mutex);
return res == MF_RECOVERED ? 0 : -EBUSY;
} else {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, 
MF_IGNORED);
+   mutex_unlock(_mutex);
return -EBUSY;
}
}
@@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags)
if (PageTransHuge(hpage)) {
if (try_to_split_thp_page(p, "Memory Failure") < 0) {
action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+   mutex_unlock(_mutex);
return -EBUSY;
}
VM_BUG_ON_PAGE(!page_count(p), p);
@@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags)
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
+   mutex_unlock(_mutex);
return 0;
}
if (hwpoison_filter(p)) {
@@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags)
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
+   mutex_unlock(_mutex);
return 0;
}
 
@@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags)
res = identify_page_state(pfn, p, page_flags);
 out:
unlock_page(p);
+out2:
+   mutex_unlock(_mutex);
return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);
-- 
2.25.1



[PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

2021-04-12 Thread Naoya Horiguchi
Hi,

I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1].
I simply borrowed Tony's mutex patch and Aili's return code patch,
then I queued another one to find error virtual address in the best
effort manner.  I know that this is not a perfect solution, but
should work for some typical case.

My simple testing showed this patchset seems to work as intended,
but if you have the related testcases, could you please test and
let me have some feedback?

Thanks,
Naoya Horiguchi

[1]: 
https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
---
Summary:

Aili Yao (1):
  mm,hwpoison: return -EHWPOISON when page already

Naoya Horiguchi (1):
  mm,hwpoison: add kill_accessing_process() to find error virtual address

Tony Luck (1):
  mm/memory-failure: Use a mutex to avoid memory_failure() races

 arch/x86/kernel/cpu/mce/core.c |  13 +++-
 include/linux/swapops.h|   5 ++
 mm/memory-failure.c| 166 -
 3 files changed, 178 insertions(+), 6 deletions(-)


[PATCH v2] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers

2021-03-05 Thread Naoya Horiguchi
Hello Oscar,

On Fri, Mar 05, 2021 at 08:26:58AM +0100, Oscar Salvador wrote:
> On Thu, Mar 04, 2021 at 03:44:37PM +0900, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi 
> 
> Hi Naoya,
> 
> good catch!
> 
> > Currently me_huge_page() temporary unlocks page to perform some actions
> > then locks it again later. My testcase (which calls hard-offline on some
> > tail page in a hugetlb, then accesses the address of the hugetlb range)
> > showed that page allocation code detects the page lock on buddy page and
> > printed out "BUG: Bad page state" message.  PG_hwpoison does not prevent
> > it because PG_hwpoison flag is set on any subpage of the hugetlb page
> > but the 2nd page lock is on the head page.
> 
> I am having difficulties to parse "PG_hwpoison does not prevent it because
> PG_hwpoison flag is set on any subpage of the hugetlb page".
> 
> What do you mean by that?

What was in my mind is that check_new_page_bad() does not consider
a page with __PG_HWPOISON as bad page, so this flag works as kind of
filter, but this filtering doesn't work in my case because the
"bad page" is not the actual hwpoisoned page.

Thank for nice comment, I've updated the patch below with this description.

> 
> > 
> > This patch suggests to drop the 2nd page lock to fix the issue.
> > 
> > Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in 
> > unrecoverable memory error")
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Naoya Horiguchi 
> 
> The fix looks fine to me:
> 
> Reviewed-by: Oscar Salvador 

Thank you!

Have a nice weekend.
- Naoya

---
>From eb05750c13fe9b637190410289a3168b097e Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi 
Date: Fri, 5 Mar 2021 21:44:47 +0900
Subject: [PATCH v2] mm, hwpoison: do not lock page again when me_huge_page()
 successfully recovers

Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on some
tail page in a hugetlb, then accesses the address of the hugetlb range)
showed that page allocation code detects this page lock on buddy page and
printed out "BUG: Bad page state" message.

check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't work
in this case because the "bad page" is not the actual hwpoisoned page.

This patch suggests to drop the 2nd page lock to fix the issue.

Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in 
unrecoverable memory error")
Cc: sta...@vger.kernel.org
Signed-off-by: Naoya Horiguchi 
Reviewed-by: Oscar Salvador 
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e9481632fcd1..d8aba15295c5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -830,7 +830,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
page_ref_inc(p);
res = MF_RECOVERED;
}
-   lock_page(hpage);
}
 
return res;
@@ -1286,7 +1285,8 @@ static int memory_failure_hugetlb(unsigned long pfn, int 
flags)
 
res = identify_page_state(pfn, p, page_flags);
 out:
-   unlock_page(head);
+   if (PageLocked(head))
+   unlock_page(head);
return res;
 }
 
-- 
2.25.1


[PATCH v1] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers

2021-03-03 Thread Naoya Horiguchi
From: Naoya Horiguchi 

Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on some
tail page in a hugetlb, then accesses the address of the hugetlb range)
showed that page allocation code detects the page lock on buddy page and
printed out "BUG: Bad page state" message.  PG_hwpoison does not prevent
it because PG_hwpoison flag is set on any subpage of the hugetlb page
but the 2nd page lock is on the head page.

This patch suggests to drop the 2nd page lock to fix the issue.

Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in 
unrecoverable memory error")
Cc: sta...@vger.kernel.org
Signed-off-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git v5.11/mm/memory-failure.c v5.11_patched/mm/memory-failure.c
index e9481632fcd1..d8aba15295c5 100644
--- v5.11/mm/memory-failure.c
+++ v5.11_patched/mm/memory-failure.c
@@ -830,7 +830,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
page_ref_inc(p);
res = MF_RECOVERED;
}
-   lock_page(hpage);
}
 
return res;
@@ -1286,7 +1285,8 @@ static int memory_failure_hugetlb(unsigned long pfn, int 
flags)
 
res = identify_page_state(pfn, p, page_flags);
 out:
-   unlock_page(head);
+   if (PageLocked(head))
+   unlock_page(head);
return res;
 }
 
-- 
2.25.1



Re: [PATCH v1] mm, hwpoison: enable error handling on shmem thp

2021-02-09 Thread Naoya Horiguchi
On Tue, Feb 09, 2021 at 11:46:40AM -0800, Andrew Morton wrote:
> On Tue,  9 Feb 2021 15:21:28 +0900 Naoya Horiguchi  
> wrote:
> 
> > Currently hwpoison code checks PageAnon() for thp and refuses to handle
> > errors on non-anonymous thps (just for historical reason).  We now
> > support non-anonymou thp like shmem one, so this patch suggests to enable
> > to handle shmem thps. Fortunately, we already have can_split_huge_page()
> > to check if a give thp is splittable, so this patch relies on it.
> 
> Thanks.  We're at -rc7 so I think I'll park this one for 5.12-rc1,
> unless it is more urgent than I believe it to be?

This patch is not so urgent, so I'm fine to make this targeted at 5.12-rc1.

Thanks,
Naoya Horiguchi


[PATCH v1] mm, hwpoison: enable error handling on shmem thp

2021-02-08 Thread Naoya Horiguchi
From: Naoya Horiguchi 

Currently hwpoison code checks PageAnon() for thp and refuses to handle
errors on non-anonymous thps (just for historical reason).  We now
support non-anonymou thp like shmem one, so this patch suggests to enable
to handle shmem thps. Fortunately, we already have can_split_huge_page()
to check if a give thp is splittable, so this patch relies on it.

Signed-off-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 34 +-
 1 file changed, 9 insertions(+), 25 deletions(-)

diff --git v5.11-rc6/mm/memory-failure.c v5.11-rc6_patched/mm/memory-failure.c
index e9481632fcd1..8c1575de0507 100644
--- v5.11-rc6/mm/memory-failure.c
+++ v5.11-rc6_patched/mm/memory-failure.c
@@ -956,20 +956,6 @@ static int __get_hwpoison_page(struct page *page)
 {
struct page *head = compound_head(page);
 
-   if (!PageHuge(head) && PageTransHuge(head)) {
-   /*
-* Non anonymous thp exists only in allocation/free time. We
-* can't handle such a case correctly, so let's give it up.
-* This should be better than triggering BUG_ON when kernel
-* tries to touch the "partially handled" page.
-*/
-   if (!PageAnon(head)) {
-   pr_err("Memory failure: %#lx: non anonymous thp\n",
-   page_to_pfn(page));
-   return 0;
-   }
-   }
-
if (get_page_unless_zero(head)) {
if (head == compound_head(page))
return 1;
@@ -1197,21 +1183,19 @@ static int identify_page_state(unsigned long pfn, 
struct page *p,
 
 static int try_to_split_thp_page(struct page *page, const char *msg)
 {
-   lock_page(page);
-   if (!PageAnon(page) || unlikely(split_huge_page(page))) {
-   unsigned long pfn = page_to_pfn(page);
+   struct page *head;
 
+   lock_page(page);
+   head = compound_head(page);
+   if (PageTransHuge(head) && can_split_huge_page(head, NULL) &&
+   !split_huge_page(page)) {
unlock_page(page);
-   if (!PageAnon(page))
-   pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
-   else
-   pr_info("%s: %#lx: thp split failed\n", msg, pfn);
-   put_page(page);
-   return -EBUSY;
+   return 0;
}
+   pr_info("%s: %#lx: thp split failed\n", msg, page_to_pfn(page));
unlock_page(page);
-
-   return 0;
+   put_page(page);
+   return -EBUSY;
 }
 
 static int memory_failure_hugetlb(unsigned long pfn, int flags)
-- 
2.25.1



Re: [PATCH v6 00/12] HWPOISON: soft offline rework

2020-08-11 Thread Naoya Horiguchi
On Tue, Aug 11, 2020 at 01:39:24PM -0400, Qian Cai wrote:
> On Tue, Aug 11, 2020 at 03:11:40AM +, HORIGUCHI NAOYA(堀口 直也) wrote:
> > I'm still not sure why the test succeeded by reverting these because
> > current mainline kernel provides similar mechanism to prevent reuse of
> > soft offlined page. So this success seems to me something suspicious.
> > 
> > To investigate more, I want to have additional info about the page states
> > of the relevant pages after soft offlining.  Could you collect it by the
> > following steps?
> > 
> >   - modify random.c not to run hotplug_memory() in 
> > migrate_huge_hotplug_memory(),
> >   - compile it and run "./random 1" once,
> >   - to collect page state with hwpoisoned pages, run "./page-types -Nlr -b 
> > hwpoison",
> > where page-types is available under tools/vm in kernel source tree.
> >   - choose a few pfns of soft offlined pages from kernel message
> > "Soft offlining pfn ...", and run "./page-types -Nlr -a ".
> 
> # ./page-types -Nlr -b hwpoison
> offsetlen flags
> 99a0001   __BX___
> 99c0001   __BX___
> 99e0001   __BX___
> 9a1   __BX___
> ba60001   __BX___
> baa0001   __BX___

Thank you.  It only shows 6 lines of records, which is unexpected to me
because random.c iterates soft offline 2 hugepages with madvise() 1000 times.
Somehow (maybe in arch specific way?) other hwpoisoned pages might be cleared?
If they really are, the success of this test is a fake, and this patchset
can be considered as a fix.

> 
> Every single one of pfns was like this,
> 
> # ./page-types -Nlr -a 0x99a000
> offsetlen flags
> 99a0001   __BX___
> 
> # ./page-types -Nlr -a 0x99e000
> offsetlen flags
> 99e0001   __BX___
> 
> # ./page-types -Nlr -a 0x99c000
> offsetlen flags
> 99c0001   __BX___


Re: [PATCH v3 13/15] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page

2020-06-28 Thread Naoya Horiguchi
On Fri, Jun 26, 2020 at 10:23:43PM +0530, Naresh Kamboju wrote:
> On Wed, 24 Jun 2020 at 20:32,  wrote:
> >
> > From: Oscar Salvador 
> >
> > Merging soft_offline_huge_page and __soft_offline_page let us get rid of
> > quite some duplicated code, and makes the code much easier to follow.
> >
> > Now, __soft_offline_page will handle both normal and hugetlb pages.
> >
> > Note that move put_page() block to the beginning of page_handle_poison()
> > with drain_all_pages() in order to make sure that the target page is
> > freed and sent into free list to make take_page_off_buddy() work properly.
> >
> > Signed-off-by: Oscar Salvador 
> > Signed-off-by: Naoya Horiguchi 
> > ---
> > ChangeLog v2 -> v3:
> > - use page_is_file_lru() instead of page_is_file_cache(),
> > - add description about put_page() and drain_all_pages().
> > - fix coding style warnings by checkpatch.pl
> > ---
> >  mm/memory-failure.c | 185 
> >  1 file changed, 86 insertions(+), 99 deletions(-)
> >
> > diff --git v5.8-rc1-mmots-2020-06-20-21-44/mm/memory-failure.c 
> > v5.8-rc1-mmots-2020-06-20-21-44_patched/mm/memory-failure.c
> > index f744eb90c15c..22c904f6d17a 100644
> > --- v5.8-rc1-mmots-2020-06-20-21-44/mm/memory-failure.c
> > +++ v5.8-rc1-mmots-2020-06-20-21-44_patched/mm/memory-failure.c
> > @@ -78,14 +78,36 @@ EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
> >  EXPORT_SYMBOL_GPL(hwpoison_filter_flags_mask);
> >  EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
> >
> > -static void page_handle_poison(struct page *page, bool release)
> > +static bool page_handle_poison(struct page *page, bool 
> > hugepage_or_freepage, bool release)
> >  {
> > +   if (release) {
> > +   put_page(page);
> > +   drain_all_pages(page_zone(page));
> > +   }
> > +
> > +   if (hugepage_or_freepage) {
> > +   /*
> > +* Doing this check for free pages is also fine since 
> > dissolve_free_huge_page
> > +* returns 0 for non-hugetlb pages as well.
> > +*/
> > +   if (dissolve_free_huge_page(page) || 
> > !take_page_off_buddy(page))
> > +   /*
> > +* The hugetlb page can end up being enqueued back into
> > +* the freelists by means of:
> > +* unmap_and_move_huge_page
> > +*  putback_active_hugepage
> > +*   put_page->free_huge_page
> > +*enqueue_huge_page
> > +* If this happens, we might lose the race against an 
> > allocation.
> > +*/
> > +   return false;
> > +   }
> >
> > SetPageHWPoison(page);
> > -   if (release)
> > -   put_page(page);
> > page_ref_inc(page);
> > num_poisoned_pages_inc();
> > +
> > +   return true;
> >  }
> >
> >  static int hwpoison_filter_dev(struct page *p)
> > @@ -1718,63 +1740,52 @@ static int get_any_page(struct page *page, unsigned 
> > long pfn)
> > return ret;
> >  }
> >
> > -static int soft_offline_huge_page(struct page *page)
> > +static bool isolate_page(struct page *page, struct list_head *pagelist)
> >  {
> > -   int ret;
> > -   unsigned long pfn = page_to_pfn(page);
> > -   struct page *hpage = compound_head(page);
> > -   LIST_HEAD(pagelist);
> > +   bool isolated = false;
> > +   bool lru = PageLRU(page);
> > +
> > +   if (PageHuge(page)) {
> > +   isolated = isolate_huge_page(page, pagelist);
> > +   } else {
> > +   if (lru)
> > +   isolated = !isolate_lru_page(page);
> > +   else
> > +   isolated = !isolate_movable_page(page, 
> > ISOLATE_UNEVICTABLE);
> > +
> > +   if (isolated)
> > +   list_add(>lru, pagelist);
> >
> > -   /*
> > -* This double-check of PageHWPoison is to avoid the race with
> > -* memory_failure(). See also comment in __soft_offline_page().
> > -*/
> > -   lock_page(hpage);
> > -   if (PageHWPoison(hpage)) {
> > -   unlock_page(hpage);
> > -   put_page(hpage);
> > -   pr_info("soft offline: %#lx hugepage already poisoned\n", 
> > pfn);
> > -

Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages

2019-10-22 Thread Naoya Horiguchi
On Tue, Oct 22, 2019 at 12:33:25PM +0200, Oscar Salvador wrote:
> On Tue, Oct 22, 2019 at 12:24:57PM +0200, Michal Hocko wrote:
> > Yes, that makes a perfect sense. What I am saying that the migration
> > (aka trying to recover) is the main and only difference. The soft
> > offline should poison page tables when not able to migrate as well
> > IIUC.
> 
> Yeah, I see your point.
> I do not really why soft-offline strived so much to left the page
> untouched unless it was able to content the problem.
> 
> Note that if we start now to poison pages even if we could not 
> content them (in soft-offline mode), that is a big and visible user
> change.

It's declared that soft offline never disrupts userspace by design,
so if poisoning page tables in migration failure, we could break this
and send SIGBUSs.  Then end users would complain that their processes
are killed by corrected (so non-urgent) errors.

Thanks,
Naoya Horiguchi


Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages

2019-10-22 Thread Naoya Horiguchi
On Tue, Oct 22, 2019 at 11:58:52AM +0200, Oscar Salvador wrote:
> On Tue, Oct 22, 2019 at 11:22:56AM +0200, Michal Hocko wrote:
> > Hmm, that might be a misunderstanding on my end. I thought that it is
> > the MCE handler to say whether the failure is recoverable or not. If yes
> > then we can touch the content of the memory (that would imply the
> > migration). Other than that both paths should be essentially the same,
> > no? Well unrecoverable case would be essentially force migration failure
> > path.
> > 
> > MADV_HWPOISON is explicitly documented to test MCE handling IIUC:
> > : This feature is intended for testing of memory error-handling
> > : code; it is available only if the kernel was configured with
> > : CONFIG_MEMORY_FAILURE.
> > 
> > There is no explicit note about the type of the error that is injected
> > but I think it is reasonably safe to assume this is a recoverable one.
> 
> MADV_HWPOISON stands for hard-offline.
> MADV_SOFT_OFFLINE stands for soft-offline.

Maybe MADV_HWPOISON should've be named like MADV_HARD_OFFLINE, although
it's API and hard to change once implemented.

> 
> MADV_SOFT_OFFLINE (since Linux 2.6.33)
>   Soft offline the pages in the range specified by addr and
>   length.  The memory of each page in the specified range is
>   preserved (i.e., when next accessed, the same content will be
>   visible, but in a new physical page frame), and the original
>   page is offlined (i.e., no longer used, and taken out of
>   normal memory management).  The effect of the
>   MADV_SOFT_OFFLINE operation is invisible to (i.e., does not
>   change the semantics of) the calling process.
> 
>   This feature is intended for testing of memory error-handling
>   code;

Although this expression might not clear enough, madvise(MADV_HWPOISON or
MADV_SOFT_OFFLINE) only covers memory error handling part, not MCE handling
part.  We have some other injection methods in the lower layers like
mce-inject and APEI.

> it is available only if the kernel was configured with
>   CONFIG_MEMORY_FAILURE.
> 
> 
> But both follow different approaches.
> 
> I think it is up to some controlers to trigger soft-offline or hard-offline:

Yes, I think so.  One usecase of soft offline is triggered by CMCI interrupt
in Intel CPU.  CMCI handler stores corrected error events in /dev/mcelog.
mcelogd polls on this device file and if corrected errors occur often enough
(IIRC the default threshold is "10 events in 24 hours",) mcelogd triggers
soft-offline via soft_offline_page under /sys.

OTOH, hard-offline is triggered directly (accurately over ring buffer to 
separate
context) from MCE handler.  mcelogd logs MCE events but does not involve in
page offline logic.

> 
> static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, 
> int sev)
> {
> #ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
>   ...
> /* iff following two events can be handled properly by now */
> if (sec_sev == GHES_SEV_CORRECTED &&
> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> flags = MF_SOFT_OFFLINE;
> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> flags = 0;
> 
> if (flags != -1)
> memory_failure_queue(pfn, flags);
>   ...
> #endif
>  }
> 
> 
> static void memory_failure_work_func(struct work_struct *work)
> {
>   ...
>   for (;;) {
>   spin_lock_irqsave(_cpu->lock, proc_flags);
>   gotten = kfifo_get(_cpu->fifo, );
>   spin_unlock_irqrestore(_cpu->lock, proc_flags);
>   if (!gotten)
>   break;
>   if (entry.flags & MF_SOFT_OFFLINE)
>   soft_offline_page(pfn_to_page(entry.pfn), entry.flags);
>   else
>   memory_failure(entry.pfn, entry.flags);
>   }
>  }
> 
> AFAICS, for hard-offline case, a recovered event would be if:
> 
> - the page to shut down is already free
> - the page was unmapped
> 
> In some cases we need to kill the process if it holds dirty pages.

One caveat is that even if the process maps dirty error pages, we
don't have to kill it unless the error data is consumed.

Thanks,
Naoya Horiguchi


[PATCH 17/16] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP

2019-10-21 Thread Naoya Horiguchi
On Mon, Oct 21, 2019 at 07:04:40AM +, Naoya Horiguchi wrote:
> On Thu, Oct 17, 2019 at 04:21:16PM +0200, Oscar Salvador wrote:
> > Place the THP's page handling in a helper and use it
> > from both hard and soft-offline machinery, so we get rid
> > of some duplicated code.
> > 
> > Signed-off-by: Oscar Salvador 
...
> > @@ -1288,21 +1307,8 @@ int memory_failure(unsigned long pfn, int flags)
> > }
> >  
> > if (PageTransHuge(hpage)) {
> > -   lock_page(p);
> > -   if (!PageAnon(p) || unlikely(split_huge_page(p))) {
> > -   unlock_page(p);
> > -   if (!PageAnon(p))
> > -   pr_err("Memory failure: %#lx: non anonymous 
> > thp\n",
> > -   pfn);
> > -   else
> > -   pr_err("Memory failure: %#lx: thp split 
> > failed\n",
> > -   pfn);
> > -   if (TestClearPageHWPoison(p))
> > -   num_poisoned_pages_dec();
> > -   put_page(p);
> > +   if (try_to_split_thp_page(p, "Memory Failure") < 0)
> > return -EBUSY;
> 
> Although this is not a cleanup thing, this failure path means that
> hwpoison is handled (PG_hwpoison is marked), so action_result() should
> be called.  I'll add a patch for this later.

Here's the one.  So Oscar, If you like, could you append this to
your tree in the next spin (with your credit or signed-off-by)?

Thanks,
Naoya Horiguchi
---
>From b920f965485f6679ddc27e1a51da5bff7a5cc81a Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi 
Date: Mon, 21 Oct 2019 18:42:33 +0900
Subject: [PATCH] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP

memory_failure() is supposed to call action_result() when it handles
a memory error event, but there's one missing case. So let's add it.

I find that include/ras/ras_event.h has some other MF_MSG_* undefined,
so this patch also adds them.

Signed-off-by: Naoya Horiguchi 
---
 include/linux/mm.h  | 1 +
 include/ras/ras_event.h | 3 +++
 mm/memory-failure.c | 5 -
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3eba26324ff1..022033cc6782 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2818,6 +2818,7 @@ enum mf_action_page_type {
MF_MSG_BUDDY,
MF_MSG_BUDDY_2ND,
MF_MSG_DAX,
+   MF_MSG_UNSPLIT_THP,
MF_MSG_UNKNOWN,
 };
 
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 36c5c5e38c1d..0bdbc0d17d2f 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -361,6 +361,7 @@ TRACE_EVENT(aer_event,
EM ( MF_MSG_POISONED_HUGE, "huge page already hardware poisoned" )  
\
EM ( MF_MSG_HUGE, "huge page" ) \
EM ( MF_MSG_FREE_HUGE, "free huge page" )   \
+   EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" )   \
EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" ) \
EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" )   \
EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" )   \
@@ -373,6 +374,8 @@ TRACE_EVENT(aer_event,
EM ( MF_MSG_TRUNCATED_LRU, "already truncated LRU page" )   \
EM ( MF_MSG_BUDDY, "free buddy page" )  \
EM ( MF_MSG_BUDDY_2ND, "free buddy page (2nd try)" )\
+   EM ( MF_MSG_DAX, "dax page" )   \
+   EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )\
EMe ( MF_MSG_UNKNOWN, "unknown page" )
 
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 46ca856703f6..b15086ad8948 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -583,6 +583,7 @@ static const char * const action_page_types[] = {
[MF_MSG_BUDDY]  = "free buddy page",
[MF_MSG_BUDDY_2ND]  = "free buddy page (2nd try)",
[MF_MSG_DAX]= "dax page",
+   [MF_MSG_UNSPLIT_THP]= "unsplit thp",
[MF_MSG_UNKNOWN]= "unknown page",
 };
 
@@ -1361,8 +1362,10 @@ int memory_failure(unsigned long pfn, int flags)
}
 
if (PageTransHuge(hpage)) {
-   if (try_to_split_thp_page(p, "Memory Failure") < 0)
+   if (try_to_split_thp_page(p, "Memory Failure") < 0) {
+   action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
return -EBUSY;
+   }
VM_BUG_ON_PAGE(!page_count(p), p);
hpage = compound_head(p);
}
-- 
2.17.1



Re: [RFC PATCH v2 14/16] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline

2019-10-21 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 04:21:21PM +0200, Oscar Salvador wrote:
> Currently, there is an inconsistency when calling soft-offline from
> different paths on a page that is already poisoned.
> 
> 1) madvise:
> 
> madvise_inject_error skips any poisoned page and continues
> the loop.
> If that was the only page to madvise, it returns 0.
> 
> 2) /sys/devices/system/memory/:
> 
> Whe calling soft_offline_page_store()->soft_offline_page(),
> we return -EBUSY in case the page is already poisoned.
> This is inconsistent with a) the above example and b)
> memory_failure, where we return 0 if the page was poisoned.
> 
> Fix this by dropping the PageHWPoison() check in madvise_inject_error,
> and let soft_offline_page return 0 if it finds the page already poisoned.
> 
> Please, note that this represents an user-api change, since now the return
> error when calling soft_offline_page_store()->soft_offline_page() will be 
> different.
> 
> Signed-off-by: Oscar Salvador 

Looks good to me.

Acked-by: Naoya Horiguchi 

> ---
>  mm/madvise.c| 3 ---
>  mm/memory-failure.c | 4 ++--
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 8a0b1f901d72..9ca48345ce45 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -887,9 +887,6 @@ static int madvise_inject_error(int behavior,
>*/
>   put_page(page);
>  
> - if (PageHWPoison(page))
> - continue;
> -
>   if (behavior == MADV_SOFT_OFFLINE) {
>   pr_info("Soft offlining pfn %#lx at process virtual 
> address %#lx\n",
>pfn, start);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3d491c0d3f91..c038896bedf0 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1767,7 +1767,7 @@ static int __soft_offline_page(struct page *page)
>   unlock_page(page);
>   put_page(page);
>   pr_info("soft offline: %#lx page already poisoned\n", pfn);
> - return -EBUSY;
> + return 0;
>   }
>  
>   if (!PageHuge(page))
> @@ -1866,7 +1866,7 @@ int soft_offline_page(struct page *page)
>  
>   if (PageHWPoison(page)) {
>   pr_info("soft offline: %#lx page already poisoned\n", pfn);
> - return -EBUSY;
> + return 0;
>   }
>  
>   get_online_mems();
> -- 
> 2.12.3
> 
> 


Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages

2019-10-21 Thread Naoya Horiguchi
  del_page_from_free_area(page_head, area);
> + break_down_buddy_pages(zone, page_head, page, 0,
> +    buddy_order, area, migratetype);
> + ret = true;
> + break;

indent with whitespace?
And you can find a few more coding style warning with checkpatch.pl.

BTW, if we consider to make unpoison mechanism to keep up with the
new semantics, we will need the reverse operation of take_page_off_buddy().
Do you think that that part will come with a separate work?

Thanks,
Naoya Horiguchi

> +  }
> + }
> + spin_unlock_irqrestore(>lock, flags);
> + return ret;
> + }
> +
> +/*
>   * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
>   * test is performed under the zone lock to prevent a race against page
>   * allocation.
> -- 
> 2.12.3
> 
> 


Re: [RFC PATCH v2 06/16] mm,hwpoison: Kill put_hwpoison_page

2019-10-21 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 04:21:13PM +0200, Oscar Salvador wrote:
> After ("4e41a30c6d50: mm: hwpoison: adjust for new thp refcounting"),
> put_hwpoison_page got reduced to a put_page.
> Let us just use put_page instead.
> 
> Signed-off-by: Oscar Salvador 

Acked-by: Naoya Horiguchi 


Re: [RFC PATCH v2 05/16] mm,hwpoison: Un-export get_hwpoison_page and make it static

2019-10-21 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 04:21:12PM +0200, Oscar Salvador wrote:
> Since get_hwpoison_page is only used in memory-failure code now,
> let us un-export it and make it private to that code.
> 
> Signed-off-by: Oscar Salvador 

Acked-by: Naoya Horiguchi 


Re: [RFC PATCH v2 09/16] mm,hwpoison: Unify THP handling for hard and soft offline

2019-10-21 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 04:21:16PM +0200, Oscar Salvador wrote:
> Place the THP's page handling in a helper and use it
> from both hard and soft-offline machinery, so we get rid
> of some duplicated code.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/memory-failure.c | 48 ++--
>  1 file changed, 22 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 836d671fb74f..37b230b8cfe7 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1066,6 +1066,25 @@ static int identify_page_state(unsigned long pfn, 
> struct page *p,
>   return page_action(ps, p, pfn);
>  }
>  
> +static int try_to_split_thp_page(struct page *page, const char *msg)
> +{
> + lock_page(page);
> + if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> + unsigned long pfn = page_to_pfn(page);
> +
> + unlock_page(page);
> + if (!PageAnon(page))
> + pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> + else
> + pr_info("%s: %#lx: thp split failed\n", msg, pfn);
> + put_page(page);
> + return -EBUSY;
> + }
> + unlock_page(page);
> +
> + return 0;
> +}
> +
>  static int memory_failure_hugetlb(unsigned long pfn, int flags)
>  {
>   struct page *p = pfn_to_page(pfn);
> @@ -1288,21 +1307,8 @@ int memory_failure(unsigned long pfn, int flags)
>   }
>  
>   if (PageTransHuge(hpage)) {
> - lock_page(p);
> - if (!PageAnon(p) || unlikely(split_huge_page(p))) {
> - unlock_page(p);
> - if (!PageAnon(p))
> - pr_err("Memory failure: %#lx: non anonymous 
> thp\n",
> - pfn);
> - else
> - pr_err("Memory failure: %#lx: thp split 
> failed\n",
> - pfn);
> - if (TestClearPageHWPoison(p))
> - num_poisoned_pages_dec();
> - put_page(p);
> + if (try_to_split_thp_page(p, "Memory Failure") < 0)
>   return -EBUSY;

Although this is not a cleanup thing, this failure path means that
hwpoison is handled (PG_hwpoison is marked), so action_result() should
be called.  I'll add a patch for this later.

Anyway, this cleanup patch looks fine to me.

Acked-by: Naoya Horiguchi 

> - }
> - unlock_page(p);
>   VM_BUG_ON_PAGE(!page_count(p), p);
>   hpage = compound_head(p);
>   }
> @@ -1801,19 +1807,9 @@ static int soft_offline_in_use_page(struct page *page)
>   int mt;
>   struct page *hpage = compound_head(page);
>  
> - if (!PageHuge(page) && PageTransHuge(hpage)) {
> - lock_page(page);
> - if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> - unlock_page(page);
> - if (!PageAnon(page))
> - pr_info("soft offline: %#lx: non anonymous 
> thp\n", page_to_pfn(page));
> - else
> - pr_info("soft offline: %#lx: thp split 
> failed\n", page_to_pfn(page));
> - put_page(page);
> + if (!PageHuge(page) && PageTransHuge(hpage))
> + if (try_to_split_thp_page(page, "soft offline") < 0)
>   return -EBUSY;
> - }
> - unlock_page(page);
> - }
>  
>   /*
>* Setting MIGRATE_ISOLATE here ensures that the page will be linked
> -- 
> 2.12.3
> 
> 


Re: [RFC PATCH v2 03/16] mm,madvise: Refactor madvise_inject_error

2019-10-21 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 04:21:10PM +0200, Oscar Salvador wrote:
> Make a proper if-else condition for {hard,soft}-offline.
> 
> Signed-off-by: Oscar Salvador 

Acked-by: Naoya Horiguchi 


Re: [RFC PATCH v2 02/16] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

2019-10-21 Thread Naoya Horiguchi
On Fri, Oct 18, 2019 at 01:52:27PM +0200, Michal Hocko wrote:
> On Thu 17-10-19 16:21:09, Oscar Salvador wrote:
> > From: Naoya Horiguchi 
> > 
> > The call to get_user_pages_fast is only to get the pointer to a struct
> > page of a given address, pinning it is memory-poisoning handler's job,
> > so drop the refcount grabbed by get_user_pages_fast
> > 
> > Signed-off-by: Naoya Horiguchi 
> > Signed-off-by: Oscar Salvador 
> > ---
> >  mm/madvise.c | 24 
> >  1 file changed, 12 insertions(+), 12 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 2be9f3fdb05e..89ed9a22ff4f 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -878,16 +878,24 @@ static int madvise_inject_error(int behavior,
> >  */
> > order = compound_order(compound_head(page));
> >  
> > -   if (PageHWPoison(page)) {
> > -   put_page(page);
> > +   /*
> > +* The get_user_pages_fast() is just to get the pfn of the
> > +* given address, and the refcount has nothing to do with
> > +* what we try to test, so it should be released immediately.
> > +* This is racy but it's intended because the real hardware
> > +* errors could happen at any moment and memory error handlers
> > +* must properly handle the race.
> > +*/
> > +   put_page(page);
> > +
> > +   if (PageHWPoison(page))
> > continue;
> > -   }
> >  
> > if (behavior == MADV_SOFT_OFFLINE) {
> > pr_info("Soft offlining pfn %#lx at process virtual 
> > address %#lx\n",
> > pfn, start);
> >  
> > -   ret = soft_offline_page(page, MF_COUNT_INCREASED);
> > +   ret = soft_offline_page(page, 0);
> 
> What does prevent this struct page to go away completely?

Nothing does it.  Memory error handler tries to pin by itself and
then determines what state the page is in now.

Thanks,
Naoya Horiguchi

> 
> > if (ret)
> > return ret;
> > continue;
> > @@ -895,14 +903,6 @@ static int madvise_inject_error(int behavior,
> >  
> > pr_info("Injecting memory failure for pfn %#lx at process 
> > virtual address %#lx\n",
> > pfn, start);
> > -
> > -   /*
> > -* Drop the page reference taken by get_user_pages_fast(). In
> > -* the absence of MF_COUNT_INCREASED the memory_failure()
> > -* routine is responsible for pinning the page to prevent it
> > -* from being released back to the page allocator.
> > -*/
> > -   put_page(page);
> > ret = memory_failure(pfn, 0);
> > if (ret)
> > return ret;
> > -- 
> > 2.12.3
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 


Re: [RFC PATCH v2 01/16] mm,hwpoison: cleanup unused PageHuge() check

2019-10-21 Thread Naoya Horiguchi
On Fri, Oct 18, 2019 at 01:48:32PM +0200, Michal Hocko wrote:
> On Thu 17-10-19 16:21:08, Oscar Salvador wrote:
> > From: Naoya Horiguchi 
> > 
> > Drop the PageHuge check since memory_failure forks into 
> > memory_failure_hugetlb()
> > for hugetlb pages.
> > 
> > Signed-off-by: Oscar Salvador 
> > Signed-off-by: Naoya Horiguchi 
> 
> s-o-b chain is reversed.
> 
> The code is a bit confusing. Doesn't this check aim for THP?

No, PageHuge() is false for thp, so this if branch is just dead code.

> AFAICS
> PageTransHuge(hpage) will split the THP or fail so PageTransHuge
> shouldn't be possible anymore, right?

Yes, that's right.

> But why does hwpoison_user_mappings
> still work with hpage then?

hwpoison_user_mappings() is called both from memory_failure() and
from memory_failure_hugetlb(), so it need handle both cases.

Thanks,
Naoya Horiguchi

> 
> > ---
> >  mm/memory-failure.c | 5 +
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 05c8c6df25e6..2cbadb58c7df 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1345,10 +1345,7 @@ int memory_failure(unsigned long pfn, int flags)
> >  * page_remove_rmap() in try_to_unmap_one(). So to determine page status
> >  * correctly, we save a copy of the page flags at this time.
> >  */
> > -   if (PageHuge(p))
> > -   page_flags = hpage->flags;
> > -   else
> > -   page_flags = p->flags;
> > +   page_flags = p->flags;
> >  
> > /*
> >  * unpoison always clear PG_hwpoison inside page lock
> > -- 
> > 2.12.3
> 
> -- 
> Michal Hocko
> SUSE Labs
> 


Re: memory offline infinite loop after soft offline

2019-10-20 Thread Naoya Horiguchi
On Fri, Oct 18, 2019 at 07:56:09AM -0400, Qian Cai wrote:
> 
> 
> On Oct 18, 2019, at 2:35 AM, Naoya Horiguchi 
> wrote:
> 
> 
> You're right, then I don't see how this happens. If the error hugepage was
> isolated without having PG_hwpoison set, it's unexpected and problematic.
> I'm testing myself with v5.4-rc2 (simply ran move_pages12 and did 
> hotremove
> /hotadd)
> but don't reproduce the issue yet.  Do we need specific kernel version/
> config
> to trigger this?
> 
> 
> This is reproducible on linux-next with the config. Not sure if it is
> reproducible on x86.
> 
> https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
> 
> and kernel cmdline if that matters
> 
> page_poison=on page_owner=on numa_balancing=enable \
> systemd.unified_cgroup_hierarchy=1 debug_guardpage_minorder=1 \
> page_alloc.shuffle=1

Thanks for the info.

> 
> BTW, where does the code set PG_hwpoison for the head page?

Precisely speaking, soft offline only sets PG_hwpoison after the target
hugepage is successfully dissolved (then it's not a hugepage any more),
so PG_hwpoison is set on the raw page in set_hwpoison_free_buddy_page().

In move_pages12 case, madvise(MADV_SOFT_OFFLINE) is called for the range
of 2 hugepages, so the expected result is that page offset 0 and 512
are marked as PG_hwpoison after injection.

Looking at your dump_page() output, the end_pfn is page offset 1
("page:c00c000800458040" is likely to point to pfn 0x11601.)
The page belongs to high order buddy free page, but doesn't have
PageBuddy nor PageHWPoison because it was not the head page or
the raw error page.

> Unfortunately, this does not solve the problem. It looks to me that in
> 
> soft_offline_huge_page(), set_hwpoison_free_buddy_page() will only set
> 
> PG_hwpoison for buddy pages, so the even the compound_head() has no 
> PG_hwpoison   
> set.  
> 

Your analysis is totally correct, and this behavior will be fixed by
the change (https://lkml.org/lkml/2019/10/17/551) in Oscar's rework.
The raw error page will be taken off from buddy system and the other
subpages are properly split into lower orderer pages (we'll properly
manage PageBuddy flags). So all possible cases would be covered by
branches in __test_page_isolated_in_pageblock.

Thanks,
Naoya Horiguchi


Re: memory offline infinite loop after soft offline

2019-10-18 Thread Naoya Horiguchi
On Fri, Oct 18, 2019 at 09:33:10AM +0200, Michal Hocko wrote:
> On Fri 18-10-19 06:32:22, Naoya Horiguchi wrote:
> > On Fri, Oct 18, 2019 at 08:06:35AM +0200, Michal Hocko wrote:
> > > On Fri 18-10-19 02:19:06, Naoya Horiguchi wrote:
> > > > On Thu, Oct 17, 2019 at 08:27:59PM +0200, Michal Hocko wrote:
> > > > > On Thu 17-10-19 14:07:13, Qian Cai wrote:
> > > > > > On Thu, 2019-10-17 at 12:01 +0200, Michal Hocko wrote:
> > > > > > > On Thu 17-10-19 09:34:10, Naoya Horiguchi wrote:
> > > > > > > > On Mon, Oct 14, 2019 at 10:39:14AM +0200, Michal Hocko wrote:
> > > > > > > 
> > > > > > > [...]
> > > > > > > > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > > > > > > > > index 89c19c0feadb..5fb3fee16fde 100644
> > > > > > > > > --- a/mm/page_isolation.c
> > > > > > > > > +++ b/mm/page_isolation.c
> > > > > > > > > @@ -274,7 +274,7 @@ 
> > > > > > > > > __test_page_isolated_in_pageblock(unsigned long pfn, unsigned 
> > > > > > > > > long end_pfn,
> > > > > > > > >* simple way to verify that as 
> > > > > > > > > VM_BUG_ON(), though.
> > > > > > > > >*/
> > > > > > > > >   pfn += 1 << page_order(page);
> > > > > > > > > - else if (skip_hwpoisoned_pages && 
> > > > > > > > > PageHWPoison(page))
> > > > > > > > > + else if (skip_hwpoisoned_pages && 
> > > > > > > > > PageHWPoison(compound_head(page)))
> > > > > > > > >   /* A HWPoisoned page cannot be also 
> > > > > > > > > PageBuddy */
> > > > > > > > >   pfn++;
> > > > > > > > >   else
> > > > > > > > 
> > > > > > > > This fix looks good to me. The original code only addresses 
> > > > > > > > hwpoisoned 4kB-page,
> > > > > > > > we seem to have this issue since the following commit,
> > > > > > > 
> > > > > > > Thanks a lot for double checking Naoya!
> > > > > > >  
> > > > > > > >   commit b023f46813cde6e3b8a8c24f432ff9c1fd8e9a64
> > > > > > > >   Author: Wen Congyang 
> > > > > > > >   Date:   Tue Dec 11 16:00:45 2012 -0800
> > > > > > > >   
> > > > > > > >   memory-hotplug: skip HWPoisoned page when offlining pages
> > > > > > > > 
> > > > > > > > and extension of LTP coverage finally discovered this.
> > > > > > > 
> > > > > > > Qian, could you give the patch some testing?
> > > > > > 
> > > > > > Unfortunately, this does not solve the problem. It looks to me that 
> > > > > > in
> > > > > > soft_offline_huge_page(), set_hwpoison_free_buddy_page() will only 
> > > > > > set
> > > > > > PG_hwpoison for buddy pages, so the even the compound_head() has no 
> > > > > > PG_hwpoison
> > > > > > set.
> > > > > > 
> > > > > > if (PageBuddy(page_head) && page_order(page_head) >= 
> > > > > > order) {
> > > > > > if (!TestSetPageHWPoison(page))
> > > > > > hwpoisoned = true;
> > > > > 
> > > > > This is more than unexpected. How are we supposed to find out that the
> > > > > page is poisoned? Any idea Naoya?
> > > > 
> > > > # sorry for my poor review...
> > > > 
> > > > We set PG_hwpoison bit only on the head page for hugetlb, that's because
> > > > we handle multiple pages as a single one for hugetlb. So it's enough
> > > > to check isolation only on the head page.  Simply skipping pfn cursor to
> > > > the page after the hugepage should avoid the infinite loop:
> > > 
> > > But the page dump Qian provided shows that the head page doesn't have
> > > HWPoison bit either. If it had then going pfn at a time should just work
> > > because all tail pages would be skipped. Or do I miss something?
> > 
> > You're right, then I don't see how this happens.
> 
> OK, this is a bit relieving. I thought that there are legitimate cases
> when none of the hugetlb gets the HWPoison bit (e.g. when the page has 0
> reference count which is the case here). That would be utterly broken
> because we would have no way to tell the page is hwpoisoned.
> 
> Anyway, do you think the patch as I've posted makes sense regardless
> another potential problem? Or would you like to resend yours which skips
> over tail pages at once?

I think the patch does no harm but we may put this pending until we
understand the mechanism of the bug. I'll dig this more next week.

Thanks,
Naoya Horiguchi

Re: memory offline infinite loop after soft offline

2019-10-18 Thread Naoya Horiguchi
On Fri, Oct 18, 2019 at 08:06:35AM +0200, Michal Hocko wrote:
> On Fri 18-10-19 02:19:06, Naoya Horiguchi wrote:
> > On Thu, Oct 17, 2019 at 08:27:59PM +0200, Michal Hocko wrote:
> > > On Thu 17-10-19 14:07:13, Qian Cai wrote:
> > > > On Thu, 2019-10-17 at 12:01 +0200, Michal Hocko wrote:
> > > > > On Thu 17-10-19 09:34:10, Naoya Horiguchi wrote:
> > > > > > On Mon, Oct 14, 2019 at 10:39:14AM +0200, Michal Hocko wrote:
> > > > > 
> > > > > [...]
> > > > > > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > > > > > > index 89c19c0feadb..5fb3fee16fde 100644
> > > > > > > --- a/mm/page_isolation.c
> > > > > > > +++ b/mm/page_isolation.c
> > > > > > > @@ -274,7 +274,7 @@ __test_page_isolated_in_pageblock(unsigned 
> > > > > > > long pfn, unsigned long end_pfn,
> > > > > > >* simple way to verify that as VM_BUG_ON(), 
> > > > > > > though.
> > > > > > >*/
> > > > > > >   pfn += 1 << page_order(page);
> > > > > > > - else if (skip_hwpoisoned_pages && PageHWPoison(page))
> > > > > > > + else if (skip_hwpoisoned_pages && 
> > > > > > > PageHWPoison(compound_head(page)))
> > > > > > >   /* A HWPoisoned page cannot be also PageBuddy */
> > > > > > >   pfn++;
> > > > > > >   else
> > > > > > 
> > > > > > This fix looks good to me. The original code only addresses 
> > > > > > hwpoisoned 4kB-page,
> > > > > > we seem to have this issue since the following commit,
> > > > > 
> > > > > Thanks a lot for double checking Naoya!
> > > > >  
> > > > > >   commit b023f46813cde6e3b8a8c24f432ff9c1fd8e9a64
> > > > > >   Author: Wen Congyang 
> > > > > >   Date:   Tue Dec 11 16:00:45 2012 -0800
> > > > > >   
> > > > > >   memory-hotplug: skip HWPoisoned page when offlining pages
> > > > > > 
> > > > > > and extension of LTP coverage finally discovered this.
> > > > > 
> > > > > Qian, could you give the patch some testing?
> > > > 
> > > > Unfortunately, this does not solve the problem. It looks to me that in
> > > > soft_offline_huge_page(), set_hwpoison_free_buddy_page() will only set
> > > > PG_hwpoison for buddy pages, so the even the compound_head() has no 
> > > > PG_hwpoison
> > > > set.
> > > > 
> > > > if (PageBuddy(page_head) && page_order(page_head) >= 
> > > > order) {
> > > > if (!TestSetPageHWPoison(page))
> > > > hwpoisoned = true;
> > > 
> > > This is more than unexpected. How are we supposed to find out that the
> > > page is poisoned? Any idea Naoya?
> > 
> > # sorry for my poor review...
> > 
> > We set PG_hwpoison bit only on the head page for hugetlb, that's because
> > we handle multiple pages as a single one for hugetlb. So it's enough
> > to check isolation only on the head page.  Simply skipping pfn cursor to
> > the page after the hugepage should avoid the infinite loop:
> 
> But the page dump Qian provided shows that the head page doesn't have
> HWPoison bit either. If it had then going pfn at a time should just work
> because all tail pages would be skipped. Or do I miss something?

You're right, then I don't see how this happens. If the error hugepage was
isolated without having PG_hwpoison set, it's unexpected and problematic.
I'm testing myself with v5.4-rc2 (simply ran move_pages12 and did 
hotremove/hotadd)
but don't reproduce the issue yet.  Do we need specific kernel version/config
to trigger this?

Thanks,
Naoya Horiguchi

>  
> >   @@ -274,9 +274,13 @@ __test_page_isolated_in_pageblock(unsigned long pfn, 
> > unsigned long end_pfn,
> >  * simple way to verify that as VM_BUG_ON(), though.
> >  */
> > pfn += 1 << page_order(page);
> >   - else if (skip_hwpoisoned_pages && PageHWPoison(page))
> >   - /* A HWPoisoned page cannot be also PageBuddy */
> >   - pfn++;
> >   + else if (skip_hwpoisoned_pages && 
> > PageHWPoison(compound_head(page)))
> >   + /*
> >   +  * A HWPoisoned page cannot be also PageBuddy.
> >   +  * PG_hwpoison could be set only on the head page in
> >   +  * hugetlb case, so no need to check tail pages.
> >   +  */
> >   + pfn += 1 << compound_order(page);
> > else
> > break;
> > }
> > 
> > Qian, could you please try this?
> > 
> > Thanks,
> > Naoya Horiguchi
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

Re: memory offline infinite loop after soft offline

2019-10-18 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 08:27:59PM +0200, Michal Hocko wrote:
> On Thu 17-10-19 14:07:13, Qian Cai wrote:
> > On Thu, 2019-10-17 at 12:01 +0200, Michal Hocko wrote:
> > > On Thu 17-10-19 09:34:10, Naoya Horiguchi wrote:
> > > > On Mon, Oct 14, 2019 at 10:39:14AM +0200, Michal Hocko wrote:
> > > 
> > > [...]
> > > > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > > > > index 89c19c0feadb..5fb3fee16fde 100644
> > > > > --- a/mm/page_isolation.c
> > > > > +++ b/mm/page_isolation.c
> > > > > @@ -274,7 +274,7 @@ __test_page_isolated_in_pageblock(unsigned long 
> > > > > pfn, unsigned long end_pfn,
> > > > >* simple way to verify that as VM_BUG_ON(), 
> > > > > though.
> > > > >*/
> > > > >   pfn += 1 << page_order(page);
> > > > > - else if (skip_hwpoisoned_pages && PageHWPoison(page))
> > > > > + else if (skip_hwpoisoned_pages && 
> > > > > PageHWPoison(compound_head(page)))
> > > > >   /* A HWPoisoned page cannot be also PageBuddy */
> > > > >   pfn++;
> > > > >   else
> > > > 
> > > > This fix looks good to me. The original code only addresses hwpoisoned 
> > > > 4kB-page,
> > > > we seem to have this issue since the following commit,
> > > 
> > > Thanks a lot for double checking Naoya!
> > >  
> > > >   commit b023f46813cde6e3b8a8c24f432ff9c1fd8e9a64
> > > >   Author: Wen Congyang 
> > > >   Date:   Tue Dec 11 16:00:45 2012 -0800
> > > >   
> > > >   memory-hotplug: skip HWPoisoned page when offlining pages
> > > > 
> > > > and extension of LTP coverage finally discovered this.
> > > 
> > > Qian, could you give the patch some testing?
> > 
> > Unfortunately, this does not solve the problem. It looks to me that in
> > soft_offline_huge_page(), set_hwpoison_free_buddy_page() will only set
> > PG_hwpoison for buddy pages, so the even the compound_head() has no 
> > PG_hwpoison
> > set.
> > 
> > if (PageBuddy(page_head) && page_order(page_head) >= order) {
> > if (!TestSetPageHWPoison(page))
> > hwpoisoned = true;
> 
> This is more than unexpected. How are we supposed to find out that the
> page is poisoned? Any idea Naoya?

# sorry for my poor review...

We set PG_hwpoison bit only on the head page for hugetlb, that's because
we handle multiple pages as a single one for hugetlb. So it's enough
to check isolation only on the head page.  Simply skipping pfn cursor to
the page after the hugepage should avoid the infinite loop:

  @@ -274,9 +274,13 @@ __test_page_isolated_in_pageblock(unsigned long pfn, 
unsigned long end_pfn,
 * simple way to verify that as VM_BUG_ON(), though.
 */
pfn += 1 << page_order(page);
  - else if (skip_hwpoisoned_pages && PageHWPoison(page))
  - /* A HWPoisoned page cannot be also PageBuddy */
  - pfn++;
  + else if (skip_hwpoisoned_pages && 
PageHWPoison(compound_head(page)))
  + /*
  +  * A HWPoisoned page cannot be also PageBuddy.
  +  * PG_hwpoison could be set only on the head page in
  +  * hugetlb case, so no need to check tail pages.
  +  */
  + pfn += 1 << compound_order(page);
else
break;
}

Qian, could you please try this?

Thanks,
Naoya Horiguchi

Re: memory offline infinite loop after soft offline

2019-10-17 Thread Naoya Horiguchi
On Mon, Oct 14, 2019 at 10:39:14AM +0200, Michal Hocko wrote:
> On Fri 11-10-19 17:32:44, Qian Cai wrote:
> > # /opt/ltp/runtest/bin/move_pages12
> > move_pages12.c:263: INFO: Free RAM 258988928 kB
> > move_pages12.c:281: INFO: Increasing 2048kB hugepages pool on node 0 to 4
> > move_pages12.c:291: INFO: Increasing 2048kB hugepages pool on node 8 to 4
> > move_pages12.c:207: INFO: Allocating and freeing 4 hugepages on node 0
> > move_pages12.c:207: INFO: Allocating and freeing 4 hugepages on node 8
> > move_pages12.c:197: PASS: Bug not reproduced
> > move_pages12.c:197: PASS: Bug not reproduced
> > 
> > for mem in $(ls -d /sys/devices/system/memory/memory*); do
> > echo offline > $mem/state
> > echo online > $mem/state
> > done
> > 
> > That LTP move_pages12 test will first madvise(MADV_SOFT_OFFLINE) for a 
> > range.
> > Then, one of "echo offline" will trigger an infinite loop in 
> > __offline_pages()
> > here,
> > 
> > /* check again */
> > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > NULL, check_pages_isolated_cb);
> > } while (ret);
> > 
> > because check_pages_isolated_cb() always return -EBUSY from
> > test_pages_isolated(),
> > 
> > 
> > pfn = __test_page_isolated_in_pageblock(start_pfn, end_pfn,
> > skip_hwpoisoned_pages);
> > ...
> > return pfn < end_pfn ? -EBUSY : 0;
> > 
> > The root cause is in __test_page_isolated_in_pageblock() where "pfn" is 
> > always
> > less than "end_pfn" because the associated page is not a PageBuddy.
> > 
> > while (pfn < end_pfn) {
> > ...
> > else
> > break;
> > 
> > return pfn;
> 
> Hmm, this is interesting. I would expect that this would hit the
> previous branch
>   if (skip_hwpoisoned_pages && PageHWPoison(page))
> and skip over hwpoisoned page. But I cannot seem to find that we would
> mark all tail pages HWPoison as well and so any tail page seem to
> confuse __test_page_isolated_in_pageblock.
> 
> Oscar is rewriting the hwpoison implementation but I am not sure
> how/whether he is handling this case differently. Naoya, shouldn't we do
> the following at least?

My appology for late response.

> ---
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 89c19c0feadb..5fb3fee16fde 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -274,7 +274,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, 
> unsigned long end_pfn,
>* simple way to verify that as VM_BUG_ON(), though.
>*/
>   pfn += 1 << page_order(page);
> - else if (skip_hwpoisoned_pages && PageHWPoison(page))
> + else if (skip_hwpoisoned_pages && 
> PageHWPoison(compound_head(page)))
>   /* A HWPoisoned page cannot be also PageBuddy */
>   pfn++;
>   else

This fix looks good to me. The original code only addresses hwpoisoned 4kB-page,
we seem to have this issue since the following commit,

  commit b023f46813cde6e3b8a8c24f432ff9c1fd8e9a64
  Author: Wen Congyang 
  Date:   Tue Dec 11 16:00:45 2012 -0800
  
  memory-hotplug: skip HWPoisoned page when offlining pages

and extension of LTP coverage finally discovered this.

Thanks,
Naoya Horiguchi

Re: [PATCH] mm, soft-offline: convert parameter to pfn

2019-10-17 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 10:03:21AM +0200, Oscar Salvador wrote:
> On Thu, Oct 17, 2019 at 07:50:18AM +0000, Naoya Horiguchi wrote:
> > Actually I guess that !pfn_valid() never happens when called from
> > madvise_inject_error(), because madvise_inject_error() gets pfn via
> > get_user_pages_fast() which only returns valid page for valid pfn.
> > 
> > And we plan to remove MF_COUNT_INCREASED by Oscar's re-design work,
> > so I start feeling that this patch should come on top of his tree.
> 
> Hi Naoya,
> 
> I am pretty much done with my testing.
> If you feel like, I can take the patch and add it on top of [1]
> , then I will do some more testing and, if nothing pops up, I will
> send it upstream.

Hi Oscar,
Yes, please take it, thank you for speaking up.

> 
> [1] https://github.com/leberus/linux/tree/hwpoison-v2


Re: [PATCH] mm, soft-offline: convert parameter to pfn

2019-10-17 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 09:16:42AM +0200, David Hildenbrand wrote:
> On 17.10.19 01:47, Naoya Horiguchi wrote:
> > On Wed, Oct 16, 2019 at 10:57:57AM +0200, David Hildenbrand wrote:
> > > On 16.10.19 10:54, Naoya Horiguchi wrote:
> > > > On Wed, Oct 16, 2019 at 10:34:52AM +0200, David Hildenbrand wrote:
> > > > > On 16.10.19 10:27, Naoya Horiguchi wrote:
> > > > > > On Wed, Oct 16, 2019 at 09:56:19AM +0200, David Hildenbrand wrote:
> > > > > > > On 16.10.19 09:09, Naoya Horiguchi wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > I wrote a simple cleanup for parameter of soft_offline_page(),
> > > > > > > > based on thread https://lkml.org/lkml/2019/10/11/57.
> > > > > > > > 
> > > > > > > > I know that we need more cleanup on hwpoison-inject, but I think
> > > > > > > > that will be mentioned in re-write patchset Oscar is preparing 
> > > > > > > > now.
> > > > > > > > So let me shared only this part as a separate one now.
> > > > > > ...
> > > > > > > 
> > > > > > > I think you should rebase that patch on linux-next (where the
> > > > > > > pfn_to_online_page() check is in place). I assume you'll want to 
> > > > > > > move the
> > > > > > > pfn_to_online_page() check into soft_offline_page() then as well?
> > > > > > 
> > > > > > I rebased to next-20191016. And yes, we will move 
> > > > > > pfn_to_online_page()
> > > > > > into soft offline code.  It seems that we can also move pfn_valid(),
> > > > > > but is simply moving like below good enough for you?
> > > > > 
> > > > > At least I can't am the patch to current next/master (due to
> > > > > pfn_to_online_page()).
> > > 
> > > Could also be that my "git am" skills failed as the mail was not a
> > > proper patch itself :)
> > 
> > Sorry for the inconvenience, my company email system breaks original
> > message by introducing quoted-printable format ('=20' or '=3D').
> > Most mail client usually handles it but git-am doesn't.
> > I give up using it and send via smtp.gmail.com.
> > 
> > > > @@ -1877,11 +1877,17 @@ static int soft_offline_free_page(struct page 
> > > > *page)
> > > >* This is not a 100% solution for all memory, but tries to be
> > > >* ``good enough'' for the majority of memory.
> > > >*/
> > > > -int soft_offline_page(struct page *page, int flags)
> > > > +int soft_offline_page(unsigned long pfn, int flags)
> > > >   {
> > > > int ret;
> > > > -   unsigned long pfn = page_to_pfn(page);
> > > > +   struct page *page;
> > > > +   if (!pfn_valid(pfn))
> > > > +   return -ENXIO;
> > > > +   /* Only online pages can be soft-offlined (esp., not 
> > > > ZONE_DEVICE). */
> > > > +   page = pfn_to_online_page(pfn);
> > > > +   if (!page)
> > > > +   return -EIO;
> > > > if (is_zone_device_page(page)) {
> > > 
> > > -> this is now no longer possible! So you can drop the whole if
> > > (is_zone_device) case
> > 
> > OK, thanks. I updated it.
> > 
> > Thanks,
> > Naoya Horiguchi
> > ---
> >  From 5faf227839b578726fe7f5ff414a153abb3b3a31 Mon Sep 17 00:00:00 2001
> > From: Naoya Horiguchi 
> > Date: Thu, 17 Oct 2019 08:40:53 +0900
> > Subject: [PATCH] mm, soft-offline: convert parameter to pfn
> > 
> > Currently soft_offline_page() receives struct page, and its sibling
> > memory_failure() receives pfn. This discrepancy looks weird and makes
> > precheck on pfn validity tricky. So let's align them.
> > 
> > Signed-off-by: Naoya Horiguchi 
> > ---
> >   drivers/base/memory.c |  7 +--
> >   include/linux/mm.h|  2 +-
> >   mm/madvise.c  |  2 +-
> >   mm/memory-failure.c   | 19 +--
> >   4 files changed, 12 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 55907c27075b..a757d9ed88a7 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -538,12 +538,7 @@ static ssize_t soft_offline_page_store(struct device 
&

Re: [PATCH] mm, soft-offline: convert parameter to pfn

2019-10-16 Thread Naoya Horiguchi
On Wed, Oct 16, 2019 at 10:57:57AM +0200, David Hildenbrand wrote:
> On 16.10.19 10:54, Naoya Horiguchi wrote:
> >On Wed, Oct 16, 2019 at 10:34:52AM +0200, David Hildenbrand wrote:
> >>On 16.10.19 10:27, Naoya Horiguchi wrote:
> >>>On Wed, Oct 16, 2019 at 09:56:19AM +0200, David Hildenbrand wrote:
> >>>>On 16.10.19 09:09, Naoya Horiguchi wrote:
> >>>>>Hi,
> >>>>>
> >>>>>I wrote a simple cleanup for parameter of soft_offline_page(),
> >>>>>based on thread https://lkml.org/lkml/2019/10/11/57.
> >>>>>
> >>>>>I know that we need more cleanup on hwpoison-inject, but I think
> >>>>>that will be mentioned in re-write patchset Oscar is preparing now.
> >>>>>So let me shared only this part as a separate one now.
> >>>...
> >>>>
> >>>>I think you should rebase that patch on linux-next (where the
> >>>>pfn_to_online_page() check is in place). I assume you'll want to move the
> >>>>pfn_to_online_page() check into soft_offline_page() then as well?
> >>>
> >>>I rebased to next-20191016. And yes, we will move pfn_to_online_page()
> >>>into soft offline code.  It seems that we can also move pfn_valid(),
> >>>but is simply moving like below good enough for you?
> >>
> >>At least I can't am the patch to current next/master (due to
> >>pfn_to_online_page()).
> 
> Could also be that my "git am" skills failed as the mail was not a
> proper patch itself :)

Sorry for the inconvenience, my company email system breaks original
message by introducing quoted-printable format ('=20' or '=3D').
Most mail client usually handles it but git-am doesn't.
I give up using it and send via smtp.gmail.com.

> >@@ -1877,11 +1877,17 @@ static int soft_offline_free_page(struct page *page)
> >   * This is not a 100% solution for all memory, but tries to be
> >   * ``good enough'' for the majority of memory.
> >   */
> >-int soft_offline_page(struct page *page, int flags)
> >+int soft_offline_page(unsigned long pfn, int flags)
> >  {
> > int ret;
> >-unsigned long pfn = page_to_pfn(page);
> >+struct page *page;
> >+if (!pfn_valid(pfn))
> >+return -ENXIO;
> >+    /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
> >+page = pfn_to_online_page(pfn);
> >+if (!page)
> >+return -EIO;
> > if (is_zone_device_page(page)) {
> 
> -> this is now no longer possible! So you can drop the whole if
> (is_zone_device) case

OK, thanks. I updated it.

Thanks,
Naoya Horiguchi
---
>From 5faf227839b578726fe7f5ff414a153abb3b3a31 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi 
Date: Thu, 17 Oct 2019 08:40:53 +0900
Subject: [PATCH] mm, soft-offline: convert parameter to pfn

Currently soft_offline_page() receives struct page, and its sibling
memory_failure() receives pfn. This discrepancy looks weird and makes
precheck on pfn validity tricky. So let's align them.

Signed-off-by: Naoya Horiguchi 
---
 drivers/base/memory.c |  7 +--
 include/linux/mm.h|  2 +-
 mm/madvise.c  |  2 +-
 mm/memory-failure.c   | 19 +--
 4 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 55907c27075b..a757d9ed88a7 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -538,12 +538,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
if (kstrtoull(buf, 0, ) < 0)
return -EINVAL;
pfn >>= PAGE_SHIFT;
-   if (!pfn_valid(pfn))
-   return -ENXIO;
-   /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
-   if (!pfn_to_online_page(pfn))
-   return -EIO;
-   ret = soft_offline_page(pfn_to_page(pfn), 0);
+   ret = soft_offline_page(pfn, 0);
return ret == 0 ? count : ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 44d058723db9..fd360d208346 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2794,7 +2794,7 @@ extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(struct page *page, int flags);
+extern int soft_offline_page(unsigned long pfn, int flags);
 
 
 /*
diff --git a/mm/madvise.c b/mm/madvise.c
index 2be9f3fdb05e..99dd06fecfa9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -887,7 +887,7 @@ static int madvise_inject_error(int behavior,
pr_info("Soft offlining pfn %#lx at process virtual 
address %#lx\n",

Re: [PATCH] mm, soft-offline: convert parameter to pfn

2019-10-16 Thread Naoya Horiguchi
On Wed, Oct 16, 2019 at 10:34:52AM +0200, David Hildenbrand wrote:
> On 16.10.19 10:27, Naoya Horiguchi wrote:
> > On Wed, Oct 16, 2019 at 09:56:19AM +0200, David Hildenbrand wrote:
> > > On 16.10.19 09:09, Naoya Horiguchi wrote:
> > > > Hi,
> > > > 
> > > > I wrote a simple cleanup for parameter of soft_offline_page(),
> > > > based on thread https://lkml.org/lkml/2019/10/11/57.
> > > > 
> > > > I know that we need more cleanup on hwpoison-inject, but I think
> > > > that will be mentioned in re-write patchset Oscar is preparing now.
> > > > So let me shared only this part as a separate one now.
> > ...
> > > 
> > > I think you should rebase that patch on linux-next (where the
> > > pfn_to_online_page() check is in place). I assume you'll want to move the
> > > pfn_to_online_page() check into soft_offline_page() then as well?
> > 
> > I rebased to next-20191016. And yes, we will move pfn_to_online_page()
> > into soft offline code.  It seems that we can also move pfn_valid(),
> > but is simply moving like below good enough for you?
> 
> At least I can't am the patch to current next/master (due to
> pfn_to_online_page()).
> 
> > 
> >@@ -1877,11 +1877,17 @@ static int soft_offline_free_page(struct page 
> > *page)
> >  * This is not a 100% solution for all memory, but tries to be
> >  * ``good enough'' for the majority of memory.
> >  */
> >-int soft_offline_page(struct page *page, int flags)
> >+int soft_offline_page(unsigned long pfn, int flags)
> > {
> > int ret;
> >-unsigned long pfn = page_to_pfn(page);
> >+struct page *page;
> >+if (!pfn_valid(pfn))
> >+return -ENXIO;
> >+/* Only online pages can be soft-offlined (esp., not 
> > ZONE_DEVICE). */
> >+if (!pfn_to_online_page(pfn))
> >+return -EIO;
> >+page = pfn_to_page(pfn);
> > if (is_zone_device_page(page)) {
> > pr_debug_ratelimited("soft_offline: %#lx page is device 
> > page\n",
> > pfn);
> >--
> > 
> > Or we might have an option to do as memory_failure() does like below:
> 
> In contrast to soft offlining, memory failure can deal with devmem. So I
> think the above makes sense.

OK, so here's the revised one.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi 
Date: Wed, 16 Oct 2019 17:00:33 +0900
Subject: [PATCH] mm, soft-offline: convert parameter to pfn

Currently soft_offline_page() receives struct page, and its sibling
memory_failure() receives pfn. This discrepancy looks weird and makes
precheck on pfn validity tricky. So let's align them.

Signed-off-by: Naoya Horiguchi 
---
 drivers/base/memory.c |  7 +--
 include/linux/mm.h|  2 +-
 mm/madvise.c  |  2 +-
 mm/memory-failure.c   | 14 ++
 4 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 55907c27075b..a757d9ed88a7 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -538,12 +538,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
if (kstrtoull(buf, 0, ) < 0)
return -EINVAL;
pfn >>= PAGE_SHIFT;
-   if (!pfn_valid(pfn))
-   return -ENXIO;
-   /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
-   if (!pfn_to_online_page(pfn))
-   return -EIO;
-   ret = soft_offline_page(pfn_to_page(pfn), 0);
+   ret = soft_offline_page(pfn, 0);
return ret == 0 ? count : ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 44d058723db9..fd360d208346 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2794,7 +2794,7 @@ extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(struct page *page, int flags);
+extern int soft_offline_page(unsigned long pfn, int flags);
 
 
 /*
diff --git a/mm/madvise.c b/mm/madvise.c
index 2be9f3fdb05e..99dd06fecfa9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -887,7 +887,7 @@ static int madvise_inject_error(int behavior,
pr_info("Soft offlining pfn %#lx at process virtual 
address %#lx\n",
pfn, start);
 
-   ret = soft_offline_page(page, MF_COUNT_INCREASED);
+   ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
if (ret)

Re: [PATCH] mm, soft-offline: convert parameter to pfn

2019-10-16 Thread Naoya Horiguchi
On Wed, Oct 16, 2019 at 09:56:19AM +0200, David Hildenbrand wrote:
> On 16.10.19 09:09, Naoya Horiguchi wrote:
> > Hi,
> > 
> > I wrote a simple cleanup for parameter of soft_offline_page(),
> > based on thread https://lkml.org/lkml/2019/10/11/57.
> > 
> > I know that we need more cleanup on hwpoison-inject, but I think
> > that will be mentioned in re-write patchset Oscar is preparing now.
> > So let me shared only this part as a separate one now.
...
> 
> I think you should rebase that patch on linux-next (where the
> pfn_to_online_page() check is in place). I assume you'll want to move the
> pfn_to_online_page() check into soft_offline_page() then as well?

I rebased to next-20191016. And yes, we will move pfn_to_online_page()
into soft offline code.  It seems that we can also move pfn_valid(),
but is simply moving like below good enough for you?

  @@ -1877,11 +1877,17 @@ static int soft_offline_free_page(struct page *page)
* This is not a 100% solution for all memory, but tries to be
* ``good enough'' for the majority of memory.
*/
  -int soft_offline_page(struct page *page, int flags)
  +int soft_offline_page(unsigned long pfn, int flags)
   {
int ret;
  - unsigned long pfn = page_to_pfn(page);
  + struct page *page;
   
  + if (!pfn_valid(pfn))
  + return -ENXIO;
  + /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
  + if (!pfn_to_online_page(pfn))
  + return -EIO;
  + page = pfn_to_page(pfn);
if (is_zone_device_page(page)) {
pr_debug_ratelimited("soft_offline: %#lx page is device page\n",
pfn);
  -- 

Or we might have an option to do as memory_failure() does like below:

  int memory_failure(unsigned long pfn, int flags)
  {
  
  p = pfn_to_online_page(pfn);
  if (!p) {
  if (pfn_valid(pfn)) {
  pgmap = get_dev_pagemap(pfn, NULL);
  if (pgmap)
  return memory_failure_dev_pagemap(pfn, flags,
pgmap);
  }
  pr_err("Memory failure: %#lx: memory outside kernel 
control\n",
      pfn);
  return -ENXIO;
  }

Thanks,
Naoya Horiguchi


[PATCH] mm, soft-offline: convert parameter to pfn

2019-10-16 Thread Naoya Horiguchi
Hi,

I wrote a simple cleanup for parameter of soft_offline_page(),
based on thread https://lkml.org/lkml/2019/10/11/57.

I know that we need more cleanup on hwpoison-inject, but I think
that will be mentioned in re-write patchset Oscar is preparing now.
So let me shared only this part as a separate one now.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi 
Date: Wed, 16 Oct 2019 15:49:00 +0900
Subject: [PATCH] mm, soft-offline: convert parameter to pfn

Currently soft_offline_page() receives struct page, and its sibling
memory_failure() receives pfn. This discrepancy looks weird and makes
precheck on pfn validity tricky. So let's align them.

Signed-off-by: Naoya Horiguchi 
---
 drivers/base/memory.c | 2 +-
 include/linux/mm.h| 2 +-
 mm/madvise.c  | 2 +-
 mm/memory-failure.c   | 8 
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index e5485c22ef77..04e469c82852 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -540,7 +540,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
pfn >>= PAGE_SHIFT;
if (!pfn_valid(pfn))
return -ENXIO;
-   ret = soft_offline_page(pfn_to_page(pfn));
+   ret = soft_offline_page(pfn);
return ret == 0 ? count : ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3eba26324ff1..0a452020edf5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2783,7 +2783,7 @@ extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(struct page *page);
+extern int soft_offline_page(unsigned long pfn);
 
 
 /*
diff --git a/mm/madvise.c b/mm/madvise.c
index fd221b610b52..df198d1e5e2e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -893,7 +893,7 @@ static int madvise_inject_error(int behavior,
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual 
address %#lx\n",
 pfn, start);
-   ret = soft_offline_page(page);
+   ret = soft_offline_page(pfn);
if (ret)
return ret;
} else {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4f16e0a7e7cc..eb4fd5e8d5e1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1514,7 +1514,7 @@ static void memory_failure_work_func(struct work_struct 
*work)
if (!gotten)
break;
if (entry.flags & MF_SOFT_OFFLINE)
-   soft_offline_page(pfn_to_page(entry.pfn));
+   soft_offline_page(entry.pfn);
else
memory_failure(entry.pfn, entry.flags);
}
@@ -1822,7 +1822,7 @@ static int soft_offline_free_page(struct page *page)
 
 /**
  * soft_offline_page - Soft offline a page.
- * @page: page to offline
+ * @pfn: pfn to soft-offline
  *
  * Returns 0 on success, otherwise negated errno.
  *
@@ -1841,10 +1841,10 @@ static int soft_offline_free_page(struct page *page)
  * This is not a 100% solution for all memory, but tries to be
  * ``good enough'' for the majority of memory.
  */
-int soft_offline_page(struct page *page)
+int soft_offline_page(unsigned long pfn)
 {
int ret;
-   unsigned long pfn = page_to_pfn(page);
+   struct page *page = pfn_to_page(pfn);
 
if (is_zone_device_page(page)) {
pr_debug_ratelimited("soft_offline: %#lx page is device page\n",
-- 
2.17.1



Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

2019-10-11 Thread Naoya Horiguchi
On Thu, Oct 10, 2019 at 09:17:42AM +0200, David Hildenbrand wrote:
> On 10.10.19 02:26, Naoya Horiguchi wrote:
> > On Wed, Oct 09, 2019 at 04:24:35PM +0200, David Hildenbrand wrote:
> >> We should check for pfn_to_online_page() to not access uninitialized
> >> memmaps. Reshuffle the code so we don't have to duplicate the error
> >> message.
> >>
> >> Cc: Naoya Horiguchi 
> >> Cc: Andrew Morton 
> >> Cc: Michal Hocko 
> >> Signed-off-by: David Hildenbrand 
> >> ---
> >>  mm/memory-failure.c | 14 --
> >>  1 file changed, 8 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >> index 7ef849da8278..e866e6e5660b 100644
> >> --- a/mm/memory-failure.c
> >> +++ b/mm/memory-failure.c
> >> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
> >>if (!sysctl_memory_failure_recovery)
> >>panic("Memory failure on page %lx", pfn);
> >>  
> >> -  if (!pfn_valid(pfn)) {
> >> +  p = pfn_to_online_page(pfn);
> >> +  if (!p) {
> >> +  if (pfn_valid(pfn)) {
> >> +  pgmap = get_dev_pagemap(pfn, NULL);
> >> +  if (pgmap)
> >> +  return memory_failure_dev_pagemap(pfn, flags,
> >> +pgmap);
> >> +  }
> >>pr_err("Memory failure: %#lx: memory outside kernel control\n",
> >>pfn);
> >>return -ENXIO;
> >>}
> >>  
> >> -  pgmap = get_dev_pagemap(pfn, NULL);
> >> -  if (pgmap)
> >> -  return memory_failure_dev_pagemap(pfn, flags, pgmap);
> >> -
> >> -  p = pfn_to_page(pfn);
> > 
> > This change seems to assume that memory_failure_dev_pagemap() is never
> > called for online pages. Is it an intended behavior?
> > Or the concept "online pages" is not applicable to zone device pages?
> 
> Yes, that's the real culprit. ZONE_DEVICE/devmem pages are never online
> (SECTION_IS_ONLINE). The terminology "online" only applies to pages that
> were given to the buddy. And as we support sup-section hotadd for
> devmem, we cannot easily make use of the section flag it. I already
> proposed somewhere to convert SECTION_IS_ONLINE to a subsection bitmap
> and call it something like pfn_active().
> 
> pfn_online() would then be "pfn_active() && zone != ZONE_DEVICE". And we
> could use pfn_active() everywhere to test for initialized memmaps (well,
> besides some special cases like device reserved memory that does not
> span full sub-sections). Until now, nobody volunteered and I have other
> things to do.

Thank you for explanation.  I'm not sure how hard now but we try to do this.
You fix the problem from online/offline viewpoint now, and leave remaining
part untouched. I agree with that approach, and the suggested change looks
good to me.

Acked-by: Naoya Horiguchi 


Re: [PATCH v1] drivers/base/memory.c: Don't access uninitialized memmaps in soft_offline_page_store()

2019-10-11 Thread Naoya Horiguchi
On Thu, Oct 10, 2019 at 04:12:00PM +0200, David Hildenbrand wrote:
> Uninitialized memmaps contain garbage and in the worst case trigger kernel
> BUGs, especially with CONFIG_PAGE_POISONING. They should not get
> touched.
> 
> Right now, when trying to soft-offline a PFN that resides on a memory
> block that was never onlined, one gets a misleading error with
> CONFIG_PAGE_POISONING:
>   :/# echo 5637144576 > /sys/devices/system/memory/soft_offline_page
>   [   23.097167] soft offline: 0x15 page already poisoned
> 
> But the actual result depends on the garbage in the memmap.
> 
> soft_offline_page() can only work with online pages, it returns -EIO in
> case of ZONE_DEVICE. Make sure to only forward pages that are online
> (iow, managed by the buddy) and, therefore, have an initialized memmap.
> 
> Add a check against pfn_to_online_page() and similarly return -EIO.
> 
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to 
> zones until online") # visible after d0dc12e86b319
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/base/memory.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 6bea4f3f8040..55907c27075b 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -540,6 +540,9 @@ static ssize_t soft_offline_page_store(struct device *dev,
>   pfn >>= PAGE_SHIFT;
>   if (!pfn_valid(pfn))
>   return -ENXIO;
> + /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
> + if (!pfn_to_online_page(pfn))
> + return -EIO;

Acked-by: Naoya Horiguchi 

I think this check could be placed in soft_offline_page(), but that requires
a few more unrelated lines of changes due to the mismatch on type of parameter
between memory_failure() and soft_offline_page(),  This is not your problem,
and I plan to do some cleanup on related interfaces, so this patch is fine.

- Naoya Horiguchi


Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

2019-10-11 Thread Naoya Horiguchi
On Thu, Oct 10, 2019 at 09:58:40AM +0200, David Hildenbrand wrote:
> On 10.10.19 09:52, David Hildenbrand wrote:
> > On 10.10.19 09:35, Michal Hocko wrote:
> >> On Thu 10-10-19 09:27:32, David Hildenbrand wrote:
> >>> On 09.10.19 16:43, Michal Hocko wrote:
> >>>> On Wed 09-10-19 16:24:35, David Hildenbrand wrote:
> >>>>> We should check for pfn_to_online_page() to not access uninitialized
> >>>>> memmaps. Reshuffle the code so we don't have to duplicate the error
> >>>>> message.
> >>>>>
> >>>>> Cc: Naoya Horiguchi 
> >>>>> Cc: Andrew Morton 
> >>>>> Cc: Michal Hocko 
> >>>>> Signed-off-by: David Hildenbrand 
> >>>>> ---
> >>>>>  mm/memory-failure.c | 14 --
> >>>>>  1 file changed, 8 insertions(+), 6 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>>> index 7ef849da8278..e866e6e5660b 100644
> >>>>> --- a/mm/memory-failure.c
> >>>>> +++ b/mm/memory-failure.c
> >>>>> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>> if (!sysctl_memory_failure_recovery)
> >>>>> panic("Memory failure on page %lx", pfn);
> >>>>>  
> >>>>> -   if (!pfn_valid(pfn)) {
> >>>>> +   p = pfn_to_online_page(pfn);
> >>>>> +   if (!p) {
> >>>>> +   if (pfn_valid(pfn)) {
> >>>>> +   pgmap = get_dev_pagemap(pfn, NULL);
> >>>>> +   if (pgmap)
> >>>>> +   return memory_failure_dev_pagemap(pfn, 
> >>>>> flags,
> >>>>> + 
> >>>>> pgmap);
> >>>>> +   }
> >>>>> pr_err("Memory failure: %#lx: memory outside kernel 
> >>>>> control\n",
> >>>>> pfn);
> >>>>> return -ENXIO;
> >>>>
> >>>> Don't we need that earlier at hwpoison_inject level?
> >>>>
> >>>
> >>> Theoretically yes, this is another instance. But pfn_to_online_page(pfn)
> >>> alone would not be sufficient as discussed. We would, again, have to
> >>> special-case ZONE_DEVICE via things like get_dev_pagemap() ...
> >>>
> >>> But mm/hwpoison-inject.c:hwpoison_inject() is a pure debug feature either 
> >>> way:
> >>>
> >>>   /*
> >>>* Note that the below poison/unpoison interfaces do not involve
> >>>* hardware status change, hence do not require hardware support.
> >>>* They are mainly for testing hwpoison in software level.
> >>>*/
> >>>
> >>> So it's not that bad compared to memory_failure() called from real HW or
> >>> drivers/base/memory.c:soft_offline_page_store()/hard_offline_page_store()
> >>
> >> Yes, this is just a toy. And yes we need to handle zone device pages
> >> here because a) people likely want to test MCE behavior even on these
> >> pages and b) HW can really trigger MCEs there as well. I was just
> >> pointing that the patch is likely incomplete.
> >>
> > 
> > I rather think this deserves a separate patch as it is a separate
> > interface :)
> > 
> > I do wonder why hwpoison_inject() has to perform so much extra work
> > compared to other memory_failure() users. This smells like legacy
> > leftovers to me, but I might be wrong. The interface is fairly old,
> > though. Does anybody know why we need this magic? I can spot quite some
> > duplicate checks/things getting performed.

It concerns me too, this *is* an old legacy code. I guess it was left as-is
because no one complained about it.  That's not good, so I'll do some cleanup.

> > 
> > Naiive me would just make the interface perform the same as
> > hard_offline_page_store(). But most probably I am not getting the real
> > purpose of both different interfaces.

Maybe for historical reason, we have these slightly different interfaces:

- corrupt-pfn
  - purely for debugging purpose
  - paired with unpoison-pfn
  - used by in-kernel tool tools/vm/page-types.c
- hard_offline_page
  - paired with soft_offline_page
  - used by other userspace tools like mcelog

But these don't explain why implemented differently, so I think that both
should be written in the same manner.

> > 
> > HWPOISON_INJECT is only selected for DEBUG_KERNEL, so I would have
> > guessed that fixing this is not urgent.
> > 
> > BTW: mm/memory-failure.c:soft_offline_page() also looks wrong and needs
> > fixing to make sure we access initialized memmaps.
> > 
> 
> To be more precise, soft_offline_page_store() needs a
> pfn_to_online_page() check. Will send a patch.

Thanks for finding this.

- Naoya Horiguchi


Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c

2019-10-10 Thread Naoya Horiguchi
On Fri, Oct 11, 2019 at 12:11:25AM +, Horiguchi Naoya(堀口 直也) wrote:
> On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote:
> > On 09.10.19 11:57, Naoya Horiguchi wrote:
> > > Hi David,
> > > 
> > > On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote:
> > >> There are various places where we access uninitialized memmaps, namely:
> > >> - /proc/kpagecount
> > >> - /proc/kpageflags
> > >> - /proc/kpagecgroup
> > >> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c
> > > 
> > > Ah right, memory_failure is another victim of this bug.
> > > 
> > >>
> > >> We have initialized memmaps either when the section is online or when
> > >> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps 
> > >> contain
> > >> garbage and in the worst case trigger kernel BUGs, especially with
> > >> CONFIG_PAGE_POISONING.
> > >>
> > >> For example, not onlining a DIMM during boot and calling /proc/kpagecount
> > >> with CONFIG_PAGE_POISONING:
> > >> :/# cat /proc/kpagecount > tmp.test
> > >> [   95.600592] BUG: unable to handle page fault for address: 
> > >> fffe
> > >> [   95.601238] #PF: supervisor read access in kernel mode
> > >> [   95.601675] #PF: error_code(0x) - not-present page
> > >> [   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
> > >> [   95.602596] Oops:  [#1] SMP NOPTI
> > >> [   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 
> > >> 5.4.0-rc1-next-20191004+ #11
> > >> [   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > >> BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> > >> [   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
> > >> [   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 
> > >> 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
> > >> [   95.606450] RSP: 0018:a14e409b7e78 EFLAGS: 00010202
> > >> [   95.606904] RAX: fffe RBX: 0002 RCX: 
> > >> 
> > >> [   95.607519] RDX: 0001 RSI: 7f76b5595000 RDI: 
> > >> f3564500
> > >> [   95.608128] RBP: 7f76b5595000 R08: 0001 R09: 
> > >> 
> > >> [   95.608731] R10:  R11:  R12: 
> > >> 0014
> > >> [   95.609327] R13: 0002 R14: 7f76b5595000 R15: 
> > >> a14e409b7f08
> > >> [   95.609924] FS:  7f76b577d580() GS:8f41bd40() 
> > >> knlGS:
> > >> [   95.610599] CS:  0010 DS:  ES:  CR0: 80050033
> > >> [   95.611083] CR2: fffe CR3: 7896 CR4: 
> > >> 06f0
> > >> [   95.611686] Call Trace:
> > >> [   95.611906]  proc_reg_read+0x3c/0x60
> > >> [   95.612228]  vfs_read+0xc5/0x180
> > >> [   95.612505]  ksys_read+0x68/0xe0
> > >> [   95.612785]  do_syscall_64+0x5c/0xa0
> > >> [   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >>
> > >> Note that there are still two possible races as far as I can see:
> > >> - pfn_to_online_page() succeeding but the memory getting offlined and
> > >>   removed. get_online_mems() could help once we run into this.
> > >> - pfn_zone_device() succeeding but the memmap not being fully
> > >>   initialized yet. As the memmap is initialized outside of the memory
> > >>   hoptlug lock, get_online_mems() can't help.
> > >>
> > >> Let's keep the existing interfaces working with ZONE_DEVICE memory. We
> > >> can later come back and fix these rare races and eventually speed-up the
> > >> ZONE_DEVICE detection.
> > > 
> > > Actually, Toshiki is writing code to refactor and optimize the pfn walking
> > > part, where we find the pfn ranges covered by zone devices by running over
> > > xarray pgmap_array and use the range info to reduce pointer dereferences
> > > to speed up pfn walk. I hope he will share it soon.
> > 
> > AFAIKT, Michal is not a friend of special-casing PFN walkers in that
> > way. We should have a mechanism to detect if a memmap was initialized
> > without having to go via pgmap, special-casing. See my other mail where
> > I draft one basic approach.
> 
> OK, so considering your v2 approach, we could have another pfn_to_page()
> variant like pfn_to_zone_device_page(), where we check that a given pfn
> belongs to the memory section backed by zone memory, then another check if
> the pfn has initialized memmap or not, and return NULL if memmap not
> initialied.  We'll try this approach then, but if you find problems/concerns,
> please let me know.

Sorry, you already mentioned detail here,
https://lore.kernel.org/lkml/c6198acd-8ff7-c40c-cb4e-f0f12f841...@redhat.com/


Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c

2019-10-10 Thread Naoya Horiguchi
On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote:
> On 09.10.19 11:57, Naoya Horiguchi wrote:
> > Hi David,
> > 
> > On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote:
> >> There are various places where we access uninitialized memmaps, namely:
> >> - /proc/kpagecount
> >> - /proc/kpageflags
> >> - /proc/kpagecgroup
> >> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c
> > 
> > Ah right, memory_failure is another victim of this bug.
> > 
> >>
> >> We have initialized memmaps either when the section is online or when
> >> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
> >> garbage and in the worst case trigger kernel BUGs, especially with
> >> CONFIG_PAGE_POISONING.
> >>
> >> For example, not onlining a DIMM during boot and calling /proc/kpagecount
> >> with CONFIG_PAGE_POISONING:
> >> :/# cat /proc/kpagecount > tmp.test
> >> [   95.600592] BUG: unable to handle page fault for address: 
> >> fffe
> >> [   95.601238] #PF: supervisor read access in kernel mode
> >> [   95.601675] #PF: error_code(0x) - not-present page
> >> [   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
> >> [   95.602596] Oops:  [#1] SMP NOPTI
> >> [   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 
> >> 5.4.0-rc1-next-20191004+ #11
> >> [   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> >> rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> >> [   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
> >> [   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 
> >> 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
> >> [   95.606450] RSP: 0018:a14e409b7e78 EFLAGS: 00010202
> >> [   95.606904] RAX: fffe RBX: 0002 RCX: 
> >> 
> >> [   95.607519] RDX: 0001 RSI: 7f76b5595000 RDI: 
> >> f3564500
> >> [   95.608128] RBP: 7f76b5595000 R08: 0001 R09: 
> >> 
> >> [   95.608731] R10:  R11:  R12: 
> >> 0014
> >> [   95.609327] R13: 0002 R14: 7f76b5595000 R15: 
> >> a14e409b7f08
> >> [   95.609924] FS:  7f76b577d580() GS:8f41bd40() 
> >> knlGS:
> >> [   95.610599] CS:  0010 DS:  ES:  CR0: 80050033
> >> [   95.611083] CR2: fffe CR3: 7896 CR4: 
> >> 06f0
> >> [   95.611686] Call Trace:
> >> [   95.611906]  proc_reg_read+0x3c/0x60
> >> [   95.612228]  vfs_read+0xc5/0x180
> >> [   95.612505]  ksys_read+0x68/0xe0
> >> [   95.612785]  do_syscall_64+0x5c/0xa0
> >> [   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>
> >> Note that there are still two possible races as far as I can see:
> >> - pfn_to_online_page() succeeding but the memory getting offlined and
> >>   removed. get_online_mems() could help once we run into this.
> >> - pfn_zone_device() succeeding but the memmap not being fully
> >>   initialized yet. As the memmap is initialized outside of the memory
> >>   hoptlug lock, get_online_mems() can't help.
> >>
> >> Let's keep the existing interfaces working with ZONE_DEVICE memory. We
> >> can later come back and fix these rare races and eventually speed-up the
> >> ZONE_DEVICE detection.
> > 
> > Actually, Toshiki is writing code to refactor and optimize the pfn walking
> > part, where we find the pfn ranges covered by zone devices by running over
> > xarray pgmap_array and use the range info to reduce pointer dereferences
> > to speed up pfn walk. I hope he will share it soon.
> 
> AFAIKT, Michal is not a friend of special-casing PFN walkers in that
> way. We should have a mechanism to detect if a memmap was initialized
> without having to go via pgmap, special-casing. See my other mail where
> I draft one basic approach.

OK, so considering your v2 approach, we could have another pfn_to_page()
variant like pfn_to_zone_device_page(), where we check that a given pfn
belongs to the memory section backed by zone memory, then another check if
the pfn has initialized memmap or not, and return NULL if memmap not
initialied.  We'll try this approach then, but if you find problems/concerns,
please let me know.

Thanks,
Naoya Horiguchi


Re: [PATCH v4 0/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS issue

2019-10-09 Thread Naoya Horiguchi
On Wed, Oct 09, 2019 at 04:55:10PM -0700, Andrew Morton wrote:
> On Tue, 8 Oct 2019 23:18:31 +0000 Naoya Horiguchi  
> wrote:
> 
> > I think that this patchset is good enough and ready to be merged.
> > Andrew, could you consider queuing this series into your tree?
> 
> I'll treat that as an acked-by:.

thanks.

> 
> Do you think 2/2 should be backported into -stable trees?

Yes, I think so. Please add Cc: stable.

Thanks,
Naoya Horiguchi


Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

2019-10-09 Thread Naoya Horiguchi
On Wed, Oct 09, 2019 at 04:24:35PM +0200, David Hildenbrand wrote:
> We should check for pfn_to_online_page() to not access uninitialized
> memmaps. Reshuffle the code so we don't have to duplicate the error
> message.
> 
> Cc: Naoya Horiguchi 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory-failure.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 7ef849da8278..e866e6e5660b 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
>   if (!sysctl_memory_failure_recovery)
>   panic("Memory failure on page %lx", pfn);
>  
> - if (!pfn_valid(pfn)) {
> + p = pfn_to_online_page(pfn);
> + if (!p) {
> + if (pfn_valid(pfn)) {
> + pgmap = get_dev_pagemap(pfn, NULL);
> + if (pgmap)
> + return memory_failure_dev_pagemap(pfn, flags,
> +   pgmap);
> + }
>   pr_err("Memory failure: %#lx: memory outside kernel control\n",
>   pfn);
>   return -ENXIO;
>   }
>  
> - pgmap = get_dev_pagemap(pfn, NULL);
> - if (pgmap)
> - return memory_failure_dev_pagemap(pfn, flags, pgmap);
> -
> - p = pfn_to_page(pfn);

This change seems to assume that memory_failure_dev_pagemap() is never
called for online pages. Is it an intended behavior?
Or the concept "online pages" is not applicable to zone device pages?

Thanks,
Naoya Horiguchi


Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c

2019-10-09 Thread Naoya Horiguchi
Hi David,

On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote:
> There are various places where we access uninitialized memmaps, namely:
> - /proc/kpagecount
> - /proc/kpageflags
> - /proc/kpagecgroup
> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c

Ah right, memory_failure is another victim of this bug.

> 
> We have initialized memmaps either when the section is online or when
> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
> garbage and in the worst case trigger kernel BUGs, especially with
> CONFIG_PAGE_POISONING.
> 
> For example, not onlining a DIMM during boot and calling /proc/kpagecount
> with CONFIG_PAGE_POISONING:
> :/# cat /proc/kpagecount > tmp.test
> [   95.600592] BUG: unable to handle page fault for address: fffe
> [   95.601238] #PF: supervisor read access in kernel mode
> [   95.601675] #PF: error_code(0x) - not-present page
> [   95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
> [   95.602596] Oops:  [#1] SMP NOPTI
> [   95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ 
> #11
> [   95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> [   95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
> [   95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 
> 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
> [   95.606450] RSP: 0018:a14e409b7e78 EFLAGS: 00010202
> [   95.606904] RAX: fffe RBX: 0002 RCX: 
> 
> [   95.607519] RDX: 0001 RSI: 7f76b5595000 RDI: 
> f3564500
> [   95.608128] RBP: 7f76b5595000 R08: 0001 R09: 
> 
> [   95.608731] R10:  R11:  R12: 
> 0014
> [   95.609327] R13: 0002 R14: 7f76b5595000 R15: 
> a14e409b7f08
> [   95.609924] FS:  7f76b577d580() GS:8f41bd40() 
> knlGS:
> [   95.610599] CS:  0010 DS:  ES:  CR0: 80050033
> [   95.611083] CR2: fffe CR3: 7896 CR4: 
> 06f0
> [   95.611686] Call Trace:
> [   95.611906]  proc_reg_read+0x3c/0x60
> [   95.612228]  vfs_read+0xc5/0x180
> [   95.612505]  ksys_read+0x68/0xe0
> [   95.612785]  do_syscall_64+0x5c/0xa0
> [   95.613092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Note that there are still two possible races as far as I can see:
> - pfn_to_online_page() succeeding but the memory getting offlined and
>   removed. get_online_mems() could help once we run into this.
> - pfn_zone_device() succeeding but the memmap not being fully
>   initialized yet. As the memmap is initialized outside of the memory
>   hoptlug lock, get_online_mems() can't help.
> 
> Let's keep the existing interfaces working with ZONE_DEVICE memory. We
> can later come back and fix these rare races and eventually speed-up the
> ZONE_DEVICE detection.

Actually, Toshiki is writing code to refactor and optimize the pfn walking
part, where we find the pfn ranges covered by zone devices by running over
xarray pgmap_array and use the range info to reduce pointer dereferences
to speed up pfn walk. I hope he will share it soon.

Thanks,
Naoya Horiguchi

> This patch now also makes sure we don't dump data
> about memory blocks that are already offline again.
> 
> Reported-by: Qian Cai 
> Cc: Alexey Dobriyan 
> Cc: Naoya Horiguchi 
> Cc: Andrew Morton 
> Cc: Stephen Rothwell 
> Cc: Michal Hocko 
> Cc: Toshiki Fukasawa 
> Cc: David Hildenbrand 
> Cc: Konstantin Khlebnikov 
> Cc: Mike Rapoport 
> Cc: Anthony Yznaga 
> Cc: Jason Gunthorpe 
> Cc: Dan Williams 
> Cc: Logan Gunthorpe 
> Cc: Ira Weiny 
> Cc: "Aneesh Kumar K.V" 
> Cc: linux-fsde...@vger.kernel.org
> Signed-off-by: David Hildenbrand 
> ---
>  fs/proc/page.c   | 12 ++--
>  include/linux/memremap.h | 11 +--
>  mm/memory-failure.c  | 22 --
>  mm/memremap.c| 19 +++
>  4 files changed, 42 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index decd3fe39674..76502af461e2 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -42,7 +42,8 @@ static ssize_t kpagecount_read(struct file *file, char 
> __user *buf,
>   return -EINVAL;
>  
>   while (count > 0) {
> - if (pfn_valid(pfn))
> + if (pfn_valid(pfn) &&
> + (pfn_to_online_page(pfn) || pfn_zone_device(pfn)))
>   ppage = pfn_to_page(pfn);
>   else
>   ppage = NULL;
> @@ -97,9

Re: [PATCH v4 0/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS issue

2019-10-08 Thread Naoya Horiguchi
Hi Jane,

I think that this patchset is good enough and ready to be merged.
Andrew, could you consider queuing this series into your tree?

Thanks,
Naoya Horiguchi

On Tue, Oct 08, 2019 at 11:13:23AM -0700, Jane Chu wrote:
> Hi, Naoya,
> 
> What is the status of the patches?
> Is there anything I need to do from my end ?
> 
> Regards,
> -jane
> 
> On 8/6/2019 10:25 AM, Jane Chu wrote:
> > Change in v4:
> >   - remove trailing white space
> > 
> > Changes in v3:
> >   - move **tk cleanup to its own patch
> > 
> > Changes in v2:
> >   - move 'tk' allocations internal to add_to_kill(), suggested by Dan;
> >   - ran checkpatch.pl check, pointed out by Matthew;
> >   - Noaya pointed out that v1 would have missed the SIGKILL
> > if "tk->addr == -EFAULT", since the code returns early.
> > Incorporated Noaya's suggestion, also, skip VMAs where
> > "tk->size_shift == 0" for zone device page, and deliver SIGBUS
> > when "tk->size_shift != 0" so the payload is helpful;
> >   - added Suggested-by: Naoya Horiguchi 
> > 
> > Jane Chu (2):
> >mm/memory-failure.c clean up around tk pre-allocation
> >mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if
> >  mmaped more than once
> > 
> >   mm/memory-failure.c | 62 
> > ++---
> >   1 file changed, 26 insertions(+), 36 deletions(-)
> > 
> 


Re: [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

2019-09-11 Thread Naoya Horiguchi
On Wed, Sep 11, 2019 at 12:27:22PM +0200, David Hildenbrand wrote:
> On 10.09.19 12:30, Oscar Salvador wrote:
> > From: Naoya Horiguchi 
> > 
> > Currently madvise_inject_error() pins the target via get_user_pages_fast.
> > The call to get_user_pages_fast is only to get the respective page
> > of a given address, but it is the job of the memory-poisoning handler
> > to deal with races, so drop the refcount grabbed by get_user_pages_fast.
> > 
> 
> Oh, and another question "it is the job of the memory-poisoning handler"
> - is that already properly implemented? (newbee question ¯\_(ツ)_/¯)

The above description might be confusing, sorry. It's intended likes

  The call to get_user_pages_fast is only to get the pointer to struct
  page of a given address, pinning it is memory-poisoning handler's job,
  so drop the refcount grabbed by get_user_pages_fast.

And pinning is done in get_hwpoison_page() for hard-offline and
get_any_page() for soft-offline.  For soft-offline case, the semantics of
refcount of poisoned pages is what this patchset tries to change/improve.

Thanks,
Naoya Horiguchi

Re: [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

2019-09-11 Thread Naoya Horiguchi
Hi David,

On Wed, Sep 11, 2019 at 12:23:24PM +0200, David Hildenbrand wrote:
> On 10.09.19 12:30, Oscar Salvador wrote:
> > From: Naoya Horiguchi 
> > 
> > Currently madvise_inject_error() pins the target via get_user_pages_fast.
> > The call to get_user_pages_fast is only to get the respective page
> > of a given address, but it is the job of the memory-poisoning handler
> > to deal with races, so drop the refcount grabbed by get_user_pages_fast.
> > 
> > Signed-off-by: Naoya Horiguchi 
> > Signed-off-by: Oscar Salvador 
> > ---
> >  mm/madvise.c | 25 +++--
> >  1 file changed, 11 insertions(+), 14 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6e023414f5c1..fbe6d402232c 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -883,6 +883,16 @@ static int madvise_inject_error(int behavior,
> > ret = get_user_pages_fast(start, 1, 0, );
> > if (ret != 1)
> > return ret;
> > +   /*
> > +* The get_user_pages_fast() is just to get the pfn of the
> > +* given address, and the refcount has nothing to do with
> > +* what we try to test, so it should be released immediately.
> > +* This is racy but it's intended because the real hardware
> > +* errors could happen at any moment and memory error handlers
> > +* must properly handle the race.
> > +*/
> > +   put_page(page);
> > +
> 
> I wonder if it would be clearer to do that after the page has been fully
> used  - e.g. after getting the pfn and the order (and then e.g.,
> symbolically setting the page pointer to 0).

Yes, this could be called just after page_to_pfn() below.

> I guess the important part of this patch is to not have an elevated
> refcount while calling soft_offline_page().
> 

That's right.

> > pfn = page_to_pfn(page);
> >  
> > /*
> > @@ -892,16 +902,11 @@ static int madvise_inject_error(int behavior,
> >  */
> > order = compound_order(compound_head(page));
> >  
> > -   if (PageHWPoison(page)) {
> > -   put_page(page);
> > -   continue;
> > -   }
> 
> This change is not reflected in the changelog. I would have expected
> that only the put_page() would go. If this should go completely, I
> suggest a separate patch.
> 

I forget why I tried to remove the if block, and now I think only the
put_page() should go as you point out.

Thanks for the comment.

- Naoya


Re: [PATCH 00/10] Hwpoison soft-offline rework

2019-09-11 Thread Naoya Horiguchi
On Wed, Sep 11, 2019 at 08:35:26AM +0200, osalva...@suse.de wrote:
> On 2019-09-11 08:22, Naoya Horiguchi wrote:
> > I found another panic ...
> 
> Hi Naoya,
> 
> Thanks for giving it a try. Are these testcase public?
> I will definetely take a look and try to solve these cases.

It's available on https://github.com/Naoya-Horiguchi/mm_regression.
The README is a bit obsolete (sorry about that ...,) but you can run
the testcase like below:

  $ git clone https://github.com/Naoya-Horiguchi/mm_regression
  $ cd mm_regression
  mm_regression $ git clone https://github.com/Naoya-Horiguchi/test_core 
  mm_regression $ make
  // you might need to install some dependencies like numa library and 
mce-inject tool
  mm_regression $ make update_recipes

To run the single testcase, run the commands like below:

  mm_regression $ 
RECIPEFILES=cases/page_migration/hugetlb_migratepages_allocate1_noovercommit.auto2
 bash run.sh
  mm_regression $ 
RECIPEFILES=cases/cases/mce_ksm_soft-offline_avoid_access.auto2 bash run.sh
  
You can run a set of many testcases with the commands like below:

  mm_regression $ RECIPEFILES=cases/cases/mce_ksm_* bash run.sh
  // run all ksm related testcases. I reproduced the panic with this command.

  mm_regression $ run_class=simple bash run.sh
  // run the set of minimum testcases I run for each releases.

Hopefully this will help you.

Thanks,
Naoya Horiguchi


Re: [PATCH 00/10] Hwpoison soft-offline rework

2019-09-11 Thread Naoya Horiguchi
29.061366] FS:  7fc9e208d740() GS:9fa9bdb0() 
> knlGS:
>   [ 1929.063842] CS:  0010 DS:  ES:  CR0: 80050033
>   [ 1929.065429] CR2: 55946c05d192 CR3: 0001365f2000 CR4: 
> 001406e0
>   [ 1929.067373] Call Trace:
>   [ 1929.068094]  soft_offline_page+0x2be/0x600
>   [ 1929.069246]  soft_offline_page_store+0xdf/0x110
>   [ 1929.070510]  kernfs_fop_write+0x116/0x190
>   [ 1929.071618]  vfs_write+0xa5/0x1a0
>   [ 1929.072614]  ksys_write+0x59/0xd0
>   [ 1929.073548]  do_syscall_64+0x5f/0x1a0
>   [ 1929.074554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   [ 1929.075957] RIP: 0033:0x7fc9e217ded8
> 
> It seems that soft-offlining on ksm pages is affected by this changeset.
> Could you try to handle this?
> 
> - Naoya
> 
> On Tue, Sep 10, 2019 at 12:30:06PM +0200, Oscar Salvador wrote:
> >
> > This patchset was based on Naoya's hwpoison rework [1], so thanks to him
> > for the initial work.
> >
> > This patchset aims to fix some issues laying in soft-offline handling,
> > but it also takes the chance and takes some further steps to perform
> > cleanups and some refactoring as well.
> >
> >  - Motivation:
> >
> >A customer and I were facing an issue where poisoned pages we returned
> >back to user-space after having offlined them properly.
> >This was only seend under some memory stress + soft offlining pages.
> >After some anaylsis, it became clear that the problem was that
> >when kcompactd kicked in to migrate pages over, compaction_alloc
> >callback was handing poisoned pages to the migrate routine.
> >Once this page was later on fault in, __do_page_fault returned
> >VM_FAULT_HWPOISON making the process being killed.
> >
> >All this could happen because isolate_freepages_block and
> >fast_isolate_freepages just check for the page to be PageBuddy,
> >and since 1) poisoned pages can be part of a higher order page
> >and 2) poisoned pages are also Page Buddy, they can sneak in easily.
> >
> >I also saw some problem with swap pages, but I suspected to be the
> >same sort of problem, so I did not follow that trace.
> >
> >The full explanation can be see in [2].
> >
> >  - Approach:
> >
> >The taken approach is to not let poisoned pages hit neither
> >pcplists nor buddy freelists.
> >This is achieved by:
> >
> > In-use pages:
> >
> >* Normal pages
> >
> >1) do not release the last reference count after the
> >   invalidation/migration of the page.
> >2) the page is being handed to page_set_poison, which does:
> >   2a) sets PageHWPoison flag
> >   2b) calls put_page (only to be able to call __page_cache_release)
> >   Since poisoned pages are skipped in free_pages_prepare,
> >   this put_page is safe.
> >   2c) Sets the refcount to 1
> >
> >* Hugetlb pages
> >
> >1) Hand the page to page_set_poison after migration
> >2) page_set_poison does:
> >   2a) Calls dissolve_free_huge_page
> >   2b) If ranged to be dissolved contains poisoned pages,
> >   we free the rangeas order-0 pages (as we do with gigantic hugetlb 
> > page),
> >   so free_pages_prepare will skip them accordingly.
> >   2c) Sets the refcount to 1
> >
> > Free pages:
> >
> >* Normal pages:
> >
> >1) Take the page off the buddy freelist
> >2) Set PageHWPoison flag and set refcount to 1
> >
> >* Hugetlb pages
> >
> >1) Try to allocate a new hugetlb page to the pool
> >2) Take off the pool the poisoned hugetlb
> >
> >
> > With this patchset, I no longer see the issues I faced before.
> >
> > Note:
> > I presented this as RFC to open discussion of the taken aproach.
> > I think that furthers cleanups and refactors could be made, but I would
> > like to get some insight of the taken approach before touching more
> > code.
> >
> > Thanks
> >
> > [1] 
> > https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horigu...@ah.jp.nec.com/
> > [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
> >
> > Naoya Horiguchi (5):
> >   mm,hwpoison: cleanup unused PageHuge() check
> >   mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
> >   mm,hwpoison-inject: don't pin for hwpoison_filter
> >   mm,hwpoison: remove MF_COUNT_INCREASED
> >   mm: remove flag argument from soft offline functions
> &

Re: [PATCH 00/10] Hwpoison soft-offline rework

2019-09-10 Thread Naoya Horiguchi
antic hugetlb 
> page),
>   so free_pages_prepare will skip them accordingly.
>   2c) Sets the refcount to 1
>
> Free pages:
>
>* Normal pages:
>
>1) Take the page off the buddy freelist
>2) Set PageHWPoison flag and set refcount to 1
>
>* Hugetlb pages
>
>1) Try to allocate a new hugetlb page to the pool
>2) Take off the pool the poisoned hugetlb
>
>
> With this patchset, I no longer see the issues I faced before.
>
> Note:
> I presented this as RFC to open discussion of the taken aproach.
> I think that furthers cleanups and refactors could be made, but I would
> like to get some insight of the taken approach before touching more
> code.
>
> Thanks
>
> [1] 
> https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horigu...@ah.jp.nec.com/
> [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
>
> Naoya Horiguchi (5):
>   mm,hwpoison: cleanup unused PageHuge() check
>   mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
>   mm,hwpoison-inject: don't pin for hwpoison_filter
>   mm,hwpoison: remove MF_COUNT_INCREASED
>   mm: remove flag argument from soft offline functions
>
> Oscar Salvador (5):
>   mm,hwpoison: Unify THP handling for hard and soft offline
>   mm,hwpoison: Rework soft offline for in-use pages
>   mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
>   mm,hwpoison: Rework soft offline for free pages
>   mm,hwpoison: Use hugetlb_replace_page to replace free hugetlb pages
>
>  drivers/base/memory.c  |   2 +-
>  include/linux/mm.h |   9 +-
>  include/linux/page-flags.h |   5 -
>  mm/hugetlb.c   |  51 +++-
>  mm/hwpoison-inject.c   |  18 +--
>  mm/madvise.c   |  25 ++--
>  mm/memory-failure.c| 319 
> +
>  mm/migrate.c   |  11 +-
>  mm/page_alloc.c|  62 +++--
>  9 files changed, 267 insertions(+), 235 deletions(-)
>
> --
> 2.12.3
>
>


Re: poisoned pages do not play well in the buddy allocator

2019-08-26 Thread Naoya Horiguchi
On Mon, Aug 26, 2019 at 12:41:50PM +0200, Oscar Salvador wrote:
> Hi,
> 
> When analyzing a problem reported by one of our customers, I stumbbled upon 
> an issue
> that origins from the fact that poisoned pages end up in the buddy allocator.
> 
> Let me break down the stepts that lie to the problem:
> 
> 1) We soft-offline a page
> 2) Page gets flagged as HWPoison and is being sent to the buddy allocator.
>This is done through set_hwpoison_free_buddy_page().
> 3) Kcompactd wakes up in order to perform some compaction.
> 4) compact_zone() will call migrate_pages()
> 5) migrate_pages() will try to get a new page from compaction_alloc() to 
> migrate to
> 6) if cc->freelist is empty, compaction_alloc() will call 
> isolate_free_pagesblock()
> 7) isolate_free_pagesblock only checks for PageBuddy() to assume that a page 
> is OK
>to be used to migrate to. Since HWPoisoned page are also PageBuddy, we add
>the page to the list. (same problem exists in fast_isolate_freepages()).
> 
> The outcome of that is that we end up happily handing poisoned pages in 
> compaction_alloc,
> so if we ever got a fault on that page through *_fault, we will return 
> VM_FAULT_HWPOISON,
> and the process will be killed.
> 
> I first though that I could get away with it by checking PageHWPoison in
> {fast_isolate_freepages/isolate_free_pagesblock}, but not really.
> It might be that the page we are checking is an order > 0 page, so the first 
> page
> might not be poisoned, but the one the follows might be, and we end up in the
> same situation.

Yes, this is a whole point of the current implementation.

> 
> After some more thought, I really came to the conclusion that HWPoison pages 
> should not
> really be in the buddy allocator, as this is only asking for problems.
> In this case it is only compaction code, but it could be happening somewhere 
> else,
> and one would expect that the pages you got from the buddy allocator are 
> __ready__ to use.
> 
> I __think__ that we thought we were safe to put HWPoison pages in the buddy 
> allocator as we
> perform healthy checks when getting a page from there, so we skip poisoned 
> pages
> 
> Of course, this is not the end of the story, now that someone got a page, if 
> he frees it,
> there is a high chance that this page ends up in a pcplist (I saw that).
> Unless we are on CONFIG_VM_DEBUG, we do not check for the health of pages got 
> from pcplist,
> as we do when getting a page from the buddy allocator.
> 
> I checked [1], and it seems that [2] was going towards fixing this kind of 
> issue.
> 
> I think it is about time to revamp the whole thing.
> 
> @Naoya: I could give it a try if you are busy.

Thanks for raising hand. That's really wonderful. I think that the series [1] 
is not
merge yet but not rejected yet, so feel free to reuse/update/revamp it.

> 
> [1] 
> https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horigu...@ah.jp.nec.com/
> [2] 
> https://lore.kernel.org/linux-mm/1541746035-13408-9-git-send-email-n-horigu...@ah.jp.nec.com/
> 
> -- 
> Oscar Salvador
> SUSE L3
> 


Re: ##freemail## Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage

2019-08-20 Thread Naoya Horiguchi
On Tue, Aug 20, 2019 at 03:03:55PM +0800, Wanpeng Li wrote:
> Cc Mel Gorman, Kirill, Dave Hansen,
> On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi  
> wrote:
> >
> > On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > > Cc Paolo,
> > > > Hi all,
> > > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz  
> > > > wrote:
> > > >>
> > > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > > >>> Andrew Morton  writes:
> > > >>>
> > > >>>> On Thu, 08 Feb 2018 12:30:45 + Punit Agrawal 
> > > >>>>  wrote:
> > > >>>>
> > > >>>>>>
> > > >>>>>> So I don't think that the above test result means that errors are 
> > > >>>>>> properly
> > > >>>>>> handled, and the proposed patch should help for arm64.
> > > >>>>>
> > > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > > >>>>> consistent with expectations by core code.
> > > >>>>>
> > > >>>>> I'll look to update the arm64 helpers once this patch gets merged. 
> > > >>>>> But
> > > >>>>> it would be helpful if there was a clear expression of semantics for
> > > >>>>> pud_huge() for various cases. Is there any version that can be used 
> > > >>>>> as
> > > >>>>> reference?
> > > >>>>
> > > >>>> Is that an ack or tested-by?
> > > >>>>
> > > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > > >>>> but they remain steadfastly in hiding.
> > > >>>
> > > >>> Cc'ing linuxppc-dev is always a good idea :)
> > > >>>
> > > >>
> > > >> Thanks Michael,
> > > >>
> > > >> I was mostly concerned about use cases for soft/hard offline of huge 
> > > >> pages
> > > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > > >> huge pages, and soft/hard offline support was specifically added for 
> > > >> this.
> > > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB 
> > > >> pages
> > > >> at PGD level"
> > > >>
> > > >> This patch will disable that functionality.  So, at a minimum this is a
> > > >> 'heads up'.  If there are actual use cases that depend on this, then 
> > > >> more
> > > >> work/discussions will need to happen.  From the e-mail thread on 
> > > >> PGD_SIZE
> > > >> support, I can not tell if there is a real use case or this is just a
> > > >> 'nice to have'.
> > > >
> > > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > > encounter gup_pud_range() panic several times in product environment.
> > > > Is there any plan to reenable and fix arch codes?
> > >
> > > I too am aware of slightly more interest in 1G huge pages.  Suspect that 
> > > as
> > > Intel MMU capacity increases to handle more TLB entries there will be more
> > > and more interest.
> > >
> > > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > > he know most about this code.
> >
> > Thanks for forwarding this to me, I'm feeling that memory error handling
> > on 1GB hugepage is demanded as real use case.
> >
> > >
> > > > In addition, 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address 

Re: [PATCH] hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS

2019-08-07 Thread Naoya Horiguchi
On Wed, Aug 07, 2019 at 05:05:33PM -0700, Mike Kravetz wrote:
> Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS
> in the kernel-v5.2.3 testing.  This is caused by a race between hugetlb
> page migration and page fault.
> 
> If a hugetlb page can not be allocated to satisfy a page fault, the task
> is sent SIGBUS.  This is normal hugetlbfs behavior.  A hugetlb fault
> mutex exists to prevent two tasks from trying to instantiate the same
> page.  This protects against the situation where there is only one
> hugetlb page, and both tasks would try to allocate.  Without the mutex,
> one would fail and SIGBUS even though the other fault would be successful.
> 
> There is a similar race between hugetlb page migration and fault.
> Migration code will allocate a page for the target of the migration.
> It will then unmap the original page from all page tables.  It does
> this unmap by first clearing the pte and then writing a migration
> entry.  The page table lock is held for the duration of this clear and
> write operation.  However, the beginnings of the hugetlb page fault
> code optimistically checks the pte without taking the page table lock.
> If clear (as it can be during the migration unmap operation), a hugetlb
> page allocation is attempted to satisfy the fault.  Note that the page
> which will eventually satisfy this fault was already allocated by the
> migration code.  However, the allocation within the fault path could
> fail which would result in the task incorrectly being sent SIGBUS.
> 
> Ideally, we could take the hugetlb fault mutex in the migration code
> when modifying the page tables.  However, locks must be taken in the
> order of hugetlb fault mutex, page lock, page table lock.  This would
> require significant rework of the migration code.  Instead, the issue
> is addressed in the hugetlb fault code.  After failing to allocate a
> huge page, take the page table lock and check for huge_pte_none before
> returning an error.  This is the same check that must be made further
> in the code even if page allocation is successful.
> 
> Reported-by: Li Wang 
> Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
> Signed-off-by: Mike Kravetz 
> Tested-by: Li Wang 

Thanks for the work and nice description.

Reviewed-by: Naoya Horiguchi 

> ---
>  mm/hugetlb.c | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ede7e7f5d1ab..6d7296dd11b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3856,6 +3856,25 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  
>   page = alloc_huge_page(vma, haddr, 0);
>   if (IS_ERR(page)) {
> + /*
> +  * Returning error will result in faulting task being
> +  * sent SIGBUS.  The hugetlb fault mutex prevents two
> +  * tasks from racing to fault in the same page which
> +  * could result in false unable to allocate errors.
> +  * Page migration does not take the fault mutex, but
> +  * does a clear then write of pte's under page table
> +  * lock.  Page fault code could race with migration,
> +  * notice the clear pte and try to allocate a page
> +  * here.  Before returning error, get ptl and make
> +  * sure there really is no pte entry.
> +  */
> + ptl = huge_pte_lock(h, mm, ptep);
> + if (!huge_pte_none(huge_ptep_get(ptep))) {
> + ret = 0;
> + spin_unlock(ptl);
> + goto out;
> + }
> + spin_unlock(ptl);
>   ret = vmf_error(PTR_ERR(page));
>   goto out;
>   }
> -- 
> 2.20.1
> 
> 


Re: [PATCH v3 1/2] mm/memory-failure.c clean up around tk pre-allocation

2019-08-01 Thread Naoya Horiguchi
On Thu, Jul 25, 2019 at 04:01:40PM -0600, Jane Chu wrote:
> add_to_kill() expects the first 'tk' to be pre-allocated, it makes
> subsequent allocations on need basis, this makes the code a bit
> difficult to read. Move all the allocation internal to add_to_kill()
> and drop the **tk argument.
> 
> Signed-off-by: Jane Chu 

Acked-by: Naoya Horiguchi 

# somehow I sent 2 acks to 2/2, sorry about the noise.


Re: [PATCH v3 2/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-08-01 Thread Naoya Horiguchi
On Thu, Jul 25, 2019 at 04:01:41PM -0600, Jane Chu wrote:
> Mmap /dev/dax more than once, then read the poison location using address
> from one of the mappings. The other mappings due to not having the page
> mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds
> over SIGBUS, so user process looses the opportunity to handle the UE.
> 
> Although one may add MAP_POPULATE to mmap(2) to work around the issue,
> MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
> isn't always an option.
> 
> Details -
> 
> ndctl inject-error --block=10 --count=1 namespace6.0
> 
> ./read_poison -x dax6.0 -o 5120 -m 2
> mmaped address 0x7f5bb660
> mmaped address 0x7f3cf360
> doing local read at address 0x7f3cf3601400
> Killed
> 
> Console messages in instrumented kernel -
> 
> mce: Uncorrected hardware memory error in user-access at edbe201400
> Memory failure: tk->addr = 7f5bb6601000
> Memory failure: address edbe201: call dev_pagemap_mapping_shift
> dev_pagemap_mapping_shift: page edbe201: no PUD
> Memory failure: tk->size_shift == 0
> Memory failure: Unable to find user space address edbe201 in read_poison
> Memory failure: tk->addr = 7f3cf3601000
> Memory failure: address edbe201: call dev_pagemap_mapping_shift
> Memory failure: tk->size_shift = 21
> Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of 
> failure to unmap corrupted page
>   => to deliver SIGKILL
> Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory 
> corruption
>   => to deliver SIGBUS
> 
> Signed-off-by: Jane Chu 
> Suggested-by: Naoya Horiguchi 

Thanks for the fix.

Acked-by: Naoya Horiguchi 


Re: [PATCH v3 2/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-08-01 Thread Naoya Horiguchi
On Thu, Jul 25, 2019 at 04:01:41PM -0600, Jane Chu wrote:
> Mmap /dev/dax more than once, then read the poison location using address
> from one of the mappings. The other mappings due to not having the page
> mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds
> over SIGBUS, so user process looses the opportunity to handle the UE.
> 
> Although one may add MAP_POPULATE to mmap(2) to work around the issue,
> MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
> isn't always an option.
> 
> Details -
> 
> ndctl inject-error --block=10 --count=1 namespace6.0
> 
> ./read_poison -x dax6.0 -o 5120 -m 2
> mmaped address 0x7f5bb660
> mmaped address 0x7f3cf360
> doing local read at address 0x7f3cf3601400
> Killed
> 
> Console messages in instrumented kernel -
> 
> mce: Uncorrected hardware memory error in user-access at edbe201400
> Memory failure: tk->addr = 7f5bb6601000
> Memory failure: address edbe201: call dev_pagemap_mapping_shift
> dev_pagemap_mapping_shift: page edbe201: no PUD
> Memory failure: tk->size_shift == 0
> Memory failure: Unable to find user space address edbe201 in read_poison
> Memory failure: tk->addr = 7f3cf3601000
> Memory failure: address edbe201: call dev_pagemap_mapping_shift
> Memory failure: tk->size_shift = 21
> Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of 
> failure to unmap corrupted page
>   => to deliver SIGKILL
> Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory 
> corruption
>   => to deliver SIGBUS
> 
> Signed-off-by: Jane Chu 
> Suggested-by: Naoya Horiguchi 
> ---
>  mm/memory-failure.c | 22 +-
>  1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 51d5b20..f668c88 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -199,7 +199,6 @@ struct to_kill {
>   struct task_struct *tsk;
>   unsigned long addr;
>   short size_shift;
> - char addr_valid;
>  };
>  
>  /*
> @@ -318,22 +317,27 @@ static void add_to_kill(struct task_struct *tsk, struct 
> page *p,
>   }
>  
>   tk->addr = page_address_in_vma(p, vma);
> - tk->addr_valid = 1;
>   if (is_zone_device_page(p))
>   tk->size_shift = dev_pagemap_mapping_shift(p, vma);
>   else
>   tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
>  
>   /*
> -  * In theory we don't have to kill when the page was
> -  * munmaped. But it could be also a mremap. Since that's
> -  * likely very rare kill anyways just out of paranoia, but use
> -  * a SIGKILL because the error is not contained anymore.
> +  * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> +  * "tk->size_shift" is always non-zero for !is_zone_device_page(),
> +  * so "tk->size_shift == 0" effectively checks no mapping on
> +  * ZONE_DEVICE. Indeed, when a devdax page is mmapped N times
> +  * to a process' address space, it's possible not all N VMAs
> +  * contain mappings for the page, but at least one VMA does.
> +  * Only deliver SIGBUS with payload derived from the VMA that
> +  * has a mapping for the page.
>*/
> - if (tk->addr == -EFAULT || tk->size_shift == 0) {
> + if (tk->addr == -EFAULT) { 
  ^
(sorry nitpicking...) there's a trailing whitespace.
Otherwise looks good to me.

Acked-by: Naoya Horiguchi 

>   pr_info("Memory failure: Unable to find user space address %lx 
> in %s\n",
>   page_to_pfn(p), tsk->comm);
> - tk->addr_valid = 0;
> + } else if (tk->size_shift == 0) {
> + kfree(tk);
> + return;
>   }
>  
>   get_task_struct(tsk);
> @@ -361,7 +365,7 @@ static void kill_procs(struct list_head *to_kill, int 
> forcekill, bool fail,
>* make sure the process doesn't catch the
>* signal and then access the memory. Just kill it.
>*/
> - if (fail || tk->addr_valid == 0) {
> + if (fail || tk->addr == -EFAULT) {
>   pr_err("Memory failure: %#lx: forcibly killing 
> %s:%d because of failure to unmap corrupted page\n",
>  pfn, tk->tsk->comm, tk->tsk->pid);
>   do_send_sig_info(SIGKILL, SEND_SIG_PRIV,
> -- 
> 1.8.3.1
> 
> 


Re: [PATCH v2 1/1] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-07-24 Thread Naoya Horiguchi
On Wed, Jul 24, 2019 at 04:33:23PM -0600, Jane Chu wrote:
> Mmap /dev/dax more than once, then read the poison location using address
> from one of the mappings. The other mappings due to not having the page
> mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds
> over SIGBUS, so user process looses the opportunity to handle the UE.
> 
> Although one may add MAP_POPULATE to mmap(2) to work around the issue,
> MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
> isn't always an option.
> 
> Details -
> 
> ndctl inject-error --block=10 --count=1 namespace6.0
> 
> ./read_poison -x dax6.0 -o 5120 -m 2
> mmaped address 0x7f5bb660
> mmaped address 0x7f3cf360
> doing local read at address 0x7f3cf3601400
> Killed
> 
> Console messages in instrumented kernel -
> 
> mce: Uncorrected hardware memory error in user-access at edbe201400
> Memory failure: tk->addr = 7f5bb6601000
> Memory failure: address edbe201: call dev_pagemap_mapping_shift
> dev_pagemap_mapping_shift: page edbe201: no PUD
> Memory failure: tk->size_shift == 0
> Memory failure: Unable to find user space address edbe201 in read_poison
> Memory failure: tk->addr = 7f3cf3601000
> Memory failure: address edbe201: call dev_pagemap_mapping_shift
> Memory failure: tk->size_shift = 21
> Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of 
> failure to unmap corrupted page
>   => to deliver SIGKILL
> Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory 
> corruption
>   => to deliver SIGBUS
> 
> Signed-off-by: Jane Chu 
> Suggested-by: Naoya Horiguchi 
> ---
>  mm/memory-failure.c | 62 
> ++---
>  1 file changed, 26 insertions(+), 36 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index d9cc660..bd4db33 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -199,7 +199,6 @@ struct to_kill {
>   struct task_struct *tsk;
>   unsigned long addr;
>   short size_shift;
> - char addr_valid;
>  };
>  
>  /*
> @@ -304,43 +303,43 @@ static unsigned long dev_pagemap_mapping_shift(struct 
> page *page,
>  /*
>   * Schedule a process for later kill.
>   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
> - * TBD would GFP_NOIO be enough?
>   */
>  static void add_to_kill(struct task_struct *tsk, struct page *p,
>  struct vm_area_struct *vma,
> -struct list_head *to_kill,
> -struct to_kill **tkc)
> +struct list_head *to_kill)
>  {
>   struct to_kill *tk;
>  
> - if (*tkc) {
> - tk = *tkc;
> - *tkc = NULL;
> - } else {
> - tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
> - if (!tk) {
> - pr_err("Memory failure: Out of memory while machine 
> check handling\n");
> - return;
> - }
> + tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
> + if (!tk) {
> + pr_err("Memory failure: Out of memory while machine check 
> handling\n");
> + return;

As Dan pointed out, the cleanup part can be delivered as a separate patch.

>   }
> +
>   tk->addr = page_address_in_vma(p, vma);
> - tk->addr_valid = 1;
>   if (is_zone_device_page(p))
>   tk->size_shift = dev_pagemap_mapping_shift(p, vma);
>   else
>   tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
>  
>   /*
> -  * In theory we don't have to kill when the page was
> -  * munmaped. But it could be also a mremap. Since that's
> -  * likely very rare kill anyways just out of paranoia, but use
> -  * a SIGKILL because the error is not contained anymore.
> +  * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> +  * "tk->size_shift" is always non-zero for !is_zone_device_page(),
> +  * so "tk->size_shift == 0" effectively checks no mapping on
> +  * ZONE_DEVICE. Indeed, when a devdax page is mmapped N times
> +  * to a process' address space, it's possible not all N VMAs
> +  * contain mappings for the page, but at least one VMA does.
> +  * Only deliver SIGBUS with payload derived from the VMA that
> +  * has a mapping for the page.

OK, so SIGBUSs are sent M times (where M is the number of mappings
for the page). Then I'm convinced that we need "else if" block below.

Thanks,
Naoya Horiguchi

>*/
> - if (tk->addr == -EFAULT || tk->size_shift == 0) {
> 

Re: [PATCH] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-07-24 Thread Naoya Horiguchi
  short size_shift;
  - char addr_valid;
   };
   
   /*
  @@ -324,7 +323,6 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
}
}
tk->addr = page_address_in_vma(p, vma);
  - tk->addr_valid = 1;
if (is_zone_device_page(p))
tk->size_shift = dev_pagemap_mapping_shift(p, vma);
else
  @@ -336,11 +334,9 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
 * likely very rare kill anyways just out of paranoia, but use
 * a SIGKILL because the error is not contained anymore.
 */
  - if (tk->addr == -EFAULT || tk->size_shift == 0) {
  + if (tk->addr == -EFAULT)
pr_info("Memory failure: Unable to find user space address %lx 
in %s\n",
page_to_pfn(p), tsk->comm);
  - tk->addr_valid = 0;
  - }
get_task_struct(tsk);
tk->tsk = tsk;
list_add_tail(>nd, to_kill);
  @@ -366,7 +362,7 @@ static void kill_procs(struct list_head *to_kill, int 
forcekill, bool fail,
 * make sure the process doesn't catch the
 * signal and then access the memory. Just kill it.
 */
  - if (fail || tk->addr_valid == 0) {
  + if (fail || tk->addr == -EFAULT) {
pr_err("Memory failure: %#lx: forcibly killing 
%s:%d because of failure to unmap corrupted page\n",
   pfn, tk->tsk->comm, tk->tsk->pid);
        do_send_sig_info(SIGKILL, SEND_SIG_PRIV,

> > }
> > +   if (tk == *tkc)
> > +   *tkc = NULL;
> > get_task_struct(tsk);
> > tk->tsk = tsk;
> > list_add_tail(>nd, to_kill);
> 
> 
> Concept and policy looks good to me, and I never did understand what
> the mremap() case was trying to protect against.
> 
> The patch is a bit difficult to read (not your fault) because of the
> odd way that add_to_kill() expects the first 'tk' to be pre-allocated.
> May I ask for a lead-in cleanup that moves all the allocation internal
> to add_to_kill() and drops the **tk argument?

I totally agree with this cleanup. Thanks for the comment.

Thanks,
Naoya Horiguchi


[PATCH v3 1/2] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

2019-06-17 Thread Naoya Horiguchi
The pass/fail of soft offline should be judged by checking whether the
raw error page was finally contained or not (i.e. the result of
set_hwpoison_free_buddy_page()), but current code do not work like that.
So this patch is suggesting to fix it.

Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may
not offline the original page and will not return an error.  It might
lead us to misjudge the test result when set_hwpoison_free_buddy_page()
actually fails.

Signed-off-by: Naoya Horiguchi 
Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
Cc:  # v4.19+
---
ChangeLog v2->v3:
- update patch description to clarify user visible change
---
 mm/memory-failure.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git v5.2-rc4/mm/memory-failure.c v5.2-rc4_patched/mm/memory-failure.c
index 8da0334..8ee7b16 100644
--- v5.2-rc4/mm/memory-failure.c
+++ v5.2-rc4_patched/mm/memory-failure.c
@@ -1730,6 +1730,8 @@ static int soft_offline_huge_page(struct page *page, int 
flags)
if (!ret) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
+   else
+   ret = -EBUSY;
}
}
return ret;
-- 
2.7.0



[PATCH v3 0/2] fix return value issue of soft offlining hugepages

2019-06-17 Thread Naoya Horiguchi
Hi everyone,

This is v3 of the fix of return value issue of hugepage soft-offlining
(v2: https://lkml.org/lkml/2019/6/10/156).
Straightforwardly applied feedbacks on v2.

Thanks,
Naoya Horiguchi


[PATCH v3 2/2] mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge

2019-06-17 Thread Naoya Horiguchi
madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
for hugepages with overcommitting enabled. That was caused by the suboptimal
code in current soft-offline code. See the following part:

ret = migrate_pages(, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
...
} else {
/*
 * We set PG_hwpoison only when the migration source hugepage
 * was successfully dissolved, because otherwise hwpoisoned
 * hugepage remains on free hugepage list, then userspace will
 * find it as SIGBUS by allocation failure. That's not expected
 * in soft-offlining.
 */
ret = dissolve_free_huge_page(page);
if (!ret) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
}
}
return ret;

Here dissolve_free_huge_page() returns -EBUSY if the migration source page
was freed into buddy in migrate_pages(), but even in that case we actually
has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
current code gives up offlining too early now.

dissolve_free_huge_page() checks that a given hugepage is suitable for
dissolving, where we should return success for !PageHuge() case because
the given hugepage is considered as already dissolved.

This change also affects other callers of dissolve_free_huge_page(),
which are cleaned up together.

Reported-by: Chen, Jerry T 
Tested-by: Chen, Jerry T 
Signed-off-by: Naoya Horiguchi 
Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
Cc:  # v4.19+
---
ChangeLog v2->v3:
- add PageHuge check in dissolve_free_huge_page() outside hugetlb_lock
- update comment on dissolve_free_huge_page() about return value
---
 mm/hugetlb.c| 29 -
 mm/memory-failure.c |  5 +
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git v5.2-rc4/mm/hugetlb.c v5.2-rc4_patched/mm/hugetlb.c
index ac843d3..ede7e7f 100644
--- v5.2-rc4/mm/hugetlb.c
+++ v5.2-rc4_patched/mm/hugetlb.c
@@ -1510,16 +1510,29 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 
 /*
  * Dissolve a given free hugepage into free buddy pages. This function does
- * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the
- * dissolution fails because a give page is not a free hugepage, or because
- * free hugepages are fully reserved.
+ * nothing for in-use hugepages and non-hugepages.
+ * This function returns values like below:
+ *
+ *  -EBUSY: failed to dissolved free hugepages or the hugepage is in-use
+ *  (allocated or reserved.)
+ *   0: successfully dissolved free hugepages or the page is not a
+ *  hugepage (considered as already dissolved)
  */
 int dissolve_free_huge_page(struct page *page)
 {
int rc = -EBUSY;
 
+   /* Not to disrupt normal path by vainly holding hugetlb_lock */
+   if (!PageHuge(page))
+   return 0;
+
spin_lock(_lock);
-   if (PageHuge(page) && !page_count(page)) {
+   if (!PageHuge(page)) {
+   rc = 0;
+   goto out;
+   }
+
+   if (!page_count(page)) {
struct page *head = compound_head(page);
struct hstate *h = page_hstate(head);
int nid = page_to_nid(head);
@@ -1564,11 +1577,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, 
unsigned long end_pfn)
 
for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
page = pfn_to_page(pfn);
-   if (PageHuge(page) && !page_count(page)) {
-   rc = dissolve_free_huge_page(page);
-   if (rc)
-   break;
-   }
+   rc = dissolve_free_huge_page(page);
+   if (rc)
+   break;
}
 
return rc;
diff --git v5.2-rc4/mm/memory-failure.c v5.2-rc4_patched/mm/memory-failure.c
index 8ee7b16..d9cc660 100644
--- v5.2-rc4/mm/memory-failure.c
+++ v5.2-rc4_patched/mm/memory-failure.c
@@ -1856,11 +1856,8 @@ static int soft_offline_in_use_page(struct page *page, 
int flags)
 
 static int soft_offline_free_page(struct page *page)
 {
-   int rc = 0;
-   struct page *head = compound_head(page);
+   int rc = dissolve_free_huge_page(page);
 
-   if (PageHuge(head))
-   rc = dissolve_free_huge_page(page);
if (!rc) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
-- 
2.7.0



Re: [PATCH v2 2/2] mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge

2019-06-12 Thread Naoya Horiguchi
On Tue, Jun 11, 2019 at 03:20:26PM +0530, Anshuman Khandual wrote:
> 
> On 06/10/2019 01:48 PM, Naoya Horiguchi wrote:
> > madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
> > for hugepages with overcommitting enabled. That was caused by the suboptimal
> > code in current soft-offline code. See the following part:
> > 
> > ret = migrate_pages(, new_page, NULL, MPOL_MF_MOVE_ALL,
> > MIGRATE_SYNC, MR_MEMORY_FAILURE);
> > if (ret) {
> > ...
> > } else {
> > /*
> >  * We set PG_hwpoison only when the migration source hugepage
> >  * was successfully dissolved, because otherwise hwpoisoned
> >  * hugepage remains on free hugepage list, then userspace will
> >  * find it as SIGBUS by allocation failure. That's not expected
> >  * in soft-offlining.
> >  */
> > ret = dissolve_free_huge_page(page);
> > if (!ret) {
> > if (set_hwpoison_free_buddy_page(page))
> > num_poisoned_pages_inc();
> > }
> > }
> > return ret;
> > 
> > Here dissolve_free_huge_page() returns -EBUSY if the migration source page
> > was freed into buddy in migrate_pages(), but even in that case we actually
> 
> Over committed source pages will be released into buddy and the normal ones
> will not be ? dissolve_free_huge_page() returns -EBUSY because PageHuge()
> return negative on already released pages ? 

The answers for both questions here are yes.

> How dissolve_free_huge_page()
> will behave differently with over committed pages. I might be missing some
> recent developments here.

This dissolve_free_huge_page() should see a (free or reused) 4kB page when
overcommitting, and should see a (free or reused) huge page for non
overcommitting case.

> 
> > has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
> > current code gives up offlining too early now.
> 
> Hmm. It gives up early as the return value from dissolve_free_huge_page(EBUSY)
> gets back as the return code for soft_offline_huge_page() without attempting
> set_hwpoison_free_buddy_page() which still has a chance to succeed for freed
> normal buddy pages.

Exactly.

> 
> > 
> > dissolve_free_huge_page() checks that a given hugepage is suitable for
> > dissolving, where we should return success for !PageHuge() case because
> > the given hugepage is considered as already dissolved.
> 
> Right. It should return 0 (as a success) for freed normal buddy pages. Should
> not it then check explicitly for PageBuddy() as well ?

in new semantics, dissolve_free_huge_page() returns:

  0: successfully dissolved free hugepages or the page is already dissolved
  EBUSY: failed to dissolved free hugepages or the hugepage is in-use.

so for any types of non hugepages, the return value is 0.

Thanks,
- Naoya 

> > 
> > This change also affects other callers of dissolve_free_huge_page(),
> > which are cleaned up together.
> > 
> > Reported-by: Chen, Jerry T 
> > Tested-by: Chen, Jerry T 
> > Signed-off-by: Naoya Horiguchi 
> > Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
> > Cc:  # v4.19+
> > ---
> >  mm/hugetlb.c| 15 +--
> >  mm/memory-failure.c |  5 +
> >  2 files changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git v5.2-rc3/mm/hugetlb.c v5.2-rc3_patched/mm/hugetlb.c
> > index ac843d3..048d071 100644
> > --- v5.2-rc3/mm/hugetlb.c
> > +++ v5.2-rc3_patched/mm/hugetlb.c
> > @@ -1519,7 +1519,12 @@ int dissolve_free_huge_page(struct page *page)
> > int rc = -EBUSY;
> >  
> > spin_lock(_lock);
> > -   if (PageHuge(page) && !page_count(page)) {
> > +   if (!PageHuge(page)) {
> > +   rc = 0;
> > +   goto out;
> > +   }
> 
> With this early bail out it maintains the functionality when called from
> soft_offline_free_page() for normal pages. For huge page, it continues
> on the previous path.
> 
> > +
> > +   if (!page_count(page)) {
> > struct page *head = compound_head(page);
> > struct hstate *h = page_hstate(head);
> > int nid = page_to_nid(head);
> > @@ -1564,11 +1569,9 @@ int dissolve_free_huge_pages(unsigned long 
> > start_pfn, unsigned long end_pfn)
> >  
> > for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
> > page = pfn_to_page(pfn);
> > -   if (PageHuge(page) && !page_count(page)) {
> 
> Right. These checks are now redundant.
> 


Re: [PATCH v2 2/2] mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge

2019-06-12 Thread Naoya Horiguchi
On Tue, Jun 11, 2019 at 10:16:03AM -0700, Mike Kravetz wrote:
> On 6/10/19 1:18 AM, Naoya Horiguchi wrote:
> > madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
> > for hugepages with overcommitting enabled. That was caused by the suboptimal
> > code in current soft-offline code. See the following part:
> > 
> > ret = migrate_pages(, new_page, NULL, MPOL_MF_MOVE_ALL,
> > MIGRATE_SYNC, MR_MEMORY_FAILURE);
> > if (ret) {
> > ...
> > } else {
> > /*
> >  * We set PG_hwpoison only when the migration source hugepage
> >  * was successfully dissolved, because otherwise hwpoisoned
> >  * hugepage remains on free hugepage list, then userspace will
> >  * find it as SIGBUS by allocation failure. That's not expected
> >  * in soft-offlining.
> >  */
> > ret = dissolve_free_huge_page(page);
> > if (!ret) {
> > if (set_hwpoison_free_buddy_page(page))
> > num_poisoned_pages_inc();
> > }
> > }
> > return ret;
> > 
> > Here dissolve_free_huge_page() returns -EBUSY if the migration source page
> > was freed into buddy in migrate_pages(), but even in that case we actually
> > has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
> > current code gives up offlining too early now.
> > 
> > dissolve_free_huge_page() checks that a given hugepage is suitable for
> > dissolving, where we should return success for !PageHuge() case because
> > the given hugepage is considered as already dissolved.
> > 
> > This change also affects other callers of dissolve_free_huge_page(),
> > which are cleaned up together.
> > 
> > Reported-by: Chen, Jerry T 
> > Tested-by: Chen, Jerry T 
> > Signed-off-by: Naoya Horiguchi 
> > Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
> > Cc:  # v4.19+
> > ---
> >  mm/hugetlb.c| 15 +--
> >  mm/memory-failure.c |  5 +
> >  2 files changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git v5.2-rc3/mm/hugetlb.c v5.2-rc3_patched/mm/hugetlb.c
> > index ac843d3..048d071 100644
> > --- v5.2-rc3/mm/hugetlb.c
> > +++ v5.2-rc3_patched/mm/hugetlb.c
> > @@ -1519,7 +1519,12 @@ int dissolve_free_huge_page(struct page *page)
> 
> Please update the function description for dissolve_free_huge_page() as
> well.  It currently says, "Returns -EBUSY if the dissolution fails because
> a give page is not a free hugepage" which is no longer true as a result of
> this change.

Thanks for pointing out, I completely missed that.

> 
> > int rc = -EBUSY;
> >  
> > spin_lock(_lock);
> > -   if (PageHuge(page) && !page_count(page)) {
> > +   if (!PageHuge(page)) {
> > +   rc = 0;
> > +   goto out;
> > +   }
> > +
> > +   if (!page_count(page)) {
> > struct page *head = compound_head(page);
> > struct hstate *h = page_hstate(head);
> > int nid = page_to_nid(head);
> > @@ -1564,11 +1569,9 @@ int dissolve_free_huge_pages(unsigned long 
> > start_pfn, unsigned long end_pfn)
> >  
> > for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
> > page = pfn_to_page(pfn);
> > -   if (PageHuge(page) && !page_count(page)) {
> > -   rc = dissolve_free_huge_page(page);
> > -   if (rc)
> > -   break;
> > -   }
> 
> We may want to consider keeping at least the PageHuge(page) check before
> calling dissolve_free_huge_page().  dissolve_free_huge_pages is called as
> part of memory offline processing.  We do not know if the memory to be 
> offlined
> contains huge pages or not.  With your changes, we are taking hugetlb_lock
> on each call to dissolve_free_huge_page just to discover that the page is
> not a huge page.
> 
> You 'could' add a PageHuge(page) check to dissolve_free_huge_page before
> taking the lock.  However, you would need to check again after taking the
> lock.

Right, I'll do this.

What was in my mind when writing this was that I actually don't like
PageHuge because it's slow (not inlined) and called anywhere in mm code,
so I like to reduce it if possible.
But I now see that dissolve_free_huge_page() are relatively rare event
rather than hugepage allocation/free, so dissolve_free_huge_page should take
burden to precheck PageHuge instead of speculatively taking hugetlb_lock
and disrupting the hot path.

Thanks,
- Naoya


Re: [PATCH v2 1/2] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

2019-06-12 Thread Naoya Horiguchi
On Tue, Jun 11, 2019 at 01:44:46PM +0530, Anshuman Khandual wrote:
> 
> 
> On 06/11/2019 06:27 AM, Naoya Horiguchi wrote:
> > On Mon, Jun 10, 2019 at 05:19:45PM -0700, Mike Kravetz wrote:
> >> On 6/10/19 1:18 AM, Naoya Horiguchi wrote:
> >>> The pass/fail of soft offline should be judged by checking whether the
> >>> raw error page was finally contained or not (i.e. the result of
> >>> set_hwpoison_free_buddy_page()), but current code do not work like that.
> >>> So this patch is suggesting to fix it.
> >>>
> >>> Signed-off-by: Naoya Horiguchi 
> >>> Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
> >>> Cc:  # v4.19+
> >>
> >> Reviewed-by: Mike Kravetz 
> > 
> > Thank you, Mike.
> > 
> >>
> >> To follow-up on Andrew's comment/question about user visible effects.  
> >> Without
> >> this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may not offline 
> >> the
> >> original page and will not return an error.
> > 
> > Yes, that's right.
> 
> Then should this be included in the commit message as well ?

Right, I'll clarify the point in the description.

Thanks,
- Naoya


Re: [PATCH v2 1/2] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

2019-06-10 Thread Naoya Horiguchi
On Mon, Jun 10, 2019 at 05:19:45PM -0700, Mike Kravetz wrote:
> On 6/10/19 1:18 AM, Naoya Horiguchi wrote:
> > The pass/fail of soft offline should be judged by checking whether the
> > raw error page was finally contained or not (i.e. the result of
> > set_hwpoison_free_buddy_page()), but current code do not work like that.
> > So this patch is suggesting to fix it.
> > 
> > Signed-off-by: Naoya Horiguchi 
> > Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
> > Cc:  # v4.19+
> 
> Reviewed-by: Mike Kravetz 

Thank you, Mike.

> 
> To follow-up on Andrew's comment/question about user visible effects.  Without
> this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may not offline the
> original page and will not return an error.

Yes, that's right.

>  Are there any other visible
> effects?

I can't think of other ones.

- Naoya


Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage

2019-06-10 Thread Naoya Horiguchi
On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > Cc Paolo,
> > Hi all,
> > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz  wrote:
> >>
> >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> >>> Andrew Morton  writes:
> >>>
> >>>> On Thu, 08 Feb 2018 12:30:45 + Punit Agrawal  
> >>>> wrote:
> >>>>
> >>>>>>
> >>>>>> So I don't think that the above test result means that errors are 
> >>>>>> properly
> >>>>>> handled, and the proposed patch should help for arm64.
> >>>>>
> >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> >>>>> would be easier to maintain and reason about if arm64 helpers are
> >>>>> consistent with expectations by core code.
> >>>>>
> >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> >>>>> it would be helpful if there was a clear expression of semantics for
> >>>>> pud_huge() for various cases. Is there any version that can be used as
> >>>>> reference?
> >>>>
> >>>> Is that an ack or tested-by?
> >>>>
> >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> >>>> but they remain steadfastly in hiding.
> >>>
> >>> Cc'ing linuxppc-dev is always a good idea :)
> >>>
> >>
> >> Thanks Michael,
> >>
> >> I was mostly concerned about use cases for soft/hard offline of huge pages
> >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> >> huge pages, and soft/hard offline support was specifically added for this.
> >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> >> at PGD level"
> >>
> >> This patch will disable that functionality.  So, at a minimum this is a
> >> 'heads up'.  If there are actual use cases that depend on this, then more
> >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> >> support, I can not tell if there is a real use case or this is just a
> >> 'nice to have'.
> > 
> > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > encounter gup_pud_range() panic several times in product environment.
> > Is there any plan to reenable and fix arch codes?
> 
> I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> Intel MMU capacity increases to handle more TLB entries there will be more
> and more interest.
> 
> Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> he know most about this code.

Thanks for forwarding this to me, I'm feeling that memory error handling
on 1GB hugepage is demanded as real use case.

> 
> > In addition, 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > hwpoison entry. The guest will vmexit and retry endlessly when
> > accessing any memory in the guest which is backed by this 1GB poisoned
> > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > will be delivered to VM at page fault next time for the offensive
> > SPTE. Is this proposal acceptable?
> 
> I am not sure of the error handling design, but this does sound reasonable.

I agree that that's better.

> That block of code which potentially dissolves a huge page on memory error
> is hard to understand and I'm not sure if that is even the 'normal'
> functionality.  Certainly, we would hate to waste/poison an entire 1G page
> for an error on a small subsection.

Yes, that's not practical, so we need at first establish the code base for
2GB hugetlb splitting and then extending it to 1GB next.

Thanks,
Naoya Horiguchi


Re: [PATCH v2 1/2] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

2019-06-10 Thread Naoya Horiguchi
On Mon, Jun 10, 2019 at 02:20:33PM -0700, Andrew Morton wrote:
> On Mon, 10 Jun 2019 17:18:05 +0900 Naoya Horiguchi 
>  wrote:
> 
> > The pass/fail of soft offline should be judged by checking whether the
> > raw error page was finally contained or not (i.e. the result of
> > set_hwpoison_free_buddy_page()), but current code do not work like that.
> > So this patch is suggesting to fix it.
> 
> Please describe the user-visible runtime effects of this change?

Sorry, could you replace the description as follows (I inserted one sentence)?

The pass/fail of soft offline should be judged by checking whether the
raw error page was finally contained or not (i.e. the result of
set_hwpoison_free_buddy_page()), but current code do not work like that.
It might lead us to misjudge the test result when
set_hwpoison_free_buddy_page() fails.  So this patch is suggesting to
fix it.

Thanks,
Naoya Horiguchi


[PATCH v2 1/2] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

2019-06-10 Thread Naoya Horiguchi
The pass/fail of soft offline should be judged by checking whether the
raw error page was finally contained or not (i.e. the result of
set_hwpoison_free_buddy_page()), but current code do not work like that.
So this patch is suggesting to fix it.

Signed-off-by: Naoya Horiguchi 
Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
Cc:  # v4.19+
---
 mm/memory-failure.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git v5.2-rc3/mm/memory-failure.c v5.2-rc3_patched/mm/memory-failure.c
index fc8b517..7ea485e 100644
--- v5.2-rc3/mm/memory-failure.c
+++ v5.2-rc3_patched/mm/memory-failure.c
@@ -1733,6 +1733,8 @@ static int soft_offline_huge_page(struct page *page, int 
flags)
if (!ret) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
+   else
+   ret = -EBUSY;
}
}
return ret;
-- 
2.7.0



[PATCH v2 2/2] mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge

2019-06-10 Thread Naoya Horiguchi
madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
for hugepages with overcommitting enabled. That was caused by the suboptimal
code in current soft-offline code. See the following part:

ret = migrate_pages(, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
...
} else {
/*
 * We set PG_hwpoison only when the migration source hugepage
 * was successfully dissolved, because otherwise hwpoisoned
 * hugepage remains on free hugepage list, then userspace will
 * find it as SIGBUS by allocation failure. That's not expected
 * in soft-offlining.
 */
ret = dissolve_free_huge_page(page);
if (!ret) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
}
}
return ret;

Here dissolve_free_huge_page() returns -EBUSY if the migration source page
was freed into buddy in migrate_pages(), but even in that case we actually
has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
current code gives up offlining too early now.

dissolve_free_huge_page() checks that a given hugepage is suitable for
dissolving, where we should return success for !PageHuge() case because
the given hugepage is considered as already dissolved.

This change also affects other callers of dissolve_free_huge_page(),
which are cleaned up together.

Reported-by: Chen, Jerry T 
Tested-by: Chen, Jerry T 
Signed-off-by: Naoya Horiguchi 
Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
Cc:  # v4.19+
---
 mm/hugetlb.c| 15 +--
 mm/memory-failure.c |  5 +
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git v5.2-rc3/mm/hugetlb.c v5.2-rc3_patched/mm/hugetlb.c
index ac843d3..048d071 100644
--- v5.2-rc3/mm/hugetlb.c
+++ v5.2-rc3_patched/mm/hugetlb.c
@@ -1519,7 +1519,12 @@ int dissolve_free_huge_page(struct page *page)
int rc = -EBUSY;
 
spin_lock(_lock);
-   if (PageHuge(page) && !page_count(page)) {
+   if (!PageHuge(page)) {
+   rc = 0;
+   goto out;
+   }
+
+   if (!page_count(page)) {
struct page *head = compound_head(page);
struct hstate *h = page_hstate(head);
int nid = page_to_nid(head);
@@ -1564,11 +1569,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, 
unsigned long end_pfn)
 
for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
page = pfn_to_page(pfn);
-   if (PageHuge(page) && !page_count(page)) {
-   rc = dissolve_free_huge_page(page);
-   if (rc)
-   break;
-   }
+   rc = dissolve_free_huge_page(page);
+   if (rc)
+   break;
}
 
return rc;
diff --git v5.2-rc3/mm/memory-failure.c v5.2-rc3_patched/mm/memory-failure.c
index 7ea485e..3a83e27 100644
--- v5.2-rc3/mm/memory-failure.c
+++ v5.2-rc3_patched/mm/memory-failure.c
@@ -1859,11 +1859,8 @@ static int soft_offline_in_use_page(struct page *page, 
int flags)
 
 static int soft_offline_free_page(struct page *page)
 {
-   int rc = 0;
-   struct page *head = compound_head(page);
+   int rc = dissolve_free_huge_page(page);
 
-   if (PageHuge(head))
-   rc = dissolve_free_huge_page(page);
if (!rc) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
-- 
2.7.0



[PATCH v2 00/02] fix return value issue of soft offlining hugepages

2019-06-10 Thread Naoya Horiguchi
Hi everyone,

This is an update of the fix of return value issue of hugepage soft-offlining
(v1: https://patchwork.kernel.org/patch/10962135/).

The code itself has no change since v1 but I updated the description.
Jerry helped testing and finally confirmed that the patch is OK.

In previous discussion, it's pointed out by Mike that this problem contained
two separate issues (a problem of dissolve_free_huge_page() and a problem of
soft_offline_huge_page() itself) and I agree with it (althouth I stated
differently at v1). So I separated the patch.

Hopefully this will help finishing the issue.

Thanks,
Naoya Horiguchi


Re: [PATCH v1] mm: hugetlb: soft-offline: fix wrong return value of soft offline

2019-05-29 Thread Naoya Horiguchi
Hi Mike,

On Wed, May 29, 2019 at 11:44:50AM -0700, Mike Kravetz wrote:
> On 5/26/19 11:06 PM, Naoya Horiguchi wrote:
> > Soft offline events for hugetlb pages return -EBUSY when page migration
> > succeeded and dissolve_free_huge_page() failed, which can happen when
> > there're surplus hugepages. We should judge pass/fail of soft offline by
> > checking whether the raw error page was finally contained or not (i.e.
> > the result of set_hwpoison_free_buddy_page()), so this behavior is wrong.
> > 
> > This problem was introduced by the following change of commit 6bc9b56433b76
> > ("mm: fix race on soft-offlining"):
> > 
> > if (ret > 0)
> > ret = -EIO;
> > } else {
> > -   if (PageHuge(page))
> > -   dissolve_free_huge_page(page);
> > +   /*
> > +* We set PG_hwpoison only when the migration source 
> > hugepage
> > +* was successfully dissolved, because otherwise 
> > hwpoisoned
> > +* hugepage remains on free hugepage list, then 
> > userspace will
> > +* find it as SIGBUS by allocation failure. That's not 
> > expected
> > +* in soft-offlining.
> > +*/
> > +   ret = dissolve_free_huge_page(page);
> > +   if (!ret) {
> > +   if (set_hwpoison_free_buddy_page(page))
> > +   num_poisoned_pages_inc();
> > +   }
> > }
> > return ret;
> >  }
> > 
> > , so a simple fix is to restore the PageHuge precheck, but my code
> > reading shows that we already have PageHuge check in
> > dissolve_free_huge_page() with hugetlb_lock, which is better place to
> > check it.  And currently dissolve_free_huge_page() returns -EBUSY for
> > !PageHuge but that's simply wrong because that that case should be
> > considered as success (meaning that "the given hugetlb was already
> > dissolved.")
> 
> Hello Naoya,
> 
> I am having a little trouble understanding the situation.  The code above is
> in the routine soft_offline_huge_page, and occurs immediately after a call to
> migrate_pages() with 'page' being the only on the list of pages to be 
> migrated.
> In addition, since we are in soft_offline_huge_page, we know that page is
> a huge page (PageHuge) before the call to migrate_pages.
> 
> IIUC, the issue is that the migrate_pages call results in 'page' being
> dissolved into regular base pages.  Therefore, the call to
> dissolve_free_huge_page returns -EBUSY and we never end up setting 
> PageHWPoison
> on the (base) page which had the error.
> 
> It seems that for the original page to be dissolved, it must go through the
> free_huge_page routine.  Once that happens, it is possible for the (dissolved)
> pages to be allocated again.  Is that just a known race, or am I missing
> something?

No, your understanding is right.  I found that the last (and most important)
part of patch description ("this behavior is wrong") might be wrong.
Sorry about that and let me correct myself:

  - before commit 6bc9b56433b76, the return value of soft offline is the
return of migrate_page(). dissolve_free_huge_page()'s return value is
ignored.

  - after commit 6bc9b56433b76 soft_offline_huge_page() returns success
only dissolve_free_huge_page() returns success.

This change is *mainly OK* (meaning nothing is broken), but there still
remains the room of improvement, that is, even in "dissolved from
free_huge_page()" case, we can try to call set_hwpoison_free_buddy_page() to
contain the 4kB error page, but we don't try it now because
dissolve_free_huge_page() return -EBUSY for !PageHuge case.

> 
> > This change affects other callers of dissolve_free_huge_page(),
> > which are also cleaned up by this patch.
> 
> It may just be me, but I am having a hard time separating the fix for this
> issue from the change to the dissolve_free_huge_page routine.  Would it be
> more clear or possible to create separate patches for these?

Yes, the change is actually an 'improvement' purely related to hugetlb,
and seems not a 'bug fix'. So I'll update the description.
Maybe no need to separate to patches.

Thanks,
Naoya Horiguchi


[PATCH v1] mm: hugetlb: soft-offline: fix wrong return value of soft offline

2019-05-27 Thread Naoya Horiguchi
Soft offline events for hugetlb pages return -EBUSY when page migration
succeeded and dissolve_free_huge_page() failed, which can happen when
there're surplus hugepages. We should judge pass/fail of soft offline by
checking whether the raw error page was finally contained or not (i.e.
the result of set_hwpoison_free_buddy_page()), so this behavior is wrong.

This problem was introduced by the following change of commit 6bc9b56433b76
("mm: fix race on soft-offlining"):

if (ret > 0)
ret = -EIO;
} else {
-   if (PageHuge(page))
-   dissolve_free_huge_page(page);
+   /*
+* We set PG_hwpoison only when the migration source 
hugepage
+* was successfully dissolved, because otherwise hwpoisoned
+* hugepage remains on free hugepage list, then userspace 
will
+* find it as SIGBUS by allocation failure. That's not 
expected
+* in soft-offlining.
+*/
+   ret = dissolve_free_huge_page(page);
+   if (!ret) {
+   if (set_hwpoison_free_buddy_page(page))
+   num_poisoned_pages_inc();
+   }
}
return ret;
 }

, so a simple fix is to restore the PageHuge precheck, but my code
reading shows that we already have PageHuge check in
dissolve_free_huge_page() with hugetlb_lock, which is better place to
check it.  And currently dissolve_free_huge_page() returns -EBUSY for
!PageHuge but that's simply wrong because that that case should be
considered as success (meaning that "the given hugetlb was already
dissolved.")

This change affects other callers of dissolve_free_huge_page(),
which are also cleaned up by this patch.

Reported-by: Chen, Jerry T 
Signed-off-by: Naoya Horiguchi 
Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
Cc:  # v4.19+
---
 mm/hugetlb.c| 15 +--
 mm/memory-failure.c |  7 +++
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git v5.1-rc6-mmotm-2019-04-25-16-30/mm/hugetlb.c 
v5.1-rc6-mmotm-2019-04-25-16-30_patched/mm/hugetlb.c
index bf58cee..385899f 100644
--- v5.1-rc6-mmotm-2019-04-25-16-30/mm/hugetlb.c
+++ v5.1-rc6-mmotm-2019-04-25-16-30_patched/mm/hugetlb.c
@@ -1518,7 +1518,12 @@ int dissolve_free_huge_page(struct page *page)
int rc = -EBUSY;
 
spin_lock(_lock);
-   if (PageHuge(page) && !page_count(page)) {
+   if (!PageHuge(page)) {
+   rc = 0;
+   goto out;
+   }
+
+   if (!page_count(page)) {
struct page *head = compound_head(page);
struct hstate *h = page_hstate(head);
int nid = page_to_nid(head);
@@ -1563,11 +1568,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, 
unsigned long end_pfn)
 
for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
page = pfn_to_page(pfn);
-   if (PageHuge(page) && !page_count(page)) {
-   rc = dissolve_free_huge_page(page);
-   if (rc)
-   break;
-   }
+   rc = dissolve_free_huge_page(page);
+   if (rc)
+   break;
}
 
return rc;
diff --git v5.1-rc6-mmotm-2019-04-25-16-30/mm/memory-failure.c 
v5.1-rc6-mmotm-2019-04-25-16-30_patched/mm/memory-failure.c
index fc8b517..3a83e27 100644
--- v5.1-rc6-mmotm-2019-04-25-16-30/mm/memory-failure.c
+++ v5.1-rc6-mmotm-2019-04-25-16-30_patched/mm/memory-failure.c
@@ -1733,6 +1733,8 @@ static int soft_offline_huge_page(struct page *page, int 
flags)
if (!ret) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
+   else
+   ret = -EBUSY;
}
}
return ret;
@@ -1857,11 +1859,8 @@ static int soft_offline_in_use_page(struct page *page, 
int flags)
 
 static int soft_offline_free_page(struct page *page)
 {
-   int rc = 0;
-   struct page *head = compound_head(page);
+   int rc = dissolve_free_huge_page(page);
 
-   if (PageHuge(head))
-   rc = dissolve_free_huge_page(page);
if (!rc) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
-- 
2.7.0



Re: [PATCH v2] mm, memory-failure: clarify error message

2019-05-20 Thread Naoya Horiguchi
On Mon, May 20, 2019 at 07:52:03PM -0600, Jane Chu wrote:
> Some user who install SIGBUS handler that does longjmp out
> therefore keeping the process alive is confused by the error
> message
>   "[188988.765862] Memory failure: 0x1840200: Killing
>cellsrv:33395 due to hardware memory corruption"
> Slightly modify the error message to improve clarity.
> 
> Signed-off-by: Jane Chu 

Acked-by: Naoya Horiguchi 

Thanks!


Re: [PATCH] mm, memory-failure: clarify error message

2019-05-20 Thread Naoya Horiguchi
On Fri, May 17, 2019 at 10:18:02AM +0530, Anshuman Khandual wrote:
> 
> 
> On 05/17/2019 09:38 AM, Jane Chu wrote:
> > Some user who install SIGBUS handler that does longjmp out
> 
> What the longjmp about ? Are you referring to the mechanism of catching the
> signal which was registered ?

AFAIK, longjmp() might be useful for signal-based retrying, so highly
optimized applications like Oracle DB might want to utilize it to handle
memory errors in application level, I guess.

> 
> > therefore keeping the process alive is confused by the error
> > message
> >   "[188988.765862] Memory failure: 0x1840200: Killing
> >cellsrv:33395 due to hardware memory corruption"
> 
> Its a valid point because those are two distinct actions.
> 
> > Slightly modify the error message to improve clarity.
> > 
> > Signed-off-by: Jane Chu 
> > ---
> >  mm/memory-failure.c | 7 ---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index fc8b517..14de5e2 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long 
> > pfn, int flags)
> > short addr_lsb = tk->size_shift;
> > int ret;
> >  
> > -   pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory 
> > corruption\n",
> > -   pfn, t->comm, t->pid);
> > -
> > if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
> > +   pr_err("Memory failure: %#lx: Killing %s:%d due to hardware 
> > memory "
> > +   "corruption\n", pfn, t->comm, t->pid);
> > ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
> >addr_lsb, current);
> > } else {
> > @@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long 
> > pfn, int flags)
> >  * This could cause a loop when the user sets SIGBUS
> >  * to SIG_IGN, but hopefully no one will do that?
> >  */
> > +   pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to 
> > hardware "
> > +   "memory corruption\n", pfn, t->comm, t->pid);
> > ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
> >   addr_lsb, t);  /* synchronous? */
> 
> As both the pr_err() messages are very similar, could not we just switch 
> between "Killing"
> and "Sending SIGBUS to" based on a variable e.g action_[kill|sigbus] 
> evaluated previously
> with ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm).

That might need additional if sentence, which I'm not sure worth doing.
I think that the simplest fix for the reported problem (a confusing message)
is like below:

-   pr_err("Memory failure: %#lx: Killing %s:%d due to hardware 
memory corruption\n",
+   pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to 
hardware memory corruption\n",
pfn, t->comm, t->pid);

Or, if we have a good reason to separate the message for MF_ACTION_REQUIRED and
MF_ACTION_OPTIONAL, that might be OK.

Thanks,
Naoya Horiguchi


[PATCH v2] tools/power: turbostat: make output buffer extensible (Re: [PATCH v1] tools/power: turbostat: fix buffer overrun)

2019-04-18 Thread Naoya Horiguchi
I updated the patch with a trival fix.
Could you review it?

- Naoya


From: Naoya Horiguchi 
Date: Fri, 19 Apr 2019 09:21:59 +0900
Subject: [PATCH v2] tools/power: turbostat: make output buffer extensible

"turbostat --Dump" could be terminated by general protection fault on
some latest hardwares which (for example) support 9 levels of C-states
and show 18 "tADDED" lines. That bloats the total output and finally
causes buffer overrun.  So this patch sugguests to extend the output
buffer when reaching the end.

This patch also removes duplicated "pc10:" line to reduce buffer usage.

Signed-off-by: Naoya Horiguchi 
---
ChangeLog v1->v2:
- moved va_end() before if block.
---
 tools/power/x86/turbostat/turbostat.c | 397 ++
 1 file changed, 210 insertions(+), 187 deletions(-)

diff --git a/tools/power/x86/turbostat/turbostat.c 
b/tools/power/x86/turbostat/turbostat.c
index c7727be9719f..767ada558d75 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -84,6 +84,7 @@ double tsc_tweak = 1.0;
 unsigned int show_pkg_only;
 unsigned int show_core_only;
 char *output_buffer, *outp;
+ssize_t outbuf_size;
 unsigned int do_rapl;
 unsigned int do_dts;
 unsigned int do_ptm;
@@ -625,6 +626,28 @@ unsigned long long bic_lookup(char *name_list, enum 
show_hide_mode mode)
return retval;
 }
 
+static void *append_to_output_buffer(const char *fmt, ...)
+{
+   va_list args;
+
+   va_start(args, fmt);
+   outp += vsprintf(outp, fmt, args);
+   va_end(args);
+
+   /* Approaching the buffer end, so extend it. */
+   if (outp - output_buffer >= (outbuf_size - 256)) {
+   int output_size = outp - output_buffer;
+
+   outbuf_size += 1024;
+   output_buffer = realloc(output_buffer, outbuf_size);
+   if (output_buffer == NULL)
+   err(-1, "realloc output buffer");
+   if (debug)
+   printf("Output buffer was extended.\n");
+   outp = output_buffer + output_size;
+   }
+   return outp;
+}
 
 void print_header(char *delim)
 {
@@ -632,173 +655,173 @@ void print_header(char *delim)
int printed = 0;
 
if (DO_BIC(BIC_USEC))
-   outp += sprintf(outp, "%susec", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%susec", (printed++ ? delim : 
""));
if (DO_BIC(BIC_TOD))
-   outp += sprintf(outp, "%sTime_Of_Day_Seconds", (printed++ ? 
delim : ""));
+   outp = append_to_output_buffer("%sTime_Of_Day_Seconds", 
(printed++ ? delim : ""));
if (DO_BIC(BIC_Package))
-   outp += sprintf(outp, "%sPackage", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sPackage", (printed++ ? delim 
: ""));
if (DO_BIC(BIC_Die))
-   outp += sprintf(outp, "%sDie", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sDie", (printed++ ? delim : 
""));
if (DO_BIC(BIC_Node))
-   outp += sprintf(outp, "%sNode", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sNode", (printed++ ? delim : 
""));
if (DO_BIC(BIC_Core))
-   outp += sprintf(outp, "%sCore", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sCore", (printed++ ? delim : 
""));
if (DO_BIC(BIC_CPU))
-   outp += sprintf(outp, "%sCPU", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sCPU", (printed++ ? delim : 
""));
if (DO_BIC(BIC_APIC))
-   outp += sprintf(outp, "%sAPIC", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sAPIC", (printed++ ? delim : 
""));
if (DO_BIC(BIC_X2APIC))
-   outp += sprintf(outp, "%sX2APIC", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sX2APIC", (printed++ ? delim : 
""));
if (DO_BIC(BIC_Avg_MHz))
-   outp += sprintf(outp, "%sAvg_MHz", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sAvg_MHz", (printed++ ? delim 
: ""));
if (DO_BIC(BIC_Busy))
-   outp += sprintf(outp, "%sBusy%%", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sBusy%%", (printed++ ? delim : 
""));
if (DO_BIC(BIC_Bzy_MHz))
-   outp += sprintf(outp, "%sBzy_MHz", (printed++ ? delim : "&quo

[PATCH] tools/power: turbostat: make output buffer extensible (Re: [PATCH v1] tools/power: turbostat: fix buffer overrun)

2019-04-03 Thread Naoya Horiguchi
Hi Prarit,

On Wed, Apr 03, 2019 at 07:42:45AM -0400, Prarit Bhargava wrote:
> 
> 
> On 4/3/19 3:02 AM, Naoya Horiguchi wrote:
> > turbostat could be terminated by general protection fault on some latest
> > hardwares which (for example) support 9 levels of C-states and show 18
> > "tADDED" lines. That bloats the total output and finally causes buffer
> > overrun.  So let's extend the buffer to avoid this.
> > 
> > This patch also removes duplicated "pc10:" line to reduce buffer usage.
> > 
> > Signed-off-by: Naoya Horiguchi 
> > ---
> >  tools/power/x86/turbostat/turbostat.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git 
> > v5.1-rc3-mmotm-2019-04-02-17-16/tools/power/x86/turbostat/turbostat.c 
> > v5.1-rc3-mmotm-2019-04-02-17-16_patched/tools/power/x86/turbostat/turbostat.c
> > index c7727be..17b1f544 100644
> > --- v5.1-rc3-mmotm-2019-04-02-17-16/tools/power/x86/turbostat/turbostat.c
> > +++ 
> > v5.1-rc3-mmotm-2019-04-02-17-16_patched/tools/power/x86/turbostat/turbostat.c
> > @@ -861,7 +861,6 @@ int dump_counters(struct thread_data *t, struct 
> > core_data *c,
> > outp += sprintf(outp, "pc8: %016llX\n", p->pc8);
> > outp += sprintf(outp, "pc9: %016llX\n", p->pc9);
> > outp += sprintf(outp, "pc10: %016llX\n", p->pc10);
> > -   outp += sprintf(outp, "pc10: %016llX\n", p->pc10);
> > outp += sprintf(outp, "cpu_lpi: %016llX\n", p->cpu_lpi);
> > outp += sprintf(outp, "sys_lpi: %016llX\n", p->sys_lpi);
> > outp += sprintf(outp, "Joules PKG: %0X\n", p->energy_pkg);
> > @@ -5135,7 +5134,7 @@ int initialize_counters(int cpu_id)
> >  
> >  void allocate_output_buffer()
> >  {
> > -   output_buffer = calloc(1, (1 + topo.num_cpus) * 1024);
> > +   output_buffer = calloc(1, (1 + topo.num_cpus) * 2048);
> 
> Is there a better way to calculate the size of that buffer other than a magic
> number?

Straightforward way to calculate it is to define the list of printing items
and set needed buffer size for each one, then sum them up in initialization.
But that might make code hard to maintain because we already have many small
items and they are not in common format.

Another approach independent of magic number or fixed-sized buffer is to
extend the buffer with remalloc() when we are approaching the end.
I hope the following patch might help.

# This patch is relatively large (~400 lines) but most are simple replacement
# of "sprintf(outp, ...)" with "append_to_output_buffer()".

Thanks,
Naoya Horiguchi
--
From: Naoya Horiguchi 
Date: Thu, 4 Apr 2019 11:54:28 +0900
Subject: [PATCH] tools/power: turbostat: make output buffer extensible

"turbostat --Dump" could be terminated by general protection fault on
some latest hardwares which (for example) support 9 levels of C-states
and show 18 "tADDED" lines. That bloats the total output and finally
causes buffer overrun.  So this patch sugguests to extend the output
buffer when reaching the end.

This patch also removes duplicated "pc10:" line to reduce buffer usage.

Signed-off-by: Naoya Horiguchi 
---
 tools/power/x86/turbostat/turbostat.c | 397 ++
 1 file changed, 210 insertions(+), 187 deletions(-)

diff --git a/tools/power/x86/turbostat/turbostat.c 
b/tools/power/x86/turbostat/turbostat.c
index c7727be9719f..41d41c532a3e 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -84,6 +84,7 @@ double tsc_tweak = 1.0;
 unsigned int show_pkg_only;
 unsigned int show_core_only;
 char *output_buffer, *outp;
+ssize_t outbuf_size;
 unsigned int do_rapl;
 unsigned int do_dts;
 unsigned int do_ptm;
@@ -625,6 +626,28 @@ unsigned long long bic_lookup(char *name_list, enum 
show_hide_mode mode)
return retval;
 }
 
+static void *append_to_output_buffer(const char *fmt, ...)
+{
+   va_list args;
+
+   va_start(args, fmt);
+   outp += vsprintf(outp, fmt, args);
+
+   /* Approaching the buffer end, so extend it. */
+   if (outp - output_buffer >= (outbuf_size - 256)) {
+   int output_size = outp - output_buffer;
+
+   outbuf_size += 1024;
+   output_buffer = realloc(output_buffer, outbuf_size);
+   if (output_buffer == NULL)
+   err(-1, "realloc output buffer");
+   if (debug)
+   printf("Output buffer was extended.\n");
+   outp = output_buffer + output_size;
+   }
+   va_end(args);
+   return o

[PATCH v1] tools/power: turbostat: fix buffer overrun

2019-04-03 Thread Naoya Horiguchi
turbostat could be terminated by general protection fault on some latest
hardwares which (for example) support 9 levels of C-states and show 18
"tADDED" lines. That bloats the total output and finally causes buffer
overrun.  So let's extend the buffer to avoid this.

This patch also removes duplicated "pc10:" line to reduce buffer usage.

Signed-off-by: Naoya Horiguchi 
---
 tools/power/x86/turbostat/turbostat.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git 
v5.1-rc3-mmotm-2019-04-02-17-16/tools/power/x86/turbostat/turbostat.c 
v5.1-rc3-mmotm-2019-04-02-17-16_patched/tools/power/x86/turbostat/turbostat.c
index c7727be..17b1f544 100644
--- v5.1-rc3-mmotm-2019-04-02-17-16/tools/power/x86/turbostat/turbostat.c
+++ 
v5.1-rc3-mmotm-2019-04-02-17-16_patched/tools/power/x86/turbostat/turbostat.c
@@ -861,7 +861,6 @@ int dump_counters(struct thread_data *t, struct core_data 
*c,
outp += sprintf(outp, "pc8: %016llX\n", p->pc8);
outp += sprintf(outp, "pc9: %016llX\n", p->pc9);
outp += sprintf(outp, "pc10: %016llX\n", p->pc10);
-   outp += sprintf(outp, "pc10: %016llX\n", p->pc10);
outp += sprintf(outp, "cpu_lpi: %016llX\n", p->cpu_lpi);
outp += sprintf(outp, "sys_lpi: %016llX\n", p->sys_lpi);
outp += sprintf(outp, "Joules PKG: %0X\n", p->energy_pkg);
@@ -5135,7 +5134,7 @@ int initialize_counters(int cpu_id)
 
 void allocate_output_buffer()
 {
-   output_buffer = calloc(1, (1 + topo.num_cpus) * 1024);
+   output_buffer = calloc(1, (1 + topo.num_cpus) * 2048);
outp = output_buffer;
if (outp == NULL)
err(-1, "calloc output buffer");
-- 
2.7.0



Re: [PATCH] mm/hugetlb: Get rid of NODEMASK_ALLOC

2019-04-02 Thread Naoya Horiguchi
On Tue, Apr 02, 2019 at 03:34:15PM +0200, Oscar Salvador wrote:
> NODEMASK_ALLOC is used to allocate a nodemask bitmap, ant it does it by
> first determining whether it should be allocated in the stack or dinamically
> depending on NODES_SHIFT.
> Right now, it goes the dynamic path whenever the nodemask_t is above 32
> bytes.
> 
> Although we could bump it to a reasonable value, the largest a nodemask_t
> can get is 128 bytes, so since __nr_hugepages_store_common is called from
> a rather shore stack we can just get rid of the NODEMASK_ALLOC call here.
> 
> This reduces some code churn and complexity.
> 
> Signed-off-by: Oscar Salvador 

nice cleanup.

Reviewed-by: Naoya Horiguchi 


Re: [PATCH v2 0/2] A couple hugetlbfs fixes

2019-04-02 Thread Naoya Horiguchi
On Thu, Mar 28, 2019 at 04:47:02PM -0700, Mike Kravetz wrote:
> I stumbled on these two hugetlbfs issues while looking at other things:
> - The 'restore reserve' functionality at page free time should not
>   be adjusting subpool counts.
> - A BUG can be triggered (not easily) due to temporarily mapping a
>   page before doing a COW.
> 
> Both are described in detail in the commit message of the patches.
> I would appreciate comments from Davidlohr Bueso as one patch is
> directly related to code he added in commit 8382d914ebf7.
> 
> I did not cc stable as the first problem has been around since reserves
> were added to hugetlbfs and nobody has noticed.  The second is very hard
> to hit/reproduce.
> 
> v2 - Update definition and all callers of hugetlb_fault_mutex_hash as
>  the arguments mm and vma are no longer used or necessary.
> 
> Mike Kravetz (2):
>   huegtlbfs: on restore reserve error path retain subpool reservation
>   hugetlb: use same fault hash key for shared and private mappings
> 
>  fs/hugetlbfs/inode.c|  7 ++-
>  include/linux/hugetlb.h |  4 +---
>  mm/hugetlb.c| 43 +
>  mm/userfaultfd.c|  3 +--
>  4 files changed, 26 insertions(+), 31 deletions(-)

Both fixes look fine to me.

Reviewed-by: Naoya Horiguchi 


Re: [PATCH REBASED] hugetlbfs: fix potential over/underflow setting node specific nr_hugepages

2019-03-28 Thread Naoya Horiguchi
On Thu, Mar 28, 2019 at 03:05:33PM -0700, Mike Kravetz wrote:
> The number of node specific huge pages can be set via a file such as:
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
> When a node specific value is specified, the global number of huge
> pages must also be adjusted.  This adjustment is calculated as the
> specified node specific value + (global value - current node value).
> If the node specific value provided by the user is large enough, this
> calculation could overflow an unsigned long leading to a smaller
> than expected number of huge pages.
> 
> To fix, check the calculation for overflow.  If overflow is detected,
> use ULONG_MAX as the requested value.  This is inline with the user
> request to allocate as many huge pages as possible.
> 
> It was also noticed that the above calculation was done outside the
> hugetlb_lock.  Therefore, the values could be inconsistent and result
> in underflow.  To fix, the calculation is moved within the routine
> set_max_huge_pages() where the lock is held.
> 
> In addition, the code in __nr_hugepages_store_common() which tries to
> handle the case of not being able to allocate a node mask would likely
> result in incorrect behavior.  Luckily, it is very unlikely we will
> ever take this path.  If we do, simply return ENOMEM.
> 
> Reported-by: Jing Xiangfeng 
> Signed-off-by: Mike Kravetz 

Looks good to me.

Reviewed-by: Naoya Horiguchi 

> ---
> This was sent upstream during 5.1 merge window, but dropped as it was
> based on an earlier version of Alex Ghiti's patch which was dropped.
> Now rebased on top of Alex Ghiti's "[PATCH v8 0/4] Fix free/allocation
> of runtime gigantic pages" series which was just added to mmotm.
> 
>  mm/hugetlb.c | 41 ++---
>  1 file changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f3e84c1bef11..f79ae4e42159 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2287,13 +2287,33 @@ static int adjust_pool_surplus(struct hstate *h, 
> nodemask_t *nodes_allowed,
>  }
>  
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static int set_max_huge_pages(struct hstate *h, unsigned long count,
> +static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
> nodemask_t *nodes_allowed)
>  {
>   unsigned long min_count, ret;
>  
>   spin_lock(_lock);
>  
> + /*
> +  * Check for a node specific request.
> +  * Changing node specific huge page count may require a corresponding
> +  * change to the global count.  In any case, the passed node mask
> +  * (nodes_allowed) will restrict alloc/free to the specified node.
> +  */
> + if (nid != NUMA_NO_NODE) {
> + unsigned long old_count = count;
> +
> + count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> + /*
> +  * User may have specified a large count value which caused the
> +  * above calculation to overflow.  In this case, they wanted
> +  * to allocate as many huge pages as possible.  Set count to
> +  * largest possible value to align with their intention.
> +  */
> + if (count < old_count)
> + count = ULONG_MAX;
> + }
> +
>   /*
>* Gigantic pages runtime allocation depend on the capability for large
>* page range allocation.
> @@ -2445,15 +2465,22 @@ static ssize_t __nr_hugepages_store_common(bool 
> obey_mempolicy,
>   }
>   } else if (nodes_allowed) {
>   /*
> -  * per node hstate attribute: adjust count to global,
> -  * but restrict alloc/free to the specified node.
> +  * Node specific request.  count adjustment happens in
> +  * set_max_huge_pages() after acquiring hugetlb_lock.
>*/
> - count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
>   init_nodemask_of_node(nodes_allowed, nid);
> - } else
> - nodes_allowed = _states[N_MEMORY];
> + } else {
> + /*
> +  * Node specific request, but we could not allocate the few
> +  * words required for a node mask.  We are unlikely to hit
> +  * this condition.  Since we can not pass down the appropriate
> +  * node mask, just return ENOMEM.
> +  */
> + err = -ENOMEM;
> + goto out;
> + }
>  
> - err = set_max_huge_pages(h, count, nodes_allowed);
> + err = set_max_huge_pages(h, count, nid, nodes_allowed);
>  
>  out:
>   if (nodes_allowed != _states[N_MEMORY])
> -- 
> 2.20.1
> 
> 


Re: [Qestion] Hit a WARN_ON_ONCE in try_to_unmap_one when runing syzkaller

2019-03-14 Thread Naoya Horiguchi
Hi,

On Wed, Mar 13, 2019 at 12:03:20AM +0800, zhong jiang wrote:
...
> 
> Minchan has changed the conditon check from  BUG_ON  to WARN_ON_ONCE in 
> try_to_unmap_one.
> However,  It is still an abnormal condition when PageSwapBacked is not equal 
> to PageSwapCache.
> 
> But Is there any case it will meet the conditon in the mainline.
> 
> It is assumed that PageSwapBacked(page) is true in the anonymous page,   This 
> is to say,  PageSwapcache
> is false. however,  That is impossible because we will update the pte for 
> hwpoison entry.
> 
> Because page is locked ,  Its page flags should not be changed except for 
> PageSwapBacked

try_to_unmap_one() from hwpoison_user_mappings() could reach the
WARN_ON_ONCE() only if TTU_IGNORE_HWPOISON is set, because PageHWPoison()
is set at the beginning of memory_failure().

Clearing TTU_IGNORE_HWPOISON might happen on the following two paths:

  static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
int flags, struct page **hpagep)
  {
  ...
  
  if (PageSwapCache(p)) {
  pr_err("Memory failure: %#lx: keeping poisoned page in swap 
cache\n",
  pfn);
  ttu |= TTU_IGNORE_HWPOISON;
  }
  ...

  mapping = page_mapping(hpage);
   
  if (!(flags & MF_MUST_KILL) && !PageDirty(hpage) && mapping &&
   
  mapping_cap_writeback_dirty(mapping)) {   
   
  if (page_mkclean(hpage)) {
   
  SetPageDirty(hpage);  
   
  } else {  
   
  kill = 0; 
   
  ttu |= TTU_IGNORE_HWPOISON;   
   
  pr_info("Memory failure: %#lx: corrupted page was clean: 
dropped without side effects\n",
  pfn); 
   
  } 
   
  } 
   
  ...

  unmap_success = try_to_unmap(hpage, ttu);
  ...

So either of the above "ttu |= TTU_IGNORE_HWPOISON" should be executed.
I'm not sure which one, but both paths show printk messages, so if you
could have kernel message log, that might help ...

Thanks,
Naoya Horiguchi


Re: [PATCH v4] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()

2019-03-04 Thread Naoya Horiguchi
On Tue, Feb 26, 2019 at 04:03:23PM -0800, Mike Kravetz wrote:
> On 2/26/19 2:36 PM, Andrew Morton wrote:
...
> >>
> >> +  } else {
> >>/*
> >> -   * per node hstate attribute: adjust count to global,
> >> -   * but restrict alloc/free to the specified node.
> >> +   * Node specific request, but we could not allocate
> >> +   * node mask.  Pass in ALL nodes, and clear nid.
> >> */
> > 
> > Ditto here, somewhat.

# I missed this part when reviewing yesterday for some reason, sorry.

> 
> I was just going to update the comments and send you a new patch, but
> but your comment got me thinking about this situation.  I did not really
> change the way this code operates.  As a reminder, the original code is like:
> 
> NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY);
> 
> if (nid == NUMA_NO_NODE) {
>   /* do something */
> } else if (nodes_allowed) {
>   /* do something else */
> } else {
>   nodes_allowed = _states[N_MEMORY];
> }
> 
> So, the only way we get to that final else if if we can not allocate
> a node mask (kmalloc a few words).  Right?  I wonder why we should
> even try to continue in this case.  Why not just return right there?

Simply returning on allocation failure looks better to me.
As you mentioned below, current behavior for this 'else' case is not
helpful for anyone.

Thanks,
Naoya Horiguchi

> 
> The specified count value is either a request to increase number of
> huge pages or decrease.  If we can't allocate a few words, we certainly
> are not going to find memory to create huge pages.  There 'might' be
> surplus pages which can be converted to permanent pages.  But remember
> this is a 'node specific' request and we can't allocate a mask to pass
> down to the conversion routines.  So, chances are good we would operate
> on the wrong node.  The same goes for a request to 'free' huge pages.
> Since, we can't allocate a node mask we are likely to free them from
> the wrong node.
> 
> Unless my reasoning above is incorrect, I think that final else block
> in __nr_hugepages_store_common() is wrong.
> 
> Any additional thoughts?
> -- 
> Mike Kravetz
> 


Re: [PATCH v4] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()

2019-03-03 Thread Naoya Horiguchi
On Tue, Feb 26, 2019 at 11:32:24AM -0800, Mike Kravetz wrote:
> On 2/25/19 10:21 PM, David Rientjes wrote:
> > On Tue, 26 Feb 2019, Jing Xiangfeng wrote:
> >> On 2019/2/26 3:17, David Rientjes wrote:
> >>> On Mon, 25 Feb 2019, Mike Kravetz wrote:
> >>>
> >>>> Ok, what about just moving the calculation/check inside the lock as in 
> >>>> the
> >>>> untested patch below?
> >>>>
> >>>> Signed-off-by: Mike Kravetz 
> 
> 
> 
> >>>
> >>> Looks good; Jing, could you test that this fixes your case?
> >>
> >> Yes, I have tested this patch, it can also fix my case.
> > 
> > Great!
> > 
> > Reported-by: Jing Xiangfeng 
> > Tested-by: Jing Xiangfeng 
> > Acked-by: David Rientjes 
> 
> Thanks Jing and David!
> 
> Here is the patch with an updated commit message and above tags:
> 
> From: Mike Kravetz 
> Date: Tue, 26 Feb 2019 10:43:24 -0800
> Subject: [PATCH] hugetlbfs: fix potential over/underflow setting node specific
> nr_hugepages
> 
> The number of node specific huge pages can be set via a file such as:
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
> When a node specific value is specified, the global number of huge
> pages must also be adjusted.  This adjustment is calculated as the
> specified node specific value + (global value - current node value).
> If the node specific value provided by the user is large enough, this
> calculation could overflow an unsigned long leading to a smaller
> than expected number of huge pages.
> 
> To fix, check the calculation for overflow.  If overflow is detected,
> use ULONG_MAX as the requested value.  This is inline with the user
> request to allocate as many huge pages as possible.
> 
> It was also noticed that the above calculation was done outside the
> hugetlb_lock.  Therefore, the values could be inconsistent and result
> in underflow.  To fix, the calculation is moved to within the routine
> set_max_huge_pages() where the lock is held.
> 
> Reported-by: Jing Xiangfeng 
> Signed-off-by: Mike Kravetz 
> Tested-by: Jing Xiangfeng 
> Acked-by: David Rientjes 

Looks good to me with improved comments.
Thanks everyone.

Reviewed-by: Naoya Horiguchi 

> ---
>  mm/hugetlb.c | 34 ++
>  1 file changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b37e3100b7cc..a7e4223d2df5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h,
> nodemask_t *nodes_allowed,
>  }
> 
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static int set_max_huge_pages(struct hstate *h, unsigned long count,
> +static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>   nodemask_t *nodes_allowed)
>  {
>   unsigned long min_count, ret;
> @@ -2289,6 +2289,23 @@ static int set_max_huge_pages(struct hstate *h, 
> unsigned
> long count,
>   goto decrease_pool;
>   }
> 
> + spin_lock(_lock);
> +
> + /*
> +  * Check for a node specific request.  Adjust global count, but
> +  * restrict alloc/free to the specified node.
> +  */
> + if (nid != NUMA_NO_NODE) {
> + unsigned long old_count = count;
> + count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> + /*
> +  * If user specified count causes overflow, set to
> +  * largest possible value.
> +  */
> + if (count < old_count)
> + count = ULONG_MAX;
> + }
> +
>   /*
>* Increase the pool size
>* First take pages out of surplus state.  Then make up the
> @@ -2300,7 +2317,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned
> long count,
>* pool might be one hugepage larger than it needs to be, but
>* within all the constraints specified by the sysctls.
>*/
> - spin_lock(_lock);
>   while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
>   if (!adjust_pool_surplus(h, nodes_allowed, -1))
>   break;
> @@ -2421,16 +2437,18 @@ static ssize_t __nr_hugepages_store_common(bool
> obey_mempolicy,
>   nodes_allowed = _states[N_MEMORY];
>   }
>   } else if (nodes_allowed) {
> + /* Node specific request */
> + init_nodemask_of_node(nodes_allowed, nid);
> + } else {
>   /*
>

Re: [PATCH] mm: hwpoison: fix thp split handing in soft_offline_in_use_page()

2019-02-28 Thread Naoya Horiguchi
On Tue, Feb 26, 2019 at 10:34:32PM +0800, zhong jiang wrote:
> On 2019/2/26 21:51, Kirill A. Shutemov wrote:
> > On Tue, Feb 26, 2019 at 07:18:00PM +0800, zhong jiang wrote:
> >> From: zhongjiang 
> >>
> >> When soft_offline_in_use_page() runs on a thp tail page after pmd is plit,
> > s/plit/split/
> >
> >> we trigger the following VM_BUG_ON_PAGE():
> >>
> >> Memory failure: 0x3755ff: non anonymous thp
> >> __get_any_page: 0x3755ff: unknown zero refcount page type 2f8000
> >> Soft offlining pfn 0x34d805 at process virtual address 0x20fff000
> >> page:ea000d360140 count:0 mapcount:0 mapping: index:0x1
> >> flags: 0x2f8000()
> >> raw: 002f8000 ea000d360108 ea000d360188 
> >> raw: 0001   
> >> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >> [ cut here ]
> >> kernel BUG at ./include/linux/mm.h:519!
> >>
> >> soft_offline_in_use_page() passed refcount and page lock from tail page to
> >> head page, which is not needed because we can pass any subpage to
> >> split_huge_page().
> > I don't see a description of what is going wrong and why change will fixed
> > it. From the description, it appears as it's cosmetic-only change.
> >
> > Please elaborate.
> When soft_offline_in_use_page runs on a thp tail page after pmd is split,  
> and we pass the head page to split_huge_page, Unfortunately, the tail page
> can be free or count turn into zero.

I guess that you have the similar fix on memory_failure() in your mind:

  commit c3901e722b2975666f42748340df798114742d6d
  Author: Naoya Horiguchi 
  Date:   Thu Nov 10 10:46:23 2016 -0800
  
  mm: hwpoison: fix thp split handling in memory_failure()

So it seems that I somehow missed fixing soft offline when I wrote commit
c3901e722b29, and now you find and fix that. Thank you very much.
If you resend the patch with fixing typo, can you add some reference to
c3901e722b29 in the patch description to show the linkage?
And you can add the following tags:

Fixes: 61f5d698cc97 ("mm: re-enable THP")
Acked-by: Naoya Horiguchi 

Thanks,
Naoya Horiguchi


Re: [PATCH] huegtlbfs: fix races and page leaks during migration

2019-02-25 Thread Naoya Horiguchi
ge_mapping() field
> is cleared yet page_private remains set until the page is actually freed
> by free_huge_page().  A page could be migrated while in this state.
> However, since page_mapping() is not set the hugetlbfs specific routine
> to transfer page_private is not called and we leak the page count in the
> filesystem.  To fix, check for this condition before migrating a huge
> page.  If the condition is detected, return EBUSY for the page.
> 
> Cc: 
> Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
> Signed-off-by: Mike Kravetz 
> ---
>  fs/hugetlbfs/inode.c | 12 
>  mm/hugetlb.c | 12 +---
>  mm/migrate.c | 11 +++
>  3 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 32920a10100e..a7fa037b876b 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -859,6 +859,18 @@ static int hugetlbfs_migrate_page(struct address_space
> *mapping,
>   rc = migrate_huge_page_move_mapping(mapping, newpage, page);
>   if (rc != MIGRATEPAGE_SUCCESS)
>   return rc;
> +
> + /*
> +  * page_private is subpool pointer in hugetlb pages.  Transfer to
> +  * new page.  PagePrivate is not associated with page_private for
> +  * hugetlb pages and can not be set here as only page_huge_active
> +  * pages can be migrated.
> +  */
> + if (page_private(page)) {
> + set_page_private(newpage, page_private(page));
> + set_page_private(page, 0);
> + }
> +
>   if (mode != MIGRATE_SYNC_NO_COPY)
>   migrate_page_copy(newpage, page);
>   else
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a80832487981..e9c92e925b7e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
...
> @@ -3863,6 +3864,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>   }
> 
>   spin_unlock(ptl);
> +
> + /* Make newly allocated pages active */

You already have a perfect explanation about why we need this "if",

  > ... We could have got the page from the pagecache, and it could
  > be that the page is !page_huge_active() because it has been isolated for
  > migration.

so you could improve this comment with it.

Anyway, I agree to what/how you try to fix.

Reviewed-by: Naoya Horiguchi 

Thanks,
Naoya Horiguchi


Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-11 Thread Naoya Horiguchi
On Mon, Feb 11, 2019 at 03:06:27PM -0800, Mike Kravetz wrote:
> On 2/7/19 11:31 PM, Naoya Horiguchi wrote:
> > On Thu, Feb 07, 2019 at 09:50:30PM -0800, Mike Kravetz wrote:
> >> On 2/7/19 6:31 PM, Naoya Horiguchi wrote:
> >>> On Thu, Feb 07, 2019 at 10:50:55AM -0800, Mike Kravetz wrote:
> >>>> On 1/30/19 1:14 PM, Mike Kravetz wrote:
> >>>>> +++ b/fs/hugetlbfs/inode.c
> >>>>> @@ -859,6 +859,16 @@ static int hugetlbfs_migrate_page(struct 
> >>>>> address_space *mapping,
> >>>>> rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> >>>>> if (rc != MIGRATEPAGE_SUCCESS)
> >>>>> return rc;
> >>>>> +
> >>>>> +   /*
> >>>>> +* page_private is subpool pointer in hugetlb pages, transfer
> >>>>> +* if needed.
> >>>>> +*/
> >>>>> +   if (page_private(page) && !page_private(newpage)) {
> >>>>> +   set_page_private(newpage, page_private(page));
> >>>>> +   set_page_private(page, 0);
> >>>
> >>> You don't have to copy PagePrivate flag?
> >>>
> >>
> >> Well my original thought was no.  For hugetlb pages, PagePrivate is not
> >> associated with page_private.  It indicates a reservation was consumed.
> >> It is set  when a hugetlb page is newly allocated and the allocation is
> >> associated with a reservation and the global reservation count is
> >> decremented.  When the page is added to the page cache or rmap,
> >> PagePrivate is cleared.  If the page is free'ed before being added to page
> >> cache or rmap, PagePrivate tells free_huge_page to restore (increment) the
> >> reserve count as we did not 'instantiate' the page.
> >>
> >> So, PagePrivate is only set from the time a huge page is allocated until
> >> it is added to page cache or rmap.  My original thought was that the page
> >> could not be migrated during this time.  However, I am not sure if that
> >> reasoning is correct.  The page is not locked, so it would appear that it
> >> could be migrated?  But, if it can be migrated at this time then perhaps
> >> there are bigger issues for the (hugetlb) page fault code?
> > 
> > In my understanding, free hugetlb pages are not expected to be passed to
> > migrate_pages(), and currently that's ensured by each migration caller
> > which checks and avoids free hugetlb pages on its own.
> > migrate_pages() and its internal code are probably not aware of handling
> > free hugetlb pages, so if they are accidentally passed to migration code,
> > that's a big problem as you are concerned.
> > So the above reasoning should work at least this assumption is correct.
> > 
> > Most of migration callers are not intersted in moving free hugepages.
> > The one I'm not sure of is the code path from alloc_contig_range().
> > If someone think it's worthwhile to migrate free hugepage to get bigger
> > contiguous memory, he/she tries to enable that code path and the assumption
> > will be broken.
> 
> You are correct.  We do not migrate free huge pages.  I was thinking more
> about problems if we migrate a page while it is being added to a task's page
> table as in hugetlb_no_page.
> 
> Commit bcc54222309c ("mm: hugetlb: introduce page_huge_active") addresses
> this issue, but I believe there is a bug in the implementation.
> isolate_huge_page contains this test:
> 
>   if (!page_huge_active(page) || !get_page_unless_zero(page)) {
>   ret = false;
>   goto unlock;
>   }
> 
> If the condition is not met, then the huge page can be isolated and migrated.
> 
> In hugetlb_no_page, there is this block of code:
> 
> page = alloc_huge_page(vma, haddr, 0);
> if (IS_ERR(page)) {
> ret = vmf_error(PTR_ERR(page));
> goto out;
> }
> clear_huge_page(page, address, pages_per_huge_page(h));
> __SetPageUptodate(page);
> set_page_huge_active(page);
> 
> if (vma->vm_flags & VM_MAYSHARE) {
> int err = huge_add_to_page_cache(page, mapping, idx);
> if (err) {
> put_page(page);
> if (err == -EEXIST)
> goto retry;
> goto

Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-07 Thread Naoya Horiguchi
On Thu, Feb 07, 2019 at 09:50:30PM -0800, Mike Kravetz wrote:
> On 2/7/19 6:31 PM, Naoya Horiguchi wrote:
> > On Thu, Feb 07, 2019 at 10:50:55AM -0800, Mike Kravetz wrote:
> >> On 1/30/19 1:14 PM, Mike Kravetz wrote:
> >>> +++ b/fs/hugetlbfs/inode.c
> >>> @@ -859,6 +859,16 @@ static int hugetlbfs_migrate_page(struct 
> >>> address_space *mapping,
> >>>   rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> >>>   if (rc != MIGRATEPAGE_SUCCESS)
> >>>   return rc;
> >>> +
> >>> + /*
> >>> +  * page_private is subpool pointer in hugetlb pages, transfer
> >>> +  * if needed.
> >>> +  */
> >>> + if (page_private(page) && !page_private(newpage)) {
> >>> + set_page_private(newpage, page_private(page));
> >>> + set_page_private(page, 0);
> > 
> > You don't have to copy PagePrivate flag?
> > 
> 
> Well my original thought was no.  For hugetlb pages, PagePrivate is not
> associated with page_private.  It indicates a reservation was consumed.
> It is set  when a hugetlb page is newly allocated and the allocation is
> associated with a reservation and the global reservation count is
> decremented.  When the page is added to the page cache or rmap,
> PagePrivate is cleared.  If the page is free'ed before being added to page
> cache or rmap, PagePrivate tells free_huge_page to restore (increment) the
> reserve count as we did not 'instantiate' the page.
> 
> So, PagePrivate is only set from the time a huge page is allocated until
> it is added to page cache or rmap.  My original thought was that the page
> could not be migrated during this time.  However, I am not sure if that
> reasoning is correct.  The page is not locked, so it would appear that it
> could be migrated?  But, if it can be migrated at this time then perhaps
> there are bigger issues for the (hugetlb) page fault code?

In my understanding, free hugetlb pages are not expected to be passed to
migrate_pages(), and currently that's ensured by each migration caller
which checks and avoids free hugetlb pages on its own.
migrate_pages() and its internal code are probably not aware of handling
free hugetlb pages, so if they are accidentally passed to migration code,
that's a big problem as you are concerned.
So the above reasoning should work at least this assumption is correct.

Most of migration callers are not intersted in moving free hugepages.
The one I'm not sure of is the code path from alloc_contig_range().
If someone think it's worthwhile to migrate free hugepage to get bigger
contiguous memory, he/she tries to enable that code path and the assumption
will be broken.

Thanks,
Naoya Horiguchi

> 
> >>> +
> >>> + }
> >>> +
> >>>   if (mode != MIGRATE_SYNC_NO_COPY)
> >>>   migrate_page_copy(newpage, page);
> >>>   else
> >>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>> index f7e4bfdc13b7..0d9708803553 100644
> >>> --- a/mm/migrate.c
> >>> +++ b/mm/migrate.c
> >>> @@ -703,8 +703,14 @@ void migrate_page_states(struct page *newpage, 
> >>> struct page *page)
> >>>*/
> >>>   if (PageSwapCache(page))
> >>>   ClearPageSwapCache(page);
> >>> - ClearPagePrivate(page);
> >>> - set_page_private(page, 0);
> >>> + /*
> >>> +  * Unlikely, but PagePrivate and page_private could potentially
> >>> +  * contain information needed at hugetlb free page time.
> >>> +  */
> >>> + if (!PageHuge(page)) {
> >>> + ClearPagePrivate(page);
> >>> + set_page_private(page, 0);
> >>> + }
> > 
> > # This argument is mainly for existing code...
> > 
> > According to the comment on migrate_page():
> > 
> > /*
> >  * Common logic to directly migrate a single LRU page suitable for
> >  * pages that do not use PagePrivate/PagePrivate2.
> >  *
> >  * Pages are locked upon entry and exit.
> >  */
> > int migrate_page(struct address_space *mapping, ...
> > 
> > So this common logic assumes that page_private is not used, so why do
> > we explicitly clear page_private in migrate_page_states()?
> 
> Perhaps someone else knows.  If not, I can do some git research and
> try to find out why.
> 
> > buffer_migrate_page(), which is commonly used for the case when
> > page_private is used, does that clearing outside migrate_page_states().
> > So I thought that hugetlbfs_migrate_page() could do in the similar manner.
> > IOW, migrate_page_states() should not do anything on PagePrivate.
> > But there're a few other .migratepage callbacks, and I'm not sure all of
> > them are safe for the change, so this approach might not fit for a small 
> > fix.
> 
> I will look at those as well unless someone knows without researching.
> 
> > 
> > # BTW, there seems a typo in $SUBJECT.
> 
> Thanks!
> 
> -- 
> Mike Kravetz
> 


Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-07 Thread Naoya Horiguchi
page_private is not used, so why do
we explicitly clear page_private in migrate_page_states()?
buffer_migrate_page(), which is commonly used for the case when
page_private is used, does that clearing outside migrate_page_states().
So I thought that hugetlbfs_migrate_page() could do in the similar manner.
IOW, migrate_page_states() should not do anything on PagePrivate.
But there're a few other .migratepage callbacks, and I'm not sure all of
them are safe for the change, so this approach might not fit for a small fix.

# BTW, there seems a typo in $SUBJECT.

Thanks,
Naoya Horiguchi


Re: [PATCH] mm: hwpoison: use do_send_sig_info() instead of force_sig() (Re: PMEM error-handling forces SIGKILL causes kernel panic)

2019-01-16 Thread Naoya Horiguchi
Hi Jane,

On Wed, Jan 16, 2019 at 09:56:02AM -0800, Jane Chu wrote:
> Hi, Naoya,
> 
> On 1/16/2019 1:30 AM, Naoya Horiguchi wrote:
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 7c72f2a95785..831be5ff5f4d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -372,7 +372,8 @@ static void kill_procs(struct list_head *to_kill, int 
> forcekill, bool fail,
> if (fail || tk->addr_valid == 0) {
> pr_err("Memory failure: %#lx: forcibly 
> killing %s:%d because of failure to unmap corrupted page\n",
>pfn, tk->tsk->comm, tk->tsk->pid);
> -   force_sig(SIGKILL, tk->tsk);
> +   do_send_sig_info(SIGKILL, SEND_SIG_PRIV,
> +tk->tsk, PIDTYPE_PID);
> }
> 
> 
> Since we don't care the return from do_send_sig_info(), would you mind to
> prefix it with (void) ?

Sorry, I'm not sure about the benefit to do casting the return value
just being ignored, so personally I'd like keeping the code simple.
Do you have some in mind?

- Naoya


[PATCH] mm: hwpoison: use do_send_sig_info() instead of force_sig() (Re: PMEM error-handling forces SIGKILL causes kernel panic)

2019-01-16 Thread Naoya Horiguchi
[ CCed Andrew and linux-mm ]

On Fri, Jan 11, 2019 at 08:14:02AM +, Horiguchi Naoya(堀口 直也) wrote:
> Hi Dan, Jane,
> 
> Thanks for the report.
> 
> On Wed, Jan 09, 2019 at 03:49:32PM -0800, Dan Williams wrote:
> > [ switch to text mail, add lkml and Naoya ]
> > 
> > On Wed, Jan 9, 2019 at 12:19 PM Jane Chu  wrote:
> ...
> > > 3. The hardware consists the latest revision CPU and Intel NVDIMM, we 
> > > suspected
> > >the CPU faulty because it generated MCE over PMEM UE in a unlikely high
> > >rate for any reasonable NVDIMM (like a few per 24hours).
> > >
> > > After swapping the CPU, the problem stopped reproducing.
> > >
> > > But one could argue that perhaps the faulty CPU exposed a small race 
> > > window
> > > from collect_procs() to unmap_mapping_range() and to kill_procs(), hence
> > > caught the kernel  PMEM error handler off guard.
> > 
> > There's definitely a race, and the implementation is buggy as can be
> > seen in __exit_signal:
> > 
> > sighand = rcu_dereference_check(tsk->sighand,
> > lockdep_tasklist_lock_is_held());
> > spin_lock(>siglock);
> > 
> > ...the memory-failure path needs to hold the proper locks before it
> > can assume that de-referencing tsk->sighand is valid.
> > 
> > > Also note, the same workload on the same faulty CPU were run on Linux 
> > > prior to
> > > the 4.19 PMEM error handling and did not encounter kernel crash, probably 
> > > because
> > > the prior HWPOISON handler did not force SIGKILL?
> > 
> > Before 4.19 this test should result in a machine-check reboot, not
> > much better than a kernel crash.
> > 
> > > Should we not to force the SIGKILL, or find a way to close the race 
> > > window?
> > 
> > The race should be closed by holding the proper tasklist and rcu read 
> > lock(s).
> 
> This reasoning and proposal sound right to me. I'm trying to reproduce
> this race (for non-pmem case,) but no luck for now. I'll investigate more.

I wrote/tested a patch for this issue.
I think that switching signal API effectively does proper locking.

Thanks,
Naoya Horiguchi
---
>From 16dbf6105ff4831f73276d79d5df238ab467de76 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi 
Date: Wed, 16 Jan 2019 16:59:27 +0900
Subject: [PATCH] mm: hwpoison: use do_send_sig_info() instead of force_sig()

Currently memory_failure() is racy against process's exiting,
which results in kernel crash by null pointer dereference.

The root cause is that memory_failure() uses force_sig() to forcibly
kill asynchronous (meaning not in the current context) processes.  As
discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for
OOM fixes, this is not a right thing to do.  OOM solves this issue by
using do_send_sig_info() as done in commit d2d393099de2 ("signal:
oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
patch is suggesting to do the same for hwpoison.  do_send_sig_info()
properly accesses to siglock with lock_task_sighand(), so is free from
the reported race.

I confirmed that the reported bug reproduces with inserting some delay
in kill_procs(), and it never reproduces with this patch.

Note that memory_failure() can send another type of signal using
force_sig_mceerr(), and the reported race shouldn't happen on it
because force_sig_mceerr() is called only for synchronous processes
(i.e. BUS_MCEERR_AR happens only when some process accesses to the
corrupted memory.)

Reported-by: Jane Chu 
Cc: Dan Williams 
Cc: sta...@vger.kernel.org
Signed-off-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7c72f2a95785..831be5ff5f4d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -372,7 +372,8 @@ static void kill_procs(struct list_head *to_kill, int 
forcekill, bool fail,
if (fail || tk->addr_valid == 0) {
pr_err("Memory failure: %#lx: forcibly killing 
%s:%d because of failure to unmap corrupted page\n",
   pfn, tk->tsk->comm, tk->tsk->pid);
-   force_sig(SIGKILL, tk->tsk);
+   do_send_sig_info(SIGKILL, SEND_SIG_PRIV,
+tk->tsk, PIDTYPE_PID);
}
 
/*
-- 
2.7.5




Re: PMEM error-handling forces SIGKILL causes kernel panic

2019-01-11 Thread Naoya Horiguchi
Hi Dan, Jane,

Thanks for the report.

On Wed, Jan 09, 2019 at 03:49:32PM -0800, Dan Williams wrote:
> [ switch to text mail, add lkml and Naoya ]
> 
> On Wed, Jan 9, 2019 at 12:19 PM Jane Chu  wrote:
...
> > 3. The hardware consists the latest revision CPU and Intel NVDIMM, we 
> > suspected
> >the CPU faulty because it generated MCE over PMEM UE in a unlikely high
> >rate for any reasonable NVDIMM (like a few per 24hours).
> >
> > After swapping the CPU, the problem stopped reproducing.
> >
> > But one could argue that perhaps the faulty CPU exposed a small race window
> > from collect_procs() to unmap_mapping_range() and to kill_procs(), hence
> > caught the kernel  PMEM error handler off guard.
> 
> There's definitely a race, and the implementation is buggy as can be
> seen in __exit_signal:
> 
> sighand = rcu_dereference_check(tsk->sighand,
> lockdep_tasklist_lock_is_held());
> spin_lock(>siglock);
> 
> ...the memory-failure path needs to hold the proper locks before it
> can assume that de-referencing tsk->sighand is valid.
> 
> > Also note, the same workload on the same faulty CPU were run on Linux prior 
> > to
> > the 4.19 PMEM error handling and did not encounter kernel crash, probably 
> > because
> > the prior HWPOISON handler did not force SIGKILL?
> 
> Before 4.19 this test should result in a machine-check reboot, not
> much better than a kernel crash.
> 
> > Should we not to force the SIGKILL, or find a way to close the race window?
> 
> The race should be closed by holding the proper tasklist and rcu read lock(s).

This reasoning and proposal sound right to me. I'm trying to reproduce
this race (for non-pmem case,) but no luck for now. I'll investigate more.

Thanks,
Naoya Horiguchi


Re: [PATCH] tools/vm/page-types.c: fix "kpagecount returned fewer pages than expected" failures

2018-12-10 Thread Naoya Horiguchi
On Tue, Dec 04, 2018 at 02:24:29PM -0800, Anthony Yznaga wrote:
> Because kpagecount_read() fakes success if map counts are not being
> collected, clamp the page count passed to it by walk_pfn() to the pages
> value returned by the preceding call to kpageflags_read().
> 
> Fixes: 7f1d23e60718 ("tools/vm/page-types.c: include shared map counts")
> Signed-off-by: Anthony Yznaga 

Reviewed-by: Naoya Horiguchi 

> ---
>  tools/vm/page-types.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
> index 37908a83ddc2..1ff3a6c0367b 100644
> --- a/tools/vm/page-types.c
> +++ b/tools/vm/page-types.c
> @@ -701,7 +701,7 @@ static void walk_pfn(unsigned long voffset,
>   if (kpagecgroup_read(cgi, index, pages) != pages)
>   fatal("kpagecgroup returned fewer pages than expected");
>  
> - if (kpagecount_read(cnt, index, batch) != pages)
> + if (kpagecount_read(cnt, index, pages) != pages)
>   fatal("kpagecount returned fewer pages than expected");
>  
>   for (i = 0; i < pages; i++)
> -- 
> 1.8.3.1
> 
> 


Re: [PATCH v2] /proc/kpagecount: return 0 for special pages that are never mapped

2018-12-10 Thread Naoya Horiguchi
On Mon, Dec 10, 2018 at 02:35:13PM -0800, Anthony Yznaga wrote:
> Certain pages that are never mapped to userspace have a type
> indicated in the page_type field of their struct pages (e.g. PG_buddy).
> page_type overlaps with _mapcount so set the count to 0 and avoid
> calling page_mapcount() for these pages.
> 
> Signed-off-by: Anthony Yznaga 
> Acked-by: Matthew Wilcox 

Looks good to me.

Reviewed-by: Naoya Horiguchi 

> ---
>   v2 - incorporated feedback from Matthew Wilcox
> 
>  fs/proc/page.c | 2 +-
>  include/linux/page-flags.h | 6 ++
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 6c517b11acf8..40b05e0d4274 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -46,7 +46,7 @@ static ssize_t kpagecount_read(struct file *file, char 
> __user *buf,
>   ppage = pfn_to_page(pfn);
>   else
>   ppage = NULL;
> - if (!ppage || PageSlab(ppage))
> + if (!ppage || PageSlab(ppage) || page_has_type(ppage))
>   pcount = 0;
>   else
>   pcount = page_mapcount(ppage);
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 50ce1bddaf56..39b4494e29f1 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -669,6 +669,7 @@ static inline int TestClearPageDoubleMap(struct page 
> *page)
>  
>  #define PAGE_TYPE_BASE   0xf000
>  /* Reserve   0x007f to catch underflows of page_mapcount */
> +#define PAGE_MAPCOUNT_RESERVE-128
>  #define PG_buddy 0x0080
>  #define PG_balloon   0x0100
>  #define PG_kmemcg0x0200
> @@ -677,6 +678,11 @@ static inline int TestClearPageDoubleMap(struct page 
> *page)
>  #define PageType(page, flag) \
>   ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
>  
> +static inline int page_has_type(struct page *page)
> +{
> + return (int)page->page_type < PAGE_MAPCOUNT_RESERVE;
> +}
> +
>  #define PAGE_TYPE_OPS(uname, lname)  \
>  static __always_inline int Page##uname(struct page *page)\
>  {\
> -- 
> 1.8.3.1
> 
> 


Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-06 Thread Naoya Horiguchi
On Thu, Dec 06, 2018 at 09:32:06AM +0100, Michal Hocko wrote:
> On Thu 06-12-18 05:21:38, Naoya Horiguchi wrote:
> > On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote:
> > > On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> > > [...]
> > > > After some more thinking I am not really sure the above reasoning is
> > > > still true with the current upstream kernel. Maybe I just managed to
> > > > confuse myself so please hold off on this patch for now. Testing by
> > > > Oscar has shown this patch is helping but the changelog might need to be
> > > > updated.
> > > 
> > > OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> > > been debugging on behaves slightly different. The underlying problem is
> > > the same though. So I have reworded the changelog and added "just in
> > > case" PageLRU handling. Naoya, maybe you have an argument that would
> > > make this void for current upstream kernels.
> > 
> > The following commit (not in 4.4.x stable tree) might explain the
> > difference you experienced:
> > 
> >   commit 286c469a988fbaf68e3a97ddf1e6c245c1446968  
> >   Author: Naoya Horiguchi   
> >   Date:   Wed May 3 14:56:22 2017 -0700
> >
> >   mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
> > 
> > This commit adds shake_page() for mlocked pages to make sure that the target
> > page is flushed out from LRU cache. Without this shake_page(), subsequent
> > delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate it and
> > the page will finally return back to LRU list.  So this scenario leads to
> > "hwpoisoned by still linked to LRU list" page.
> 
> OK, I see. So does that mean that the LRU handling is no longer needed
> and there is a guanratee that all kernels with the above commit cannot
> ever get an LRU page?

Theoretically no such gurantee, because try_to_unmap() doesn't have a
guarantee of success and then memory_failure() returns immediately
when hwpoison_user_mappings fails.
Or the following code (comes after hwpoison_user_mappings block) also implies
that the target page can still have PageLRU flag.

/*
 * Torn down by someone else?
 */
if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
    action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
res = -EBUSY;
goto out;
}

So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
current version of your patch.

Feel free to add my ack.

Acked-by: Naoya Horiguchi 

Thanks,
Naoya Horiguchi


Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-06 Thread Naoya Horiguchi
On Thu, Dec 06, 2018 at 09:32:06AM +0100, Michal Hocko wrote:
> On Thu 06-12-18 05:21:38, Naoya Horiguchi wrote:
> > On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote:
> > > On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> > > [...]
> > > > After some more thinking I am not really sure the above reasoning is
> > > > still true with the current upstream kernel. Maybe I just managed to
> > > > confuse myself so please hold off on this patch for now. Testing by
> > > > Oscar has shown this patch is helping but the changelog might need to be
> > > > updated.
> > > 
> > > OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> > > been debugging on behaves slightly different. The underlying problem is
> > > the same though. So I have reworded the changelog and added "just in
> > > case" PageLRU handling. Naoya, maybe you have an argument that would
> > > make this void for current upstream kernels.
> > 
> > The following commit (not in 4.4.x stable tree) might explain the
> > difference you experienced:
> > 
> >   commit 286c469a988fbaf68e3a97ddf1e6c245c1446968  
> >   Author: Naoya Horiguchi   
> >   Date:   Wed May 3 14:56:22 2017 -0700
> >
> >   mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
> > 
> > This commit adds shake_page() for mlocked pages to make sure that the target
> > page is flushed out from LRU cache. Without this shake_page(), subsequent
> > delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate it and
> > the page will finally return back to LRU list.  So this scenario leads to
> > "hwpoisoned by still linked to LRU list" page.
> 
> OK, I see. So does that mean that the LRU handling is no longer needed
> and there is a guanratee that all kernels with the above commit cannot
> ever get an LRU page?

Theoretically no such gurantee, because try_to_unmap() doesn't have a
guarantee of success and then memory_failure() returns immediately
when hwpoison_user_mappings fails.
Or the following code (comes after hwpoison_user_mappings block) also implies
that the target page can still have PageLRU flag.

/*
 * Torn down by someone else?
 */
if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
    action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
res = -EBUSY;
goto out;
}

So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
current version of your patch.

Feel free to add my ack.

Acked-by: Naoya Horiguchi 

Thanks,
Naoya Horiguchi


Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-05 Thread Naoya Horiguchi
On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote:
> On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> [...]
> > After some more thinking I am not really sure the above reasoning is
> > still true with the current upstream kernel. Maybe I just managed to
> > confuse myself so please hold off on this patch for now. Testing by
> > Oscar has shown this patch is helping but the changelog might need to be
> > updated.
> 
> OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> been debugging on behaves slightly different. The underlying problem is
> the same though. So I have reworded the changelog and added "just in
> case" PageLRU handling. Naoya, maybe you have an argument that would
> make this void for current upstream kernels.

The following commit (not in 4.4.x stable tree) might explain the
difference you experienced:

  commit 286c469a988fbaf68e3a97ddf1e6c245c1446968  
  Author: Naoya Horiguchi   
  Date:   Wed May 3 14:56:22 2017 -0700
   
  mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page

This commit adds shake_page() for mlocked pages to make sure that the target
page is flushed out from LRU cache. Without this shake_page(), subsequent
delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate it and
the page will finally return back to LRU list.  So this scenario leads to
"hwpoisoned by still linked to LRU list" page.

Thanks,
Naoya Horiguchi



Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-05 Thread Naoya Horiguchi
On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote:
> On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> [...]
> > After some more thinking I am not really sure the above reasoning is
> > still true with the current upstream kernel. Maybe I just managed to
> > confuse myself so please hold off on this patch for now. Testing by
> > Oscar has shown this patch is helping but the changelog might need to be
> > updated.
> 
> OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> been debugging on behaves slightly different. The underlying problem is
> the same though. So I have reworded the changelog and added "just in
> case" PageLRU handling. Naoya, maybe you have an argument that would
> make this void for current upstream kernels.

The following commit (not in 4.4.x stable tree) might explain the
difference you experienced:

  commit 286c469a988fbaf68e3a97ddf1e6c245c1446968  
  Author: Naoya Horiguchi   
  Date:   Wed May 3 14:56:22 2017 -0700
   
  mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page

This commit adds shake_page() for mlocked pages to make sure that the target
page is flushed out from LRU cache. Without this shake_page(), subsequent
delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate it and
the page will finally return back to LRU list.  So this scenario leads to
"hwpoisoned by still linked to LRU list" page.

Thanks,
Naoya Horiguchi



Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-04 Thread Naoya Horiguchi
On Tue, Dec 04, 2018 at 10:35:49AM +0100, Michal Hocko wrote:
> On Tue 04-12-18 09:11:05, Naoya Horiguchi wrote:
> > On Tue, Dec 04, 2018 at 09:48:26AM +0100, Michal Hocko wrote:
> > > On Tue 04-12-18 07:21:16, Naoya Horiguchi wrote:
> > > > On Mon, Dec 03, 2018 at 11:03:09AM +0100, Michal Hocko wrote:
> > > > > From: Michal Hocko 
> > > > > 
> > > > > We have received a bug report that an injected MCE about faulty memory
> > > > > prevents memory offline to succeed. The underlying reason is that the
> > > > > HWPoison page has an elevated reference count and the migration keeps
> > > > > failing. There are two problems with that. First of all it is dubious
> > > > > to migrate the poisoned page because we know that accessing that 
> > > > > memory
> > > > > is possible to fail. Secondly it doesn't make any sense to migrate a
> > > > > potentially broken content and preserve the memory corruption over to 
> > > > > a
> > > > > new location.
> > > > > 
> > > > > Oscar has found out that it is the elevated reference count from
> > > > > memory_failure that is confusing the offlining path. HWPoisoned pages
> > > > > are isolated from the LRU list but __offline_pages might still try to
> > > > > migrate them if there is any preceding migrateable pages in the pfn
> > > > > range. Such a migration would fail due to the reference count but
> > > > > the migration code would put it back on the LRU list. This is quite
> > > > > wrong in itself but it would also make scan_movable_pages stumble over
> > > > > it again without any way out.
> > > > > 
> > > > > This means that the hotremove with hwpoisoned pages has never really
> > > > > worked (without a luck). HWPoisoning really needs a larger surgery
> > > > > but an immediate and backportable fix is to skip over these pages 
> > > > > during
> > > > > offlining. Even if they are still mapped for some reason then
> > > > > try_to_unmap should turn those mappings into hwpoison ptes and cause
> > > > > SIGBUS on access. Nobody should be really touching the content of the
> > > > > page so it should be safe to ignore them even when there is a pending
> > > > > reference count.
> > > > > 
> > > > > Debugged-by: Oscar Salvador 
> > > > > Cc: stable
> > > > > Signed-off-by: Michal Hocko 
> > > > > ---
> > > > > Hi,
> > > > > I am sending this as an RFC now because I am not fully sure I see all
> > > > > the consequences myself yet. This has passed a testing by Oscar but I
> > > > > would highly appreciate a review from Naoya about my assumptions about
> > > > > hwpoisoning. E.g. it is not entirely clear to me whether there is a
> > > > > potential case where the page might be still mapped.
> > > > 
> > > > One potential case is ksm page, for which we give up unmapping and leave
> > > > it unmapped. Rather than that I don't have any idea, but any new type of
> > > > page would be potentially categorized to this class.
> > > 
> > > Could you be more specific why hwpoison code gives up on ksm pages while
> > > we can safely unmap here?
> > 
> > Actually no big reason. Ksm pages never dominate memory, so we simply didn't
> > have strong motivation to save the pages.
> 
> OK, so the unmapping is safe. I will drop a comment. Does this look good
> to you?
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 08c576d5a633..ef5d42759aa2 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1370,7 +1370,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
> end_pfn)
>   /*
>* HWPoison pages have elevated reference counts so the 
> migration would
>* fail on them. It also doesn't make any sense to migrate them 
> in the
> -  * first place. Still try to unmap such a page in case it is 
> still mapped.
> +  * first place. Still try to unmap such a page in case it is 
> still mapped
> +  * (e.g. current hwpoison implementation doesn't unmap KSM 
> pages but keep
> +  * the unmap as the catch all safety net).
>*/
>   if (PageHWPoison(page)) {
>   if (page_mapped(page))

Thanks, I'm fine to this part which explains why we unmap here.

- Naoya


Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-04 Thread Naoya Horiguchi
On Tue, Dec 04, 2018 at 10:35:49AM +0100, Michal Hocko wrote:
> On Tue 04-12-18 09:11:05, Naoya Horiguchi wrote:
> > On Tue, Dec 04, 2018 at 09:48:26AM +0100, Michal Hocko wrote:
> > > On Tue 04-12-18 07:21:16, Naoya Horiguchi wrote:
> > > > On Mon, Dec 03, 2018 at 11:03:09AM +0100, Michal Hocko wrote:
> > > > > From: Michal Hocko 
> > > > > 
> > > > > We have received a bug report that an injected MCE about faulty memory
> > > > > prevents memory offline to succeed. The underlying reason is that the
> > > > > HWPoison page has an elevated reference count and the migration keeps
> > > > > failing. There are two problems with that. First of all it is dubious
> > > > > to migrate the poisoned page because we know that accessing that 
> > > > > memory
> > > > > is possible to fail. Secondly it doesn't make any sense to migrate a
> > > > > potentially broken content and preserve the memory corruption over to 
> > > > > a
> > > > > new location.
> > > > > 
> > > > > Oscar has found out that it is the elevated reference count from
> > > > > memory_failure that is confusing the offlining path. HWPoisoned pages
> > > > > are isolated from the LRU list but __offline_pages might still try to
> > > > > migrate them if there is any preceding migrateable pages in the pfn
> > > > > range. Such a migration would fail due to the reference count but
> > > > > the migration code would put it back on the LRU list. This is quite
> > > > > wrong in itself but it would also make scan_movable_pages stumble over
> > > > > it again without any way out.
> > > > > 
> > > > > This means that the hotremove with hwpoisoned pages has never really
> > > > > worked (without a luck). HWPoisoning really needs a larger surgery
> > > > > but an immediate and backportable fix is to skip over these pages 
> > > > > during
> > > > > offlining. Even if they are still mapped for some reason then
> > > > > try_to_unmap should turn those mappings into hwpoison ptes and cause
> > > > > SIGBUS on access. Nobody should be really touching the content of the
> > > > > page so it should be safe to ignore them even when there is a pending
> > > > > reference count.
> > > > > 
> > > > > Debugged-by: Oscar Salvador 
> > > > > Cc: stable
> > > > > Signed-off-by: Michal Hocko 
> > > > > ---
> > > > > Hi,
> > > > > I am sending this as an RFC now because I am not fully sure I see all
> > > > > the consequences myself yet. This has passed a testing by Oscar but I
> > > > > would highly appreciate a review from Naoya about my assumptions about
> > > > > hwpoisoning. E.g. it is not entirely clear to me whether there is a
> > > > > potential case where the page might be still mapped.
> > > > 
> > > > One potential case is ksm page, for which we give up unmapping and leave
> > > > it unmapped. Rather than that I don't have any idea, but any new type of
> > > > page would be potentially categorized to this class.
> > > 
> > > Could you be more specific why hwpoison code gives up on ksm pages while
> > > we can safely unmap here?
> > 
> > Actually no big reason. Ksm pages never dominate memory, so we simply didn't
> > have strong motivation to save the pages.
> 
> OK, so the unmapping is safe. I will drop a comment. Does this look good
> to you?
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 08c576d5a633..ef5d42759aa2 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1370,7 +1370,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
> end_pfn)
>   /*
>* HWPoison pages have elevated reference counts so the 
> migration would
>* fail on them. It also doesn't make any sense to migrate them 
> in the
> -  * first place. Still try to unmap such a page in case it is 
> still mapped.
> +  * first place. Still try to unmap such a page in case it is 
> still mapped
> +  * (e.g. current hwpoison implementation doesn't unmap KSM 
> pages but keep
> +  * the unmap as the catch all safety net).
>*/
>   if (PageHWPoison(page)) {
>   if (page_mapped(page))

Thanks, I'm fine to this part which explains why we unmap here.

- Naoya


  1   2   3   4   5   6   7   8   9   10   >