from:"Rafael Aquini"

Re: [PATCH 1/5] mm/migrate.c: remove unnecessary VM_BUG_ON_PAGE on putback_movable_page()

2021-03-20 Thread Rafael Aquini

On Sat, Mar 20, 2021 at 05:36:57AM -0400, Miaohe Lin wrote:
> The !PageLocked() check is implicitly done in PageMovable(). Remove this
> explicit one.
>

I'd just keep the explicit assertion in place, regardless.
But if you're going to stick with this patch, please fix it because it's 
removing the wrong assertion.


> Signed-off-by: Miaohe Lin 
> ---
>  mm/migrate.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 47df0df8f21a..e4ca5ef508ea 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -146,7 +146,6 @@ void putback_movable_page(struct page *page)
>   struct address_space *mapping;
>  
>   VM_BUG_ON_PAGE(!PageLocked(page), page);
> - VM_BUG_ON_PAGE(!PageMovable(page), page);
>   VM_BUG_ON_PAGE(!PageIsolated(page), page);
>  
>   mapping = page_mapping(page);
> -- 
> 2.19.1
> 
>

[PATCH v2] mm/slab_common: provide "slab_merge" option for !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT) builds

2021-03-19 Thread Rafael Aquini

This is a minor addition to the allocator setup options to provide
a simple way to on demand enable back cache merging for builds
that by default run with CONFIG_SLAB_MERGE_DEFAULT not set.

Signed-off-by: Rafael Aquini 
---
v2 changelog:
* fix __setup("slab_merge", setup_slab_nomerge); typo

 Documentation/admin-guide/kernel-parameters.txt | 7 +++
 mm/slab_common.c| 8 
 2 files changed, 15 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..06519eecbfec 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4877,6 +4877,10 @@
 
slram=  [HW,MTD]
 
+   slab_merge  [MM]
+   Enable merging of slabs with similar size when the
+   kernel is built without CONFIG_SLAB_MERGE_DEFAULT.
+
slab_nomerge[MM]
Disable merging of slabs with similar size. May be
necessary if there is some reason to distinguish
@@ -4924,6 +4928,9 @@
lower than slub_max_order.
For more information see Documentation/vm/slub.rst.
 
+   slub_merge  [MM, SLUB]
+   Same with slab_merge.
+
slub_nomerge[MM, SLUB]
Same with slab_nomerge. This is supported for legacy.
See slab_nomerge for more information.
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 88e833986332..b84dd734b75f 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -71,11 +71,19 @@ static int __init setup_slab_nomerge(char *str)
return 1;
 }
 
+static int __init setup_slab_merge(char *str)
+{
+   slab_nomerge = false;
+   return 1;
+}
+
 #ifdef CONFIG_SLUB
 __setup_param("slub_nomerge", slub_nomerge, setup_slab_nomerge, 0);
+__setup_param("slub_merge", slub_merge, setup_slab_merge, 0);
 #endif
 
 __setup("slab_nomerge", setup_slab_nomerge);
+__setup("slab_merge", setup_slab_merge);
 
 /*
  * Determine the size of a slab object
-- 
2.26.2

Re: [PATCH] mm/slab_common: provide "slab_merge" option for !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT) builds

2021-03-19 Thread Rafael Aquini

On Fri, Mar 19, 2021 at 03:22:33PM -0400, Rafael Aquini wrote:
> This is a minor addition to the allocator setup options to provide
> a simple way to on demand enable back cache merging for builds
> that by default run with CONFIG_SLAB_MERGE_DEFAULT not set.
> 
> Signed-off-by: Rafael Aquini 
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 7 +++
>  mm/slab_common.c| 8 
>  2 files changed, 15 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 04545725f187..06519eecbfec 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4877,6 +4877,10 @@
>  
>   slram=  [HW,MTD]
>  
> + slab_merge  [MM]
> + Enable merging of slabs with similar size when the
> + kernel is built without CONFIG_SLAB_MERGE_DEFAULT.
> +
>   slab_nomerge[MM]
>   Disable merging of slabs with similar size. May be
>   necessary if there is some reason to distinguish
> @@ -4924,6 +4928,9 @@
>   lower than slub_max_order.
>   For more information see Documentation/vm/slub.rst.
>  
> + slub_merge  [MM, SLUB]
> + Same with slab_merge.
> +
>   slub_nomerge[MM, SLUB]
>   Same with slab_nomerge. This is supported for legacy.
>   See slab_nomerge for more information.
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 88e833986332..30db72269036 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -71,11 +71,19 @@ static int __init setup_slab_nomerge(char *str)
>   return 1;
>  }
>  
> +static int __init setup_slab_merge(char *str)
> +{
> + slab_nomerge = false;
> + return 1;
> +}
> +
>  #ifdef CONFIG_SLUB
>  __setup_param("slub_nomerge", slub_nomerge, setup_slab_nomerge, 0);
> +__setup_param("slub_merge", slub_merge, setup_slab_merge, 0);
>  #endif
>  
>  __setup("slab_nomerge", setup_slab_nomerge);
> +__setup("slab_merge", setup_slab_nomerge);
DOH! I missed the typo here ^^   

Sorry.

[PATCH] mm/slab_common: provide "slab_merge" option for !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT) builds

2021-03-19 Thread Rafael Aquini

This is a minor addition to the allocator setup options to provide
a simple way to on demand enable back cache merging for builds
that by default run with CONFIG_SLAB_MERGE_DEFAULT not set.

Signed-off-by: Rafael Aquini 
---
 Documentation/admin-guide/kernel-parameters.txt | 7 +++
 mm/slab_common.c| 8 
 2 files changed, 15 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..06519eecbfec 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4877,6 +4877,10 @@
 
slram=  [HW,MTD]
 
+   slab_merge  [MM]
+   Enable merging of slabs with similar size when the
+   kernel is built without CONFIG_SLAB_MERGE_DEFAULT.
+
slab_nomerge[MM]
Disable merging of slabs with similar size. May be
necessary if there is some reason to distinguish
@@ -4924,6 +4928,9 @@
lower than slub_max_order.
For more information see Documentation/vm/slub.rst.
 
+   slub_merge  [MM, SLUB]
+   Same with slab_merge.
+
slub_nomerge[MM, SLUB]
Same with slab_nomerge. This is supported for legacy.
See slab_nomerge for more information.
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 88e833986332..30db72269036 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -71,11 +71,19 @@ static int __init setup_slab_nomerge(char *str)
return 1;
 }
 
+static int __init setup_slab_merge(char *str)
+{
+   slab_nomerge = false;
+   return 1;
+}
+
 #ifdef CONFIG_SLUB
 __setup_param("slub_nomerge", slub_nomerge, setup_slab_nomerge, 0);
+__setup_param("slub_merge", slub_merge, setup_slab_merge, 0);
 #endif
 
 __setup("slab_nomerge", setup_slab_nomerge);
+__setup("slab_merge", setup_slab_nomerge);
 
 /*
  * Determine the size of a slab object
-- 
2.26.2

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-10-05 Thread Rafael Aquini

On Thu, Oct 01, 2020 at 10:31:57AM -0400, Rafael Aquini wrote:
> On Fri, Sep 25, 2020 at 11:21:58AM +0800, Huang, Ying wrote:
> > Rafael Aquini  writes:
> > >> Or, can you help to run the test with a debug kernel based on upstream
> > >> kernel.  I can provide some debug patch.
> > >> 
> > >
> > > Sure, I can set your patches to run with the test cases we have that tend 
> > > to 
> > > reproduce the issue with some degree of success.
> > 
> > Thanks!
> > 
> > I found a race condition.  During THP splitting, "head" may be unlocked
> > before calling split_swap_cluster(), because head != page during
> > deferred splitting.  So we should call split_swap_cluster() before
> > unlocking.  The debug patch to do that is as below.  Can you help to
> > test it?
> > 
> > Best Regards,
> > Huang, Ying
> > 
> > 8<
> > From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
> > From: Huang Ying 
> > Date: Fri, 25 Sep 2020 11:10:56 +0800
> > Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
> >  split THP
> > 
> > ---
> >  mm/huge_memory.c | 13 +++--
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index faadc449cca5..8d79e5e6b46e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, 
> > struct list_head *list,
> >  
> > remap_page(head);
> >  
> > +   if (PageSwapCache(head)) {
> > +   swp_entry_t entry = { .val = page_private(head) };
> > +
> > +   split_swap_cluster(entry);
> > +   }
> > +
> > for (i = 0; i < HPAGE_PMD_NR; i++) {
> > struct page *subpage = head + i;
> > if (subpage == page)
> > @@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, 
> > struct list_head *list)
> > }
> >  
> > __split_huge_page(page, list, end, flags);
> > -   if (PageSwapCache(head)) {
> > -   swp_entry_t entry = { .val = page_private(head) };
> > -
> > -   ret = split_swap_cluster(entry);
> > -   } else
> > -   ret = 0;
> > +   ret = 0;
> > } else {
> > if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
> > pr_alert("total_mapcount: %u, page_count(): %u\n",
> > -- 
> > 2.28.0
> > 
> 
> I left it running for several days, on several systems that had seen the
> crash hitting before, and no crashes were observed for either the upstream
> kernel nor the distro build 4.18-based kernel.
> 
> I guess we can comfortably go with your patch. Thanks!
> 
>
Ping

Are you going to post this patchfix soon? Or do you rather have me
posting it?

regards,

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-10-01 Thread Rafael Aquini

On Fri, Sep 25, 2020 at 11:21:58AM +0800, Huang, Ying wrote:
> Rafael Aquini  writes:
> >> Or, can you help to run the test with a debug kernel based on upstream
> >> kernel.  I can provide some debug patch.
> >> 
> >
> > Sure, I can set your patches to run with the test cases we have that tend 
> > to 
> > reproduce the issue with some degree of success.
> 
> Thanks!
> 
> I found a race condition.  During THP splitting, "head" may be unlocked
> before calling split_swap_cluster(), because head != page during
> deferred splitting.  So we should call split_swap_cluster() before
> unlocking.  The debug patch to do that is as below.  Can you help to
> test it?
> 
> Best Regards,
> Huang, Ying
> 
> 8<
> From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
> From: Huang Ying 
> Date: Fri, 25 Sep 2020 11:10:56 +0800
> Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
>  split THP
> 
> ---
>  mm/huge_memory.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index faadc449cca5..8d79e5e6b46e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, 
> struct list_head *list,
>  
>   remap_page(head);
>  
> + if (PageSwapCache(head)) {
> + swp_entry_t entry = { .val = page_private(head) };
> +
> + split_swap_cluster(entry);
> + }
> +
>   for (i = 0; i < HPAGE_PMD_NR; i++) {
>   struct page *subpage = head + i;
>   if (subpage == page)
> @@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, struct 
> list_head *list)
>   }
>  
>   __split_huge_page(page, list, end, flags);
> - if (PageSwapCache(head)) {
> - swp_entry_t entry = { .val = page_private(head) };
> -
> - ret = split_swap_cluster(entry);
> - } else
> - ret = 0;
> + ret = 0;
>   } else {
>   if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
>   pr_alert("total_mapcount: %u, page_count(): %u\n",
> -- 
> 2.28.0
> 

I left it running for several days, on several systems that had seen the
crash hitting before, and no crashes were observed for either the upstream
kernel nor the distro build 4.18-based kernel.

I guess we can comfortably go with your patch. Thanks!

Re: [PATCH 1/2] mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON

2020-09-26 Thread Rafael Aquini

On Thu, Sep 24, 2020 at 04:25:08PM +0800, Huang Ying wrote:
> To follow code-of-conduct better.
> 
> Signed-off-by: "Huang, Ying" 
> Suggested-by: "Matthew Wilcox (Oracle)" 
> Cc: Andrew Morton 
> Cc: Ingo Molnar 
> Cc: Mel Gorman 
> Cc: Rik van Riel 
> Cc: Johannes Weiner 
> Cc: Dave Hansen 
> Cc: Andi Kleen 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> ---
>  include/uapi/linux/mempolicy.h | 2 +-
>  kernel/sched/debug.c   | 2 +-
>  mm/mempolicy.c | 6 +++---
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index 3354774af61e..3c3666d017e6 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -60,7 +60,7 @@ enum {
>  #define MPOL_F_SHARED  (1 << 0)  /* identify shared policies */
>  #define MPOL_F_LOCAL   (1 << 1)  /* preferred local allocation */
>  #define MPOL_F_MOF   (1 << 3) /* this policy wants migrate on fault */
> -#define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */
> +#define MPOL_F_MOPRON(1 << 4) /* Migrate On Protnone Reference On 
> Node */
>  
>  
>  #endif /* _UAPI_LINUX_MEMPOLICY_H */
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 36c54265bb2b..26495a344d8d 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -844,7 +844,7 @@ static void sched_show_numa(struct task_struct *p, struct 
> seq_file *m)
>  
>   task_lock(p);
>   pol = p->mempolicy;
> - if (pol && !(pol->flags & MPOL_F_MORON))
> + if (pol && !(pol->flags & MPOL_F_MOPRON))
>   pol = NULL;
>   mpol_get(pol);
>   task_unlock(p);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index eddbe4e56c73..62cd159aa46d 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2515,7 +2515,7 @@ int mpol_misplaced(struct page *page, struct 
> vm_area_struct *vma, unsigned long
>   }
>  
>   /* Migrate the page towards the node whose CPU is referencing it */
> - if (pol->flags & MPOL_F_MORON) {
> + if (pol->flags & MPOL_F_MOPRON) {
>   polnid = thisnid;
>  
>   if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
> @@ -2806,7 +2806,7 @@ void __init numa_policy_init(void)
>   preferred_node_policy[nid] = (struct mempolicy) {
>   .refcnt = ATOMIC_INIT(1),
>   .mode = MPOL_PREFERRED,
> - .flags = MPOL_F_MOF | MPOL_F_MORON,
> + .flags = MPOL_F_MOF | MPOL_F_MOPRON,
>   .v = { .preferred_node = nid, },
>   };
>   }
> @@ -3014,7 +3014,7 @@ void mpol_to_str(char *buffer, int maxlen, struct 
> mempolicy *pol)
>   unsigned short mode = MPOL_DEFAULT;
>   unsigned short flags = 0;
>  
> - if (pol && pol != _policy && !(pol->flags & MPOL_F_MORON)) {
> + if (pol && pol != _policy && !(pol->flags & MPOL_F_MOPRON)) {
>   mode = pol->mode;
>   flags = pol->flags;
>   }
> -- 
> 2.28.0
> 
> 
Acked-by: Rafael Aquini

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-26 Thread Rafael Aquini

On Fri, Sep 25, 2020 at 11:21:58AM +0800, Huang, Ying wrote:
> Rafael Aquini  writes:
> >> Or, can you help to run the test with a debug kernel based on upstream
> >> kernel.  I can provide some debug patch.
> >> 
> >
> > Sure, I can set your patches to run with the test cases we have that tend 
> > to 
> > reproduce the issue with some degree of success.
> 
> Thanks!
> 
> I found a race condition.  During THP splitting, "head" may be unlocked
> before calling split_swap_cluster(), because head != page during
> deferred splitting.  So we should call split_swap_cluster() before
> unlocking.  The debug patch to do that is as below.  Can you help to
> test it?
>


I finally could grab a good crashdump and confirm that head is really
not locked. I still need to dig into it to figure out more about the
crash. I guess that your patch will guarantee that lock on head, but
it still doesn't help on explaining how did we get the THP marked as 
PG_swapcache, given that it should fail add_to_swap()->get_swap_page()
right? 

I'll give your patch a run over the weekend, hopefully we'll have more
info on this next week.

  
> Best Regards,
> Huang, Ying
> 
> 8<
> From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
> From: Huang Ying 
> Date: Fri, 25 Sep 2020 11:10:56 +0800
> Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
>  split THP
> 
> ---
>  mm/huge_memory.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index faadc449cca5..8d79e5e6b46e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, 
> struct list_head *list,
>  
>   remap_page(head);
>  
> + if (PageSwapCache(head)) {
> + swp_entry_t entry = { .val = page_private(head) };
> +
> + split_swap_cluster(entry);
> + }
> +
>   for (i = 0; i < HPAGE_PMD_NR; i++) {
>   struct page *subpage = head + i;
>   if (subpage == page)
> @@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, struct 
> list_head *list)
>   }
>  
>   __split_huge_page(page, list, end, flags);
> - if (PageSwapCache(head)) {
> - swp_entry_t entry = { .val = page_private(head) };
> -
> - ret = split_swap_cluster(entry);
> - } else
> - ret = 0;
> + ret = 0;
>   } else {
>   if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
>   pr_alert("total_mapcount: %u, page_count(): %u\n",
> -- 
> 2.28.0
>

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-24 Thread Rafael Aquini

On Thu, Sep 24, 2020 at 03:45:52PM +0800, Huang, Ying wrote:
> Rafael Aquini  writes:
> 
> > On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote:
> >> Rafael Aquini  writes:
> >> > The bug here is quite simple: split_swap_cluster() misses checking for
> >> > lock_cluster() returning NULL before committing to change 
> >> > cluster_info->flags.
> >> 
> >> I don't think so.  We shouldn't run into this situation firstly.  So the
> >> "fix" hides the real bug instead of fixing it.  Just like we call
> >> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
> >> instead of returning if !PageLocked(head) silently.
> >>
> >
> > Not the same thing, obviously, as you are going for an apples-to-carrots
> > comparison, but since you mentioned:
> >
> > split_huge_page_to_list() asserts (in debug builds) *page is locked,
> 
>   VM_BUG_ON_PAGE(!PageLocked(head), head);
> 
> It asserts *head instead of *page.
>
> > and later checks if *head bears the SwapCache flag. 
> > deferred_split_scan(), OTOH, doesn't hand down the compound head locked, 
> > but the 2nd page in the group instead.
> 
> No.  deferred_split_scan() will can trylock_page() on the 2nd page in
> the group, but
> 
> static inline int trylock_page(struct page *page)
> {
>   page = compound_head(page);
>   return (likely(!test_and_set_bit_lock(PG_locked, >flags)));
> }
> 
> So the head page will be locked instead.
> 

Yep, missed that. Thanks for straighten me out on this one.


> > This doesn't necessarely means it's a problem, though, but might help
> > on hitting the issue. 
> >  
> >> > The fundamental problem has nothing to do with allocating, or not 
> >> > allocating
> >> > a swap cluster, but it has to do with the fact that the THP deferred 
> >> > split scan
> >> > can transiently race with swapcache insertion, and the fact that when 
> >> > you run
> >> > your swap area on rotational storage cluster_info is _always_ NULL.
> >> > split_swap_cluster() needs to check for lock_cluster() returning NULL 
> >> > because
> >> > that's one possible case, and it clearly fails to do so.
> >> 
> >> If there's a race, we should fix the race.  But the code path for
> >> swapcache insertion is,
> >> 
> >> add_to_swap()
> >>   get_swap_page() /* Return if fails to allocate */
> >>   add_to_swap_cache()
> >> SetPageSwapCache()
> >> 
> >> While the code path to split THP is,
> >> 
> >> split_huge_page_to_list()
> >>   if PageSwapCache()
> >> split_swap_cluster()
> >> 
> >> Both code paths are protected by the page lock.  So there should be some
> >> other reasons to trigger the bug.
> >
> > As mentioned above, no they seem to not be protected (at least, not the
> > same page, depending on the case). While add_to_swap() will assure a 
> > page_lock on the compound head, split_huge_page_to_list() does not.
> >
> >
> >> And again, for HDD, a THP shouldn't have PageSwapCache() set at the
> >> first place.  If so, the bug is that the flag is set and we should fix
> >> the setting.
> >> 
> >
> > I fail to follow your claim here. Where is the guarantee, in the code, that 
> > you'll never have a compound head in the swapcache? 
> 
> We may have a THP in the swap cache, only if non-rotational disk is used
> as swap device.  This is the design assumption of the THP swap support.
> And this is guaranteed via swap space allocation for THP will fail for
> HDD.  If the implementation doesn't guarantee this, we will fix the
> implementation to guarantee this.
> 
> >> > Run a workload that cause multiple THP COW, and add a memory hogger to 
> >> > create
> >> > memory pressure so you'll force the reclaimers to kick the registered
> >> > shrinkers. The trigger is not heavy swapping, and that's probably why
> >> > most swap test cases don't hit it. The window is tight, but you will get 
> >> > the
> >> > NULL pointer dereference.
> >> 
> >> Do you have a script to reproduce the bug?
> >> 
> >
> > Nope, a convoluted set of internal regression tests we have usually
> > triggers it. In the wild, customers running HANNA are seeing it,
> > occasionally.
> 
> So you haven't reproduce the bug on upstream kernel?
> 

Have you seen the stack dump in the patch? It still reproduces with v

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-24 Thread Rafael Aquini

On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote:
> Rafael Aquini  writes:
> > The bug here is quite simple: split_swap_cluster() misses checking for
> > lock_cluster() returning NULL before committing to change 
> > cluster_info->flags.
> 
> I don't think so.  We shouldn't run into this situation firstly.  So the
> "fix" hides the real bug instead of fixing it.  Just like we call
> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
> instead of returning if !PageLocked(head) silently.
>

Not the same thing, obviously, as you are going for an apples-to-carrots
comparison, but since you mentioned:

split_huge_page_to_list() asserts (in debug builds) *page is locked, 
and later checks if *head bears the SwapCache flag. 
deferred_split_scan(), OTOH, doesn't hand down the compound head locked, 
but the 2nd page in the group instead. 
This doesn't necessarely means it's a problem, though, but might help
on hitting the issue. 

> > The fundamental problem has nothing to do with allocating, or not allocating
> > a swap cluster, but it has to do with the fact that the THP deferred split 
> > scan
> > can transiently race with swapcache insertion, and the fact that when you 
> > run
> > your swap area on rotational storage cluster_info is _always_ NULL.
> > split_swap_cluster() needs to check for lock_cluster() returning NULL 
> > because
> > that's one possible case, and it clearly fails to do so.
> 
> If there's a race, we should fix the race.  But the code path for
> swapcache insertion is,
> 
> add_to_swap()
>   get_swap_page() /* Return if fails to allocate */
>   add_to_swap_cache()
> SetPageSwapCache()
> 
> While the code path to split THP is,
> 
> split_huge_page_to_list()
>   if PageSwapCache()
> split_swap_cluster()
> 
> Both code paths are protected by the page lock.  So there should be some
> other reasons to trigger the bug.

As mentioned above, no they seem to not be protected (at least, not the
same page, depending on the case). While add_to_swap() will assure a 
page_lock on the compound head, split_huge_page_to_list() does not.

> And again, for HDD, a THP shouldn't have PageSwapCache() set at the
> first place.  If so, the bug is that the flag is set and we should fix
> the setting.
> 

I fail to follow your claim here. Where is the guarantee, in the code, that 
you'll never have a compound head in the swapcache? 

> > Run a workload that cause multiple THP COW, and add a memory hogger to 
> > create
> > memory pressure so you'll force the reclaimers to kick the registered
> > shrinkers. The trigger is not heavy swapping, and that's probably why
> > most swap test cases don't hit it. The window is tight, but you will get the
> > NULL pointer dereference.
> 
> Do you have a script to reproduce the bug?
> 

Nope, a convoluted set of internal regression tests we have usually
triggers it. In the wild, customers running HANNA are seeing it,
occasionally.

> > Regardless you find furhter bugs, or not, this patch is needed to correct a
> > blunt coding mistake.
> 
> As above.  I don't agree with that.
> 

It's OK to disagree, split_swap_cluster still misses the cluster_info NULL 
check,
though.

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-23 Thread Rafael Aquini

On Thu, Sep 24, 2020 at 08:59:40AM +0800, Huang, Ying wrote:
> Rafael Aquini  writes:
> 
> > On Wed, Sep 23, 2020 at 01:13:49PM +0800, Huang, Ying wrote:
> >> Rafael Aquini  writes:
> >> 
> >> > On Wed, Sep 23, 2020 at 10:21:36AM +0800, Huang, Ying wrote:
> >> >> Hi, Rafael,
> >> >> 
> >> >> Rafael Aquini  writes:
> >> >> 
> >> >> > The swap area descriptor only gets struct swap_cluster_info 
> >> >> > *cluster_info
> >> >> > allocated if the swapfile is backed by non-rotational storage.
> >> >> > When the swap area is laid on top of ordinary disk spindles, 
> >> >> > lock_cluster()
> >> >> > will naturally return NULL.
> >> >> 
> >> >> Thanks for reporting.  But the bug looks strange.  Because in a system
> >> >> with only HDD swap devices, during THP swap out, the swap cluster
> >> >> shouldn't be allocated, as in
> >> >> 
> >> >> shrink_page_list()
> >> >>   add_to_swap()
> >> >> get_swap_page()
> >> >>   get_swap_pages()
> >> >> swap_alloc_cluster()
> >> >>
> >> >
> >> > The underlying problem is that swap_info_struct.cluster_info is always 
> >> > NULL 
> >> > on the rotational storage case.
> >> 
> >> Yes.
> >> 
> >> > So, it's very easy to follow that constructions 
> >> > like this one, in split_swap_cluster 
> >> >
> >> > ...
> >> > ci = lock_cluster(si, offset);
> >> > cluster_clear_huge(ci);
> >> > ...
> >> >
> >> > will go for a NULL pointer dereference, in that case, given that 
> >> > lock_cluster 
> >> > reads:
> >> >
> >> > ...
> >> >  struct swap_cluster_info *ci;
> >> > ci = si->cluster_info;
> >> > if (ci) {
> >> > ci += offset / SWAPFILE_CLUSTER;
> >> > spin_lock(>lock);
> >> > }
> >> > return ci;
> >> > ...
> >> 
> >> But on HDD, we shouldn't call split_swap_cluster() at all, because we
> >> will not allocate swap cluster firstly.  So, if we run into this,
> >> there should be some other bug, we need to figure it out.
> >>
> >
> > split_swap_cluster() gets called by split_huge_page_to_list(),
> > if the page happens to be in the swapcache, and it will always
> > go that way, regardless the backing storage type:
> >
> > ...
> > __split_huge_page(page, list, end, flags);
> > if (PageSwapCache(head)) {
> > swp_entry_t entry = { .val = page_private(head) };
> >
> > ret = split_swap_cluster(entry);
> > } else
> > ret = 0;
> > ...
> >
> > The problem is not about allocating the swap_cluster -- it's obviously
> > not allocated in these cases. The problem is that on rotational
> > storage you don't even have the base structure that allows you to
> > keep the swap clusters (cluster_info does not get allocated, at all,
> > so si->cluster_info is always NULL)
> >
> > You can argue about other bugs all you want, it doesn't change
> > the fact that this code is incomplete as it sits, because it 
> > misses checking for a real case where lock_cluster() will return NULL.
> 
> I don't want to argue about anything.  I just want to fix the bug.  The
> fix here will hide the real bug instead of fixing it.  For the situation
> you described (PageSwapCache() returns true for a THP backed by a normal
> swap entry (not swap cluster)), we will run into other troubles too.  So
> we need to find the root cause and fix it.
>

The bug here is quite simple: split_swap_cluster() misses checking for 
lock_cluster() returning NULL before committing to change cluster_info->flags.

The fundamental problem has nothing to do with allocating, or not allocating 
a swap cluster, but it has to do with the fact that the THP deferred split scan
can transiently race with swapcache insertion, and the fact that when you run 
your swap area on rotational storage cluster_info is _always_ NULL. 
split_swap_cluster() needs to check for lock_cluster() returning NULL because 
that's one possible case, and it clearly fails to do so. 

Run a workload that cause multiple THP COW, and add a memory hogger to create 
memory pressure so you'll force the reclaimers to kick the registered
shrinkers. The trigger is not heavy swapping, and that's probably why
most swap test cases don't hit it. The window is tight, but you will get the 
NULL pointer dereference.

Regardless you find furhter bugs, or not, this patch is needed to correct a
blunt coding mistake.

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-23 Thread Rafael Aquini

On Tue, Sep 22, 2020 at 12:47:50PM -0700, Andrew Morton wrote:
> On Tue, 22 Sep 2020 14:48:38 -0400 Rafael Aquini  wrote:
> 
> > The swap area descriptor only gets struct swap_cluster_info *cluster_info
> > allocated if the swapfile is backed by non-rotational storage.
> > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
> > will naturally return NULL.
> > 
> > CONFIG_THP_SWAP exposes cluster_info infrastructure to a broader number of
> > use cases, and split_swap_cluster(), which is the counterpart of 
> > split_huge_page()
> > for the THPs in the swapcache, misses checking the return of lock_cluster 
> > before
> > operating on the cluster_info pointer.
> > 
> > This patch addresses that issue by adding a proper check for the pointer
> > not being NULL in the wrappers cluster_{is,clear}_huge(), in order to avoid
> > crashes similar to the one below:
> > 
> > ...
> >
> > Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap 
> > out")
> > Signed-off-by: Rafael Aquini 
> 
> Did you consider cc:stable?
>

UGH! I missed adding it to my cc list. Shall I just forward it, now, or
do you prefer a fresh repost?

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-23 Thread Rafael Aquini

On Wed, Sep 23, 2020 at 01:13:49PM +0800, Huang, Ying wrote:
> Rafael Aquini  writes:
> 
> > On Wed, Sep 23, 2020 at 10:21:36AM +0800, Huang, Ying wrote:
> >> Hi, Rafael,
> >> 
> >> Rafael Aquini  writes:
> >> 
> >> > The swap area descriptor only gets struct swap_cluster_info *cluster_info
> >> > allocated if the swapfile is backed by non-rotational storage.
> >> > When the swap area is laid on top of ordinary disk spindles, 
> >> > lock_cluster()
> >> > will naturally return NULL.
> >> 
> >> Thanks for reporting.  But the bug looks strange.  Because in a system
> >> with only HDD swap devices, during THP swap out, the swap cluster
> >> shouldn't be allocated, as in
> >> 
> >> shrink_page_list()
> >>   add_to_swap()
> >> get_swap_page()
> >>   get_swap_pages()
> >> swap_alloc_cluster()
> >>
> >
> > The underlying problem is that swap_info_struct.cluster_info is always NULL 
> > on the rotational storage case.
> 
> Yes.
> 
> > So, it's very easy to follow that constructions 
> > like this one, in split_swap_cluster 
> >
> > ...
> > ci = lock_cluster(si, offset);
> > cluster_clear_huge(ci);
> > ...
> >
> > will go for a NULL pointer dereference, in that case, given that 
> > lock_cluster 
> > reads:
> >
> > ...
> > struct swap_cluster_info *ci;
> > ci = si->cluster_info;
> > if (ci) {
> > ci += offset / SWAPFILE_CLUSTER;
> > spin_lock(>lock);
> > }
> > return ci;
> > ...
> 
> But on HDD, we shouldn't call split_swap_cluster() at all, because we
> will not allocate swap cluster firstly.  So, if we run into this,
> there should be some other bug, we need to figure it out.
>

split_swap_cluster() gets called by split_huge_page_to_list(),
if the page happens to be in the swapcache, and it will always
go that way, regardless the backing storage type:

...
__split_huge_page(page, list, end, flags);
if (PageSwapCache(head)) {
swp_entry_t entry = { .val = page_private(head) };

ret = split_swap_cluster(entry);
} else
ret = 0;
...

The problem is not about allocating the swap_cluster -- it's obviously
not allocated in these cases. The problem is that on rotational
storage you don't even have the base structure that allows you to
keep the swap clusters (cluster_info does not get allocated, at all,
so si->cluster_info is always NULL)

You can argue about other bugs all you want, it doesn't change
the fact that this code is incomplete as it sits, because it 
misses checking for a real case where lock_cluster() will return NULL.

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-22 Thread Rafael Aquini

On Wed, Sep 23, 2020 at 10:21:36AM +0800, Huang, Ying wrote:
> Hi, Rafael,
> 
> Rafael Aquini  writes:
> 
> > The swap area descriptor only gets struct swap_cluster_info *cluster_info
> > allocated if the swapfile is backed by non-rotational storage.
> > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
> > will naturally return NULL.
> 
> Thanks for reporting.  But the bug looks strange.  Because in a system
> with only HDD swap devices, during THP swap out, the swap cluster
> shouldn't be allocated, as in
> 
> shrink_page_list()
>   add_to_swap()
> get_swap_page()
>   get_swap_pages()
> swap_alloc_cluster()
>

The underlying problem is that swap_info_struct.cluster_info is always NULL 
on the rotational storage case. So, it's very easy to follow that constructions 
like this one, in split_swap_cluster 

...
ci = lock_cluster(si, offset);
cluster_clear_huge(ci);
...

will go for a NULL pointer dereference, in that case, given that lock_cluster 
reads:

...
struct swap_cluster_info *ci;
ci = si->cluster_info;
if (ci) {
ci += offset / SWAPFILE_CLUSTER;
spin_lock(>lock);
}
return ci;
...

[PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-22 Thread Rafael Aquini

The swap area descriptor only gets struct swap_cluster_info *cluster_info
allocated if the swapfile is backed by non-rotational storage.
When the swap area is laid on top of ordinary disk spindles, lock_cluster()
will naturally return NULL.

CONFIG_THP_SWAP exposes cluster_info infrastructure to a broader number of
use cases, and split_swap_cluster(), which is the counterpart of 
split_huge_page()
for the THPs in the swapcache, misses checking the return of lock_cluster before
operating on the cluster_info pointer.

This patch addresses that issue by adding a proper check for the pointer
not being NULL in the wrappers cluster_{is,clear}_huge(), in order to avoid
crashes similar to the one below:

[ 5758.157556] BUG: kernel NULL pointer dereference, address: 0007
[ 5758.165331] #PF: supervisor write access in kernel mode
[ 5758.171161] #PF: error_code(0x0002) - not-present page
[ 5758.176894] PGD 0 P4D 0
[ 5758.179721] Oops: 0002 [#1] SMP PTI
[ 5758.183614] CPU: 10 PID: 316 Comm: kswapd1 Kdump: loaded Tainted: G S
   - ---  5.9.0-0.rc3.1.tst.el8.x86_64 #1
[ 5758.196717] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS 
SE5C600.86B.02.01.0002.082220131453 08/22/2013
[ 5758.208176] RIP: 0010:split_swap_cluster+0x47/0x60
[ 5758.213522] Code: c1 e3 06 48 c1 eb 0f 48 8d 1c d8 48 89 df e8 d0 20 6a 00 
80 63 07 fb 48 85 db 74 16 48 89 df c6 07 00 66 66 66 90 31 c0 5b c3 <80> 24 25 
07 00 00 00 fb 31 c0 5b c3 b8 f0 ff ff ff 5b c3 66 0f 1f
[ 5758.234478] RSP: 0018:b147442d7af0 EFLAGS: 00010246
[ 5758.240309] RAX:  RBX: 0014b217 RCX: b14779fd9000
[ 5758.248281] RDX: 0014b217 RSI: 9c52f2ab1400 RDI: 0014b217
[ 5758.256246] RBP: e00c51168080 R08: e00c5116fe08 R09: 9c52fffd3000
[ 5758.264208] R10: e00c511537c8 R11: 9c52fffd3c90 R12: 
[ 5758.272172] R13: e00c5117 R14: e00c5117 R15: e00c51168040
[ 5758.280134] FS:  () GS:9c52f2a8() 
knlGS:
[ 5758.289163] CS:  0010 DS:  ES:  CR0: 80050033
[ 5758.295575] CR2: 0007 CR3: 22a0e003 CR4: 000606e0
[ 5758.303538] Call Trace:
[ 5758.306273]  split_huge_page_to_list+0x88b/0x950
[ 5758.311433]  deferred_split_scan+0x1ca/0x310
[ 5758.316202]  do_shrink_slab+0x12c/0x2a0
[ 5758.320491]  shrink_slab+0x20f/0x2c0
[ 5758.324482]  shrink_node+0x240/0x6c0
[ 5758.328469]  balance_pgdat+0x2d1/0x550
[ 5758.332652]  kswapd+0x201/0x3c0
[ 5758.336157]  ? finish_wait+0x80/0x80
[ 5758.340147]  ? balance_pgdat+0x550/0x550
[ 5758.344525]  kthread+0x114/0x130
[ 5758.348126]  ? kthread_park+0x80/0x80
[ 5758.352214]  ret_from_fork+0x22/0x30
[ 5758.356203] Modules linked in: fuse zram rfkill sunrpc intel_rapl_msr 
intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp 
mgag200 iTCO_wdt crct10dif_pclmul iTCO_vendor_support drm_kms_helper 
crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops 
cec rapl joydev intel_cstate ipmi_si ipmi_devintf drm intel_uncore i2c_i801 
ipmi_msghandler pcspkr lpc_ich mei_me i2c_smbus mei ioatdma ip_tables xfs 
libcrc32c sr_mod sd_mod cdrom t10_pi sg igb ahci libahci i2c_algo_bit 
crc32c_intel libata dca wmi dm_mirror dm_region_hash dm_log dm_mod
[ 5758.412673] CR2: 0007
[0.00] Linux version 5.9.0-0.rc3.1.tst.el8.x86_64 
(mockbu...@x86-vm-15.build.eng.bos.redhat.com) (gcc (GCC) 8.3.1 20191121 (Red 
Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Wed Sep 9 16:03:34 EDT 2020

Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
Signed-off-by: Rafael Aquini 
---
 mm/swapfile.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 12f59e641b5e..37ddf5e5c53b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -324,14 +324,15 @@ static inline void cluster_set_null(struct 
swap_cluster_info *info)
 
 static inline bool cluster_is_huge(struct swap_cluster_info *info)
 {
-   if (IS_ENABLED(CONFIG_THP_SWAP))
+   if (IS_ENABLED(CONFIG_THP_SWAP) && info)
return info->flags & CLUSTER_FLAG_HUGE;
return false;
 }
 
 static inline void cluster_clear_huge(struct swap_cluster_info *info)
 {
-   info->flags &= ~CLUSTER_FLAG_HUGE;
+   if (IS_ENABLED(CONFIG_THP_SWAP) && info)
+   info->flags &= ~CLUSTER_FLAG_HUGE;
 }
 
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct 
*si,
-- 
2.25.4

Re: [PATCH v2] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-20 Thread Rafael Aquini

On Fri, Aug 21, 2020 at 09:34:46AM +1000, Dave Chinner wrote:
> On Thu, Aug 20, 2020 at 12:53:23PM +0800, Gao Xiang wrote:
> > SWP_FS is used to make swap_{read,write}page() go through
> > the filesystem, and it's only used for swap files over
> > NFS. So, !SWP_FS means non NFS for now, it could be either
> > file backed or device backed. Something similar goes with
> > legacy SWP_FILE.
> > 
> > So in order to achieve the goal of the original patch,
> > SWP_BLKDEV should be used instead.
> > 
> > FS corruption can be observed with SSD device + XFS +
> > fragmented swapfile due to CONFIG_THP_SWAP=y.
> > 
> > I reproduced the issue with the following details:
> > 
> > Environment:
> > QEMU + upstream kernel + buildroot + NVMe (2 GB)
> > 
> > Kernel config:
> > CONFIG_BLK_DEV_NVME=y
> > CONFIG_THP_SWAP=y
> 
> Ok, so at it's core this is a swap file extent versus THP swap
> cluster alignment issue?
> 
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 6c26916e95fd..2937daf3ca02 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t 
> > swp_entries[], int entry_size)
> > goto nextsi;
> > }
> > if (size == SWAPFILE_CLUSTER) {
> > -   if (!(si->flags & SWP_FS))
> > +   if (si->flags & SWP_BLKDEV)
> > n_ret = swap_alloc_cluster(si, swp_entries);
> > } else
> > n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> 
> IOWs, if you don't make this change, does the corruption problem go
> away if you align swap extents in iomap_swapfile_add_extent() to
> (SWAPFILE_CLUSTER * PAGE_SIZE) instead of just PAGE_SIZE?
> 

I suspect that will have to come with the 3rd, and final, part of the THP_SWAP
work Intel is doing. Right now, basically, all that's accomplished is deferring 
the THP split step when swapping out, so this change is what we need to
avoid stomping outside the file extent boundaries.


> I.e. if the swapfile extents are aligned correctly to huge page swap
> cluster size and alignment, does the swap clustering optimisations
> for swapping THP pages work correctly? And, if so, is there any
> performance benefit we get from enabling proper THP swap clustering
> on swapfiles?
>
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
>

Re: [PATCH v2] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-20 Thread Rafael Aquini

On Thu, Aug 20, 2020 at 12:53:23PM +0800, Gao Xiang wrote:
> SWP_FS is used to make swap_{read,write}page() go through
> the filesystem, and it's only used for swap files over
> NFS. So, !SWP_FS means non NFS for now, it could be either
> file backed or device backed. Something similar goes with
> legacy SWP_FILE.
> 
> So in order to achieve the goal of the original patch,
> SWP_BLKDEV should be used instead.
> 
> FS corruption can be observed with SSD device + XFS +
> fragmented swapfile due to CONFIG_THP_SWAP=y.
> 
> I reproduced the issue with the following details:
> 
> Environment:
> QEMU + upstream kernel + buildroot + NVMe (2 GB)
> 
> Kernel config:
> CONFIG_BLK_DEV_NVME=y
> CONFIG_THP_SWAP=y
> 
> Some reproducable steps:
> mkfs.xfs -f /dev/nvme0n1
> mkdir /tmp/mnt
> mount /dev/nvme0n1 /tmp/mnt
> bs="32k"
> sz="1024m"# doesn't matter too much, I also tried 16m
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
> 
> mkswap /tmp/mnt/sw
> swapon /tmp/mnt/sw
> 
> stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well
> 
> Symptoms:
>  - FS corruption (e.g. checksum failure)
>  - memory corruption at: 0xd2808010
>  - segfault
> 
> Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
> backed swap device")
> Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> Cc: "Huang, Ying" 
> Cc: Yang Shi 
> Cc: Rafael Aquini 
> Cc: Dave Chinner 
> Cc: stable 
> Signed-off-by: Gao Xiang 
> ---
> v1: https://lore.kernel.org/r/20200819195613.24269-1-hsiang...@redhat.com
> 
> changes since v1:
>  - improve commit message description
> 
> Hi Andrew,
> Kindly consider this one instead if no other concerns...
> 
> Thanks,
> Gao Xiang
> 
>  mm/swapfile.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6c26916e95fd..2937daf3ca02 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t 
> swp_entries[], int entry_size)
>   goto nextsi;
>   }
>   if (size == SWAPFILE_CLUSTER) {
> - if (!(si->flags & SWP_FS))
> + if (si->flags & SWP_BLKDEV)
>   n_ret = swap_alloc_cluster(si, swp_entries);
>   } else
>   n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> -- 
> 2.18.1
> 
Acked-by: Rafael Aquini

Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Rafael Aquini

On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang  wrote:
> 
> > SWP_FS doesn't mean the device is file-backed swap device,
> > which just means each writeback request should go through fs
> > by DIO. Or it'll just use extents added by .swap_activate(),
> > but it also works as file-backed swap device.
> 
> This is very hard to understand :(
> 

I'll work with Gao to rephrase that message. Sorry!

> > So in order to achieve the goal of the original patch,
> > SWP_BLKDEV should be used instead.
> > 
> > FS corruption can be observed with SSD device + XFS +
> > fragmented swapfile due to CONFIG_THP_SWAP=y.
> > 
> > Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
> > backed swap device")
> > Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> 
> Why do you think it has taken three years to discover this?
>

My bet here is that it's rare to go for a swapfile on non-rotational
devices, and even rarer to create the swapfile when the filesystem is
already fragmented. 

RHEL-8, v4.18-based, is starting to see more adpters among Red Hat's
customer base, thus the report now. We are also working on a secondary 
issue related to CONFIG_THP_SWAP, as well, where the deferred THP split
registered shriker goes for a NULL pointer dereference in case the
swap device is backed by a rotational drive.

-- Rafael

Re: [PATCH] mm/page_alloc: fix documentation error and remove magic numbers

2020-06-24 Thread Rafael Aquini

On Wed, Jun 24, 2020 at 03:09:58PM +0100, Matthew Wilcox wrote:
> On Wed, Jun 24, 2020 at 10:07:27AM -0400, Rafael Aquini wrote:
> > On Wed, Jun 24, 2020 at 12:12:55PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jun 23, 2020 at 11:27:12PM -0400, Joel Savitz wrote:
> > > > In addition, this patch replaces the magic number bounds with symbolic
> > > > constants to clarify the logic.
> > > 
> > > Why do people think this kind of thing makes the code easier to read?
> > > It actually makes it harder.  Unless the constants are used in more
> > > than one place, just leave the numbers where they are.
> > > 
> > > > @@ -7852,6 +7852,9 @@ void setup_per_zone_wmarks(void)
> > > >   * 8192MB: 11584k
> > > >   * 16384MB:16384k
> > > >   */
> > > > +static const int MIN_FREE_KBYTES_LOWER_BOUND = 1 << 7;
> > > > +static const int MIN_FREE_KBYTES_UPPER_BOUND = 1 << 18;
> > > > +
> > 
> > I think these constants would look better if declared as an enum.
> 
> Why does having to look in two different places make the code clearer?
>

It might not make it clearer in this particular case, because it was
easy to take the meaning from the code, but it also doesn't make it
harder to read, so I don't have any strong opinion on this case. 

Joel's approach, however, makes sense if you consider it's generally a 
good practice to get rid of the unnamed magic numbers anti-pattern.

Cheers,
-- Rafael

Re: [PATCH] mm/page_alloc: fix documentation error and remove magic numbers

2020-06-24 Thread Rafael Aquini

On Wed, Jun 24, 2020 at 12:12:55PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 23, 2020 at 11:27:12PM -0400, Joel Savitz wrote:
> > In addition, this patch replaces the magic number bounds with symbolic
> > constants to clarify the logic.
> 
> Why do people think this kind of thing makes the code easier to read?
> It actually makes it harder.  Unless the constants are used in more
> than one place, just leave the numbers where they are.
> 
> > @@ -7852,6 +7852,9 @@ void setup_per_zone_wmarks(void)
> >   * 8192MB: 11584k
> >   * 16384MB:16384k
> >   */
> > +static const int MIN_FREE_KBYTES_LOWER_BOUND = 1 << 7;
> > +static const int MIN_FREE_KBYTES_UPPER_BOUND = 1 << 18;
> > +

I think these constants would look better if declared as an enum.

> >  int __meminit init_per_zone_wmark_min(void)
> >  {
> > unsigned long lowmem_kbytes;
> > @@ -7862,10 +7865,10 @@ int __meminit init_per_zone_wmark_min(void)
> >  
> > if (new_min_free_kbytes > user_min_free_kbytes) {
> > min_free_kbytes = new_min_free_kbytes;
> > -   if (min_free_kbytes < 128)
> > -   min_free_kbytes = 128;
> > -   if (min_free_kbytes > 262144)
> > -   min_free_kbytes = 262144;
> > +   if (min_free_kbytes < MIN_FREE_KBYTES_LOWER_BOUND)
> > +   min_free_kbytes = MIN_FREE_KBYTES_LOWER_BOUND;
> > +   if (min_free_kbytes > MIN_FREE_KBYTES_UPPER_BOUND)
> > +   min_free_kbytes = MIN_FREE_KBYTES_UPPER_BOUND;
> 
> The only thing I'd consider changing there is replacing 262144 with 256
> * 1024.  1 << 18 is not clearer!


>

Re: [PATCH mmotm] mm/swap: fix livelock in __read_swap_cache_async()

2020-05-22 Thread Rafael Aquini

h has set SWAP_HAS_CACHE
> +  * in swap_map, but not yet added its page to swap cache.
> +  */
> + cond_resched();
>   }
>  
>   /*
> -  * The swap entry is ours to swap in. Prepare a new page.
> +  * The swap entry is ours to swap in. Prepare the new page.
>*/
>  
> - page = alloc_page_vma(gfp_mask, vma, addr);
> - if (!page)
> - goto fail_free;
> -
>   __SetPageLocked(page);
>   __SetPageSwapBacked(page);
>  
>   /* May fail (-ENOMEM) if XArray node allocation failed. */
> - if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> + if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
> + put_swap_page(page, entry);
>   goto fail_unlock;
> + }
>  
> - if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL))
> - goto fail_delete;
> + if (mem_cgroup_charge(page, NULL, gfp_mask)) {
> + delete_from_swap_cache(page);
> + goto fail_unlock;
> + }
>  
> - /* Initiate read into locked page */
> + /* Caller will initiate read into locked page */
>   SetPageWorkingset(page);
>   lru_cache_add_anon(page);
>   *new_page_allocated = true;
>   return page;
>  
> -fail_delete:
> - delete_from_swap_cache(page);
>  fail_unlock:
>   unlock_page(page);
>   put_page(page);
> -fail_free:
> - swap_free(entry);
>   return NULL;
>  }
>  
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 15/15] mwl8k: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:46PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: linux-wirel...@vger.kernel.org
> Cc: Lennert Buytenhek 
> Cc: Kalle Valo 
> Cc: "Gustavo A. R. Silva" 
> Cc: Johannes Berg 
> Cc: Ganapathi Bhat 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/wireless/marvell/mwl8k.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/wireless/marvell/mwl8k.c 
> b/drivers/net/wireless/marvell/mwl8k.c
> index 97f23f93f6e7..d609ef1bb879 100644
> --- a/drivers/net/wireless/marvell/mwl8k.c
> +++ b/drivers/net/wireless/marvell/mwl8k.c
> @@ -1551,6 +1551,7 @@ static int mwl8k_tx_wait_empty(struct ieee80211_hw *hw)
>* the firmware has crashed
>*/
>   if (priv->hw_restart_in_progress) {
> + module_firmware_crashed();
>   if (priv->hw_restart_owner == current)
>   return 0;
>   else
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 14/15] brcm80211: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:45PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: linux-wirel...@vger.kernel.org
> Cc: brcm80211-dev-list@broadcom.com
> Cc: brcm80211-dev-l...@cypress.com
> Cc: Arend van Spriel 
> Cc: Franky Lin 
> Cc: Hante Meuleman 
> Cc: Chi-Hsien Lin 
> Cc: Wright Feng 
> Cc: Kalle Valo 
> Cc: "Rafał Miłecki" 
> Cc: Pieter-Paul Giesberts 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c 
> b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
> index c88655acc78c..d623f83568b3 100644
> --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
> +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
> @@ -1393,6 +1393,7 @@ void brcmf_fw_crashed(struct device *dev)
>   struct brcmf_pub *drvr = bus_if->drvr;
>  
>   bphy_err(drvr, "Firmware has halted or crashed\n");
> + module_firmware_crashed();
>  
>   brcmf_dev_coredump(dev);
>  
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 13/15] ath6kl: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:44PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: linux-wirel...@vger.kernel.org
> Cc: ath...@lists.infradead.org
> Cc: Kalle Valo 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/wireless/ath/ath6kl/hif.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/wireless/ath/ath6kl/hif.c 
> b/drivers/net/wireless/ath/ath6kl/hif.c
> index d1942537ea10..cfd838607544 100644
> --- a/drivers/net/wireless/ath/ath6kl/hif.c
> +++ b/drivers/net/wireless/ath/ath6kl/hif.c
> @@ -120,6 +120,7 @@ static int ath6kl_hif_proc_dbg_intr(struct ath6kl_device 
> *dev)
>   int ret;
>  
>   ath6kl_warn("firmware crashed\n");
> + module_firmware_crashed();
>  
>   /*
>* read counter to clear the interrupt, the debug error interrupt is
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 11/15] wimax/i2400m: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:42PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: linux-wi...@intel.com
> Cc: Inaky Perez-Gonzalez 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/wimax/i2400m/rx.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/wimax/i2400m/rx.c b/drivers/net/wimax/i2400m/rx.c
> index c9fb619a9e01..307a62ca183f 100644
> --- a/drivers/net/wimax/i2400m/rx.c
> +++ b/drivers/net/wimax/i2400m/rx.c
> @@ -712,6 +712,7 @@ void __i2400m_roq_queue(struct i2400m *i2400m, struct 
> i2400m_roq *roq,
>   dev_err(dev, "SW BUG? failed to insert packet\n");
>   dev_err(dev, "ERX: roq %p [ws %u] skb %p nsn %d sn %u\n",
>   roq, roq->ws, skb, nsn, roq_data->sn);
> + module_firmware_crashed();
>   skb_queue_walk(>queue, skb_itr) {
>   roq_data_itr = (struct i2400m_roq_data *) _itr->cb;
>   nsn_itr = __i2400m_roq_nsn(roq, roq_data_itr->sn);
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:43PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: linux-wirel...@vger.kernel.org
> Cc: ath...@lists.infradead.org
> Cc: Kalle Valo 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/wireless/ath/ath10k/pci.c  | 2 ++
>  drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
>  drivers/net/wireless/ath/ath10k/snoc.c | 1 +
>  3 files changed, 5 insertions(+)
> 
> diff --git a/drivers/net/wireless/ath/ath10k/pci.c 
> b/drivers/net/wireless/ath/ath10k/pci.c
> index 1d941d53fdc9..6bd0f3b518b9 100644
> --- a/drivers/net/wireless/ath/ath10k/pci.c
> +++ b/drivers/net/wireless/ath/ath10k/pci.c
> @@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct 
> *work)
>   scnprintf(guid, sizeof(guid), "n/a");
>  
>   ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
> + module_firmware_crashed();
>   ath10k_print_driver_info(ar);
>   ath10k_pci_dump_registers(ar, crash_data);
>   ath10k_ce_dump_registers(ar, crash_data);
> @@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar,
>   if (ret) {
>   if (ath10k_pci_has_fw_crashed(ar)) {
>   ath10k_warn(ar, "firmware crashed during chip reset\n");
> + module_firmware_crashed();
>   ath10k_pci_fw_crashed_clear(ar);
>   ath10k_pci_fw_crashed_dump(ar);
>   }
> diff --git a/drivers/net/wireless/ath/ath10k/sdio.c 
> b/drivers/net/wireless/ath/ath10k/sdio.c
> index e2aff2254a40..d34ad289380f 100644
> --- a/drivers/net/wireless/ath/ath10k/sdio.c
> +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> @@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k 
> *ar)
>  
>   /* TODO: Add firmware crash handling */
>   ath10k_warn(ar, "firmware crashed\n");
> + module_firmware_crashed();
>  
>   /* read counter to clear the interrupt, the debug error interrupt is
>* counter 0.
> @@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k 
> *ar)
>   if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
>   ath10k_err(ar, "firmware crashed!\n");
>   queue_work(ar->workqueue, >restart_work);
> + module_firmware_crashed();
>   }
>   return ret;
>  }
> diff --git a/drivers/net/wireless/ath/ath10k/snoc.c 
> b/drivers/net/wireless/ath/ath10k/snoc.c
> index 354d49b1cd45..7cfc123c345c 100644
> --- a/drivers/net/wireless/ath/ath10k/snoc.c
> +++ b/drivers/net/wireless/ath/ath10k/snoc.c
> @@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
>   scnprintf(guid, sizeof(guid), "n/a");
>  
>   ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
> + module_firmware_crashed();
>   ath10k_print_driver_info(ar);
>   ath10k_msa_dump_memory(ar, crash_data);
>   mutex_unlock(>dump_mutex);
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 09/15] qed: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:40PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: Ariel Elior 
> Cc: gr-everest-linux...@marvell.com
> Reviewed-by: Igor Russkikh 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/ethernet/qlogic/qed/qed_mcp.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
> b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
> index 9624616806e7..aea200d465ef 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
> @@ -566,6 +566,7 @@ _qed_mcp_cmd_and_union(struct qed_hwfn *p_hwfn,
>   DP_NOTICE(p_hwfn,
> "The MFW failed to respond to command 0x%08x [param 
> 0x%08x].\n",
> p_mb_params->cmd, p_mb_params->param);
> + module_firmware_crashed();
>   qed_mcp_print_cpu_info(p_hwfn, p_ptt);
>  
>   spin_lock_bh(_hwfn->mcp_info->cmd_lock);
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 10/15] soc: qcom: ipa: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:41PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: Alex Elder 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/ipa/ipa_modem.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ipa/ipa_modem.c b/drivers/net/ipa/ipa_modem.c
> index ed10818dd99f..1790b87446ed 100644
> --- a/drivers/net/ipa/ipa_modem.c
> +++ b/drivers/net/ipa/ipa_modem.c
> @@ -285,6 +285,7 @@ static void ipa_modem_crashed(struct ipa *ipa)
>   struct device *dev = >pdev->dev;
>   int ret;
>  
> + module_firmware_crashed();
>   ipa_endpoint_modem_pause_all(ipa, true);
>  
>   ipa_endpoint_modem_hol_block_clear_all(ipa);
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 08/15] ehea: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:39PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: Douglas Miller 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/ethernet/ibm/ehea/ehea_main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/ibm/ehea/ehea_main.c 
> b/drivers/net/ethernet/ibm/ehea/ehea_main.c
> index 0273fb7a9d01..6ae35067003f 100644
> --- a/drivers/net/ethernet/ibm/ehea/ehea_main.c
> +++ b/drivers/net/ethernet/ibm/ehea/ehea_main.c
> @@ -3285,6 +3285,8 @@ static void ehea_crash_handler(void)
>  {
>   int i;
>  
> + module_firmware_crashed();
> +
>   if (ehea_fw_handles.arr)
>   for (i = 0; i < ehea_fw_handles.num_entries; i++)
>       ehea_h_free_resource(ehea_fw_handles.arr[i].adh,
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 07/15] cxgb4: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:38PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: Vishal Kulkarni 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
> b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> index a70018f067aa..c67fc86c0e42 100644
> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> @@ -3646,6 +3646,7 @@ void t4_fatal_err(struct adapter *adap)
>* could be exposed to the adapter.  RDMA MWs for example...
>*/
>   t4_shutdown_adapter(adap);
> + module_firmware_crashed();
>   for_each_port(adap, port) {
>   struct net_device *dev = adap->port[port];
>  
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 06/15] liquidio: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:37PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: Derek Chickles 
> Cc: Satanand Burla 
> Cc: Felix Manlunas 
> Reviewed-by: Derek Chickles 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/ethernet/cavium/liquidio/lio_main.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
> b/drivers/net/ethernet/cavium/liquidio/lio_main.c
> index 66d31c018c7e..f18085262982 100644
> --- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
> +++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
> @@ -801,6 +801,7 @@ static int liquidio_watchdog(void *param)
>   continue;
>  
>   WRITE_ONCE(oct->cores_crashed, true);
> + module_firmware_crashed();
>   other_oct = get_other_octeon_device(oct);
>   if (other_oct)
>       WRITE_ONCE(other_oct->cores_crashed, true);
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 05/15] bna: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:36PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: Rasesh Mody 
> Cc: Sudarsana Kalluru 
> Cc: gr-linux-nic-...@marvell.com
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/ethernet/brocade/bna/bfa_ioc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/brocade/bna/bfa_ioc.c 
> b/drivers/net/ethernet/brocade/bna/bfa_ioc.c
> index e17bfc87da90..b3f44a912574 100644
> --- a/drivers/net/ethernet/brocade/bna/bfa_ioc.c
> +++ b/drivers/net/ethernet/brocade/bna/bfa_ioc.c
> @@ -927,6 +927,7 @@ bfa_iocpf_sm_disabled(struct bfa_iocpf *iocpf, enum 
> iocpf_event event)
>  static void
>  bfa_iocpf_sm_initfail_sync_entry(struct bfa_iocpf *iocpf)
>  {
> + module_firmware_crashed();
>   bfa_nw_ioc_debug_save_ftrc(iocpf->ioc);
>   bfa_ioc_hw_sem_get(iocpf->ioc);
>  }
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 04/15] bnxt: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:35PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: Michael Chan 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> index dd0c3f227009..5ba1bd0734e9 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> @@ -3503,6 +3503,7 @@ static int bnxt_get_dump_data(struct net_device *dev, 
> struct ethtool_dump *dump,
>  
>   dump->flag = bp->dump_flag;
>   if (dump->flag == BNXT_DUMP_CRASH) {
> + module_firmware_crashed();
>  #ifdef CONFIG_TEE_BNXT_FW
>   return tee_bnxt_copy_coredump(buf, 0, dump->len);
>  #endif
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 03/15] bnx2x: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:34PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: Ariel Elior 
> Cc: Sudarsana Kalluru 
> CC: gr-everest-linux...@marvell.com
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> index db5107e7937c..c38b8c9c8af0 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> @@ -909,6 +909,7 @@ void bnx2x_panic_dump(struct bnx2x *bp, bool disable_int)
>   bp->eth_stats.unrecoverable_error++;
>   DP(BNX2X_MSG_STATS, "stats_state - DISABLED\n");
>  
> + module_firmware_crashed();
>   BNX2X_ERR("begin crash dump -\n");
>  
>   /* Indices */
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 02/15] ethernet/839: use new module_firmware_crashed()

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:33PM +, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: "Michael S. Tsirkin" 
> Cc: Shannon Nelson 
> Cc: Jakub Kicinski 
> Cc: Heiner Kallweit 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/net/ethernet/8390/axnet_cs.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/8390/axnet_cs.c 
> b/drivers/net/ethernet/8390/axnet_cs.c
> index aeae7966a082..8ad0200db8e9 100644
> --- a/drivers/net/ethernet/8390/axnet_cs.c
> +++ b/drivers/net/ethernet/8390/axnet_cs.c
> @@ -1358,9 +1358,11 @@ static void ei_receive(struct net_device *dev)
>*/
>   if ((netif_msg_rx_err(ei_local)) &&
>   this_frame != ei_local->current_page &&
> - (this_frame != 0x0 || rxing_page != 0xFF))
> + (this_frame != 0x0 || rxing_page != 0xFF)) {
> + module_firmware_crashed();
>   netdev_err(dev, "mismatched read page pointers %2x vs 
> %2x\n",
>  this_frame, ei_local->current_page);
> + }
>   
>   if (this_frame == rxing_page)   /* Read all the frames? */
>   break;  /* Done for now */
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini

Re: [PATCH v2 01/15] taint: add module firmware crash taint support

2020-05-15 Thread Rafael Aquini

On Fri, May 15, 2020 at 09:28:32PM +, Luis Chamberlain wrote:
> Device driver firmware can crash, and sometimes, this can leave your
> system in a state which makes the device or subsystem completely
> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> of scraping some magical words from the kernel log, which is driver
> specific, is much easier. So instead provide a helper which lets drivers
> annotate this.
> 
> Once this happens, scrapers can easily look for modules taint flags
> for a firmware crash. This will taint both the kernel and respective
> calling module.
> 
> The new helper module_firmware_crashed() uses LOCKDEP_STILL_OK as this
> fact should in no way shape or form affect lockdep. This taint is device
> driver specific.
> 
> Signed-off-by: Luis Chamberlain 
> ---
>  Documentation/admin-guide/tainted-kernels.rst |  6 ++
>  include/linux/kernel.h|  3 ++-
>  include/linux/module.h| 13 +
>  include/trace/events/module.h |  3 ++-
>  kernel/module.c   |  5 +++--
>  kernel/panic.c|  1 +
>  tools/debugging/kernel-chktaint   |  7 +++
>  7 files changed, 34 insertions(+), 4 deletions(-)
> 

Reviewed-by: Rafael Aquini

[PATCH v5] kernel: add panic_on_taint

2020-05-15 Thread Rafael Aquini

Analogously to the introduction of panic_on_warn, this patch introduces a kernel
option named panic_on_taint in order to provide a simple and generic way to stop
execution and catch a coredump when the kernel gets tainted by any given flag.

This is useful for debugging sessions as it avoids having to rebuild the kernel
to explicitly add calls to panic() into the code sites that introduce the taint
flags of interest. For instance, if one is interested in proceeding with a
post-mortem analysis at the point a given code path is hitting a bad page
(i.e. unaccount_page_cache_page(), or slab_bug()), a coredump can be collected
by rebooting the kernel with 'panic_on_taint=0x20' amended to the command line.

Another, perhaps less frequent, use for this option would be as a mean for
assuring a security policy case where only a subset of taints, or no single
taint (in paranoid mode), is allowed for the running system.
The optional switch 'nousertaint' is handy in this particular scenario,
as it will avoid userspace induced crashes by writes to sysctl interface
/proc/sys/kernel/tainted causing false positive hits for such policies.

Suggested-by: Qian Cai 
Signed-off-by: Rafael Aquini 
---
Changelog:
* v2: get rid of unnecessary/misguided compiler hints   (Luis)
  enhance documentation text for the new kernel parameter   (Randy)
* v3: drop sysctl interface, keep it only as a kernel parameter (Luis)
* v4: change panic_on_taint input from alphabetical taint flags
  to hexadecimal bitmasks, for clarity and extendability(Luis)
* v5: add doc note on the potential effects of panic_on_taint
  with notaintuser on writes to kernel.tainted sysctl knob  (Luis)

 Documentation/admin-guide/kdump/kdump.rst |  8 +
 .../admin-guide/kernel-parameters.txt | 13 +++
 Documentation/admin-guide/sysctl/kernel.rst   |  7 
 include/linux/kernel.h|  3 ++
 kernel/panic.c| 34 +++
 kernel/sysctl.c   | 11 +-
 6 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kdump/kdump.rst 
b/Documentation/admin-guide/kdump/kdump.rst
index ac7e131d2935..2da65fef2a1c 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -521,6 +521,14 @@ will cause a kdump to occur at the panic() call.  In cases 
where a user wants
 to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
 to achieve the same behaviour.
 
+Trigger Kdump on add_taint()
+
+
+The kernel parameter panic_on_taint facilitates a conditional call to panic()
+from within add_taint() whenever the value set in this bitmask matches with the
+bit flag being set by add_taint().
+This will cause a kdump to occur at the add_taint()->panic() call.
+
 Contact
 ===
 
diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index d9197499aad1..27b988acb4db 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3422,6 +3422,19 @@
bit 4: print ftrace buffer
bit 5: print all printk messages in buffer
 
+   panic_on_taint= Bitmask for conditionally call panic() in add_taint()
+   Format: [,nousertaint]
+   Hexadecimal bitmask representing the set of TAINT flags
+   that will cause the kernel to panic when add_taint() is
+   called with any of the flags in this set.
+   The optional switch "nousertaint" can be utilized to
+   prevent userspace forced crashes by writing to sysctl
+   /proc/sys/kernel/tainted any flagset matching with the
+   bitmask set on panic_on_taint.
+   See Documentation/admin-guide/tainted-kernels.rst for
+   extra details on the taint flags that users can pick
+   to compose the bitmask to assign to panic_on_taint.
+
panic_on_warn   panic() instead of WARN().  Useful to cause kdump
on a WARN().
 
diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
b/Documentation/admin-guide/sysctl/kernel.rst
index c6c27db68d4c..427ce0a86b36 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1241,6 +1241,13 @@ ORed together. The letters are seen in "Tainted" line of 
Oops reports.
 
 See :doc:`/admin-guide/tainted-kernels` for more information.
 
+Note:
+  writes to this sysctl interface will fail with ``EINVAL`` if the kernel is
+  booted with the command line option ``panic_on_taint=,nousertaint``
+  and any of the ORed together values being written to ``tainted`` match with
+  the bitmask declared on panic_on_tai

Re: [PATCH v4] kernel: add panic_on_taint

2020-05-13 Thread Rafael Aquini

On Wed, May 13, 2020 at 03:47:22PM +, Luis Chamberlain wrote:
> On Wed, May 13, 2020 at 11:00:26AM -0400, Rafael Aquini wrote:
> > Analogously to the introduction of panic_on_warn, this patch
> > introduces a kernel option named panic_on_taint in order to
> > provide a simple and generic way to stop execution and catch
> > a coredump when the kernel gets tainted by any given taint flag.
> > 
> > This is useful for debugging sessions as it avoids rebuilding
> > the kernel to explicitly add calls to panic() or BUG() into
> > code sites that introduce the taint flags of interest.
> > For instance, if one is interested in following up with
> > a post mortem analysis at the point a code path is hitting
> > a bad page (i.e. unaccount_page_cache_page(), or slab_bug()),
> > a crashdump could be collected by rebooting the kernel with
> > 'panic_on_taint=0x20' amended to the command line string.
> > 
> > Another, perhaps less frequent, use for this option would be
> > as a mean for assuring a security policy case where only a
> > subset of taints, or no single taint (in paranoid mode),
> > is allowed for the running system.
> > The optional switch 'nousertaint' is handy in this particular
> > scenario as it will avoid userspace induced crashes by writes
> > to /proc/sys/kernel/tainted causing false positive hits for
> > such policies.
> > 
> > Suggested-by: Qian Cai 
> > Signed-off-by: Rafael Aquini 
> > ---
> > Changelog:
> > * v2: get rid of unnecessary/misguided compiler hints   (Luis)
> >   enhance documentation text for the new kernel parameter   (Randy)
> > * v3: drop sysctl interface, keep it only as a kernel parameter (Luis)
> > * v4: change panic_on_taint input from alphabetical taint flags
> >   to hexadecimal bitmasks, for clarity and extendability(Luis)
> > 
> >  Documentation/admin-guide/kdump/kdump.rst |  7 
> >  .../admin-guide/kernel-parameters.txt | 13 +++
> >  include/linux/kernel.h|  4 +++
> >  kernel/panic.c| 34 +++
> >  kernel/sysctl.c   | 11 +-
> >  5 files changed, 68 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> > b/Documentation/admin-guide/kdump/kdump.rst
> > index ac7e131d2935..2707de840fd3 100644
> > --- a/Documentation/admin-guide/kdump/kdump.rst
> > +++ b/Documentation/admin-guide/kdump/kdump.rst
> > @@ -521,6 +521,13 @@ will cause a kdump to occur at the panic() call.  In 
> > cases where a user wants
> >  to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set 
> > to 1
> >  to achieve the same behaviour.
> >  
> > +Trigger Kdump on add_taint()
> > +
> > +
> > +The kernel parameter panic_on_taint facilitates calling panic() from within
> > +add_taint() whenever the value set in this bitmask matches with the bit 
> > flag
> > +being set by add_taint(). This will cause a kdump to occur at the panic() 
> > call.
> > +
> >  Contact
> >  ===
> >  
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index 7bc83f3d9bdf..ce17fdbec7d1 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3401,6 +3401,19 @@
> > bit 4: print ftrace buffer
> > bit 5: print all printk messages in buffer
> >  
> > +   panic_on_taint= Bitmask for conditionally call panic() in add_taint()
> > +   Format: [,nousertaint]
> > +   Hexadecimal bitmask representing the set of TAINT flags
> > +   that will cause the kernel to panic when add_taint() is
> > +   called with any of the flags in this set.
> > +   The optional switch "nousertaint" can be utilized to
> > +   prevent userland forced crashes by writing to sysctl
> > +   /proc/sys/kernel/tainted any flagset matching with the
> > +   bitmask set on panic_on_taint.
> > +   See Documentation/admin-guide/tainted-kernels.rst for
> > +   extra details on the taint flags that users can pick
> > +   to compose the bitmask to assign to panic_on_taint.
> > +
> > panic_on_warn   panic() instead of WARN().  Useful to ca

[PATCH v4] kernel: add panic_on_taint

2020-05-13 Thread Rafael Aquini

Analogously to the introduction of panic_on_warn, this patch
introduces a kernel option named panic_on_taint in order to
provide a simple and generic way to stop execution and catch
a coredump when the kernel gets tainted by any given taint flag.

This is useful for debugging sessions as it avoids rebuilding
the kernel to explicitly add calls to panic() or BUG() into
code sites that introduce the taint flags of interest.
For instance, if one is interested in following up with
a post mortem analysis at the point a code path is hitting
a bad page (i.e. unaccount_page_cache_page(), or slab_bug()),
a crashdump could be collected by rebooting the kernel with
'panic_on_taint=0x20' amended to the command line string.

Another, perhaps less frequent, use for this option would be
as a mean for assuring a security policy case where only a
subset of taints, or no single taint (in paranoid mode),
is allowed for the running system.
The optional switch 'nousertaint' is handy in this particular
scenario as it will avoid userspace induced crashes by writes
to /proc/sys/kernel/tainted causing false positive hits for
such policies.

Suggested-by: Qian Cai 
Signed-off-by: Rafael Aquini 
---
Changelog:
* v2: get rid of unnecessary/misguided compiler hints   (Luis)
  enhance documentation text for the new kernel parameter   (Randy)
* v3: drop sysctl interface, keep it only as a kernel parameter (Luis)
* v4: change panic_on_taint input from alphabetical taint flags
  to hexadecimal bitmasks, for clarity and extendability(Luis)

 Documentation/admin-guide/kdump/kdump.rst |  7 
 .../admin-guide/kernel-parameters.txt | 13 +++
 include/linux/kernel.h|  4 +++
 kernel/panic.c| 34 +++
 kernel/sysctl.c   | 11 +-
 5 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kdump/kdump.rst 
b/Documentation/admin-guide/kdump/kdump.rst
index ac7e131d2935..2707de840fd3 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -521,6 +521,13 @@ will cause a kdump to occur at the panic() call.  In cases 
where a user wants
 to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
 to achieve the same behaviour.
 
+Trigger Kdump on add_taint()
+
+
+The kernel parameter panic_on_taint facilitates calling panic() from within
+add_taint() whenever the value set in this bitmask matches with the bit flag
+being set by add_taint(). This will cause a kdump to occur at the panic() call.
+
 Contact
 ===
 
diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7bc83f3d9bdf..ce17fdbec7d1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3401,6 +3401,19 @@
bit 4: print ftrace buffer
bit 5: print all printk messages in buffer
 
+   panic_on_taint= Bitmask for conditionally call panic() in add_taint()
+   Format: [,nousertaint]
+   Hexadecimal bitmask representing the set of TAINT flags
+   that will cause the kernel to panic when add_taint() is
+   called with any of the flags in this set.
+   The optional switch "nousertaint" can be utilized to
+   prevent userland forced crashes by writing to sysctl
+   /proc/sys/kernel/tainted any flagset matching with the
+   bitmask set on panic_on_taint.
+   See Documentation/admin-guide/tainted-kernels.rst for
+   extra details on the taint flags that users can pick
+   to compose the bitmask to assign to panic_on_taint.
+
panic_on_warn   panic() instead of WARN().  Useful to cause kdump
on a WARN().
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 9b7a8d74a9d6..70712944dffc 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -528,6 +528,8 @@ extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
+extern unsigned long panic_on_taint;
+extern bool panic_on_taint_nousertaint;
 extern int sysctl_panic_on_rcu_stall;
 extern int sysctl_panic_on_stackoverflow;
 
@@ -597,6 +599,8 @@ extern enum system_states {
 #define TAINT_RANDSTRUCT   17
 #define TAINT_FLAGS_COUNT  18
 
+#define TAINT_FLAGS_MAX((1UL << TAINT_FLAGS_COUNT) - 1)
+
 struct taint_flag {
char c_true;/* character printed when tainted */
char c_false;   /* character printed when not tainted */
diff --git a/kernel/panic.c b/kernel/panic.c
index b69ee9e76cb2..94b5c973770c 100644
---

Re: [PATCH v2] kernel: sysctl: ignore out-of-range taint bits introduced via kernel.tainted

2020-05-12 Thread Rafael Aquini

On Wed, May 13, 2020 at 12:39:53AM +, Luis Chamberlain wrote:
> On Tue, May 12, 2020 at 06:39:46PM -0400, Rafael Aquini wrote:
> > Users with SYS_ADMIN capability can add arbitrary taint flags
> > to the running kernel by writing to /proc/sys/kernel/tainted
> > or issuing the command 'sysctl -w kernel.tainted=...'.

I just notice 2 minor 'screw ups' on my part in the commit log:

> > These interface, however, are open for any integer value
This one probably needs to be reprhased as:
 "The interface, however, is ... "


> > and this might an invalid set of flags being committed to
and I'm missing a verb here, as it should read:
 "and this might cause an invalid ... "


I hope these are easy fixes, in the pre-merge step. (Sorry!)

> > the tainted_mask bitset.
> > 
> > This patch introduces a simple way for proc_taint() to ignore
> > any eventual invalid bit coming from the user input before
> > committing those bits to the kernel tainted_mask.
> > 
> > Signed-off-by: Rafael Aquini 
> 
> Reviewed-by: Luis Chamberlain 
> 

Thanks!
-- Rafael

[PATCH v2] kernel: sysctl: ignore out-of-range taint bits introduced via kernel.tainted

2020-05-12 Thread Rafael Aquini

Users with SYS_ADMIN capability can add arbitrary taint flags
to the running kernel by writing to /proc/sys/kernel/tainted
or issuing the command 'sysctl -w kernel.tainted=...'.
These interface, however, are open for any integer value
and this might an invalid set of flags being committed to
the tainted_mask bitset.

This patch introduces a simple way for proc_taint() to ignore
any eventual invalid bit coming from the user input before
committing those bits to the kernel tainted_mask.

Signed-off-by: Rafael Aquini 
---
Changelog:
v2: simplify the bit iterator within proc_taint(),
and silently drop out-of-range bits (akpm)

 kernel/sysctl.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..fcd46fc41206 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2628,10 +2628,9 @@ static int proc_taint(struct ctl_table *table, int write,
 * to everyone's atomic.h for this
 */
int i;
-   for (i = 0; i < BITS_PER_LONG && tmptaint >> i; i++) {
-   if ((tmptaint >> i) & 1)
+   for (i = 0; i < TAINT_FLAGS_COUNT; i++)
+   if ((1UL << i) & tmptaint)
add_taint(i, LOCKDEP_STILL_OK);
-   }
}
 
return err;
-- 
2.25.4

Re: [PATCH] kernel: sysctl: ignore out-of-range taint bits introduced via kernel.tainted

2020-05-12 Thread Rafael Aquini

On Tue, May 12, 2020 at 01:53:26PM -0700, Andrew Morton wrote:
> On Tue, 12 May 2020 13:46:53 -0400 Rafael Aquini  wrote:
> 
> > The sysctl knob
> 
> /proc/sys/kernel/tainted, yes?
> 
> > allows users with SYS_ADMIN capability to
> > taint the kernel with any arbitrary value, but this might
> > produce an invalid flags bitset being committed to tainted_mask.
> > 
> > This patch introduces a simple way for proc_taint() to ignore
> > any eventual invalid bit coming from the user input before
> > committing those bits to the kernel tainted_mask.
> > 
> > ...
> >
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -597,6 +597,8 @@ extern enum system_states {
> >  #define TAINT_RANDSTRUCT   17
> >  #define TAINT_FLAGS_COUNT  18
> >  
> > +#define TAINT_FLAGS_MAX((1UL << TAINT_FLAGS_COUNT) - 1)
> > +
> >  struct taint_flag {
> > char c_true;/* character printed when tainted */
> > char c_false;   /* character printed when not tainted */
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 8a176d8727a3..fb2d693fc08c 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -2623,11 +2623,23 @@ static int proc_taint(struct ctl_table *table, int 
> > write,
> > return err;
> >  
> > if (write) {
> > +   int i;
> > +
> > +   /*
> > +* Ignore user input that would cause the loop below
> > +* to commit arbitrary and out of valid range TAINT flags.
> > +*/
> > +   if (tmptaint > TAINT_FLAGS_MAX) {
> > +   tmptaint &= TAINT_FLAGS_MAX;
> > +   pr_warn_once("%s: out-of-range taint input ignored."
> > +" tainted_mask adjusted to 0x%lx\n",
> > +__func__, tmptaint);
> > +   }
> > +
> > /*
> >  * Poor man's atomic or. Not worth adding a primitive
> >  * to everyone's atomic.h for this
> >  */
> > -   int i;
> > for (i = 0; i < BITS_PER_LONG && tmptaint >> i; i++) {
> 
> Could simply replace BITS_PER_LONG with TAINT_FLAGS_COUNT here?
> 
> (That "&& tmptaint >> i" seems a rather silly optimization?)
> 
> > if ((tmptaint >> i) & 1)
> > add_taint(i, LOCKDEP_STILL_OK);
> 
> In fact the whole thing could be simplified down to
> 
>   for (i = 1; i <= TAINT_FLAGS_COUNT; i <<= 1)
>   if (i & tmptaint)
>   add_taint(...)
> 
> and silently drop out-of-range bits?
>

Sure!

-- Rafael

[PATCH] kernel: sysctl: ignore out-of-range taint bits introduced via kernel.tainted

2020-05-12 Thread Rafael Aquini

The sysctl knob allows users with SYS_ADMIN capability to
taint the kernel with any arbitrary value, but this might
produce an invalid flags bitset being committed to tainted_mask.

This patch introduces a simple way for proc_taint() to ignore
any eventual invalid bit coming from the user input before
committing those bits to the kernel tainted_mask.

Signed-off-by: Rafael Aquini 
---
 include/linux/kernel.h |  2 ++
 kernel/sysctl.c| 14 +-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 9b7a8d74a9d6..e8c22a9bbc95 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -597,6 +597,8 @@ extern enum system_states {
 #define TAINT_RANDSTRUCT   17
 #define TAINT_FLAGS_COUNT  18
 
+#define TAINT_FLAGS_MAX((1UL << TAINT_FLAGS_COUNT) - 1)
+
 struct taint_flag {
char c_true;/* character printed when tainted */
char c_false;   /* character printed when not tainted */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..fb2d693fc08c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2623,11 +2623,23 @@ static int proc_taint(struct ctl_table *table, int 
write,
return err;
 
if (write) {
+   int i;
+
+   /*
+* Ignore user input that would cause the loop below
+* to commit arbitrary and out of valid range TAINT flags.
+*/
+   if (tmptaint > TAINT_FLAGS_MAX) {
+   tmptaint &= TAINT_FLAGS_MAX;
+   pr_warn_once("%s: out-of-range taint input ignored."
+" tainted_mask adjusted to 0x%lx\n",
+__func__, tmptaint);
+   }
+
/*
 * Poor man's atomic or. Not worth adding a primitive
 * to everyone's atomic.h for this
 */
-   int i;
for (i = 0; i < BITS_PER_LONG && tmptaint >> i; i++) {
if ((tmptaint >> i) & 1)
add_taint(i, LOCKDEP_STILL_OK);
-- 
2.25.4

Re: [PATCH] kernel: sysctl: ignore invalid taint bits introduced via kernel.tainted and taint the kernel with TAINT_USER on writes

2020-05-12 Thread Rafael Aquini

On Tue, May 12, 2020 at 03:46:54PM +, Luis Chamberlain wrote:
> On Tue, May 12, 2020 at 10:49:06AM -0400, Rafael Aquini wrote:
> > On Tue, May 12, 2020 at 05:04:05AM +, Luis Chamberlain wrote:
> > > On Mon, May 11, 2020 at 09:03:13PM -0400, Rafael Aquini wrote:
> > > > On Tue, May 12, 2020 at 12:17:03AM +, Luis Chamberlain wrote:
> > > > > On Mon, May 11, 2020 at 07:59:14PM -0400, Rafael Aquini wrote:
> > > > > > On Mon, May 11, 2020 at 11:10:45PM +, Luis Chamberlain wrote:
> > > > > > > On Mon, May 11, 2020 at 05:59:04PM -0400, Rafael Aquini wrote:
> > > > > > > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > > > > > > index 8a176d8727a3..f0a4fb38ac62 100644
> > > > > > > > --- a/kernel/sysctl.c
> > > > > > > > +++ b/kernel/sysctl.c
> > > > > > > > @@ -2623,17 +2623,32 @@ static int proc_taint(struct ctl_table 
> > > > > > > > *table, int write,
> > > > > > > > return err;
> > > > > > > >  
> > > > > > > > if (write) {
> > > > > > > > +   int i;
> > > > > > > > +
> > > > > > > > +   /*
> > > > > > > > +* Ignore user input that would make us 
> > > > > > > > committing
> > > > > > > > +* arbitrary invalid TAINT flags in the loop 
> > > > > > > > below.
> > > > > > > > +*/
> > > > > > > > +   tmptaint &= (1UL << TAINT_FLAGS_COUNT) - 1;
> > > > > > > 
> > > > > > > This looks good but we don't pr_warn() of information lost on 
> > > > > > > intention.
> > > > > > >
> > > > > > 
> > > > > > Are you thinking in sth like:
> > > > > > 
> > > > > > +   if (tmptaint > TAINT_FLAGS_MAX) {
> > > > > > +   tmptaint &= TAINT_FLAGS_MAX;
> > > > > > +   pr_warn("proc_taint: out-of-range invalid 
> > > > > > input ignored"
> > > > > > +   " tainted_mask adjusted to 0x%x\n", 
> > > > > > tmptaint);
> > > > > > +   }
> > > > > > ?
> > > > > 
> > > > > Sure that would clarify this.
> > > > > 
> > > > > > > > +
> > > > > > > > /*
> > > > > > > >  * Poor man's atomic or. Not worth adding a 
> > > > > > > > primitive
> > > > > > > >  * to everyone's atomic.h for this
> > > > > > > >  */
> > > > > > > > -   int i;
> > > > > > > > for (i = 0; i < BITS_PER_LONG && tmptaint >> i; 
> > > > > > > > i++) {
> > > > > > > > if ((tmptaint >> i) & 1)
> > > > > > > > add_taint(i, LOCKDEP_STILL_OK);
> > > > > > > > }
> > > > > > > > +
> > > > > > > > +   /*
> > > > > > > > +* Users with SYS_ADMIN capability can include 
> > > > > > > > any arbitrary
> > > > > > > > +* taint flag by writing to this interface. If 
> > > > > > > > that's the case,
> > > > > > > > +* we also need to mark the kernel "tainted by 
> > > > > > > > user".
> > > > > > > > +*/
> > > > > > > > +   add_taint(TAINT_USER, LOCKDEP_STILL_OK);
> > > > > > > 
> > > > > > > I'm in favor of this however I'd like to hear from Ted on if it 
> > > > > > > meets
> > > > > > > the original intention. I would think he had a good reason not to 
> > > > > > > add
> > > > > > > it here.
> > > > > > >
> > > > > > 
> > > > &g

Re: [PATCH] kernel: sysctl: ignore invalid taint bits introduced via kernel.tainted and taint the kernel with TAINT_USER on writes

2020-05-12 Thread Rafael Aquini

On Tue, May 12, 2020 at 05:04:05AM +, Luis Chamberlain wrote:
> On Mon, May 11, 2020 at 09:03:13PM -0400, Rafael Aquini wrote:
> > On Tue, May 12, 2020 at 12:17:03AM +, Luis Chamberlain wrote:
> > > On Mon, May 11, 2020 at 07:59:14PM -0400, Rafael Aquini wrote:
> > > > On Mon, May 11, 2020 at 11:10:45PM +, Luis Chamberlain wrote:
> > > > > On Mon, May 11, 2020 at 05:59:04PM -0400, Rafael Aquini wrote:
> > > > > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > > > > index 8a176d8727a3..f0a4fb38ac62 100644
> > > > > > --- a/kernel/sysctl.c
> > > > > > +++ b/kernel/sysctl.c
> > > > > > @@ -2623,17 +2623,32 @@ static int proc_taint(struct ctl_table 
> > > > > > *table, int write,
> > > > > > return err;
> > > > > >  
> > > > > > if (write) {
> > > > > > +   int i;
> > > > > > +
> > > > > > +   /*
> > > > > > +* Ignore user input that would make us committing
> > > > > > +* arbitrary invalid TAINT flags in the loop below.
> > > > > > +*/
> > > > > > +   tmptaint &= (1UL << TAINT_FLAGS_COUNT) - 1;
> > > > > 
> > > > > This looks good but we don't pr_warn() of information lost on 
> > > > > intention.
> > > > >
> > > > 
> > > > Are you thinking in sth like:
> > > > 
> > > > +   if (tmptaint > TAINT_FLAGS_MAX) {
> > > > +   tmptaint &= TAINT_FLAGS_MAX;
> > > > +   pr_warn("proc_taint: out-of-range invalid input 
> > > > ignored"
> > > > +   " tainted_mask adjusted to 0x%x\n", 
> > > > tmptaint);
> > > > +   }
> > > > ?
> > > 
> > > Sure that would clarify this.
> > > 
> > > > > > +
> > > > > > /*
> > > > > >  * Poor man's atomic or. Not worth adding a primitive
> > > > > >  * to everyone's atomic.h for this
> > > > > >  */
> > > > > > -   int i;
> > > > > > for (i = 0; i < BITS_PER_LONG && tmptaint >> i; i++) {
> > > > > > if ((tmptaint >> i) & 1)
> > > > > > add_taint(i, LOCKDEP_STILL_OK);
> > > > > > }
> > > > > > +
> > > > > > +   /*
> > > > > > +* Users with SYS_ADMIN capability can include any 
> > > > > > arbitrary
> > > > > > +* taint flag by writing to this interface. If that's 
> > > > > > the case,
> > > > > > +* we also need to mark the kernel "tainted by user".
> > > > > > +*/
> > > > > > +   add_taint(TAINT_USER, LOCKDEP_STILL_OK);
> > > > > 
> > > > > I'm in favor of this however I'd like to hear from Ted on if it meets
> > > > > the original intention. I would think he had a good reason not to add
> > > > > it here.
> > > > >
> > > > 
> > > > Fair enough. The impression I got by reading Ted's original commit
> > > > message is that the intent was to have TAINT_USER as the flag set 
> > > > via this interface, even though the code was allowing for any 
> > > > arbitrary value.
> > > 
> > > That wasn't my reading, it was that the user did something very odd
> > > with user input which we don't like as kernel developers, and it gives
> > > us a way to prove: hey you did something stupid, sorry but I cannot
> > > support your kernel panic.
> > > 
> > > > I think it's OK to let the user fiddle with
> > > > the flags, as it's been allowed since the introduction of
> > > > this interface, but we need to reflect that fact in the
> > > > tainting itself. Since TAINT_USER is not used anywhere,
> > > 
> > > I see users of TAINT_USER sprinkled around
> > >
> > 
> > I meant in the original commit that introduced it
> > (commit 34f5a39899f3f3e815da64f48ddb72942d86c366). Sorry I
> > miscomunicated that.
> > 
> > In its current usage, it seems that the other places adding TAINT_USER
> > match with what is being proposed here: To signal when we have user 
> > fiddling with kernel / module parameters.
> 
> drivers/base/regmap/regmap-debugfs.c requires *manual* code changes
> to compile / enable some knob. i915 complains about unsafe module
> params such as module_param_cb_unsafe() core_param_unsafe(). Then
> drivers/soundwire/cadence_master.c is for when a debugfs dangerous
> param was used.
> 
> This still doesn't rule out the use of proc_taint() for testing taint,
> and that adding it may break some tests. So even though this would
> only affect some tests scripts, I can't say that adding this taint won't
> cause some headaches to someone. I wouldn't encourage its use on
> proc_taint() from what I can see so far.
>

OK, I´ll repost without the hunk forcing the taint. If we eventually
come to the conclusion that tainting in proc_taint() is the right thing
to do, we can do that part of the change later.

Do you think we should use printk_ratelimited() in the ignore message,
instead? 

Cheers,
-- Rafael

Re: [PATCH] kernel: sysctl: ignore invalid taint bits introduced via kernel.tainted and taint the kernel with TAINT_USER on writes

2020-05-11 Thread Rafael Aquini

On Tue, May 12, 2020 at 12:17:03AM +, Luis Chamberlain wrote:
> On Mon, May 11, 2020 at 07:59:14PM -0400, Rafael Aquini wrote:
> > On Mon, May 11, 2020 at 11:10:45PM +, Luis Chamberlain wrote:
> > > On Mon, May 11, 2020 at 05:59:04PM -0400, Rafael Aquini wrote:
> > > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > > index 8a176d8727a3..f0a4fb38ac62 100644
> > > > --- a/kernel/sysctl.c
> > > > +++ b/kernel/sysctl.c
> > > > @@ -2623,17 +2623,32 @@ static int proc_taint(struct ctl_table *table, 
> > > > int write,
> > > > return err;
> > > >  
> > > > if (write) {
> > > > +   int i;
> > > > +
> > > > +   /*
> > > > +* Ignore user input that would make us committing
> > > > +* arbitrary invalid TAINT flags in the loop below.
> > > > +*/
> > > > +   tmptaint &= (1UL << TAINT_FLAGS_COUNT) - 1;
> > > 
> > > This looks good but we don't pr_warn() of information lost on intention.
> > >
> > 
> > Are you thinking in sth like:
> > 
> > +   if (tmptaint > TAINT_FLAGS_MAX) {
> > +   tmptaint &= TAINT_FLAGS_MAX;
> > +   pr_warn("proc_taint: out-of-range invalid input 
> > ignored"
> > +   " tainted_mask adjusted to 0x%x\n", 
> > tmptaint);
> > +   }
> > ?
> 
> Sure that would clarify this.
> 
> > > > +
> > > > /*
> > > >  * Poor man's atomic or. Not worth adding a primitive
> > > >  * to everyone's atomic.h for this
> > > >  */
> > > > -   int i;
> > > > for (i = 0; i < BITS_PER_LONG && tmptaint >> i; i++) {
> > > > if ((tmptaint >> i) & 1)
> > > > add_taint(i, LOCKDEP_STILL_OK);
> > > > }
> > > > +
> > > > +   /*
> > > > +* Users with SYS_ADMIN capability can include any 
> > > > arbitrary
> > > > +* taint flag by writing to this interface. If that's 
> > > > the case,
> > > > +* we also need to mark the kernel "tainted by user".
> > > > +*/
> > > > +   add_taint(TAINT_USER, LOCKDEP_STILL_OK);
> > > 
> > > I'm in favor of this however I'd like to hear from Ted on if it meets
> > > the original intention. I would think he had a good reason not to add
> > > it here.
> > >
> > 
> > Fair enough. The impression I got by reading Ted's original commit
> > message is that the intent was to have TAINT_USER as the flag set 
> > via this interface, even though the code was allowing for any 
> > arbitrary value.
> 
> That wasn't my reading, it was that the user did something very odd
> with user input which we don't like as kernel developers, and it gives
> us a way to prove: hey you did something stupid, sorry but I cannot
> support your kernel panic.
> 
> > I think it's OK to let the user fiddle with
> > the flags, as it's been allowed since the introduction of
> > this interface, but we need to reflect that fact in the
> > tainting itself. Since TAINT_USER is not used anywhere,
> 
> I see users of TAINT_USER sprinkled around
>

I meant in the original commit that introduced it
(commit 34f5a39899f3f3e815da64f48ddb72942d86c366). Sorry I
miscomunicated that.

In its current usage, it seems that the other places adding TAINT_USER
match with what is being proposed here: To signal when we have user 
fiddling with kernel / module parameters.

 
> > this change perfectly communicates that fact without
> > the need for introducing yet another taint flag.
> 
> I'd be happy if we don't have introduce yet-anothe flag as well.
> But since Ted introduced it, without using the flag on the proc_taint()
> I'd like confirmation we won't screw things up with existing test cases
> which assume proc_taint() won't set this up. We'd therefore regress
> userspace.
> 
> This is why I'd like for us to be careful with this flag.
> 
>   Luis
>

Re: [PATCH] kernel: sysctl: ignore invalid taint bits introduced via kernel.tainted and taint the kernel with TAINT_USER on writes

2020-05-11 Thread Rafael Aquini

On Mon, May 11, 2020 at 11:10:45PM +, Luis Chamberlain wrote:
> On Mon, May 11, 2020 at 05:59:04PM -0400, Rafael Aquini wrote:
> > The sysctl knob allows any user with SYS_ADMIN capability to
> > taint the kernel with any arbitrary value, but this might
> > produce an invalid flags bitset being committed to tainted_mask.
> > 
> > This patch introduces a simple way for proc_taint() to ignore
> > any eventual invalid bit coming from the user input before
> > committing those bits to the kernel tainted_mask, as well as
> > it makes clear use of TAINT_USER flag to mark the kernel
> > tainted by user everytime a taint value is written
> > to the kernel.tainted sysctl.
> > 
> > Signed-off-by: Rafael Aquini 
> > ---
> >  kernel/sysctl.c | 17 -
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 8a176d8727a3..f0a4fb38ac62 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -2623,17 +2623,32 @@ static int proc_taint(struct ctl_table *table, int 
> > write,
> > return err;
> >  
> > if (write) {
> > +   int i;
> > +
> > +   /*
> > +* Ignore user input that would make us committing
> > +* arbitrary invalid TAINT flags in the loop below.
> > +*/
> > +   tmptaint &= (1UL << TAINT_FLAGS_COUNT) - 1;
> 
> This looks good but we don't pr_warn() of information lost on intention.
>

Are you thinking in sth like:

+   if (tmptaint > TAINT_FLAGS_MAX) {
+   tmptaint &= TAINT_FLAGS_MAX;
+   pr_warn("proc_taint: out-of-range invalid input ignored"
+   " tainted_mask adjusted to 0x%x\n", tmptaint);
+   }

?
 
> > +
> > /*
> >  * Poor man's atomic or. Not worth adding a primitive
> >  * to everyone's atomic.h for this
> >  */
> > -   int i;
> > for (i = 0; i < BITS_PER_LONG && tmptaint >> i; i++) {
> > if ((tmptaint >> i) & 1)
> > add_taint(i, LOCKDEP_STILL_OK);
> > }
> > +
> > +   /*
> > +* Users with SYS_ADMIN capability can include any arbitrary
> > +* taint flag by writing to this interface. If that's the case,
> > +* we also need to mark the kernel "tainted by user".
> > +*/
> > +   add_taint(TAINT_USER, LOCKDEP_STILL_OK);
> 
> I'm in favor of this however I'd like to hear from Ted on if it meets
> the original intention. I would think he had a good reason not to add
> it here.
>

Fair enough. The impression I got by reading Ted's original commit
message is that the intent was to have TAINT_USER as the flag set 
via this interface, even though the code was allowing for any 
arbitrary value. I think it's OK to let the user fiddle with
the flags, as it's been allowed since the introduction of
this interface, but we need to reflect that fact in the
tainting itself. Since TAINT_USER is not used anywhere,
this change perfectly communicates that fact without
the need for introducing yet another taint flag.

Cheers!
-- Rafael

[PATCH] kernel: sysctl: ignore invalid taint bits introduced via kernel.tainted and taint the kernel with TAINT_USER on writes

2020-05-11 Thread Rafael Aquini

The sysctl knob allows any user with SYS_ADMIN capability to
taint the kernel with any arbitrary value, but this might
produce an invalid flags bitset being committed to tainted_mask.

This patch introduces a simple way for proc_taint() to ignore
any eventual invalid bit coming from the user input before
committing those bits to the kernel tainted_mask, as well as
it makes clear use of TAINT_USER flag to mark the kernel
tainted by user everytime a taint value is written
to the kernel.tainted sysctl.

Signed-off-by: Rafael Aquini 
---
 kernel/sysctl.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..f0a4fb38ac62 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2623,17 +2623,32 @@ static int proc_taint(struct ctl_table *table, int 
write,
return err;
 
if (write) {
+   int i;
+
+   /*
+* Ignore user input that would make us committing
+* arbitrary invalid TAINT flags in the loop below.
+*/
+   tmptaint &= (1UL << TAINT_FLAGS_COUNT) - 1;
+
/*
 * Poor man's atomic or. Not worth adding a primitive
 * to everyone's atomic.h for this
 */
-   int i;
for (i = 0; i < BITS_PER_LONG && tmptaint >> i; i++) {
if ((tmptaint >> i) & 1)
add_taint(i, LOCKDEP_STILL_OK);
}
+
+   /*
+* Users with SYS_ADMIN capability can include any arbitrary
+* taint flag by writing to this interface. If that's the case,
+* we also need to mark the kernel "tainted by user".
+*/
+   add_taint(TAINT_USER, LOCKDEP_STILL_OK);
}
 
+
return err;
 }
 
-- 
2.25.4

Re: [PATCH v3] kernel: add panic_on_taint

2020-05-11 Thread Rafael Aquini

On Mon, May 11, 2020 at 06:24:55PM +, Luis Chamberlain wrote:
> On Sat, May 09, 2020 at 09:57:37AM -0400, Rafael Aquini wrote:
> > +Trigger Kdump on add_taint()
> > +
> > +
> > +The kernel parameter, panic_on_taint, calls panic() from within 
> > add_taint(),
> > +whenever the value set in this bitmask matches with the bit flag being set
> > +by add_taint(). This will cause a kdump to occur at the panic() call.
> > +In cases where a user wants to specify this during runtime,
> > +/proc/sys/kernel/panic_on_taint can be set to a respective bitmask value
> > +to achieve the same behaviour.
> > +
> >  Contact
> >  ===
> >  
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index 7bc83f3d9bdf..4a69fe49a70d 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3404,6 +3404,21 @@
> > panic_on_warn   panic() instead of WARN().  Useful to cause kdump
> > on a WARN().
> >  
> > +   panic_on_taint= [KNL] conditionally panic() in add_taint()
> > +   Format: 
> > +   Specifies, as a string, the TAINT flag set that will
> > +   compose a bitmask for calling panic() when the kernel
> > +   gets tainted.
> > +   See Documentation/admin-guide/tainted-kernels.rst for
> > +   details on the taint flags that users can pick to
> > +   compose the bitmask to assign to panic_on_taint.
> > +   When the string is prefixed with a '-' the bitmask
> > +   set in panic_on_taint will be mutually exclusive
> > +   with the sysctl knob kernel.tainted, and any attempt
> > +   to write to that sysctl will fail with -EINVAL for
> > +   any taint value that masks with the flags set for
> > +   this option.
> 
> This talks about using a string, but that it sets a bitmask. Its not
> very clear that one must use the string representation from each taint
> flag. Also, I don't think to use the character representation as we
> limit ourselves to the alphabet and quirky what-should-be-arbitrary
> characters that represent the taint flags. The taint flag character
> representation is juse useful for human reading of a panic, but I think
> because of the limitation of the mask with the alphabet this was not
> such a great idea long term.
>

I respectfully disagree with you on this one, but that might be just
because I'm a non-native English speaker and the cause of confusion is 
not very clear to me. Would you mind providing a blurb with a text that
you think it would be clearer on this regard?

Also, making the input of the option to match with something one
is used to see in taint reports make it easier to use. It would
be still a human setting the boot option, so it makes sense to
utilize a known/meaningful representation for panic_on_taint input.

> So, I don't think we should keep on extending the alphabet use case, a
> simple digit representation would suffice. I think this means we'd need
> two params one for exclusive and one for the value of the taint.
> 
> Using a hex value or number also lets us make the input value shorter.
>

Albeit you're right on the limitation of an alphabetical representation, 
the truth is that taint flags are not introduced that frequently.
Considering how many times these flags were introduced in the past,
one could infer we probably will not run out of alphabet in the next 20 
years (a couple of new flags gets introduced every 2 years, or so, in
average), and the rate of change in these seems to be slowing down
considerably, as in the past 5 years, we've seen only 2 new flags.

I'm not against your suggestion, though, but I think it makes
clumsier to utilize the feature as you now require 2 kernel
cmdline options, instead of just one, and a less intuitive
way of setting it via base16 values. All at the expense of 
a theoretical shortage of taint flags. 

If the others that were already OK with the simple string interface 
don't object your change suggestions in that regard, I'll refactor
the parser to meet them. 

> If a kernel boots with panic-on-taint flag not yet supported, we don't
> complain, therefore getting a false sense of security that we will panic
> with a not yet supported taint flag. I think we should pr_warn() or
> fail to boot when that happens.
>

Bear in mind that these very early print outs (like the ones eventually
produced by setting panic_on_taint) to the log buffer might get lost in 
verbose-logging long-running time systems, but apart from that, I see no 
problems in being a little bit more verbose in that parser. I'll make
the changes for a next revision, if no one else objects them.

Cheers!
-- Rafael

Re: [PATCH v3] kernel: add panic_on_taint

2020-05-10 Thread Rafael Aquini

On Sun, May 10, 2020 at 10:59:21AM +0800, Baoquan He wrote:
> On 05/09/20 at 09:57am, Rafael Aquini wrote:
> > Analogously to the introduction of panic_on_warn, this patch
> > introduces a kernel option named panic_on_taint in order to
> > provide a simple and generic way to stop execution and catch
> > a coredump when the kernel gets tainted by any given taint flag.
> > 
> > This is useful for debugging sessions as it avoids rebuilding
> > the kernel to explicitly add calls to panic() or BUG() into
> > code sites that introduce the taint flags of interest.
> > Another, perhaps less frequent, use for this option would be
> > as a mean for assuring a security policy (in paranoid mode)
> > case where no single taint is allowed for the running system.
> > 
> > Suggested-by: Qian Cai 
> > Signed-off-by: Rafael Aquini 
> > ---
> > Changelog:
> > * v2: get rid of unnecessary/misguided compiler hints   (Luis)
> > * v2: enhance documentation text for the new kernel parameter   (Randy)
> > * v3: drop sysctl interface, keep it only as a kernel parameter (Luis)
> > 
> >  Documentation/admin-guide/kdump/kdump.rst | 10 +
> >  .../admin-guide/kernel-parameters.txt | 15 +++
> >  include/linux/kernel.h|  2 +
> >  kernel/panic.c| 40 +++
> >  kernel/sysctl.c   |  9 -
> >  5 files changed, 75 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> > b/Documentation/admin-guide/kdump/kdump.rst
> > index ac7e131d2935..de3cf6d377cc 100644
> > --- a/Documentation/admin-guide/kdump/kdump.rst
> > +++ b/Documentation/admin-guide/kdump/kdump.rst
> > @@ -521,6 +521,16 @@ will cause a kdump to occur at the panic() call.  In 
> > cases where a user wants
> >  to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set 
> > to 1
> >  to achieve the same behaviour.
> >  
> > +Trigger Kdump on add_taint()
> > +
> > +
> > +The kernel parameter, panic_on_taint, calls panic() from within 
> > add_taint(),
> > +whenever the value set in this bitmask matches with the bit flag being set
> > +by add_taint(). This will cause a kdump to occur at the panic() call.
> > +In cases where a user wants to specify this during runtime,
> > +/proc/sys/kernel/panic_on_taint can be set to a respective bitmask value
> > +to achieve the same behaviour.
> > +
> >  Contact
> >  ===
> >  
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index 7bc83f3d9bdf..4a69fe49a70d 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3404,6 +3404,21 @@
> > panic_on_warn   panic() instead of WARN().  Useful to cause kdump
> > on a WARN().
> >  
> > +   panic_on_taint= [KNL] conditionally panic() in add_taint()
> > +   Format: 
>   Changed it as 'Format: ' to be
> consistent with the existing other options?

I can resubmit with the change, if it's a strong req and the surgery
cannot be done at merge time.


> > +   Specifies, as a string, the TAINT flag set that will
> > +   compose a bitmask for calling panic() when the kernel
> > +   gets tainted.
> > +   See Documentation/admin-guide/tainted-kernels.rst for
> > +   details on the taint flags that users can pick to
> > +   compose the bitmask to assign to panic_on_taint.
> > +   When the string is prefixed with a '-' the bitmask
> > +   set in panic_on_taint will be mutually exclusive
> > +   with the sysctl knob kernel.tainted, and any attempt
> > +   to write to that sysctl will fail with -EINVAL for
> > +   any taint value that masks with the flags set for
> > +   this option.
> > +
> > crash_kexec_post_notifiers
> > Run kdump after running panic-notifiers and dumping
> > kmsg. This only for the users who doubt kdump always
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index 9b7a8d74a9d6..66bc102cb59a 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -528,6 +528,8 @@ extern int panic_on_oops;
> >  e

Re: [PATCH 01/15] taint: add module firmware crash taint support

2020-05-09 Thread Rafael Aquini

On Sat, May 09, 2020 at 04:35:38AM +, Luis Chamberlain wrote:
> Device driver firmware can crash, and sometimes, this can leave your
> system in a state which makes the device or subsystem completely
> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> of scraping some magical words from the kernel log, which is driver
> specific, is much easier. So instead provide a helper which lets drivers
> annotate this.
> 
> Once this happens, scrapers can easily look for modules taint flags
> for a firmware crash. This will taint both the kernel and respective
> calling module.
> 
> The new helper module_firmware_crashed() uses LOCKDEP_STILL_OK as this
> fact should in no way shape or form affect lockdep. This taint is device
> driver specific.
> 
> Signed-off-by: Luis Chamberlain 
> ---
>  include/linux/kernel.h|  3 ++-
>  include/linux/module.h| 13 +
>  include/trace/events/module.h |  3 ++-
>  kernel/module.c   |  5 +++--
>  kernel/panic.c|  1 +
>  5 files changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 04a5885cec1b..19e1541c82c7 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -601,7 +601,8 @@ extern enum system_states {
>  #define TAINT_LIVEPATCH  15
>  #define TAINT_AUX16
>  #define TAINT_RANDSTRUCT 17
> -#define TAINT_FLAGS_COUNT18
> +#define TAINT_FIRMWARE_CRASH 18
> +#define TAINT_FLAGS_COUNT19
>

We are still missing the documentation bits for this
new flag, though.

How about having a blurb similar to:

diff --git a/Documentation/admin-guide/tainted-kernels.rst 
b/Documentation/admin-guide/tainted-kernels.rst
index 71e9184a9079..5c6a9e2478b0 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -100,6 +100,7 @@ Bit  Log  Number  Reason that got the kernel tainted
  15  _/K   32768  kernel has been live patched
  16  _/X   65536  auxiliary taint, defined for and used by distros
  17  _/T  131072  kernel was built with the struct randomization plugin
+ 18  _/Q  262144  driver firmware crash annotation
 ===  ===  ==  

 Note: The character ``_`` is representing a blank in this table to make reading
@@ -162,3 +163,7 @@ More detailed explanation for tainting
  produce extremely unusual kernel structure layouts (even performance
  pathological ones), which is important to know when debugging. Set at
  build time.
+
+ 18) ``Q`` Device drivers might annotate the kernel with this taint, in cases
+ their firmware might have crashed leaving the driver in a crippled and
+ potentially useless state.




>  struct taint_flag {
>   char c_true;/* character printed when tainted */
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 2c2e988bcf10..221200078180 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -697,6 +697,14 @@ static inline bool is_livepatch_module(struct module 
> *mod)
>  bool is_module_sig_enforced(void);
>  void set_module_sig_enforced(void);
>  
> +void add_taint_module(struct module *mod, unsigned flag,
> +   enum lockdep_ok lockdep_ok);
> +
> +static inline void module_firmware_crashed(void)
> +{
> + add_taint_module(THIS_MODULE, TAINT_FIRMWARE_CRASH, LOCKDEP_STILL_OK);
> +}
> +
>  #else /* !CONFIG_MODULES... */
>  
>  static inline struct module *__module_address(unsigned long addr)
> @@ -844,6 +852,11 @@ void *dereference_module_function_descriptor(struct 
> module *mod, void *ptr)
>   return ptr;
>  }
>  
> +static inline void module_firmware_crashed(void)
> +{
> + add_taint(TAINT_FIRMWARE_CRASH, LOCKDEP_STILL_OK);
> +}
> +
>  #endif /* CONFIG_MODULES */
>  
>  #ifdef CONFIG_SYSFS
> diff --git a/include/trace/events/module.h b/include/trace/events/module.h
> index 097485c73c01..b749ea25affd 100644
> --- a/include/trace/events/module.h
> +++ b/include/trace/events/module.h
> @@ -26,7 +26,8 @@ struct module;
>   { (1UL << TAINT_OOT_MODULE),"O" },  \
>   { (1UL << TAINT_FORCED_MODULE), "F" },  \
>   { (1UL << TAINT_CRAP),  "C" },  \
> - { (1UL << TAINT_UNSIGNED_MODULE),   "E" })
> + { (1UL << TAINT_UNSIGNED_MODULE),   "E" },  \
> + { (1UL << TAINT_FIRMWARE_CRASH),"Q" })
>  
>  TRACE_EVENT(module_load,
>  
> diff --git a/kernel/module.c b/kernel/module.c
> index 80faaf2116dd..f98e8c25c6b4 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -325,12 +325,13 @@ static inline int strong_try_module_get(struct module 
> *mod)
>   return -ENOENT;
>  }
>  
> -static inline void add_taint_module(struct module *mod, unsigned flag,
> - enum lockdep_ok lockdep_ok)
> +void

Re: [PATCH v2] kernel: add panic_on_taint

2020-05-09 Thread Rafael Aquini

On Sat, May 09, 2020 at 03:48:54AM +, Luis Chamberlain wrote:
> On Fri, May 08, 2020 at 08:47:19AM -0400, Rafael Aquini wrote:
> > On Thu, May 07, 2020 at 10:25:58PM +, Luis Chamberlain wrote:
> > > On Thu, May 07, 2020 at 06:06:06PM -0400, Rafael Aquini wrote:
> > > > On Thu, May 07, 2020 at 08:33:40PM +, Luis Chamberlain wrote:
> > > > > I *think* that a cmdline route to enable this would likely remove the
> > > > > need for the kernel config for this. But even with Vlastimil's work
> > > > > merged, I think we'd want yet-another value to enable / disable this
> > > > > feature. Do we need yet-another-taint flag to tell us that this 
> > > > > feature
> > > > > was enabled?
> > > > >
> > > > 
> > > > I guess it makes sense to get rid of the sysctl interface for
> > > > proc_on_taint, and only keep it as a cmdline option. 
> > > 
> > > That would be easier to support and k3eps this simple.
> > > 
> > > > But the real issue seems to be, regardless we go with a cmdline-only 
> > > > option
> > > > or not, the ability of proc_taint() to set any arbitrary taint flag 
> > > > other than just marking the kernel with TAINT_USER. 
> > > 
> > > I think we would have no other option but to add a new TAINT flag so
> > > that we know that the taint flag was modified by a user. Perhaps just
> > > re-using TAINT_USER when proc_taint() would suffice.
> > >
> > 
> > We might not need an extra taint flag if, perhaps, we could make these
> > two features mutually exclusive. The idea here is that bitmasks added 
> > via panic_on_taint get filtered out in proc_taint(), so a malicious 
> > user couldn't exploit the latter interface to easily panic the system,
> > when the first one is also in use. 
> 
> I get it, however I I can still see the person who enables enabling
> panic-on-tain wanting to know if proc_taint() was used. So even if
> it was not on their mask, if it was modified that seems like important
> information for a bug report analysis.
>

For that purpose (tracking user taints) I think sth between these lines
would work:

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..651a82c13621 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2602,6 +2602,9 @@ int proc_douintvec(struct ctl_table *table, int write,
 do_proc_douintvec_conv, NULL);
 }

+/* track which taint bits were set by the user */
+static unsigned long user_tainted;
+
 /*
  * Taint values can only be increased
  * This means we can safely use a temporary.
@@ -2629,11 +2632,20 @@ static int proc_taint(struct ctl_table *table, int 
write,
 */
int i;
for (i = 0; i < BITS_PER_LONG && tmptaint >> i; i++) {
-   if ((tmptaint >> i) & 1)
+   if ((tmptaint >> i) & 1) {
+   set_bit(i, _tainted);
add_taint(i, LOCKDEP_STILL_OK);
+   }
}
}

+   /*
+* Users with SYS_ADMIN capability can fiddle with any arbitrary
+* taint flag through this interface.
+* If that's the case, we also need to mark the kernel "tainted by user"
+*/
+   add_taint(TAINT_USER, LOCKDEP_STILL_OK);
+
return err;
 }


I don't think, though, it's panic_on_taint work to track that. I posted a v3 for
this feature with a way to select if one wants to avoid user forced taints
triggering panic() for flags also set for panic_on_taint.

Cheers,

-- Rafael

[PATCH v3] kernel: add panic_on_taint

2020-05-09 Thread Rafael Aquini

Analogously to the introduction of panic_on_warn, this patch
introduces a kernel option named panic_on_taint in order to
provide a simple and generic way to stop execution and catch
a coredump when the kernel gets tainted by any given taint flag.

This is useful for debugging sessions as it avoids rebuilding
the kernel to explicitly add calls to panic() or BUG() into
code sites that introduce the taint flags of interest.
Another, perhaps less frequent, use for this option would be
as a mean for assuring a security policy (in paranoid mode)
case where no single taint is allowed for the running system.

Suggested-by: Qian Cai 
Signed-off-by: Rafael Aquini 
---
Changelog:
* v2: get rid of unnecessary/misguided compiler hints   (Luis)
* v2: enhance documentation text for the new kernel parameter   (Randy)
* v3: drop sysctl interface, keep it only as a kernel parameter (Luis)

 Documentation/admin-guide/kdump/kdump.rst | 10 +
 .../admin-guide/kernel-parameters.txt | 15 +++
 include/linux/kernel.h|  2 +
 kernel/panic.c| 40 +++
 kernel/sysctl.c   |  9 -
 5 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kdump/kdump.rst 
b/Documentation/admin-guide/kdump/kdump.rst
index ac7e131d2935..de3cf6d377cc 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -521,6 +521,16 @@ will cause a kdump to occur at the panic() call.  In cases 
where a user wants
 to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
 to achieve the same behaviour.
 
+Trigger Kdump on add_taint()
+
+
+The kernel parameter, panic_on_taint, calls panic() from within add_taint(),
+whenever the value set in this bitmask matches with the bit flag being set
+by add_taint(). This will cause a kdump to occur at the panic() call.
+In cases where a user wants to specify this during runtime,
+/proc/sys/kernel/panic_on_taint can be set to a respective bitmask value
+to achieve the same behaviour.
+
 Contact
 ===
 
diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7bc83f3d9bdf..4a69fe49a70d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3404,6 +3404,21 @@
panic_on_warn   panic() instead of WARN().  Useful to cause kdump
on a WARN().
 
+   panic_on_taint= [KNL] conditionally panic() in add_taint()
+   Format: 
+   Specifies, as a string, the TAINT flag set that will
+   compose a bitmask for calling panic() when the kernel
+   gets tainted.
+   See Documentation/admin-guide/tainted-kernels.rst for
+   details on the taint flags that users can pick to
+   compose the bitmask to assign to panic_on_taint.
+   When the string is prefixed with a '-' the bitmask
+   set in panic_on_taint will be mutually exclusive
+   with the sysctl knob kernel.tainted, and any attempt
+   to write to that sysctl will fail with -EINVAL for
+   any taint value that masks with the flags set for
+   this option.
+
crash_kexec_post_notifiers
Run kdump after running panic-notifiers and dumping
kmsg. This only for the users who doubt kdump always
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 9b7a8d74a9d6..66bc102cb59a 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -528,6 +528,8 @@ extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
+extern unsigned long panic_on_taint;
+extern bool panic_on_taint_exclusive;
 extern int sysctl_panic_on_rcu_stall;
 extern int sysctl_panic_on_stackoverflow;
 
diff --git a/kernel/panic.c b/kernel/panic.c
index b69ee9e76cb2..65c62f8a1de8 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -44,6 +45,8 @@ static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
 bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;
+unsigned long panic_on_taint;
+bool panic_on_taint_exclusive = false;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
 EXPORT_SYMBOL_GPL(panic_timeout);
@@ -434,6 +437,11 @@ void add_taint(unsigned flag, enum lockdep_ok lockdep_ok)
pr_warn("Disabling lock debugging due to kernel taint\n");
 
set_bit(flag, _mask);
+
+   if (tainted_mask & panic_on_taint) {
+   panic_on_taint = 0;
+   panic("

Re: [RFC] taint: add module firmware crash taint support

2020-05-08 Thread Rafael Aquini

On Fri, May 08, 2020 at 02:14:38AM +, Luis Chamberlain wrote:
> Device driver firmware can crash, and sometimes, this can leave your
> system in a state which makes the device or subsystem completely
> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> of scraping some magical words from the kernel log, which is driver
> specific, is much easier. So instead provide a helper which lets drivers
> annotate this.
> 
> Once this happens, scrapers can easily scrape modules taint flags.
> This will taint both the kernel and respective calling module.
> 
> The new helper module_firmware_crashed() uses LOCKDEP_STILL_OK as
> this fact should in no way shape or form affect lockdep. This taint
> is device driver specific.
> 
> Signed-off-by: Luis Chamberlain 
> ---
> 
> Below is the full diff stat of manual inspection throughout the kernel
> when this happens. My methodology is to just scrape for "crash" and
> then study the driver a bit to see if indeed it seems like that the
> firmware crashes there. In *many* cases there is even infrastructure
> for this, so this is sometimes clearly obvious. Some other times it
> required a bit of deciphering.
> 
> The diff stat below is what I have so far, however the patch below
> only includes the drivers that start with Q, as they were a source of
> inspiration for this, and to make this RFC easier to read.
> 
> If this seems sensible, I can follow up with the kernel helper first,
> and then tackle each subsystem independently.
> 
> I purposely skipped review of remoteproc and virtualization. That should
> require its own separate careful review and considerations.
> 
...
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 04a5885cec1b..19e1541c82c7 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -601,7 +601,8 @@ extern enum system_states {
>  #define TAINT_LIVEPATCH  15
>  #define TAINT_AUX16
>  #define TAINT_RANDSTRUCT 17
> -#define TAINT_FLAGS_COUNT18
> +#define TAINT_FIRMWARE_CRASH 18
> +#define TAINT_FLAGS_COUNT19
>  

As others have already mentioned (and I guess it was your idea too)
it would be nicer to split the changes into a set.

We also will need to update Documentation/admin-guide/tainted-kernels.rst
to reflect the inclusion of this new flag.

-- Rafael

Re: [PATCH v2] kernel: add panic_on_taint

2020-05-08 Thread Rafael Aquini

On Thu, May 07, 2020 at 10:25:58PM +, Luis Chamberlain wrote:
> On Thu, May 07, 2020 at 06:06:06PM -0400, Rafael Aquini wrote:
> > On Thu, May 07, 2020 at 08:33:40PM +, Luis Chamberlain wrote:
> > > I *think* that a cmdline route to enable this would likely remove the
> > > need for the kernel config for this. But even with Vlastimil's work
> > > merged, I think we'd want yet-another value to enable / disable this
> > > feature. Do we need yet-another-taint flag to tell us that this feature
> > > was enabled?
> > >
> > 
> > I guess it makes sense to get rid of the sysctl interface for
> > proc_on_taint, and only keep it as a cmdline option. 
> 
> That would be easier to support and k3eps this simple.
> 
> > But the real issue seems to be, regardless we go with a cmdline-only option
> > or not, the ability of proc_taint() to set any arbitrary taint flag 
> > other than just marking the kernel with TAINT_USER. 
> 
> I think we would have no other option but to add a new TAINT flag so
> that we know that the taint flag was modified by a user. Perhaps just
> re-using TAINT_USER when proc_taint() would suffice.
>

We might not need an extra taint flag if, perhaps, we could make these
two features mutually exclusive. The idea here is that bitmasks added 
via panic_on_taint get filtered out in proc_taint(), so a malicious 
user couldn't exploit the latter interface to easily panic the system,
when the first one is also in use. 
 
-- Rafael

Re: [PATCH] kernel: add panic_on_taint

2020-05-07 Thread Rafael Aquini

On Thu, May 07, 2020 at 07:07:20PM -0400, Qian Cai wrote:
> 
> 
> > On May 7, 2020, at 6:15 PM, Rafael Aquini  wrote:
> > 
> > It's a reasonable and self-contained feature that we have a valid use for. 
> > I honestly fail to see it causing that amount of annoyance as you are 
> > suggesting here.
> 
> It is not a big trouble yet, but keeping an obsolete patch that not very 
> straightforward to figure out that it will be superseded by the 
> panic_on_taint patch will only cause more confusion the longer it has stayed 
> in linux-next.
> 
> The thing is that even if you can’t get this panic_on_taint (the superior 
> solution) patch accepted for some reasons, someone else could still work on 
> it until it get merged.
> 
> Thus, I failed to see any possibility we will go back to the inferior 
> solution (mm-slub-add-panic_on_error-to-the-debug-facilities.patch) by all 
> means.
>

There are plenty of examples of things being added, changed, and
removed in -next. IOW, living in a transient state. I think it's 
a reasonable compromise to keep it while the other one is beind 
ironed out.

The fact that you prefer one solution to another doesn't
invalidate the one you dislike. 

Cheers,
-- Rafael

Re: [PATCH] kernel: add panic_on_taint

2020-05-07 Thread Rafael Aquini

On Thu, May 07, 2020 at 06:05:27PM -0400, Qian Cai wrote:
> 
> 
> > On May 7, 2020, at 4:42 PM, Rafael Aquini  wrote:
> > 
> > On Wed, May 06, 2020 at 10:50:19PM -0400, Qian Cai wrote:
> >> 
> >> 
> >>> On May 6, 2020, at 6:28 PM, Rafael Aquini  wrote:
> >>> 
> >>> Analogously to the introduction of panic_on_warn, this patch
> >>> introduces a kernel option named panic_on_taint in order to
> >>> provide a simple and generic way to stop execution and catch
> >>> a coredump when the kernel gets tainted by any given taint flag.
> >>> 
> >>> This is useful for debugging sessions as it avoids rebuilding
> >>> the kernel to explicitly add calls to panic() or BUG() into
> >>> code sites that introduce the taint flags of interest.
> >>> Another, perhaps less frequent, use for this option would be
> >>> as a mean for assuring a security policy (in paranoid mode)
> >>> case where no single taint is allowed for the running system.
> >> 
> >> Andrew, you can drop the patch below from -mm now because that one is now 
> >> obsolete,
> >> 
> >> mm-slub-add-panic_on_error-to-the-debug-facilities.patch
> >> 
> > Please, don't drop it yet. I'll send a patch to get rid of the bits,
> > once this one gets accepted, if it gets accepted.
> 
> Why do you ever want that obsolete patch even show up in linux-next to 
> potentailly waste other people/bots time to test it and develop things on top 
> of it?
>

It's a reasonable and self-contained feature that we have a valid use for. 
I honestly fail to see it causing that amount of annoyance as you are 
suggesting here.

Re: [PATCH v2] kernel: add panic_on_taint

2020-05-07 Thread Rafael Aquini

On Thu, May 07, 2020 at 08:33:40PM +, Luis Chamberlain wrote:
> On Thu, May 07, 2020 at 02:47:05PM -0400, Rafael Aquini wrote:
> > On Thu, May 07, 2020 at 02:43:16PM -0400, Rafael Aquini wrote:
> > > On Thu, May 07, 2020 at 06:22:57PM +, Luis Chamberlain wrote:
> > > > On Thu, May 07, 2020 at 02:06:31PM -0400, Rafael Aquini wrote:
> > > > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > > > index 8a176d8727a3..b80ab660d727 100644
> > > > > --- a/kernel/sysctl.c
> > > > > +++ b/kernel/sysctl.c
> > > > > @@ -1217,6 +1217,13 @@ static struct ctl_table kern_table[] = {
> > > > >   .extra1 = SYSCTL_ZERO,
> > > > >   .extra2 = SYSCTL_ONE,
> > > > >   },
> > > > > + {
> > > > > + .procname   = "panic_on_taint",
> > > > > + .data   = _on_taint,
> > > > > + .maxlen = sizeof(unsigned long),
> > > > > + .mode   = 0644,
> > > > > + .proc_handler   = proc_doulongvec_minmax,
> > > > > + },
> > > > 
> > > > You sent this out before I could reply to the other thread on v1.
> > > > My thoughts on the min / max values, or lack here:
> > > > 
> > > > 
> > > > Valid range doesn't mean "currently allowed defined" masks. 
> > > > 
> > > > 
> > > > For example, if you expect to panic due to a taint, but a new taint type
> > > > you want was not added on an older kernel you would be under a very
> > > > *false* sense of security that your kernel may not have hit such a
> > > > taint, but the reality of the situation was that the kernel didn't
> > > > support that taint flag only added in future kernels.   
> > > > 
> > > > 
> > > > You may need to define a new flag (MAX_TAINT) which should be the last
> > > > value + 1, the allowed max values would be  
> > > > 
> > > > 
> > > > (2^MAX_TAINT)-1 
> > > > 
> > > > 
> > > > or  
> > > > 
> > > > 
> > > > (1< > > > 
> > > > Since this is to *PANIC* I think we do want to test ranges and ensure
> > > > only valid ones are allowed.
> > > >
> > > 
> > > Ok. I'm thinking in:
> > > 
> > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > index 8a176d8727a3..ee492431e7b0 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -1217,6 +1217,15 @@ static struct ctl_table kern_table[] = {
> > > .extra1 = SYSCTL_ZERO,
> > > .extra2 = SYSCTL_ONE,
> > > },
> > > +   {
> > > +   .procname   = "panic_on_taint",
> > > +   .data   = _on_taint,
> > > +   .maxlen = sizeof(unsigned long),
> > > +   .mode   = 0644,
> > > +   .proc_handler   = proc_doulongvec_minmax,
> > > +   .extra1 = SYSCTL_ZERO,
> > > +   .extra2 = (1 << TAINT_FLAGS_COUNT << 1) - 1,
> > 
> > Without that crap, obviously. Sorry. That was a screw up on my side,
> > when copyin' and pasting.
> 
> I really think that the implications of this needs a bit further review,
> hence the wider CCs.
> 
> Since this can trivially crash a system, I think we need to be careful
> about this value. First, proc_doulongvec_minmax() will not suffice alone,
> we'll *at least* want to check for capable(CAP_SYS_ADMIN)) as in
> proc_taint().  Second first note that we *always* build proc_taint(), if
> just CONFIG_PROC_SYSCTL is enabled. That has been the way since it got
> merged via commit 34f5a39899f3f ("Add TAINT_USER and ability to set
> taint flags from userspace") since v2.6.21. We need to evaluate if this
> little *new* knob you are introducing merits its own kconfig tucked away
> under debugging first. The ship has already sailed for proc_taint().
> Anyo

Re: [PATCH] kernel: add panic_on_taint

2020-05-07 Thread Rafael Aquini

On Wed, May 06, 2020 at 10:50:19PM -0400, Qian Cai wrote:
> 
> 
> > On May 6, 2020, at 6:28 PM, Rafael Aquini  wrote:
> > 
> > Analogously to the introduction of panic_on_warn, this patch
> > introduces a kernel option named panic_on_taint in order to
> > provide a simple and generic way to stop execution and catch
> > a coredump when the kernel gets tainted by any given taint flag.
> > 
> > This is useful for debugging sessions as it avoids rebuilding
> > the kernel to explicitly add calls to panic() or BUG() into
> > code sites that introduce the taint flags of interest.
> > Another, perhaps less frequent, use for this option would be
> > as a mean for assuring a security policy (in paranoid mode)
> > case where no single taint is allowed for the running system.
> 
> Andrew, you can drop the patch below from -mm now because that one is now 
> obsolete,
> 
> mm-slub-add-panic_on_error-to-the-debug-facilities.patch
>
Please, don't drop it yet. I'll send a patch to get rid of the bits,
once this one gets accepted, if it gets accepted.

-- Rafael

Re: [PATCH v2] kernel: add panic_on_taint

2020-05-07 Thread Rafael Aquini

On Thu, May 07, 2020 at 06:50:46PM +, Luis Chamberlain wrote:
> On Thu, May 07, 2020 at 02:06:31PM -0400, Rafael Aquini wrote:
> > Another, perhaps less frequent, use for this option would be
> > as a mean for assuring a security policy (in paranoid mode)
> > case where no single taint is allowed for the running system.
> 
> If used for this purpose then we must add a new TAINT flag for
> proc_taint() was used, otherwise we can cheat to show a taint
> *did* happen, where in fact it never happened, some punk just
> echo'd a value into the kernel's /proc/sys/kernel/tainted.
>

To accomplish that, the punk would need to be root, though, in which 
case everything else is doomed, already.
 
> Forunately proc_taint() only allows to *increment* the taint, not
> reduce.
> 
>   Luis
>

Re: [PATCH v2] kernel: add panic_on_taint

2020-05-07 Thread Rafael Aquini

On Thu, May 07, 2020 at 02:43:16PM -0400, Rafael Aquini wrote:
> On Thu, May 07, 2020 at 06:22:57PM +, Luis Chamberlain wrote:
> > On Thu, May 07, 2020 at 02:06:31PM -0400, Rafael Aquini wrote:
> > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > index 8a176d8727a3..b80ab660d727 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -1217,6 +1217,13 @@ static struct ctl_table kern_table[] = {
> > >   .extra1 = SYSCTL_ZERO,
> > >   .extra2 = SYSCTL_ONE,
> > >   },
> > > + {
> > > + .procname   = "panic_on_taint",
> > > + .data   = _on_taint,
> > > + .maxlen = sizeof(unsigned long),
> > > + .mode   = 0644,
> > > + .proc_handler   = proc_doulongvec_minmax,
> > > + },
> > 
> > You sent this out before I could reply to the other thread on v1.
> > My thoughts on the min / max values, or lack here:
> > 
> > 
> > Valid range doesn't mean "currently allowed defined" masks. 
> > 
> > 
> > For example, if you expect to panic due to a taint, but a new taint type
> > you want was not added on an older kernel you would be under a very
> > *false* sense of security that your kernel may not have hit such a
> > taint, but the reality of the situation was that the kernel didn't
> > support that taint flag only added in future kernels.   
> > 
> > 
> > You may need to define a new flag (MAX_TAINT) which should be the last
> > value + 1, the allowed max values would be  
> > 
> > 
> > (2^MAX_TAINT)-1 
> > 
> > 
> > or  
> > 
> > 
> > (1< > 
> > Since this is to *PANIC* I think we do want to test ranges and ensure
> > only valid ones are allowed.
> >
> 
> Ok. I'm thinking in:
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8a176d8727a3..ee492431e7b0 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1217,6 +1217,15 @@ static struct ctl_table kern_table[] = {
> .extra1 = SYSCTL_ZERO,
> .extra2 = SYSCTL_ONE,
> },
> +   {
> +   .procname   = "panic_on_taint",
> +   .data   = _on_taint,
> +   .maxlen = sizeof(unsigned long),
> +   .mode   = 0644,
> +   .proc_handler   = proc_doulongvec_minmax,
> +   .extra1 = SYSCTL_ZERO,
> +   .extra2 = (1 << TAINT_FLAGS_COUNT << 1) - 1,

Without that crap, obviously. Sorry. That was a screw up on my side,
when copyin' and pasting.

-- Rafael

> +   },
> 
> 
> Would that address your concerns wrt this one?
> 
> Cheers!
> -- Rafael

Re: [PATCH v2] kernel: add panic_on_taint

2020-05-07 Thread Rafael Aquini

On Thu, May 07, 2020 at 06:22:57PM +, Luis Chamberlain wrote:
> On Thu, May 07, 2020 at 02:06:31PM -0400, Rafael Aquini wrote:
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 8a176d8727a3..b80ab660d727 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -1217,6 +1217,13 @@ static struct ctl_table kern_table[] = {
> > .extra1 = SYSCTL_ZERO,
> > .extra2 = SYSCTL_ONE,
> > },
> > +   {
> > +   .procname   = "panic_on_taint",
> > +   .data   = _on_taint,
> > +   .maxlen = sizeof(unsigned long),
> > +   .mode   = 0644,
> > +   .proc_handler   = proc_doulongvec_minmax,
> > +   },
> 
> You sent this out before I could reply to the other thread on v1.
> My thoughts on the min / max values, or lack here:
>   
>   
> Valid range doesn't mean "currently allowed defined" masks.   
>   
> 
> For example, if you expect to panic due to a taint, but a new taint type
> you want was not added on an older kernel you would be under a very
> *false* sense of security that your kernel may not have hit such a
> taint, but the reality of the situation was that the kernel didn't
> support that taint flag only added in future kernels. 
>   
> 
> You may need to define a new flag (MAX_TAINT) which should be the last
> value + 1, the allowed max values would be
>   
> 
> (2^MAX_TAINT)-1   
>   
> 
> or
>   
> 
> (1< 
> Since this is to *PANIC* I think we do want to test ranges and ensure
> only valid ones are allowed.
>

Ok. I'm thinking in:

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..ee492431e7b0 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1217,6 +1217,15 @@ static struct ctl_table kern_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
},
+   {
+   .procname   = "panic_on_taint",
+   .data   = _on_taint,
+   .maxlen = sizeof(unsigned long),
+   .mode   = 0644,
+   .proc_handler   = proc_doulongvec_minmax,
+   .extra1 = SYSCTL_ZERO,
+   .extra2 = (1 << TAINT_FLAGS_COUNT << 1) - 1,
+   },


Would that address your concerns wrt this one?

Cheers!
-- Rafael

[PATCH v2] kernel: add panic_on_taint

2020-05-07 Thread Rafael Aquini

Analogously to the introduction of panic_on_warn, this patch
introduces a kernel option named panic_on_taint in order to
provide a simple and generic way to stop execution and catch
a coredump when the kernel gets tainted by any given taint flag.

This is useful for debugging sessions as it avoids rebuilding
the kernel to explicitly add calls to panic() or BUG() into
code sites that introduce the taint flags of interest.
Another, perhaps less frequent, use for this option would be
as a mean for assuring a security policy (in paranoid mode)
case where no single taint is allowed for the running system.

Suggested-by: Qian Cai 
Signed-off-by: Rafael Aquini 
---
Changelog, from v1:
* get rid of unnecessary/misguided compiler hints   (Luis)
* enhance documentation text for the new kernel parameter   (Randy)

 Documentation/admin-guide/kdump/kdump.rst  | 10 ++
 .../admin-guide/kernel-parameters.txt  |  7 +++
 Documentation/admin-guide/sysctl/kernel.rst| 18 ++
 include/linux/kernel.h |  1 +
 kernel/panic.c |  7 +++
 kernel/sysctl.c|  7 +++
 6 files changed, 50 insertions(+)

diff --git a/Documentation/admin-guide/kdump/kdump.rst 
b/Documentation/admin-guide/kdump/kdump.rst
index ac7e131d2935..de3cf6d377cc 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -521,6 +521,16 @@ will cause a kdump to occur at the panic() call.  In cases 
where a user wants
 to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
 to achieve the same behaviour.
 
+Trigger Kdump on add_taint()
+
+
+The kernel parameter, panic_on_taint, calls panic() from within add_taint(),
+whenever the value set in this bitmask matches with the bit flag being set
+by add_taint(). This will cause a kdump to occur at the panic() call.
+In cases where a user wants to specify this during runtime,
+/proc/sys/kernel/panic_on_taint can be set to a respective bitmask value
+to achieve the same behaviour.
+
 Contact
 ===
 
diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7bc83f3d9bdf..6aa524928399 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3404,6 +3404,13 @@
panic_on_warn   panic() instead of WARN().  Useful to cause kdump
on a WARN().
 
+   panic_on_taint= Bitmask for calling panic() when the kernel gets tainted
+   and the taint bit set matches with the assigned bitmask.
+   See Documentation/admin-guide/tainted-kernels.rst for
+   details on the taint flag bits that user can pick to
+   compose the bitmask to assign to panic_on_taint.
+   When set to 0 (default), this option is disabled.
+
crash_kexec_post_notifiers
Run kdump after running panic-notifiers and dumping
kmsg. This only for the users who doubt kdump always
diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
b/Documentation/admin-guide/sysctl/kernel.rst
index 0d427fd10941..60500d5c1ee0 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -658,6 +658,24 @@ a kernel rebuild when attempting to kdump at the location 
of a WARN().
 = 
 
 
+panic_on_taint
+==
+
+Bitmask for calling panic() in the add_taint() path.
+This is useful to avoid a kernel rebuild when attempting to
+kdump at the insertion of any specific TAINT flags.
+When set to 0 (default), the non-crashing default behavior
+of add_panic() is maintained.
+
+See :doc:`/admin-guide/tainted-kernels` for a complete list of
+taint flag bits and their decimal decoded values.
+
+So, for example, to panic if the kernel gets tainted due to
+occurrences of bad pages and/or machine check errors, a user can::
+
+  echo 48 > /proc/sys/kernel/panic_on_taint
+
+
 panic_print
 ===
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 9b7a8d74a9d6..518b9fd381c2 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -528,6 +528,7 @@ extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
+extern unsigned long panic_on_taint;
 extern int sysctl_panic_on_rcu_stall;
 extern int sysctl_panic_on_stackoverflow;
 
diff --git a/kernel/panic.c b/kernel/panic.c
index b69ee9e76cb2..d284241e702b 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -44,6 +44,7 @@ static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
 bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;
+unsigned long panic_on_taint;
 
 int panic_time

Re: [PATCH v2] mm: expand documentation over __read_mostly

2020-05-07 Thread Rafael Aquini

On Thu, May 07, 2020 at 04:14:24PM +, Luis Chamberlain wrote:
> __read_mostly can easily be misused by folks, its not meant for
> just read-only data. There are performance reasons for using it, but
> we also don't provide any guidance about its use. Provide a bit more
> guidance over its use.
> 
> Acked-by: Christoph Lameter 
> Signed-off-by: Luis Chamberlain 
> ---
> 
> This v2 just has a few spelling fixes.
> 
>  include/linux/cache.h | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/cache.h b/include/linux/cache.h
> index 750621e41d1c..8106fb304fa7 100644
> --- a/include/linux/cache.h
> +++ b/include/linux/cache.h
> @@ -15,8 +15,14 @@
>  
>  /*
>   * __read_mostly is used to keep rarely changing variables out of frequently
> - * updated cachelines. If an architecture doesn't support it, ignore the
> - * hint.
> + * updated cachelines. Its use should be reserved for data that is used
> + * frequently in hot paths. Performance traces can help decide when to use
> + * this. You want __read_mostly data to be tightly packed, so that in the
> + * best case multiple frequently read variables for a hot path will be next
> + * to each other in order to reduce the number of cachelines needed to
> + * execute a critical path. We should be mindful and selective of its use.
> + * ie: if you're going to use it please supply a *good* justification in your
> + * commit log
>   */
>  #ifndef __read_mostly
>  #define __read_mostly
> -- 
> 2.25.1
> 
Acked-by: Rafael Aquini

Re: [PATCH] kernel: add panic_on_taint

2020-05-06 Thread Rafael Aquini

On Wed, May 06, 2020 at 11:24:48PM +, Luis Chamberlain wrote:
> On Wed, May 06, 2020 at 06:28:15PM -0400, Rafael Aquini wrote:
> > Analogously to the introduction of panic_on_warn, this patch
> > introduces a kernel option named panic_on_taint in order to
> > provide a simple and generic way to stop execution and catch
> > a coredump when the kernel gets tainted by any given taint flag.
> > 
> > This is useful for debugging sessions as it avoids rebuilding
> > the kernel to explicitly add calls to panic() or BUG() into
> > code sites that introduce the taint flags of interest.
> > Another, perhaps less frequent, use for this option would be
> > as a mean for assuring a security policy (in paranoid mode)
> > case where no single taint is allowed for the running system.
> > 
> > Suggested-by: Qian Cai 
> > Signed-off-by: Rafael Aquini 
> > ---
> >  Documentation/admin-guide/kdump/kdump.rst | 10 ++
> >  .../admin-guide/kernel-parameters.txt |  3 ++
> >  Documentation/admin-guide/sysctl/kernel.rst   | 36 +++
> >  include/linux/kernel.h|  1 +
> >  kernel/panic.c|  7 
> >  kernel/sysctl.c   |  7 
> >  6 files changed, 64 insertions(+)
> > 
> > diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> > b/Documentation/admin-guide/kdump/kdump.rst
> > index ac7e131d2935..de3cf6d377cc 100644
> > --- a/Documentation/admin-guide/kdump/kdump.rst
> > +++ b/Documentation/admin-guide/kdump/kdump.rst
> > @@ -521,6 +521,16 @@ will cause a kdump to occur at the panic() call.  In 
> > cases where a user wants
> >  to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set 
> > to 1
> >  to achieve the same behaviour.
> >  
> > +Trigger Kdump on add_taint()
> > +
> > +
> > +The kernel parameter, panic_on_taint, calls panic() from within 
> > add_taint(),
> > +whenever the value set in this bitmask matches with the bit flag being set
> > +by add_taint(). This will cause a kdump to occur at the panic() call.
> > +In cases where a user wants to specify this during runtime,
> > +/proc/sys/kernel/panic_on_taint can be set to a respective bitmask value
> > +to achieve the same behaviour.
> > +
> >  Contact
> >  ===
> >  
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index 7bc83f3d9bdf..75c02c1841b2 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3404,6 +3404,9 @@
> > panic_on_warn   panic() instead of WARN().  Useful to cause kdump
> > on a WARN().
> >  
> > +   panic_on_taint  panic() when the kernel gets tainted, if the taint
> > +   flag being set matches with the assigned bitmask.
> > +
> > crash_kexec_post_notifiers
> > Run kdump after running panic-notifiers and dumping
> > kmsg. This only for the users who doubt kdump always
> > diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
> > b/Documentation/admin-guide/sysctl/kernel.rst
> > index 0d427fd10941..5b880102f2e3 100644
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -658,6 +658,42 @@ a kernel rebuild when attempting to kdump at the 
> > location of a WARN().
> >  = 
> >  
> >  
> > +panic_on_taint
> > +==
> > +
> > +Bitmask for calling panic() in the add_taint() path.
> > +This is useful to avoid a kernel rebuild when attempting to
> > +kdump at the insertion of any specific TAINT flags.
> > +When set to 0 (default) add_taint() default behavior is maintained.
> > +
> > +== 
> > +bit  0 TAINT_PROPRIETARY_MODULE
> > +bit  1 TAINT_FORCED_MODULE
> > +bit  2 TAINT_CPU_OUT_OF_SPEC
> > +bit  3 TAINT_FORCED_RMMOD
> > +bit  4 TAINT_MACHINE_CHECK
> > +bit  5 TAINT_BAD_PAGE
> > +bit  6 TAINT_USER
> > +bit  7 TAINT_DIE
> > +bit  8 TAINT_OVERRIDDEN_ACPI_TABLE
> > +bit  9 TAINT_WARN
> > +bit 10 TAINT_CRAP
> > +bit 11 TAINT_FIRMWARE_WORKAROUND
> > +bit 12 TAINT_OOT_MODULE
> > +bit 13 TAINT_UNSIGNED_MODULE
> > +bit 14 TAINT_SOFTLOCKUP
> > +bit 15 TAINT_LIVEPATCH
> > +bit 16 TAINT_AUX
> > +bit 17 TAINT_RANDSTRUCT

Re: [PATCH] mm: expland documentation over __read_mostly

2020-05-06 Thread Rafael Aquini

On Wed, May 06, 2020 at 11:13:53PM +, Luis Chamberlain wrote:
> __read_mostly can easily be misused by folks, its not meant for
> just read-only data. There are performance reasons for using it, but
> we also don't provide any guidance about its use. Provide a bit more
> guidance over it use.
   s/it/its

same goes for the subject, as I think there is a minor typo: s/expland/expand

> 
> Acked-by: Christoph Lameter 
> Signed-off-by: Luis Chamberlain 
> ---
> 
> I sent this 2 years ago, but it fell through the cracks. This time
> I'm adding Andrew Morton now, the fix0r-of-falling-through-the-cracks.
> 
> Resending as I just saw a patch which doesn't clearly justifiy the
> merits of the use of __read_mostly on it.
> 

That would be my fault! (sorry) given the rationale below, the patch I sent
really doesn't need the hint. Thanks for the extra bit of education here.

(not an excuse) In a glance over the source tree, though, it seems most 
of the hinting cases are doing it in the misguided way.


>  include/linux/cache.h | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/cache.h b/include/linux/cache.h
> index 750621e41d1c..8106fb304fa7 100644
> --- a/include/linux/cache.h
> +++ b/include/linux/cache.h
> @@ -15,8 +15,14 @@
>  
>  /*
>   * __read_mostly is used to keep rarely changing variables out of frequently
> - * updated cachelines. If an architecture doesn't support it, ignore the
> - * hint.
> + * updated cachelines. Its use should be reserved for data that is used
> + * frequently in hot paths. Performance traces can help decide when to use
> + * this. You want __read_mostly data to be tightly packed, so that in the
> + * best case multiple frequently read variables for a hot path will be next
> + * to each other in order to reduce the number of cachelines needed to
> + * execute a critial path. We should be mindful and selective of its use.
> + * ie: if you're going to use it please supply a *good* justification in your
> + * commit log
>   */
>  #ifndef __read_mostly
>  #define __read_mostly
> -- 
> 2.25.1
> 

Acked-by: Rafael Aquini

[PATCH] kernel: add panic_on_taint

2020-05-06 Thread Rafael Aquini

Analogously to the introduction of panic_on_warn, this patch
introduces a kernel option named panic_on_taint in order to
provide a simple and generic way to stop execution and catch
a coredump when the kernel gets tainted by any given taint flag.

This is useful for debugging sessions as it avoids rebuilding
the kernel to explicitly add calls to panic() or BUG() into
code sites that introduce the taint flags of interest.
Another, perhaps less frequent, use for this option would be
as a mean for assuring a security policy (in paranoid mode)
case where no single taint is allowed for the running system.

Suggested-by: Qian Cai 
Signed-off-by: Rafael Aquini 
---
 Documentation/admin-guide/kdump/kdump.rst | 10 ++
 .../admin-guide/kernel-parameters.txt |  3 ++
 Documentation/admin-guide/sysctl/kernel.rst   | 36 +++
 include/linux/kernel.h|  1 +
 kernel/panic.c|  7 
 kernel/sysctl.c   |  7 
 6 files changed, 64 insertions(+)

diff --git a/Documentation/admin-guide/kdump/kdump.rst 
b/Documentation/admin-guide/kdump/kdump.rst
index ac7e131d2935..de3cf6d377cc 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -521,6 +521,16 @@ will cause a kdump to occur at the panic() call.  In cases 
where a user wants
 to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
 to achieve the same behaviour.
 
+Trigger Kdump on add_taint()
+
+
+The kernel parameter, panic_on_taint, calls panic() from within add_taint(),
+whenever the value set in this bitmask matches with the bit flag being set
+by add_taint(). This will cause a kdump to occur at the panic() call.
+In cases where a user wants to specify this during runtime,
+/proc/sys/kernel/panic_on_taint can be set to a respective bitmask value
+to achieve the same behaviour.
+
 Contact
 ===
 
diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7bc83f3d9bdf..75c02c1841b2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3404,6 +3404,9 @@
panic_on_warn   panic() instead of WARN().  Useful to cause kdump
on a WARN().
 
+   panic_on_taint  panic() when the kernel gets tainted, if the taint
+   flag being set matches with the assigned bitmask.
+
crash_kexec_post_notifiers
Run kdump after running panic-notifiers and dumping
kmsg. This only for the users who doubt kdump always
diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
b/Documentation/admin-guide/sysctl/kernel.rst
index 0d427fd10941..5b880102f2e3 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -658,6 +658,42 @@ a kernel rebuild when attempting to kdump at the location 
of a WARN().
 = 
 
 
+panic_on_taint
+==
+
+Bitmask for calling panic() in the add_taint() path.
+This is useful to avoid a kernel rebuild when attempting to
+kdump at the insertion of any specific TAINT flags.
+When set to 0 (default) add_taint() default behavior is maintained.
+
+== 
+bit  0 TAINT_PROPRIETARY_MODULE
+bit  1 TAINT_FORCED_MODULE
+bit  2 TAINT_CPU_OUT_OF_SPEC
+bit  3 TAINT_FORCED_RMMOD
+bit  4 TAINT_MACHINE_CHECK
+bit  5 TAINT_BAD_PAGE
+bit  6 TAINT_USER
+bit  7 TAINT_DIE
+bit  8 TAINT_OVERRIDDEN_ACPI_TABLE
+bit  9 TAINT_WARN
+bit 10 TAINT_CRAP
+bit 11 TAINT_FIRMWARE_WORKAROUND
+bit 12 TAINT_OOT_MODULE
+bit 13 TAINT_UNSIGNED_MODULE
+bit 14 TAINT_SOFTLOCKUP
+bit 15 TAINT_LIVEPATCH
+bit 16 TAINT_AUX
+bit 17 TAINT_RANDSTRUCT
+bit 18 TAINT_FLAGS_COUNT
+== 
+
+So, for example, to panic if the kernel gets tainted due to
+occurrences of bad pages and/or machine check errors, a user can::
+
+  echo 48 > /proc/sys/kernel/panic_on_taint
+
+
 panic_print
 ===
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 9b7a8d74a9d6..518b9fd381c2 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -528,6 +528,7 @@ extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
+extern unsigned long panic_on_taint;
 extern int sysctl_panic_on_rcu_stall;
 extern int sysctl_panic_on_stackoverflow;
 
diff --git a/kernel/panic.c b/kernel/panic.c
index b69ee9e76cb2..e2d4771ab911 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -44,6 +44,7 @@ static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
 bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;
+unsigned long panic_on_taint __read_mostly;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
 EXPORT_SYMBOL_GPL(panic_timeout);
@@ -43

Re: [PATCH] mm: slub: add panic_on_error to the debug facilities

2020-05-03 Thread Rafael Aquini

On Sat, May 02, 2020 at 11:16:30PM +, Christopher Lameter wrote:
> On Fri, 1 May 2020, Rafael Aquini wrote:
> 
> > Sometimes it is desirable to override SLUB's debug facilities
> > default behavior upon stumbling on a cache or object error
> > and just stop the execution in order to grab a coredump, at
> > the error-spotting time, instead of trying to fix the issue
> > and report in an attempt to keep the system rolling.
> 
> The stopping of execution on an error is the default behavior. Usually
> you get some OOPS somewhere when data is corrupted and that causes a core
> dump.
> 
> SLUB can fix the issue and continue if enabled by specifying special
> options on boot. That is *not* the default.
>
It is the default behavior when slub_debug is turned on, which is what
this patch is trying to override, when needed. We've been seeing the
need for such feature as, most often than not, by letting the system
running to crash somewhere else after hitting occurrences reported by 
slub_debug ends up clobbering clues to the original issue.

-- Rafael

Re: [PATCH] mm: slub: add panic_on_error to the debug facilities

2020-05-03 Thread Rafael Aquini

On Fri, May 01, 2020 at 07:17:24PM -0400, Qian Cai wrote:
> 
> 
> > On May 1, 2020, at 5:54 PM, Rafael Aquini  wrote:
> > 
> > It seems like a good idea which also would required "adding things"
> > elsewhere, but doesn't look mutually exclusive with the approach here.
> 
> Also, what’s so special about these bad pages here that deserve a crash dump, 
> but not other TAINT_BAD_PAGE places?
>

In that light, yes they're not as different and we could better leverage
a panic_on_taint mechanism (similar to panic_on_warn).

I'll get that posted soon.

Thanks.
-- Rafael

Re: [PATCH] mm: slub: add panic_on_error to the debug facilities

2020-05-01 Thread Rafael Aquini

On Fri, May 01, 2020 at 05:29:19PM -0400, Qian Cai wrote:
> 
> 
> > On May 1, 2020, at 5:15 PM, Rafael Aquini  wrote:
> > 
> > Sometimes it is desirable to override SLUB's debug facilities
> > default behavior upon stumbling on a cache or object error
> > and just stop the execution in order to grab a coredump, at
> > the error-spotting time, instead of trying to fix the issue
> > and report in an attempt to keep the system rolling.
> > 
> > This patch introduces a new debug flag SLAB_PANIC_ON_ERROR,
> > along with its related SLUB-machinery, in order to extend
> > current slub_debug facilites and provide the aforementioned
> > behavior override.
> 
> Instead of adding those things everywhere. How about adding something like 
> panic_on_taint? Then, you could write specific taint flags you are interested 
> in to that file because slab_bug() will taint it TAINT_BAD_PAGE.
>
It seems like a good idea which also would required "adding things"
elsewhere, but doesn't look mutually exclusive with the approach here.

Thanks
-- Rafael

Re: [PATCH] mm: slub: add panic_on_error to the debug facilities

2020-05-01 Thread Rafael Aquini

On Fri, May 01, 2020 at 02:37:51PM -0700, Andrew Morton wrote:
> On Fri,  1 May 2020 17:15:40 -0400 Rafael Aquini  wrote:
> 
> > Sometimes it is desirable to override SLUB's debug facilities
> > default behavior upon stumbling on a cache or object error
> > and just stop the execution in order to grab a coredump, at
> > the error-spotting time, instead of trying to fix the issue
> > and report in an attempt to keep the system rolling.
> > 
> > This patch introduces a new debug flag SLAB_PANIC_ON_ERROR,
> > along with its related SLUB-machinery, in order to extend
> > current slub_debug facilites and provide the aforementioned
> > behavior override.
> > 
> 
> Sees reasonable.
> 
> > --- a/Documentation/vm/slub.rst
> > +++ b/Documentation/vm/slub.rst
> > @@ -54,6 +54,8 @@ Possible debug options are::
> > caused higher minimum slab orders
> > -   Switch all debugging off (useful if the kernel is
> > configured with CONFIG_SLUB_DEBUG_ON)
> > +   C   Toggle panic on error (crash) to allow for post-mortem
> > +   analysis of a coredump taken at the error-spotting time
> 
> nit: "toggle" means to switch to the other state.  But what we're doing
> here is to set to the "on" state.  This:
>

Thanks Andrew, that's indeed much better.
 
> --- 
> a/Documentation/vm/slub.rst~mm-slub-add-panic_on_error-to-the-debug-facilities-fix
> +++ a/Documentation/vm/slub.rst
> @@ -49,12 +49,12 @@ Possible debug options are::
>   P   Poisoning (object and padding)
>   U   User tracking (free and alloc)
>   T   Trace (please only use on single slabs)
> - A   Toggle failslab filter mark for the cache
> + A   Enable failslab filter mark for the cache
>   O   Switch debugging off for caches that would have
>   caused higher minimum slab orders
>   -   Switch all debugging off (useful if the kernel is
>   configured with CONFIG_SLUB_DEBUG_ON)
> - C   Toggle panic on error (crash) to allow for post-mortem
> + C   Enable panic on error (crash) to allow for post-mortem
>   analysis of a coredump taken at the error-spotting time
>  
>  F.e. in order to boot just with sanity checks and red zoning one would 
> specify::
> _
> 
>

Re: [PATCH v3] mm/slub: Fix incorrect interpretation of s->offset

2020-05-01 Thread Rafael Aquini

;* This is the case if we do RCU, have a constructor or
>* destructor or are poisoning the objects.
> +  *
> +  * The assumption that s->offset >= s->inuse means free
> +  * pointer is outside of the object is used in the
> +  * freeptr_outside_object() function. If that is no
> +  * longer true, the function needs to be modified.
>*/
>   s->offset = size;
>   size += sizeof(void *);
> -- 
> 2.18.1
> 
> 
Acked-by: Rafael Aquini

[PATCH] mm: slub: add panic_on_error to the debug facilities

2020-05-01 Thread Rafael Aquini

Sometimes it is desirable to override SLUB's debug facilities
default behavior upon stumbling on a cache or object error
and just stop the execution in order to grab a coredump, at
the error-spotting time, instead of trying to fix the issue
and report in an attempt to keep the system rolling.

This patch introduces a new debug flag SLAB_PANIC_ON_ERROR,
along with its related SLUB-machinery, in order to extend
current slub_debug facilites and provide the aforementioned
behavior override.

Signed-off-by: Rafael Aquini 
---
 Documentation/vm/slub.rst |  2 ++
 include/linux/slab.h  |  2 ++
 mm/slab.h |  3 ++-
 mm/slub.c | 44 ++-
 4 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/Documentation/vm/slub.rst b/Documentation/vm/slub.rst
index 933ada4368ff..51b18c28ec78 100644
--- a/Documentation/vm/slub.rst
+++ b/Documentation/vm/slub.rst
@@ -54,6 +54,8 @@ Possible debug options are::
caused higher minimum slab orders
-   Switch all debugging off (useful if the kernel is
configured with CONFIG_SLUB_DEBUG_ON)
+   C   Toggle panic on error (crash) to allow for post-mortem
+   analysis of a coredump taken at the error-spotting time
 
 F.e. in order to boot just with sanity checks and red zoning one would 
specify::
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 6d454886bcaf..e3496ad7859f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -25,6 +25,8 @@
  */
 /* DEBUG: Perform (expensive) checks on alloc/free */
 #define SLAB_CONSISTENCY_CHECKS((slab_flags_t __force)0x0100U)
+/* DEBUG: panic on error (forced crash) */
+#define SLAB_PANIC_ON_ERROR((slab_flags_t __force)0x0200U)
 /* DEBUG: Red zone objs in a cache */
 #define SLAB_RED_ZONE  ((slab_flags_t __force)0x0400U)
 /* DEBUG: Poison objects */
diff --git a/mm/slab.h b/mm/slab.h
index 207c83ef6e06..27116f8683a1 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -198,7 +198,8 @@ static inline slab_flags_t kmem_cache_flags(unsigned int 
object_size,
 #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
 #elif defined(CONFIG_SLUB_DEBUG)
 #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
- SLAB_TRACE | SLAB_CONSISTENCY_CHECKS)
+ SLAB_TRACE | SLAB_CONSISTENCY_CHECKS | \
+ SLAB_PANIC_ON_ERROR)
 #else
 #define SLAB_DEBUG_FLAGS (0)
 #endif
diff --git a/mm/slub.c b/mm/slub.c
index 9bf44955c4f1..8b4fc002b865 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -700,8 +700,6 @@ static void print_trailer(struct kmem_cache *s, struct page 
*page, u8 *p)
/* Beginning of the filler is the free pointer */
print_section(KERN_ERR, "Padding ", p + off,
  size_from_object(s) - off);
-
-   dump_stack();
 }
 
 void object_err(struct kmem_cache *s, struct page *page,
@@ -709,6 +707,9 @@ void object_err(struct kmem_cache *s, struct page *page,
 {
slab_bug(s, "%s", reason);
print_trailer(s, page, object);
+   if (unlikely(s->flags & SLAB_PANIC_ON_ERROR))
+   panic("BUG: %s: %s", s->name, reason);
+   dump_stack();
 }
 
 static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page,
@@ -722,6 +723,8 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, 
struct page *page,
va_end(args);
slab_bug(s, "%s", buf);
print_page_info(page);
+   if (unlikely(s->flags & SLAB_PANIC_ON_ERROR))
+   panic("BUG: %s: %s", s->name, buf);
dump_stack();
 }
 
@@ -771,7 +774,7 @@ static int check_bytes_and_report(struct kmem_cache *s, 
struct page *page,
fault, end - 1, fault - addr,
fault[0], value);
print_trailer(s, page, object);
-
+   dump_stack();
restore_bytes(s, what, value, fault, end);
return 0;
 }
@@ -1173,13 +1176,14 @@ static inline int free_consistency_checks(struct 
kmem_cache *s,
if (!PageSlab(page)) {
slab_err(s, page, "Attempt to free object(0x%p) outside 
of slab",
 object);
-   } else if (!page->slab_cache) {
-   pr_err("SLUB : no slab for object 0x%p.\n",
-  object);
-   dump_stack();
-   } else
-   object_err(s, page, object,
-   "page slab pointer corrupt.");
+   } else {
+   char reason[80];
+
+   snprintf(reason, sizeof(reason),
+"page slab pointer corruption: 0x%p

Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Rafael Aquini

On Wed, Oct 23, 2019 at 10:48:13AM -0400, Qian Cai wrote:
> On Wed, 2019-10-23 at 10:30 -0400, Waiman Long wrote:
> > On 10/22/19 5:59 PM, Andrew Morton wrote:
> > > On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long  wrote:
> > > 
> > > > The pagetypeinfo_showfree_print() function prints out the number of
> > > > free blocks for each of the page orders and migrate types. The current
> > > > code just iterates the each of the free lists to get counts.  There are
> > > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > > file just because it look too long to iterate all the free lists within
> > > > a zone while holing the zone lock with irq disabled.
> > > > 
> > > > Given the fact that /proc/pagetypeinfo is readable by all, the 
> > > > possiblity
> > > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > > by any user is a security problem that needs to be addressed.
> > > 
> > > Yes.
> > > 
> > > > There is a free_area structure associated with each page order. There
> > > > is also a nr_free count within the free_area for all the different
> > > > migration types combined. Tracking the number of free list entries
> > > > for each migration type will probably add some overhead to the fast
> > > > paths like moving pages from one migration type to another which may
> > > > not be desirable.
> > > > 
> > > > we can actually skip iterating the list of one of the migration types
> > > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > > is usually the largest one on large memory systems, this is the one
> > > > to be skipped. Since the printing order is migration-type => order, we
> > > > will have to store the counts in an internal 2D array before printing
> > > > them out.
> > > > 
> > > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > > zone lock for too long blocking out other zone lock waiters from being
> > > > run. This can be problematic for systems with large amount of memory.
> > > > So a check is added to temporarily release the lock and reschedule if
> > > > more than 64k of list entries have been iterated for each order. With
> > > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > > entries before releasing the lock.
> > > > 
> > > > ...
> > > > 
> > > > --- a/mm/vmstat.c
> > > > +++ b/mm/vmstat.c
> > > > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct 
> > > > seq_file *m,
> > > > pg_data_t *pgdat, struct zone 
> > > > *zone)
> > > >  {
> > > > int order, mtype;
> > > > +   unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> > > 
> > > 600+ bytes is a bit much.  I guess it's OK in this situation.
> > > 
> > 
> > This function is called by reading /proc/pagetypeinfo. The call stack is
> > rather shallow:
> > 
> > PID: 58188  TASK: 938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
> >  #0 [9483bf488e48] crash_nmi_callback at b8c551d7
> >  #1 [9483bf488e58] nmi_handle at b931d8cc
> >  #2 [9483bf488eb0] do_nmi at b931dba8
> >  #3 [9483bf488ef0] end_repeat_nmi at b931cd69
> >     [exception RIP: pagetypeinfo_showfree_print+0x73]
> >     RIP: b8db7173  RSP: 938b9fcbfda0  RFLAGS: 0006
> >     RAX: f0c9946d7020  RBX: 96073ffd5528  RCX: 
> >     RDX: 001c7764  RSI: b9676ab1  RDI: 
> >     RBP: 938b9fcbfdd0   R8: 000a   R9: fffe
> >     R10:   R11: 938b9fcbfc36  R12: 942b97758240
> >     R13: b942f730  R14: 96073ffd5000  R15: 96073ffd5180
> >     ORIG_RAX:   CS: 0010  SS: 0018
> > ---  ---
> >  #4 [938b9fcbfda0] pagetypeinfo_showfree_print at b8db7173
> >  #5 [938b9fcbfdd8] walk_zones_in_node at b8db74df
> >  #6 [938b9fcbfe20] pagetypeinfo_show at b8db7a29
> >  #7 [938b9fcbfe48] seq_read at b8e45c3c
> >  #8 [938b9fcbfeb8] proc_reg_read at b8e95070
> >  #9 [938b9fcbfed8] vfs_read at b8e1f2af
> > #10 [938b9fcbff08] sys_read at b8e2017f
> > #11 [938b9fcbff50] system_call_fastpath at b932579b
> > 
> > So we should not be in any risk of overflowing the stack.
> > 
> > > > -   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > > -   seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > > -   pgdat->node_id,
> > > > -   zone->name,
> > > > -   migratetype_names[mtype]);
> > > > -   for (order = 0; order < MAX_ORDER; ++order) {
> > > > +   lockdep_assert_held(>lock);
> > > > +   lockdep_assert_irqs_disabled();
> > > > +
> > > > +   /*
> > > > +* MIGRATE_MOVABLE is usually the largest one in large memory
> > > > +* systems. We skip iterating that

Re: [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users

2019-10-23 Thread Rafael Aquini

On Wed, Oct 23, 2019 at 12:27:36PM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> /proc/pagetypeinfo is a debugging tool to examine internal page
> allocator state wrt to fragmentation. It is not very useful for
> any other use so normal users really do not need to read this file.
> 
> Waiman Long has noticed that reading this file can have negative side
> effects because zone->lock is necessary for gathering data and that
> a) interferes with the page allocator and its users and b) can lead to
> hard lockups on large machines which have very long free_list.
> 
> Reduce both issues by simply not exporting the file to regular users.
> 
> Reported-by: Waiman Long 
> Cc: stable
> Signed-off-by: Michal Hocko 
> ---
>  mm/vmstat.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6afc892a148a..4e885ecd44d1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
>  #endif
>  #ifdef CONFIG_PROC_FS
>   proc_create_seq("buddyinfo", 0444, NULL, _op);
> - proc_create_seq("pagetypeinfo", 0444, NULL, _op);
> + proc_create_seq("pagetypeinfo", 0400, NULL, _op);
>   proc_create_seq("vmstat", 0444, NULL, _op);
>   proc_create_seq("zoneinfo", 0444, NULL, _op);
>  #endif
> -- 
> 2.20.1
>
 
Acked-by: Rafael Aquini

Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

2019-10-23 Thread Rafael Aquini

On Wed, Oct 23, 2019 at 03:32:05PM +0200, Vlastimil Babka wrote:
> On 10/23/19 12:27 PM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> > This is not really nice because it blocks both any interrupts on that
> > cpu and the page allocator. On large machines this might even trigger
> > the hard lockup detector.
> > 
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes therefore putting
> > a bound on the number of pages on the free_list sounds like a reasonable
> > tradeoff.
> > 
> > The new output will simply tell
> > [...]
> > Node6, zone   Normal, type  Movable >10 >10 >10 >10 
> >  41019  31560  23996  10054   3229983648
> > 
> > instead of
> > Node6, zone   Normal, type  Movable 399568 294127 221558 102119  
> > 41019  31560  23996  10054   3229983648
> > 
> > The limit has been chosen arbitrary and it is a subject of a future
> > change should there be a need for that.
> > 
> > Suggested-by: Andrew Morton 
> > Signed-off-by: Michal Hocko 
> 
> Hmm dunno, I would rather e.g. hide the file behind some config or boot
> option than do this. Or move it to /sys/kernel/debug ?
>

You beat me to it. I was going to suggest moving it to debug, as well.


 
> > ---
> >  mm/vmstat.c | 19 ++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 4e885ecd44d1..762034fc3b83 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct 
> > seq_file *m,
> >  
> > area = &(zone->free_area[order]);
> >  
> > -   list_for_each(curr, >free_list[mtype])
> > +   list_for_each(curr, >free_list[mtype]) {
> > freecount++;
> > +   /*
> > +* Cap the free_list iteration because it might
> > +* be really large and we are under a spinlock
> > +* so a long time spent here could trigger a
> > +* hard lockup detector. Anyway this is a
> > +* debugging tool so knowing there is a handful
> > +* of pages in this order should be more than
> > +* sufficient
> > +*/
> > +   if (freecount > 10) {
> > +   seq_printf(m, ">%6lu ", freecount);
> > +   spin_unlock_irq(>lock);
> > +   cond_resched();
> > +   spin_lock_irq(>lock);
> > +   continue;
> > +   }
> > +   }
> > seq_printf(m, "%6lu ", freecount);
> > }
> > seq_putc(m, '\n');
> > 
>

Re: [RESEND PATCH v2] mm/oom_killer: Add task UID to info message on an oom kill

2019-09-20 Thread Rafael Aquini

On Fri, Sep 20, 2019 at 05:13:40PM -0700, Andrew Morton wrote:
> On Thu, 13 Jun 2019 10:23:18 +0200 Michal Hocko  wrote:
> 
> > On Wed 12-06-19 13:57:53, Joel Savitz wrote:
> > > In the event of an oom kill, useful information about the killed
> > > process is printed to dmesg. Users, especially system administrators,
> > > will find it useful to immediately see the UID of the process.
> > 
> > Could you be more specific please? We already print uid when dumping
> > eligible tasks so it is not overly hard to find that information in the
> > oom report. Well, except when dumping of eligible tasks is disabled. Is
> > this what you are after?
> > 
> > Please always be specific about usecases in the changelog. A terse
> > statement that something is useful doesn't tell much very often.
> > 
> 
> 
> I'll add this to the chagnelog:
> 
> : We already print uid when dumping eligible tasks so it is not overly hard
> : to find that information in the oom report.  However this information is
> : unavailable then dumping of eligible tasks is disabled.
 

Thanks Andrew! just a minor nit there: 's/then/when/'


Acked-by: Rafael Aquini 
>

Re: [RESEND PATCH v2] mm/oom_killer: Add task UID to info message on an oom kill

2019-06-12 Thread Rafael Aquini

On Wed, Jun 12, 2019 at 01:57:53PM -0400, Joel Savitz wrote:
> In the event of an oom kill, useful information about the killed
> process is printed to dmesg. Users, especially system administrators,
> will find it useful to immediately see the UID of the process.
> 
> In the following example, abuse_the_ram is the name of a program
> that attempts to iteratively allocate all available memory until it is
> stopped by force.
> 
> Current message:
> 
> Out of memory: Killed process 35389 (abuse_the_ram)
> total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB,
> shmem-rss:0kB
> 
> Patched message:
> 
> Out of memory: Killed process 2739 (abuse_the_ram),
> total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB,
> shmem-rss:0kB, UID 0
> 
> 
> Suggested-by: David Rientjes 
> Signed-off-by: Joel Savitz 
> ---
>  mm/oom_kill.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3a2484884cfd..af2e3faa72a0 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -874,12 +874,13 @@ static void __oom_kill_process(struct task_struct 
> *victim, const char *message)
>*/
>   do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
>   mark_oom_victim(victim);
> - pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
> + pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB, UID %d\n",
>   message, task_pid_nr(victim), victim->comm,
>   K(victim->mm->total_vm),
>   K(get_mm_counter(victim->mm, MM_ANONPAGES)),
>   K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> - K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
> +     K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
> + from_kuid(_user_ns, task_uid(victim)));
>   task_unlock(victim);
>  
>   /*
> -- 
> 2.18.1
> 
Acked-by: Rafael Aquini

Re: [PATCH v3] fs/proc: add VmTaskSize field to /proc/$$/status

2019-05-08 Thread Rafael Aquini

On Tue, May 07, 2019 at 11:37:16PM -0700, Yury Norov wrote:
> On Tue, May 07, 2019 at 08:54:31AM -0400, Rafael Aquini wrote:
> > On Mon, May 06, 2019 at 11:53:43AM -0400, Joel Savitz wrote:
> > > There is currently no easy and architecture-independent way to find the
> > > lowest unusable virtual address available to a process without
> > > brute-force calculation. This patch allows a user to easily retrieve
> > > this value via /proc//status.
> > > 
> > > Using this patch, any program that previously needed to waste cpu cycles
> > > recalculating a non-sensitive process-dependent value already known to
> > > the kernel can now be optimized to use this mechanism.
> > > 
> > > Signed-off-by: Joel Savitz 
> > > ---
> > >  Documentation/filesystems/proc.txt | 2 ++
> > >  fs/proc/task_mmu.c | 2 ++
> > >  2 files changed, 4 insertions(+)
> > > 
> > > diff --git a/Documentation/filesystems/proc.txt 
> > > b/Documentation/filesystems/proc.txt
> > > index 66cad5c86171..1c6a912e3975 100644
> > > --- a/Documentation/filesystems/proc.txt
> > > +++ b/Documentation/filesystems/proc.txt
> > > @@ -187,6 +187,7 @@ read the file /proc/PID/status:
> > >VmLib:  1412 kB
> > >VmPTE:20 kb
> > >VmSwap:0 kB
> > > +  VmTaskSize:137438953468 kB
> > >HugetlbPages:  0 kB
> > >CoreDumping:0
> > >THP_enabled: 1
> > > @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
> > >   VmPTE   size of page table entries
> > >   VmSwap  amount of swap used by anonymous private 
> > > data
> > >   (shmem swap usage is not included)
> > > + VmTaskSize  lowest unusable address in process virtual 
> > > memory
> > 
> > Can we change this help text to "size of process' virtual address space 
> > memory" ?
> 
> Agree. Or go in other direction and make it VmEnd
> 
> > >   HugetlbPagessize of hugetlb memory portions
> > >   CoreDumping process's memory is currently being dumped
> > >   (killing the process may lead to a 
> > > corrupted core)
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 95ca1fe7283c..0af7081f7b19 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> > >   seq_put_decimal_ull_width(m,
> > >   " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
> > >   SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> > > + seq_put_decimal_ull_width(m,
> > > + " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
> > >   seq_puts(m, " kB\n");
> > >   hugetlb_report_usage(m, mm);
> > >  }
> 
> I'm OK with technical part, but I still have questions not answered
> (or wrongly answered) in v1 and v2. Below is the very detailed
> description of the concerns I have.
> 
> 1. What is the exact reason for it? Original version tells about some
> test that takes so much time that you were able to drink a cup of
> coffee before it was done. The test as you said implements linear
> search to find the last page and so is of O(n). If it's only for some
> random test, I think the kernel can survive without it. Do you have a
> real example of useful programs that suffer without this information?
> 
> 
> 2. I have nothing against taking breaks and see nothing weird if
> ineffective algorithms take time. On my system (x86, Ubuntu) the last
> mapped region according to /proc//maps is:
> ff60-ff601000 r-xp  00:00 0 [vsyscall]
> So to find the required address, we have to inspect 2559 pages. With a
> binary search it would take 12 iterations at max. If my calculation is
> wrong or your environment is completely different - please elaborate.
> 
> 3. As far as I can see, Linux currently does not support dynamic
> TASK_SIZE. It means that for any platform+ABI combination VmTaskSize
> will be always the same. So VmTaskSize would be effectively useless waste

Assuming you can have it fixed and decide upon one another at compile
time also is not necessarely true, unfortunately. One could adjust PAGE_OFFSET, 
at kernel config, to provide different splits for the virtual address space,
which will affect TASK_SIZE, eventually. (see arch/x86/Kconfig)

 
> o

Re: [PATCH v3] fs/proc: add VmTaskSize field to /proc/$$/status

2019-05-07 Thread Rafael Aquini

On Mon, May 06, 2019 at 11:53:43AM -0400, Joel Savitz wrote:
> There is currently no easy and architecture-independent way to find the
> lowest unusable virtual address available to a process without
> brute-force calculation. This patch allows a user to easily retrieve
> this value via /proc//status.
> 
> Using this patch, any program that previously needed to waste cpu cycles
> recalculating a non-sensitive process-dependent value already known to
> the kernel can now be optimized to use this mechanism.
> 
> Signed-off-by: Joel Savitz 
> ---
>  Documentation/filesystems/proc.txt | 2 ++
>  fs/proc/task_mmu.c | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 66cad5c86171..1c6a912e3975 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -187,6 +187,7 @@ read the file /proc/PID/status:
>VmLib:  1412 kB
>VmPTE:20 kb
>VmSwap:0 kB
> +  VmTaskSize:137438953468 kB
>HugetlbPages:  0 kB
>CoreDumping:0
>THP_enabled: 1
> @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
>   VmPTE   size of page table entries
>   VmSwap  amount of swap used by anonymous private data
>   (shmem swap usage is not included)
> + VmTaskSize  lowest unusable address in process virtual 
> memory

Can we change this help text to "size of process' virtual address space memory" 
?

>   HugetlbPagessize of hugetlb memory portions
>   CoreDumping process's memory is currently being dumped
>   (killing the process may lead to a corrupted 
> core)
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 95ca1fe7283c..0af7081f7b19 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>   seq_put_decimal_ull_width(m,
>   " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
>   SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> + seq_put_decimal_ull_width(m,
> + " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
>   seq_puts(m, " kB\n");
>   hugetlb_report_usage(m, mm);
>  }
> -- 
> 2.18.1
> 
Acked-by: Rafael Aquini

Re: [PATCH v3 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)

2019-05-03 Thread Rafael Aquini

On Fri, May 03, 2019 at 02:08:31PM -0700, Yury Norov wrote:
> On Fri, May 03, 2019 at 02:10:20PM -0400, Joel Savitz wrote:
> > When PR_GET_TASK_SIZE is passed to prctl, the kernel will attempt to
> > copy the value of TASK_SIZE to the userspace address in arg2.
> > 
> > It is important that we account for the case of the userspace task
> > running in 32-bit compat mode on a 64-bit kernel. As such, we must be
> > careful to copy the correct number of bytes to userspace to avoid stack
> > corruption.
> > 
> > Suggested-by: Yuri Norov 
> 
> I actually didn't suggest that. If you _really_ need TASK_SIZE to
> be exposed, I would suggest to expose it in kernel headers. TASK_SIZE
> is a compile-time information, and it may available for userspace at
> compile time as well.
> 
> > Suggested-by: Alexey Dobriyan 
> > Signed-off-by: Joel Savitz 
> > ---
> >  include/uapi/linux/prctl.h |  3 +++
> >  kernel/sys.c   | 23 +++
> >  2 files changed, 26 insertions(+)
> > 
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index 094bb03b9cc2..2c261c461952 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -229,4 +229,7 @@ struct prctl_mm_map {
> >  # define PR_PAC_APDBKEY(1UL << 3)
> >  # define PR_PAC_APGAKEY(1UL << 4)
> > 
> > +/* Get the process virtual memory size (i.e. the highest usable VM 
> > address) */
> > +#define PR_GET_TASK_SIZE   55
> > +
> >  #endif /* _LINUX_PRCTL_H */
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 12df0e5434b8..709584400070 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2252,6 +2252,26 @@ static int propagate_has_child_subreaper(struct 
> > task_struct *p, void *data)
> > return 1;
> >  }
> > 
> > +static int prctl_get_tasksize(void __user *uaddr)
> > +{
> > +   unsigned long current_task_size, current_word_size;
> > +
> > +   current_task_size = TASK_SIZE;
> > +   current_word_size = sizeof(unsigned long);
> > +
> > +#ifdef CONFIG_64BIT
> > +   /* On 64-bit architecture, we must check whether the current thread
> > +* is running in 32-bit compat mode. If it is, we can simply cut
> > +* the size in half. This avoids corruption of the userspace stack.
> > +*/
> > +   if (test_thread_flag(TIF_ADDR32))
> 
> It breaks build for all architectures except x86 since TIF_ADDR32 is
> defined for x86 only.

Or we could get TIF_32BIT also defined for x86 (same value of
 TIF_ADDR32) and check for it instead. i.e.

...
#if defined(CONFIG_64BIT) && defined(TIF_32BIT)
if (test_thread_flag(TIF_32BIT))
... 

which is also uglier and keeps adding unecessary complexity to a very
simple task. At this point, I think we just should give up on trying
this via prctl(2) and do it via /proc//status instead. 


> 
> In comment to v2 I suggested you to stick to fixed-size data type to
> avoid exactly this problem.
> 
> NACK
> 
> Yury
> 
> > +   current_word_size >>= 1;
> > +#endif
> > +
> > +   return copy_to_user(uaddr, _task_size, current_word_size) ? 
> > -EFAULT : 0;
> > +}
> > +
> >  int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long 
> > which)
> >  {
> > return -EINVAL;
> > @@ -2486,6 +2506,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> > arg2, unsigned long, arg3,
> > return -EINVAL;
> > error = PAC_RESET_KEYS(me, arg2);
> > break;
> > +   case PR_GET_TASK_SIZE:
> > +   error = prctl_get_tasksize((void *)arg2);
> > +   break;
> > default:
> > error = -EINVAL;
> > break;
> > --
> > 2.18.1

Re: [PATCH v3 0/2] sys/prctl: expose TASK_SIZE value to userspace

2019-05-03 Thread Rafael Aquini

On Fri, May 03, 2019 at 01:49:12PM -0700, Yury Norov wrote:
> On Fri, May 03, 2019 at 02:10:19PM -0400, Joel Savitz wrote:
> > In the mainline kernel, there is no quick mechanism to get the virtual
> > memory size of the current process from userspace.
> > 
> > Despite the current state of affairs, this information is available to the
> > user through several means, one being a linear search of the entire address
> > space. This is an inefficient use of cpu cycles.
> > 
> > A component of the libhugetlb kernel test does exactly this, and as
> > systems' address spaces increase beyond 32-bits, this method becomes
> > exceedingly tedious.
> > 
> > For example, on a ppc64le system with a 47-bit address space, the linear
> > search causes the test to hang for some unknown amount of time. I
> > couldn't give you an exact number because I just ran it for about 10-20
> > minutes and went to go do something else, probably to get coffee or
> > something, and when I came back, I just killed the test and patched it
> > to use this new mechanism. I re-ran my new version of the test using a
> > kernel with this patch, and of course it passed through the previously
> > bottlenecking codepath nearly instantaneously.
> > 
> > As such, I propose that the prctl syscall be extended to include the
> > option to retrieve TASK_SIZE from the kernel.
> > 
> > This patch will allow us to upgrade an O(n) codepath to O(1) in an
> > architecture-independent manner, and provide a mechanism for future
> > generations to do the same.
> 
> So the only reason for the new API is boosting some random poorly
> written userspace test? Why don't you introduce binary search instead?
>

there's no real cost in exposing the value that is known to the kernel,
anyways, as long as it's not a freaking hassle (like trying to go with
this prctl(2) stunt). We just need to get it properly exported alongside
other task's VM-related values at /proc//status.
 
> Look at /proc//maps. It may help to reduce the memory area to be
> checked.
>  
> > Changes from v2:
> >  We now account for the case of 32-bit compat userspace on a 64-bit kernel
> >  More detail about the nature of TASK_SIZE in documentation
> > 
> > Joel Savitz(2):
> >   sys/prctl: add PR_GET_TASK_SIZE option to prctl(2)
> >   prctl.2: Document the new PR_GET_TASK_SIZE option
> > 
> >  include/uapi/linux/prctl.h |  3 +++
> >  kernel/sys.c   | 23 +++
> >  2 files changed, 26 insertions(+)
> > 
> >  man2/prctl.2 | 10 ++
> >  1 file changed, 10 insertions(+)
> > --
> > 2.18.1

Re: [PATCH v3 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)

2019-05-03 Thread Rafael Aquini

On Fri, May 03, 2019 at 02:08:31PM -0700, Yury Norov wrote:
> On Fri, May 03, 2019 at 02:10:20PM -0400, Joel Savitz wrote:
> > When PR_GET_TASK_SIZE is passed to prctl, the kernel will attempt to
> > copy the value of TASK_SIZE to the userspace address in arg2.
> > 
> > It is important that we account for the case of the userspace task
> > running in 32-bit compat mode on a 64-bit kernel. As such, we must be
> > careful to copy the correct number of bytes to userspace to avoid stack
> > corruption.
> > 
> > Suggested-by: Yuri Norov 
> 
> I actually didn't suggest that. If you _really_ need TASK_SIZE to
> be exposed, I would suggest to expose it in kernel headers. TASK_SIZE
> is a compile-time information, and it may available for userspace at
> compile time as well.
> 

TASK_SIZE is a runtime resolved macro, dependent on the thread currently
running on the CPU. It's not a compile time constant.

Anyways, it's proven that going prctl(2), although interesting, as
suggested by Alexey, wasn't worth the hassle as it poses more issues 
than it can possibly solve. 

A better way to get this value exposed to userspace is really through
/proc//status, where one can utilize TASK_SIZE_OF(mm->owner), or
simply mm->task_size, which seems to be properly assigned for each arch


> > Suggested-by: Alexey Dobriyan 
> > Signed-off-by: Joel Savitz 
> > ---
> >  include/uapi/linux/prctl.h |  3 +++
> >  kernel/sys.c   | 23 +++
> >  2 files changed, 26 insertions(+)
> > 
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index 094bb03b9cc2..2c261c461952 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -229,4 +229,7 @@ struct prctl_mm_map {
> >  # define PR_PAC_APDBKEY(1UL << 3)
> >  # define PR_PAC_APGAKEY(1UL << 4)
> > 
> > +/* Get the process virtual memory size (i.e. the highest usable VM 
> > address) */
> > +#define PR_GET_TASK_SIZE   55
> > +
> >  #endif /* _LINUX_PRCTL_H */
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 12df0e5434b8..709584400070 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2252,6 +2252,26 @@ static int propagate_has_child_subreaper(struct 
> > task_struct *p, void *data)
> > return 1;
> >  }
> > 
> > +static int prctl_get_tasksize(void __user *uaddr)
> > +{
> > +   unsigned long current_task_size, current_word_size;
> > +
> > +   current_task_size = TASK_SIZE;
> > +   current_word_size = sizeof(unsigned long);
> > +
> > +#ifdef CONFIG_64BIT
> > +   /* On 64-bit architecture, we must check whether the current thread
> > +* is running in 32-bit compat mode. If it is, we can simply cut
> > +* the size in half. This avoids corruption of the userspace stack.
> > +*/
> > +   if (test_thread_flag(TIF_ADDR32))
> 
> It breaks build for all architectures except x86 since TIF_ADDR32 is
> defined for x86 only.
> 
> In comment to v2 I suggested you to stick to fixed-size data type to
> avoid exactly this problem.
> 
> NACK
> 
> Yury
> 
> > +   current_word_size >>= 1;
> > +#endif
> > +
> > +   return copy_to_user(uaddr, _task_size, current_word_size) ? 
> > -EFAULT : 0;
> > +}
> > +
> >  int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long 
> > which)
> >  {
> > return -EINVAL;
> > @@ -2486,6 +2506,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> > arg2, unsigned long, arg3,
> > return -EINVAL;
> > error = PAC_RESET_KEYS(me, arg2);
> > break;
> > +   case PR_GET_TASK_SIZE:
> > +   error = prctl_get_tasksize((void *)arg2);
> > +   break;
> > default:
> > error = -EINVAL;
> > break;
> > --
> > 2.18.1

Re: [PATCH v2 2/2] prctl.2: Document the new PR_GET_TASK_SIZE option

2019-05-02 Thread Rafael Aquini

On Thu, May 02, 2019 at 03:23:12PM -0700, Yury Norov wrote:
> чт, 2 мая 2019 г. в 13:52, Joel Savitz :
> >
> > Add a short explanation of the new PR_GET_TASK_SIZE option for the benefit
> > of future generations.
> >
> > Signed-off-by: Joel Savitz 
> > ---
> >  man2/prctl.2 | 9 +
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/man2/prctl.2 b/man2/prctl.2
> > index 06d8e13c7..35a6a3919 100644
> > --- a/man2/prctl.2
> > +++ b/man2/prctl.2
> > @@ -49,6 +49,7 @@
> >  .\" 2013-01-10 Kees Cook, document PR_SET_PTRACER
> >  .\" 2012-02-04 Michael Kerrisk, document PR_{SET,GET}_CHILD_SUBREAPER
> >  .\" 2014-11-10 Dave Hansen, document PR_MPX_{EN,DIS}ABLE_MANAGEMENT
> > +.\" 2019-05-02 Joel Savitz, document PR_GET_TASK_SIZE
> >  .\"
> >  .\"
> >  .TH PRCTL 2 2019-03-06 "Linux" "Linux Programmer's Manual"
> > @@ -1375,6 +1376,14 @@ system call on Tru64).
> >  for information on versions and architectures)
> >  Return unaligned access control bits, in the location pointed to by
> >  .IR "(unsigned int\ *) arg2" .
> > +.TP
> > +.B PR_GET_TASK_SIZE
> > +Copy the value of TASK_SIZE to the userspace address in
> > +.IR "(unsigned long\ *) arg2" .
> 
> This is a bad idea to use pointers to size-undefined types in interface 
> because
> that way you have to introduce compat versions of interface functions.
> I'd recommend you to use u64 or unsigned long long here.
>
unsigned long long seems to make little sense too as prctl(2) extra arguments 
are of unsigned long type (good for passing the pointer address, in this case).

a pointer to an unsigned long var is OK for native builds, and for the
compat mode issue you correctly pointed out, the storage size needs to be 
dealt with at the kernel side, by checking test_thread_flag(TIF_ADDR32), 
before proceeding with copy_to_user().

 
> The comment not clear for reader not familiar with kernel internals.
> Can you rephrase
> TASK_SIZE like 'the (next after) highest possible userspace address',
> or similar?
> 
> For the updated version could you please CC to yury.no...@gmail.com?
> 
> > +Return
> > +.B EFAULT
> > +if this operation fails.
> > +
> >  .SH RETURN VALUE
> >  On success,
> >  .BR PR_GET_DUMPABLE ,
> > --
> > 2.18.1
> >

Re: [PATCH v2 0/2] sys/prctl: expose TASK_SIZE value to userspace

2019-05-02 Thread Rafael Aquini

On Thu, May 02, 2019 at 04:52:20PM -0400, Joel Savitz wrote:
> In the mainline kernel, there is no quick mechanism to get the virtual
> memory size of the current process from userspace.
> 
> Despite the current state of affairs, this information is available to the
> user through several means, one being a linear search of the entire address
> space. This is an inefficient use of cpu cycles.
> 
> A component of the libhugetlb kernel test does exactly this, and as
> systems' address spaces increase beyond 32-bits, this method becomes
> exceedingly tedious.
> 
> For example, on a ppc64le system with a 47-bit address space, the linear
> search causes the test to hang for some unknown amount of time. I
> couldn't give you an exact number because I just ran it for about 10-20
> minutes and went to go do something else, probably to get coffee or
> something, and when I came back, I just killed the test and patched it
> to use this new mechanism. I re-ran my new version of the test using a
> kernel with this patch, and of course it passed through the previously
> bottlenecking codepath nearly instantaneously.
> 
> As such, I propose that the prctl syscall be extended to include the
> option to retrieve TASK_SIZE from the kernel.
> 
> This patch will allow us to upgrade an O(n) codepath to O(1) in an
> architecture-independent manner, and provide a mechanism for others
> to do the same.
> 
> Joel Savitz(2):
>   sys/prctl: add PR_GET_TASK_SIZE option to prctl(2)
>   prctl.2: Document the new PR_GET_TASK_SIZE option
> 
>  include/uapi/linux/prctl.h |  3 +++
>  kernel/sys.c   | 10 ++
>  2 files changed, 13 insertions(+)
> 
>  man2/prctl.2 | 9 +
>  1 file changed, 9 insertions(+)
> 
> --
> 2.18.1
>

If you decide to repost patch 1/2 to sort out the minor nit I
pointed out, you can keep my ack.
 
Acked-by: Rafael Aquini

Re: [PATCH v2 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)

2019-05-02 Thread Rafael Aquini

On Thu, May 02, 2019 at 04:52:21PM -0400, Joel Savitz wrote:
> When PR_GET_TASK_SIZE is passed to prctl, the kernel will attempt to
> copy the value of TASK_SIZE to the userspace address in arg2.
> 
> Suggested-by: Alexey Dobriyan 
> Signed-off-by: Joel Savitz 
> ---
>  include/uapi/linux/prctl.h |  3 +++
>  kernel/sys.c   | 10 ++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 094bb03b9cc2..2335fe0a8db8 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -229,4 +229,7 @@ struct prctl_mm_map {
>  # define PR_PAC_APDBKEY  (1UL << 3)
>  # define PR_PAC_APGAKEY  (1UL << 4)
>  
> +/* Get the process virtual memory size */
> +#define PR_GET_TASK_SIZE 55
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 12df0e5434b8..7ced7dbd035d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2252,6 +2252,13 @@ static int propagate_has_child_subreaper(struct 
> task_struct *p, void *data)
>   return 1;
>  }
>  
> +static int prctl_get_tasksize(void __user * uaddr)
> +{
> + unsigned long task_size = TASK_SIZE;
> + return copy_to_user(uaddr, _size, sizeof(unsigned long))
> + ? -EFAULT : 0;

Minor pick: I would keep the all of the ternary statement above in the
same line, to help on improving readability (even though it bursts a
little longer than 80 columns of text)

> +}
> +
>  int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long 
> which)
>  {
>   return -EINVAL;
> @@ -2486,6 +2493,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
>   return -EINVAL;
>   error = PAC_RESET_KEYS(me, arg2);
>   break;
> + case PR_GET_TASK_SIZE:
> + error = prctl_get_tasksize((void *)arg2) ;
> + break;
>   default:
>   error = -EINVAL;
>   break;
> -- 
> 2.18.1
>

Re: [PATCH v2] fs/proc: add VmTaskSize field to /proc/$$/status

2019-04-26 Thread Rafael Aquini

On Fri, Apr 26, 2019 at 03:02:08PM -0400, Joel Savitz wrote:
> In the mainline kernel, there is no quick mechanism to get the virtual
> memory size of the current process from userspace.
> 
> Despite the current state of affairs, this information is available to the
> user through several means, one being a linear search of the entire address
> space. This is an inefficient use of cpu cycles.
> 
> A component of the libhugetlb kernel test does exactly this, and as
> systems' address spaces increase beyond 32-bits, this method becomes
> exceedingly tedious.
> 
> For example, on a ppc64le system with a 47-bit address space, the linear
> search causes the test to hang for some unknown amount of time. I
> couldn't give you an exact number because I just ran it for about 10-20
> minutes and went to go do something else, probably to get coffee or
> something, and when I came back, I just killed the test and patched it
> to use this new mechanism. I re-ran my new version of the test using a
> kernel with this patch, and of course it passed through the previously
> bottlenecking codepath nearly instantaneously.
> 
> This patched enabled me to upgrade an O(n) codepath to O(1) in an
> architecture-independent manner.
> 
> Signed-off-by: Joel Savitz 
> ---
>  Documentation/filesystems/proc.txt | 2 ++
>  fs/proc/task_mmu.c | 5 -
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 66cad5c86171..1c6a912e3975 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -187,6 +187,7 @@ read the file /proc/PID/status:
>VmLib:  1412 kB
>VmPTE:20 kb
>VmSwap:0 kB
> +  VmTaskSize:137438953468 kB
>HugetlbPages:  0 kB
>CoreDumping:0
>THP_enabled: 1
> @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
>   VmPTE   size of page table entries
>   VmSwap  amount of swap used by anonymous private data
>   (shmem swap usage is not included)
> + VmTaskSize  size of entire virtual address space of a 
> process
>   HugetlbPagessize of hugetlb memory portions
>   CoreDumping process's memory is currently being dumped
>   (killing the process may lead to a corrupted 
> core)
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 95ca1fe7283c..0ddd51479f90 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -74,7 +74,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>   seq_put_decimal_ull_width(m,
>   " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
>   SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> - seq_puts(m, " kB\n");
> + SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> + seq_put_decimal_ull_width(m,
> + " kB\nVmTaskSize:\t", TASK_SIZE >> 10, 8);
> + seq_puts(m, " kB\n");
>   hugetlb_report_usage(m, mm);
>  }
>  #undef SEQ_PUT_DEC
> -- 
> 2.18.1
> 
Acked-by: Rafael Aquini

Re: [PATCH v2] mm/oom_killer: Add task UID to info message on an oom kill

2019-04-26 Thread Rafael Aquini

On Fri, Apr 26, 2019 at 02:59:57PM -0400, Joel Savitz wrote:
> In the event of an oom kill, useful information about the killed
> process is printed to dmesg. Users, especially system administrators,
> will find it useful to immediately see the UID of the process.
> 
> In the following example, abuse_the_ram is the name of a program
> that attempts to iteratively allocate all available memory until it is
> stopped by force.
> 
> Current message:
> 
> Out of memory: Killed process 35389 (abuse_the_ram)
> total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB,
> shmem-rss:0kB
> 
> Patched message:
> 
> Out of memory: Killed process 2739 (abuse_the_ram),
> total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB,
> shmem-rss:0kB, UID 0
> 
> Signed-off-by: Joel Savitz 
> ---
>  mm/oom_kill.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3a2484884cfd..af2e3faa72a0 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -874,12 +874,13 @@ static void __oom_kill_process(struct task_struct 
> *victim, const char *message)
>*/
>   do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
>   mark_oom_victim(victim);
> - pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
> + pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB, UID %d\n",
>   message, task_pid_nr(victim), victim->comm,
>   K(victim->mm->total_vm),
>   K(get_mm_counter(victim->mm, MM_ANONPAGES)),
>   K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> - K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
> + K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
> + from_kuid(_user_ns, task_uid(victim)));
>   task_unlock(victim);
>  
>   /*
> -- 
> 2.18.1
> 
Acked-by: Rafael Aquini

Re: [PATCH] fs/proc: add VmTaskSize field to /proc/$$/status

2019-04-25 Thread Rafael Aquini

On Thu, Apr 25, 2019 at 04:57:47PM -0400, Joel Savitz wrote:
> Currently, there is no fast mechanism to get the virtual memory size of
> the current process from userspace. This information is available to the
> user through several means, one being a linear search of the entire address
> space. This is the method used by a component of the libhugetlb kernel
> test, and using the mechanism proposed in this patch, the time complexity
> of that test would be upgraded to constant time from linear time. This is
> especially relevant on 64-bit architechtures where a linear search of
> the address space could take an absurd amount of time. Using this
> mechanism, the modification to the test component would be portable
> across all architechtures.
> 
> Signed-off-by: Joel Savitz 
> ---
>  fs/proc/task_mmu.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 92a91e7816d8..f64b9a949624 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -74,7 +74,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>   seq_put_decimal_ull_width(m,
>   " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
>   SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> - seq_puts(m, " kB\n");
> + SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> + seq_put_decimal_ull_width(m,
> + " kB\nVmTaskSize:\t", TASK_SIZE >> 10, 8);
> + seq_puts(m, " kB\n");
>   hugetlb_report_usage(m, mm);
>  }
>  #undef SEQ_PUT_DEC
> -- 
> 2.18.1
> 
Acked-by: Rafael Aquini

Re: [PATCH] mm/oom_killer: Add task UID to info message on an oom kill

2019-04-25 Thread Rafael Aquini

On Wed, Apr 24, 2019 at 02:20:13PM -0400, Joel Savitz wrote:
> In the event of an oom kill, useful information about the killed
> process is printed to dmesg. Users, especially system administrators,
> will find it useful to immediately see the UID of the process.
> 
> In the following example, abuse_the_ram is the name of a program
> that attempts to iteratively allocate all available memory until it is
> stopped by force.
> 
> Current message:
> 
> Out of memory: Killed process 35389 (abuse_the_ram)
> total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB,
> shmem-rss:0kB
> 
> Patched message:
> 
> Out of memory: Killed process 2739 (abuse_the_ram), UID 0,
> total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB,
> shmem-rss:0kB
> 
> Signed-off-by: Joel Savitz 
> ---
>  mm/oom_kill.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3a2484884cfd..22972648b758 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -874,9 +874,9 @@ static void __oom_kill_process(struct task_struct 
> *victim, const char *message)
>*/
>   do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
>   mark_oom_victim(victim);
> - pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
> + pr_err("%s: Killed process %d (%s), UID %d, total-vm:%lukB, 
> anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>   message, task_pid_nr(victim), victim->comm,
> - K(victim->mm->total_vm),
> + task_uid(victim).val, K(victim->mm->total_vm),
>   K(get_mm_counter(victim->mm, MM_ANONPAGES)),
>   K(get_mm_counter(victim->mm, MM_FILEPAGES)),
>   K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
> -- 
> 2.18.1
> 
Acked-by: Rafael Aquini

Re: [PATCH v1] mm: balloon: drop unused function stubs

2019-04-02 Thread Rafael Aquini

On Fri, Mar 29, 2019 at 01:26:49PM +0100, David Hildenbrand wrote:
> These are leftovers from the pre-"general non-lru movable page" era.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/balloon_compaction.h | 15 ---
>  1 file changed, 15 deletions(-)
> 
> diff --git a/include/linux/balloon_compaction.h 
> b/include/linux/balloon_compaction.h
> index f111c780ef1d..f31521dcb09a 100644
> --- a/include/linux/balloon_compaction.h
> +++ b/include/linux/balloon_compaction.h
> @@ -151,21 +151,6 @@ static inline void balloon_page_delete(struct page *page)
>   list_del(>lru);
>  }
>  
> -static inline bool __is_movable_balloon_page(struct page *page)
> -{
> - return false;
> -}
> -
> -static inline bool balloon_page_movable(struct page *page)
> -{
> - return false;
> -}
> -
> -static inline bool isolated_balloon_page(struct page *page)
> -{
> - return false;
> -}
> -
>  static inline bool balloon_page_isolate(struct page *page)
>  {
>   return false;
> -- 
> 2.17.2
> 
Acked-by: Rafael Aquini

Re: [PATCH] mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified

2019-03-20 Thread Rafael Aquini

On Wed, Mar 20, 2019 at 02:35:56AM +0800, Yang Shi wrote:
> When MPOL_MF_STRICT was specified and an existing page was already
> on a node that does not follow the policy, mbind() should return -EIO.
> But commit 6f4576e3687b ("mempolicy: apply page table walker on
> queue_pages_range()") broke the rule.
> 
> And, commit c8633798497c ("mm: mempolicy: mbind and migrate_pages
> support thp migration") didn't return the correct value for THP mbind()
> too.
> 
> If MPOL_MF_STRICT is set, ignore vma_migratable() to make sure it reaches
> queue_pages_to_pte_range() or queue_pages_pmd() to check if an existing
> page was already on a node that does not follow the policy.  And,
> non-migratable vma may be used, return -EIO too if MPOL_MF_MOVE or
> MPOL_MF_MOVE_ALL was specified.
> 
> Tested with 
> https://github.com/metan-ucw/ltp/blob/master/testcases/kernel/syscalls/mbind/mbind02.c
> 
> Fixes: 6f4576e3687b ("mempolicy: apply page table walker on 
> queue_pages_range()")
> Reported-by: Cyril Hrubis 
> Cc: Vlastimil Babka 
> Cc: sta...@vger.kernel.org
> Suggested-by: Kirill A. Shutemov 
> Signed-off-by: Yang Shi 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/mempolicy.c | 40 +---
>  1 file changed, 33 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index abe7a67..401c817 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -447,6 +447,13 @@ static inline bool queue_pages_required(struct page 
> *page,
>   return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT);
>  }
>  
> +/*
> + * The queue_pages_pmd() may have three kind of return value.
> + * 1 - pages are placed on he right node or queued successfully.
> + * 0 - THP get split.
> + * -EIO - is migration entry or MPOL_MF_STRICT was specified and an existing
> + *page was already on a node that does not follow the policy.
> + */
>  static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
>   unsigned long end, struct mm_walk *walk)
>  {
> @@ -456,7 +463,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, 
> unsigned long addr,
>   unsigned long flags;
>  
>   if (unlikely(is_pmd_migration_entry(*pmd))) {
> - ret = 1;
> + ret = -EIO;
>   goto unlock;
>   }
>   page = pmd_page(*pmd);
> @@ -473,8 +480,15 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, 
> unsigned long addr,
>   ret = 1;
>   flags = qp->flags;
>   /* go to thp migration */
> - if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> + if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
> + if (!vma_migratable(walk->vma)) {
> + ret = -EIO;
> + goto unlock;
> + }
> +
>   migrate_page_add(page, qp->pagelist, flags);
> + } else
> + ret = -EIO;
>  unlock:
>   spin_unlock(ptl);
>  out:
> @@ -499,8 +513,10 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned 
> long addr,
>   ptl = pmd_trans_huge_lock(pmd, vma);
>   if (ptl) {
>   ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
> - if (ret)
> + if (ret > 0)
>   return 0;
> + else if (ret < 0)
> + return ret;
>   }
>  
>   if (pmd_trans_unstable(pmd))
> @@ -521,11 +537,16 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned 
> long addr,
>   continue;
>   if (!queue_pages_required(page, qp))
>   continue;
> - migrate_page_add(page, qp->pagelist, flags);
> + if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
> + if (!vma_migratable(vma))
> + break;
> + migrate_page_add(page, qp->pagelist, flags);
> + } else
> + break;
>   }
>   pte_unmap_unlock(pte - 1, ptl);
>   cond_resched();
> - return 0;
> + return addr != end ? -EIO : 0;
>  }
>  
>  static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
> @@ -595,7 +616,12 @@ static int queue_pages_test_walk(unsigned long start, 
> unsigned long end,
>   unsigned long endvma = vma->vm_end;
>   unsigned long flags = qp->flags;
>  
> - if (!vma_migratable(vma))
> + /*
> +  * Need check MPOL_MF_STRICT to return -EIO if possible
> +  * regardless of vma_migratable
> +  */ 
> + if (!vma_migratable(vma) &&
> + !(flags & MPOL_MF_STRICT))
>   return 1;
>  
>   if (endvma > end)
> @@ -622,7 +648,7 @@ static int queue_pages_test_walk(unsigned long start, 
> unsigned long end,
>   }
>  
>   /* queue pages from current vma */
> - if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> + if (flags & MPOL_MF_VALID)
>   return 0;
>   return 1;
>  }
> -- 
> 1.8.3.1
> 
Acked-by: Rafael Aquini

Re: [PATCH v3] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

2019-03-03 Thread Rafael Aquini

On Sun, Mar 03, 2019 at 08:28:04AM +0100, Jan Stancek wrote:
> LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
> This is a stress test, where one thread mmaps/writes/munmaps memory area
> and other thread is trying to read from it:
> 
>   CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
>   Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
>   Krnl PSW : 0404e0018000 001ac8d8 (__lock_acquire+0x7/0x7a8)
>   Call Trace:
>   ([<>]   (null))
>[<001adae4>] lock_acquire+0xec/0x258
>[<0080d1ac>] _raw_spin_lock_bh+0x5c/0x98
>[<0012a780>] page_table_free+0x48/0x1a8
>[<002f6e54>] do_fault+0xdc/0x670
>[<002fadae>] __handle_mm_fault+0x416/0x5f0
>[<002fb138>] handle_mm_fault+0x1b0/0x320
>[<001248cc>] do_dat_exception+0x19c/0x2c8
>[<0080e5ee>] pgm_check_handler+0x19e/0x200
> 
> page_table_free() is called with NULL mm parameter, but because
> "0" is a valid address on s390 (see S390_lowcore), it keeps
> going until it eventually crashes in lockdep's lock_acquire.
> This crash is reproducible at least since 4.14.
> 
> Problem is that "vmf->vma" used in do_fault() can become stale.
> Because mmap_sem may be released, other threads can come in,
> call munmap() and cause "vma" be returned to kmem cache, and
> get zeroed/re-initialized and re-used:
> 
> handle_mm_fault   |
>   __handle_mm_fault   |
> do_fault  |
>   vma = vmf->vma  |
>   do_read_fault   |
> __do_fault|
>   vma->vm_ops->fault(vmf);|
> mmap_sem is released  |
>   |
>   | do_munmap()
>   |   remove_vma_list()
>   | remove_vma()
>   |   vm_area_free()
>   | # vma is released
>   | ...
>   | # same vma is allocated
>   | # from kmem cache
>   | do_mmap()
>   |   vm_area_alloc()
>   | memset(vma, 0, ...)
>   |
>   pte_free(vma->vm_mm, ...);  |
> page_table_free   |
>   spin_lock_bh(>context.lock);|
>|
> 
> Cache mm_struct to avoid using potentially stale "vma".
> 
> [1] 
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> 
> Signed-off-by: Jan Stancek 
> Reviewed-by: Andrea Arcangeli 
> ---
>  mm/memory.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e11ca9dd823f..e8d69ade5acc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3517,10 +3517,13 @@ static vm_fault_t do_shared_fault(struct vm_fault 
> *vmf)
>   * but allow concurrent faults).
>   * The mmap_sem may have been released depending on flags and our
>   * return value.  See filemap_fault() and __lock_page_or_retry().
> + * If mmap_sem is released, vma may become invalid (for example
> + * by other thread calling munmap()).
>   */
>  static vm_fault_t do_fault(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
> + struct mm_struct *vm_mm = vma->vm_mm;
>   vm_fault_t ret;
>  
>   /*
> @@ -3561,7 +3564,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
>  
>   /* preallocated pagetable is unused: free it */
>   if (vmf->prealloc_pte) {
> - pte_free(vma->vm_mm, vmf->prealloc_pte);
> + pte_free(vm_mm, vmf->prealloc_pte);
>   vmf->prealloc_pte = NULL;
>   }
>   return ret;
> -- 
> 1.8.3.1
> 
Acked-by: Rafael Aquini

Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

2019-01-28 Thread Rafael Aquini

On Mon, Jan 28, 2019 at 10:01:56AM -0500, Rafael Aquini wrote:
> On Mon, Jan 28, 2019 at 03:38:38PM +0100, David Hildenbrand wrote:
> > On 28.01.19 14:35, Michal Hocko wrote:
> > > On Mon 28-01-19 14:22:52, David Hildenbrand wrote:
> > >> On 28.01.19 14:21, Michal Hocko wrote:
> > >>> On Mon 28-01-19 14:14:28, David Hildenbrand wrote:
> > >>>> On 28.01.19 14:07, Michal Hocko wrote:
> > >>>>> On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
> > >>>>> [...]
> > >>>>>> My theory:
> > >>>>>>
> > >>>>>> In __unmap_and_move(), we lock the old and newpage and perform the
> > >>>>>> migration. In case of vitio-balloon, the new page will become
> > >>>>>> movable, the old page will no longer be movable.
> > >>>>>>
> > >>>>>> However, after unlocking newpage, I think there is nothing stopping
> > >>>>>> the newpage from getting dequeued and freed by virtio-balloon. This
> > >>>>>> will result in the newpage
> > >>>>>> 1. No longer having PageMovable()
> > >>>>>> 2. Getting moved to the local list before finally freeing it (using
> > >>>>>>page->lru)
> > >>>>>
> > >>>>> Does that mean that the virtio-balloon can change the Movable state
> > >>>>> while there are other users of the page? Can you point to the code 
> > >>>>> that
> > >>>>> does it? How come this can be safe at all? Or is the PageMovable 
> > >>>>> stable
> > >>>>> only under the page lock?
> > >>>>>
> > >>>>
> > >>>> PageMovable is stable under the lock. The relevant instructions are in
> > >>>>
> > >>>> mm/balloon_compaction.c and include/linux/balloon_compaction.h
> > >>>
> > >>> OK, I have just checked __ClearPageMovable and it indeed requires
> > >>> PageLock. Then we also have to move is_lru = __PageMovable(page) after
> > >>> the page lock.
> > >>>
> > >>
> > >> I assume that is fine as is as the page is isolated? (yes, it will be
> > >> modified later when moving but we are interested in the original state)
> > > 
> > > OK, I've missed that the page is indeed isolated. Then the patch makes
> > > sense to me.
> > > 
> > 
> > Thanks Michal. I assume this has broken ever since balloon compaction
> > was introduced. I'll wait a little more and then resend as !RFC with a
> > cc-stable tag.
> >
> 
> Yes, balloon deflation could always race against migration
> This race was a problem, initially, and was dealt with, via:
> 
> commit 117aad1e9e4d97448d1df3f84b08bd65811e6d6a
> Author: Rafael Aquini 
> Date:   Mon Sep 30 13:45:16 2013 -0700
> 
> mm: avoid reinserting isolated balloon pages into LRU lists
> 
>  
> 
> I think this upstream patch has re-introduced it, in a more subtle way,
> as we're stumbling on it now, again:
> 
> commit d6d86c0a7f8ddc5b38cf089222cb1d9540762dc2
> Author: Konstantin Khlebnikov 
> Date:   Thu Oct 9 15:29:27 2014 -0700
> 
> mm/balloon_compaction: redesign ballooned pages management
> 
> 
> 
> On this particular race against migration case, virtio ballon deflation would 
> not see it before
> 
> commit b1123ea6d3b3da25af5c8a9d843bd07ab63213f4
> Author: Minchan Kim 
> Date:   Tue Jul 26 15:23:09 2016 -0700
> 
> mm: balloon: use general non-lru movable page feature
> 
> as the recently released balloon page would be post-processed 
> without the page->lru list handling, which for migration stability
> purposes must be done under the protection of page_lock.
> 
> 

missing part here:

I think your patch adresses this new case.


Acked-by: Rafael Aquini 


> get rid of balloon reference count.

^^ this was a left over (sorry about my fat-fingers)
> 
> 
> -- Rafael

Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

2019-01-28 Thread Rafael Aquini

On Mon, Jan 28, 2019 at 03:38:38PM +0100, David Hildenbrand wrote:
> On 28.01.19 14:35, Michal Hocko wrote:
> > On Mon 28-01-19 14:22:52, David Hildenbrand wrote:
> >> On 28.01.19 14:21, Michal Hocko wrote:
> >>> On Mon 28-01-19 14:14:28, David Hildenbrand wrote:
> >>>> On 28.01.19 14:07, Michal Hocko wrote:
> >>>>> On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
> >>>>> [...]
> >>>>>> My theory:
> >>>>>>
> >>>>>> In __unmap_and_move(), we lock the old and newpage and perform the
> >>>>>> migration. In case of vitio-balloon, the new page will become
> >>>>>> movable, the old page will no longer be movable.
> >>>>>>
> >>>>>> However, after unlocking newpage, I think there is nothing stopping
> >>>>>> the newpage from getting dequeued and freed by virtio-balloon. This
> >>>>>> will result in the newpage
> >>>>>> 1. No longer having PageMovable()
> >>>>>> 2. Getting moved to the local list before finally freeing it (using
> >>>>>>page->lru)
> >>>>>
> >>>>> Does that mean that the virtio-balloon can change the Movable state
> >>>>> while there are other users of the page? Can you point to the code that
> >>>>> does it? How come this can be safe at all? Or is the PageMovable stable
> >>>>> only under the page lock?
> >>>>>
> >>>>
> >>>> PageMovable is stable under the lock. The relevant instructions are in
> >>>>
> >>>> mm/balloon_compaction.c and include/linux/balloon_compaction.h
> >>>
> >>> OK, I have just checked __ClearPageMovable and it indeed requires
> >>> PageLock. Then we also have to move is_lru = __PageMovable(page) after
> >>> the page lock.
> >>>
> >>
> >> I assume that is fine as is as the page is isolated? (yes, it will be
> >> modified later when moving but we are interested in the original state)
> > 
> > OK, I've missed that the page is indeed isolated. Then the patch makes
> > sense to me.
> > 
> 
> Thanks Michal. I assume this has broken ever since balloon compaction
> was introduced. I'll wait a little more and then resend as !RFC with a
> cc-stable tag.
>

Yes, balloon deflation could always race against migration
This race was a problem, initially, and was dealt with, via:

commit 117aad1e9e4d97448d1df3f84b08bd65811e6d6a
Author: Rafael Aquini 
Date:   Mon Sep 30 13:45:16 2013 -0700

mm: avoid reinserting isolated balloon pages into LRU lists

 

I think this upstream patch has re-introduced it, in a more subtle way,
as we're stumbling on it now, again:

commit d6d86c0a7f8ddc5b38cf089222cb1d9540762dc2
Author: Konstantin Khlebnikov 
Date:   Thu Oct 9 15:29:27 2014 -0700

mm/balloon_compaction: redesign ballooned pages management



On this particular race against migration case, virtio ballon deflation would 
not see it before

commit b1123ea6d3b3da25af5c8a9d843bd07ab63213f4
Author: Minchan Kim 
Date:   Tue Jul 26 15:23:09 2016 -0700

mm: balloon: use general non-lru movable page feature

as the recently released balloon page would be post-processed 
without the page->lru list handling, which for migration stability
purposes must be done under the protection of page_lock.


get rid of balloon reference count.


-- Rafael

[PATCH v2] sysctl: clean up nr_pdflush_threads leftover

2018-11-29 Thread Rafael Aquini

nr_pdflush_threads has been long deprecated and
removed, but a remnant of its glorious past is
still around in CTL_VM names enum. This patch
is a minor clean-up to that case.

Reviewed-by: William Kucharski 
Signed-off-by: Rafael Aquini 
---
v2:
 - adjust typo "was;" for VM_UNUSED2 comment   [wkucharski]
 - add colon after "int" for VM_UNUSED15 comment   [wkucharski]

 include/uapi/linux/sysctl.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index d71013fffaf6..9df59925ed55 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -161,7 +161,7 @@ enum
 enum
 {
VM_UNUSED1=1,   /* was: struct: Set vm swapping control */
-   VM_UNUSED2=2,   /* was; int: Linear or sqrt() swapout for hogs 
*/
+   VM_UNUSED2=2,   /* was: int: Linear or sqrt() swapout for hogs 
*/
VM_UNUSED3=3,   /* was: struct: Set free page thresholds */
VM_UNUSED4=4,   /* Spare */
VM_OVERCOMMIT_MEMORY=5, /* Turn off the virtual memory safety limit */
@@ -174,7 +174,7 @@ enum
VM_DIRTY_RATIO=12,  /* dirty_ratio */
VM_DIRTY_WB_CS=13,  /* dirty_writeback_centisecs */
VM_DIRTY_EXPIRE_CS=14,  /* dirty_expire_centisecs */
-   VM_NR_PDFLUSH_THREADS=15, /* nr_pdflush_threads */
+   VM_UNUSED15=15, /* was: int: nr_pdflush_threads */
VM_OVERCOMMIT_RATIO=16, /* percent of RAM to allow overcommit in */
VM_PAGEBUF=17,  /* struct: Control pagebuf parameters */
VM_HUGETLB_PAGES=18,/* int: Number of available Huge Pages */
-- 
2.17.2

[PATCH v2] sysctl: clean up nr_pdflush_threads leftover

2018-11-29 Thread Rafael Aquini

nr_pdflush_threads has been long deprecated and
removed, but a remnant of its glorious past is
still around in CTL_VM names enum. This patch
is a minor clean-up to that case.

Reviewed-by: William Kucharski 
Signed-off-by: Rafael Aquini 
---
v2:
 - adjust typo "was;" for VM_UNUSED2 comment   [wkucharski]
 - add colon after "int" for VM_UNUSED15 comment   [wkucharski]

 include/uapi/linux/sysctl.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index d71013fffaf6..9df59925ed55 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -161,7 +161,7 @@ enum
 enum
 {
VM_UNUSED1=1,   /* was: struct: Set vm swapping control */
-   VM_UNUSED2=2,   /* was; int: Linear or sqrt() swapout for hogs 
*/
+   VM_UNUSED2=2,   /* was: int: Linear or sqrt() swapout for hogs 
*/
VM_UNUSED3=3,   /* was: struct: Set free page thresholds */
VM_UNUSED4=4,   /* Spare */
VM_OVERCOMMIT_MEMORY=5, /* Turn off the virtual memory safety limit */
@@ -174,7 +174,7 @@ enum
VM_DIRTY_RATIO=12,  /* dirty_ratio */
VM_DIRTY_WB_CS=13,  /* dirty_writeback_centisecs */
VM_DIRTY_EXPIRE_CS=14,  /* dirty_expire_centisecs */
-   VM_NR_PDFLUSH_THREADS=15, /* nr_pdflush_threads */
+   VM_UNUSED15=15, /* was: int: nr_pdflush_threads */
VM_OVERCOMMIT_RATIO=16, /* percent of RAM to allow overcommit in */
VM_PAGEBUF=17,  /* struct: Control pagebuf parameters */
VM_HUGETLB_PAGES=18,/* int: Number of available Huge Pages */
-- 
2.17.2

[PATCH] sysctl: clean up nr_pdflush_threads leftover

2018-11-28 Thread Rafael Aquini

nr_pdflush_threads has been long deprecated and
removed, but a remnant of its glorious past is
still around in CTL_VM names enum. This patch
is a minor clean-up to that case.

Signed-off-by: Rafael Aquini 
---
 include/uapi/linux/sysctl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index d71013fffaf6..dad5a8f93343 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -174,7 +174,7 @@ enum
VM_DIRTY_RATIO=12,  /* dirty_ratio */
VM_DIRTY_WB_CS=13,  /* dirty_writeback_centisecs */
VM_DIRTY_EXPIRE_CS=14,  /* dirty_expire_centisecs */
-   VM_NR_PDFLUSH_THREADS=15, /* nr_pdflush_threads */
+   VM_UNUSED15=15, /* was: int nr_pdflush_threads */
VM_OVERCOMMIT_RATIO=16, /* percent of RAM to allow overcommit in */
VM_PAGEBUF=17,  /* struct: Control pagebuf parameters */
VM_HUGETLB_PAGES=18,/* int: Number of available Huge Pages */
-- 
2.17.2

[PATCH] sysctl: clean up nr_pdflush_threads leftover

2018-11-28 Thread Rafael Aquini

nr_pdflush_threads has been long deprecated and
removed, but a remnant of its glorious past is
still around in CTL_VM names enum. This patch
is a minor clean-up to that case.

Signed-off-by: Rafael Aquini 
---
 include/uapi/linux/sysctl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index d71013fffaf6..dad5a8f93343 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -174,7 +174,7 @@ enum
VM_DIRTY_RATIO=12,  /* dirty_ratio */
VM_DIRTY_WB_CS=13,  /* dirty_writeback_centisecs */
VM_DIRTY_EXPIRE_CS=14,  /* dirty_expire_centisecs */
-   VM_NR_PDFLUSH_THREADS=15, /* nr_pdflush_threads */
+   VM_UNUSED15=15, /* was: int nr_pdflush_threads */
VM_OVERCOMMIT_RATIO=16, /* percent of RAM to allow overcommit in */
VM_PAGEBUF=17,  /* struct: Control pagebuf parameters */
VM_HUGETLB_PAGES=18,/* int: Number of available Huge Pages */
-- 
2.17.2

Re: [PATCH] mm: be more informative in OOM task list

2018-07-04 Thread Rafael Aquini

On Tue, Jul 03, 2018 at 06:34:48PM -0700, David Rientjes wrote:
> On Sun, 1 Jul 2018, Rodrigo Freire wrote:
> 
> > The default page memory unit of OOM task dump events might not be
> > intuitive for the non-initiated when debugging OOM events. Add
> > a small printk prior to the task dump informing that the memory
> > units are actually memory _pages_.
> > 
> > Signed-off-by: Rodrigo Freire 
> > ---
> >  mm/oom_kill.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 84081e7..b4d9557 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -392,6 +392,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const 
> > nodemask_t *nodemask)
> > struct task_struct *p;
> > struct task_struct *task;
> >  
> > +   pr_info("Tasks state (memory values in pages):\n");
> > pr_info("[ pid ]   uid  tgid total_vm  rss pgtables_bytes swapents 
> > oom_score_adj name\n");
> > rcu_read_lock();
> > for_each_process(p) {
> 
> As the author of dump_tasks(), and having seen these values misinterpreted 
> on more than one occassion, I think this is a valuable addition.
> 
> Could you also expand out the "pid" field to allow for seven digits 
> instead of five?  I think everything else is aligned.
> 
> Feel free to add
> 
> Acked-by: David Rientjes 
> 
> to a v2.
>

Same here, for a v2:
 
Acked-by: Rafael Aquini

1 2 3 4 5 6 7 >

1 - 100 of 639 matches

Mail list logo