Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-27 Thread Michal Hocko
On Wed 26-07-17 15:25:03, Andrew Morton wrote:
> On Wed, 26 Jul 2017 14:19:52 +0200 Michal Hocko  wrote:
> 
> > > > Please note that this code has been removed by
> > > > http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
> > > > will get to linux-next as soon as Andrew releases a new version mmotm
> > > > tree.
> > > 
> > > We still would need something for 4.13, no?
> > 
> > If this presents a real problem then yes. Has this happened in a real
> > workload or during some artificial test? I mean the code has been like
> > that for ages and nobody noticed/reported any problems.
> > 
> > That being said, I do not have anything against your patch. It is
> > trivial to rebase mine on top of yours. I am just not sure it is worth
> > the code churn. E.g. do you think this patch is a stable backport
> > material?
> 
> As I understand it, Heiko's patch fixes 4.13-rc1's 3f906ba23689a
> ("mm/memory-hotplug: switch locking to a percpu rwsem") (and should
> have been tagged as such!).  So no -stable backport is needed - we just
> need to get 4.13.x fixed.
> 
> So I grabbed this for Linusing this week or next, and fixed up your
> for-4.14 patch thusly:

OK. Feel free to add
Acked-by: Michal Hocko 

And thanks for the rebase which looks correct.

> From: Michal Hocko 
> Subject: mm, page_alloc: rip out ZONELIST_ORDER_ZONE
> 
> Patch series "cleanup zonelists initialization", v1.
> 
> This is aimed at cleaning up the zonelists initialization code we have
> but the primary motivation was bug report [2] which got resolved but
> the usage of stop_machine is just too ugly to live. Most patches are
> straightforward but 3 of them need a special consideration.
> 
> Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
> because this is a user visible change. As I argue in the patch
> description I do not think we have a strong usecase for it these days.
> I have kept sysctl in place and warn into the log if somebody tries to
> configure zone lists ordering. If somebody has a real usecase for it
> we can revert this patch but I do not expect anybody will actually notice
> runtime differences. This patch is not strictly needed for the rest but
> it made patch 6 easier to implement.
> 
> Patch 7 removes stop_machine from build_all_zonelists without adding any
> special synchronization between iterators and updater which I _believe_
> is acceptable as explained in the changelog. I hope I am not missing
> anything.
> 
> Patch 8 then removes zonelists_mutex which is kind of ugly as well and
> not really needed AFAICS but a care should be taken when double checking
> my thinking.
> 
> 
> This patch (of 9):
> 
> Supporting zone ordered zonelists costs us just a lot of code while the
> usefulness is arguable if existent at all.  Mel has already made node
> ordering default on 64b systems.  32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to a
> different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim has
> been reworked to be node rather than zone oriented.  This means that
> lowmem requests have to skip over all highmem pages on LRUs already and so
> zone ordering doesn't save the reclaim time much.  So the only advantage
> of the zone ordering is under a light memory pressure when highmem
> requests do not ever hit into lowmem zones and the lowmem pressure doesn't
> need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and it is
> generally advisable to use 64b kernel on such a HW I believe we should
> rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if somebody
> tries to set zone ordering either from kernel command line or the sysctl.
> 
> Link: http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org
> Signed-off-by: Michal Hocko 
> Acked-by: Mel Gorman 
> Acked-by: Vlastimil Babka 
> Cc: Johannes Weiner 
> Cc: Joonsoo Kim 
> Cc: Shaohua Li 
> Cc: Toshi Kani 
> Cc: 
> Signed-off-by: Andrew Morton 
> ---
> 
>  Documentation/admin-guide/kernel-parameters.txt |2 
>  Documentation/sysctl/vm.txt |4 
>  Documentation/vm/numa   |7 
>  include/linux/mmzone.h  |2 
>  kernel/sysctl.c |2 
>  mm/page_alloc.c |  183 +-
>  6 files changed, 30 insertions(+), 170 deletions(-)
> 
> diff -puN 
> Documentation/admin-guide/kernel-parameters.txt~mm-page_alloc-rip-out-zonelist_order_zone
>  Documentation/admin-guide/kernel-parameters.txt
> --- 
> 

Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-27 Thread Michal Hocko
On Wed 26-07-17 15:25:03, Andrew Morton wrote:
> On Wed, 26 Jul 2017 14:19:52 +0200 Michal Hocko  wrote:
> 
> > > > Please note that this code has been removed by
> > > > http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
> > > > will get to linux-next as soon as Andrew releases a new version mmotm
> > > > tree.
> > > 
> > > We still would need something for 4.13, no?
> > 
> > If this presents a real problem then yes. Has this happened in a real
> > workload or during some artificial test? I mean the code has been like
> > that for ages and nobody noticed/reported any problems.
> > 
> > That being said, I do not have anything against your patch. It is
> > trivial to rebase mine on top of yours. I am just not sure it is worth
> > the code churn. E.g. do you think this patch is a stable backport
> > material?
> 
> As I understand it, Heiko's patch fixes 4.13-rc1's 3f906ba23689a
> ("mm/memory-hotplug: switch locking to a percpu rwsem") (and should
> have been tagged as such!).  So no -stable backport is needed - we just
> need to get 4.13.x fixed.
> 
> So I grabbed this for Linusing this week or next, and fixed up your
> for-4.14 patch thusly:

OK. Feel free to add
Acked-by: Michal Hocko 

And thanks for the rebase which looks correct.

> From: Michal Hocko 
> Subject: mm, page_alloc: rip out ZONELIST_ORDER_ZONE
> 
> Patch series "cleanup zonelists initialization", v1.
> 
> This is aimed at cleaning up the zonelists initialization code we have
> but the primary motivation was bug report [2] which got resolved but
> the usage of stop_machine is just too ugly to live. Most patches are
> straightforward but 3 of them need a special consideration.
> 
> Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
> because this is a user visible change. As I argue in the patch
> description I do not think we have a strong usecase for it these days.
> I have kept sysctl in place and warn into the log if somebody tries to
> configure zone lists ordering. If somebody has a real usecase for it
> we can revert this patch but I do not expect anybody will actually notice
> runtime differences. This patch is not strictly needed for the rest but
> it made patch 6 easier to implement.
> 
> Patch 7 removes stop_machine from build_all_zonelists without adding any
> special synchronization between iterators and updater which I _believe_
> is acceptable as explained in the changelog. I hope I am not missing
> anything.
> 
> Patch 8 then removes zonelists_mutex which is kind of ugly as well and
> not really needed AFAICS but a care should be taken when double checking
> my thinking.
> 
> 
> This patch (of 9):
> 
> Supporting zone ordered zonelists costs us just a lot of code while the
> usefulness is arguable if existent at all.  Mel has already made node
> ordering default on 64b systems.  32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to a
> different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim has
> been reworked to be node rather than zone oriented.  This means that
> lowmem requests have to skip over all highmem pages on LRUs already and so
> zone ordering doesn't save the reclaim time much.  So the only advantage
> of the zone ordering is under a light memory pressure when highmem
> requests do not ever hit into lowmem zones and the lowmem pressure doesn't
> need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and it is
> generally advisable to use 64b kernel on such a HW I believe we should
> rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if somebody
> tries to set zone ordering either from kernel command line or the sysctl.
> 
> Link: http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org
> Signed-off-by: Michal Hocko 
> Acked-by: Mel Gorman 
> Acked-by: Vlastimil Babka 
> Cc: Johannes Weiner 
> Cc: Joonsoo Kim 
> Cc: Shaohua Li 
> Cc: Toshi Kani 
> Cc: 
> Signed-off-by: Andrew Morton 
> ---
> 
>  Documentation/admin-guide/kernel-parameters.txt |2 
>  Documentation/sysctl/vm.txt |4 
>  Documentation/vm/numa   |7 
>  include/linux/mmzone.h  |2 
>  kernel/sysctl.c |2 
>  mm/page_alloc.c |  183 +-
>  6 files changed, 30 insertions(+), 170 deletions(-)
> 
> diff -puN 
> Documentation/admin-guide/kernel-parameters.txt~mm-page_alloc-rip-out-zonelist_order_zone
>  Documentation/admin-guide/kernel-parameters.txt
> --- 
> a/Documentation/admin-guide/kernel-parameters.txt~mm-page_alloc-rip-out-zonelist_order_zone
> +++ a/Documentation/admin-guide/kernel-parameters.txt
> @@ -2769,7 +2769,7 @@
>   Allowed values are enable and disable
>  
>   numa_zonelist_order= 

Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Andrew Morton
On Wed, 26 Jul 2017 14:19:52 +0200 Michal Hocko  wrote:

> > > Please note that this code has been removed by
> > > http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
> > > will get to linux-next as soon as Andrew releases a new version mmotm
> > > tree.
> > 
> > We still would need something for 4.13, no?
> 
> If this presents a real problem then yes. Has this happened in a real
> workload or during some artificial test? I mean the code has been like
> that for ages and nobody noticed/reported any problems.
> 
> That being said, I do not have anything against your patch. It is
> trivial to rebase mine on top of yours. I am just not sure it is worth
> the code churn. E.g. do you think this patch is a stable backport
> material?

As I understand it, Heiko's patch fixes 4.13-rc1's 3f906ba23689a
("mm/memory-hotplug: switch locking to a percpu rwsem") (and should
have been tagged as such!).  So no -stable backport is needed - we just
need to get 4.13.x fixed.

So I grabbed this for Linusing this week or next, and fixed up your
for-4.14 patch thusly:


From: Michal Hocko 
Subject: mm, page_alloc: rip out ZONELIST_ORDER_ZONE

Patch series "cleanup zonelists initialization", v1.

This is aimed at cleaning up the zonelists initialization code we have
but the primary motivation was bug report [2] which got resolved but
the usage of stop_machine is just too ugly to live. Most patches are
straightforward but 3 of them need a special consideration.

Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
because this is a user visible change. As I argue in the patch
description I do not think we have a strong usecase for it these days.
I have kept sysctl in place and warn into the log if somebody tries to
configure zone lists ordering. If somebody has a real usecase for it
we can revert this patch but I do not expect anybody will actually notice
runtime differences. This patch is not strictly needed for the rest but
it made patch 6 easier to implement.

Patch 7 removes stop_machine from build_all_zonelists without adding any
special synchronization between iterators and updater which I _believe_
is acceptable as explained in the changelog. I hope I am not missing
anything.

Patch 8 then removes zonelists_mutex which is kind of ugly as well and
not really needed AFAICS but a care should be taken when double checking
my thinking.


This patch (of 9):

Supporting zone ordered zonelists costs us just a lot of code while the
usefulness is arguable if existent at all.  Mel has already made node
ordering default on 64b systems.  32b systems are still using
ZONELIST_ORDER_ZONE because it is considered better to fallback to a
different NUMA node rather than consume precious lowmem zones.

This argument is, however, weaken by the fact that the memory reclaim has
been reworked to be node rather than zone oriented.  This means that
lowmem requests have to skip over all highmem pages on LRUs already and so
zone ordering doesn't save the reclaim time much.  So the only advantage
of the zone ordering is under a light memory pressure when highmem
requests do not ever hit into lowmem zones and the lowmem pressure doesn't
need to reclaim.

Considering that 32b NUMA systems are rather suboptimal already and it is
generally advisable to use 64b kernel on such a HW I believe we should
rather care about the code maintainability and just get rid of
ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if somebody
tries to set zone ordering either from kernel command line or the sysctl.

Link: http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Joonsoo Kim 
Cc: Shaohua Li 
Cc: Toshi Kani 
Cc: 
Signed-off-by: Andrew Morton 
---

 Documentation/admin-guide/kernel-parameters.txt |2 
 Documentation/sysctl/vm.txt |4 
 Documentation/vm/numa   |7 
 include/linux/mmzone.h  |2 
 kernel/sysctl.c |2 
 mm/page_alloc.c |  183 +-
 6 files changed, 30 insertions(+), 170 deletions(-)

diff -puN 
Documentation/admin-guide/kernel-parameters.txt~mm-page_alloc-rip-out-zonelist_order_zone
 Documentation/admin-guide/kernel-parameters.txt
--- 
a/Documentation/admin-guide/kernel-parameters.txt~mm-page_alloc-rip-out-zonelist_order_zone
+++ a/Documentation/admin-guide/kernel-parameters.txt
@@ -2769,7 +2769,7 @@
Allowed values are enable and disable
 
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
-   one of ['zone', 'node', 'default'] can be specified
+   

Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Andrew Morton
On Wed, 26 Jul 2017 14:19:52 +0200 Michal Hocko  wrote:

> > > Please note that this code has been removed by
> > > http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
> > > will get to linux-next as soon as Andrew releases a new version mmotm
> > > tree.
> > 
> > We still would need something for 4.13, no?
> 
> If this presents a real problem then yes. Has this happened in a real
> workload or during some artificial test? I mean the code has been like
> that for ages and nobody noticed/reported any problems.
> 
> That being said, I do not have anything against your patch. It is
> trivial to rebase mine on top of yours. I am just not sure it is worth
> the code churn. E.g. do you think this patch is a stable backport
> material?

As I understand it, Heiko's patch fixes 4.13-rc1's 3f906ba23689a
("mm/memory-hotplug: switch locking to a percpu rwsem") (and should
have been tagged as such!).  So no -stable backport is needed - we just
need to get 4.13.x fixed.

So I grabbed this for Linusing this week or next, and fixed up your
for-4.14 patch thusly:


From: Michal Hocko 
Subject: mm, page_alloc: rip out ZONELIST_ORDER_ZONE

Patch series "cleanup zonelists initialization", v1.

This is aimed at cleaning up the zonelists initialization code we have
but the primary motivation was bug report [2] which got resolved but
the usage of stop_machine is just too ugly to live. Most patches are
straightforward but 3 of them need a special consideration.

Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
because this is a user visible change. As I argue in the patch
description I do not think we have a strong usecase for it these days.
I have kept sysctl in place and warn into the log if somebody tries to
configure zone lists ordering. If somebody has a real usecase for it
we can revert this patch but I do not expect anybody will actually notice
runtime differences. This patch is not strictly needed for the rest but
it made patch 6 easier to implement.

Patch 7 removes stop_machine from build_all_zonelists without adding any
special synchronization between iterators and updater which I _believe_
is acceptable as explained in the changelog. I hope I am not missing
anything.

Patch 8 then removes zonelists_mutex which is kind of ugly as well and
not really needed AFAICS but a care should be taken when double checking
my thinking.


This patch (of 9):

Supporting zone ordered zonelists costs us just a lot of code while the
usefulness is arguable if existent at all.  Mel has already made node
ordering default on 64b systems.  32b systems are still using
ZONELIST_ORDER_ZONE because it is considered better to fallback to a
different NUMA node rather than consume precious lowmem zones.

This argument is, however, weaken by the fact that the memory reclaim has
been reworked to be node rather than zone oriented.  This means that
lowmem requests have to skip over all highmem pages on LRUs already and so
zone ordering doesn't save the reclaim time much.  So the only advantage
of the zone ordering is under a light memory pressure when highmem
requests do not ever hit into lowmem zones and the lowmem pressure doesn't
need to reclaim.

Considering that 32b NUMA systems are rather suboptimal already and it is
generally advisable to use 64b kernel on such a HW I believe we should
rather care about the code maintainability and just get rid of
ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if somebody
tries to set zone ordering either from kernel command line or the sysctl.

Link: http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Joonsoo Kim 
Cc: Shaohua Li 
Cc: Toshi Kani 
Cc: 
Signed-off-by: Andrew Morton 
---

 Documentation/admin-guide/kernel-parameters.txt |2 
 Documentation/sysctl/vm.txt |4 
 Documentation/vm/numa   |7 
 include/linux/mmzone.h  |2 
 kernel/sysctl.c |2 
 mm/page_alloc.c |  183 +-
 6 files changed, 30 insertions(+), 170 deletions(-)

diff -puN 
Documentation/admin-guide/kernel-parameters.txt~mm-page_alloc-rip-out-zonelist_order_zone
 Documentation/admin-guide/kernel-parameters.txt
--- 
a/Documentation/admin-guide/kernel-parameters.txt~mm-page_alloc-rip-out-zonelist_order_zone
+++ a/Documentation/admin-guide/kernel-parameters.txt
@@ -2769,7 +2769,7 @@
Allowed values are enable and disable
 
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
-   one of ['zone', 'node', 'default'] can be specified
+   'node', 'default' can be specified
This can be set from sysctl after boot.
See Documentation/sysctl/vm.txt for details.
 
diff -puN 

Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Thomas Gleixner
On Wed, 26 Jul 2017, Heiko Carstens wrote:
> Andre Wild reported the folling warning:
> 
> WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 
> lockdep_assert_cpus_held+0x4c/0x60
> Modules linked in:
> CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10
> Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
> task: 701d8100 task.stack: 73594000
> Krnl PSW : 0704f0018000 00145e24 
> (lockdep_assert_cpus_held+0x4c/0x60)
> ...
> Call Trace:
>  lockdep_assert_cpus_held+0x42/0x60)
>  stop_machine_cpuslocked+0x62/0xf0
>  build_all_zonelists+0x92/0x150
>  numa_zonelist_order_handler+0x102/0x150
>  proc_sys_call_handler.isra.12+0xda/0x118
>  proc_sys_write+0x34/0x48
>  __vfs_write+0x3c/0x178
>  vfs_write+0xbc/0x1a0
>  SyS_write+0x66/0xc0
>  system_call+0xc4/0x2b0
>  locks held by bash/1205:
>  #0:  (sb_writers#4){.+.+.+}, at: [<0037b29e>] vfs_write+0xa6/0x1a0
>  #1:  (zl_order_mutex){+.+...}, at: [<002c8e4c>] 
> numa_zonelist_order_handler+0x44/0x150
>  #2:  (zonelists_mutex){+.+...}, at: [<002c8efc>] 
> numa_zonelist_order_handler+0xf4/0x150
> Last Breaking-Event-Address:
>  [<00145e20>] lockdep_assert_cpus_held+0x48/0x60
> 
> This can be easily triggered with e.g.
> 
>  >echo n > /proc/sys/vm/numa_zonelist_order
> 
> With commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a
> percpu rwsem") memory hotplug locking was changed to fix a potential
> deadlock. This also switched the stop_machine() invocation within
> build_all_zonelists() to stop_machine_cpuslocked() which now expects
> that online cpus are locked when being called.
> 
> This assumption is not true if build_all_zonelists() is being called
> from numa_zonelist_order_handler(). In order to fix this simply add a
> mem_hotplug_begin()/mem_hotplug_done() pair to numa_zonelist_order_handler().

Sorry, I missed that call path when I did the conversion. So yes, that
needs some protection

Thanks,

tglx


Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Thomas Gleixner
On Wed, 26 Jul 2017, Heiko Carstens wrote:
> Andre Wild reported the folling warning:
> 
> WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 
> lockdep_assert_cpus_held+0x4c/0x60
> Modules linked in:
> CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10
> Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
> task: 701d8100 task.stack: 73594000
> Krnl PSW : 0704f0018000 00145e24 
> (lockdep_assert_cpus_held+0x4c/0x60)
> ...
> Call Trace:
>  lockdep_assert_cpus_held+0x42/0x60)
>  stop_machine_cpuslocked+0x62/0xf0
>  build_all_zonelists+0x92/0x150
>  numa_zonelist_order_handler+0x102/0x150
>  proc_sys_call_handler.isra.12+0xda/0x118
>  proc_sys_write+0x34/0x48
>  __vfs_write+0x3c/0x178
>  vfs_write+0xbc/0x1a0
>  SyS_write+0x66/0xc0
>  system_call+0xc4/0x2b0
>  locks held by bash/1205:
>  #0:  (sb_writers#4){.+.+.+}, at: [<0037b29e>] vfs_write+0xa6/0x1a0
>  #1:  (zl_order_mutex){+.+...}, at: [<002c8e4c>] 
> numa_zonelist_order_handler+0x44/0x150
>  #2:  (zonelists_mutex){+.+...}, at: [<002c8efc>] 
> numa_zonelist_order_handler+0xf4/0x150
> Last Breaking-Event-Address:
>  [<00145e20>] lockdep_assert_cpus_held+0x48/0x60
> 
> This can be easily triggered with e.g.
> 
>  >echo n > /proc/sys/vm/numa_zonelist_order
> 
> With commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a
> percpu rwsem") memory hotplug locking was changed to fix a potential
> deadlock. This also switched the stop_machine() invocation within
> build_all_zonelists() to stop_machine_cpuslocked() which now expects
> that online cpus are locked when being called.
> 
> This assumption is not true if build_all_zonelists() is being called
> from numa_zonelist_order_handler(). In order to fix this simply add a
> mem_hotplug_begin()/mem_hotplug_done() pair to numa_zonelist_order_handler().

Sorry, I missed that call path when I did the conversion. So yes, that
needs some protection

Thanks,

tglx


Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Michal Hocko
On Wed 26-07-17 13:48:12, Heiko Carstens wrote:
> On Wed, Jul 26, 2017 at 01:31:12PM +0200, Michal Hocko wrote:
> > On Wed 26-07-17 13:17:38, Heiko Carstens wrote:
> > [...]
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 6d30e914afb6..fc32aa81f359 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4891,9 +4891,11 @@ int numa_zonelist_order_handler(struct ctl_table 
> > > *table, int write,
> > >   NUMA_ZONELIST_ORDER_LEN);
> > >   user_zonelist_order = oldval;
> > >   } else if (oldval != user_zonelist_order) {
> > > + mem_hotplug_begin();
> > >   mutex_lock(_mutex);
> > >   build_all_zonelists(NULL, NULL);
> > >   mutex_unlock(_mutex);
> > > + mem_hotplug_done();
> > >   }
> > >   }
> > >  out:
> > 
> > Please note that this code has been removed by
> > http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
> > will get to linux-next as soon as Andrew releases a new version mmotm
> > tree.
> 
> We still would need something for 4.13, no?

If this presents a real problem then yes. Has this happened in a real
workload or during some artificial test? I mean the code has been like
that for ages and nobody noticed/reported any problems.

That being said, I do not have anything against your patch. It is
trivial to rebase mine on top of yours. I am just not sure it is worth
the code churn. E.g. do you think this patch is a stable backport
material?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Michal Hocko
On Wed 26-07-17 13:48:12, Heiko Carstens wrote:
> On Wed, Jul 26, 2017 at 01:31:12PM +0200, Michal Hocko wrote:
> > On Wed 26-07-17 13:17:38, Heiko Carstens wrote:
> > [...]
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 6d30e914afb6..fc32aa81f359 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4891,9 +4891,11 @@ int numa_zonelist_order_handler(struct ctl_table 
> > > *table, int write,
> > >   NUMA_ZONELIST_ORDER_LEN);
> > >   user_zonelist_order = oldval;
> > >   } else if (oldval != user_zonelist_order) {
> > > + mem_hotplug_begin();
> > >   mutex_lock(_mutex);
> > >   build_all_zonelists(NULL, NULL);
> > >   mutex_unlock(_mutex);
> > > + mem_hotplug_done();
> > >   }
> > >   }
> > >  out:
> > 
> > Please note that this code has been removed by
> > http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
> > will get to linux-next as soon as Andrew releases a new version mmotm
> > tree.
> 
> We still would need something for 4.13, no?

If this presents a real problem then yes. Has this happened in a real
workload or during some artificial test? I mean the code has been like
that for ages and nobody noticed/reported any problems.

That being said, I do not have anything against your patch. It is
trivial to rebase mine on top of yours. I am just not sure it is worth
the code churn. E.g. do you think this patch is a stable backport
material?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Heiko Carstens
On Wed, Jul 26, 2017 at 01:31:12PM +0200, Michal Hocko wrote:
> On Wed 26-07-17 13:17:38, Heiko Carstens wrote:
> [...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6d30e914afb6..fc32aa81f359 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4891,9 +4891,11 @@ int numa_zonelist_order_handler(struct ctl_table 
> > *table, int write,
> > NUMA_ZONELIST_ORDER_LEN);
> > user_zonelist_order = oldval;
> > } else if (oldval != user_zonelist_order) {
> > +   mem_hotplug_begin();
> > mutex_lock(_mutex);
> > build_all_zonelists(NULL, NULL);
> > mutex_unlock(_mutex);
> > +   mem_hotplug_done();
> > }
> > }
> >  out:
> 
> Please note that this code has been removed by
> http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
> will get to linux-next as soon as Andrew releases a new version mmotm
> tree.

We still would need something for 4.13, no?



Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Heiko Carstens
On Wed, Jul 26, 2017 at 01:31:12PM +0200, Michal Hocko wrote:
> On Wed 26-07-17 13:17:38, Heiko Carstens wrote:
> [...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6d30e914afb6..fc32aa81f359 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4891,9 +4891,11 @@ int numa_zonelist_order_handler(struct ctl_table 
> > *table, int write,
> > NUMA_ZONELIST_ORDER_LEN);
> > user_zonelist_order = oldval;
> > } else if (oldval != user_zonelist_order) {
> > +   mem_hotplug_begin();
> > mutex_lock(_mutex);
> > build_all_zonelists(NULL, NULL);
> > mutex_unlock(_mutex);
> > +   mem_hotplug_done();
> > }
> > }
> >  out:
> 
> Please note that this code has been removed by
> http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
> will get to linux-next as soon as Andrew releases a new version mmotm
> tree.

We still would need something for 4.13, no?



Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Michal Hocko
On Wed 26-07-17 13:17:38, Heiko Carstens wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d30e914afb6..fc32aa81f359 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4891,9 +4891,11 @@ int numa_zonelist_order_handler(struct ctl_table 
> *table, int write,
>   NUMA_ZONELIST_ORDER_LEN);
>   user_zonelist_order = oldval;
>   } else if (oldval != user_zonelist_order) {
> + mem_hotplug_begin();
>   mutex_lock(_mutex);
>   build_all_zonelists(NULL, NULL);
>   mutex_unlock(_mutex);
> + mem_hotplug_done();
>   }
>   }
>  out:

Please note that this code has been removed by
http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
will get to linux-next as soon as Andrew releases a new version mmotm
tree.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Michal Hocko
On Wed 26-07-17 13:17:38, Heiko Carstens wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d30e914afb6..fc32aa81f359 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4891,9 +4891,11 @@ int numa_zonelist_order_handler(struct ctl_table 
> *table, int write,
>   NUMA_ZONELIST_ORDER_LEN);
>   user_zonelist_order = oldval;
>   } else if (oldval != user_zonelist_order) {
> + mem_hotplug_begin();
>   mutex_lock(_mutex);
>   build_all_zonelists(NULL, NULL);
>   mutex_unlock(_mutex);
> + mem_hotplug_done();
>   }
>   }
>  out:

Please note that this code has been removed by
http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
will get to linux-next as soon as Andrew releases a new version mmotm
tree.

-- 
Michal Hocko
SUSE Labs


[PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Heiko Carstens
Andre Wild reported the folling warning:

WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
Modules linked in:
CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10
Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
task: 701d8100 task.stack: 73594000
Krnl PSW : 0704f0018000 00145e24 
(lockdep_assert_cpus_held+0x4c/0x60)
...
Call Trace:
 lockdep_assert_cpus_held+0x42/0x60)
 stop_machine_cpuslocked+0x62/0xf0
 build_all_zonelists+0x92/0x150
 numa_zonelist_order_handler+0x102/0x150
 proc_sys_call_handler.isra.12+0xda/0x118
 proc_sys_write+0x34/0x48
 __vfs_write+0x3c/0x178
 vfs_write+0xbc/0x1a0
 SyS_write+0x66/0xc0
 system_call+0xc4/0x2b0
 locks held by bash/1205:
 #0:  (sb_writers#4){.+.+.+}, at: [<0037b29e>] vfs_write+0xa6/0x1a0
 #1:  (zl_order_mutex){+.+...}, at: [<002c8e4c>] 
numa_zonelist_order_handler+0x44/0x150
 #2:  (zonelists_mutex){+.+...}, at: [<002c8efc>] 
numa_zonelist_order_handler+0xf4/0x150
Last Breaking-Event-Address:
 [<00145e20>] lockdep_assert_cpus_held+0x48/0x60

This can be easily triggered with e.g.

 >echo n > /proc/sys/vm/numa_zonelist_order

With commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a
percpu rwsem") memory hotplug locking was changed to fix a potential
deadlock. This also switched the stop_machine() invocation within
build_all_zonelists() to stop_machine_cpuslocked() which now expects
that online cpus are locked when being called.

This assumption is not true if build_all_zonelists() is being called
from numa_zonelist_order_handler(). In order to fix this simply add a
mem_hotplug_begin()/mem_hotplug_done() pair to numa_zonelist_order_handler().

Reported-by: Andre Wild 
Cc: KAMEZAWA Hiroyuki 
Cc: Thomas Gleixner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Signed-off-by: Heiko Carstens 
---
 mm/page_alloc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d30e914afb6..fc32aa81f359 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4891,9 +4891,11 @@ int numa_zonelist_order_handler(struct ctl_table *table, 
int write,
NUMA_ZONELIST_ORDER_LEN);
user_zonelist_order = oldval;
} else if (oldval != user_zonelist_order) {
+   mem_hotplug_begin();
mutex_lock(_mutex);
build_all_zonelists(NULL, NULL);
mutex_unlock(_mutex);
+   mem_hotplug_done();
}
}
 out:
-- 
2.11.2



[PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Heiko Carstens
Andre Wild reported the folling warning:

WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
Modules linked in:
CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10
Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
task: 701d8100 task.stack: 73594000
Krnl PSW : 0704f0018000 00145e24 
(lockdep_assert_cpus_held+0x4c/0x60)
...
Call Trace:
 lockdep_assert_cpus_held+0x42/0x60)
 stop_machine_cpuslocked+0x62/0xf0
 build_all_zonelists+0x92/0x150
 numa_zonelist_order_handler+0x102/0x150
 proc_sys_call_handler.isra.12+0xda/0x118
 proc_sys_write+0x34/0x48
 __vfs_write+0x3c/0x178
 vfs_write+0xbc/0x1a0
 SyS_write+0x66/0xc0
 system_call+0xc4/0x2b0
 locks held by bash/1205:
 #0:  (sb_writers#4){.+.+.+}, at: [<0037b29e>] vfs_write+0xa6/0x1a0
 #1:  (zl_order_mutex){+.+...}, at: [<002c8e4c>] 
numa_zonelist_order_handler+0x44/0x150
 #2:  (zonelists_mutex){+.+...}, at: [<002c8efc>] 
numa_zonelist_order_handler+0xf4/0x150
Last Breaking-Event-Address:
 [<00145e20>] lockdep_assert_cpus_held+0x48/0x60

This can be easily triggered with e.g.

 >echo n > /proc/sys/vm/numa_zonelist_order

With commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a
percpu rwsem") memory hotplug locking was changed to fix a potential
deadlock. This also switched the stop_machine() invocation within
build_all_zonelists() to stop_machine_cpuslocked() which now expects
that online cpus are locked when being called.

This assumption is not true if build_all_zonelists() is being called
from numa_zonelist_order_handler(). In order to fix this simply add a
mem_hotplug_begin()/mem_hotplug_done() pair to numa_zonelist_order_handler().

Reported-by: Andre Wild 
Cc: KAMEZAWA Hiroyuki 
Cc: Thomas Gleixner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Signed-off-by: Heiko Carstens 
---
 mm/page_alloc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d30e914afb6..fc32aa81f359 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4891,9 +4891,11 @@ int numa_zonelist_order_handler(struct ctl_table *table, 
int write,
NUMA_ZONELIST_ORDER_LEN);
user_zonelist_order = oldval;
} else if (oldval != user_zonelist_order) {
+   mem_hotplug_begin();
mutex_lock(_mutex);
build_all_zonelists(NULL, NULL);
mutex_unlock(_mutex);
+   mem_hotplug_done();
}
}
 out:
-- 
2.11.2