Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-20 Thread Michal Hocko
On Thu 20-07-17 09:24:58, Vlastimil Babka wrote:
> On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > build_all_zonelists has been (ab)using stop_machine to make sure that
> > zonelists do not change while somebody is looking at them. This is
> > is just a gross hack because a) it complicates the context from which
> > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > switch locking to a percpu rwsem")) and b) is is not really necessary
> > especially after "mm, page_alloc: simplify zonelist initialization".
> > 
> > Updates of the zonelists happen very seldom, basically only when a zone
> > becomes populated during memory online or when it loses all the memory
> > during offline. A racing iteration over zonelists could either miss a
> > zone or try to work on one zone twice. Both of these are something we
> > can live with occasionally because there will always be at least one
> > zone visible so we are not likely to fail allocation too easily for
> > example.
> > 
> > Signed-off-by: Michal Hocko 
> 
> Some stress testing of this would still be worth, IMHO.

I have run the pathological online/offline of the single memblock in the
movable zone while stressing the same small node with some memory pressure.
Node 1, zone  DMA
  pages free 0
min  0
low  0
high 0
spanned  0
present  0
managed  0
protection: (0, 943, 943, 943)
Node 1, zoneDMA32
  pages free 227310
min  8294
low  10367
high 12440
spanned  262112
present  262112
managed  241436
protection: (0, 0, 0, 0)
Node 1, zone   Normal
  pages free 0
min  0
low  0
high 0
spanned  0
present  0
managed  0
protection: (0, 0, 0, 1024)
Node 1, zone  Movable
  pages free 32722
min  85
low  117
high 149
spanned  32768
present  32768
managed  32768
protection: (0, 0, 0, 0)

root@test1:/sys/devices/system/node/node1# while true
do 
echo offline > memory34/state
echo online_movable > memory34/state
done

root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4

and it survived without any unexpected behavior. While this is not
really a great testing coverage it should exercise the allocation path
quite a lot.

I can add this to the changelog if you think it is worth it.
 
> Acked-by: Vlastimil Babka 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-20 Thread Michal Hocko
On Thu 20-07-17 09:24:58, Vlastimil Babka wrote:
> On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > build_all_zonelists has been (ab)using stop_machine to make sure that
> > zonelists do not change while somebody is looking at them. This is
> > is just a gross hack because a) it complicates the context from which
> > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > switch locking to a percpu rwsem")) and b) is is not really necessary
> > especially after "mm, page_alloc: simplify zonelist initialization".
> > 
> > Updates of the zonelists happen very seldom, basically only when a zone
> > becomes populated during memory online or when it loses all the memory
> > during offline. A racing iteration over zonelists could either miss a
> > zone or try to work on one zone twice. Both of these are something we
> > can live with occasionally because there will always be at least one
> > zone visible so we are not likely to fail allocation too easily for
> > example.
> > 
> > Signed-off-by: Michal Hocko 
> 
> Some stress testing of this would still be worth, IMHO.

I have run the pathological online/offline of the single memblock in the
movable zone while stressing the same small node with some memory pressure.
Node 1, zone  DMA
  pages free 0
min  0
low  0
high 0
spanned  0
present  0
managed  0
protection: (0, 943, 943, 943)
Node 1, zoneDMA32
  pages free 227310
min  8294
low  10367
high 12440
spanned  262112
present  262112
managed  241436
protection: (0, 0, 0, 0)
Node 1, zone   Normal
  pages free 0
min  0
low  0
high 0
spanned  0
present  0
managed  0
protection: (0, 0, 0, 1024)
Node 1, zone  Movable
  pages free 32722
min  85
low  117
high 149
spanned  32768
present  32768
managed  32768
protection: (0, 0, 0, 0)

root@test1:/sys/devices/system/node/node1# while true
do 
echo offline > memory34/state
echo online_movable > memory34/state
done

root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4

and it survived without any unexpected behavior. While this is not
really a great testing coverage it should exercise the allocation path
quite a lot.

I can add this to the changelog if you think it is worth it.
 
> Acked-by: Vlastimil Babka 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-20 Thread Vlastimil Babka
On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_all_zonelists has been (ab)using stop_machine to make sure that
> zonelists do not change while somebody is looking at them. This is
> is just a gross hack because a) it complicates the context from which
> we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> switch locking to a percpu rwsem")) and b) is is not really necessary
> especially after "mm, page_alloc: simplify zonelist initialization".
> 
> Updates of the zonelists happen very seldom, basically only when a zone
> becomes populated during memory online or when it loses all the memory
> during offline. A racing iteration over zonelists could either miss a
> zone or try to work on one zone twice. Both of these are something we
> can live with occasionally because there will always be at least one
> zone visible so we are not likely to fail allocation too easily for
> example.
> 
> Signed-off-by: Michal Hocko 

Some stress testing of this would still be worth, IMHO.

Acked-by: Vlastimil Babka 

> ---
>  mm/page_alloc.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 78bd62418380..217889ecd13f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5066,8 +5066,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, 
> boot_nodestats);
>   */
>  DEFINE_MUTEX(zonelists_mutex);
>  
> -/* return values int just for stop_machine() */
> -static int __build_all_zonelists(void *data)
> +static void __build_all_zonelists(void *data)
>  {
>   int nid;
>   int cpu;
> @@ -5103,8 +5102,6 @@ static int __build_all_zonelists(void *data)
>   set_cpu_numa_mem(cpu, 
> local_memory_node(cpu_to_node(cpu)));
>  #endif
>   }
> -
> - return 0;
>  }
>  
>  static noinline void __init
> @@ -5147,9 +5144,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>   if (system_state == SYSTEM_BOOTING) {
>   build_all_zonelists_init();
>   } else {
> - /* we have to stop all cpus to guarantee there is no user
> -of zonelist */
> - stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
> + __build_all_zonelists(pgdat);
>   /* cpuset refresh routine should be here */
>   }
>   vm_total_pages = nr_free_pagecache_pages();
> 



Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-20 Thread Vlastimil Babka
On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_all_zonelists has been (ab)using stop_machine to make sure that
> zonelists do not change while somebody is looking at them. This is
> is just a gross hack because a) it complicates the context from which
> we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> switch locking to a percpu rwsem")) and b) is is not really necessary
> especially after "mm, page_alloc: simplify zonelist initialization".
> 
> Updates of the zonelists happen very seldom, basically only when a zone
> becomes populated during memory online or when it loses all the memory
> during offline. A racing iteration over zonelists could either miss a
> zone or try to work on one zone twice. Both of these are something we
> can live with occasionally because there will always be at least one
> zone visible so we are not likely to fail allocation too easily for
> example.
> 
> Signed-off-by: Michal Hocko 

Some stress testing of this would still be worth, IMHO.

Acked-by: Vlastimil Babka 

> ---
>  mm/page_alloc.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 78bd62418380..217889ecd13f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5066,8 +5066,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, 
> boot_nodestats);
>   */
>  DEFINE_MUTEX(zonelists_mutex);
>  
> -/* return values int just for stop_machine() */
> -static int __build_all_zonelists(void *data)
> +static void __build_all_zonelists(void *data)
>  {
>   int nid;
>   int cpu;
> @@ -5103,8 +5102,6 @@ static int __build_all_zonelists(void *data)
>   set_cpu_numa_mem(cpu, 
> local_memory_node(cpu_to_node(cpu)));
>  #endif
>   }
> -
> - return 0;
>  }
>  
>  static noinline void __init
> @@ -5147,9 +5144,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>   if (system_state == SYSTEM_BOOTING) {
>   build_all_zonelists_init();
>   } else {
> - /* we have to stop all cpus to guarantee there is no user
> -of zonelist */
> - stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
> + __build_all_zonelists(pgdat);
>   /* cpuset refresh routine should be here */
>   }
>   vm_total_pages = nr_free_pagecache_pages();
> 



Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-20 Thread Vlastimil Babka
On 07/14/2017 01:45 PM, Michal Hocko wrote:
> On Fri 14-07-17 13:43:21, Michal Hocko wrote:
>> On Fri 14-07-17 13:29:14, Vlastimil Babka wrote:
>>> On 07/14/2017 10:00 AM, Michal Hocko wrote:
 From: Michal Hocko 

 build_all_zonelists has been (ab)using stop_machine to make sure that
 zonelists do not change while somebody is looking at them. This is
 is just a gross hack because a) it complicates the context from which
 we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
 switch locking to a percpu rwsem")) and b) is is not really necessary
 especially after "mm, page_alloc: simplify zonelist initialization".

 Updates of the zonelists happen very seldom, basically only when a zone
 becomes populated during memory online or when it loses all the memory
 during offline. A racing iteration over zonelists could either miss a
 zone or try to work on one zone twice. Both of these are something we
 can live with occasionally because there will always be at least one
 zone visible so we are not likely to fail allocation too easily for
 example.
>>>
>>> Given the experience with with cpusets and mempolicies, I would rather
>>> avoid the risk of allocation not seeing the only zone(s) that are
>>> allowed by its nodemask, and triggering premature OOM.
>>
>> I would argue, those are a different beast because they are directly
>> under control of not fully priviledged user and change between the empty
>> nodemask and cpusets very often. For this one to trigger we
>> would have to online/offline the last memory block in the zone very
>> often and that doesn't resemble a sensible usecase even remotely.

OK.

>>> So maybe the
>>> updates could be done in a way to avoid that, e.g. first append a copy
>>> of the old zonelist to the end, then overwrite and terminate with NULL.
>>> But if this requires any barriers or something similar on the iteration
>>> site, which is performance critical, then it's bad.
>>> Maybe a seqcount, that the iteration side only starts checking in the
>>> slowpath? Like we have with cpusets now.
>>> I know that Mel noted that stop_machine() also never had such guarantees
>>> to prevent this, but it could have made the chances smaller.
>>
>> I think we can come up with some scheme but is this really worth it
>> considering how unlikely the whole thing is? Well, if somebody hits a
>> premature OOM killer or allocations failures it would have to be along
>> with a heavy memory hotplug operations and then it would be quite easy
>> to spot what is going on and try to fix it. I would rather not
>> overcomplicate it, to be honest.

Fine, we can always add it later.

> And one more thing, Mel has already brought this up in his response.
> stop_machine haven't is very roughly same strenght wrt. double zone
> visit or a missed zone because we do not restart zonelist iteration.

I know, that's why I wrote "I know that Mel noted that stop_machine()
also never had such guarantees to prevent this, but it could have made
the chances smaller." But I don't have any good proof that your patch is
indeed making things worse, so let's apply and see...



Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-20 Thread Vlastimil Babka
On 07/14/2017 01:45 PM, Michal Hocko wrote:
> On Fri 14-07-17 13:43:21, Michal Hocko wrote:
>> On Fri 14-07-17 13:29:14, Vlastimil Babka wrote:
>>> On 07/14/2017 10:00 AM, Michal Hocko wrote:
 From: Michal Hocko 

 build_all_zonelists has been (ab)using stop_machine to make sure that
 zonelists do not change while somebody is looking at them. This is
 is just a gross hack because a) it complicates the context from which
 we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
 switch locking to a percpu rwsem")) and b) is is not really necessary
 especially after "mm, page_alloc: simplify zonelist initialization".

 Updates of the zonelists happen very seldom, basically only when a zone
 becomes populated during memory online or when it loses all the memory
 during offline. A racing iteration over zonelists could either miss a
 zone or try to work on one zone twice. Both of these are something we
 can live with occasionally because there will always be at least one
 zone visible so we are not likely to fail allocation too easily for
 example.
>>>
>>> Given the experience with with cpusets and mempolicies, I would rather
>>> avoid the risk of allocation not seeing the only zone(s) that are
>>> allowed by its nodemask, and triggering premature OOM.
>>
>> I would argue, those are a different beast because they are directly
>> under control of not fully priviledged user and change between the empty
>> nodemask and cpusets very often. For this one to trigger we
>> would have to online/offline the last memory block in the zone very
>> often and that doesn't resemble a sensible usecase even remotely.

OK.

>>> So maybe the
>>> updates could be done in a way to avoid that, e.g. first append a copy
>>> of the old zonelist to the end, then overwrite and terminate with NULL.
>>> But if this requires any barriers or something similar on the iteration
>>> site, which is performance critical, then it's bad.
>>> Maybe a seqcount, that the iteration side only starts checking in the
>>> slowpath? Like we have with cpusets now.
>>> I know that Mel noted that stop_machine() also never had such guarantees
>>> to prevent this, but it could have made the chances smaller.
>>
>> I think we can come up with some scheme but is this really worth it
>> considering how unlikely the whole thing is? Well, if somebody hits a
>> premature OOM killer or allocations failures it would have to be along
>> with a heavy memory hotplug operations and then it would be quite easy
>> to spot what is going on and try to fix it. I would rather not
>> overcomplicate it, to be honest.

Fine, we can always add it later.

> And one more thing, Mel has already brought this up in his response.
> stop_machine haven't is very roughly same strenght wrt. double zone
> visit or a missed zone because we do not restart zonelist iteration.

I know, that's why I wrote "I know that Mel noted that stop_machine()
also never had such guarantees to prevent this, but it could have made
the chances smaller." But I don't have any good proof that your patch is
indeed making things worse, so let's apply and see...



Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Mel Gorman
On Fri, Jul 14, 2017 at 01:00:25PM +0200, Michal Hocko wrote:
> On Fri 14-07-17 10:59:32, Mel Gorman wrote:
> > On Fri, Jul 14, 2017 at 10:00:04AM +0200, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > build_all_zonelists has been (ab)using stop_machine to make sure that
> > > zonelists do not change while somebody is looking at them. This is
> > > is just a gross hack because a) it complicates the context from which
> > > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > > switch locking to a percpu rwsem")) and b) is is not really necessary
> > > especially after "mm, page_alloc: simplify zonelist initialization".
> > > 
> > > Updates of the zonelists happen very seldom, basically only when a zone
> > > becomes populated during memory online or when it loses all the memory
> > > during offline. A racing iteration over zonelists could either miss a
> > > zone or try to work on one zone twice. Both of these are something we
> > > can live with occasionally because there will always be at least one
> > > zone visible so we are not likely to fail allocation too easily for
> > > example.
> > > 
> > > Signed-off-by: Michal Hocko 
> > 
> > This patch is contingent on the last patch which updates in place
> > instead of zeroing the early part of the zonelist first but needs to fix
> > the stack usage issues. I think it's also worth pointing out in the
> > changelog that stop_machine never gave the guarantees it claimed as a
> > process iterating through the zonelist can be stopped so when it resumes
> > the zonelist has changed underneath it. Doing it online is roughly
> > equivalent in terms of safety.
> 
> OK, what about the following addendum?
> "
> Please note that the original stop_machine approach doesn't really
> provide a better exclusion because the iteration might be interrupted
> half way (unless the whole iteration is preempt disabled which is not the
> case in most cases) so the some zones could still be seen twice or a
> zone missed.
> "

Works for me.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Mel Gorman
On Fri, Jul 14, 2017 at 01:00:25PM +0200, Michal Hocko wrote:
> On Fri 14-07-17 10:59:32, Mel Gorman wrote:
> > On Fri, Jul 14, 2017 at 10:00:04AM +0200, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > build_all_zonelists has been (ab)using stop_machine to make sure that
> > > zonelists do not change while somebody is looking at them. This is
> > > is just a gross hack because a) it complicates the context from which
> > > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > > switch locking to a percpu rwsem")) and b) is is not really necessary
> > > especially after "mm, page_alloc: simplify zonelist initialization".
> > > 
> > > Updates of the zonelists happen very seldom, basically only when a zone
> > > becomes populated during memory online or when it loses all the memory
> > > during offline. A racing iteration over zonelists could either miss a
> > > zone or try to work on one zone twice. Both of these are something we
> > > can live with occasionally because there will always be at least one
> > > zone visible so we are not likely to fail allocation too easily for
> > > example.
> > > 
> > > Signed-off-by: Michal Hocko 
> > 
> > This patch is contingent on the last patch which updates in place
> > instead of zeroing the early part of the zonelist first but needs to fix
> > the stack usage issues. I think it's also worth pointing out in the
> > changelog that stop_machine never gave the guarantees it claimed as a
> > process iterating through the zonelist can be stopped so when it resumes
> > the zonelist has changed underneath it. Doing it online is roughly
> > equivalent in terms of safety.
> 
> OK, what about the following addendum?
> "
> Please note that the original stop_machine approach doesn't really
> provide a better exclusion because the iteration might be interrupted
> half way (unless the whole iteration is preempt disabled which is not the
> case in most cases) so the some zones could still be seen twice or a
> zone missed.
> "

Works for me.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Michal Hocko
On Fri 14-07-17 13:43:21, Michal Hocko wrote:
> On Fri 14-07-17 13:29:14, Vlastimil Babka wrote:
> > On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > build_all_zonelists has been (ab)using stop_machine to make sure that
> > > zonelists do not change while somebody is looking at them. This is
> > > is just a gross hack because a) it complicates the context from which
> > > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > > switch locking to a percpu rwsem")) and b) is is not really necessary
> > > especially after "mm, page_alloc: simplify zonelist initialization".
> > > 
> > > Updates of the zonelists happen very seldom, basically only when a zone
> > > becomes populated during memory online or when it loses all the memory
> > > during offline. A racing iteration over zonelists could either miss a
> > > zone or try to work on one zone twice. Both of these are something we
> > > can live with occasionally because there will always be at least one
> > > zone visible so we are not likely to fail allocation too easily for
> > > example.
> > 
> > Given the experience with with cpusets and mempolicies, I would rather
> > avoid the risk of allocation not seeing the only zone(s) that are
> > allowed by its nodemask, and triggering premature OOM.
> 
> I would argue, those are a different beast because they are directly
> under control of not fully priviledged user and change between the empty
> nodemask and cpusets very often. For this one to trigger we
> would have to online/offline the last memory block in the zone very
> often and that doesn't resemble a sensible usecase even remotely.
> 
> > So maybe the
> > updates could be done in a way to avoid that, e.g. first append a copy
> > of the old zonelist to the end, then overwrite and terminate with NULL.
> > But if this requires any barriers or something similar on the iteration
> > site, which is performance critical, then it's bad.
> > Maybe a seqcount, that the iteration side only starts checking in the
> > slowpath? Like we have with cpusets now.
> > I know that Mel noted that stop_machine() also never had such guarantees
> > to prevent this, but it could have made the chances smaller.
> 
> I think we can come up with some scheme but is this really worth it
> considering how unlikely the whole thing is? Well, if somebody hits a
> premature OOM killer or allocations failures it would have to be along
> with a heavy memory hotplug operations and then it would be quite easy
> to spot what is going on and try to fix it. I would rather not
> overcomplicate it, to be honest.

And one more thing, Mel has already brought this up in his response.
stop_machine haven't is very roughly same strenght wrt. double zone
visit or a missed zone because we do not restart zonelist iteration.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Michal Hocko
On Fri 14-07-17 13:43:21, Michal Hocko wrote:
> On Fri 14-07-17 13:29:14, Vlastimil Babka wrote:
> > On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > build_all_zonelists has been (ab)using stop_machine to make sure that
> > > zonelists do not change while somebody is looking at them. This is
> > > is just a gross hack because a) it complicates the context from which
> > > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > > switch locking to a percpu rwsem")) and b) is is not really necessary
> > > especially after "mm, page_alloc: simplify zonelist initialization".
> > > 
> > > Updates of the zonelists happen very seldom, basically only when a zone
> > > becomes populated during memory online or when it loses all the memory
> > > during offline. A racing iteration over zonelists could either miss a
> > > zone or try to work on one zone twice. Both of these are something we
> > > can live with occasionally because there will always be at least one
> > > zone visible so we are not likely to fail allocation too easily for
> > > example.
> > 
> > Given the experience with with cpusets and mempolicies, I would rather
> > avoid the risk of allocation not seeing the only zone(s) that are
> > allowed by its nodemask, and triggering premature OOM.
> 
> I would argue, those are a different beast because they are directly
> under control of not fully priviledged user and change between the empty
> nodemask and cpusets very often. For this one to trigger we
> would have to online/offline the last memory block in the zone very
> often and that doesn't resemble a sensible usecase even remotely.
> 
> > So maybe the
> > updates could be done in a way to avoid that, e.g. first append a copy
> > of the old zonelist to the end, then overwrite and terminate with NULL.
> > But if this requires any barriers or something similar on the iteration
> > site, which is performance critical, then it's bad.
> > Maybe a seqcount, that the iteration side only starts checking in the
> > slowpath? Like we have with cpusets now.
> > I know that Mel noted that stop_machine() also never had such guarantees
> > to prevent this, but it could have made the chances smaller.
> 
> I think we can come up with some scheme but is this really worth it
> considering how unlikely the whole thing is? Well, if somebody hits a
> premature OOM killer or allocations failures it would have to be along
> with a heavy memory hotplug operations and then it would be quite easy
> to spot what is going on and try to fix it. I would rather not
> overcomplicate it, to be honest.

And one more thing, Mel has already brought this up in his response.
stop_machine haven't is very roughly same strenght wrt. double zone
visit or a missed zone because we do not restart zonelist iteration.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Michal Hocko
On Fri 14-07-17 13:29:14, Vlastimil Babka wrote:
> On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > build_all_zonelists has been (ab)using stop_machine to make sure that
> > zonelists do not change while somebody is looking at them. This is
> > is just a gross hack because a) it complicates the context from which
> > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > switch locking to a percpu rwsem")) and b) is is not really necessary
> > especially after "mm, page_alloc: simplify zonelist initialization".
> > 
> > Updates of the zonelists happen very seldom, basically only when a zone
> > becomes populated during memory online or when it loses all the memory
> > during offline. A racing iteration over zonelists could either miss a
> > zone or try to work on one zone twice. Both of these are something we
> > can live with occasionally because there will always be at least one
> > zone visible so we are not likely to fail allocation too easily for
> > example.
> 
> Given the experience with with cpusets and mempolicies, I would rather
> avoid the risk of allocation not seeing the only zone(s) that are
> allowed by its nodemask, and triggering premature OOM.

I would argue, those are a different beast because they are directly
under control of not fully priviledged user and change between the empty
nodemask and cpusets very often. For this one to trigger we
would have to online/offline the last memory block in the zone very
often and that doesn't resemble a sensible usecase even remotely.

> So maybe the
> updates could be done in a way to avoid that, e.g. first append a copy
> of the old zonelist to the end, then overwrite and terminate with NULL.
> But if this requires any barriers or something similar on the iteration
> site, which is performance critical, then it's bad.
> Maybe a seqcount, that the iteration side only starts checking in the
> slowpath? Like we have with cpusets now.
> I know that Mel noted that stop_machine() also never had such guarantees
> to prevent this, but it could have made the chances smaller.

I think we can come up with some scheme but is this really worth it
considering how unlikely the whole thing is? Well, if somebody hits a
premature OOM killer or allocations failures it would have to be along
with a heavy memory hotplug operations and then it would be quite easy
to spot what is going on and try to fix it. I would rather not
overcomplicate it, to be honest.

> > Signed-off-by: Michal Hocko 
> > ---
> >  mm/page_alloc.c | 9 ++---
> >  1 file changed, 2 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 78bd62418380..217889ecd13f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5066,8 +5066,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, 
> > boot_nodestats);
> >   */
> >  DEFINE_MUTEX(zonelists_mutex);
> >  
> > -/* return values int just for stop_machine() */
> > -static int __build_all_zonelists(void *data)
> > +static void __build_all_zonelists(void *data)
> >  {
> > int nid;
> > int cpu;
> > @@ -5103,8 +5102,6 @@ static int __build_all_zonelists(void *data)
> > set_cpu_numa_mem(cpu, 
> > local_memory_node(cpu_to_node(cpu)));
> >  #endif
> > }
> > -
> > -   return 0;
> >  }
> >  
> >  static noinline void __init
> > @@ -5147,9 +5144,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
> > if (system_state == SYSTEM_BOOTING) {
> > build_all_zonelists_init();
> > } else {
> > -   /* we have to stop all cpus to guarantee there is no user
> > -  of zonelist */
> > -   stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
> > +   __build_all_zonelists(pgdat);
> > /* cpuset refresh routine should be here */
> > }
> > vm_total_pages = nr_free_pagecache_pages();
> > 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Michal Hocko
On Fri 14-07-17 13:29:14, Vlastimil Babka wrote:
> On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > build_all_zonelists has been (ab)using stop_machine to make sure that
> > zonelists do not change while somebody is looking at them. This is
> > is just a gross hack because a) it complicates the context from which
> > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > switch locking to a percpu rwsem")) and b) is is not really necessary
> > especially after "mm, page_alloc: simplify zonelist initialization".
> > 
> > Updates of the zonelists happen very seldom, basically only when a zone
> > becomes populated during memory online or when it loses all the memory
> > during offline. A racing iteration over zonelists could either miss a
> > zone or try to work on one zone twice. Both of these are something we
> > can live with occasionally because there will always be at least one
> > zone visible so we are not likely to fail allocation too easily for
> > example.
> 
> Given the experience with with cpusets and mempolicies, I would rather
> avoid the risk of allocation not seeing the only zone(s) that are
> allowed by its nodemask, and triggering premature OOM.

I would argue, those are a different beast because they are directly
under control of not fully priviledged user and change between the empty
nodemask and cpusets very often. For this one to trigger we
would have to online/offline the last memory block in the zone very
often and that doesn't resemble a sensible usecase even remotely.

> So maybe the
> updates could be done in a way to avoid that, e.g. first append a copy
> of the old zonelist to the end, then overwrite and terminate with NULL.
> But if this requires any barriers or something similar on the iteration
> site, which is performance critical, then it's bad.
> Maybe a seqcount, that the iteration side only starts checking in the
> slowpath? Like we have with cpusets now.
> I know that Mel noted that stop_machine() also never had such guarantees
> to prevent this, but it could have made the chances smaller.

I think we can come up with some scheme but is this really worth it
considering how unlikely the whole thing is? Well, if somebody hits a
premature OOM killer or allocations failures it would have to be along
with a heavy memory hotplug operations and then it would be quite easy
to spot what is going on and try to fix it. I would rather not
overcomplicate it, to be honest.

> > Signed-off-by: Michal Hocko 
> > ---
> >  mm/page_alloc.c | 9 ++---
> >  1 file changed, 2 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 78bd62418380..217889ecd13f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5066,8 +5066,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, 
> > boot_nodestats);
> >   */
> >  DEFINE_MUTEX(zonelists_mutex);
> >  
> > -/* return values int just for stop_machine() */
> > -static int __build_all_zonelists(void *data)
> > +static void __build_all_zonelists(void *data)
> >  {
> > int nid;
> > int cpu;
> > @@ -5103,8 +5102,6 @@ static int __build_all_zonelists(void *data)
> > set_cpu_numa_mem(cpu, 
> > local_memory_node(cpu_to_node(cpu)));
> >  #endif
> > }
> > -
> > -   return 0;
> >  }
> >  
> >  static noinline void __init
> > @@ -5147,9 +5144,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
> > if (system_state == SYSTEM_BOOTING) {
> > build_all_zonelists_init();
> > } else {
> > -   /* we have to stop all cpus to guarantee there is no user
> > -  of zonelist */
> > -   stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
> > +   __build_all_zonelists(pgdat);
> > /* cpuset refresh routine should be here */
> > }
> > vm_total_pages = nr_free_pagecache_pages();
> > 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Vlastimil Babka
On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_all_zonelists has been (ab)using stop_machine to make sure that
> zonelists do not change while somebody is looking at them. This is
> is just a gross hack because a) it complicates the context from which
> we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> switch locking to a percpu rwsem")) and b) is is not really necessary
> especially after "mm, page_alloc: simplify zonelist initialization".
> 
> Updates of the zonelists happen very seldom, basically only when a zone
> becomes populated during memory online or when it loses all the memory
> during offline. A racing iteration over zonelists could either miss a
> zone or try to work on one zone twice. Both of these are something we
> can live with occasionally because there will always be at least one
> zone visible so we are not likely to fail allocation too easily for
> example.

Given the experience with with cpusets and mempolicies, I would rather
avoid the risk of allocation not seeing the only zone(s) that are
allowed by its nodemask, and triggering premature OOM. So maybe the
updates could be done in a way to avoid that, e.g. first append a copy
of the old zonelist to the end, then overwrite and terminate with NULL.
But if this requires any barriers or something similar on the iteration
site, which is performance critical, then it's bad.
Maybe a seqcount, that the iteration side only starts checking in the
slowpath? Like we have with cpusets now.
I know that Mel noted that stop_machine() also never had such guarantees
to prevent this, but it could have made the chances smaller.

> Signed-off-by: Michal Hocko 
> ---
>  mm/page_alloc.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 78bd62418380..217889ecd13f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5066,8 +5066,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, 
> boot_nodestats);
>   */
>  DEFINE_MUTEX(zonelists_mutex);
>  
> -/* return values int just for stop_machine() */
> -static int __build_all_zonelists(void *data)
> +static void __build_all_zonelists(void *data)
>  {
>   int nid;
>   int cpu;
> @@ -5103,8 +5102,6 @@ static int __build_all_zonelists(void *data)
>   set_cpu_numa_mem(cpu, 
> local_memory_node(cpu_to_node(cpu)));
>  #endif
>   }
> -
> - return 0;
>  }
>  
>  static noinline void __init
> @@ -5147,9 +5144,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>   if (system_state == SYSTEM_BOOTING) {
>   build_all_zonelists_init();
>   } else {
> - /* we have to stop all cpus to guarantee there is no user
> -of zonelist */
> - stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
> + __build_all_zonelists(pgdat);
>   /* cpuset refresh routine should be here */
>   }
>   vm_total_pages = nr_free_pagecache_pages();
> 



Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Vlastimil Babka
On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_all_zonelists has been (ab)using stop_machine to make sure that
> zonelists do not change while somebody is looking at them. This is
> is just a gross hack because a) it complicates the context from which
> we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> switch locking to a percpu rwsem")) and b) is is not really necessary
> especially after "mm, page_alloc: simplify zonelist initialization".
> 
> Updates of the zonelists happen very seldom, basically only when a zone
> becomes populated during memory online or when it loses all the memory
> during offline. A racing iteration over zonelists could either miss a
> zone or try to work on one zone twice. Both of these are something we
> can live with occasionally because there will always be at least one
> zone visible so we are not likely to fail allocation too easily for
> example.

Given the experience with with cpusets and mempolicies, I would rather
avoid the risk of allocation not seeing the only zone(s) that are
allowed by its nodemask, and triggering premature OOM. So maybe the
updates could be done in a way to avoid that, e.g. first append a copy
of the old zonelist to the end, then overwrite and terminate with NULL.
But if this requires any barriers or something similar on the iteration
site, which is performance critical, then it's bad.
Maybe a seqcount, that the iteration side only starts checking in the
slowpath? Like we have with cpusets now.
I know that Mel noted that stop_machine() also never had such guarantees
to prevent this, but it could have made the chances smaller.

> Signed-off-by: Michal Hocko 
> ---
>  mm/page_alloc.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 78bd62418380..217889ecd13f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5066,8 +5066,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, 
> boot_nodestats);
>   */
>  DEFINE_MUTEX(zonelists_mutex);
>  
> -/* return values int just for stop_machine() */
> -static int __build_all_zonelists(void *data)
> +static void __build_all_zonelists(void *data)
>  {
>   int nid;
>   int cpu;
> @@ -5103,8 +5102,6 @@ static int __build_all_zonelists(void *data)
>   set_cpu_numa_mem(cpu, 
> local_memory_node(cpu_to_node(cpu)));
>  #endif
>   }
> -
> - return 0;
>  }
>  
>  static noinline void __init
> @@ -5147,9 +5144,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>   if (system_state == SYSTEM_BOOTING) {
>   build_all_zonelists_init();
>   } else {
> - /* we have to stop all cpus to guarantee there is no user
> -of zonelist */
> - stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
> + __build_all_zonelists(pgdat);
>   /* cpuset refresh routine should be here */
>   }
>   vm_total_pages = nr_free_pagecache_pages();
> 



Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Michal Hocko
On Fri 14-07-17 10:59:32, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 10:00:04AM +0200, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > build_all_zonelists has been (ab)using stop_machine to make sure that
> > zonelists do not change while somebody is looking at them. This is
> > is just a gross hack because a) it complicates the context from which
> > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > switch locking to a percpu rwsem")) and b) is is not really necessary
> > especially after "mm, page_alloc: simplify zonelist initialization".
> > 
> > Updates of the zonelists happen very seldom, basically only when a zone
> > becomes populated during memory online or when it loses all the memory
> > during offline. A racing iteration over zonelists could either miss a
> > zone or try to work on one zone twice. Both of these are something we
> > can live with occasionally because there will always be at least one
> > zone visible so we are not likely to fail allocation too easily for
> > example.
> > 
> > Signed-off-by: Michal Hocko 
> 
> This patch is contingent on the last patch which updates in place
> instead of zeroing the early part of the zonelist first but needs to fix
> the stack usage issues. I think it's also worth pointing out in the
> changelog that stop_machine never gave the guarantees it claimed as a
> process iterating through the zonelist can be stopped so when it resumes
> the zonelist has changed underneath it. Doing it online is roughly
> equivalent in terms of safety.

OK, what about the following addendum?
"
Please note that the original stop_machine approach doesn't really
provide a better exclusion because the iteration might be interrupted
half way (unless the whole iteration is preempt disabled which is not the
case in most cases) so the some zones could still be seen twice or a
zone missed.
"
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Michal Hocko
On Fri 14-07-17 10:59:32, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 10:00:04AM +0200, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > build_all_zonelists has been (ab)using stop_machine to make sure that
> > zonelists do not change while somebody is looking at them. This is
> > is just a gross hack because a) it complicates the context from which
> > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > switch locking to a percpu rwsem")) and b) is is not really necessary
> > especially after "mm, page_alloc: simplify zonelist initialization".
> > 
> > Updates of the zonelists happen very seldom, basically only when a zone
> > becomes populated during memory online or when it loses all the memory
> > during offline. A racing iteration over zonelists could either miss a
> > zone or try to work on one zone twice. Both of these are something we
> > can live with occasionally because there will always be at least one
> > zone visible so we are not likely to fail allocation too easily for
> > example.
> > 
> > Signed-off-by: Michal Hocko 
> 
> This patch is contingent on the last patch which updates in place
> instead of zeroing the early part of the zonelist first but needs to fix
> the stack usage issues. I think it's also worth pointing out in the
> changelog that stop_machine never gave the guarantees it claimed as a
> process iterating through the zonelist can be stopped so when it resumes
> the zonelist has changed underneath it. Doing it online is roughly
> equivalent in terms of safety.

OK, what about the following addendum?
"
Please note that the original stop_machine approach doesn't really
provide a better exclusion because the iteration might be interrupted
half way (unless the whole iteration is preempt disabled which is not the
case in most cases) so the some zones could still be seen twice or a
zone missed.
"
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Mel Gorman
On Fri, Jul 14, 2017 at 10:00:04AM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_all_zonelists has been (ab)using stop_machine to make sure that
> zonelists do not change while somebody is looking at them. This is
> is just a gross hack because a) it complicates the context from which
> we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> switch locking to a percpu rwsem")) and b) is is not really necessary
> especially after "mm, page_alloc: simplify zonelist initialization".
> 
> Updates of the zonelists happen very seldom, basically only when a zone
> becomes populated during memory online or when it loses all the memory
> during offline. A racing iteration over zonelists could either miss a
> zone or try to work on one zone twice. Both of these are something we
> can live with occasionally because there will always be at least one
> zone visible so we are not likely to fail allocation too easily for
> example.
> 
> Signed-off-by: Michal Hocko 

This patch is contingent on the last patch which updates in place
instead of zeroing the early part of the zonelist first but needs to fix
the stack usage issues. I think it's also worth pointing out in the
changelog that stop_machine never gave the guarantees it claimed as a
process iterating through the zonelist can be stopped so when it resumes
the zonelist has changed underneath it. Doing it online is roughly
equivalent in terms of safety.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

2017-07-14 Thread Mel Gorman
On Fri, Jul 14, 2017 at 10:00:04AM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_all_zonelists has been (ab)using stop_machine to make sure that
> zonelists do not change while somebody is looking at them. This is
> is just a gross hack because a) it complicates the context from which
> we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> switch locking to a percpu rwsem")) and b) is is not really necessary
> especially after "mm, page_alloc: simplify zonelist initialization".
> 
> Updates of the zonelists happen very seldom, basically only when a zone
> becomes populated during memory online or when it loses all the memory
> during offline. A racing iteration over zonelists could either miss a
> zone or try to work on one zone twice. Both of these are something we
> can live with occasionally because there will always be at least one
> zone visible so we are not likely to fail allocation too easily for
> example.
> 
> Signed-off-by: Michal Hocko 

This patch is contingent on the last patch which updates in place
instead of zeroing the early part of the zonelist first but needs to fix
the stack usage issues. I think it's also worth pointing out in the
changelog that stop_machine never gave the guarantees it claimed as a
process iterating through the zonelist can be stopped so when it resumes
the zonelist has changed underneath it. Doing it online is roughly
equivalent in terms of safety.

-- 
Mel Gorman
SUSE Labs