Re: [PATCH 2/2] mm: Update NUMA counter threshold size

2017-08-22 Thread Mel Gorman
On Tue, Aug 22, 2017 at 11:21:31AM +0800, kemi wrote:
> 
> 
> On 2017???08???15??? 17:58, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
> >>  Threshold   CPU cyclesThroughput(88 threads)
> >>  32  799 241760478
> >>  64  640 301628829
> >>  125 537 358906028 <==> system by default (base)
> >>  256 468 412397590
> >>  512 428 450550704
> >>  4096399 482520943
> >>  2   394 489009617
> >>  3   395 488017817
> >>  32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >>  N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> >>
> >> Signed-off-by: Kemi Wang 
> >> Suggested-by: Dave Hansen 
> >> Suggested-by: Ying Huang 
> >> ---
> >>  include/linux/mmzone.h |  4 ++--
> >>  include/linux/vmstat.h |  6 +-
> >>  mm/vmstat.c| 23 ++-
> >>  3 files changed, 17 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 0b11ba7..7eaf0e8 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
> >>struct per_cpu_pages pcp;
> >>  #ifdef CONFIG_NUMA
> >>s8 expire;
> >> -  s8 numa_stat_threshold;
> >> -  s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> >> +  s16 numa_stat_threshold;
> >> +  s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> > 
> > I'm fairly sure this pushes the size of that structure into the next
> > cache line which is not welcome.
> > 
> Hi Mel
>   I am refreshing this patch. Would you pls be more explicit of what "that
> structure" indicates. 
>   If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
> still occupies two caches line after extending s8 to s16/u16, that should
> not be a problem.

You're right, I was in error. I miscalculated badly initially. It still
fits in as expected.

> For 32 bits machine, we probably does not need to extend
> the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
> numa system, and s8/u8 is large enough for it, in this case, we can keep the 
> same size of "struct per_cpu_pageset".
> 

I don't believe it's worth the complexity of making this
bitness-specific. 32-bit takes penalties in other places and besides,
32-bit does not necessarily mean a change in cache line size.

Fortunately, I think you should still be able to gain a bit more with
some special casing the fact it's always incrementing and always do full
spill of the counters instead of half. If so, then using u16 instead of
s16 should also reduce the update frequency. However, if you find it's
too complex and the gain is too marginal then I'll ack without it.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/2] mm: Update NUMA counter threshold size

2017-08-21 Thread kemi


On 2017年08月15日 17:58, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>>  Threshold   CPU cyclesThroughput(88 threads)
>>  32  799 241760478
>>  64  640 301628829
>>  125 537 358906028 <==> system by default (base)
>>  256 468 412397590
>>  512 428 450550704
>>  4096399 482520943
>>  2   394 489009617
>>  3   395 488017817
>>  32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>  N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Signed-off-by: Kemi Wang 
>> Suggested-by: Dave Hansen 
>> Suggested-by: Ying Huang 
>> ---
>>  include/linux/mmzone.h |  4 ++--
>>  include/linux/vmstat.h |  6 +-
>>  mm/vmstat.c| 23 ++-
>>  3 files changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 0b11ba7..7eaf0e8 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>>  struct per_cpu_pages pcp;
>>  #ifdef CONFIG_NUMA
>>  s8 expire;
>> -s8 numa_stat_threshold;
>> -s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>> +s16 numa_stat_threshold;
>> +s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> 
> I'm fairly sure this pushes the size of that structure into the next
> cache line which is not welcome.
> 
Hi Mel
  I am refreshing this patch. Would you pls be more explicit of what "that
structure" indicates. 
  If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
still occupies two caches line after extending s8 to s16/u16, that should
not be a problem. For 32 bits machine, we probably does not need to extend
the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
numa system, and s8/u8 is large enough for it, in this case, we can keep the 
same size of "struct per_cpu_pageset".

 If you mean "s16 vm_numa_stat_diff[]", and want to keep it in a single cache
line, we probably can add some padding after "s8 expire" to achieve it.

Again, thanks for your comments to make this patch more graceful.
> vm_numa_stat_diff is an always incrementing field. How much do you gain
> if this becomes a u8 code and remove any code that deals with negative
> values? That would double the threshold without consuming another cache line.
> 
> Furthermore, the stats in question are only ever incremented by one.
> That means that any calcluation related to overlap can be removed and
> special cased that it'll never overlap by more than 1. That potentially
> removes code that is required for other stats but not locality stats.
> This may give enough savings to avoid moving to s16.
> 
> Very broadly speaking, I like what you're doing but I would like to see
> more work on reducing any unnecessary code in that path (such as dealing
> with overlaps for single increments) and treat incrasing the cache footprint
> only as a very last resort.
> 
>> 


Re: [PATCH 2/2] mm: Update NUMA counter threshold size

2017-08-15 Thread kemi


On 2017年08月16日 00:55, Tim Chen wrote:
> On 08/15/2017 02:58 AM, Mel Gorman wrote:
>> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:

>> I'm fairly sure this pushes the size of that structure into the next
>> cache line which is not welcome.
 vm_numa_stat_diff is an always incrementing field. How much do you gain
>> if this becomes a u8 code and remove any code that deals with negative
>> values? That would double the threshold without consuming another cache line.
> 
> Doubling the threshold and counter size will help, but not as much
> as making them above u8 limit as seen in Kemi's data:
> 
>   125 537 358906028 <==> system by default (base)
>   256 468 412397590
>   32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset
> 
> For small system making them u8 makes sense.  For larger ones the
> frequent local counter overflow into the global counter still
> causes a lot of cache bounce.  Kemi can perhaps collect some data
> to see what is the gain from making the counters u8. 
> 
Tim, thanks for your answer. That is what I want to clarify.

Also, pls notice that the negative threshold/2 is set to cpu local counter
(e.g. vm_numa_stat_diff[]) once per-zone counter is updated in current code
path. This weakens the benefit of changing s8 to u8 in this case. 
>>
>> Furthermore, the stats in question are only ever incremented by one.
>> That means that any calcluation related to overlap can be removed and
>> special cased that it'll never overlap by more than 1. That potentially
>> removes code that is required for other stats but not locality stats.
>> This may give enough savings to avoid moving to s16.
>>
>> Very broadly speaking, I like what you're doing but I would like to see
>> more work on reducing any unnecessary code in that path (such as dealing
>> with overlaps for single increments) and treat incrasing the cache footprint
>> only as a very last resort.
>>
Agree. I will think about it more. 

>>>  #endif
>>>  #ifdef CONFIG_SMP
>>> s8 stat_threshold;
>>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>>> index 1e19379..d97cc34 100644
>>> --- a/include/linux/vmstat.h
>>> +++ b/include/linux/vmstat.h
>>> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum 
>>> zone_numa_stat_item item)
>>> return x;
>>>  }
>>>  



Re: [PATCH 2/2] mm: Update NUMA counter threshold size

2017-08-15 Thread kemi

>>  
>> -static inline unsigned long zone_numa_state(struct zone *zone,
>> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>>  enum zone_numa_stat_item item)
>>  {
>>  long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> +int cpu;
>> +
>> +for_each_online_cpu(cpu)
>> +x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>>  
>>  return x;
>>  }
> 
> This does not appear to be related to the current patch. It either
> should be merged with the previous patch or stand on its own.
> 
OK. I can move it to an individual patch if it does not make anyone unhappy.
Since it is not graceful to introduce any functionality change in first patch.

>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 5a7fa30..c7f50ed 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>  
>>  #include "internal.h"
>>  
>> +#define NUMA_STAT_THRESHOLD  32765
>> +
> 
> This should be expressed in terms of the type and not a hard-coded value.
> 
OK, Thanks. I will follow it.


Re: [PATCH 2/2] mm: Update NUMA counter threshold size

2017-08-15 Thread Mel Gorman
On Tue, Aug 15, 2017 at 10:51:21AM -0700, Tim Chen wrote:
> On 08/15/2017 10:30 AM, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:
> 
> >>
> >> Doubling the threshold and counter size will help, but not as much
> >> as making them above u8 limit as seen in Kemi's data:
> >>
> >>   125 537 358906028 <==> system by default (base)
> >>   256 468 412397590
> >>   32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >>
> >> For small system making them u8 makes sense.  For larger ones the
> >> frequent local counter overflow into the global counter still
> >> causes a lot of cache bounce.  Kemi can perhaps collect some data
> >> to see what is the gain from making the counters u8. 
> >>
> > 
> > The same comments hold. The increase of a cache line is undesirable but
> > there are other places where the overall cost can be reduced by special
> > casing based on how this counter is used (always incrementing by one).
> 
> Can you be more explicit of what optimization you suggest here and changes
> to inc/dec_zone_page_state?  Seems to me like we will still overflow
> the local counter with the same frequency unless the threshold and
> counter size is changed.

One of the helpers added is __inc_zone_numa_state which doesn't have a
symmetrical __dec_zone_numa_state because the counter is always
incrementing. Because of this, there is little or no motivation to
update the global value by threshold >> 1 because with both inc/dec, you
want to avoid a corner case whereby a loop of inc/dec would do an
overflow every time. Instead, you can always apply the full threshold
and clear it which is fewer operations and halves the frequency at which
the global value needs to be updated.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/2] mm: Update NUMA counter threshold size

2017-08-15 Thread Tim Chen
On 08/15/2017 10:30 AM, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:

>>
>> Doubling the threshold and counter size will help, but not as much
>> as making them above u8 limit as seen in Kemi's data:
>>
>>   125 537 358906028 <==> system by default (base)
>>   256 468 412397590
>>   32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>
>> For small system making them u8 makes sense.  For larger ones the
>> frequent local counter overflow into the global counter still
>> causes a lot of cache bounce.  Kemi can perhaps collect some data
>> to see what is the gain from making the counters u8. 
>>
> 
> The same comments hold. The increase of a cache line is undesirable but
> there are other places where the overall cost can be reduced by special
> casing based on how this counter is used (always incrementing by one).

Can you be more explicit of what optimization you suggest here and changes
to inc/dec_zone_page_state?  Seems to me like we will still overflow
the local counter with the same frequency unless the threshold and
counter size is changed.

Thanks.

Tim

> It would be preferred if those were addressed to see how close that gets
> to the same performance of doubling the necessary storage for a counter.
> doubling the storage 
> 



Re: [PATCH 2/2] mm: Update NUMA counter threshold size

2017-08-15 Thread Mel Gorman
On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:
> On 08/15/2017 02:58 AM, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
> >>  Threshold   CPU cyclesThroughput(88 threads)
> >>  32  799 241760478
> >>  64  640 301628829
> >>  125 537 358906028 <==> system by default (base)
> >>  256 468 412397590
> >>  512 428 450550704
> >>  4096399 482520943
> >>  2   394 489009617
> >>  3   395 488017817
> >>  32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >>  N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> >>
> >> Signed-off-by: Kemi Wang 
> >> Suggested-by: Dave Hansen 
> >> Suggested-by: Ying Huang 
> >> ---
> >>  include/linux/mmzone.h |  4 ++--
> >>  include/linux/vmstat.h |  6 +-
> >>  mm/vmstat.c| 23 ++-
> >>  3 files changed, 17 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 0b11ba7..7eaf0e8 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
> >>struct per_cpu_pages pcp;
> >>  #ifdef CONFIG_NUMA
> >>s8 expire;
> >> -  s8 numa_stat_threshold;
> >> -  s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> >> +  s16 numa_stat_threshold;
> >> +  s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> > 
> > I'm fairly sure this pushes the size of that structure into the next
> > cache line which is not welcome.
> > 
> > vm_numa_stat_diff is an always incrementing field. How much do you gain
> > if this becomes a u8 code and remove any code that deals with negative
> > values? That would double the threshold without consuming another cache 
> > line.
> 
> Doubling the threshold and counter size will help, but not as much
> as making them above u8 limit as seen in Kemi's data:
> 
>   125 537 358906028 <==> system by default (base)
>   256 468 412397590
>   32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset
> 
> For small system making them u8 makes sense.  For larger ones the
> frequent local counter overflow into the global counter still
> causes a lot of cache bounce.  Kemi can perhaps collect some data
> to see what is the gain from making the counters u8. 
> 

The same comments hold. The increase of a cache line is undesirable but
there are other places where the overall cost can be reduced by special
casing based on how this counter is used (always incrementing by one).
It would be preferred if those were addressed to see how close that gets
to the same performance of doubling the necessary storage for a counter.
doubling the storage 

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/2] mm: Update NUMA counter threshold size

2017-08-15 Thread Tim Chen
On 08/15/2017 02:58 AM, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>>  Threshold   CPU cyclesThroughput(88 threads)
>>  32  799 241760478
>>  64  640 301628829
>>  125 537 358906028 <==> system by default (base)
>>  256 468 412397590
>>  512 428 450550704
>>  4096399 482520943
>>  2   394 489009617
>>  3   395 488017817
>>  32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>  N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Signed-off-by: Kemi Wang 
>> Suggested-by: Dave Hansen 
>> Suggested-by: Ying Huang 
>> ---
>>  include/linux/mmzone.h |  4 ++--
>>  include/linux/vmstat.h |  6 +-
>>  mm/vmstat.c| 23 ++-
>>  3 files changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 0b11ba7..7eaf0e8 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>>  struct per_cpu_pages pcp;
>>  #ifdef CONFIG_NUMA
>>  s8 expire;
>> -s8 numa_stat_threshold;
>> -s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>> +s16 numa_stat_threshold;
>> +s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> 
> I'm fairly sure this pushes the size of that structure into the next
> cache line which is not welcome.
> 
> vm_numa_stat_diff is an always incrementing field. How much do you gain
> if this becomes a u8 code and remove any code that deals with negative
> values? That would double the threshold without consuming another cache line.

Doubling the threshold and counter size will help, but not as much
as making them above u8 limit as seen in Kemi's data:

  125 537 358906028 <==> system by default (base)
  256 468 412397590
  32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset

For small system making them u8 makes sense.  For larger ones the
frequent local counter overflow into the global counter still
causes a lot of cache bounce.  Kemi can perhaps collect some data
to see what is the gain from making the counters u8. 

> 
> Furthermore, the stats in question are only ever incremented by one.
> That means that any calcluation related to overlap can be removed and
> special cased that it'll never overlap by more than 1. That potentially
> removes code that is required for other stats but not locality stats.
> This may give enough savings to avoid moving to s16.
> 
> Very broadly speaking, I like what you're doing but I would like to see
> more work on reducing any unnecessary code in that path (such as dealing
> with overlaps for single increments) and treat incrasing the cache footprint
> only as a very last resort.
> 
>>  #endif
>>  #ifdef CONFIG_SMP
>>  s8 stat_threshold;
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 1e19379..d97cc34 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum 
>> zone_numa_stat_item item)
>>  return x;
>>  }
>>  
>> -static inline unsigned long zone_numa_state(struct zone *zone,
>> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>>  enum zone_numa_stat_item item)
>>  {
>>  long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> +int cpu;
>> +
>> +for_each_online_cpu(cpu)
>> +x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>>  
>>  return x;
>>  }
> 
> This does not appear to be related to the current patch. It either
> should be merged with the previous patch or stand on its own.
> 
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 5a7fa30..c7f50ed 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>  
>>  #include "internal.h"
>>  
>> +#define NUMA_STAT_THRESHOLD  32765
>> +
> 
> This should be expressed in terms of the type and not a hard-coded value.
> 



Re: [PATCH 2/2] mm: Update NUMA counter threshold size

2017-08-15 Thread Mel Gorman
On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>  Threshold   CPU cyclesThroughput(88 threads)
>  32  799 241760478
>  64  640 301628829
>  125 537 358906028 <==> system by default (base)
>  256 468 412397590
>  512 428 450550704
>  4096399 482520943
>  2   394 489009617
>  3   395 488017817
>  32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset
>  N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> 
> Signed-off-by: Kemi Wang 
> Suggested-by: Dave Hansen 
> Suggested-by: Ying Huang 
> ---
>  include/linux/mmzone.h |  4 ++--
>  include/linux/vmstat.h |  6 +-
>  mm/vmstat.c| 23 ++-
>  3 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0b11ba7..7eaf0e8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>   struct per_cpu_pages pcp;
>  #ifdef CONFIG_NUMA
>   s8 expire;
> - s8 numa_stat_threshold;
> - s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> + s16 numa_stat_threshold;
> + s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];

I'm fairly sure this pushes the size of that structure into the next
cache line which is not welcome.

vm_numa_stat_diff is an always incrementing field. How much do you gain
if this becomes a u8 code and remove any code that deals with negative
values? That would double the threshold without consuming another cache line.

Furthermore, the stats in question are only ever incremented by one.
That means that any calcluation related to overlap can be removed and
special cased that it'll never overlap by more than 1. That potentially
removes code that is required for other stats but not locality stats.
This may give enough savings to avoid moving to s16.

Very broadly speaking, I like what you're doing but I would like to see
more work on reducing any unnecessary code in that path (such as dealing
with overlaps for single increments) and treat incrasing the cache footprint
only as a very last resort.

>  #endif
>  #ifdef CONFIG_SMP
>   s8 stat_threshold;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 1e19379..d97cc34 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum 
> zone_numa_stat_item item)
>   return x;
>  }
>  
> -static inline unsigned long zone_numa_state(struct zone *zone,
> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>   enum zone_numa_stat_item item)
>  {
>   long x = atomic_long_read(&zone->vm_numa_stat[item]);
> + int cpu;
> +
> + for_each_online_cpu(cpu)
> + x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>  
>   return x;
>  }

This does not appear to be related to the current patch. It either
should be merged with the previous patch or stand on its own.

> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 5a7fa30..c7f50ed 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -30,6 +30,8 @@
>  
>  #include "internal.h"
>  
> +#define NUMA_STAT_THRESHOLD  32765
> +

This should be expressed in terms of the type and not a hard-coded value.

-- 
Mel Gorman
SUSE Labs


[PATCH 2/2] mm: Update NUMA counter threshold size

2017-08-15 Thread Kemi Wang
There is significant overhead in cache bouncing caused by zone counters
(NUMA associated counters) update in parallel in multi-threaded page
allocation (suggested by Dave Hansen).

This patch updates NUMA counter threshold to a fixed size of 32765, as a
small threshold greatly increases the update frequency of the global
counter from local per cpu counter, and the number of NUMA items in each
cpu (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user
*reads* the value of numa counter to eliminate deviation (suggested by
Ying Huang).

The rationality is that these statistics counters don't need to be read
often, unlike other VM counters, so it's not a problem to use a large
threshold and make readers more expensive.

With this patchset, we see 26.6% drop of CPU cycles(537-->394) for per
single page allocation and reclaim on Jesper's page_bench03 benchmark.

Benchmark provided by Jesper D Broucer(increase loop times to 1000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

 Threshold   CPU cyclesThroughput(88 threads)
 32  799 241760478
 64  640 301628829
 125 537 358906028 <==> system by default (base)
 256 468 412397590
 512 428 450550704
 4096399 482520943
 2   394 489009617
 3   395 488017817
 32765   394(-26.6%) 488932078(+36.2%) <==> with this patchset
 N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics

Signed-off-by: Kemi Wang 
Suggested-by: Dave Hansen 
Suggested-by: Ying Huang 
---
 include/linux/mmzone.h |  4 ++--
 include/linux/vmstat.h |  6 +-
 mm/vmstat.c| 23 ++-
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0b11ba7..7eaf0e8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -282,8 +282,8 @@ struct per_cpu_pageset {
struct per_cpu_pages pcp;
 #ifdef CONFIG_NUMA
s8 expire;
-   s8 numa_stat_threshold;
-   s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
+   s16 numa_stat_threshold;
+   s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
 #endif
 #ifdef CONFIG_SMP
s8 stat_threshold;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1e19379..d97cc34 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum 
zone_numa_stat_item item)
return x;
 }
 
-static inline unsigned long zone_numa_state(struct zone *zone,
+static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
enum zone_numa_stat_item item)
 {
long x = atomic_long_read(&zone->vm_numa_stat[item]);
+   int cpu;
+
+   for_each_online_cpu(cpu)
+   x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
 
return x;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5a7fa30..c7f50ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,6 +30,8 @@
 
 #include "internal.h"
 
+#define NUMA_STAT_THRESHOLD  32765
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
 EXPORT_PER_CPU_SYMBOL(vm_event_states);
@@ -196,7 +198,7 @@ void refresh_zone_stat_thresholds(void)
= threshold;
 #ifdef CONFIG_NUMA
per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
-   = threshold;
+   = NUMA_STAT_THRESHOLD;
 #endif
/* Base nodestat threshold on the largest populated 
zone. */
pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, 
cpu)->stat_threshold;
@@ -231,14 +233,9 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
continue;
 
threshold = (*calculate_pressure)(zone);
-   for_each_online_cpu(cpu) {
+   for_each_online_cpu(cpu)
per_cpu_ptr(zone->pageset, cpu)->stat_threshold
= threshold;
-#ifdef CONFIG_NUMA
-   per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
-   = threshold;
-#endif
-   }
}
 }
 
@@ -872,13 +869,13 @@ void __inc_zone_numa_state(struct zone *zone,
 enum zone_numa_stat_item item)
 {
struct per_cpu_pageset __percpu *pcp = zone->pageset;
-   s8 __percpu *p = pcp->vm_numa_stat_diff + item;
-   s8 v, t;
+   s16 __percpu *p = pcp->vm_numa_stat_diff + item;
+   s16 v, t;
 
v = __this_cpu_inc_return(*p);
t = __this_cpu_read(pcp->numa_stat_threshold);
if (unlikely(v > t)) {