Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread kemi


On 2017年12月20日 01:21, Christopher Lameter wrote:
> On Tue, 19 Dec 2017, Michal Hocko wrote:
> 
>>> Well the reason for s8 was to keep the data structures small so that they
>>> fit in the higher level cpu caches. The large these structures become the
>>> more cachelines are used by the counters and the larger the performance
>>> influence on the code that should not be impacted by the overhead.
>>
>> I am not sure I understand. We usually do not access more counters in
>> the single code path (well, PGALLOC and NUMA counteres is more of an
>> exception). So it is rarely an advantage that the whole array is in the
>> same cache line. Besides that this is allocated by the percpu allocator
>> aligns to the type size rather than cache lines AFAICS.
> 
> I thought we are talking about NUMA counters here?
> 
> Regardless: A typical fault, system call or OS action will access multiple
> zone and node counters when allocating or freeing memory. Enlarging the
> fields will increase the number of cachelines touched.
> 

Yes, we add one more cache line footprint access theoretically.
But I don't think it would be a problem.
1) Not all the counters need to be accessed in fast path of page allocation,
the counters covered in a single cache line usually is enough for that, we
probably don't need to access one more cache line. I tend to agree Michal's
argument.
Besides, in some slow path in which code is protected by zone lock or lru lock,
access one more cache line would be a big problem since many other cache lines 
are also be accessed.

2) Enlarging vm_node_stat_diff from s8 to s16 gives an opportunity to keep
more number in local cpus that provides the possibility of reducing the global
counter update frequency. Thus, we can gain the benefit by reducing expensive 
cache bouncing.  

Well, if you still have some concerns, I can post some data for 
will-it-scale.page_fault1.
What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
1 mmap() 128M anonymous space;
2 writes to each page there to trigger actual page allocation;
3 munmap() it.
in a loop.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Or you can provide some other benchmarks on which you want to see performance 
impact.

>> Maybe it used to be all different back then when the code has been added
>> but arguing about cache lines seems to be a bit problematic here. Maybe
>> you have some specific workloads which can prove me wrong?
> 
> Run a workload that does some page faults? Heavy allocation and freeing of
> memory?
> 
> Maybe that is no longer relevant since the number of the counters is
> large that the accesses are so sparse that each action pulls in a whole
> cacheline. That would be something we tried to avoid when implementing
> the differentials.
> 
> 


Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread kemi


On 2017年12月20日 01:21, Christopher Lameter wrote:
> On Tue, 19 Dec 2017, Michal Hocko wrote:
> 
>>> Well the reason for s8 was to keep the data structures small so that they
>>> fit in the higher level cpu caches. The large these structures become the
>>> more cachelines are used by the counters and the larger the performance
>>> influence on the code that should not be impacted by the overhead.
>>
>> I am not sure I understand. We usually do not access more counters in
>> the single code path (well, PGALLOC and NUMA counteres is more of an
>> exception). So it is rarely an advantage that the whole array is in the
>> same cache line. Besides that this is allocated by the percpu allocator
>> aligns to the type size rather than cache lines AFAICS.
> 
> I thought we are talking about NUMA counters here?
> 
> Regardless: A typical fault, system call or OS action will access multiple
> zone and node counters when allocating or freeing memory. Enlarging the
> fields will increase the number of cachelines touched.
> 

Yes, we add one more cache line footprint access theoretically.
But I don't think it would be a problem.
1) Not all the counters need to be accessed in fast path of page allocation,
the counters covered in a single cache line usually is enough for that, we
probably don't need to access one more cache line. I tend to agree Michal's
argument.
Besides, in some slow path in which code is protected by zone lock or lru lock,
access one more cache line would be a big problem since many other cache lines 
are also be accessed.

2) Enlarging vm_node_stat_diff from s8 to s16 gives an opportunity to keep
more number in local cpus that provides the possibility of reducing the global
counter update frequency. Thus, we can gain the benefit by reducing expensive 
cache bouncing.  

Well, if you still have some concerns, I can post some data for 
will-it-scale.page_fault1.
What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
1 mmap() 128M anonymous space;
2 writes to each page there to trigger actual page allocation;
3 munmap() it.
in a loop.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Or you can provide some other benchmarks on which you want to see performance 
impact.

>> Maybe it used to be all different back then when the code has been added
>> but arguing about cache lines seems to be a bit problematic here. Maybe
>> you have some specific workloads which can prove me wrong?
> 
> Run a workload that does some page faults? Heavy allocation and freeing of
> memory?
> 
> Maybe that is no longer relevant since the number of the counters is
> large that the accesses are so sparse that each action pulls in a whole
> cacheline. That would be something we tried to avoid when implementing
> the differentials.
> 
> 


Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread kemi


On 2017年12月19日 20:38, Michal Hocko wrote:
> On Tue 19-12-17 14:39:23, Kemi Wang wrote:
>> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
>> limitation of global counters update frequency, especially for those
>> monotone increasing type of counters like NUMA counters with more and more
>> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
>> without any functionality change.
>>
>>  before after
>> sizeof(struct per_cpu_nodestat)28 68
> 
> So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO
> but the changelog is a bit silent about any numbers. This is a
> performance optimization so it should better give us some.
>  

This patch does not have any functionality change. So no performance gain 
I suppose. 
I guess you are talking about performance gain from the third patch which 
increases threshold size of NUMA counters.

>> Signed-off-by: Kemi Wang 
>> ---
>>  include/linux/mmzone.h |  4 ++--
>>  mm/vmstat.c| 16 
>>  2 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index c06d880..2da6b6f 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -289,8 +289,8 @@ struct per_cpu_pageset {
>>  };
>>  
>>  struct per_cpu_nodestat {
>> -s8 stat_threshold;
>> -s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>> +s16 stat_threshold;
>> +s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>>  };
>>  
>>  #endif /* !__GENERATING_BOUNDS.H */
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 1dd12ae..9c681cc 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, 
>> enum node_stat_item item,
>>  long delta)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>>  long x;
>>  long t;
>>  
>> @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum 
>> zone_stat_item item)
>>  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> -s8 v, t;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 v, t;
>>  
>>  v = __this_cpu_inc_return(*p);
>>  t = __this_cpu_read(pcp->stat_threshold);
>>  if (unlikely(v > t)) {
>> -s8 overstep = t >> 1;
>> +s16 overstep = t >> 1;
>>  
>>  node_page_state_add(v + overstep, pgdat, item);
>>  __this_cpu_write(*p, -overstep);
>> @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum 
>> zone_stat_item item)
>>  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> -s8 v, t;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 v, t;
>>  
>>  v = __this_cpu_dec_return(*p);
>>  t = __this_cpu_read(pcp->stat_threshold);
>>  if (unlikely(v < - t)) {
>> -s8 overstep = t >> 1;
>> +s16 overstep = t >> 1;
>>  
>>  node_page_state_add(v - overstep, pgdat, item);
>>  __this_cpu_write(*p, overstep);
>> @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data 
>> *pgdat,
>> enum node_stat_item item, int delta, int overstep_mode)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>>  long o, n, t, z;
>>  
>>  do {
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread kemi


On 2017年12月19日 20:38, Michal Hocko wrote:
> On Tue 19-12-17 14:39:23, Kemi Wang wrote:
>> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
>> limitation of global counters update frequency, especially for those
>> monotone increasing type of counters like NUMA counters with more and more
>> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
>> without any functionality change.
>>
>>  before after
>> sizeof(struct per_cpu_nodestat)28 68
> 
> So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO
> but the changelog is a bit silent about any numbers. This is a
> performance optimization so it should better give us some.
>  

This patch does not have any functionality change. So no performance gain 
I suppose. 
I guess you are talking about performance gain from the third patch which 
increases threshold size of NUMA counters.

>> Signed-off-by: Kemi Wang 
>> ---
>>  include/linux/mmzone.h |  4 ++--
>>  mm/vmstat.c| 16 
>>  2 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index c06d880..2da6b6f 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -289,8 +289,8 @@ struct per_cpu_pageset {
>>  };
>>  
>>  struct per_cpu_nodestat {
>> -s8 stat_threshold;
>> -s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>> +s16 stat_threshold;
>> +s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>>  };
>>  
>>  #endif /* !__GENERATING_BOUNDS.H */
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 1dd12ae..9c681cc 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, 
>> enum node_stat_item item,
>>  long delta)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>>  long x;
>>  long t;
>>  
>> @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum 
>> zone_stat_item item)
>>  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> -s8 v, t;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 v, t;
>>  
>>  v = __this_cpu_inc_return(*p);
>>  t = __this_cpu_read(pcp->stat_threshold);
>>  if (unlikely(v > t)) {
>> -s8 overstep = t >> 1;
>> +s16 overstep = t >> 1;
>>  
>>  node_page_state_add(v + overstep, pgdat, item);
>>  __this_cpu_write(*p, -overstep);
>> @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum 
>> zone_stat_item item)
>>  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> -s8 v, t;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 v, t;
>>  
>>  v = __this_cpu_dec_return(*p);
>>  t = __this_cpu_read(pcp->stat_threshold);
>>  if (unlikely(v < - t)) {
>> -s8 overstep = t >> 1;
>> +s16 overstep = t >> 1;
>>  
>>  node_page_state_add(v - overstep, pgdat, item);
>>  __this_cpu_write(*p, overstep);
>> @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data 
>> *pgdat,
>> enum node_stat_item item, int delta, int overstep_mode)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>>  long o, n, t, z;
>>  
>>  do {
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread Christopher Lameter
On Tue, 19 Dec 2017, Michal Hocko wrote:

> > Well the reason for s8 was to keep the data structures small so that they
> > fit in the higher level cpu caches. The large these structures become the
> > more cachelines are used by the counters and the larger the performance
> > influence on the code that should not be impacted by the overhead.
>
> I am not sure I understand. We usually do not access more counters in
> the single code path (well, PGALLOC and NUMA counteres is more of an
> exception). So it is rarely an advantage that the whole array is in the
> same cache line. Besides that this is allocated by the percpu allocator
> aligns to the type size rather than cache lines AFAICS.

I thought we are talking about NUMA counters here?

Regardless: A typical fault, system call or OS action will access multiple
zone and node counters when allocating or freeing memory. Enlarging the
fields will increase the number of cachelines touched.

> Maybe it used to be all different back then when the code has been added
> but arguing about cache lines seems to be a bit problematic here. Maybe
> you have some specific workloads which can prove me wrong?

Run a workload that does some page faults? Heavy allocation and freeing of
memory?

Maybe that is no longer relevant since the number of the counters is
large that the accesses are so sparse that each action pulls in a whole
cacheline. That would be something we tried to avoid when implementing
the differentials.




Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread Christopher Lameter
On Tue, 19 Dec 2017, Michal Hocko wrote:

> > Well the reason for s8 was to keep the data structures small so that they
> > fit in the higher level cpu caches. The large these structures become the
> > more cachelines are used by the counters and the larger the performance
> > influence on the code that should not be impacted by the overhead.
>
> I am not sure I understand. We usually do not access more counters in
> the single code path (well, PGALLOC and NUMA counteres is more of an
> exception). So it is rarely an advantage that the whole array is in the
> same cache line. Besides that this is allocated by the percpu allocator
> aligns to the type size rather than cache lines AFAICS.

I thought we are talking about NUMA counters here?

Regardless: A typical fault, system call or OS action will access multiple
zone and node counters when allocating or freeing memory. Enlarging the
fields will increase the number of cachelines touched.

> Maybe it used to be all different back then when the code has been added
> but arguing about cache lines seems to be a bit problematic here. Maybe
> you have some specific workloads which can prove me wrong?

Run a workload that does some page faults? Heavy allocation and freeing of
memory?

Maybe that is no longer relevant since the number of the counters is
large that the accesses are so sparse that each action pulls in a whole
cacheline. That would be something we tried to avoid when implementing
the differentials.




Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread Michal Hocko
On Tue 19-12-17 10:05:48, Cristopher Lameter wrote:
> On Tue, 19 Dec 2017, Kemi Wang wrote:
> 
> > The type s8 used for vm_diff_nodestat[] as local cpu counters has the
> > limitation of global counters update frequency, especially for those
> > monotone increasing type of counters like NUMA counters with more and more
> > cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
> > without any functionality change.
> 
> Well the reason for s8 was to keep the data structures small so that they
> fit in the higher level cpu caches. The large these structures become the
> more cachelines are used by the counters and the larger the performance
> influence on the code that should not be impacted by the overhead.
 
I am not sure I understand. We usually do not access more counters in
the single code path (well, PGALLOC and NUMA counteres is more of an
exception). So it is rarely an advantage that the whole array is in the
same cache line. Besides that this is allocated by the percpu allocator
aligns to the type size rather than cache lines AFAICS.

Maybe it used to be all different back then when the code has been added
but arguing about cache lines seems to be a bit problematic here. Maybe
you have some specific workloads which can prove me wrong?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread Michal Hocko
On Tue 19-12-17 10:05:48, Cristopher Lameter wrote:
> On Tue, 19 Dec 2017, Kemi Wang wrote:
> 
> > The type s8 used for vm_diff_nodestat[] as local cpu counters has the
> > limitation of global counters update frequency, especially for those
> > monotone increasing type of counters like NUMA counters with more and more
> > cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
> > without any functionality change.
> 
> Well the reason for s8 was to keep the data structures small so that they
> fit in the higher level cpu caches. The large these structures become the
> more cachelines are used by the counters and the larger the performance
> influence on the code that should not be impacted by the overhead.
 
I am not sure I understand. We usually do not access more counters in
the single code path (well, PGALLOC and NUMA counteres is more of an
exception). So it is rarely an advantage that the whole array is in the
same cache line. Besides that this is allocated by the percpu allocator
aligns to the type size rather than cache lines AFAICS.

Maybe it used to be all different back then when the code has been added
but arguing about cache lines seems to be a bit problematic here. Maybe
you have some specific workloads which can prove me wrong?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread Christopher Lameter
On Tue, 19 Dec 2017, Kemi Wang wrote:

> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
> limitation of global counters update frequency, especially for those
> monotone increasing type of counters like NUMA counters with more and more
> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
> without any functionality change.

Well the reason for s8 was to keep the data structures small so that they
fit in the higher level cpu caches. The large these structures become the
more cachelines are used by the counters and the larger the performance
influence on the code that should not be impacted by the overhead.



Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread Christopher Lameter
On Tue, 19 Dec 2017, Kemi Wang wrote:

> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
> limitation of global counters update frequency, especially for those
> monotone increasing type of counters like NUMA counters with more and more
> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
> without any functionality change.

Well the reason for s8 was to keep the data structures small so that they
fit in the higher level cpu caches. The large these structures become the
more cachelines are used by the counters and the larger the performance
influence on the code that should not be impacted by the overhead.



Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread Michal Hocko
On Tue 19-12-17 14:39:23, Kemi Wang wrote:
> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
> limitation of global counters update frequency, especially for those
> monotone increasing type of counters like NUMA counters with more and more
> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
> without any functionality change.
> 
>  before after
> sizeof(struct per_cpu_nodestat)28 68

So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO
but the changelog is a bit silent about any numbers. This is a
performance optimization so it should better give us some.
 
> Signed-off-by: Kemi Wang 
> ---
>  include/linux/mmzone.h |  4 ++--
>  mm/vmstat.c| 16 
>  2 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c06d880..2da6b6f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -289,8 +289,8 @@ struct per_cpu_pageset {
>  };
>  
>  struct per_cpu_nodestat {
> - s8 stat_threshold;
> - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
> + s16 stat_threshold;
> + s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>  };
>  
>  #endif /* !__GENERATING_BOUNDS.H */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1dd12ae..9c681cc 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, 
> enum node_stat_item item,
>   long delta)
>  {
>   struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
>   long x;
>   long t;
>  
> @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum 
> zone_stat_item item)
>  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>  {
>   struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> - s8 v, t;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 v, t;
>  
>   v = __this_cpu_inc_return(*p);
>   t = __this_cpu_read(pcp->stat_threshold);
>   if (unlikely(v > t)) {
> - s8 overstep = t >> 1;
> + s16 overstep = t >> 1;
>  
>   node_page_state_add(v + overstep, pgdat, item);
>   __this_cpu_write(*p, -overstep);
> @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum 
> zone_stat_item item)
>  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>  {
>   struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> - s8 v, t;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 v, t;
>  
>   v = __this_cpu_dec_return(*p);
>   t = __this_cpu_read(pcp->stat_threshold);
>   if (unlikely(v < - t)) {
> - s8 overstep = t >> 1;
> + s16 overstep = t >> 1;
>  
>   node_page_state_add(v - overstep, pgdat, item);
>   __this_cpu_write(*p, overstep);
> @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data 
> *pgdat,
> enum node_stat_item item, int delta, int overstep_mode)
>  {
>   struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
>   long o, n, t, z;
>  
>   do {
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread Michal Hocko
On Tue 19-12-17 14:39:23, Kemi Wang wrote:
> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
> limitation of global counters update frequency, especially for those
> monotone increasing type of counters like NUMA counters with more and more
> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
> without any functionality change.
> 
>  before after
> sizeof(struct per_cpu_nodestat)28 68

So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO
but the changelog is a bit silent about any numbers. This is a
performance optimization so it should better give us some.
 
> Signed-off-by: Kemi Wang 
> ---
>  include/linux/mmzone.h |  4 ++--
>  mm/vmstat.c| 16 
>  2 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c06d880..2da6b6f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -289,8 +289,8 @@ struct per_cpu_pageset {
>  };
>  
>  struct per_cpu_nodestat {
> - s8 stat_threshold;
> - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
> + s16 stat_threshold;
> + s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>  };
>  
>  #endif /* !__GENERATING_BOUNDS.H */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1dd12ae..9c681cc 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, 
> enum node_stat_item item,
>   long delta)
>  {
>   struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
>   long x;
>   long t;
>  
> @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum 
> zone_stat_item item)
>  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>  {
>   struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> - s8 v, t;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 v, t;
>  
>   v = __this_cpu_inc_return(*p);
>   t = __this_cpu_read(pcp->stat_threshold);
>   if (unlikely(v > t)) {
> - s8 overstep = t >> 1;
> + s16 overstep = t >> 1;
>  
>   node_page_state_add(v + overstep, pgdat, item);
>   __this_cpu_write(*p, -overstep);
> @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum 
> zone_stat_item item)
>  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>  {
>   struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> - s8 v, t;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 v, t;
>  
>   v = __this_cpu_dec_return(*p);
>   t = __this_cpu_read(pcp->stat_threshold);
>   if (unlikely(v < - t)) {
> - s8 overstep = t >> 1;
> + s16 overstep = t >> 1;
>  
>   node_page_state_add(v - overstep, pgdat, item);
>   __this_cpu_write(*p, overstep);
> @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data 
> *pgdat,
> enum node_stat_item item, int delta, int overstep_mode)
>  {
>   struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
>   long o, n, t, z;
>  
>   do {
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs