Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-11 Thread Vlastimil Babka
On 10/10/2017 05:39 PM, Dave Hansen wrote:
> On 10/10/2017 07:57 AM, Michal Hocko wrote:
>>> But, let's be honest, this leaves us with an option that nobody is ever
>>> going to turn on.  IOW, nobody except a very small portion of our users
>>> will ever see any benefit from this.
>> But aren't those small groups who would like to squeeze every single
>> cycle out from the page allocator path the targeted audience?
> 
> They're the reason we started looking at this.  They also care the most.
> 
> But, the cost of these stats, especially we get more and more cores in a
> NUMA node is really making them show up in profiles.  It would be nice
> to get rid of them there, too.

Furthermore, the group that actually looks at those stats, could be also
expected to be quite small. The group that cares neither about the
stats, nor relies on top allocator performance, might still arguably
benefit from improved allocator performance, but won't for sure benefit
from the stats.

> Aaron, do you remember offhand how much of the allocator overhead was
> coming from NUMA stats?
> 



Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-11 Thread Vlastimil Babka
On 10/10/2017 05:39 PM, Dave Hansen wrote:
> On 10/10/2017 07:57 AM, Michal Hocko wrote:
>>> But, let's be honest, this leaves us with an option that nobody is ever
>>> going to turn on.  IOW, nobody except a very small portion of our users
>>> will ever see any benefit from this.
>> But aren't those small groups who would like to squeeze every single
>> cycle out from the page allocator path the targeted audience?
> 
> They're the reason we started looking at this.  They also care the most.
> 
> But, the cost of these stats, especially we get more and more cores in a
> NUMA node is really making them show up in profiles.  It would be nice
> to get rid of them there, too.

Furthermore, the group that actually looks at those stats, could be also
expected to be quite small. The group that cares neither about the
stats, nor relies on top allocator performance, might still arguably
benefit from improved allocator performance, but won't for sure benefit
from the stats.

> Aaron, do you remember offhand how much of the allocator overhead was
> coming from NUMA stats?
> 



Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Andrew Morton
On Tue, 10 Oct 2017 19:51:43 +0200 Michal Hocko  wrote:

> On Tue 10-10-17 08:39:47, Dave Hansen wrote:
> > On 10/10/2017 07:57 AM, Michal Hocko wrote:
> > >> But, let's be honest, this leaves us with an option that nobody is ever
> > >> going to turn on.  IOW, nobody except a very small portion of our users
> > >> will ever see any benefit from this.
> > > But aren't those small groups who would like to squeeze every single
> > > cycle out from the page allocator path the targeted audience?
> > 
> > They're the reason we started looking at this.  They also care the most.
> > 
> > But, the cost of these stats, especially we get more and more cores in a
> > NUMA node is really making them show up in profiles.  It would be nice
> > to get rid of them there, too.
> 
> I am not opposing to the auto mode. I am just not sure it is a safe
> default and I also think that we should add this on top if it is really
> needed.

Yup.  Let's keep things simple unless a real need is demonstrated, please.


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Andrew Morton
On Tue, 10 Oct 2017 19:51:43 +0200 Michal Hocko  wrote:

> On Tue 10-10-17 08:39:47, Dave Hansen wrote:
> > On 10/10/2017 07:57 AM, Michal Hocko wrote:
> > >> But, let's be honest, this leaves us with an option that nobody is ever
> > >> going to turn on.  IOW, nobody except a very small portion of our users
> > >> will ever see any benefit from this.
> > > But aren't those small groups who would like to squeeze every single
> > > cycle out from the page allocator path the targeted audience?
> > 
> > They're the reason we started looking at this.  They also care the most.
> > 
> > But, the cost of these stats, especially we get more and more cores in a
> > NUMA node is really making them show up in profiles.  It would be nice
> > to get rid of them there, too.
> 
> I am not opposing to the auto mode. I am just not sure it is a safe
> default and I also think that we should add this on top if it is really
> needed.

Yup.  Let's keep things simple unless a real need is demonstrated, please.


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Michal Hocko
On Tue 10-10-17 08:39:47, Dave Hansen wrote:
> On 10/10/2017 07:57 AM, Michal Hocko wrote:
> >> But, let's be honest, this leaves us with an option that nobody is ever
> >> going to turn on.  IOW, nobody except a very small portion of our users
> >> will ever see any benefit from this.
> > But aren't those small groups who would like to squeeze every single
> > cycle out from the page allocator path the targeted audience?
> 
> They're the reason we started looking at this.  They also care the most.
> 
> But, the cost of these stats, especially we get more and more cores in a
> NUMA node is really making them show up in profiles.  It would be nice
> to get rid of them there, too.

I am not opposing to the auto mode. I am just not sure it is a safe
default and I also think that we should add this on top if it is really
needed.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Michal Hocko
On Tue 10-10-17 08:39:47, Dave Hansen wrote:
> On 10/10/2017 07:57 AM, Michal Hocko wrote:
> >> But, let's be honest, this leaves us with an option that nobody is ever
> >> going to turn on.  IOW, nobody except a very small portion of our users
> >> will ever see any benefit from this.
> > But aren't those small groups who would like to squeeze every single
> > cycle out from the page allocator path the targeted audience?
> 
> They're the reason we started looking at this.  They also care the most.
> 
> But, the cost of these stats, especially we get more and more cores in a
> NUMA node is really making them show up in profiles.  It would be nice
> to get rid of them there, too.

I am not opposing to the auto mode. I am just not sure it is a safe
default and I also think that we should add this on top if it is really
needed.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Dave Hansen
On 10/10/2017 07:57 AM, Michal Hocko wrote:
>> But, let's be honest, this leaves us with an option that nobody is ever
>> going to turn on.  IOW, nobody except a very small portion of our users
>> will ever see any benefit from this.
> But aren't those small groups who would like to squeeze every single
> cycle out from the page allocator path the targeted audience?

They're the reason we started looking at this.  They also care the most.

But, the cost of these stats, especially we get more and more cores in a
NUMA node is really making them show up in profiles.  It would be nice
to get rid of them there, too.

Aaron, do you remember offhand how much of the allocator overhead was
coming from NUMA stats?


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Dave Hansen
On 10/10/2017 07:57 AM, Michal Hocko wrote:
>> But, let's be honest, this leaves us with an option that nobody is ever
>> going to turn on.  IOW, nobody except a very small portion of our users
>> will ever see any benefit from this.
> But aren't those small groups who would like to squeeze every single
> cycle out from the page allocator path the targeted audience?

They're the reason we started looking at this.  They also care the most.

But, the cost of these stats, especially we get more and more cores in a
NUMA node is really making them show up in profiles.  It would be nice
to get rid of them there, too.

Aaron, do you remember offhand how much of the allocator overhead was
coming from NUMA stats?


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Christopher Lameter
On Tue, 10 Oct 2017, Michal Hocko wrote:

> > But, let's be honest, this leaves us with an option that nobody is ever
> > going to turn on.  IOW, nobody except a very small portion of our users
> > will ever see any benefit from this.
>
> But aren't those small groups who would like to squeeze every single
> cycle out from the page allocator path the targeted audience?

Those have long sine raised the white flag and succumbed to the
featuritis. Resigned to try to keep the bloat restricted to a couple of
cores so that the rest of the cores stay usable.



Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Christopher Lameter
On Tue, 10 Oct 2017, Michal Hocko wrote:

> > But, let's be honest, this leaves us with an option that nobody is ever
> > going to turn on.  IOW, nobody except a very small portion of our users
> > will ever see any benefit from this.
>
> But aren't those small groups who would like to squeeze every single
> cycle out from the page allocator path the targeted audience?

Those have long sine raised the white flag and succumbed to the
featuritis. Resigned to try to keep the bloat restricted to a couple of
cores so that the rest of the cores stay usable.



Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Michal Hocko
On Tue 10-10-17 07:53:50, Dave Hansen wrote:
> On 10/10/2017 07:31 AM, Michal Hocko wrote:
> > On Tue 10-10-17 07:29:31, Dave Hansen wrote:
> >> On 10/09/2017 10:49 PM, Michal Hocko wrote:
> >>> Anyway I still stand by my position that this sounds over-engineered and
> >>> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> >>> anybody wants an auto mode it can be added later (as a value 2 resp.
> >>> auto).
> >>
> >> 0/1 with the default set to the strict, slower mode?
> > 
> > yes, keep the current semantic and allow users who care to disable
> > something that stands in the way.
> 
> But, let's be honest, this leaves us with an option that nobody is ever
> going to turn on.  IOW, nobody except a very small portion of our users
> will ever see any benefit from this.

But aren't those small groups who would like to squeeze every single
cycle out from the page allocator path the targeted audience?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Michal Hocko
On Tue 10-10-17 07:53:50, Dave Hansen wrote:
> On 10/10/2017 07:31 AM, Michal Hocko wrote:
> > On Tue 10-10-17 07:29:31, Dave Hansen wrote:
> >> On 10/09/2017 10:49 PM, Michal Hocko wrote:
> >>> Anyway I still stand by my position that this sounds over-engineered and
> >>> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> >>> anybody wants an auto mode it can be added later (as a value 2 resp.
> >>> auto).
> >>
> >> 0/1 with the default set to the strict, slower mode?
> > 
> > yes, keep the current semantic and allow users who care to disable
> > something that stands in the way.
> 
> But, let's be honest, this leaves us with an option that nobody is ever
> going to turn on.  IOW, nobody except a very small portion of our users
> will ever see any benefit from this.

But aren't those small groups who would like to squeeze every single
cycle out from the page allocator path the targeted audience?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Dave Hansen
On 10/10/2017 07:31 AM, Michal Hocko wrote:
> On Tue 10-10-17 07:29:31, Dave Hansen wrote:
>> On 10/09/2017 10:49 PM, Michal Hocko wrote:
>>> Anyway I still stand by my position that this sounds over-engineered and
>>> a simple 0/1 resp. on/off interface would be both simpler and safer. If
>>> anybody wants an auto mode it can be added later (as a value 2 resp.
>>> auto).
>>
>> 0/1 with the default set to the strict, slower mode?
> 
> yes, keep the current semantic and allow users who care to disable
> something that stands in the way.

But, let's be honest, this leaves us with an option that nobody is ever
going to turn on.  IOW, nobody except a very small portion of our users
will ever see any benefit from this.



Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Dave Hansen
On 10/10/2017 07:31 AM, Michal Hocko wrote:
> On Tue 10-10-17 07:29:31, Dave Hansen wrote:
>> On 10/09/2017 10:49 PM, Michal Hocko wrote:
>>> Anyway I still stand by my position that this sounds over-engineered and
>>> a simple 0/1 resp. on/off interface would be both simpler and safer. If
>>> anybody wants an auto mode it can be added later (as a value 2 resp.
>>> auto).
>>
>> 0/1 with the default set to the strict, slower mode?
> 
> yes, keep the current semantic and allow users who care to disable
> something that stands in the way.

But, let's be honest, this leaves us with an option that nobody is ever
going to turn on.  IOW, nobody except a very small portion of our users
will ever see any benefit from this.



Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Michal Hocko
On Tue 10-10-17 07:29:31, Dave Hansen wrote:
> On 10/09/2017 10:49 PM, Michal Hocko wrote:
> > On Mon 09-10-17 09:55:49, Michal Hocko wrote:
> >> I haven't checked closely but what happens (or should happen) when you
> >> do a partial read? Should you get an inconsistent results? Or is this
> >> impossible?
> > Well, after thinking about it little bit more, partial reads are always
> > inconsistent so this wouldn't add a new problem.
> > 
> > Anyway I still stand by my position that this sounds over-engineered and
> > a simple 0/1 resp. on/off interface would be both simpler and safer. If
> > anybody wants an auto mode it can be added later (as a value 2 resp.
> > auto).
> 
> 0/1 with the default set to the strict, slower mode?

yes, keep the current semantic and allow users who care to disable
something that stands in the way.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Michal Hocko
On Tue 10-10-17 07:29:31, Dave Hansen wrote:
> On 10/09/2017 10:49 PM, Michal Hocko wrote:
> > On Mon 09-10-17 09:55:49, Michal Hocko wrote:
> >> I haven't checked closely but what happens (or should happen) when you
> >> do a partial read? Should you get an inconsistent results? Or is this
> >> impossible?
> > Well, after thinking about it little bit more, partial reads are always
> > inconsistent so this wouldn't add a new problem.
> > 
> > Anyway I still stand by my position that this sounds over-engineered and
> > a simple 0/1 resp. on/off interface would be both simpler and safer. If
> > anybody wants an auto mode it can be added later (as a value 2 resp.
> > auto).
> 
> 0/1 with the default set to the strict, slower mode?

yes, keep the current semantic and allow users who care to disable
something that stands in the way.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Dave Hansen
On 10/09/2017 10:49 PM, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).

0/1 with the default set to the strict, slower mode?


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-10 Thread Dave Hansen
On 10/09/2017 10:49 PM, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).

0/1 with the default set to the strict, slower mode?


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread kemi


On 2017年10月10日 13:49, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> 
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).
> 

It sounds good to me. If Andrew also tends to be a simple 0/1, I will submit
V4 patch for it. Thanks


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread kemi


On 2017年10月10日 13:49, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> 
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).
> 

It sounds good to me. If Andrew also tends to be a simple 0/1, I will submit
V4 patch for it. Thanks


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread Michal Hocko
On Mon 09-10-17 09:55:49, Michal Hocko wrote:
> I haven't checked closely but what happens (or should happen) when you
> do a partial read? Should you get an inconsistent results? Or is this
> impossible?

Well, after thinking about it little bit more, partial reads are always
inconsistent so this wouldn't add a new problem.

Anyway I still stand by my position that this sounds over-engineered and
a simple 0/1 resp. on/off interface would be both simpler and safer. If
anybody wants an auto mode it can be added later (as a value 2 resp.
auto).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread Michal Hocko
On Mon 09-10-17 09:55:49, Michal Hocko wrote:
> I haven't checked closely but what happens (or should happen) when you
> do a partial read? Should you get an inconsistent results? Or is this
> impossible?

Well, after thinking about it little bit more, partial reads are always
inconsistent so this wouldn't add a new problem.

Anyway I still stand by my position that this sounds over-engineered and
a simple 0/1 resp. on/off interface would be both simpler and safer. If
anybody wants an auto mode it can be added later (as a value 2 resp.
auto).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread Michal Hocko
On Mon 09-10-17 14:34:11, kemi wrote:
> 
> 
> On 2017年10月03日 17:23, Michal Hocko wrote:
> > On Thu 28-09-17 14:11:41, Kemi Wang wrote:
> >> This is the second step which introduces a tunable interface that allow
> >> numa stats configurable for optimizing zone_statistics(), as suggested by
> >> Dave Hansen and Ying Huang.
> >>
> >> =
> >> When page allocation performance becomes a bottleneck and you can tolerate
> >> some possible tool breakage and decreased numa counter precision, you can
> >> do:
> >>echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> >> In this case, numa counter update is ignored. We can see about
> >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> >> memory).
> >>
> >> Benchmark link provided by Jesper D Brouer(increase loop times to
> >> 1000):
> >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> >> bench
> >>
> >> =
> >> When page allocation performance is not a bottleneck and you want all
> >> tooling to work, you can do:
> >>echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> >>
> >> =
> >> We recommend automatic detection of numa statistics by system, this is also
> >> system default configuration, you can do:
> >>echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> >> In this case, numa counter update is skipped unless it has been read by
> >> users at least once, e.g. cat /proc/zoneinfo.
> > 
> > I am still not convinced the auto mode is worth all the additional code
> > and a safe default to use. The whole thing could have been 0/1 with a
> > simpler parsing and less code to catch readers.
> > 
> 
> I understood your concern. 
> Well, we may get rid of auto mode if there is some obvious disadvantage
> here. Now, I tend to keep it because most people may not touch this interface,
> and auto mode is helpful in such case.

But you cannot guarantee it won't break any existing users, can you?
Besides I do not remember anybody complaining about the performance
impact of these counters other than very specialized workloads which are
going to disable the accounting altogether. So I simply fail to see a
reason to add more code with a questionable semantic (see below on
partial reads).

> > E.g. why do we have to do static_branch_enable on any read or even
> > vmstat_stop? Wouldn't open be sufficient?
> > 
> 
> NUMA stats is used in four files:
> /proc/zoneinfo
> /proc/vmstat
> /sys/devices/system/node/node*/numastat
> /sys/devices/system/node/node*/vmstat
> In auto mode, each *read* will trigger the update of NUMA counter. 
> So, we should make sure the target branch is jumped to the branch 
> for NUMA counter update once the file is read from user space.
> the intension of static_branch_enable in vmstat_stop(in the call site 
> of file->file_ops.read) is for reading /proc/vmstat in case.  
> 
> I guess the *open* means file->file_op.open here, right?
> Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

I haven't checked closely but what happens (or should happen) when you
do a partial read? Should you get an inconsistent results? Or is this
impossible?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread Michal Hocko
On Mon 09-10-17 14:34:11, kemi wrote:
> 
> 
> On 2017年10月03日 17:23, Michal Hocko wrote:
> > On Thu 28-09-17 14:11:41, Kemi Wang wrote:
> >> This is the second step which introduces a tunable interface that allow
> >> numa stats configurable for optimizing zone_statistics(), as suggested by
> >> Dave Hansen and Ying Huang.
> >>
> >> =
> >> When page allocation performance becomes a bottleneck and you can tolerate
> >> some possible tool breakage and decreased numa counter precision, you can
> >> do:
> >>echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> >> In this case, numa counter update is ignored. We can see about
> >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> >> memory).
> >>
> >> Benchmark link provided by Jesper D Brouer(increase loop times to
> >> 1000):
> >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> >> bench
> >>
> >> =
> >> When page allocation performance is not a bottleneck and you want all
> >> tooling to work, you can do:
> >>echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> >>
> >> =
> >> We recommend automatic detection of numa statistics by system, this is also
> >> system default configuration, you can do:
> >>echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> >> In this case, numa counter update is skipped unless it has been read by
> >> users at least once, e.g. cat /proc/zoneinfo.
> > 
> > I am still not convinced the auto mode is worth all the additional code
> > and a safe default to use. The whole thing could have been 0/1 with a
> > simpler parsing and less code to catch readers.
> > 
> 
> I understood your concern. 
> Well, we may get rid of auto mode if there is some obvious disadvantage
> here. Now, I tend to keep it because most people may not touch this interface,
> and auto mode is helpful in such case.

But you cannot guarantee it won't break any existing users, can you?
Besides I do not remember anybody complaining about the performance
impact of these counters other than very specialized workloads which are
going to disable the accounting altogether. So I simply fail to see a
reason to add more code with a questionable semantic (see below on
partial reads).

> > E.g. why do we have to do static_branch_enable on any read or even
> > vmstat_stop? Wouldn't open be sufficient?
> > 
> 
> NUMA stats is used in four files:
> /proc/zoneinfo
> /proc/vmstat
> /sys/devices/system/node/node*/numastat
> /sys/devices/system/node/node*/vmstat
> In auto mode, each *read* will trigger the update of NUMA counter. 
> So, we should make sure the target branch is jumped to the branch 
> for NUMA counter update once the file is read from user space.
> the intension of static_branch_enable in vmstat_stop(in the call site 
> of file->file_ops.read) is for reading /proc/vmstat in case.  
> 
> I guess the *open* means file->file_op.open here, right?
> Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

I haven't checked closely but what happens (or should happen) when you
do a partial read? Should you get an inconsistent results? Or is this
impossible?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread kemi


On 2017年10月03日 17:23, Michal Hocko wrote:
> On Thu 28-09-17 14:11:41, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>>  echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 1000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>>  echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>>  echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
> 
> I am still not convinced the auto mode is worth all the additional code
> and a safe default to use. The whole thing could have been 0/1 with a
> simpler parsing and less code to catch readers.
> 

I understood your concern. 
Well, we may get rid of auto mode if there is some obvious disadvantage
here. Now, I tend to keep it because most people may not touch this interface,
and auto mode is helpful in such case.

> E.g. why do we have to do static_branch_enable on any read or even
> vmstat_stop? Wouldn't open be sufficient?
> 

NUMA stats is used in four files:
/proc/zoneinfo
/proc/vmstat
/sys/devices/system/node/node*/numastat
/sys/devices/system/node/node*/vmstat
In auto mode, each *read* will trigger the update of NUMA counter. 
So, we should make sure the target branch is jumped to the branch 
for NUMA counter update once the file is read from user space.
the intension of static_branch_enable in vmstat_stop(in the call site 
of file->file_ops.read) is for reading /proc/vmstat in case.  

I guess the *open* means file->file_op.open here, right?
Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

>> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, 
>> NULL);
>>  static ssize_t node_read_numastat(struct device *dev,
>>  struct device_attribute *attr, char *buf)
>>  {
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>>  return sprintf(buf,
>> "numa_hit %lu\n"
>> "numa_miss %lu\n"
>> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  n += sprintf(buf+n, "%s %lu\n",
>>   vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>   sum_zone_numa_state(nid, i));
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>>  #endif
>>  
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> [...]
>> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void 
>> *arg)
>>  {
>>  pg_data_t *pgdat = (pg_data_t *)arg;
>>  walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
>> +#ifdef CONFIG_NUMA
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>> +#endif
>>  return 0;
>>  }
>>  
>> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>>  
>>  static void vmstat_stop(struct seq_file *m, void *arg)
>>  {
>> +#ifdef CONFIG_NUMA
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>> +#endif
>>  kfree(m->private);
>>  m->private = NULL;
>>  }
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread kemi


On 2017年10月03日 17:23, Michal Hocko wrote:
> On Thu 28-09-17 14:11:41, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>>  echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 1000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>>  echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>>  echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
> 
> I am still not convinced the auto mode is worth all the additional code
> and a safe default to use. The whole thing could have been 0/1 with a
> simpler parsing and less code to catch readers.
> 

I understood your concern. 
Well, we may get rid of auto mode if there is some obvious disadvantage
here. Now, I tend to keep it because most people may not touch this interface,
and auto mode is helpful in such case.

> E.g. why do we have to do static_branch_enable on any read or even
> vmstat_stop? Wouldn't open be sufficient?
> 

NUMA stats is used in four files:
/proc/zoneinfo
/proc/vmstat
/sys/devices/system/node/node*/numastat
/sys/devices/system/node/node*/vmstat
In auto mode, each *read* will trigger the update of NUMA counter. 
So, we should make sure the target branch is jumped to the branch 
for NUMA counter update once the file is read from user space.
the intension of static_branch_enable in vmstat_stop(in the call site 
of file->file_ops.read) is for reading /proc/vmstat in case.  

I guess the *open* means file->file_op.open here, right?
Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

>> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, 
>> NULL);
>>  static ssize_t node_read_numastat(struct device *dev,
>>  struct device_attribute *attr, char *buf)
>>  {
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>>  return sprintf(buf,
>> "numa_hit %lu\n"
>> "numa_miss %lu\n"
>> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  n += sprintf(buf+n, "%s %lu\n",
>>   vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>   sum_zone_numa_state(nid, i));
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>>  #endif
>>  
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> [...]
>> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void 
>> *arg)
>>  {
>>  pg_data_t *pgdat = (pg_data_t *)arg;
>>  walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
>> +#ifdef CONFIG_NUMA
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>> +#endif
>>  return 0;
>>  }
>>  
>> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>>  
>>  static void vmstat_stop(struct seq_file *m, void *arg)
>>  {
>> +#ifdef CONFIG_NUMA
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>> +#endif
>>  kfree(m->private);
>>  m->private = NULL;
>>  }
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-08 Thread kemi


On 2017年09月29日 15:03, Vlastimil Babka wrote:
> On 09/28/2017 08:11 AM, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>>  echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 1000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>>  echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>>  echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
>>
>> Branch target selection with jump label:
>> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
>> counters update.
>> b) When numa_stats_mode is changed to *coarse*, return back directly.
>> c) When numa_stats_mode is changed to *auto*, the branch target used in
>> last time is kept, and the branch target is changed to the branch for numa
>> counters update once numa counters are *read* by users.
>>
>> Therefore, with the help of jump label, the page allocation performance is
>> hardly affected when numa counters are updated with a call in
>> zone_statistics(). Meanwhile, the auto mode can give people benefit without
>> manual tuning.
>>
>> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
>> help improve the original patch.
>>
>> ChangeLog:
>>   V2->V3:
>>   a) Propose a better way to use jump label to eliminate the overhead of
>>   branch selection in zone_statistics(), as inspired by Ying Huang;
>>   b) Add a paragraph in commit log to describe the way for branch target
>>   selection;
>>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>>   and change the description accordingly, as suggested by Michal Hocko;
>>   d) Make this functionality NUMA-specific via ifdef
>>
>>   V1->V2:
>>   a) Merge to one patch;
>>   b) Use jump label to eliminate the overhead of branch selection;
>>   c) Add a single-time log message at boot time to help tell users what
>>   happened.
>>
>> Reported-by: Jesper Dangaard Brouer 
>> Suggested-by: Dave Hansen 
>> Suggested-by: Ying Huang 
>> Signed-off-by: Kemi Wang 
>> ---
>>  Documentation/sysctl/vm.txt |  24 +
>>  drivers/base/node.c |   4 ++
>>  include/linux/vmstat.h  |  23 
>>  init/main.c |   3 ++
>>  kernel/sysctl.c |   7 +++
>>  mm/page_alloc.c |  10 
>>  mm/vmstat.c | 129 
>> 
>>  7 files changed, 200 insertions(+)
>>
>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
>> index 9baf66a..e310e69 100644
>> --- a/Documentation/sysctl/vm.txt
>> +++ b/Documentation/sysctl/vm.txt
>> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>>  - swappiness
>>  - user_reserve_kbytes
>>  - vfs_cache_pressure
>> +- numa_stats_mode
>>  - watermark_scale_factor
>>  - zone_reclaim_mode
>>  
>> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>>  
>>  =
>>  
>> +numa_stats_mode
>> +
>> +This interface allows numa statistics configurable.
>> +
>> +When page allocation performance becomes a bottleneck and you can tolerate
>> +some possible tool breakage and decreased numa counter precision, you can
>> +do:
>> +echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> +
>> +When page allocation performance is not a bottleneck and you want all
>> +tooling to work, you can do:
>> +echo [S|s]trict > /proc/sys/vm/numa_stat_mode
>> +
>> +We recommend automatic detection of numa statistics by system, because numa
>> +statistics does not affect system's decision and it is very rarely

Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-08 Thread kemi


On 2017年09月29日 15:03, Vlastimil Babka wrote:
> On 09/28/2017 08:11 AM, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>>  echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 1000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>>  echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>>  echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
>>
>> Branch target selection with jump label:
>> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
>> counters update.
>> b) When numa_stats_mode is changed to *coarse*, return back directly.
>> c) When numa_stats_mode is changed to *auto*, the branch target used in
>> last time is kept, and the branch target is changed to the branch for numa
>> counters update once numa counters are *read* by users.
>>
>> Therefore, with the help of jump label, the page allocation performance is
>> hardly affected when numa counters are updated with a call in
>> zone_statistics(). Meanwhile, the auto mode can give people benefit without
>> manual tuning.
>>
>> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
>> help improve the original patch.
>>
>> ChangeLog:
>>   V2->V3:
>>   a) Propose a better way to use jump label to eliminate the overhead of
>>   branch selection in zone_statistics(), as inspired by Ying Huang;
>>   b) Add a paragraph in commit log to describe the way for branch target
>>   selection;
>>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>>   and change the description accordingly, as suggested by Michal Hocko;
>>   d) Make this functionality NUMA-specific via ifdef
>>
>>   V1->V2:
>>   a) Merge to one patch;
>>   b) Use jump label to eliminate the overhead of branch selection;
>>   c) Add a single-time log message at boot time to help tell users what
>>   happened.
>>
>> Reported-by: Jesper Dangaard Brouer 
>> Suggested-by: Dave Hansen 
>> Suggested-by: Ying Huang 
>> Signed-off-by: Kemi Wang 
>> ---
>>  Documentation/sysctl/vm.txt |  24 +
>>  drivers/base/node.c |   4 ++
>>  include/linux/vmstat.h  |  23 
>>  init/main.c |   3 ++
>>  kernel/sysctl.c |   7 +++
>>  mm/page_alloc.c |  10 
>>  mm/vmstat.c | 129 
>> 
>>  7 files changed, 200 insertions(+)
>>
>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
>> index 9baf66a..e310e69 100644
>> --- a/Documentation/sysctl/vm.txt
>> +++ b/Documentation/sysctl/vm.txt
>> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>>  - swappiness
>>  - user_reserve_kbytes
>>  - vfs_cache_pressure
>> +- numa_stats_mode
>>  - watermark_scale_factor
>>  - zone_reclaim_mode
>>  
>> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>>  
>>  =
>>  
>> +numa_stats_mode
>> +
>> +This interface allows numa statistics configurable.
>> +
>> +When page allocation performance becomes a bottleneck and you can tolerate
>> +some possible tool breakage and decreased numa counter precision, you can
>> +do:
>> +echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> +
>> +When page allocation performance is not a bottleneck and you want all
>> +tooling to work, you can do:
>> +echo [S|s]trict > /proc/sys/vm/numa_stat_mode
>> +
>> +We recommend automatic detection of numa statistics by system, because numa
>> +statistics does not affect system's decision and it is very rarely
>> +consumed. you can do:
>> +echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> 

Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-03 Thread Michal Hocko
On Thu 28-09-17 14:11:41, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
>   echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 1000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
>   echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
>   echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.

I am still not convinced the auto mode is worth all the additional code
and a safe default to use. The whole thing could have been 0/1 with a
simpler parsing and less code to catch readers.

E.g. why do we have to do static_branch_enable on any read or even
vmstat_stop? Wouldn't open be sufficient?

> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, 
> NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
> + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> + static_branch_enable(_numa_stats_mode_key);
>   return sprintf(buf,
>  "numa_hit %lu\n"
>  "numa_miss %lu\n"
> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>   n += sprintf(buf+n, "%s %lu\n",
>vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>sum_zone_numa_state(nid, i));
> + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> + static_branch_enable(_numa_stats_mode_key);
>  #endif
>  
>   for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
[...]
> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
>  {
>   pg_data_t *pgdat = (pg_data_t *)arg;
>   walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
> +#ifdef CONFIG_NUMA
> + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> + static_branch_enable(_numa_stats_mode_key);
> +#endif
>   return 0;
>  }
>  
> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  
>  static void vmstat_stop(struct seq_file *m, void *arg)
>  {
> +#ifdef CONFIG_NUMA
> + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> + static_branch_enable(_numa_stats_mode_key);
> +#endif
>   kfree(m->private);
>   m->private = NULL;
>  }
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-03 Thread Michal Hocko
On Thu 28-09-17 14:11:41, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
>   echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 1000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
>   echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
>   echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.

I am still not convinced the auto mode is worth all the additional code
and a safe default to use. The whole thing could have been 0/1 with a
simpler parsing and less code to catch readers.

E.g. why do we have to do static_branch_enable on any read or even
vmstat_stop? Wouldn't open be sufficient?

> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, 
> NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
> + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> + static_branch_enable(_numa_stats_mode_key);
>   return sprintf(buf,
>  "numa_hit %lu\n"
>  "numa_miss %lu\n"
> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>   n += sprintf(buf+n, "%s %lu\n",
>vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>sum_zone_numa_state(nid, i));
> + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> + static_branch_enable(_numa_stats_mode_key);
>  #endif
>  
>   for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
[...]
> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
>  {
>   pg_data_t *pgdat = (pg_data_t *)arg;
>   walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
> +#ifdef CONFIG_NUMA
> + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> + static_branch_enable(_numa_stats_mode_key);
> +#endif
>   return 0;
>  }
>  
> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  
>  static void vmstat_stop(struct seq_file *m, void *arg)
>  {
> +#ifdef CONFIG_NUMA
> + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> + static_branch_enable(_numa_stats_mode_key);
> +#endif
>   kfree(m->private);
>   m->private = NULL;
>  }
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-09-29 Thread Vlastimil Babka
[+CC linux-api]

On 09/28/2017 08:11 AM, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
>   echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 1000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
>   echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
>   echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.
> 
> Branch target selection with jump label:
> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
> counters update.
> b) When numa_stats_mode is changed to *coarse*, return back directly.
> c) When numa_stats_mode is changed to *auto*, the branch target used in
> last time is kept, and the branch target is changed to the branch for numa
> counters update once numa counters are *read* by users.
> 
> Therefore, with the help of jump label, the page allocation performance is
> hardly affected when numa counters are updated with a call in
> zone_statistics(). Meanwhile, the auto mode can give people benefit without
> manual tuning.
> 
> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
> help improve the original patch.
> 
> ChangeLog:
>   V2->V3:
>   a) Propose a better way to use jump label to eliminate the overhead of
>   branch selection in zone_statistics(), as inspired by Ying Huang;
>   b) Add a paragraph in commit log to describe the way for branch target
>   selection;
>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>   and change the description accordingly, as suggested by Michal Hocko;
>   d) Make this functionality NUMA-specific via ifdef
> 
>   V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
> 
> Reported-by: Jesper Dangaard Brouer 
> Suggested-by: Dave Hansen 
> Suggested-by: Ying Huang 
> Signed-off-by: Kemi Wang 
> ---
>  Documentation/sysctl/vm.txt |  24 +
>  drivers/base/node.c |   4 ++
>  include/linux/vmstat.h  |  23 
>  init/main.c |   3 ++
>  kernel/sysctl.c |   7 +++
>  mm/page_alloc.c |  10 
>  mm/vmstat.c | 129 
> 
>  7 files changed, 200 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..e310e69 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- numa_stats_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>  
>  =
>  
> +numa_stats_mode
> +
> +This interface allows numa statistics configurable.
> +
> +When page allocation performance becomes a bottleneck and you can tolerate
> +some possible tool breakage and decreased numa counter precision, you can
> +do:
> + echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> +
> +When page allocation performance is not a bottleneck and you want all
> +tooling to work, you can do:
> + echo [S|s]trict > /proc/sys/vm/numa_stat_mode
> +
> +We recommend automatic detection of numa statistics by system, because numa
> +statistics does not affect system's decision and it is very rarely
> +consumed. you can do:
> + echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> +This is also system default configuration, 

Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-09-29 Thread Vlastimil Babka
[+CC linux-api]

On 09/28/2017 08:11 AM, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
>   echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 1000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
>   echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
>   echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.
> 
> Branch target selection with jump label:
> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
> counters update.
> b) When numa_stats_mode is changed to *coarse*, return back directly.
> c) When numa_stats_mode is changed to *auto*, the branch target used in
> last time is kept, and the branch target is changed to the branch for numa
> counters update once numa counters are *read* by users.
> 
> Therefore, with the help of jump label, the page allocation performance is
> hardly affected when numa counters are updated with a call in
> zone_statistics(). Meanwhile, the auto mode can give people benefit without
> manual tuning.
> 
> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
> help improve the original patch.
> 
> ChangeLog:
>   V2->V3:
>   a) Propose a better way to use jump label to eliminate the overhead of
>   branch selection in zone_statistics(), as inspired by Ying Huang;
>   b) Add a paragraph in commit log to describe the way for branch target
>   selection;
>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>   and change the description accordingly, as suggested by Michal Hocko;
>   d) Make this functionality NUMA-specific via ifdef
> 
>   V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
> 
> Reported-by: Jesper Dangaard Brouer 
> Suggested-by: Dave Hansen 
> Suggested-by: Ying Huang 
> Signed-off-by: Kemi Wang 
> ---
>  Documentation/sysctl/vm.txt |  24 +
>  drivers/base/node.c |   4 ++
>  include/linux/vmstat.h  |  23 
>  init/main.c |   3 ++
>  kernel/sysctl.c |   7 +++
>  mm/page_alloc.c |  10 
>  mm/vmstat.c | 129 
> 
>  7 files changed, 200 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..e310e69 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- numa_stats_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>  
>  =
>  
> +numa_stats_mode
> +
> +This interface allows numa statistics configurable.
> +
> +When page allocation performance becomes a bottleneck and you can tolerate
> +some possible tool breakage and decreased numa counter precision, you can
> +do:
> + echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> +
> +When page allocation performance is not a bottleneck and you want all
> +tooling to work, you can do:
> + echo [S|s]trict > /proc/sys/vm/numa_stat_mode
> +
> +We recommend automatic detection of numa statistics by system, because numa
> +statistics does not affect system's decision and it is very rarely
> +consumed. you can do:
> + echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> +This is also system default configuration, with this default setting, numa
> +counters update is skipped unless the counter is 

Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-09-29 Thread Vlastimil Babka
On 09/28/2017 11:29 PM, Andrew Morton wrote:
> On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang  wrote:
> 
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
> 
> Looks OK I guess.
> 
> I fiddled with it a lot.  Please consider:
> 
> From: Andrew Morton 
> Subject: mm-sysctl-make-numa-stats-configurable-fix
> 
> - tweak documentation
> 
> - move advisory message from start_kernel() into mm_init() (I'm not sure
>   we really need this message)

Actually, I'm not sure we need any of the current messages, or to have
them at higher priority than pr_debug()? They are all triggered by admin
action, or unconditionally upon boot.
OTOH I think that an useful message that's currently missing would be
when the static_key_enable() is triggered in auto mode. Bonus points for
including the name of the process and the stat file that was read.
However static_key_enable() returns void and not whether it actually
flipped the switch, so it's not trivial.


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-09-29 Thread Vlastimil Babka
On 09/28/2017 11:29 PM, Andrew Morton wrote:
> On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang  wrote:
> 
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
> 
> Looks OK I guess.
> 
> I fiddled with it a lot.  Please consider:
> 
> From: Andrew Morton 
> Subject: mm-sysctl-make-numa-stats-configurable-fix
> 
> - tweak documentation
> 
> - move advisory message from start_kernel() into mm_init() (I'm not sure
>   we really need this message)

Actually, I'm not sure we need any of the current messages, or to have
them at higher priority than pr_debug()? They are all triggered by admin
action, or unconditionally upon boot.
OTOH I think that an useful message that's currently missing would be
when the static_key_enable() is triggered in auto mode. Bonus points for
including the name of the process and the stat file that was read.
However static_key_enable() returns void and not whether it actually
flipped the switch, so it's not trivial.


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-09-29 Thread Vlastimil Babka
On 09/28/2017 08:11 AM, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
>   echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 1000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
>   echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
>   echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.
> 
> Branch target selection with jump label:
> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
> counters update.
> b) When numa_stats_mode is changed to *coarse*, return back directly.
> c) When numa_stats_mode is changed to *auto*, the branch target used in
> last time is kept, and the branch target is changed to the branch for numa
> counters update once numa counters are *read* by users.
> 
> Therefore, with the help of jump label, the page allocation performance is
> hardly affected when numa counters are updated with a call in
> zone_statistics(). Meanwhile, the auto mode can give people benefit without
> manual tuning.
> 
> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
> help improve the original patch.
> 
> ChangeLog:
>   V2->V3:
>   a) Propose a better way to use jump label to eliminate the overhead of
>   branch selection in zone_statistics(), as inspired by Ying Huang;
>   b) Add a paragraph in commit log to describe the way for branch target
>   selection;
>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>   and change the description accordingly, as suggested by Michal Hocko;
>   d) Make this functionality NUMA-specific via ifdef
> 
>   V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
> 
> Reported-by: Jesper Dangaard Brouer 
> Suggested-by: Dave Hansen 
> Suggested-by: Ying Huang 
> Signed-off-by: Kemi Wang 
> ---
>  Documentation/sysctl/vm.txt |  24 +
>  drivers/base/node.c |   4 ++
>  include/linux/vmstat.h  |  23 
>  init/main.c |   3 ++
>  kernel/sysctl.c |   7 +++
>  mm/page_alloc.c |  10 
>  mm/vmstat.c | 129 
> 
>  7 files changed, 200 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..e310e69 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- numa_stats_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>  
>  =
>  
> +numa_stats_mode
> +
> +This interface allows numa statistics configurable.
> +
> +When page allocation performance becomes a bottleneck and you can tolerate
> +some possible tool breakage and decreased numa counter precision, you can
> +do:
> + echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> +
> +When page allocation performance is not a bottleneck and you want all
> +tooling to work, you can do:
> + echo [S|s]trict > /proc/sys/vm/numa_stat_mode
> +
> +We recommend automatic detection of numa statistics by system, because numa
> +statistics does not affect system's decision and it is very rarely
> +consumed. you can do:
> + echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> +This is also system default configuration, with this default 

Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-09-29 Thread Vlastimil Babka
On 09/28/2017 08:11 AM, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
>   echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 1000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
>   echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
>   echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.
> 
> Branch target selection with jump label:
> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
> counters update.
> b) When numa_stats_mode is changed to *coarse*, return back directly.
> c) When numa_stats_mode is changed to *auto*, the branch target used in
> last time is kept, and the branch target is changed to the branch for numa
> counters update once numa counters are *read* by users.
> 
> Therefore, with the help of jump label, the page allocation performance is
> hardly affected when numa counters are updated with a call in
> zone_statistics(). Meanwhile, the auto mode can give people benefit without
> manual tuning.
> 
> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
> help improve the original patch.
> 
> ChangeLog:
>   V2->V3:
>   a) Propose a better way to use jump label to eliminate the overhead of
>   branch selection in zone_statistics(), as inspired by Ying Huang;
>   b) Add a paragraph in commit log to describe the way for branch target
>   selection;
>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>   and change the description accordingly, as suggested by Michal Hocko;
>   d) Make this functionality NUMA-specific via ifdef
> 
>   V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
> 
> Reported-by: Jesper Dangaard Brouer 
> Suggested-by: Dave Hansen 
> Suggested-by: Ying Huang 
> Signed-off-by: Kemi Wang 
> ---
>  Documentation/sysctl/vm.txt |  24 +
>  drivers/base/node.c |   4 ++
>  include/linux/vmstat.h  |  23 
>  init/main.c |   3 ++
>  kernel/sysctl.c |   7 +++
>  mm/page_alloc.c |  10 
>  mm/vmstat.c | 129 
> 
>  7 files changed, 200 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..e310e69 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- numa_stats_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>  
>  =
>  
> +numa_stats_mode
> +
> +This interface allows numa statistics configurable.
> +
> +When page allocation performance becomes a bottleneck and you can tolerate
> +some possible tool breakage and decreased numa counter precision, you can
> +do:
> + echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> +
> +When page allocation performance is not a bottleneck and you want all
> +tooling to work, you can do:
> + echo [S|s]trict > /proc/sys/vm/numa_stat_mode
> +
> +We recommend automatic detection of numa statistics by system, because numa
> +statistics does not affect system's decision and it is very rarely
> +consumed. you can do:
> + echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> +This is also system default configuration, with this default setting, numa
> +counters update is skipped unless the counter is *read* by users at 

Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-09-28 Thread kemi


On 2017年09月29日 05:29, Andrew Morton wrote:
> On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang  wrote:
> 
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
> 
> Looks OK I guess.
> 
> I fiddled with it a lot.  Please consider:
> 

Thanks for your help to make it more graceful! I will be more careful next time.
There may be a typo error in Documentation/sysctl/vm.txt, see comment below.

> From: Andrew Morton 
> Subject: mm-sysctl-make-numa-stats-configurable-fix
> 
> - tweak documentation
> 
> - move advisory message from start_kernel() into mm_init() (I'm not sure
>   we really need this message)
> 
> - use strcasecmp() in __parse_vm_numa_stats_mode()
> 
> - clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler()
> 
> Cc: Aaron Lu 
> Cc: Andi Kleen 
> Cc: Christopher Lameter 
> Cc: Dave Hansen 
> Cc: Jesper Dangaard Brouer 
> Cc: Johannes Weiner 
> Cc: Jonathan Corbet 
> Cc: Kees Cook 
> Cc: Kemi Wang 
> Cc: "Luis R . Rodriguez" 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Sebastian Andrzej Siewior 
> Cc: Tim Chen 
> Cc: Vlastimil Babka 
> Cc: Ying Huang 
> Signed-off-by: Andrew Morton 
> ---
> 
>  Documentation/sysctl/vm.txt |   15 ++---
>  init/main.c |6 ++---
>  mm/vmstat.c |   39 +++---
>  3 files changed, 29 insertions(+), 31 deletions(-)
> 
> diff -puN 
> Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix 
> Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/Documentation/sysctl/vm.txt
> @@ -853,7 +853,7 @@ ten times more freeable objects than the
>  
>  numa_stats_mode
>  
> -This interface allows numa statistics configurable.
> +This interface allows runtime configuration *or* numa statistics.
>  

typo? or->of/for?

>  When page allocation performance becomes a bottleneck and you can tolerate
>  some possible tool breakage and decreased numa counter precision, you can
> @@ -864,13 +864,14 @@ When page allocation performance is not
>  tooling to work, you can do:
>   echo [S|s]trict > /proc/sys/vm/numa_stat_mode
>  
> -We recommend automatic detection of numa statistics by system, because numa
> -statistics does not affect system's decision and it is very rarely
> -consumed. you can do:
> +We recommend automatic detection of numa statistics by system, because
> +numa statistics do not affect system decisions and it is very rarely
> +consumed.  In this case you can do:
>   echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> -This is also system default configuration, with this default setting, numa
> -counters update is skipped unless the counter is *read* by users at least
> -once.
> +
> +This is the system default configuration.  With this default setting, numa
> +counter updates are skipped until the counter is *read* by userspace at
> +least once.
>  
>  ==
>  
> diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix 
> drivers/base/node.c
> diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix 
> include/linux/vmstat.h
> diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c
> --- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/init/main.c
> @@ -504,6 +504,9 @@ static void __init mm_init(void)
>   pgtable_init();
>   vmalloc_init();
>   ioremap_huge_init();
> +#ifdef CONFIG_NUMA
> + pr_info("vmstat: NUMA stat updates are skipped unless they have been 
> used\n");
> +#endif
>  }
>  
>  asmlinkage __visible void __init start_kernel(void)
> @@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k
>   sort_main_extable();
>   trap_init();
>   mm_init();
> -#ifdef CONFIG_NUMA
> - pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
> -#endif
>  
>   ftrace_init();
>  
> diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix 
> kernel/sysctl.c
> diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix 
> mm/page_alloc.c
> diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c
> --- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/mm/vmstat.c
> @@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l
>  
>  static int __parse_vm_numa_stats_mode(char *s)
>  {
> - const char *str = s;
> -
> - if (strcmp(str, "auto") == 0 || strcmp(str, 

Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-09-28 Thread kemi


On 2017年09月29日 05:29, Andrew Morton wrote:
> On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang  wrote:
> 
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
> 
> Looks OK I guess.
> 
> I fiddled with it a lot.  Please consider:
> 

Thanks for your help to make it more graceful! I will be more careful next time.
There may be a typo error in Documentation/sysctl/vm.txt, see comment below.

> From: Andrew Morton 
> Subject: mm-sysctl-make-numa-stats-configurable-fix
> 
> - tweak documentation
> 
> - move advisory message from start_kernel() into mm_init() (I'm not sure
>   we really need this message)
> 
> - use strcasecmp() in __parse_vm_numa_stats_mode()
> 
> - clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler()
> 
> Cc: Aaron Lu 
> Cc: Andi Kleen 
> Cc: Christopher Lameter 
> Cc: Dave Hansen 
> Cc: Jesper Dangaard Brouer 
> Cc: Johannes Weiner 
> Cc: Jonathan Corbet 
> Cc: Kees Cook 
> Cc: Kemi Wang 
> Cc: "Luis R . Rodriguez" 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Sebastian Andrzej Siewior 
> Cc: Tim Chen 
> Cc: Vlastimil Babka 
> Cc: Ying Huang 
> Signed-off-by: Andrew Morton 
> ---
> 
>  Documentation/sysctl/vm.txt |   15 ++---
>  init/main.c |6 ++---
>  mm/vmstat.c |   39 +++---
>  3 files changed, 29 insertions(+), 31 deletions(-)
> 
> diff -puN 
> Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix 
> Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/Documentation/sysctl/vm.txt
> @@ -853,7 +853,7 @@ ten times more freeable objects than the
>  
>  numa_stats_mode
>  
> -This interface allows numa statistics configurable.
> +This interface allows runtime configuration *or* numa statistics.
>  

typo? or->of/for?

>  When page allocation performance becomes a bottleneck and you can tolerate
>  some possible tool breakage and decreased numa counter precision, you can
> @@ -864,13 +864,14 @@ When page allocation performance is not
>  tooling to work, you can do:
>   echo [S|s]trict > /proc/sys/vm/numa_stat_mode
>  
> -We recommend automatic detection of numa statistics by system, because numa
> -statistics does not affect system's decision and it is very rarely
> -consumed. you can do:
> +We recommend automatic detection of numa statistics by system, because
> +numa statistics do not affect system decisions and it is very rarely
> +consumed.  In this case you can do:
>   echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> -This is also system default configuration, with this default setting, numa
> -counters update is skipped unless the counter is *read* by users at least
> -once.
> +
> +This is the system default configuration.  With this default setting, numa
> +counter updates are skipped until the counter is *read* by userspace at
> +least once.
>  
>  ==
>  
> diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix 
> drivers/base/node.c
> diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix 
> include/linux/vmstat.h
> diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c
> --- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/init/main.c
> @@ -504,6 +504,9 @@ static void __init mm_init(void)
>   pgtable_init();
>   vmalloc_init();
>   ioremap_huge_init();
> +#ifdef CONFIG_NUMA
> + pr_info("vmstat: NUMA stat updates are skipped unless they have been 
> used\n");
> +#endif
>  }
>  
>  asmlinkage __visible void __init start_kernel(void)
> @@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k
>   sort_main_extable();
>   trap_init();
>   mm_init();
> -#ifdef CONFIG_NUMA
> - pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
> -#endif
>  
>   ftrace_init();
>  
> diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix 
> kernel/sysctl.c
> diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix 
> mm/page_alloc.c
> diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c
> --- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/mm/vmstat.c
> @@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l
>  
>  static int __parse_vm_numa_stats_mode(char *s)
>  {
> - const char *str = s;
> -
> - if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
> + if (strcasecmp(s, "auto"))
>   vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> - else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
> + else if (strcasecmp(s, "strict") == 0)
>   vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
> - else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
> + else if (strcasecmp(s, 

Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-09-28 Thread Andrew Morton
On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang  wrote:

> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.

Looks OK I guess.

I fiddled with it a lot.  Please consider:

From: Andrew Morton 
Subject: mm-sysctl-make-numa-stats-configurable-fix

- tweak documentation

- move advisory message from start_kernel() into mm_init() (I'm not sure
  we really need this message)

- use strcasecmp() in __parse_vm_numa_stats_mode()

- clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler()

Cc: Aaron Lu 
Cc: Andi Kleen 
Cc: Christopher Lameter 
Cc: Dave Hansen 
Cc: Jesper Dangaard Brouer 
Cc: Johannes Weiner 
Cc: Jonathan Corbet 
Cc: Kees Cook 
Cc: Kemi Wang 
Cc: "Luis R . Rodriguez" 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Sebastian Andrzej Siewior 
Cc: Tim Chen 
Cc: Vlastimil Babka 
Cc: Ying Huang 
Signed-off-by: Andrew Morton 
---

 Documentation/sysctl/vm.txt |   15 ++---
 init/main.c |6 ++---
 mm/vmstat.c |   39 +++---
 3 files changed, 29 insertions(+), 31 deletions(-)

diff -puN 
Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix 
Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix
+++ a/Documentation/sysctl/vm.txt
@@ -853,7 +853,7 @@ ten times more freeable objects than the
 
 numa_stats_mode
 
-This interface allows numa statistics configurable.
+This interface allows runtime configuration or numa statistics.
 
 When page allocation performance becomes a bottleneck and you can tolerate
 some possible tool breakage and decreased numa counter precision, you can
@@ -864,13 +864,14 @@ When page allocation performance is not
 tooling to work, you can do:
echo [S|s]trict > /proc/sys/vm/numa_stat_mode
 
-We recommend automatic detection of numa statistics by system, because numa
-statistics does not affect system's decision and it is very rarely
-consumed. you can do:
+We recommend automatic detection of numa statistics by system, because
+numa statistics do not affect system decisions and it is very rarely
+consumed.  In this case you can do:
echo [A|a]uto > /proc/sys/vm/numa_stats_mode
-This is also system default configuration, with this default setting, numa
-counters update is skipped unless the counter is *read* by users at least
-once.
+
+This is the system default configuration.  With this default setting, numa
+counter updates are skipped until the counter is *read* by userspace at
+least once.
 
 ==
 
diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix 
drivers/base/node.c
diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix 
include/linux/vmstat.h
diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c
--- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix
+++ a/init/main.c
@@ -504,6 +504,9 @@ static void __init mm_init(void)
pgtable_init();
vmalloc_init();
ioremap_huge_init();
+#ifdef CONFIG_NUMA
+   pr_info("vmstat: NUMA stat updates are skipped unless they have been 
used\n");
+#endif
 }
 
 asmlinkage __visible void __init start_kernel(void)
@@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k
sort_main_extable();
trap_init();
mm_init();
-#ifdef CONFIG_NUMA
-   pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
-#endif
 
ftrace_init();
 
diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix 
kernel/sysctl.c
diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix 
mm/page_alloc.c
diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c
--- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix
+++ a/mm/vmstat.c
@@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l
 
 static int __parse_vm_numa_stats_mode(char *s)
 {
-   const char *str = s;
-
-   if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
+   if (strcasecmp(s, "auto"))
vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
-   else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
+   else if (strcasecmp(s, "strict") == 0)
vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
-   else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
+   else if (strcasecmp(s, "coarse"))

Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-09-28 Thread Andrew Morton
On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang  wrote:

> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.

Looks OK I guess.

I fiddled with it a lot.  Please consider:

From: Andrew Morton 
Subject: mm-sysctl-make-numa-stats-configurable-fix

- tweak documentation

- move advisory message from start_kernel() into mm_init() (I'm not sure
  we really need this message)

- use strcasecmp() in __parse_vm_numa_stats_mode()

- clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler()

Cc: Aaron Lu 
Cc: Andi Kleen 
Cc: Christopher Lameter 
Cc: Dave Hansen 
Cc: Jesper Dangaard Brouer 
Cc: Johannes Weiner 
Cc: Jonathan Corbet 
Cc: Kees Cook 
Cc: Kemi Wang 
Cc: "Luis R . Rodriguez" 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Sebastian Andrzej Siewior 
Cc: Tim Chen 
Cc: Vlastimil Babka 
Cc: Ying Huang 
Signed-off-by: Andrew Morton 
---

 Documentation/sysctl/vm.txt |   15 ++---
 init/main.c |6 ++---
 mm/vmstat.c |   39 +++---
 3 files changed, 29 insertions(+), 31 deletions(-)

diff -puN 
Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix 
Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix
+++ a/Documentation/sysctl/vm.txt
@@ -853,7 +853,7 @@ ten times more freeable objects than the
 
 numa_stats_mode
 
-This interface allows numa statistics configurable.
+This interface allows runtime configuration or numa statistics.
 
 When page allocation performance becomes a bottleneck and you can tolerate
 some possible tool breakage and decreased numa counter precision, you can
@@ -864,13 +864,14 @@ When page allocation performance is not
 tooling to work, you can do:
echo [S|s]trict > /proc/sys/vm/numa_stat_mode
 
-We recommend automatic detection of numa statistics by system, because numa
-statistics does not affect system's decision and it is very rarely
-consumed. you can do:
+We recommend automatic detection of numa statistics by system, because
+numa statistics do not affect system decisions and it is very rarely
+consumed.  In this case you can do:
echo [A|a]uto > /proc/sys/vm/numa_stats_mode
-This is also system default configuration, with this default setting, numa
-counters update is skipped unless the counter is *read* by users at least
-once.
+
+This is the system default configuration.  With this default setting, numa
+counter updates are skipped until the counter is *read* by userspace at
+least once.
 
 ==
 
diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix 
drivers/base/node.c
diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix 
include/linux/vmstat.h
diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c
--- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix
+++ a/init/main.c
@@ -504,6 +504,9 @@ static void __init mm_init(void)
pgtable_init();
vmalloc_init();
ioremap_huge_init();
+#ifdef CONFIG_NUMA
+   pr_info("vmstat: NUMA stat updates are skipped unless they have been 
used\n");
+#endif
 }
 
 asmlinkage __visible void __init start_kernel(void)
@@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k
sort_main_extable();
trap_init();
mm_init();
-#ifdef CONFIG_NUMA
-   pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
-#endif
 
ftrace_init();
 
diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix 
kernel/sysctl.c
diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix 
mm/page_alloc.c
diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c
--- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix
+++ a/mm/vmstat.c
@@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l
 
 static int __parse_vm_numa_stats_mode(char *s)
 {
-   const char *str = s;
-
-   if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
+   if (strcasecmp(s, "auto"))
vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
-   else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
+   else if (strcasecmp(s, "strict") == 0)
vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
-   else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
+   else if (strcasecmp(s, "coarse"))
vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
else {
pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
@@ -86,30 +84,29 @@ int sysctl_vm_numa_stats_mode_handler(st
/* no change */
mutex_unlock(_numa_stats_mode_lock);
return 0;
-   } else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+