Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 10/10/2017 05:39 PM, Dave Hansen wrote: > On 10/10/2017 07:57 AM, Michal Hocko wrote: >>> But, let's be honest, this leaves us with an option that nobody is ever >>> going to turn on. IOW, nobody except a very small portion of our users >>> will ever see any benefit from this. >> But aren't those small groups who would like to squeeze every single >> cycle out from the page allocator path the targeted audience? > > They're the reason we started looking at this. They also care the most. > > But, the cost of these stats, especially we get more and more cores in a > NUMA node is really making them show up in profiles. It would be nice > to get rid of them there, too. Furthermore, the group that actually looks at those stats, could be also expected to be quite small. The group that cares neither about the stats, nor relies on top allocator performance, might still arguably benefit from improved allocator performance, but won't for sure benefit from the stats. > Aaron, do you remember offhand how much of the allocator overhead was > coming from NUMA stats? >
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 10/10/2017 05:39 PM, Dave Hansen wrote: > On 10/10/2017 07:57 AM, Michal Hocko wrote: >>> But, let's be honest, this leaves us with an option that nobody is ever >>> going to turn on. IOW, nobody except a very small portion of our users >>> will ever see any benefit from this. >> But aren't those small groups who would like to squeeze every single >> cycle out from the page allocator path the targeted audience? > > They're the reason we started looking at this. They also care the most. > > But, the cost of these stats, especially we get more and more cores in a > NUMA node is really making them show up in profiles. It would be nice > to get rid of them there, too. Furthermore, the group that actually looks at those stats, could be also expected to be quite small. The group that cares neither about the stats, nor relies on top allocator performance, might still arguably benefit from improved allocator performance, but won't for sure benefit from the stats. > Aaron, do you remember offhand how much of the allocator overhead was > coming from NUMA stats? >
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Tue, 10 Oct 2017 19:51:43 +0200 Michal Hockowrote: > On Tue 10-10-17 08:39:47, Dave Hansen wrote: > > On 10/10/2017 07:57 AM, Michal Hocko wrote: > > >> But, let's be honest, this leaves us with an option that nobody is ever > > >> going to turn on. IOW, nobody except a very small portion of our users > > >> will ever see any benefit from this. > > > But aren't those small groups who would like to squeeze every single > > > cycle out from the page allocator path the targeted audience? > > > > They're the reason we started looking at this. They also care the most. > > > > But, the cost of these stats, especially we get more and more cores in a > > NUMA node is really making them show up in profiles. It would be nice > > to get rid of them there, too. > > I am not opposing to the auto mode. I am just not sure it is a safe > default and I also think that we should add this on top if it is really > needed. Yup. Let's keep things simple unless a real need is demonstrated, please.
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Tue, 10 Oct 2017 19:51:43 +0200 Michal Hocko wrote: > On Tue 10-10-17 08:39:47, Dave Hansen wrote: > > On 10/10/2017 07:57 AM, Michal Hocko wrote: > > >> But, let's be honest, this leaves us with an option that nobody is ever > > >> going to turn on. IOW, nobody except a very small portion of our users > > >> will ever see any benefit from this. > > > But aren't those small groups who would like to squeeze every single > > > cycle out from the page allocator path the targeted audience? > > > > They're the reason we started looking at this. They also care the most. > > > > But, the cost of these stats, especially we get more and more cores in a > > NUMA node is really making them show up in profiles. It would be nice > > to get rid of them there, too. > > I am not opposing to the auto mode. I am just not sure it is a safe > default and I also think that we should add this on top if it is really > needed. Yup. Let's keep things simple unless a real need is demonstrated, please.
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Tue 10-10-17 08:39:47, Dave Hansen wrote: > On 10/10/2017 07:57 AM, Michal Hocko wrote: > >> But, let's be honest, this leaves us with an option that nobody is ever > >> going to turn on. IOW, nobody except a very small portion of our users > >> will ever see any benefit from this. > > But aren't those small groups who would like to squeeze every single > > cycle out from the page allocator path the targeted audience? > > They're the reason we started looking at this. They also care the most. > > But, the cost of these stats, especially we get more and more cores in a > NUMA node is really making them show up in profiles. It would be nice > to get rid of them there, too. I am not opposing to the auto mode. I am just not sure it is a safe default and I also think that we should add this on top if it is really needed. -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Tue 10-10-17 08:39:47, Dave Hansen wrote: > On 10/10/2017 07:57 AM, Michal Hocko wrote: > >> But, let's be honest, this leaves us with an option that nobody is ever > >> going to turn on. IOW, nobody except a very small portion of our users > >> will ever see any benefit from this. > > But aren't those small groups who would like to squeeze every single > > cycle out from the page allocator path the targeted audience? > > They're the reason we started looking at this. They also care the most. > > But, the cost of these stats, especially we get more and more cores in a > NUMA node is really making them show up in profiles. It would be nice > to get rid of them there, too. I am not opposing to the auto mode. I am just not sure it is a safe default and I also think that we should add this on top if it is really needed. -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 10/10/2017 07:57 AM, Michal Hocko wrote: >> But, let's be honest, this leaves us with an option that nobody is ever >> going to turn on. IOW, nobody except a very small portion of our users >> will ever see any benefit from this. > But aren't those small groups who would like to squeeze every single > cycle out from the page allocator path the targeted audience? They're the reason we started looking at this. They also care the most. But, the cost of these stats, especially we get more and more cores in a NUMA node is really making them show up in profiles. It would be nice to get rid of them there, too. Aaron, do you remember offhand how much of the allocator overhead was coming from NUMA stats?
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 10/10/2017 07:57 AM, Michal Hocko wrote: >> But, let's be honest, this leaves us with an option that nobody is ever >> going to turn on. IOW, nobody except a very small portion of our users >> will ever see any benefit from this. > But aren't those small groups who would like to squeeze every single > cycle out from the page allocator path the targeted audience? They're the reason we started looking at this. They also care the most. But, the cost of these stats, especially we get more and more cores in a NUMA node is really making them show up in profiles. It would be nice to get rid of them there, too. Aaron, do you remember offhand how much of the allocator overhead was coming from NUMA stats?
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Tue, 10 Oct 2017, Michal Hocko wrote: > > But, let's be honest, this leaves us with an option that nobody is ever > > going to turn on. IOW, nobody except a very small portion of our users > > will ever see any benefit from this. > > But aren't those small groups who would like to squeeze every single > cycle out from the page allocator path the targeted audience? Those have long sine raised the white flag and succumbed to the featuritis. Resigned to try to keep the bloat restricted to a couple of cores so that the rest of the cores stay usable.
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Tue, 10 Oct 2017, Michal Hocko wrote: > > But, let's be honest, this leaves us with an option that nobody is ever > > going to turn on. IOW, nobody except a very small portion of our users > > will ever see any benefit from this. > > But aren't those small groups who would like to squeeze every single > cycle out from the page allocator path the targeted audience? Those have long sine raised the white flag and succumbed to the featuritis. Resigned to try to keep the bloat restricted to a couple of cores so that the rest of the cores stay usable.
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Tue 10-10-17 07:53:50, Dave Hansen wrote: > On 10/10/2017 07:31 AM, Michal Hocko wrote: > > On Tue 10-10-17 07:29:31, Dave Hansen wrote: > >> On 10/09/2017 10:49 PM, Michal Hocko wrote: > >>> Anyway I still stand by my position that this sounds over-engineered and > >>> a simple 0/1 resp. on/off interface would be both simpler and safer. If > >>> anybody wants an auto mode it can be added later (as a value 2 resp. > >>> auto). > >> > >> 0/1 with the default set to the strict, slower mode? > > > > yes, keep the current semantic and allow users who care to disable > > something that stands in the way. > > But, let's be honest, this leaves us with an option that nobody is ever > going to turn on. IOW, nobody except a very small portion of our users > will ever see any benefit from this. But aren't those small groups who would like to squeeze every single cycle out from the page allocator path the targeted audience? -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Tue 10-10-17 07:53:50, Dave Hansen wrote: > On 10/10/2017 07:31 AM, Michal Hocko wrote: > > On Tue 10-10-17 07:29:31, Dave Hansen wrote: > >> On 10/09/2017 10:49 PM, Michal Hocko wrote: > >>> Anyway I still stand by my position that this sounds over-engineered and > >>> a simple 0/1 resp. on/off interface would be both simpler and safer. If > >>> anybody wants an auto mode it can be added later (as a value 2 resp. > >>> auto). > >> > >> 0/1 with the default set to the strict, slower mode? > > > > yes, keep the current semantic and allow users who care to disable > > something that stands in the way. > > But, let's be honest, this leaves us with an option that nobody is ever > going to turn on. IOW, nobody except a very small portion of our users > will ever see any benefit from this. But aren't those small groups who would like to squeeze every single cycle out from the page allocator path the targeted audience? -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 10/10/2017 07:31 AM, Michal Hocko wrote: > On Tue 10-10-17 07:29:31, Dave Hansen wrote: >> On 10/09/2017 10:49 PM, Michal Hocko wrote: >>> Anyway I still stand by my position that this sounds over-engineered and >>> a simple 0/1 resp. on/off interface would be both simpler and safer. If >>> anybody wants an auto mode it can be added later (as a value 2 resp. >>> auto). >> >> 0/1 with the default set to the strict, slower mode? > > yes, keep the current semantic and allow users who care to disable > something that stands in the way. But, let's be honest, this leaves us with an option that nobody is ever going to turn on. IOW, nobody except a very small portion of our users will ever see any benefit from this.
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 10/10/2017 07:31 AM, Michal Hocko wrote: > On Tue 10-10-17 07:29:31, Dave Hansen wrote: >> On 10/09/2017 10:49 PM, Michal Hocko wrote: >>> Anyway I still stand by my position that this sounds over-engineered and >>> a simple 0/1 resp. on/off interface would be both simpler and safer. If >>> anybody wants an auto mode it can be added later (as a value 2 resp. >>> auto). >> >> 0/1 with the default set to the strict, slower mode? > > yes, keep the current semantic and allow users who care to disable > something that stands in the way. But, let's be honest, this leaves us with an option that nobody is ever going to turn on. IOW, nobody except a very small portion of our users will ever see any benefit from this.
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Tue 10-10-17 07:29:31, Dave Hansen wrote: > On 10/09/2017 10:49 PM, Michal Hocko wrote: > > On Mon 09-10-17 09:55:49, Michal Hocko wrote: > >> I haven't checked closely but what happens (or should happen) when you > >> do a partial read? Should you get an inconsistent results? Or is this > >> impossible? > > Well, after thinking about it little bit more, partial reads are always > > inconsistent so this wouldn't add a new problem. > > > > Anyway I still stand by my position that this sounds over-engineered and > > a simple 0/1 resp. on/off interface would be both simpler and safer. If > > anybody wants an auto mode it can be added later (as a value 2 resp. > > auto). > > 0/1 with the default set to the strict, slower mode? yes, keep the current semantic and allow users who care to disable something that stands in the way. -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Tue 10-10-17 07:29:31, Dave Hansen wrote: > On 10/09/2017 10:49 PM, Michal Hocko wrote: > > On Mon 09-10-17 09:55:49, Michal Hocko wrote: > >> I haven't checked closely but what happens (or should happen) when you > >> do a partial read? Should you get an inconsistent results? Or is this > >> impossible? > > Well, after thinking about it little bit more, partial reads are always > > inconsistent so this wouldn't add a new problem. > > > > Anyway I still stand by my position that this sounds over-engineered and > > a simple 0/1 resp. on/off interface would be both simpler and safer. If > > anybody wants an auto mode it can be added later (as a value 2 resp. > > auto). > > 0/1 with the default set to the strict, slower mode? yes, keep the current semantic and allow users who care to disable something that stands in the way. -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 10/09/2017 10:49 PM, Michal Hocko wrote: > On Mon 09-10-17 09:55:49, Michal Hocko wrote: >> I haven't checked closely but what happens (or should happen) when you >> do a partial read? Should you get an inconsistent results? Or is this >> impossible? > Well, after thinking about it little bit more, partial reads are always > inconsistent so this wouldn't add a new problem. > > Anyway I still stand by my position that this sounds over-engineered and > a simple 0/1 resp. on/off interface would be both simpler and safer. If > anybody wants an auto mode it can be added later (as a value 2 resp. > auto). 0/1 with the default set to the strict, slower mode?
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 10/09/2017 10:49 PM, Michal Hocko wrote: > On Mon 09-10-17 09:55:49, Michal Hocko wrote: >> I haven't checked closely but what happens (or should happen) when you >> do a partial read? Should you get an inconsistent results? Or is this >> impossible? > Well, after thinking about it little bit more, partial reads are always > inconsistent so this wouldn't add a new problem. > > Anyway I still stand by my position that this sounds over-engineered and > a simple 0/1 resp. on/off interface would be both simpler and safer. If > anybody wants an auto mode it can be added later (as a value 2 resp. > auto). 0/1 with the default set to the strict, slower mode?
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 2017年10月10日 13:49, Michal Hocko wrote: > On Mon 09-10-17 09:55:49, Michal Hocko wrote: >> I haven't checked closely but what happens (or should happen) when you >> do a partial read? Should you get an inconsistent results? Or is this >> impossible? > > Well, after thinking about it little bit more, partial reads are always > inconsistent so this wouldn't add a new problem. > > Anyway I still stand by my position that this sounds over-engineered and > a simple 0/1 resp. on/off interface would be both simpler and safer. If > anybody wants an auto mode it can be added later (as a value 2 resp. > auto). > It sounds good to me. If Andrew also tends to be a simple 0/1, I will submit V4 patch for it. Thanks
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 2017年10月10日 13:49, Michal Hocko wrote: > On Mon 09-10-17 09:55:49, Michal Hocko wrote: >> I haven't checked closely but what happens (or should happen) when you >> do a partial read? Should you get an inconsistent results? Or is this >> impossible? > > Well, after thinking about it little bit more, partial reads are always > inconsistent so this wouldn't add a new problem. > > Anyway I still stand by my position that this sounds over-engineered and > a simple 0/1 resp. on/off interface would be both simpler and safer. If > anybody wants an auto mode it can be added later (as a value 2 resp. > auto). > It sounds good to me. If Andrew also tends to be a simple 0/1, I will submit V4 patch for it. Thanks
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Mon 09-10-17 09:55:49, Michal Hocko wrote: > I haven't checked closely but what happens (or should happen) when you > do a partial read? Should you get an inconsistent results? Or is this > impossible? Well, after thinking about it little bit more, partial reads are always inconsistent so this wouldn't add a new problem. Anyway I still stand by my position that this sounds over-engineered and a simple 0/1 resp. on/off interface would be both simpler and safer. If anybody wants an auto mode it can be added later (as a value 2 resp. auto). -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Mon 09-10-17 09:55:49, Michal Hocko wrote: > I haven't checked closely but what happens (or should happen) when you > do a partial read? Should you get an inconsistent results? Or is this > impossible? Well, after thinking about it little bit more, partial reads are always inconsistent so this wouldn't add a new problem. Anyway I still stand by my position that this sounds over-engineered and a simple 0/1 resp. on/off interface would be both simpler and safer. If anybody wants an auto mode it can be added later (as a value 2 resp. auto). -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Mon 09-10-17 14:34:11, kemi wrote: > > > On 2017年10月03日 17:23, Michal Hocko wrote: > > On Thu 28-09-17 14:11:41, Kemi Wang wrote: > >> This is the second step which introduces a tunable interface that allow > >> numa stats configurable for optimizing zone_statistics(), as suggested by > >> Dave Hansen and Ying Huang. > >> > >> = > >> When page allocation performance becomes a bottleneck and you can tolerate > >> some possible tool breakage and decreased numa counter precision, you can > >> do: > >>echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > >> In this case, numa counter update is ignored. We can see about > >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim > >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu > >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88 > >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G > >> memory). > >> > >> Benchmark link provided by Jesper D Brouer(increase loop times to > >> 1000): > >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ > >> bench > >> > >> = > >> When page allocation performance is not a bottleneck and you want all > >> tooling to work, you can do: > >>echo [S|s]trict > /proc/sys/vm/numa_stats_mode > >> > >> = > >> We recommend automatic detection of numa statistics by system, this is also > >> system default configuration, you can do: > >>echo [A|a]uto > /proc/sys/vm/numa_stats_mode > >> In this case, numa counter update is skipped unless it has been read by > >> users at least once, e.g. cat /proc/zoneinfo. > > > > I am still not convinced the auto mode is worth all the additional code > > and a safe default to use. The whole thing could have been 0/1 with a > > simpler parsing and less code to catch readers. > > > > I understood your concern. > Well, we may get rid of auto mode if there is some obvious disadvantage > here. Now, I tend to keep it because most people may not touch this interface, > and auto mode is helpful in such case. But you cannot guarantee it won't break any existing users, can you? Besides I do not remember anybody complaining about the performance impact of these counters other than very specialized workloads which are going to disable the accounting altogether. So I simply fail to see a reason to add more code with a questionable semantic (see below on partial reads). > > E.g. why do we have to do static_branch_enable on any read or even > > vmstat_stop? Wouldn't open be sufficient? > > > > NUMA stats is used in four files: > /proc/zoneinfo > /proc/vmstat > /sys/devices/system/node/node*/numastat > /sys/devices/system/node/node*/vmstat > In auto mode, each *read* will trigger the update of NUMA counter. > So, we should make sure the target branch is jumped to the branch > for NUMA counter update once the file is read from user space. > the intension of static_branch_enable in vmstat_stop(in the call site > of file->file_ops.read) is for reading /proc/vmstat in case. > > I guess the *open* means file->file_op.open here, right? > Do you suggest to move static_branch_enable to file->file_op.open? Thanks. I haven't checked closely but what happens (or should happen) when you do a partial read? Should you get an inconsistent results? Or is this impossible? -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Mon 09-10-17 14:34:11, kemi wrote: > > > On 2017年10月03日 17:23, Michal Hocko wrote: > > On Thu 28-09-17 14:11:41, Kemi Wang wrote: > >> This is the second step which introduces a tunable interface that allow > >> numa stats configurable for optimizing zone_statistics(), as suggested by > >> Dave Hansen and Ying Huang. > >> > >> = > >> When page allocation performance becomes a bottleneck and you can tolerate > >> some possible tool breakage and decreased numa counter precision, you can > >> do: > >>echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > >> In this case, numa counter update is ignored. We can see about > >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim > >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu > >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88 > >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G > >> memory). > >> > >> Benchmark link provided by Jesper D Brouer(increase loop times to > >> 1000): > >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ > >> bench > >> > >> = > >> When page allocation performance is not a bottleneck and you want all > >> tooling to work, you can do: > >>echo [S|s]trict > /proc/sys/vm/numa_stats_mode > >> > >> = > >> We recommend automatic detection of numa statistics by system, this is also > >> system default configuration, you can do: > >>echo [A|a]uto > /proc/sys/vm/numa_stats_mode > >> In this case, numa counter update is skipped unless it has been read by > >> users at least once, e.g. cat /proc/zoneinfo. > > > > I am still not convinced the auto mode is worth all the additional code > > and a safe default to use. The whole thing could have been 0/1 with a > > simpler parsing and less code to catch readers. > > > > I understood your concern. > Well, we may get rid of auto mode if there is some obvious disadvantage > here. Now, I tend to keep it because most people may not touch this interface, > and auto mode is helpful in such case. But you cannot guarantee it won't break any existing users, can you? Besides I do not remember anybody complaining about the performance impact of these counters other than very specialized workloads which are going to disable the accounting altogether. So I simply fail to see a reason to add more code with a questionable semantic (see below on partial reads). > > E.g. why do we have to do static_branch_enable on any read or even > > vmstat_stop? Wouldn't open be sufficient? > > > > NUMA stats is used in four files: > /proc/zoneinfo > /proc/vmstat > /sys/devices/system/node/node*/numastat > /sys/devices/system/node/node*/vmstat > In auto mode, each *read* will trigger the update of NUMA counter. > So, we should make sure the target branch is jumped to the branch > for NUMA counter update once the file is read from user space. > the intension of static_branch_enable in vmstat_stop(in the call site > of file->file_ops.read) is for reading /proc/vmstat in case. > > I guess the *open* means file->file_op.open here, right? > Do you suggest to move static_branch_enable to file->file_op.open? Thanks. I haven't checked closely but what happens (or should happen) when you do a partial read? Should you get an inconsistent results? Or is this impossible? -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 2017年10月03日 17:23, Michal Hocko wrote: > On Thu 28-09-17 14:11:41, Kemi Wang wrote: >> This is the second step which introduces a tunable interface that allow >> numa stats configurable for optimizing zone_statistics(), as suggested by >> Dave Hansen and Ying Huang. >> >> = >> When page allocation performance becomes a bottleneck and you can tolerate >> some possible tool breakage and decreased numa counter precision, you can >> do: >> echo [C|c]oarse > /proc/sys/vm/numa_stats_mode >> In this case, numa counter update is ignored. We can see about >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88 >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G >> memory). >> >> Benchmark link provided by Jesper D Brouer(increase loop times to >> 1000): >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ >> bench >> >> = >> When page allocation performance is not a bottleneck and you want all >> tooling to work, you can do: >> echo [S|s]trict > /proc/sys/vm/numa_stats_mode >> >> = >> We recommend automatic detection of numa statistics by system, this is also >> system default configuration, you can do: >> echo [A|a]uto > /proc/sys/vm/numa_stats_mode >> In this case, numa counter update is skipped unless it has been read by >> users at least once, e.g. cat /proc/zoneinfo. > > I am still not convinced the auto mode is worth all the additional code > and a safe default to use. The whole thing could have been 0/1 with a > simpler parsing and less code to catch readers. > I understood your concern. Well, we may get rid of auto mode if there is some obvious disadvantage here. Now, I tend to keep it because most people may not touch this interface, and auto mode is helpful in such case. > E.g. why do we have to do static_branch_enable on any read or even > vmstat_stop? Wouldn't open be sufficient? > NUMA stats is used in four files: /proc/zoneinfo /proc/vmstat /sys/devices/system/node/node*/numastat /sys/devices/system/node/node*/vmstat In auto mode, each *read* will trigger the update of NUMA counter. So, we should make sure the target branch is jumped to the branch for NUMA counter update once the file is read from user space. the intension of static_branch_enable in vmstat_stop(in the call site of file->file_ops.read) is for reading /proc/vmstat in case. I guess the *open* means file->file_op.open here, right? Do you suggest to move static_branch_enable to file->file_op.open? Thanks. >> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, >> NULL); >> static ssize_t node_read_numastat(struct device *dev, >> struct device_attribute *attr, char *buf) >> { >> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) >> +static_branch_enable(_numa_stats_mode_key); >> return sprintf(buf, >> "numa_hit %lu\n" >> "numa_miss %lu\n" >> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev, >> n += sprintf(buf+n, "%s %lu\n", >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> sum_zone_numa_state(nid, i)); >> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) >> +static_branch_enable(_numa_stats_mode_key); >> #endif >> >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > [...] >> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void >> *arg) >> { >> pg_data_t *pgdat = (pg_data_t *)arg; >> walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print); >> +#ifdef CONFIG_NUMA >> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) >> +static_branch_enable(_numa_stats_mode_key); >> +#endif >> return 0; >> } >> >> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg) >> >> static void vmstat_stop(struct seq_file *m, void *arg) >> { >> +#ifdef CONFIG_NUMA >> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) >> +static_branch_enable(_numa_stats_mode_key); >> +#endif >> kfree(m->private); >> m->private = NULL; >> } >> -- >> 2.7.4 >> >
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 2017年10月03日 17:23, Michal Hocko wrote: > On Thu 28-09-17 14:11:41, Kemi Wang wrote: >> This is the second step which introduces a tunable interface that allow >> numa stats configurable for optimizing zone_statistics(), as suggested by >> Dave Hansen and Ying Huang. >> >> = >> When page allocation performance becomes a bottleneck and you can tolerate >> some possible tool breakage and decreased numa counter precision, you can >> do: >> echo [C|c]oarse > /proc/sys/vm/numa_stats_mode >> In this case, numa counter update is ignored. We can see about >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88 >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G >> memory). >> >> Benchmark link provided by Jesper D Brouer(increase loop times to >> 1000): >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ >> bench >> >> = >> When page allocation performance is not a bottleneck and you want all >> tooling to work, you can do: >> echo [S|s]trict > /proc/sys/vm/numa_stats_mode >> >> = >> We recommend automatic detection of numa statistics by system, this is also >> system default configuration, you can do: >> echo [A|a]uto > /proc/sys/vm/numa_stats_mode >> In this case, numa counter update is skipped unless it has been read by >> users at least once, e.g. cat /proc/zoneinfo. > > I am still not convinced the auto mode is worth all the additional code > and a safe default to use. The whole thing could have been 0/1 with a > simpler parsing and less code to catch readers. > I understood your concern. Well, we may get rid of auto mode if there is some obvious disadvantage here. Now, I tend to keep it because most people may not touch this interface, and auto mode is helpful in such case. > E.g. why do we have to do static_branch_enable on any read or even > vmstat_stop? Wouldn't open be sufficient? > NUMA stats is used in four files: /proc/zoneinfo /proc/vmstat /sys/devices/system/node/node*/numastat /sys/devices/system/node/node*/vmstat In auto mode, each *read* will trigger the update of NUMA counter. So, we should make sure the target branch is jumped to the branch for NUMA counter update once the file is read from user space. the intension of static_branch_enable in vmstat_stop(in the call site of file->file_ops.read) is for reading /proc/vmstat in case. I guess the *open* means file->file_op.open here, right? Do you suggest to move static_branch_enable to file->file_op.open? Thanks. >> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, >> NULL); >> static ssize_t node_read_numastat(struct device *dev, >> struct device_attribute *attr, char *buf) >> { >> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) >> +static_branch_enable(_numa_stats_mode_key); >> return sprintf(buf, >> "numa_hit %lu\n" >> "numa_miss %lu\n" >> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev, >> n += sprintf(buf+n, "%s %lu\n", >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> sum_zone_numa_state(nid, i)); >> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) >> +static_branch_enable(_numa_stats_mode_key); >> #endif >> >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > [...] >> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void >> *arg) >> { >> pg_data_t *pgdat = (pg_data_t *)arg; >> walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print); >> +#ifdef CONFIG_NUMA >> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) >> +static_branch_enable(_numa_stats_mode_key); >> +#endif >> return 0; >> } >> >> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg) >> >> static void vmstat_stop(struct seq_file *m, void *arg) >> { >> +#ifdef CONFIG_NUMA >> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) >> +static_branch_enable(_numa_stats_mode_key); >> +#endif >> kfree(m->private); >> m->private = NULL; >> } >> -- >> 2.7.4 >> >
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 2017年09月29日 15:03, Vlastimil Babka wrote: > On 09/28/2017 08:11 AM, Kemi Wang wrote: >> This is the second step which introduces a tunable interface that allow >> numa stats configurable for optimizing zone_statistics(), as suggested by >> Dave Hansen and Ying Huang. >> >> = >> When page allocation performance becomes a bottleneck and you can tolerate >> some possible tool breakage and decreased numa counter precision, you can >> do: >> echo [C|c]oarse > /proc/sys/vm/numa_stats_mode >> In this case, numa counter update is ignored. We can see about >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88 >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G >> memory). >> >> Benchmark link provided by Jesper D Brouer(increase loop times to >> 1000): >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ >> bench >> >> = >> When page allocation performance is not a bottleneck and you want all >> tooling to work, you can do: >> echo [S|s]trict > /proc/sys/vm/numa_stats_mode >> >> = >> We recommend automatic detection of numa statistics by system, this is also >> system default configuration, you can do: >> echo [A|a]uto > /proc/sys/vm/numa_stats_mode >> In this case, numa counter update is skipped unless it has been read by >> users at least once, e.g. cat /proc/zoneinfo. >> >> Branch target selection with jump label: >> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa >> counters update. >> b) When numa_stats_mode is changed to *coarse*, return back directly. >> c) When numa_stats_mode is changed to *auto*, the branch target used in >> last time is kept, and the branch target is changed to the branch for numa >> counters update once numa counters are *read* by users. >> >> Therefore, with the help of jump label, the page allocation performance is >> hardly affected when numa counters are updated with a call in >> zone_statistics(). Meanwhile, the auto mode can give people benefit without >> manual tuning. >> >> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to >> help improve the original patch. >> >> ChangeLog: >> V2->V3: >> a) Propose a better way to use jump label to eliminate the overhead of >> branch selection in zone_statistics(), as inspired by Ying Huang; >> b) Add a paragraph in commit log to describe the way for branch target >> selection; >> c) Use a more descriptive name numa_stats_mode instead of vmstat_mode, >> and change the description accordingly, as suggested by Michal Hocko; >> d) Make this functionality NUMA-specific via ifdef >> >> V1->V2: >> a) Merge to one patch; >> b) Use jump label to eliminate the overhead of branch selection; >> c) Add a single-time log message at boot time to help tell users what >> happened. >> >> Reported-by: Jesper Dangaard Brouer>> Suggested-by: Dave Hansen >> Suggested-by: Ying Huang >> Signed-off-by: Kemi Wang >> --- >> Documentation/sysctl/vm.txt | 24 + >> drivers/base/node.c | 4 ++ >> include/linux/vmstat.h | 23 >> init/main.c | 3 ++ >> kernel/sysctl.c | 7 +++ >> mm/page_alloc.c | 10 >> mm/vmstat.c | 129 >> >> 7 files changed, 200 insertions(+) >> >> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt >> index 9baf66a..e310e69 100644 >> --- a/Documentation/sysctl/vm.txt >> +++ b/Documentation/sysctl/vm.txt >> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm: >> - swappiness >> - user_reserve_kbytes >> - vfs_cache_pressure >> +- numa_stats_mode >> - watermark_scale_factor >> - zone_reclaim_mode >> >> @@ -843,6 +844,29 @@ ten times more freeable objects than there are. >> >> = >> >> +numa_stats_mode >> + >> +This interface allows numa statistics configurable. >> + >> +When page allocation performance becomes a bottleneck and you can tolerate >> +some possible tool breakage and decreased numa counter precision, you can >> +do: >> +echo [C|c]oarse > /proc/sys/vm/numa_stats_mode >> + >> +When page allocation performance is not a bottleneck and you want all >> +tooling to work, you can do: >> +echo [S|s]trict > /proc/sys/vm/numa_stat_mode >> + >> +We recommend automatic detection of numa statistics by system, because numa >> +statistics does not affect system's decision and it is very rarely
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 2017年09月29日 15:03, Vlastimil Babka wrote: > On 09/28/2017 08:11 AM, Kemi Wang wrote: >> This is the second step which introduces a tunable interface that allow >> numa stats configurable for optimizing zone_statistics(), as suggested by >> Dave Hansen and Ying Huang. >> >> = >> When page allocation performance becomes a bottleneck and you can tolerate >> some possible tool breakage and decreased numa counter precision, you can >> do: >> echo [C|c]oarse > /proc/sys/vm/numa_stats_mode >> In this case, numa counter update is ignored. We can see about >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88 >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G >> memory). >> >> Benchmark link provided by Jesper D Brouer(increase loop times to >> 1000): >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ >> bench >> >> = >> When page allocation performance is not a bottleneck and you want all >> tooling to work, you can do: >> echo [S|s]trict > /proc/sys/vm/numa_stats_mode >> >> = >> We recommend automatic detection of numa statistics by system, this is also >> system default configuration, you can do: >> echo [A|a]uto > /proc/sys/vm/numa_stats_mode >> In this case, numa counter update is skipped unless it has been read by >> users at least once, e.g. cat /proc/zoneinfo. >> >> Branch target selection with jump label: >> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa >> counters update. >> b) When numa_stats_mode is changed to *coarse*, return back directly. >> c) When numa_stats_mode is changed to *auto*, the branch target used in >> last time is kept, and the branch target is changed to the branch for numa >> counters update once numa counters are *read* by users. >> >> Therefore, with the help of jump label, the page allocation performance is >> hardly affected when numa counters are updated with a call in >> zone_statistics(). Meanwhile, the auto mode can give people benefit without >> manual tuning. >> >> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to >> help improve the original patch. >> >> ChangeLog: >> V2->V3: >> a) Propose a better way to use jump label to eliminate the overhead of >> branch selection in zone_statistics(), as inspired by Ying Huang; >> b) Add a paragraph in commit log to describe the way for branch target >> selection; >> c) Use a more descriptive name numa_stats_mode instead of vmstat_mode, >> and change the description accordingly, as suggested by Michal Hocko; >> d) Make this functionality NUMA-specific via ifdef >> >> V1->V2: >> a) Merge to one patch; >> b) Use jump label to eliminate the overhead of branch selection; >> c) Add a single-time log message at boot time to help tell users what >> happened. >> >> Reported-by: Jesper Dangaard Brouer >> Suggested-by: Dave Hansen >> Suggested-by: Ying Huang >> Signed-off-by: Kemi Wang >> --- >> Documentation/sysctl/vm.txt | 24 + >> drivers/base/node.c | 4 ++ >> include/linux/vmstat.h | 23 >> init/main.c | 3 ++ >> kernel/sysctl.c | 7 +++ >> mm/page_alloc.c | 10 >> mm/vmstat.c | 129 >> >> 7 files changed, 200 insertions(+) >> >> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt >> index 9baf66a..e310e69 100644 >> --- a/Documentation/sysctl/vm.txt >> +++ b/Documentation/sysctl/vm.txt >> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm: >> - swappiness >> - user_reserve_kbytes >> - vfs_cache_pressure >> +- numa_stats_mode >> - watermark_scale_factor >> - zone_reclaim_mode >> >> @@ -843,6 +844,29 @@ ten times more freeable objects than there are. >> >> = >> >> +numa_stats_mode >> + >> +This interface allows numa statistics configurable. >> + >> +When page allocation performance becomes a bottleneck and you can tolerate >> +some possible tool breakage and decreased numa counter precision, you can >> +do: >> +echo [C|c]oarse > /proc/sys/vm/numa_stats_mode >> + >> +When page allocation performance is not a bottleneck and you want all >> +tooling to work, you can do: >> +echo [S|s]trict > /proc/sys/vm/numa_stat_mode >> + >> +We recommend automatic detection of numa statistics by system, because numa >> +statistics does not affect system's decision and it is very rarely >> +consumed. you can do: >> +echo [A|a]uto > /proc/sys/vm/numa_stats_mode >>
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Thu 28-09-17 14:11:41, Kemi Wang wrote: > This is the second step which introduces a tunable interface that allow > numa stats configurable for optimizing zone_statistics(), as suggested by > Dave Hansen and Ying Huang. > > = > When page allocation performance becomes a bottleneck and you can tolerate > some possible tool breakage and decreased numa counter precision, you can > do: > echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is ignored. We can see about > *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim > on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu > cycles per single page allocation and reclaim on Jesper's page_bench03 (88 > threads) running on a 2-Socket Broadwell-based server (88 threads, 126G > memory). > > Benchmark link provided by Jesper D Brouer(increase loop times to > 1000): > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ > bench > > = > When page allocation performance is not a bottleneck and you want all > tooling to work, you can do: > echo [S|s]trict > /proc/sys/vm/numa_stats_mode > > = > We recommend automatic detection of numa statistics by system, this is also > system default configuration, you can do: > echo [A|a]uto > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is skipped unless it has been read by > users at least once, e.g. cat /proc/zoneinfo. I am still not convinced the auto mode is worth all the additional code and a safe default to use. The whole thing could have been 0/1 with a simpler parsing and less code to catch readers. E.g. why do we have to do static_branch_enable on any read or even vmstat_stop? Wouldn't open be sufficient? > @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, > NULL); > static ssize_t node_read_numastat(struct device *dev, > struct device_attribute *attr, char *buf) > { > + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) > + static_branch_enable(_numa_stats_mode_key); > return sprintf(buf, > "numa_hit %lu\n" > "numa_miss %lu\n" > @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev, > n += sprintf(buf+n, "%s %lu\n", >vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >sum_zone_numa_state(nid, i)); > + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) > + static_branch_enable(_numa_stats_mode_key); > #endif > > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) [...] > @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg) > { > pg_data_t *pgdat = (pg_data_t *)arg; > walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print); > +#ifdef CONFIG_NUMA > + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) > + static_branch_enable(_numa_stats_mode_key); > +#endif > return 0; > } > > @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg) > > static void vmstat_stop(struct seq_file *m, void *arg) > { > +#ifdef CONFIG_NUMA > + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) > + static_branch_enable(_numa_stats_mode_key); > +#endif > kfree(m->private); > m->private = NULL; > } > -- > 2.7.4 > -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Thu 28-09-17 14:11:41, Kemi Wang wrote: > This is the second step which introduces a tunable interface that allow > numa stats configurable for optimizing zone_statistics(), as suggested by > Dave Hansen and Ying Huang. > > = > When page allocation performance becomes a bottleneck and you can tolerate > some possible tool breakage and decreased numa counter precision, you can > do: > echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is ignored. We can see about > *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim > on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu > cycles per single page allocation and reclaim on Jesper's page_bench03 (88 > threads) running on a 2-Socket Broadwell-based server (88 threads, 126G > memory). > > Benchmark link provided by Jesper D Brouer(increase loop times to > 1000): > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ > bench > > = > When page allocation performance is not a bottleneck and you want all > tooling to work, you can do: > echo [S|s]trict > /proc/sys/vm/numa_stats_mode > > = > We recommend automatic detection of numa statistics by system, this is also > system default configuration, you can do: > echo [A|a]uto > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is skipped unless it has been read by > users at least once, e.g. cat /proc/zoneinfo. I am still not convinced the auto mode is worth all the additional code and a safe default to use. The whole thing could have been 0/1 with a simpler parsing and less code to catch readers. E.g. why do we have to do static_branch_enable on any read or even vmstat_stop? Wouldn't open be sufficient? > @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, > NULL); > static ssize_t node_read_numastat(struct device *dev, > struct device_attribute *attr, char *buf) > { > + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) > + static_branch_enable(_numa_stats_mode_key); > return sprintf(buf, > "numa_hit %lu\n" > "numa_miss %lu\n" > @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev, > n += sprintf(buf+n, "%s %lu\n", >vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >sum_zone_numa_state(nid, i)); > + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) > + static_branch_enable(_numa_stats_mode_key); > #endif > > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) [...] > @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg) > { > pg_data_t *pgdat = (pg_data_t *)arg; > walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print); > +#ifdef CONFIG_NUMA > + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) > + static_branch_enable(_numa_stats_mode_key); > +#endif > return 0; > } > > @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg) > > static void vmstat_stop(struct seq_file *m, void *arg) > { > +#ifdef CONFIG_NUMA > + if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) > + static_branch_enable(_numa_stats_mode_key); > +#endif > kfree(m->private); > m->private = NULL; > } > -- > 2.7.4 > -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
[+CC linux-api] On 09/28/2017 08:11 AM, Kemi Wang wrote: > This is the second step which introduces a tunable interface that allow > numa stats configurable for optimizing zone_statistics(), as suggested by > Dave Hansen and Ying Huang. > > = > When page allocation performance becomes a bottleneck and you can tolerate > some possible tool breakage and decreased numa counter precision, you can > do: > echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is ignored. We can see about > *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim > on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu > cycles per single page allocation and reclaim on Jesper's page_bench03 (88 > threads) running on a 2-Socket Broadwell-based server (88 threads, 126G > memory). > > Benchmark link provided by Jesper D Brouer(increase loop times to > 1000): > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ > bench > > = > When page allocation performance is not a bottleneck and you want all > tooling to work, you can do: > echo [S|s]trict > /proc/sys/vm/numa_stats_mode > > = > We recommend automatic detection of numa statistics by system, this is also > system default configuration, you can do: > echo [A|a]uto > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is skipped unless it has been read by > users at least once, e.g. cat /proc/zoneinfo. > > Branch target selection with jump label: > a) When numa_stats_mode is changed to *strict*, jump to the branch for numa > counters update. > b) When numa_stats_mode is changed to *coarse*, return back directly. > c) When numa_stats_mode is changed to *auto*, the branch target used in > last time is kept, and the branch target is changed to the branch for numa > counters update once numa counters are *read* by users. > > Therefore, with the help of jump label, the page allocation performance is > hardly affected when numa counters are updated with a call in > zone_statistics(). Meanwhile, the auto mode can give people benefit without > manual tuning. > > Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to > help improve the original patch. > > ChangeLog: > V2->V3: > a) Propose a better way to use jump label to eliminate the overhead of > branch selection in zone_statistics(), as inspired by Ying Huang; > b) Add a paragraph in commit log to describe the way for branch target > selection; > c) Use a more descriptive name numa_stats_mode instead of vmstat_mode, > and change the description accordingly, as suggested by Michal Hocko; > d) Make this functionality NUMA-specific via ifdef > > V1->V2: > a) Merge to one patch; > b) Use jump label to eliminate the overhead of branch selection; > c) Add a single-time log message at boot time to help tell users what > happened. > > Reported-by: Jesper Dangaard Brouer> Suggested-by: Dave Hansen > Suggested-by: Ying Huang > Signed-off-by: Kemi Wang > --- > Documentation/sysctl/vm.txt | 24 + > drivers/base/node.c | 4 ++ > include/linux/vmstat.h | 23 > init/main.c | 3 ++ > kernel/sysctl.c | 7 +++ > mm/page_alloc.c | 10 > mm/vmstat.c | 129 > > 7 files changed, 200 insertions(+) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 9baf66a..e310e69 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm: > - swappiness > - user_reserve_kbytes > - vfs_cache_pressure > +- numa_stats_mode > - watermark_scale_factor > - zone_reclaim_mode > > @@ -843,6 +844,29 @@ ten times more freeable objects than there are. > > = > > +numa_stats_mode > + > +This interface allows numa statistics configurable. > + > +When page allocation performance becomes a bottleneck and you can tolerate > +some possible tool breakage and decreased numa counter precision, you can > +do: > + echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > + > +When page allocation performance is not a bottleneck and you want all > +tooling to work, you can do: > + echo [S|s]trict > /proc/sys/vm/numa_stat_mode > + > +We recommend automatic detection of numa statistics by system, because numa > +statistics does not affect system's decision and it is very rarely > +consumed. you can do: > + echo [A|a]uto > /proc/sys/vm/numa_stats_mode > +This is also system default configuration,
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
[+CC linux-api] On 09/28/2017 08:11 AM, Kemi Wang wrote: > This is the second step which introduces a tunable interface that allow > numa stats configurable for optimizing zone_statistics(), as suggested by > Dave Hansen and Ying Huang. > > = > When page allocation performance becomes a bottleneck and you can tolerate > some possible tool breakage and decreased numa counter precision, you can > do: > echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is ignored. We can see about > *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim > on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu > cycles per single page allocation and reclaim on Jesper's page_bench03 (88 > threads) running on a 2-Socket Broadwell-based server (88 threads, 126G > memory). > > Benchmark link provided by Jesper D Brouer(increase loop times to > 1000): > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ > bench > > = > When page allocation performance is not a bottleneck and you want all > tooling to work, you can do: > echo [S|s]trict > /proc/sys/vm/numa_stats_mode > > = > We recommend automatic detection of numa statistics by system, this is also > system default configuration, you can do: > echo [A|a]uto > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is skipped unless it has been read by > users at least once, e.g. cat /proc/zoneinfo. > > Branch target selection with jump label: > a) When numa_stats_mode is changed to *strict*, jump to the branch for numa > counters update. > b) When numa_stats_mode is changed to *coarse*, return back directly. > c) When numa_stats_mode is changed to *auto*, the branch target used in > last time is kept, and the branch target is changed to the branch for numa > counters update once numa counters are *read* by users. > > Therefore, with the help of jump label, the page allocation performance is > hardly affected when numa counters are updated with a call in > zone_statistics(). Meanwhile, the auto mode can give people benefit without > manual tuning. > > Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to > help improve the original patch. > > ChangeLog: > V2->V3: > a) Propose a better way to use jump label to eliminate the overhead of > branch selection in zone_statistics(), as inspired by Ying Huang; > b) Add a paragraph in commit log to describe the way for branch target > selection; > c) Use a more descriptive name numa_stats_mode instead of vmstat_mode, > and change the description accordingly, as suggested by Michal Hocko; > d) Make this functionality NUMA-specific via ifdef > > V1->V2: > a) Merge to one patch; > b) Use jump label to eliminate the overhead of branch selection; > c) Add a single-time log message at boot time to help tell users what > happened. > > Reported-by: Jesper Dangaard Brouer > Suggested-by: Dave Hansen > Suggested-by: Ying Huang > Signed-off-by: Kemi Wang > --- > Documentation/sysctl/vm.txt | 24 + > drivers/base/node.c | 4 ++ > include/linux/vmstat.h | 23 > init/main.c | 3 ++ > kernel/sysctl.c | 7 +++ > mm/page_alloc.c | 10 > mm/vmstat.c | 129 > > 7 files changed, 200 insertions(+) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 9baf66a..e310e69 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm: > - swappiness > - user_reserve_kbytes > - vfs_cache_pressure > +- numa_stats_mode > - watermark_scale_factor > - zone_reclaim_mode > > @@ -843,6 +844,29 @@ ten times more freeable objects than there are. > > = > > +numa_stats_mode > + > +This interface allows numa statistics configurable. > + > +When page allocation performance becomes a bottleneck and you can tolerate > +some possible tool breakage and decreased numa counter precision, you can > +do: > + echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > + > +When page allocation performance is not a bottleneck and you want all > +tooling to work, you can do: > + echo [S|s]trict > /proc/sys/vm/numa_stat_mode > + > +We recommend automatic detection of numa statistics by system, because numa > +statistics does not affect system's decision and it is very rarely > +consumed. you can do: > + echo [A|a]uto > /proc/sys/vm/numa_stats_mode > +This is also system default configuration, with this default setting, numa > +counters update is skipped unless the counter is
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 09/28/2017 11:29 PM, Andrew Morton wrote: > On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wangwrote: > >> This is the second step which introduces a tunable interface that allow >> numa stats configurable for optimizing zone_statistics(), as suggested by >> Dave Hansen and Ying Huang. > > Looks OK I guess. > > I fiddled with it a lot. Please consider: > > From: Andrew Morton > Subject: mm-sysctl-make-numa-stats-configurable-fix > > - tweak documentation > > - move advisory message from start_kernel() into mm_init() (I'm not sure > we really need this message) Actually, I'm not sure we need any of the current messages, or to have them at higher priority than pr_debug()? They are all triggered by admin action, or unconditionally upon boot. OTOH I think that an useful message that's currently missing would be when the static_key_enable() is triggered in auto mode. Bonus points for including the name of the process and the stat file that was read. However static_key_enable() returns void and not whether it actually flipped the switch, so it's not trivial.
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 09/28/2017 11:29 PM, Andrew Morton wrote: > On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang wrote: > >> This is the second step which introduces a tunable interface that allow >> numa stats configurable for optimizing zone_statistics(), as suggested by >> Dave Hansen and Ying Huang. > > Looks OK I guess. > > I fiddled with it a lot. Please consider: > > From: Andrew Morton > Subject: mm-sysctl-make-numa-stats-configurable-fix > > - tweak documentation > > - move advisory message from start_kernel() into mm_init() (I'm not sure > we really need this message) Actually, I'm not sure we need any of the current messages, or to have them at higher priority than pr_debug()? They are all triggered by admin action, or unconditionally upon boot. OTOH I think that an useful message that's currently missing would be when the static_key_enable() is triggered in auto mode. Bonus points for including the name of the process and the stat file that was read. However static_key_enable() returns void and not whether it actually flipped the switch, so it's not trivial.
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 09/28/2017 08:11 AM, Kemi Wang wrote: > This is the second step which introduces a tunable interface that allow > numa stats configurable for optimizing zone_statistics(), as suggested by > Dave Hansen and Ying Huang. > > = > When page allocation performance becomes a bottleneck and you can tolerate > some possible tool breakage and decreased numa counter precision, you can > do: > echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is ignored. We can see about > *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim > on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu > cycles per single page allocation and reclaim on Jesper's page_bench03 (88 > threads) running on a 2-Socket Broadwell-based server (88 threads, 126G > memory). > > Benchmark link provided by Jesper D Brouer(increase loop times to > 1000): > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ > bench > > = > When page allocation performance is not a bottleneck and you want all > tooling to work, you can do: > echo [S|s]trict > /proc/sys/vm/numa_stats_mode > > = > We recommend automatic detection of numa statistics by system, this is also > system default configuration, you can do: > echo [A|a]uto > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is skipped unless it has been read by > users at least once, e.g. cat /proc/zoneinfo. > > Branch target selection with jump label: > a) When numa_stats_mode is changed to *strict*, jump to the branch for numa > counters update. > b) When numa_stats_mode is changed to *coarse*, return back directly. > c) When numa_stats_mode is changed to *auto*, the branch target used in > last time is kept, and the branch target is changed to the branch for numa > counters update once numa counters are *read* by users. > > Therefore, with the help of jump label, the page allocation performance is > hardly affected when numa counters are updated with a call in > zone_statistics(). Meanwhile, the auto mode can give people benefit without > manual tuning. > > Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to > help improve the original patch. > > ChangeLog: > V2->V3: > a) Propose a better way to use jump label to eliminate the overhead of > branch selection in zone_statistics(), as inspired by Ying Huang; > b) Add a paragraph in commit log to describe the way for branch target > selection; > c) Use a more descriptive name numa_stats_mode instead of vmstat_mode, > and change the description accordingly, as suggested by Michal Hocko; > d) Make this functionality NUMA-specific via ifdef > > V1->V2: > a) Merge to one patch; > b) Use jump label to eliminate the overhead of branch selection; > c) Add a single-time log message at boot time to help tell users what > happened. > > Reported-by: Jesper Dangaard Brouer> Suggested-by: Dave Hansen > Suggested-by: Ying Huang > Signed-off-by: Kemi Wang > --- > Documentation/sysctl/vm.txt | 24 + > drivers/base/node.c | 4 ++ > include/linux/vmstat.h | 23 > init/main.c | 3 ++ > kernel/sysctl.c | 7 +++ > mm/page_alloc.c | 10 > mm/vmstat.c | 129 > > 7 files changed, 200 insertions(+) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 9baf66a..e310e69 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm: > - swappiness > - user_reserve_kbytes > - vfs_cache_pressure > +- numa_stats_mode > - watermark_scale_factor > - zone_reclaim_mode > > @@ -843,6 +844,29 @@ ten times more freeable objects than there are. > > = > > +numa_stats_mode > + > +This interface allows numa statistics configurable. > + > +When page allocation performance becomes a bottleneck and you can tolerate > +some possible tool breakage and decreased numa counter precision, you can > +do: > + echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > + > +When page allocation performance is not a bottleneck and you want all > +tooling to work, you can do: > + echo [S|s]trict > /proc/sys/vm/numa_stat_mode > + > +We recommend automatic detection of numa statistics by system, because numa > +statistics does not affect system's decision and it is very rarely > +consumed. you can do: > + echo [A|a]uto > /proc/sys/vm/numa_stats_mode > +This is also system default configuration, with this default
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 09/28/2017 08:11 AM, Kemi Wang wrote: > This is the second step which introduces a tunable interface that allow > numa stats configurable for optimizing zone_statistics(), as suggested by > Dave Hansen and Ying Huang. > > = > When page allocation performance becomes a bottleneck and you can tolerate > some possible tool breakage and decreased numa counter precision, you can > do: > echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is ignored. We can see about > *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim > on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu > cycles per single page allocation and reclaim on Jesper's page_bench03 (88 > threads) running on a 2-Socket Broadwell-based server (88 threads, 126G > memory). > > Benchmark link provided by Jesper D Brouer(increase loop times to > 1000): > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ > bench > > = > When page allocation performance is not a bottleneck and you want all > tooling to work, you can do: > echo [S|s]trict > /proc/sys/vm/numa_stats_mode > > = > We recommend automatic detection of numa statistics by system, this is also > system default configuration, you can do: > echo [A|a]uto > /proc/sys/vm/numa_stats_mode > In this case, numa counter update is skipped unless it has been read by > users at least once, e.g. cat /proc/zoneinfo. > > Branch target selection with jump label: > a) When numa_stats_mode is changed to *strict*, jump to the branch for numa > counters update. > b) When numa_stats_mode is changed to *coarse*, return back directly. > c) When numa_stats_mode is changed to *auto*, the branch target used in > last time is kept, and the branch target is changed to the branch for numa > counters update once numa counters are *read* by users. > > Therefore, with the help of jump label, the page allocation performance is > hardly affected when numa counters are updated with a call in > zone_statistics(). Meanwhile, the auto mode can give people benefit without > manual tuning. > > Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to > help improve the original patch. > > ChangeLog: > V2->V3: > a) Propose a better way to use jump label to eliminate the overhead of > branch selection in zone_statistics(), as inspired by Ying Huang; > b) Add a paragraph in commit log to describe the way for branch target > selection; > c) Use a more descriptive name numa_stats_mode instead of vmstat_mode, > and change the description accordingly, as suggested by Michal Hocko; > d) Make this functionality NUMA-specific via ifdef > > V1->V2: > a) Merge to one patch; > b) Use jump label to eliminate the overhead of branch selection; > c) Add a single-time log message at boot time to help tell users what > happened. > > Reported-by: Jesper Dangaard Brouer > Suggested-by: Dave Hansen > Suggested-by: Ying Huang > Signed-off-by: Kemi Wang > --- > Documentation/sysctl/vm.txt | 24 + > drivers/base/node.c | 4 ++ > include/linux/vmstat.h | 23 > init/main.c | 3 ++ > kernel/sysctl.c | 7 +++ > mm/page_alloc.c | 10 > mm/vmstat.c | 129 > > 7 files changed, 200 insertions(+) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 9baf66a..e310e69 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm: > - swappiness > - user_reserve_kbytes > - vfs_cache_pressure > +- numa_stats_mode > - watermark_scale_factor > - zone_reclaim_mode > > @@ -843,6 +844,29 @@ ten times more freeable objects than there are. > > = > > +numa_stats_mode > + > +This interface allows numa statistics configurable. > + > +When page allocation performance becomes a bottleneck and you can tolerate > +some possible tool breakage and decreased numa counter precision, you can > +do: > + echo [C|c]oarse > /proc/sys/vm/numa_stats_mode > + > +When page allocation performance is not a bottleneck and you want all > +tooling to work, you can do: > + echo [S|s]trict > /proc/sys/vm/numa_stat_mode > + > +We recommend automatic detection of numa statistics by system, because numa > +statistics does not affect system's decision and it is very rarely > +consumed. you can do: > + echo [A|a]uto > /proc/sys/vm/numa_stats_mode > +This is also system default configuration, with this default setting, numa > +counters update is skipped unless the counter is *read* by users at
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 2017年09月29日 05:29, Andrew Morton wrote: > On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wangwrote: > >> This is the second step which introduces a tunable interface that allow >> numa stats configurable for optimizing zone_statistics(), as suggested by >> Dave Hansen and Ying Huang. > > Looks OK I guess. > > I fiddled with it a lot. Please consider: > Thanks for your help to make it more graceful! I will be more careful next time. There may be a typo error in Documentation/sysctl/vm.txt, see comment below. > From: Andrew Morton > Subject: mm-sysctl-make-numa-stats-configurable-fix > > - tweak documentation > > - move advisory message from start_kernel() into mm_init() (I'm not sure > we really need this message) > > - use strcasecmp() in __parse_vm_numa_stats_mode() > > - clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler() > > Cc: Aaron Lu > Cc: Andi Kleen > Cc: Christopher Lameter > Cc: Dave Hansen > Cc: Jesper Dangaard Brouer > Cc: Johannes Weiner > Cc: Jonathan Corbet > Cc: Kees Cook > Cc: Kemi Wang > Cc: "Luis R . Rodriguez" > Cc: Mel Gorman > Cc: Michal Hocko > Cc: Sebastian Andrzej Siewior > Cc: Tim Chen > Cc: Vlastimil Babka > Cc: Ying Huang > Signed-off-by: Andrew Morton > --- > > Documentation/sysctl/vm.txt | 15 ++--- > init/main.c |6 ++--- > mm/vmstat.c | 39 +++--- > 3 files changed, 29 insertions(+), 31 deletions(-) > > diff -puN > Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix > Documentation/sysctl/vm.txt > --- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix > +++ a/Documentation/sysctl/vm.txt > @@ -853,7 +853,7 @@ ten times more freeable objects than the > > numa_stats_mode > > -This interface allows numa statistics configurable. > +This interface allows runtime configuration *or* numa statistics. > typo? or->of/for? > When page allocation performance becomes a bottleneck and you can tolerate > some possible tool breakage and decreased numa counter precision, you can > @@ -864,13 +864,14 @@ When page allocation performance is not > tooling to work, you can do: > echo [S|s]trict > /proc/sys/vm/numa_stat_mode > > -We recommend automatic detection of numa statistics by system, because numa > -statistics does not affect system's decision and it is very rarely > -consumed. you can do: > +We recommend automatic detection of numa statistics by system, because > +numa statistics do not affect system decisions and it is very rarely > +consumed. In this case you can do: > echo [A|a]uto > /proc/sys/vm/numa_stats_mode > -This is also system default configuration, with this default setting, numa > -counters update is skipped unless the counter is *read* by users at least > -once. > + > +This is the system default configuration. With this default setting, numa > +counter updates are skipped until the counter is *read* by userspace at > +least once. > > == > > diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix > drivers/base/node.c > diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix > include/linux/vmstat.h > diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c > --- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix > +++ a/init/main.c > @@ -504,6 +504,9 @@ static void __init mm_init(void) > pgtable_init(); > vmalloc_init(); > ioremap_huge_init(); > +#ifdef CONFIG_NUMA > + pr_info("vmstat: NUMA stat updates are skipped unless they have been > used\n"); > +#endif > } > > asmlinkage __visible void __init start_kernel(void) > @@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k > sort_main_extable(); > trap_init(); > mm_init(); > -#ifdef CONFIG_NUMA > - pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n"); > -#endif > > ftrace_init(); > > diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix > kernel/sysctl.c > diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix > mm/page_alloc.c > diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c > --- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix > +++ a/mm/vmstat.c > @@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l > > static int __parse_vm_numa_stats_mode(char *s) > { > - const char *str = s; > - > - if (strcmp(str, "auto") == 0 || strcmp(str,
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On 2017年09月29日 05:29, Andrew Morton wrote: > On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang wrote: > >> This is the second step which introduces a tunable interface that allow >> numa stats configurable for optimizing zone_statistics(), as suggested by >> Dave Hansen and Ying Huang. > > Looks OK I guess. > > I fiddled with it a lot. Please consider: > Thanks for your help to make it more graceful! I will be more careful next time. There may be a typo error in Documentation/sysctl/vm.txt, see comment below. > From: Andrew Morton > Subject: mm-sysctl-make-numa-stats-configurable-fix > > - tweak documentation > > - move advisory message from start_kernel() into mm_init() (I'm not sure > we really need this message) > > - use strcasecmp() in __parse_vm_numa_stats_mode() > > - clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler() > > Cc: Aaron Lu > Cc: Andi Kleen > Cc: Christopher Lameter > Cc: Dave Hansen > Cc: Jesper Dangaard Brouer > Cc: Johannes Weiner > Cc: Jonathan Corbet > Cc: Kees Cook > Cc: Kemi Wang > Cc: "Luis R . Rodriguez" > Cc: Mel Gorman > Cc: Michal Hocko > Cc: Sebastian Andrzej Siewior > Cc: Tim Chen > Cc: Vlastimil Babka > Cc: Ying Huang > Signed-off-by: Andrew Morton > --- > > Documentation/sysctl/vm.txt | 15 ++--- > init/main.c |6 ++--- > mm/vmstat.c | 39 +++--- > 3 files changed, 29 insertions(+), 31 deletions(-) > > diff -puN > Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix > Documentation/sysctl/vm.txt > --- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix > +++ a/Documentation/sysctl/vm.txt > @@ -853,7 +853,7 @@ ten times more freeable objects than the > > numa_stats_mode > > -This interface allows numa statistics configurable. > +This interface allows runtime configuration *or* numa statistics. > typo? or->of/for? > When page allocation performance becomes a bottleneck and you can tolerate > some possible tool breakage and decreased numa counter precision, you can > @@ -864,13 +864,14 @@ When page allocation performance is not > tooling to work, you can do: > echo [S|s]trict > /proc/sys/vm/numa_stat_mode > > -We recommend automatic detection of numa statistics by system, because numa > -statistics does not affect system's decision and it is very rarely > -consumed. you can do: > +We recommend automatic detection of numa statistics by system, because > +numa statistics do not affect system decisions and it is very rarely > +consumed. In this case you can do: > echo [A|a]uto > /proc/sys/vm/numa_stats_mode > -This is also system default configuration, with this default setting, numa > -counters update is skipped unless the counter is *read* by users at least > -once. > + > +This is the system default configuration. With this default setting, numa > +counter updates are skipped until the counter is *read* by userspace at > +least once. > > == > > diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix > drivers/base/node.c > diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix > include/linux/vmstat.h > diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c > --- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix > +++ a/init/main.c > @@ -504,6 +504,9 @@ static void __init mm_init(void) > pgtable_init(); > vmalloc_init(); > ioremap_huge_init(); > +#ifdef CONFIG_NUMA > + pr_info("vmstat: NUMA stat updates are skipped unless they have been > used\n"); > +#endif > } > > asmlinkage __visible void __init start_kernel(void) > @@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k > sort_main_extable(); > trap_init(); > mm_init(); > -#ifdef CONFIG_NUMA > - pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n"); > -#endif > > ftrace_init(); > > diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix > kernel/sysctl.c > diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix > mm/page_alloc.c > diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c > --- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix > +++ a/mm/vmstat.c > @@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l > > static int __parse_vm_numa_stats_mode(char *s) > { > - const char *str = s; > - > - if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0) > + if (strcasecmp(s, "auto")) > vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE; > - else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0) > + else if (strcasecmp(s, "strict") == 0) > vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE; > - else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0) > + else if (strcasecmp(s,
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wangwrote: > This is the second step which introduces a tunable interface that allow > numa stats configurable for optimizing zone_statistics(), as suggested by > Dave Hansen and Ying Huang. Looks OK I guess. I fiddled with it a lot. Please consider: From: Andrew Morton Subject: mm-sysctl-make-numa-stats-configurable-fix - tweak documentation - move advisory message from start_kernel() into mm_init() (I'm not sure we really need this message) - use strcasecmp() in __parse_vm_numa_stats_mode() - clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler() Cc: Aaron Lu Cc: Andi Kleen Cc: Christopher Lameter Cc: Dave Hansen Cc: Jesper Dangaard Brouer Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Kees Cook Cc: Kemi Wang Cc: "Luis R . Rodriguez" Cc: Mel Gorman Cc: Michal Hocko Cc: Sebastian Andrzej Siewior Cc: Tim Chen Cc: Vlastimil Babka Cc: Ying Huang Signed-off-by: Andrew Morton --- Documentation/sysctl/vm.txt | 15 ++--- init/main.c |6 ++--- mm/vmstat.c | 39 +++--- 3 files changed, 29 insertions(+), 31 deletions(-) diff -puN Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix Documentation/sysctl/vm.txt --- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix +++ a/Documentation/sysctl/vm.txt @@ -853,7 +853,7 @@ ten times more freeable objects than the numa_stats_mode -This interface allows numa statistics configurable. +This interface allows runtime configuration or numa statistics. When page allocation performance becomes a bottleneck and you can tolerate some possible tool breakage and decreased numa counter precision, you can @@ -864,13 +864,14 @@ When page allocation performance is not tooling to work, you can do: echo [S|s]trict > /proc/sys/vm/numa_stat_mode -We recommend automatic detection of numa statistics by system, because numa -statistics does not affect system's decision and it is very rarely -consumed. you can do: +We recommend automatic detection of numa statistics by system, because +numa statistics do not affect system decisions and it is very rarely +consumed. In this case you can do: echo [A|a]uto > /proc/sys/vm/numa_stats_mode -This is also system default configuration, with this default setting, numa -counters update is skipped unless the counter is *read* by users at least -once. + +This is the system default configuration. With this default setting, numa +counter updates are skipped until the counter is *read* by userspace at +least once. == diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix drivers/base/node.c diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix include/linux/vmstat.h diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c --- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix +++ a/init/main.c @@ -504,6 +504,9 @@ static void __init mm_init(void) pgtable_init(); vmalloc_init(); ioremap_huge_init(); +#ifdef CONFIG_NUMA + pr_info("vmstat: NUMA stat updates are skipped unless they have been used\n"); +#endif } asmlinkage __visible void __init start_kernel(void) @@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k sort_main_extable(); trap_init(); mm_init(); -#ifdef CONFIG_NUMA - pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n"); -#endif ftrace_init(); diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix kernel/sysctl.c diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix mm/page_alloc.c diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c --- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix +++ a/mm/vmstat.c @@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l static int __parse_vm_numa_stats_mode(char *s) { - const char *str = s; - - if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0) + if (strcasecmp(s, "auto")) vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE; - else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0) + else if (strcasecmp(s, "strict") == 0) vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE; - else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0) + else if (strcasecmp(s, "coarse"))
Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang wrote: > This is the second step which introduces a tunable interface that allow > numa stats configurable for optimizing zone_statistics(), as suggested by > Dave Hansen and Ying Huang. Looks OK I guess. I fiddled with it a lot. Please consider: From: Andrew Morton Subject: mm-sysctl-make-numa-stats-configurable-fix - tweak documentation - move advisory message from start_kernel() into mm_init() (I'm not sure we really need this message) - use strcasecmp() in __parse_vm_numa_stats_mode() - clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler() Cc: Aaron Lu Cc: Andi Kleen Cc: Christopher Lameter Cc: Dave Hansen Cc: Jesper Dangaard Brouer Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Kees Cook Cc: Kemi Wang Cc: "Luis R . Rodriguez" Cc: Mel Gorman Cc: Michal Hocko Cc: Sebastian Andrzej Siewior Cc: Tim Chen Cc: Vlastimil Babka Cc: Ying Huang Signed-off-by: Andrew Morton --- Documentation/sysctl/vm.txt | 15 ++--- init/main.c |6 ++--- mm/vmstat.c | 39 +++--- 3 files changed, 29 insertions(+), 31 deletions(-) diff -puN Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix Documentation/sysctl/vm.txt --- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix +++ a/Documentation/sysctl/vm.txt @@ -853,7 +853,7 @@ ten times more freeable objects than the numa_stats_mode -This interface allows numa statistics configurable. +This interface allows runtime configuration or numa statistics. When page allocation performance becomes a bottleneck and you can tolerate some possible tool breakage and decreased numa counter precision, you can @@ -864,13 +864,14 @@ When page allocation performance is not tooling to work, you can do: echo [S|s]trict > /proc/sys/vm/numa_stat_mode -We recommend automatic detection of numa statistics by system, because numa -statistics does not affect system's decision and it is very rarely -consumed. you can do: +We recommend automatic detection of numa statistics by system, because +numa statistics do not affect system decisions and it is very rarely +consumed. In this case you can do: echo [A|a]uto > /proc/sys/vm/numa_stats_mode -This is also system default configuration, with this default setting, numa -counters update is skipped unless the counter is *read* by users at least -once. + +This is the system default configuration. With this default setting, numa +counter updates are skipped until the counter is *read* by userspace at +least once. == diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix drivers/base/node.c diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix include/linux/vmstat.h diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c --- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix +++ a/init/main.c @@ -504,6 +504,9 @@ static void __init mm_init(void) pgtable_init(); vmalloc_init(); ioremap_huge_init(); +#ifdef CONFIG_NUMA + pr_info("vmstat: NUMA stat updates are skipped unless they have been used\n"); +#endif } asmlinkage __visible void __init start_kernel(void) @@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k sort_main_extable(); trap_init(); mm_init(); -#ifdef CONFIG_NUMA - pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n"); -#endif ftrace_init(); diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix kernel/sysctl.c diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix mm/page_alloc.c diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c --- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix +++ a/mm/vmstat.c @@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l static int __parse_vm_numa_stats_mode(char *s) { - const char *str = s; - - if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0) + if (strcasecmp(s, "auto")) vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE; - else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0) + else if (strcasecmp(s, "strict") == 0) vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE; - else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0) + else if (strcasecmp(s, "coarse")) vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE; else { pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s); @@ -86,30 +84,29 @@ int sysctl_vm_numa_stats_mode_handler(st /* no change */ mutex_unlock(_numa_stats_mode_lock); return 0; - } else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) +