Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-11 Thread Lu, Aaron
On Wed, 2018-07-11 at 14:35 -0700, Andrew Morton wrote:
> On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu  wrote:
> 
> > [550 lines of changelog]
> 
> OK, I'm convinced ;)  That was a lot of work - thanks for being exhaustive.

Thanks Andrew.
I think the credit goes to Dave Hansen since he has been very careful
about this change and would like me to do all those 2nd phase tests to
make sure we didn't get any surprise after doubling batch size :-)

I think the LKP robot will run even more tests to capture possible
regressions, if any.

Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-11 Thread Lu, Aaron
On Wed, 2018-07-11 at 14:35 -0700, Andrew Morton wrote:
> On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu  wrote:
> 
> > [550 lines of changelog]
> 
> OK, I'm convinced ;)  That was a lot of work - thanks for being exhaustive.

Thanks Andrew.
I think the credit goes to Dave Hansen since he has been very careful
about this change and would like me to do all those 2nd phase tests to
make sure we didn't get any surprise after doubling batch size :-)

I think the LKP robot will run even more tests to capture possible
regressions, if any.

Re: [LKP] [lkp-robot] [mm, memcontrol] 309fe96bfc: vm-scalability.throughput +23.0% improvement

2018-05-28 Thread Lu, Aaron
On Mon, 2018-05-28 at 14:03 +0200, Michal Hocko wrote:
> On Mon 28-05-18 19:40:19, kernel test robot wrote:
> > 
> > Greeting,
> > 
> > FYI, we noticed a +23.0% improvement of vm-scalability.throughput due to 
> > commit:
> > 
> > 
> > commit: 309fe96bfc0ae387f53612927a8f0dc3eb056efd ("mm, memcontrol: 
> > implement memory.swap.events")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> This doesn't make any sense to me. The patch merely adds an accounting.
> It doesn't optimize anything. So I strongly suspect the result is just
> misleading or the test (environment) misconfigured. Not the first time
> I am seeing something like that I am afraid.
> 

Most likely the same situation as:
"
FYI, we noticed a -27.2% regression of will-it-scale.per_process_ops
due to commit:


commit: e27be240df53f1a20c659168e722b5d9f16cc7f4 ("mm: memcg: make sure
memory.events is uptodate when waking pollers")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
"

Where the performance change is due to layout change of
'struct mem_cgroup':
http://lkml.kernel.org/r/20180528085201.ga2...@intel.com

Re: [LKP] [lkp-robot] [mm, memcontrol] 309fe96bfc: vm-scalability.throughput +23.0% improvement

2018-05-28 Thread Lu, Aaron
On Mon, 2018-05-28 at 14:03 +0200, Michal Hocko wrote:
> On Mon 28-05-18 19:40:19, kernel test robot wrote:
> > 
> > Greeting,
> > 
> > FYI, we noticed a +23.0% improvement of vm-scalability.throughput due to 
> > commit:
> > 
> > 
> > commit: 309fe96bfc0ae387f53612927a8f0dc3eb056efd ("mm, memcontrol: 
> > implement memory.swap.events")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> This doesn't make any sense to me. The patch merely adds an accounting.
> It doesn't optimize anything. So I strongly suspect the result is just
> misleading or the test (environment) misconfigured. Not the first time
> I am seeing something like that I am afraid.
> 

Most likely the same situation as:
"
FYI, we noticed a -27.2% regression of will-it-scale.per_process_ops
due to commit:


commit: e27be240df53f1a20c659168e722b5d9f16cc7f4 ("mm: memcg: make sure
memory.events is uptodate when waking pollers")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
"

Where the performance change is due to layout change of
'struct mem_cgroup':
http://lkml.kernel.org/r/20180528085201.ga2...@intel.com

Re: [LKP] [lkp-robot] [fs/locks] 52306e882f: stress-ng.lockofd.ops_per_sec -11% regression

2017-12-05 Thread Lu, Aaron
On Tue, 2017-12-05 at 06:01 -0500, Jeff Layton wrote:
> On Tue, 2017-12-05 at 13:57 +0800, Aaron Lu wrote:
> > On Wed, Nov 08, 2017 at 03:22:33PM +0800, Aaron Lu wrote:
> > > On Thu, Sep 28, 2017 at 04:02:23PM +0800, kernel test robot wrote:
> > > > 
> > > > Greeting,
> > > > 
> > > > FYI, we noticed a -11% regression of stress-ng.lockofd.ops_per_sec due 
> > > > to commit:
> > > > 
> > > > 
> > > > commit: 52306e882f77d3fd73f91435c41373d634acc5d2 ("fs/locks: Use 
> > > > allocation rather than the stack in fcntl_getlk()")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > 
> > > It's been a while, I wonder what do you think of this regression?
> > > 
> > > The test stresses byte-range locks AFAICS and since the commit uses
> > > dynamic allocation instead of stack for the 'struct file_lock', it sounds
> > > natural the performance regressed for this test.
> > > 
> > > Now the question is, do we care about the performance regression here?
> > 
> > Appreciated it if you can share your opinion on this, thanks.
> > 
> > Regards,
> > Aaron
> >  
> 
> Sorry I missed your earlier mail about this. My feeling is to not worry

Never mind :)

> about it. struct file_lock is rather large, so putting it on the stack
> was always a bit dangerous, and F_GETLK is a rather uncommon operation
> anyway.
> 
> That said, if there are real-world workloads that have regressed because
> of this patch, I'm definitely open to backing it out.
> 
> Does anyone else have opinions on the matter?

Your comments makes sense to me, thanks for the reply.

Re: [LKP] [lkp-robot] [fs/locks] 52306e882f: stress-ng.lockofd.ops_per_sec -11% regression

2017-12-05 Thread Lu, Aaron
On Tue, 2017-12-05 at 06:01 -0500, Jeff Layton wrote:
> On Tue, 2017-12-05 at 13:57 +0800, Aaron Lu wrote:
> > On Wed, Nov 08, 2017 at 03:22:33PM +0800, Aaron Lu wrote:
> > > On Thu, Sep 28, 2017 at 04:02:23PM +0800, kernel test robot wrote:
> > > > 
> > > > Greeting,
> > > > 
> > > > FYI, we noticed a -11% regression of stress-ng.lockofd.ops_per_sec due 
> > > > to commit:
> > > > 
> > > > 
> > > > commit: 52306e882f77d3fd73f91435c41373d634acc5d2 ("fs/locks: Use 
> > > > allocation rather than the stack in fcntl_getlk()")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > 
> > > It's been a while, I wonder what do you think of this regression?
> > > 
> > > The test stresses byte-range locks AFAICS and since the commit uses
> > > dynamic allocation instead of stack for the 'struct file_lock', it sounds
> > > natural the performance regressed for this test.
> > > 
> > > Now the question is, do we care about the performance regression here?
> > 
> > Appreciated it if you can share your opinion on this, thanks.
> > 
> > Regards,
> > Aaron
> >  
> 
> Sorry I missed your earlier mail about this. My feeling is to not worry

Never mind :)

> about it. struct file_lock is rather large, so putting it on the stack
> was always a bit dangerous, and F_GETLK is a rather uncommon operation
> anyway.
> 
> That said, if there are real-world workloads that have regressed because
> of this patch, I'm definitely open to backing it out.
> 
> Does anyone else have opinions on the matter?

Your comments makes sense to me, thanks for the reply.

Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline

2017-10-17 Thread Lu, Aaron
On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote:
> On 10/13/2017 08:31 AM, Aaron Lu wrote:
> > __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
> > __rmqueue_cma_fallback() are all in page allocator's hot path and
> > better be finished as soon as possible. One way to make them faster
> > is by making them inline. But as Andrew Morton and Andi Kleen pointed
> > out:
> > https://lkml.org/lkml/2017/10/10/1252
> > https://lkml.org/lkml/2017/10/10/1279
> > To make sure they are inlined, we should use __always_inline for them.
> > 
> > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
> > processes to stress buddy, the results for will-it-scale.processes with
> > and without the patch are:
> > 
> > On a 2-sockets Intel-Skylake machine:
> > 
> >  compiler  basehead
> > gcc-4.4.7   6496131 6911823 +6.4%
> > gcc-4.9.4   7225110 7731072 +7.0%
> > gcc-5.4.1   7054224 7688146 +9.0%
> > gcc-6.2.0   7059794 7651675 +8.4%
> > 
> > On a 4-sockets Intel-Skylake machine:
> > 
> >  compiler  basehead
> > gcc-4.4.7  1316289013508193 +2.6%
> > gcc-4.9.4  1499746315484353 +3.2%
> > gcc-5.4.1  1470871115449805 +5.0%
> > gcc-6.2.0  1457409915349204 +5.3%
> > 
> > The above 4 compilers are used becuase I've done the tests through Intel's
> > Linux Kernel Performance(LKP) infrastructure and they are the available
> > compilers there.
> > 
> > The benefit being less on 4 sockets machine is due to the lock contention
> > there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe
> > than on the 2 sockets machine(85%).
> > 
> > What the benchmark does is: it forks nr_cpu processes and then each
> > process does the following:
> > 1 mmap() 128M anonymous space;
> > 2 writes to each page there to trigger actual page allocation;
> > 3 munmap() it.
> > in a loop.
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Are transparent hugepages enabled? If yes, __rmqueue() is called from
> rmqueue(), and there's only one page fault (and __rmqueue()) per 512
> "writes to each page". If not, __rmqueue() is called from rmqueue_bulk()
> in bursts once pcplists are depleted. I guess it's the latter, otherwise
> I wouldn't expect a function call to have such visible overhead.

THP is disabled. I should have mentioned this in the changelog, sorry
about that.

> 
> I guess what would help much more would be a bulk __rmqueue_smallest()
> to grab multiple pages from the freelists. But can't argue with your

Do I understand you correctly that you suggest to use a bulk
__rmqueue_smallest(), say __rmqueue_smallest_bulk(). With that, instead
of looping pcp->batch times in rmqueue_bulk(), a single call to
__rmqueue_smallest_bulk() is enough and __rmqueue_smallest_bulk() will
loop pcp->batch times to get those pages?

Then it feels like __rmqueue_smallest_bulk() has become rmqueue_bulk(),
or do I miss something?

> numbers against this patch.
> 
> > Binary size wise, I have locally built them with different compilers:
> > 
> > [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
> >textdata bss dec hex filename
> >   3740999048524   55837da1d gcc-4.9.4/base/mm/page_alloc.o
> >   3827399048524   56701dd7d gcc-4.9.4/head/mm/page_alloc.o
> >   3746598408428   55733d9b5 gcc-5.5.0/base/mm/page_alloc.o
> >   3816998408428   56437dc75 gcc-5.5.0/head/mm/page_alloc.o
> >   3757398408428   55841da21 gcc-6.4.0/base/mm/page_alloc.o
> >   3826198408428   56529dcd1 gcc-6.4.0/head/mm/page_alloc.o
> >   3686398408428   55131d75b gcc-7.2.0/base/mm/page_alloc.o
> >   3771198408428   55979daab gcc-7.2.0/head/mm/page_alloc.o
> > 
> > Text size increased about 800 bytes for mm/page_alloc.o.
> 
> BTW, do you know about ./scripts/bloat-o-meter? :)

NO!!! Thanks for bringing this up :)

> With gcc 7.2.1:
> > ./scripts/bloat-o-meter base.o mm/page_alloc.o
> 
> add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844)

Nice, it clearly showed 844 bytes bloat.

> function old new   delta
> get_page_from_freelist  28984937   +2039
> steal_suitable_fallback- 365+365
> find_suitable_fallback31 120 +89
> find_suitable_fallback.part  115   --115
> __rmqueue   1534   -   -1534
> 
> 
> > [aaron@aaronlu obj]$ size */*/vmlinux
> >textdata bss dec   hex filename
> > 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
> > 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
> > 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
> > 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
> > 10094546   5836696 17715200 

Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline

2017-10-17 Thread Lu, Aaron
On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote:
> On 10/13/2017 08:31 AM, Aaron Lu wrote:
> > __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
> > __rmqueue_cma_fallback() are all in page allocator's hot path and
> > better be finished as soon as possible. One way to make them faster
> > is by making them inline. But as Andrew Morton and Andi Kleen pointed
> > out:
> > https://lkml.org/lkml/2017/10/10/1252
> > https://lkml.org/lkml/2017/10/10/1279
> > To make sure they are inlined, we should use __always_inline for them.
> > 
> > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
> > processes to stress buddy, the results for will-it-scale.processes with
> > and without the patch are:
> > 
> > On a 2-sockets Intel-Skylake machine:
> > 
> >  compiler  basehead
> > gcc-4.4.7   6496131 6911823 +6.4%
> > gcc-4.9.4   7225110 7731072 +7.0%
> > gcc-5.4.1   7054224 7688146 +9.0%
> > gcc-6.2.0   7059794 7651675 +8.4%
> > 
> > On a 4-sockets Intel-Skylake machine:
> > 
> >  compiler  basehead
> > gcc-4.4.7  1316289013508193 +2.6%
> > gcc-4.9.4  1499746315484353 +3.2%
> > gcc-5.4.1  1470871115449805 +5.0%
> > gcc-6.2.0  1457409915349204 +5.3%
> > 
> > The above 4 compilers are used becuase I've done the tests through Intel's
> > Linux Kernel Performance(LKP) infrastructure and they are the available
> > compilers there.
> > 
> > The benefit being less on 4 sockets machine is due to the lock contention
> > there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe
> > than on the 2 sockets machine(85%).
> > 
> > What the benchmark does is: it forks nr_cpu processes and then each
> > process does the following:
> > 1 mmap() 128M anonymous space;
> > 2 writes to each page there to trigger actual page allocation;
> > 3 munmap() it.
> > in a loop.
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Are transparent hugepages enabled? If yes, __rmqueue() is called from
> rmqueue(), and there's only one page fault (and __rmqueue()) per 512
> "writes to each page". If not, __rmqueue() is called from rmqueue_bulk()
> in bursts once pcplists are depleted. I guess it's the latter, otherwise
> I wouldn't expect a function call to have such visible overhead.

THP is disabled. I should have mentioned this in the changelog, sorry
about that.

> 
> I guess what would help much more would be a bulk __rmqueue_smallest()
> to grab multiple pages from the freelists. But can't argue with your

Do I understand you correctly that you suggest to use a bulk
__rmqueue_smallest(), say __rmqueue_smallest_bulk(). With that, instead
of looping pcp->batch times in rmqueue_bulk(), a single call to
__rmqueue_smallest_bulk() is enough and __rmqueue_smallest_bulk() will
loop pcp->batch times to get those pages?

Then it feels like __rmqueue_smallest_bulk() has become rmqueue_bulk(),
or do I miss something?

> numbers against this patch.
> 
> > Binary size wise, I have locally built them with different compilers:
> > 
> > [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
> >textdata bss dec hex filename
> >   3740999048524   55837da1d gcc-4.9.4/base/mm/page_alloc.o
> >   3827399048524   56701dd7d gcc-4.9.4/head/mm/page_alloc.o
> >   3746598408428   55733d9b5 gcc-5.5.0/base/mm/page_alloc.o
> >   3816998408428   56437dc75 gcc-5.5.0/head/mm/page_alloc.o
> >   3757398408428   55841da21 gcc-6.4.0/base/mm/page_alloc.o
> >   3826198408428   56529dcd1 gcc-6.4.0/head/mm/page_alloc.o
> >   3686398408428   55131d75b gcc-7.2.0/base/mm/page_alloc.o
> >   3771198408428   55979daab gcc-7.2.0/head/mm/page_alloc.o
> > 
> > Text size increased about 800 bytes for mm/page_alloc.o.
> 
> BTW, do you know about ./scripts/bloat-o-meter? :)

NO!!! Thanks for bringing this up :)

> With gcc 7.2.1:
> > ./scripts/bloat-o-meter base.o mm/page_alloc.o
> 
> add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844)

Nice, it clearly showed 844 bytes bloat.

> function old new   delta
> get_page_from_freelist  28984937   +2039
> steal_suitable_fallback- 365+365
> find_suitable_fallback31 120 +89
> find_suitable_fallback.part  115   --115
> __rmqueue   1534   -   -1534
> 
> 
> > [aaron@aaronlu obj]$ size */*/vmlinux
> >textdata bss dec   hex filename
> > 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
> > 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
> > 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
> > 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
> > 10094546   5836696 17715200 

Re: [PATCH] acpi-video: Fix brightness keys for Thinkpad X240

2015-07-22 Thread Lu, Aaron
On Wed, 2015-07-22 at 07:52 +0200, Mathieu OTHACEHE wrote:
> Thinkpad X240 laptop has a working acpi_video backlight control but
> using the default native backlight control, brightness keys does not 
> work.

Is it because the events are not sent to user space?

Can you please file a bug in https://bugzilla.kernel.org/ under the
ACPI/Power-Video category and attach its acpidump/dmesg there? Thanks.

Regards,
Aaron

> 
> This patch force acpi_video use for this laptop by adding an 
> exception in
> video_detect_dmi_table.
> 
> Signed-off-by: Mathieu OTHACEHE 
> ---
>  drivers/acpi/video_detect.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/acpi/video_detect.c 
> b/drivers/acpi/video_detect.c
> index 815f75e..c4bc5f2 100644
> --- a/drivers/acpi/video_detect.c
> +++ b/drivers/acpi/video_detect.c
> @@ -173,6 +173,14 @@ static const struct dmi_system_id 
> video_detect_dmi_table[] = {
>   DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad X201s"),
>   },
>   },
> + {
> +  .callback = video_detect_force_video,
> +  .ident = "ThinkPad X240",
> +  .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
> + DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad X240"),
> + },
> + },
>  
>   /* The native backlight controls do not work on some older 
> machines */
>   {

Re: [PATCH] acpi-video: Fix brightness keys for Thinkpad X240

2015-07-22 Thread Lu, Aaron
On Wed, 2015-07-22 at 07:52 +0200, Mathieu OTHACEHE wrote:
 Thinkpad X240 laptop has a working acpi_video backlight control but
 using the default native backlight control, brightness keys does not 
 work.

Is it because the events are not sent to user space?

Can you please file a bug in https://bugzilla.kernel.org/ under the
ACPI/Power-Video category and attach its acpidump/dmesg there? Thanks.

Regards,
Aaron

 
 This patch force acpi_video use for this laptop by adding an 
 exception in
 video_detect_dmi_table.
 
 Signed-off-by: Mathieu OTHACEHE m.othac...@gmail.com
 ---
  drivers/acpi/video_detect.c | 8 
  1 file changed, 8 insertions(+)
 
 diff --git a/drivers/acpi/video_detect.c 
 b/drivers/acpi/video_detect.c
 index 815f75e..c4bc5f2 100644
 --- a/drivers/acpi/video_detect.c
 +++ b/drivers/acpi/video_detect.c
 @@ -173,6 +173,14 @@ static const struct dmi_system_id 
 video_detect_dmi_table[] = {
   DMI_MATCH(DMI_PRODUCT_VERSION, ThinkPad X201s),
   },
   },
 + {
 +  .callback = video_detect_force_video,
 +  .ident = ThinkPad X240,
 +  .matches = {
 + DMI_MATCH(DMI_SYS_VENDOR, LENOVO),
 + DMI_MATCH(DMI_PRODUCT_VERSION, ThinkPad X240),
 + },
 + },
  
   /* The native backlight controls do not work on some older 
 machines */
   {