Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-06-05 Thread Daniel Jordan
On Tue, Jun 05, 2018 at 12:30:13PM +0800, Huang, Ying wrote:
> Daniel Jordan  writes:
> 
> > On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote:
> >> And for all, Any comment is welcome!
> >> 
> >> This patchset is based on the 2018-05-18 head of mmotm/master.
> >
> > Trying to review this and it doesn't apply to mmotm-2018-05-18-16-44.  git
> > fails on patch 10:
> >
> > Applying: mm, THP, swap: Support to count THP swapin and its fallback
> > error: Documentation/vm/transhuge.rst: does not exist in index
> > Patch failed at 0010 mm, THP, swap: Support to count THP swapin and its 
> > fallback
> >
> > Sure enough, this tag has Documentation/vm/transhuge.txt but not the .rst
> > version.  Was this the tag you meant?  If so did you pull in some of Mike
> > Rapoport's doc changes on top?
> 
> I use the mmotm tree at
> 
> git://git.cmpxchg.org/linux-mmotm.git
> 
> Maybe you are using the other one?

Yes I was, and I didn't know about this other tree, thanks!  Working my way
through your changes now.

> 
> >> base  optimized
> >>  -- 
> >>  %stddev %change %stddev
> >>  \  |\  
> >>1417897   2%+992.8%   15494673vm-scalability.throughput
> >>1020489   4%   +1091.2%   12156349vmstat.swap.si
> >>1255093   3%+940.3%   13056114vmstat.swap.so
> >>1259769   7%   +1818.3%   24166779meminfo.AnonHugePages
> >>   28021761   -10.7%   25018848   2%  meminfo.AnonPages
> >>   64080064   4% -95.6%2787565  33%  
> >> interrupts.CAL:Function_call_interrupts
> >>  13.91   5% -13.80.10  27%  
> >> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> >> 
> > ...snip...
> >> test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
> >> (represented as interrupts.CAL:Function_call_interrupts) reduced
> >> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
> >> are performance benefit of THP swapout/swapin too.
> >
> > Which spinlocks are we spending less time on?
> 
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page.__handle_mm_fault":
>  4.39,
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages":
>  1.53,
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask":
>  1.34,
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.swapcache_free_entries.free_swap_slot.do_swap_page":
>  1.02,
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.shrink_inactive_list.shrink_node_memcg.shrink_node":
>  0.61,
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.shrink_active_list.shrink_node_memcg.shrink_node":
>  0.54,

Nice, seems like lru_lock followed by zone->lock are the main improvements.


Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-06-05 Thread Daniel Jordan
On Tue, Jun 05, 2018 at 12:30:13PM +0800, Huang, Ying wrote:
> Daniel Jordan  writes:
> 
> > On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote:
> >> And for all, Any comment is welcome!
> >> 
> >> This patchset is based on the 2018-05-18 head of mmotm/master.
> >
> > Trying to review this and it doesn't apply to mmotm-2018-05-18-16-44.  git
> > fails on patch 10:
> >
> > Applying: mm, THP, swap: Support to count THP swapin and its fallback
> > error: Documentation/vm/transhuge.rst: does not exist in index
> > Patch failed at 0010 mm, THP, swap: Support to count THP swapin and its 
> > fallback
> >
> > Sure enough, this tag has Documentation/vm/transhuge.txt but not the .rst
> > version.  Was this the tag you meant?  If so did you pull in some of Mike
> > Rapoport's doc changes on top?
> 
> I use the mmotm tree at
> 
> git://git.cmpxchg.org/linux-mmotm.git
> 
> Maybe you are using the other one?

Yes I was, and I didn't know about this other tree, thanks!  Working my way
through your changes now.

> 
> >> base  optimized
> >>  -- 
> >>  %stddev %change %stddev
> >>  \  |\  
> >>1417897   2%+992.8%   15494673vm-scalability.throughput
> >>1020489   4%   +1091.2%   12156349vmstat.swap.si
> >>1255093   3%+940.3%   13056114vmstat.swap.so
> >>1259769   7%   +1818.3%   24166779meminfo.AnonHugePages
> >>   28021761   -10.7%   25018848   2%  meminfo.AnonPages
> >>   64080064   4% -95.6%2787565  33%  
> >> interrupts.CAL:Function_call_interrupts
> >>  13.91   5% -13.80.10  27%  
> >> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> >> 
> > ...snip...
> >> test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
> >> (represented as interrupts.CAL:Function_call_interrupts) reduced
> >> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
> >> are performance benefit of THP swapout/swapin too.
> >
> > Which spinlocks are we spending less time on?
> 
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page.__handle_mm_fault":
>  4.39,
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages":
>  1.53,
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask":
>  1.34,
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.swapcache_free_entries.free_swap_slot.do_swap_page":
>  1.02,
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.shrink_inactive_list.shrink_node_memcg.shrink_node":
>  0.61,
> "perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.shrink_active_list.shrink_node_memcg.shrink_node":
>  0.54,

Nice, seems like lru_lock followed by zone->lock are the main improvements.


Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-06-04 Thread Huang, Ying
Daniel Jordan  writes:

> On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote:
>> And for all, Any comment is welcome!
>> 
>> This patchset is based on the 2018-05-18 head of mmotm/master.
>
> Trying to review this and it doesn't apply to mmotm-2018-05-18-16-44.  git
> fails on patch 10:
>
> Applying: mm, THP, swap: Support to count THP swapin and its fallback
> error: Documentation/vm/transhuge.rst: does not exist in index
> Patch failed at 0010 mm, THP, swap: Support to count THP swapin and its 
> fallback
>
> Sure enough, this tag has Documentation/vm/transhuge.txt but not the .rst
> version.  Was this the tag you meant?  If so did you pull in some of Mike
> Rapoport's doc changes on top?

I use the mmotm tree at

git://git.cmpxchg.org/linux-mmotm.git

Maybe you are using the other one?

>> base  optimized
>>  -- 
>>  %stddev %change %stddev
>>  \  |\  
>>1417897   2%+992.8%   15494673vm-scalability.throughput
>>1020489   4%   +1091.2%   12156349vmstat.swap.si
>>1255093   3%+940.3%   13056114vmstat.swap.so
>>1259769   7%   +1818.3%   24166779meminfo.AnonHugePages
>>   28021761   -10.7%   25018848   2%  meminfo.AnonPages
>>   64080064   4% -95.6%2787565  33%  
>> interrupts.CAL:Function_call_interrupts
>>  13.91   5% -13.80.10  27%  
>> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>> 
> ...snip...
>> test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
>> (represented as interrupts.CAL:Function_call_interrupts) reduced
>> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
>> are performance benefit of THP swapout/swapin too.
>
> Which spinlocks are we spending less time on?

"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page.__handle_mm_fault":
 4.39,
"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages":
 1.53,
"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask":
 1.34,
"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.swapcache_free_entries.free_swap_slot.do_swap_page":
 1.02,
"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.shrink_inactive_list.shrink_node_memcg.shrink_node":
 0.61,
"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.shrink_active_list.shrink_node_memcg.shrink_node":
 0.54,

Best Regards,
Huang, Ying


Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-06-04 Thread Huang, Ying
Daniel Jordan  writes:

> On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote:
>> And for all, Any comment is welcome!
>> 
>> This patchset is based on the 2018-05-18 head of mmotm/master.
>
> Trying to review this and it doesn't apply to mmotm-2018-05-18-16-44.  git
> fails on patch 10:
>
> Applying: mm, THP, swap: Support to count THP swapin and its fallback
> error: Documentation/vm/transhuge.rst: does not exist in index
> Patch failed at 0010 mm, THP, swap: Support to count THP swapin and its 
> fallback
>
> Sure enough, this tag has Documentation/vm/transhuge.txt but not the .rst
> version.  Was this the tag you meant?  If so did you pull in some of Mike
> Rapoport's doc changes on top?

I use the mmotm tree at

git://git.cmpxchg.org/linux-mmotm.git

Maybe you are using the other one?

>> base  optimized
>>  -- 
>>  %stddev %change %stddev
>>  \  |\  
>>1417897   2%+992.8%   15494673vm-scalability.throughput
>>1020489   4%   +1091.2%   12156349vmstat.swap.si
>>1255093   3%+940.3%   13056114vmstat.swap.so
>>1259769   7%   +1818.3%   24166779meminfo.AnonHugePages
>>   28021761   -10.7%   25018848   2%  meminfo.AnonPages
>>   64080064   4% -95.6%2787565  33%  
>> interrupts.CAL:Function_call_interrupts
>>  13.91   5% -13.80.10  27%  
>> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>> 
> ...snip...
>> test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
>> (represented as interrupts.CAL:Function_call_interrupts) reduced
>> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
>> are performance benefit of THP swapout/swapin too.
>
> Which spinlocks are we spending less time on?

"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page.__handle_mm_fault":
 4.39,
"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages":
 1.53,
"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask":
 1.34,
"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.swapcache_free_entries.free_swap_slot.do_swap_page":
 1.02,
"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.shrink_inactive_list.shrink_node_memcg.shrink_node":
 0.61,
"perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.shrink_active_list.shrink_node_memcg.shrink_node":
 0.54,

Best Regards,
Huang, Ying


Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-06-04 Thread Daniel Jordan
On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote:
> And for all, Any comment is welcome!
> 
> This patchset is based on the 2018-05-18 head of mmotm/master.

Trying to review this and it doesn't apply to mmotm-2018-05-18-16-44.  git
fails on patch 10:

Applying: mm, THP, swap: Support to count THP swapin and its fallback
error: Documentation/vm/transhuge.rst: does not exist in index
Patch failed at 0010 mm, THP, swap: Support to count THP swapin and its fallback

Sure enough, this tag has Documentation/vm/transhuge.txt but not the .rst
version.  Was this the tag you meant?  If so did you pull in some of Mike
Rapoport's doc changes on top?

> base  optimized
>  -- 
>  %stddev %change %stddev
>  \  |\  
>1417897 ±  2%+992.8%   15494673vm-scalability.throughput
>1020489 ±  4%   +1091.2%   12156349vmstat.swap.si
>1255093 ±  3%+940.3%   13056114vmstat.swap.so
>1259769 ±  7%   +1818.3%   24166779meminfo.AnonHugePages
>   28021761   -10.7%   25018848 ±  2%  meminfo.AnonPages
>   64080064 ±  4% -95.6%2787565 ± 33%  
> interrupts.CAL:Function_call_interrupts
>  13.91 ±  5% -13.80.10 ± 27%  
> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> 
...snip...
> test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
> (represented as interrupts.CAL:Function_call_interrupts) reduced
> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
> are performance benefit of THP swapout/swapin too.

Which spinlocks are we spending less time on?


Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-06-04 Thread Daniel Jordan
On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote:
> And for all, Any comment is welcome!
> 
> This patchset is based on the 2018-05-18 head of mmotm/master.

Trying to review this and it doesn't apply to mmotm-2018-05-18-16-44.  git
fails on patch 10:

Applying: mm, THP, swap: Support to count THP swapin and its fallback
error: Documentation/vm/transhuge.rst: does not exist in index
Patch failed at 0010 mm, THP, swap: Support to count THP swapin and its fallback

Sure enough, this tag has Documentation/vm/transhuge.txt but not the .rst
version.  Was this the tag you meant?  If so did you pull in some of Mike
Rapoport's doc changes on top?

> base  optimized
>  -- 
>  %stddev %change %stddev
>  \  |\  
>1417897 ±  2%+992.8%   15494673vm-scalability.throughput
>1020489 ±  4%   +1091.2%   12156349vmstat.swap.si
>1255093 ±  3%+940.3%   13056114vmstat.swap.so
>1259769 ±  7%   +1818.3%   24166779meminfo.AnonHugePages
>   28021761   -10.7%   25018848 ±  2%  meminfo.AnonPages
>   64080064 ±  4% -95.6%2787565 ± 33%  
> interrupts.CAL:Function_call_interrupts
>  13.91 ±  5% -13.80.10 ± 27%  
> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> 
...snip...
> test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
> (represented as interrupts.CAL:Function_call_interrupts) reduced
> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
> are performance benefit of THP swapout/swapin too.

Which spinlocks are we spending less time on?


Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-06-01 Thread Huang, Ying
Naoya Horiguchi  writes:

> On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote:
>> From: Huang Ying 
>> 
>> Hi, Andrew, could you help me to check whether the overall design is
>> reasonable?
>> 
>> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
>> swap part of the patchset?  Especially [02/21], [03/21], [04/21],
>> [05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21],
>> [12/21], [20/21].
>> 
>> Hi, Andrea and Kirill, could you help me to review the THP part of the
>> patchset?  Especially [01/21], [07/21], [09/21], [11/21], [13/21],
>> [15/21], [16/21], [17/21], [18/21], [19/21], [20/21], [21/21].
>> 
>> Hi, Johannes and Michal, could you help me to review the cgroup part
>> of the patchset?  Especially [14/21].
>> 
>> And for all, Any comment is welcome!
>
> Hi Ying Huang,
> I've read through this series and find no issue.

Thanks a lot for your review!

> It seems that thp swapout never happens if swap devices are backed by
> rotation storages.  I guess that's because this feature depends on swap
> cluster searching algorithm which only supports non-rotational storages.
>
> I think that this limitation is OK because non-rotational storage is
> better for swap device (most future users will use it). But I think
> it's better to document the limitation somewhere because swap cluster
> is in-kernel thing and we can't assume that end users know about it.

Yes.  I will try to document it somewhere.

Best Regards,
Huang, Ying

> Thanks,
> Naoya Horiguchi
>
>> 
>> This patchset is based on the 2018-05-18 head of mmotm/master.
>> 
>> This is the final step of THP (Transparent Huge Page) swap
>> optimization.  After the first and second step, the splitting huge
>> page is delayed from almost the first step of swapout to after swapout
>> has been finished.  In this step, we avoid splitting THP for swapout
>> and swapout/swapin the THP in one piece.
>> 
>> We tested the patchset with vm-scalability benchmark swap-w-seq test
>> case, with 16 processes.  The test case forks 16 processes.  Each
>> process allocates large anonymous memory range, and writes it from
>> begin to end for 8 rounds.  The first round will swapout, while the
>> remaining rounds will swapin and swapout.  The test is done on a Xeon
>> E5 v3 system, the swap device used is a RAM simulated PMEM (persistent
>> memory) device.  The test result is as follow,
>> 
>> base  optimized
>>  -- 
>>  %stddev %change %stddev
>>  \  |\  
>>1417897 ±  2%+992.8%   15494673vm-scalability.throughput
>>1020489 ±  4%   +1091.2%   12156349vmstat.swap.si
>>1255093 ±  3%+940.3%   13056114vmstat.swap.so
>>1259769 ±  7%   +1818.3%   24166779meminfo.AnonHugePages
>>   28021761   -10.7%   25018848 ±  2%  meminfo.AnonPages
>>   64080064 ±  4% -95.6%2787565 ± 33%  
>> interrupts.CAL:Function_call_interrupts
>>  13.91 ±  5% -13.80.10 ± 27%  
>> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>> 
>> Where, the score of benchmark (bytes written per second) improved
>> 992.8%.  The swapout/swapin throughput improved 1008% (from about
>> 2.17GB/s to 24.04GB/s).  The performance difference is huge.  In base
>> kernel, for the first round of writing, the THP is swapout and split,
>> so in the remaining rounds, there is only normal page swapin and
>> swapout.  While in optimized kernel, the THP is kept after first
>> swapout, so THP swapin and swapout is used in the remaining rounds.
>> This shows the key benefit to swapout/swapin THP in one piece, the THP
>> will be kept instead of being split.  meminfo information verified
>> this, in base kernel only 4.5% of anonymous page are THP during the
>> test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
>> (represented as interrupts.CAL:Function_call_interrupts) reduced
>> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
>> are performance benefit of THP swapout/swapin too.
>> 
>> Below is the description for all steps of THP swap optimization.
>> 
>> Recently, the performance of the storage devices improved so fast that
>> we cannot saturate the disk bandwidth with single logical CPU when do
>> page swapping even on a high-end server machine.  Because the
>> performance of the storage device improved faster than that of single
>> logical CPU.  And it seems that the trend will not change in the near
>> future.  On the other hand, the THP becomes more and more popular
>> because of increased memory size.  So it becomes necessary to optimize
>> THP swap performance.
>> 
>> The advantages to swapout/swapin a THP in one piece include:
>> 
>> - Batch various swap operations for the THP.  Many operations need to
>>   be done once per THP instead of per normal page, for example,
>>   allocating/freeing the swap space, writing/reading the swap 

Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-06-01 Thread Huang, Ying
Naoya Horiguchi  writes:

> On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote:
>> From: Huang Ying 
>> 
>> Hi, Andrew, could you help me to check whether the overall design is
>> reasonable?
>> 
>> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
>> swap part of the patchset?  Especially [02/21], [03/21], [04/21],
>> [05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21],
>> [12/21], [20/21].
>> 
>> Hi, Andrea and Kirill, could you help me to review the THP part of the
>> patchset?  Especially [01/21], [07/21], [09/21], [11/21], [13/21],
>> [15/21], [16/21], [17/21], [18/21], [19/21], [20/21], [21/21].
>> 
>> Hi, Johannes and Michal, could you help me to review the cgroup part
>> of the patchset?  Especially [14/21].
>> 
>> And for all, Any comment is welcome!
>
> Hi Ying Huang,
> I've read through this series and find no issue.

Thanks a lot for your review!

> It seems that thp swapout never happens if swap devices are backed by
> rotation storages.  I guess that's because this feature depends on swap
> cluster searching algorithm which only supports non-rotational storages.
>
> I think that this limitation is OK because non-rotational storage is
> better for swap device (most future users will use it). But I think
> it's better to document the limitation somewhere because swap cluster
> is in-kernel thing and we can't assume that end users know about it.

Yes.  I will try to document it somewhere.

Best Regards,
Huang, Ying

> Thanks,
> Naoya Horiguchi
>
>> 
>> This patchset is based on the 2018-05-18 head of mmotm/master.
>> 
>> This is the final step of THP (Transparent Huge Page) swap
>> optimization.  After the first and second step, the splitting huge
>> page is delayed from almost the first step of swapout to after swapout
>> has been finished.  In this step, we avoid splitting THP for swapout
>> and swapout/swapin the THP in one piece.
>> 
>> We tested the patchset with vm-scalability benchmark swap-w-seq test
>> case, with 16 processes.  The test case forks 16 processes.  Each
>> process allocates large anonymous memory range, and writes it from
>> begin to end for 8 rounds.  The first round will swapout, while the
>> remaining rounds will swapin and swapout.  The test is done on a Xeon
>> E5 v3 system, the swap device used is a RAM simulated PMEM (persistent
>> memory) device.  The test result is as follow,
>> 
>> base  optimized
>>  -- 
>>  %stddev %change %stddev
>>  \  |\  
>>1417897 ±  2%+992.8%   15494673vm-scalability.throughput
>>1020489 ±  4%   +1091.2%   12156349vmstat.swap.si
>>1255093 ±  3%+940.3%   13056114vmstat.swap.so
>>1259769 ±  7%   +1818.3%   24166779meminfo.AnonHugePages
>>   28021761   -10.7%   25018848 ±  2%  meminfo.AnonPages
>>   64080064 ±  4% -95.6%2787565 ± 33%  
>> interrupts.CAL:Function_call_interrupts
>>  13.91 ±  5% -13.80.10 ± 27%  
>> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>> 
>> Where, the score of benchmark (bytes written per second) improved
>> 992.8%.  The swapout/swapin throughput improved 1008% (from about
>> 2.17GB/s to 24.04GB/s).  The performance difference is huge.  In base
>> kernel, for the first round of writing, the THP is swapout and split,
>> so in the remaining rounds, there is only normal page swapin and
>> swapout.  While in optimized kernel, the THP is kept after first
>> swapout, so THP swapin and swapout is used in the remaining rounds.
>> This shows the key benefit to swapout/swapin THP in one piece, the THP
>> will be kept instead of being split.  meminfo information verified
>> this, in base kernel only 4.5% of anonymous page are THP during the
>> test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
>> (represented as interrupts.CAL:Function_call_interrupts) reduced
>> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
>> are performance benefit of THP swapout/swapin too.
>> 
>> Below is the description for all steps of THP swap optimization.
>> 
>> Recently, the performance of the storage devices improved so fast that
>> we cannot saturate the disk bandwidth with single logical CPU when do
>> page swapping even on a high-end server machine.  Because the
>> performance of the storage device improved faster than that of single
>> logical CPU.  And it seems that the trend will not change in the near
>> future.  On the other hand, the THP becomes more and more popular
>> because of increased memory size.  So it becomes necessary to optimize
>> THP swap performance.
>> 
>> The advantages to swapout/swapin a THP in one piece include:
>> 
>> - Batch various swap operations for the THP.  Many operations need to
>>   be done once per THP instead of per normal page, for example,
>>   allocating/freeing the swap space, writing/reading the swap 

Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-06-01 Thread Naoya Horiguchi
On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> Hi, Andrew, could you help me to check whether the overall design is
> reasonable?
> 
> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
> swap part of the patchset?  Especially [02/21], [03/21], [04/21],
> [05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21],
> [12/21], [20/21].
> 
> Hi, Andrea and Kirill, could you help me to review the THP part of the
> patchset?  Especially [01/21], [07/21], [09/21], [11/21], [13/21],
> [15/21], [16/21], [17/21], [18/21], [19/21], [20/21], [21/21].
> 
> Hi, Johannes and Michal, could you help me to review the cgroup part
> of the patchset?  Especially [14/21].
> 
> And for all, Any comment is welcome!

Hi Ying Huang,
I've read through this series and find no issue.

It seems that thp swapout never happens if swap devices are backed by
rotation storages.  I guess that's because this feature depends on swap
cluster searching algorithm which only supports non-rotational storages.

I think that this limitation is OK because non-rotational storage is
better for swap device (most future users will use it). But I think
it's better to document the limitation somewhere because swap cluster
is in-kernel thing and we can't assume that end users know about it.

Thanks,
Naoya Horiguchi

> 
> This patchset is based on the 2018-05-18 head of mmotm/master.
> 
> This is the final step of THP (Transparent Huge Page) swap
> optimization.  After the first and second step, the splitting huge
> page is delayed from almost the first step of swapout to after swapout
> has been finished.  In this step, we avoid splitting THP for swapout
> and swapout/swapin the THP in one piece.
> 
> We tested the patchset with vm-scalability benchmark swap-w-seq test
> case, with 16 processes.  The test case forks 16 processes.  Each
> process allocates large anonymous memory range, and writes it from
> begin to end for 8 rounds.  The first round will swapout, while the
> remaining rounds will swapin and swapout.  The test is done on a Xeon
> E5 v3 system, the swap device used is a RAM simulated PMEM (persistent
> memory) device.  The test result is as follow,
> 
> base  optimized
>  -- 
>  %stddev %change %stddev
>  \  |\  
>1417897 ±  2%+992.8%   15494673vm-scalability.throughput
>1020489 ±  4%   +1091.2%   12156349vmstat.swap.si
>1255093 ±  3%+940.3%   13056114vmstat.swap.so
>1259769 ±  7%   +1818.3%   24166779meminfo.AnonHugePages
>   28021761   -10.7%   25018848 ±  2%  meminfo.AnonPages
>   64080064 ±  4% -95.6%2787565 ± 33%  
> interrupts.CAL:Function_call_interrupts
>  13.91 ±  5% -13.80.10 ± 27%  
> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> 
> Where, the score of benchmark (bytes written per second) improved
> 992.8%.  The swapout/swapin throughput improved 1008% (from about
> 2.17GB/s to 24.04GB/s).  The performance difference is huge.  In base
> kernel, for the first round of writing, the THP is swapout and split,
> so in the remaining rounds, there is only normal page swapin and
> swapout.  While in optimized kernel, the THP is kept after first
> swapout, so THP swapin and swapout is used in the remaining rounds.
> This shows the key benefit to swapout/swapin THP in one piece, the THP
> will be kept instead of being split.  meminfo information verified
> this, in base kernel only 4.5% of anonymous page are THP during the
> test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
> (represented as interrupts.CAL:Function_call_interrupts) reduced
> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
> are performance benefit of THP swapout/swapin too.
> 
> Below is the description for all steps of THP swap optimization.
> 
> Recently, the performance of the storage devices improved so fast that
> we cannot saturate the disk bandwidth with single logical CPU when do
> page swapping even on a high-end server machine.  Because the
> performance of the storage device improved faster than that of single
> logical CPU.  And it seems that the trend will not change in the near
> future.  On the other hand, the THP becomes more and more popular
> because of increased memory size.  So it becomes necessary to optimize
> THP swap performance.
> 
> The advantages to swapout/swapin a THP in one piece include:
> 
> - Batch various swap operations for the THP.  Many operations need to
>   be done once per THP instead of per normal page, for example,
>   allocating/freeing the swap space, writing/reading the swap space,
>   flushing TLB, page fault, etc.  This will improve the performance of
>   the THP swap greatly.
> 
> - The THP swap space read/write will be large sequential IO (2M on
>   x86_64).  It is particularly helpful for the swapin, 

Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-06-01 Thread Naoya Horiguchi
On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> Hi, Andrew, could you help me to check whether the overall design is
> reasonable?
> 
> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
> swap part of the patchset?  Especially [02/21], [03/21], [04/21],
> [05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21],
> [12/21], [20/21].
> 
> Hi, Andrea and Kirill, could you help me to review the THP part of the
> patchset?  Especially [01/21], [07/21], [09/21], [11/21], [13/21],
> [15/21], [16/21], [17/21], [18/21], [19/21], [20/21], [21/21].
> 
> Hi, Johannes and Michal, could you help me to review the cgroup part
> of the patchset?  Especially [14/21].
> 
> And for all, Any comment is welcome!

Hi Ying Huang,
I've read through this series and find no issue.

It seems that thp swapout never happens if swap devices are backed by
rotation storages.  I guess that's because this feature depends on swap
cluster searching algorithm which only supports non-rotational storages.

I think that this limitation is OK because non-rotational storage is
better for swap device (most future users will use it). But I think
it's better to document the limitation somewhere because swap cluster
is in-kernel thing and we can't assume that end users know about it.

Thanks,
Naoya Horiguchi

> 
> This patchset is based on the 2018-05-18 head of mmotm/master.
> 
> This is the final step of THP (Transparent Huge Page) swap
> optimization.  After the first and second step, the splitting huge
> page is delayed from almost the first step of swapout to after swapout
> has been finished.  In this step, we avoid splitting THP for swapout
> and swapout/swapin the THP in one piece.
> 
> We tested the patchset with vm-scalability benchmark swap-w-seq test
> case, with 16 processes.  The test case forks 16 processes.  Each
> process allocates large anonymous memory range, and writes it from
> begin to end for 8 rounds.  The first round will swapout, while the
> remaining rounds will swapin and swapout.  The test is done on a Xeon
> E5 v3 system, the swap device used is a RAM simulated PMEM (persistent
> memory) device.  The test result is as follow,
> 
> base  optimized
>  -- 
>  %stddev %change %stddev
>  \  |\  
>1417897 ±  2%+992.8%   15494673vm-scalability.throughput
>1020489 ±  4%   +1091.2%   12156349vmstat.swap.si
>1255093 ±  3%+940.3%   13056114vmstat.swap.so
>1259769 ±  7%   +1818.3%   24166779meminfo.AnonHugePages
>   28021761   -10.7%   25018848 ±  2%  meminfo.AnonPages
>   64080064 ±  4% -95.6%2787565 ± 33%  
> interrupts.CAL:Function_call_interrupts
>  13.91 ±  5% -13.80.10 ± 27%  
> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> 
> Where, the score of benchmark (bytes written per second) improved
> 992.8%.  The swapout/swapin throughput improved 1008% (from about
> 2.17GB/s to 24.04GB/s).  The performance difference is huge.  In base
> kernel, for the first round of writing, the THP is swapout and split,
> so in the remaining rounds, there is only normal page swapin and
> swapout.  While in optimized kernel, the THP is kept after first
> swapout, so THP swapin and swapout is used in the remaining rounds.
> This shows the key benefit to swapout/swapin THP in one piece, the THP
> will be kept instead of being split.  meminfo information verified
> this, in base kernel only 4.5% of anonymous page are THP during the
> test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
> (represented as interrupts.CAL:Function_call_interrupts) reduced
> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
> are performance benefit of THP swapout/swapin too.
> 
> Below is the description for all steps of THP swap optimization.
> 
> Recently, the performance of the storage devices improved so fast that
> we cannot saturate the disk bandwidth with single logical CPU when do
> page swapping even on a high-end server machine.  Because the
> performance of the storage device improved faster than that of single
> logical CPU.  And it seems that the trend will not change in the near
> future.  On the other hand, the THP becomes more and more popular
> because of increased memory size.  So it becomes necessary to optimize
> THP swap performance.
> 
> The advantages to swapout/swapin a THP in one piece include:
> 
> - Batch various swap operations for the THP.  Many operations need to
>   be done once per THP instead of per normal page, for example,
>   allocating/freeing the swap space, writing/reading the swap space,
>   flushing TLB, page fault, etc.  This will improve the performance of
>   the THP swap greatly.
> 
> - The THP swap space read/write will be large sequential IO (2M on
>   x86_64).  It is particularly helpful for the swapin, 

[PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-05-23 Thread Huang, Ying
From: Huang Ying 

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [02/21], [03/21], [04/21],
[05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21],
[12/21], [20/21].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [01/21], [07/21], [09/21], [11/21], [13/21],
[15/21], [16/21], [17/21], [18/21], [19/21], [20/21], [21/21].

Hi, Johannes and Michal, could you help me to review the cgroup part
of the patchset?  Especially [14/21].

And for all, Any comment is welcome!

This patchset is based on the 2018-05-18 head of mmotm/master.

This is the final step of THP (Transparent Huge Page) swap
optimization.  After the first and second step, the splitting huge
page is delayed from almost the first step of swapout to after swapout
has been finished.  In this step, we avoid splitting THP for swapout
and swapout/swapin the THP in one piece.

We tested the patchset with vm-scalability benchmark swap-w-seq test
case, with 16 processes.  The test case forks 16 processes.  Each
process allocates large anonymous memory range, and writes it from
begin to end for 8 rounds.  The first round will swapout, while the
remaining rounds will swapin and swapout.  The test is done on a Xeon
E5 v3 system, the swap device used is a RAM simulated PMEM (persistent
memory) device.  The test result is as follow,

base  optimized
 -- 
 %stddev %change %stddev
 \  |\  
   1417897 ±  2%+992.8%   15494673vm-scalability.throughput
   1020489 ±  4%   +1091.2%   12156349vmstat.swap.si
   1255093 ±  3%+940.3%   13056114vmstat.swap.so
   1259769 ±  7%   +1818.3%   24166779meminfo.AnonHugePages
  28021761   -10.7%   25018848 ±  2%  meminfo.AnonPages
  64080064 ±  4% -95.6%2787565 ± 33%  
interrupts.CAL:Function_call_interrupts
 13.91 ±  5% -13.80.10 ± 27%  
perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath

Where, the score of benchmark (bytes written per second) improved
992.8%.  The swapout/swapin throughput improved 1008% (from about
2.17GB/s to 24.04GB/s).  The performance difference is huge.  In base
kernel, for the first round of writing, the THP is swapout and split,
so in the remaining rounds, there is only normal page swapin and
swapout.  While in optimized kernel, the THP is kept after first
swapout, so THP swapin and swapout is used in the remaining rounds.
This shows the key benefit to swapout/swapin THP in one piece, the THP
will be kept instead of being split.  meminfo information verified
this, in base kernel only 4.5% of anonymous page are THP during the
test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
(represented as interrupts.CAL:Function_call_interrupts) reduced
95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
are performance benefit of THP swapout/swapin too.

Below is the description for all steps of THP swap optimization.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swapping even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages to swapout/swapin a THP in one piece include:

- Batch various swap operations for the THP.  Many operations need to
  be done once per THP instead of per normal page, for example,
  allocating/freeing the swap space, writing/reading the swap space,
  flushing TLB, page fault, etc.  This will improve the performance of
  the THP swap greatly.

- The THP swap space read/write will be large sequential IO (2M on
  x86_64).  It is particularly helpful for the swapin, which are
  usually 4k random IO.  This will improve the performance of the THP
  swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The THP order pages will be free
  up after THP swapout.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapout, it will take quite long time for the normal pages to
  collapse back into the THP after being swapin.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swapin, mainly because possible
enlarged read/write IO size (for swapout/swapin) may 

[PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece

2018-05-23 Thread Huang, Ying
From: Huang Ying 

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [02/21], [03/21], [04/21],
[05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21],
[12/21], [20/21].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [01/21], [07/21], [09/21], [11/21], [13/21],
[15/21], [16/21], [17/21], [18/21], [19/21], [20/21], [21/21].

Hi, Johannes and Michal, could you help me to review the cgroup part
of the patchset?  Especially [14/21].

And for all, Any comment is welcome!

This patchset is based on the 2018-05-18 head of mmotm/master.

This is the final step of THP (Transparent Huge Page) swap
optimization.  After the first and second step, the splitting huge
page is delayed from almost the first step of swapout to after swapout
has been finished.  In this step, we avoid splitting THP for swapout
and swapout/swapin the THP in one piece.

We tested the patchset with vm-scalability benchmark swap-w-seq test
case, with 16 processes.  The test case forks 16 processes.  Each
process allocates large anonymous memory range, and writes it from
begin to end for 8 rounds.  The first round will swapout, while the
remaining rounds will swapin and swapout.  The test is done on a Xeon
E5 v3 system, the swap device used is a RAM simulated PMEM (persistent
memory) device.  The test result is as follow,

base  optimized
 -- 
 %stddev %change %stddev
 \  |\  
   1417897 ±  2%+992.8%   15494673vm-scalability.throughput
   1020489 ±  4%   +1091.2%   12156349vmstat.swap.si
   1255093 ±  3%+940.3%   13056114vmstat.swap.so
   1259769 ±  7%   +1818.3%   24166779meminfo.AnonHugePages
  28021761   -10.7%   25018848 ±  2%  meminfo.AnonPages
  64080064 ±  4% -95.6%2787565 ± 33%  
interrupts.CAL:Function_call_interrupts
 13.91 ±  5% -13.80.10 ± 27%  
perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath

Where, the score of benchmark (bytes written per second) improved
992.8%.  The swapout/swapin throughput improved 1008% (from about
2.17GB/s to 24.04GB/s).  The performance difference is huge.  In base
kernel, for the first round of writing, the THP is swapout and split,
so in the remaining rounds, there is only normal page swapin and
swapout.  While in optimized kernel, the THP is kept after first
swapout, so THP swapin and swapout is used in the remaining rounds.
This shows the key benefit to swapout/swapin THP in one piece, the THP
will be kept instead of being split.  meminfo information verified
this, in base kernel only 4.5% of anonymous page are THP during the
test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
(represented as interrupts.CAL:Function_call_interrupts) reduced
95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
are performance benefit of THP swapout/swapin too.

Below is the description for all steps of THP swap optimization.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swapping even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages to swapout/swapin a THP in one piece include:

- Batch various swap operations for the THP.  Many operations need to
  be done once per THP instead of per normal page, for example,
  allocating/freeing the swap space, writing/reading the swap space,
  flushing TLB, page fault, etc.  This will improve the performance of
  the THP swap greatly.

- The THP swap space read/write will be large sequential IO (2M on
  x86_64).  It is particularly helpful for the swapin, which are
  usually 4k random IO.  This will improve the performance of the THP
  swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The THP order pages will be free
  up after THP swapout.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapout, it will take quite long time for the normal pages to
  collapse back into the THP after being swapin.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swapin, mainly because possible
enlarged read/write IO size (for swapout/swapin) may put more overhead
on