Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-25 Thread Huang, Ying
Shaohua Li  writes:

> On Fri, Sep 23, 2016 at 10:32:39AM +0800, Huang, Ying wrote:
>> Rik van Riel  writes:
>> 
>> > On Thu, 2016-09-22 at 15:56 -0700, Shaohua Li wrote:
>> >> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
>> >> >.
>> >> > - It will help the memory fragmentation, especially when the THP is
>> >> > . heavily used by the applications.. The 2M continuous pages will
>> >> > be
>> >> > . free up after THP swapping out.
>> >> 
>> >> So this is impossible without THP swapin. While 2M swapout makes a
>> >> lot of
>> >> sense, I doubt 2M swapin is really useful. What kind of application
>> >> is
>> >> 'optimized' to do sequential memory access?
>> >
>> > I suspect a lot of this will depend on the ratio of storage
>> > speed to CPU & RAM speed.
>> >
>> > When swapping to a spinning disk, it makes sense to avoid
>> > extra memory use on swapin, and work in 4kB blocks.
>> 
>> For spinning disk, the THP swap optimization will be turned off in
>> current implementation.  Because huge swap cluster allocation based on
>> swap cluster management, which is available only for non-rotating block
>> devices (blk_queue_nonrot()).
>
> For 2m swapin, as long as one byte is changed in the 2m, next time we must do
> 2m swapout. There is huge waste of memory and IO bandwidth and increases
> unnecessary memory pressure. 2M IO will very easily saturate a very fast SSD
> and makes IO the bottleneck. Not sure about NVRAM though.

One solution is to make 2M swapin configurable, maybe via a sysfs file
in /sys/kernel/mm/transparent_hugepage/, so that we can turn on it only
for really fast storage devices, such as NVRAM, etc.

Best Regards,
Huang, Ying


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-25 Thread Minchan Kim
On Sun, Sep 25, 2016 at 12:18:49PM -0700, Shaohua Li wrote:
> On Fri, Sep 23, 2016 at 10:32:39AM +0800, Huang, Ying wrote:
> > Rik van Riel  writes:
> > 
> > > On Thu, 2016-09-22 at 15:56 -0700, Shaohua Li wrote:
> > >> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> > >> > 
> > >> > - It will help the memory fragmentation, especially when the THP is
> > >> >   heavily used by the applications.  The 2M continuous pages will
> > >> > be
> > >> >   free up after THP swapping out.
> > >> 
> > >> So this is impossible without THP swapin. While 2M swapout makes a
> > >> lot of
> > >> sense, I doubt 2M swapin is really useful. What kind of application
> > >> is
> > >> 'optimized' to do sequential memory access?
> > >
> > > I suspect a lot of this will depend on the ratio of storage
> > > speed to CPU & RAM speed.
> > >
> > > When swapping to a spinning disk, it makes sense to avoid
> > > extra memory use on swapin, and work in 4kB blocks.
> > 
> > For spinning disk, the THP swap optimization will be turned off in
> > current implementation.  Because huge swap cluster allocation based on
> > swap cluster management, which is available only for non-rotating block
> > devices (blk_queue_nonrot()).
> 
> For 2m swapin, as long as one byte is changed in the 2m, next time we must do
> 2m swapout. There is huge waste of memory and IO bandwidth and increases
> unnecessary memory pressure. 2M IO will very easily saturate a very fast SSD

I agree. No doubt THP swapout is helpful for overall performance but
THP swapin should be more careful. It would cause memory pressure which
could evict warm pages which mitigates THP's benefit. THP swapin also
would increase minor fault latency, too.

If we want to swap in a THP, I think we need something to guarantee that
subpages in a THP swapped out were hot and temporal locality so that
it's worth to swap in a THP page to lose other memory kept in in memory.

Maybe it would not matter so much in MADVISE mode where userspace knows
pros and cons and choosed it. The problem would be there in ALWAYS mode.

One of idea is we can raise bar to collapse THP page higher, for example,
reducing khugepaged_max_ptes_none and introducing khugepaged_max_pte_ref.
With that, khugepaged would collapse 4K pages into a THP only if most of
subpages are mapped and hot.


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-25 Thread Shaohua Li
On Fri, Sep 23, 2016 at 10:32:39AM +0800, Huang, Ying wrote:
> Rik van Riel  writes:
> 
> > On Thu, 2016-09-22 at 15:56 -0700, Shaohua Li wrote:
> >> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> >> > 
> >> > - It will help the memory fragmentation, especially when the THP is
> >> >   heavily used by the applications.  The 2M continuous pages will
> >> > be
> >> >   free up after THP swapping out.
> >> 
> >> So this is impossible without THP swapin. While 2M swapout makes a
> >> lot of
> >> sense, I doubt 2M swapin is really useful. What kind of application
> >> is
> >> 'optimized' to do sequential memory access?
> >
> > I suspect a lot of this will depend on the ratio of storage
> > speed to CPU & RAM speed.
> >
> > When swapping to a spinning disk, it makes sense to avoid
> > extra memory use on swapin, and work in 4kB blocks.
> 
> For spinning disk, the THP swap optimization will be turned off in
> current implementation.  Because huge swap cluster allocation based on
> swap cluster management, which is available only for non-rotating block
> devices (blk_queue_nonrot()).

For 2m swapin, as long as one byte is changed in the 2m, next time we must do
2m swapout. There is huge waste of memory and IO bandwidth and increases
unnecessary memory pressure. 2M IO will very easily saturate a very fast SSD
and makes IO the bottleneck. Not sure about NVRAM though.

Thanks,
Shaohua


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-22 Thread Huang, Ying
Rik van Riel  writes:

> On Thu, 2016-09-22 at 15:56 -0700, Shaohua Li wrote:
>> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
>> > 
>> > - It will help the memory fragmentation, especially when the THP is
>> >   heavily used by the applications.  The 2M continuous pages will
>> > be
>> >   free up after THP swapping out.
>> 
>> So this is impossible without THP swapin. While 2M swapout makes a
>> lot of
>> sense, I doubt 2M swapin is really useful. What kind of application
>> is
>> 'optimized' to do sequential memory access?
>
> I suspect a lot of this will depend on the ratio of storage
> speed to CPU & RAM speed.
>
> When swapping to a spinning disk, it makes sense to avoid
> extra memory use on swapin, and work in 4kB blocks.

For spinning disk, the THP swap optimization will be turned off in
current implementation.  Because huge swap cluster allocation based on
swap cluster management, which is available only for non-rotating block
devices (blk_queue_nonrot()).

> When swapping to NVRAM, it makes sense to use 2MB blocks,
> because that storage can handle data faster than we can
> manage 4kB pages in the VM.

Best Regards,
Huang, Ying


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-22 Thread Huang, Ying
Hi, Shaohua,

Thanks for comments!

Shaohua Li  writes:

> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
>> 
>> The advantages of the THP swap support include:

Sorry for confusing.  This is the advantages of the final goal, that is,
avoid splitting/collapsing the THP during swap out/in, not the
advantages of this patchset.  This patchset is just the first step of
the final goal.  So some advantages of the final goal is not reflected
in this patchset.

>> - Batch the swap operations for the THP to reduce lock
>>   acquiring/releasing, including allocating/freeing the swap space,
>>   adding/deleting to/from the swap cache, and writing/reading the swap
>>   space, etc.  This will help improve the performance of the THP swap.
>> 
>> - The THP swap space read/write will be 2M sequential IO.  It is
>>   particularly helpful for the swap read, which usually are 4k random
>>   IO.  This will improve the performance of the THP swap too.
>
> I think this is not a problem. Even with current early split, we are 
> allocating
> swap entry sequentially, after IO is dispatched, block layer will merge IO to
> big size.

Yes.  For swap out, the original implementation can merge IO to big size
already.  But for the THP swap out, instead of allocating one bio for
each 4k page in a THP, we can allocate one bio for each THP.  This will
avoid many useless CPU cycles to split then merge.  I think this will
help performance for the fast storage device.

>> - It will help the memory fragmentation, especially when the THP is
>>   heavily used by the applications.  The 2M continuous pages will be
>>   free up after THP swapping out.
>
> So this is impossible without THP swapin. While 2M swapout makes a lot of
> sense, I doubt 2M swapin is really useful. What kind of application is
> 'optimized' to do sequential memory access?

Although applications usually don't do much sequential memory access,
they still have space locality.  And after 2M swap in, the THP before
swapped out is kept to be a THP after swapped in.  It can be mapped into
the PMD of the application.  This will help reduce the TLB contention.

> One advantage of THP swapout is to reduce TLB flush. Eg, when we split 2m to 
> 4k
> pages, we set swap entry for the 4k pages since your patch already allocates
> swap entry before the split, so we only do tlb flush once in the split. 
> Without
> the delay THP split, we do twice tlb flush (split and unmap of swapout). I
> don't see this in the patches, do I misread the code?

Combining THP splitting with unmapping?  That sounds like a good idea.
It is not implemented in this patchset because I have not thought about
that before :).

In the next step of THP swap support, I will further delay THP splitting
after swapping out finished.  At that time, we will avoid calling
split_huge_page_to_list() during swapping out.  So the TLB flush only
need to be done once for unmap.

Best Regards,
Huang, Ying

> Thanks,
> Shaohua


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-22 Thread Rik van Riel
On Thu, 2016-09-22 at 15:56 -0700, Shaohua Li wrote:
> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> > 
> > - It will help the memory fragmentation, especially when the THP is
> >   heavily used by the applications.  The 2M continuous pages will
> > be
> >   free up after THP swapping out.
> 
> So this is impossible without THP swapin. While 2M swapout makes a
> lot of
> sense, I doubt 2M swapin is really useful. What kind of application
> is
> 'optimized' to do sequential memory access?

I suspect a lot of this will depend on the ratio of storage
speed to CPU & RAM speed.

When swapping to a spinning disk, it makes sense to avoid
extra memory use on swapin, and work in 4kB blocks.

When swapping to NVRAM, it makes sense to use 2MB blocks,
because that storage can handle data faster than we can
manage 4kB pages in the VM.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-22 Thread Andi Kleen
"Chen, Tim C"  writes:

>>
>>So this is impossible without THP swapin. While 2M swapout makes a lot of
>>sense, I doubt 2M swapin is really useful. What kind of application is 
>>'optimized'
>>to do sequential memory access?

Anything that touches regions larger than 4K and we want to do the
kernel do minimal work to manage the swapping.

>
> We waste a lot of cpu cycles to re-compact 4K pages back to a large page
> under THP.  Swapping it back in as a single large page can avoid
> fragmentation and this overhead.

Also splitting something just to merge it again is wasteful.

A lot of big improvements in the block and VM and network layers
over the years came from avoiding that kind of wasteful work.

-Andi


RE: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-22 Thread Chen, Tim C
>
>So this is impossible without THP swapin. While 2M swapout makes a lot of
>sense, I doubt 2M swapin is really useful. What kind of application is 
>'optimized'
>to do sequential memory access?

We waste a lot of cpu cycles to re-compact 4K pages back to a large page
under THP.  Swapping it back in as a single large page can avoid
fragmentation and this overhead.

Thanks.

Tim


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-22 Thread Shaohua Li
On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> 
> The advantages of the THP swap support include:
> 
> - Batch the swap operations for the THP to reduce lock
>   acquiring/releasing, including allocating/freeing the swap space,
>   adding/deleting to/from the swap cache, and writing/reading the swap
>   space, etc.  This will help improve the performance of the THP swap.
> 
> - The THP swap space read/write will be 2M sequential IO.  It is
>   particularly helpful for the swap read, which usually are 4k random
>   IO.  This will improve the performance of the THP swap too.

I think this is not a problem. Even with current early split, we are allocating
swap entry sequentially, after IO is dispatched, block layer will merge IO to
big size.

> - It will help the memory fragmentation, especially when the THP is
>   heavily used by the applications.  The 2M continuous pages will be
>   free up after THP swapping out.

So this is impossible without THP swapin. While 2M swapout makes a lot of
sense, I doubt 2M swapin is really useful. What kind of application is
'optimized' to do sequential memory access?

One advantage of THP swapout is to reduce TLB flush. Eg, when we split 2m to 4k
pages, we set swap entry for the 4k pages since your patch already allocates
swap entry before the split, so we only do tlb flush once in the split. Without
the delay THP split, we do twice tlb flush (split and unmap of swapout). I
don't see this in the patches, do I misread the code?

Thanks,
Shaohua


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-19 Thread Minchan Kim
Hi Huang,

On Tue, Sep 20, 2016 at 10:54:35AM +0800, Huang, Ying wrote:
> Hi, Minchan,
> 
> Minchan Kim  writes:
> > Hi Huang,
> >
> > On Sun, Sep 18, 2016 at 09:53:39AM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
> >> >> Minchan Kim  writes:
> >> >> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
> >> >> >> Minchan Kim  writes:
> >> >> >> 
> >> >> >> > Hi Huang,
> >> >> >> >
> >> >> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
> >> >> >> >
> 
> [snip]
> 
> >> >> > 1. If we solve batching swapout, then how is THP split for swapout 
> >> >> > bad?
> >> >> > 2. Also, how is current conservatie swapin from khugepaged bad?
> >> >> >
> >> >> > I think it's one of decision point for the motivation of your work
> >> >> > and for 1, we need batching swapout feature.
> >> >> >
> >> >> > I am saying again that I'm not against your goal but only concern
> >> >> > is approach. If you don't agree, please ignore me.
> >> >> 
> >> >> I am glad to discuss my final goal, that is, swapping out/in the full
> >> >> THP without splitting.  Why I want to do that is copied as below,
> >> >
> >> > Yes, it's your *final* goal but what if it couldn't be acceptable
> >> > on second step you mentioned above, for example?
> >> >
> >> > Unncessary binded implementation to rejected work.
> >> 
> >> So I want to discuss my final goal.  If people accept my final goal,
> >> this is resolved.  If people don't accept, I will reconsider it.
> >
> > No.
> >
> > Please keep it in mind. There are lots of factors the project would
> > be broken during going on by several reasons because we are human being
> > so we can simply miss something clear and realize it later that it's
> > not feasible. Otherwise, others can show up with better idea for the
> > goal or fix other subsystem which can affect your goals.
> > I don't want to say such boring theoretical stuffs any more.
> >
> > My point is patchset should be self-contained if you really want to go
> > with step-by-step approach because we are likely to miss something
> > *easily*.
> >
> >> 
> >> > If you want to achieve your goal step by step, please consider if
> >> > one of step you are thinking could be rejected but steps already
> >> > merged should be self-contained without side-effect.
> >> 
> >> What is the side-effect or possible regressions of the step 1 as in this
> >
> > Adding code complexity for unproved feature.
> >
> > When I read your steps, your *most important* goal is to avoid split/
> > collapsing anon THP page for swap out/in. As a bonus with the approach,
> > we could increase swapout/in bandwidth, too. Do I understand correctly?
> 
> It's hard to say what is the *most important* goal.  But it is clear
> that to improve swapout/in performance isn't the only goal.  The other
> goal to avoid split/collapsing THP page for swap out/in is very
> important too.

Okay, then, couldn't you focus a goal in patchset? After solving a problem,
then next one. What's the problem?
One of your goal is swapout performance and it's same with Tim's work.
That's why I wanted to make your patchset based on Tim's work. But if you
want your patch first, please make patchset independent with your other goal
so everyone can review easily and focus on *a* problem.
In your patchset, THP split delaying part could be folded into in your second
patchset which is to avoid THP split/collapsing.

> 
> > However, swap-in/out bandwidth enhance is common requirement for both
> > normal and THP page and with Tim's work, we could enhance swapout path.
> >
> > So, I think you should give us to number about how THP split is bad
> > for the swapout bandwidth even though we applied Tim's work.
> > If it's serious, next approach is yours that we could tweak swap code
> > be aware of a THP to avoid splitting a THP.
> 
> It's not only about CPU cycles spent in splitting and collapsing THP,
> but also how to make THP work effectively on systems with swap turned
> on.
> 
> To avoid disturbing user applications etc., THP collapsing doesn't work
> aggressively to collapse anonymous pages into THP.  This means, once the
> THP is split, it will take quite long time (wall time, instead of CPU
> cycles) to be collapsed to become a THP, especially on machines with
> large memory size.  And on systems with swap turned on, THP will be
> split during swap out/in now.  If much swapping out/in is triggered
> during system running, it is possible that many THP is split, and have
> no chance to be collapsed.  Even if the THP that has been split gets
> opportunity to be collapsed again, the applications lose the opportunity
> to take advantage of the THP for quite long time too.  And the memory
> will be fragmented during the process, this makes it hard to allocate
> new THP.  The end result is that THP usage is very low in this
> situation.  One solution is to avoid to split/collapse THP during swap
> out/in.

Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-19 Thread Huang, Ying
Minchan Kim  writes:

> Hi Huang,
>
> On Tue, Sep 20, 2016 at 10:54:35AM +0800, Huang, Ying wrote:
>> Hi, Minchan,
>> 
>> Minchan Kim  writes:
>> > Hi Huang,
>> >
>> > On Sun, Sep 18, 2016 at 09:53:39AM +0800, Huang, Ying wrote:
>> >> Minchan Kim  writes:
>> >> 
>> >> > On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
>> >> >> Minchan Kim  writes:
>> >> >> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
>> >> >> >> Minchan Kim  writes:
>> >> >> >> 
>> >> >> >> > Hi Huang,
>> >> >> >> >
>> >> >> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
>> >> >> >> >
>> 
>> [snip]
>> 
>> >> >> > 1. If we solve batching swapout, then how is THP split for swapout 
>> >> >> > bad?
>> >> >> > 2. Also, how is current conservatie swapin from khugepaged bad?
>> >> >> >
>> >> >> > I think it's one of decision point for the motivation of your work
>> >> >> > and for 1, we need batching swapout feature.
>> >> >> >
>> >> >> > I am saying again that I'm not against your goal but only concern
>> >> >> > is approach. If you don't agree, please ignore me.
>> >> >> 
>> >> >> I am glad to discuss my final goal, that is, swapping out/in the full
>> >> >> THP without splitting.  Why I want to do that is copied as below,
>> >> >
>> >> > Yes, it's your *final* goal but what if it couldn't be acceptable
>> >> > on second step you mentioned above, for example?
>> >> >
>> >> > Unncessary binded implementation to rejected work.
>> >> 
>> >> So I want to discuss my final goal.  If people accept my final goal,
>> >> this is resolved.  If people don't accept, I will reconsider it.
>> >
>> > No.
>> >
>> > Please keep it in mind. There are lots of factors the project would
>> > be broken during going on by several reasons because we are human being
>> > so we can simply miss something clear and realize it later that it's
>> > not feasible. Otherwise, others can show up with better idea for the
>> > goal or fix other subsystem which can affect your goals.
>> > I don't want to say such boring theoretical stuffs any more.
>> >
>> > My point is patchset should be self-contained if you really want to go
>> > with step-by-step approach because we are likely to miss something
>> > *easily*.
>> >
>> >> 
>> >> > If you want to achieve your goal step by step, please consider if
>> >> > one of step you are thinking could be rejected but steps already
>> >> > merged should be self-contained without side-effect.
>> >> 
>> >> What is the side-effect or possible regressions of the step 1 as in this
>> >
>> > Adding code complexity for unproved feature.
>> >
>> > When I read your steps, your *most important* goal is to avoid split/
>> > collapsing anon THP page for swap out/in. As a bonus with the approach,
>> > we could increase swapout/in bandwidth, too. Do I understand correctly?
>> 
>> It's hard to say what is the *most important* goal.  But it is clear
>> that to improve swapout/in performance isn't the only goal.  The other
>> goal to avoid split/collapsing THP page for swap out/in is very
>> important too.
>
> Okay, then, couldn't you focus a goal in patchset? After solving a problem,
> then next one. What's the problem?
> One of your goal is swapout performance and it's same with Tim's work.
> That's why I wanted to make your patchset based on Tim's work. But if you
> want your patch first, please make patchset independent with your other goal
> so everyone can review easily and focus on *a* problem.
> In your patchset, THP split delaying part could be folded into in your second
> patchset which is to avoid THP split/collapsing.

I thought multiple goals for one patchset is common.  But if you want
just one goal for review, I suggest you to review the patchset for the
goal to avoid split/collapsing anon THP page for swap out/in.  And this
patchset is just the first step for that.

>> > However, swap-in/out bandwidth enhance is common requirement for both
>> > normal and THP page and with Tim's work, we could enhance swapout path.
>> >
>> > So, I think you should give us to number about how THP split is bad
>> > for the swapout bandwidth even though we applied Tim's work.
>> > If it's serious, next approach is yours that we could tweak swap code
>> > be aware of a THP to avoid splitting a THP.
>> 
>> It's not only about CPU cycles spent in splitting and collapsing THP,
>> but also how to make THP work effectively on systems with swap turned
>> on.
>> 
>> To avoid disturbing user applications etc., THP collapsing doesn't work
>> aggressively to collapse anonymous pages into THP.  This means, once the
>> THP is split, it will take quite long time (wall time, instead of CPU
>> cycles) to be collapsed to become a THP, especially on machines with
>> large memory size.  And on systems with swap turned on, THP will be
>> split during swap out/in now.  If much swapping out/in is triggered
>> during system running, it is possible that many THP is split, and have
>> no chance to be collapsed.  Even if the THP that

Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-19 Thread Huang, Ying
Hi, Minchan,

Minchan Kim  writes:
> Hi Huang,
>
> On Sun, Sep 18, 2016 at 09:53:39AM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> 
>> > On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
>> >> Minchan Kim  writes:
>> >> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
>> >> >> Minchan Kim  writes:
>> >> >> 
>> >> >> > Hi Huang,
>> >> >> >
>> >> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
>> >> >> >

[snip]

>> >> > 1. If we solve batching swapout, then how is THP split for swapout bad?
>> >> > 2. Also, how is current conservatie swapin from khugepaged bad?
>> >> >
>> >> > I think it's one of decision point for the motivation of your work
>> >> > and for 1, we need batching swapout feature.
>> >> >
>> >> > I am saying again that I'm not against your goal but only concern
>> >> > is approach. If you don't agree, please ignore me.
>> >> 
>> >> I am glad to discuss my final goal, that is, swapping out/in the full
>> >> THP without splitting.  Why I want to do that is copied as below,
>> >
>> > Yes, it's your *final* goal but what if it couldn't be acceptable
>> > on second step you mentioned above, for example?
>> >
>> > Unncessary binded implementation to rejected work.
>> 
>> So I want to discuss my final goal.  If people accept my final goal,
>> this is resolved.  If people don't accept, I will reconsider it.
>
> No.
>
> Please keep it in mind. There are lots of factors the project would
> be broken during going on by several reasons because we are human being
> so we can simply miss something clear and realize it later that it's
> not feasible. Otherwise, others can show up with better idea for the
> goal or fix other subsystem which can affect your goals.
> I don't want to say such boring theoretical stuffs any more.
>
> My point is patchset should be self-contained if you really want to go
> with step-by-step approach because we are likely to miss something
> *easily*.
>
>> 
>> > If you want to achieve your goal step by step, please consider if
>> > one of step you are thinking could be rejected but steps already
>> > merged should be self-contained without side-effect.
>> 
>> What is the side-effect or possible regressions of the step 1 as in this
>
> Adding code complexity for unproved feature.
>
> When I read your steps, your *most important* goal is to avoid split/
> collapsing anon THP page for swap out/in. As a bonus with the approach,
> we could increase swapout/in bandwidth, too. Do I understand correctly?

It's hard to say what is the *most important* goal.  But it is clear
that to improve swapout/in performance isn't the only goal.  The other
goal to avoid split/collapsing THP page for swap out/in is very
important too.

> However, swap-in/out bandwidth enhance is common requirement for both
> normal and THP page and with Tim's work, we could enhance swapout path.
>
> So, I think you should give us to number about how THP split is bad
> for the swapout bandwidth even though we applied Tim's work.
> If it's serious, next approach is yours that we could tweak swap code
> be aware of a THP to avoid splitting a THP.

It's not only about CPU cycles spent in splitting and collapsing THP,
but also how to make THP work effectively on systems with swap turned
on.

To avoid disturbing user applications etc., THP collapsing doesn't work
aggressively to collapse anonymous pages into THP.  This means, once the
THP is split, it will take quite long time (wall time, instead of CPU
cycles) to be collapsed to become a THP, especially on machines with
large memory size.  And on systems with swap turned on, THP will be
split during swap out/in now.  If much swapping out/in is triggered
during system running, it is possible that many THP is split, and have
no chance to be collapsed.  Even if the THP that has been split gets
opportunity to be collapsed again, the applications lose the opportunity
to take advantage of the THP for quite long time too.  And the memory
will be fragmented during the process, this makes it hard to allocate
new THP.  The end result is that THP usage is very low in this
situation.  One solution is to avoid to split/collapse THP during swap
out/in.

> For THP swap-in, I think it's another topic we should discuss.
> For each step, it's orthogonal work so it shouldn't rely on next goal.
>
>
>> patchset?  Lacks the opportunity to allocate consecutive 512 swap slots
>> in 2 non-free swap clusters?  I don't think that is a regression,
>> because the patchset will NOT make free swap clusters consumed faster
>> than that in current code.  Even if it were better to allocate
>> consecutive 512 swap slots in 2 non-free swap clusters, it could be an
>> incremental improvement to the simple solution in this patchset.  That
>> is, to allocate 512 swap slots, the simple solution is:
>> 
>> a) Try to allocate a free swap cluster
>> b) If a) fails, give up
>> 
>> The improved solution could be (if it were needed finally)
>> 
>> a) Try to alloca

Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-19 Thread Hugh Dickins
On Wed, 7 Sep 2016, Huang, Ying wrote:
> From: Huang Ying 
> 
> This patchset is to optimize the performance of Transparent Huge Page
> (THP) swap.
> 
> Hi, Andrew, could you help me to check whether the overall design is
> reasonable?
> 
> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
> swap part of the patchset?  Especially [01/10], [04/10], [05/10],
> [06/10], [07/10], [10/10].

Sorry, I am very far from having time to do so.

Hugh


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-19 Thread Tim Chen
On Mon, 2016-09-19 at 16:11 +0900, Minchan Kim wrote:
> Hi Tim,
> 
> On Tue, Sep 13, 2016 at 11:52:27PM +, Chen, Tim C wrote:
> > 
> > > 
> > > > 
> > > > 
> > > > - Avoid CPU time for splitting, collapsing THP across swap out/in.
> > > Yes, if you want, please give us how bad it is.
> > > 
> > It could be pretty bad.  In an experiment with THP turned on and we
> > enter swap, 50% of the cpu are spent in the page compaction path.  
> It's page compaction overhead, especially, pageblock_pfn_to_page.
> Why is it related to overhead THP split for swapout?
> I don't understand.

Today you have to split a large page into 4K pages to swap it out.
Then after you swap in all the 4K pages, you have to re-compact
them back into a large page.

If you can swap the large page out as a contiguous unit, and swap
it back in as a single large page, the splitting and re-compaction
back into a large page can be avoided.

Tim



Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-19 Thread Minchan Kim
Hi Tim,

On Tue, Sep 13, 2016 at 11:52:27PM +, Chen, Tim C wrote:
> >>
> >> - Avoid CPU time for splitting, collapsing THP across swap out/in.
> >
> >Yes, if you want, please give us how bad it is.
> >
> 
> It could be pretty bad.  In an experiment with THP turned on and we
> enter swap, 50% of the cpu are spent in the page compaction path.  

It's page compaction overhead, especially, pageblock_pfn_to_page.
Why is it related to overhead THP split for swapout?
I don't understand.

> So if we could deal with units of large page for swap, the splitting
> and compaction of ordinary pages to large page overhead could be avoided.
> 
>51.89%51.89%:1688  [kernel.kallsyms]   [k] 
> pageblock_pfn_to_page   
>   |
>   --- pageblock_pfn_to_page
>  |  
>  |--64.57%-- compaction_alloc
>  |  migrate_pages
>  |  compact_zone
>  |  compact_zone_order
>  |  try_to_compact_pages
>  |  __alloc_pages_direct_compact
>  |  __alloc_pages_nodemask
>  |  alloc_pages_vma
>  |  do_huge_pmd_anonymous_page
>  |  handle_mm_fault
>  |  __do_page_fault
>  |  do_page_fault
>  |  page_fault
>  |  0x401d9a
>  |  
>  |--34.62%-- compact_zone
>  |  compact_zone_order
>  |  try_to_compact_pages
>  |  __alloc_pages_direct_compact
>  |  __alloc_pages_nodemask
>  |  alloc_pages_vma
>  |  do_huge_pmd_anonymous_page
>  |  handle_mm_fault
>  |  __do_page_fault
>  |  do_page_fault
>  |  page_fault
>  |  0x401d9a
>   --0.81%-- [...]
> 
> Tim


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-19 Thread Minchan Kim
Hi Huang,

On Sun, Sep 18, 2016 at 09:53:39AM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
> >> >> Minchan Kim  writes:
> >> >> 
> >> >> > Hi Huang,
> >> >> >
> >> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
> >> >> >
> >> >> > < snip >
> >> >> >
> >> >> >> >> Recently, the performance of the storage devices improved so fast 
> >> >> >> >> that
> >> >> >> >> we cannot saturate the disk bandwidth when do page swap out even 
> >> >> >> >> on a
> >> >> >> >> high-end server machine.  Because the performance of the storage
> >> >> >> >> device improved faster than that of CPU.  And it seems that the 
> >> >> >> >> trend
> >> >> >> >> will not change in the near future.  On the other hand, the THP
> >> >> >> >> becomes more and more popular because of increased memory size.  
> >> >> >> >> So it
> >> >> >> >> becomes necessary to optimize THP swap performance.
> >> >> >> >> 
> >> >> >> >> The advantages of the THP swap support include:
> >> >> >> >> 
> >> >> >> >> - Batch the swap operations for the THP to reduce lock
> >> >> >> >>   acquiring/releasing, including allocating/freeing the swap 
> >> >> >> >> space,
> >> >> >> >>   adding/deleting to/from the swap cache, and writing/reading the 
> >> >> >> >> swap
> >> >> >> >>   space, etc.  This will help improve the performance of the THP 
> >> >> >> >> swap.
> >> >> >> >> 
> >> >> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >> >> >> >>   particularly helpful for the swap read, which usually are 4k 
> >> >> >> >> random
> >> >> >> >>   IO.  This will improve the performance of the THP swap too.
> >> >> >> >> 
> >> >> >> >> - It will help the memory fragmentation, especially when the THP 
> >> >> >> >> is
> >> >> >> >>   heavily used by the applications.  The 2M continuous pages will 
> >> >> >> >> be
> >> >> >> >>   free up after THP swapping out.
> >> >> >> >
> >> >> >> > I just read patchset right now and still doubt why the all changes
> >> >> >> > should be coupled with THP tightly. Many parts(e.g., you introduced
> >> >> >> > or modifying existing functions for making them THP specific) could
> >> >> >> > just take page_list and the number of pages then would handle them
> >> >> >> > without THP awareness.
> >> >> >> 
> >> >> >> I am glad if my change could help normal pages swapping too.  And we 
> >> >> >> can
> >> >> >> change these functions to work for normal pages when necessary.
> >> >> >
> >> >> > Sure but it would be less painful that THP awareness swapout is
> >> >> > based on multiple normal pages swapout. For exmaple, we don't
> >> >> > touch delay THP split part(i.e., split a THP into 512 pages like
> >> >> > as-is) and enhances swapout further like Tim's suggestion
> >> >> > for mulitple normal pages swapout. With that, it might be enough
> >> >> > for fast-storage without needing THP awareness.
> >> >> >
> >> >> > My *point* is let's approach step by step.
> >> >> > First of all, go with batching normal pages swapout and if it's
> >> >> > not enough, dive into further optimization like introducing
> >> >> > THP-aware swapout.
> >> >> >
> >> >> > I believe it's natural development process to evolve things
> >> >> > without over-engineering.
> >> >> 
> >> >> My target is not only the THP swap out acceleration, but also the full
> >> >> THP swap out/in support without splitting THP.  This patchset is just
> >> >> the first step of the full THP swap support.
> >> >> 
> >> >> >> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> >> >> >> > can try to allocate new cluster. With that, we could allocate new
> >> >> >> > clusters to meet nr_pages requested or bail out if we fail to 
> >> >> >> > allocate
> >> >> >> > and fallback to 0-order page swapout. With that, swap layer could
> >> >> >> > support multiple order-0 pages by batch.
> >> >> >> >
> >> >> >> > IMO, I really want to land Tim Chen's batching swapout work first.
> >> >> >> > With Tim Chen's work, I expect we can make better refactoring
> >> >> >> > for batching swap before adding more confuse to the swap layer.
> >> >> >> > (I expect it would share several pieces of code for or would be 
> >> >> >> > base
> >> >> >> > for batching allocation of swapcache, swapslot)
> >> >> >> 
> >> >> >> I don't think there is hard conflict between normal pages swapping
> >> >> >> optimizing and THP swap optimizing.  Some code may be shared between
> >> >> >> them.  That is good for both sides.
> >> >> >> 
> >> >> >> > After that, we could enhance swap for big contiguous batching
> >> >> >> > like THP and finally we might make it be aware of THP specific to
> >> >> >> > enhance further.
> >> >> >> >
> >> >> >> > A thing I remember you aruged: you want to swapin 512 pages
> >> >> >> > all at once unconditionally. It's really worth to discuss if
> >> >> >> > your design is g

Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-17 Thread Huang, Ying
Minchan Kim  writes:

> On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
>> >> Minchan Kim  writes:
>> >> 
>> >> > Hi Huang,
>> >> >
>> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
>> >> >
>> >> > < snip >
>> >> >
>> >> >> >> Recently, the performance of the storage devices improved so fast 
>> >> >> >> that
>> >> >> >> we cannot saturate the disk bandwidth when do page swap out even on 
>> >> >> >> a
>> >> >> >> high-end server machine.  Because the performance of the storage
>> >> >> >> device improved faster than that of CPU.  And it seems that the 
>> >> >> >> trend
>> >> >> >> will not change in the near future.  On the other hand, the THP
>> >> >> >> becomes more and more popular because of increased memory size.  So 
>> >> >> >> it
>> >> >> >> becomes necessary to optimize THP swap performance.
>> >> >> >> 
>> >> >> >> The advantages of the THP swap support include:
>> >> >> >> 
>> >> >> >> - Batch the swap operations for the THP to reduce lock
>> >> >> >>   acquiring/releasing, including allocating/freeing the swap space,
>> >> >> >>   adding/deleting to/from the swap cache, and writing/reading the 
>> >> >> >> swap
>> >> >> >>   space, etc.  This will help improve the performance of the THP 
>> >> >> >> swap.
>> >> >> >> 
>> >> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
>> >> >> >>   particularly helpful for the swap read, which usually are 4k 
>> >> >> >> random
>> >> >> >>   IO.  This will improve the performance of the THP swap too.
>> >> >> >> 
>> >> >> >> - It will help the memory fragmentation, especially when the THP is
>> >> >> >>   heavily used by the applications.  The 2M continuous pages will be
>> >> >> >>   free up after THP swapping out.
>> >> >> >
>> >> >> > I just read patchset right now and still doubt why the all changes
>> >> >> > should be coupled with THP tightly. Many parts(e.g., you introduced
>> >> >> > or modifying existing functions for making them THP specific) could
>> >> >> > just take page_list and the number of pages then would handle them
>> >> >> > without THP awareness.
>> >> >> 
>> >> >> I am glad if my change could help normal pages swapping too.  And we 
>> >> >> can
>> >> >> change these functions to work for normal pages when necessary.
>> >> >
>> >> > Sure but it would be less painful that THP awareness swapout is
>> >> > based on multiple normal pages swapout. For exmaple, we don't
>> >> > touch delay THP split part(i.e., split a THP into 512 pages like
>> >> > as-is) and enhances swapout further like Tim's suggestion
>> >> > for mulitple normal pages swapout. With that, it might be enough
>> >> > for fast-storage without needing THP awareness.
>> >> >
>> >> > My *point* is let's approach step by step.
>> >> > First of all, go with batching normal pages swapout and if it's
>> >> > not enough, dive into further optimization like introducing
>> >> > THP-aware swapout.
>> >> >
>> >> > I believe it's natural development process to evolve things
>> >> > without over-engineering.
>> >> 
>> >> My target is not only the THP swap out acceleration, but also the full
>> >> THP swap out/in support without splitting THP.  This patchset is just
>> >> the first step of the full THP swap support.
>> >> 
>> >> >> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
>> >> >> > can try to allocate new cluster. With that, we could allocate new
>> >> >> > clusters to meet nr_pages requested or bail out if we fail to 
>> >> >> > allocate
>> >> >> > and fallback to 0-order page swapout. With that, swap layer could
>> >> >> > support multiple order-0 pages by batch.
>> >> >> >
>> >> >> > IMO, I really want to land Tim Chen's batching swapout work first.
>> >> >> > With Tim Chen's work, I expect we can make better refactoring
>> >> >> > for batching swap before adding more confuse to the swap layer.
>> >> >> > (I expect it would share several pieces of code for or would be base
>> >> >> > for batching allocation of swapcache, swapslot)
>> >> >> 
>> >> >> I don't think there is hard conflict between normal pages swapping
>> >> >> optimizing and THP swap optimizing.  Some code may be shared between
>> >> >> them.  That is good for both sides.
>> >> >> 
>> >> >> > After that, we could enhance swap for big contiguous batching
>> >> >> > like THP and finally we might make it be aware of THP specific to
>> >> >> > enhance further.
>> >> >> >
>> >> >> > A thing I remember you aruged: you want to swapin 512 pages
>> >> >> > all at once unconditionally. It's really worth to discuss if
>> >> >> > your design is going for the way.
>> >> >> > I doubt it's generally good idea. Because, currently, we try to
>> >> >> > swap in swapped out pages in THP page with conservative approach
>> >> >> > but your direction is going to opposite way.
>> >> >> >
>> >> >> > [mm, thp: convert from optimistic swapin collapsing to conservative]
>> >> >

RE: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-13 Thread Chen, Tim C
>>
>> - Avoid CPU time for splitting, collapsing THP across swap out/in.
>
>Yes, if you want, please give us how bad it is.
>

It could be pretty bad.  In an experiment with THP turned on and we
enter swap, 50% of the cpu are spent in the page compaction path.  
So if we could deal with units of large page for swap, the splitting
and compaction of ordinary pages to large page overhead could be avoided.

   51.89%51.89%:1688  [kernel.kallsyms]   [k] 
pageblock_pfn_to_page   
  |
  --- pageblock_pfn_to_page
 |  
 |--64.57%-- compaction_alloc
 |  migrate_pages
 |  compact_zone
 |  compact_zone_order
 |  try_to_compact_pages
 |  __alloc_pages_direct_compact
 |  __alloc_pages_nodemask
 |  alloc_pages_vma
 |  do_huge_pmd_anonymous_page
 |  handle_mm_fault
 |  __do_page_fault
 |  do_page_fault
 |  page_fault
 |  0x401d9a
 |  
 |--34.62%-- compact_zone
 |  compact_zone_order
 |  try_to_compact_pages
 |  __alloc_pages_direct_compact
 |  __alloc_pages_nodemask
 |  alloc_pages_vma
 |  do_huge_pmd_anonymous_page
 |  handle_mm_fault
 |  __do_page_fault
 |  do_page_fault
 |  page_fault
 |  0x401d9a
  --0.81%-- [...]

Tim


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-13 Thread Andrea Arcangeli
Hello,

On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
> I am glad to discuss my final goal, that is, swapping out/in the full
> THP without splitting.  Why I want to do that is copied as below,

I think that is a fine objective. It wasn't implemented initially just
to keep things simple.

Doing it will reduce swap fragmentation (provided we can find a
physically contiguous piece of to swapout the THP in the first place)
and it will make all other heuristics that tries to keep the swap
space contiguous less relevant and it should increase the swap
bandwidth significantly at least on spindle disks. I personally see it
as a positive that we relay less on those and the readhaead swapin.

> >> >> The disadvantage are:
> >> >> 
> >> >> - Increase the memory pressure when swap in THP.

That is always true with THP enabled to always. It is the tradeoff. It
still cannot use more RAM than userland ever allocated in the vma as
virtual memory. If userland don't ever need such memory it can free it
by zapping the vma and the THP will be splitted. If the vma is zapped
while the THP is natively swapped out, the zapped portion of swap
space shall be released as well. So ultimately userland always
controls the cap on the max virtual memory (ram+swap) the kernel
decides to use with THP enabled to always.

> I think it is important to use 2M pages as much as possible to deal with
> the big memory problem.  Do you agree?

I agree.

Thanks,
Andrea


Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-13 Thread Minchan Kim
On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > Hi Huang,
> >> >
> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
> >> >
> >> > < snip >
> >> >
> >> >> >> Recently, the performance of the storage devices improved so fast 
> >> >> >> that
> >> >> >> we cannot saturate the disk bandwidth when do page swap out even on a
> >> >> >> high-end server machine.  Because the performance of the storage
> >> >> >> device improved faster than that of CPU.  And it seems that the trend
> >> >> >> will not change in the near future.  On the other hand, the THP
> >> >> >> becomes more and more popular because of increased memory size.  So 
> >> >> >> it
> >> >> >> becomes necessary to optimize THP swap performance.
> >> >> >> 
> >> >> >> The advantages of the THP swap support include:
> >> >> >> 
> >> >> >> - Batch the swap operations for the THP to reduce lock
> >> >> >>   acquiring/releasing, including allocating/freeing the swap space,
> >> >> >>   adding/deleting to/from the swap cache, and writing/reading the 
> >> >> >> swap
> >> >> >>   space, etc.  This will help improve the performance of the THP 
> >> >> >> swap.
> >> >> >> 
> >> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >> >> >>   particularly helpful for the swap read, which usually are 4k random
> >> >> >>   IO.  This will improve the performance of the THP swap too.
> >> >> >> 
> >> >> >> - It will help the memory fragmentation, especially when the THP is
> >> >> >>   heavily used by the applications.  The 2M continuous pages will be
> >> >> >>   free up after THP swapping out.
> >> >> >
> >> >> > I just read patchset right now and still doubt why the all changes
> >> >> > should be coupled with THP tightly. Many parts(e.g., you introduced
> >> >> > or modifying existing functions for making them THP specific) could
> >> >> > just take page_list and the number of pages then would handle them
> >> >> > without THP awareness.
> >> >> 
> >> >> I am glad if my change could help normal pages swapping too.  And we can
> >> >> change these functions to work for normal pages when necessary.
> >> >
> >> > Sure but it would be less painful that THP awareness swapout is
> >> > based on multiple normal pages swapout. For exmaple, we don't
> >> > touch delay THP split part(i.e., split a THP into 512 pages like
> >> > as-is) and enhances swapout further like Tim's suggestion
> >> > for mulitple normal pages swapout. With that, it might be enough
> >> > for fast-storage without needing THP awareness.
> >> >
> >> > My *point* is let's approach step by step.
> >> > First of all, go with batching normal pages swapout and if it's
> >> > not enough, dive into further optimization like introducing
> >> > THP-aware swapout.
> >> >
> >> > I believe it's natural development process to evolve things
> >> > without over-engineering.
> >> 
> >> My target is not only the THP swap out acceleration, but also the full
> >> THP swap out/in support without splitting THP.  This patchset is just
> >> the first step of the full THP swap support.
> >> 
> >> >> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> >> >> > can try to allocate new cluster. With that, we could allocate new
> >> >> > clusters to meet nr_pages requested or bail out if we fail to allocate
> >> >> > and fallback to 0-order page swapout. With that, swap layer could
> >> >> > support multiple order-0 pages by batch.
> >> >> >
> >> >> > IMO, I really want to land Tim Chen's batching swapout work first.
> >> >> > With Tim Chen's work, I expect we can make better refactoring
> >> >> > for batching swap before adding more confuse to the swap layer.
> >> >> > (I expect it would share several pieces of code for or would be base
> >> >> > for batching allocation of swapcache, swapslot)
> >> >> 
> >> >> I don't think there is hard conflict between normal pages swapping
> >> >> optimizing and THP swap optimizing.  Some code may be shared between
> >> >> them.  That is good for both sides.
> >> >> 
> >> >> > After that, we could enhance swap for big contiguous batching
> >> >> > like THP and finally we might make it be aware of THP specific to
> >> >> > enhance further.
> >> >> >
> >> >> > A thing I remember you aruged: you want to swapin 512 pages
> >> >> > all at once unconditionally. It's really worth to discuss if
> >> >> > your design is going for the way.
> >> >> > I doubt it's generally good idea. Because, currently, we try to
> >> >> > swap in swapped out pages in THP page with conservative approach
> >> >> > but your direction is going to opposite way.
> >> >> >
> >> >> > [mm, thp: convert from optimistic swapin collapsing to conservative]
> >> >> >
> >> >> > I think general approach(i.e., less effective than targeting
> >> >> > implement for your own specific goal but less hacky and better job
> >> >> > for many cases) i

Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-13 Thread Huang, Ying
Minchan Kim  writes:
> On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> 
>> > Hi Huang,
>> >
>> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
>> >
>> > < snip >
>> >
>> >> >> Recently, the performance of the storage devices improved so fast that
>> >> >> we cannot saturate the disk bandwidth when do page swap out even on a
>> >> >> high-end server machine.  Because the performance of the storage
>> >> >> device improved faster than that of CPU.  And it seems that the trend
>> >> >> will not change in the near future.  On the other hand, the THP
>> >> >> becomes more and more popular because of increased memory size.  So it
>> >> >> becomes necessary to optimize THP swap performance.
>> >> >> 
>> >> >> The advantages of the THP swap support include:
>> >> >> 
>> >> >> - Batch the swap operations for the THP to reduce lock
>> >> >>   acquiring/releasing, including allocating/freeing the swap space,
>> >> >>   adding/deleting to/from the swap cache, and writing/reading the swap
>> >> >>   space, etc.  This will help improve the performance of the THP swap.
>> >> >> 
>> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
>> >> >>   particularly helpful for the swap read, which usually are 4k random
>> >> >>   IO.  This will improve the performance of the THP swap too.
>> >> >> 
>> >> >> - It will help the memory fragmentation, especially when the THP is
>> >> >>   heavily used by the applications.  The 2M continuous pages will be
>> >> >>   free up after THP swapping out.
>> >> >
>> >> > I just read patchset right now and still doubt why the all changes
>> >> > should be coupled with THP tightly. Many parts(e.g., you introduced
>> >> > or modifying existing functions for making them THP specific) could
>> >> > just take page_list and the number of pages then would handle them
>> >> > without THP awareness.
>> >> 
>> >> I am glad if my change could help normal pages swapping too.  And we can
>> >> change these functions to work for normal pages when necessary.
>> >
>> > Sure but it would be less painful that THP awareness swapout is
>> > based on multiple normal pages swapout. For exmaple, we don't
>> > touch delay THP split part(i.e., split a THP into 512 pages like
>> > as-is) and enhances swapout further like Tim's suggestion
>> > for mulitple normal pages swapout. With that, it might be enough
>> > for fast-storage without needing THP awareness.
>> >
>> > My *point* is let's approach step by step.
>> > First of all, go with batching normal pages swapout and if it's
>> > not enough, dive into further optimization like introducing
>> > THP-aware swapout.
>> >
>> > I believe it's natural development process to evolve things
>> > without over-engineering.
>> 
>> My target is not only the THP swap out acceleration, but also the full
>> THP swap out/in support without splitting THP.  This patchset is just
>> the first step of the full THP swap support.
>> 
>> >> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
>> >> > can try to allocate new cluster. With that, we could allocate new
>> >> > clusters to meet nr_pages requested or bail out if we fail to allocate
>> >> > and fallback to 0-order page swapout. With that, swap layer could
>> >> > support multiple order-0 pages by batch.
>> >> >
>> >> > IMO, I really want to land Tim Chen's batching swapout work first.
>> >> > With Tim Chen's work, I expect we can make better refactoring
>> >> > for batching swap before adding more confuse to the swap layer.
>> >> > (I expect it would share several pieces of code for or would be base
>> >> > for batching allocation of swapcache, swapslot)
>> >> 
>> >> I don't think there is hard conflict between normal pages swapping
>> >> optimizing and THP swap optimizing.  Some code may be shared between
>> >> them.  That is good for both sides.
>> >> 
>> >> > After that, we could enhance swap for big contiguous batching
>> >> > like THP and finally we might make it be aware of THP specific to
>> >> > enhance further.
>> >> >
>> >> > A thing I remember you aruged: you want to swapin 512 pages
>> >> > all at once unconditionally. It's really worth to discuss if
>> >> > your design is going for the way.
>> >> > I doubt it's generally good idea. Because, currently, we try to
>> >> > swap in swapped out pages in THP page with conservative approach
>> >> > but your direction is going to opposite way.
>> >> >
>> >> > [mm, thp: convert from optimistic swapin collapsing to conservative]
>> >> >
>> >> > I think general approach(i.e., less effective than targeting
>> >> > implement for your own specific goal but less hacky and better job
>> >> > for many cases) is to rely/improve on the swap readahead.
>> >> > If most of subpages of a THP page are really workingset, swap readahead
>> >> > could work well.
>> >> >
>> >> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
>> >> 
>> >> Yes.  I want to go to the direction that to swap in

Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-13 Thread Minchan Kim
On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > Hi Huang,
> >
> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
> >
> > < snip >
> >
> >> >> Recently, the performance of the storage devices improved so fast that
> >> >> we cannot saturate the disk bandwidth when do page swap out even on a
> >> >> high-end server machine.  Because the performance of the storage
> >> >> device improved faster than that of CPU.  And it seems that the trend
> >> >> will not change in the near future.  On the other hand, the THP
> >> >> becomes more and more popular because of increased memory size.  So it
> >> >> becomes necessary to optimize THP swap performance.
> >> >> 
> >> >> The advantages of the THP swap support include:
> >> >> 
> >> >> - Batch the swap operations for the THP to reduce lock
> >> >>   acquiring/releasing, including allocating/freeing the swap space,
> >> >>   adding/deleting to/from the swap cache, and writing/reading the swap
> >> >>   space, etc.  This will help improve the performance of the THP swap.
> >> >> 
> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >> >>   particularly helpful for the swap read, which usually are 4k random
> >> >>   IO.  This will improve the performance of the THP swap too.
> >> >> 
> >> >> - It will help the memory fragmentation, especially when the THP is
> >> >>   heavily used by the applications.  The 2M continuous pages will be
> >> >>   free up after THP swapping out.
> >> >
> >> > I just read patchset right now and still doubt why the all changes
> >> > should be coupled with THP tightly. Many parts(e.g., you introduced
> >> > or modifying existing functions for making them THP specific) could
> >> > just take page_list and the number of pages then would handle them
> >> > without THP awareness.
> >> 
> >> I am glad if my change could help normal pages swapping too.  And we can
> >> change these functions to work for normal pages when necessary.
> >
> > Sure but it would be less painful that THP awareness swapout is
> > based on multiple normal pages swapout. For exmaple, we don't
> > touch delay THP split part(i.e., split a THP into 512 pages like
> > as-is) and enhances swapout further like Tim's suggestion
> > for mulitple normal pages swapout. With that, it might be enough
> > for fast-storage without needing THP awareness.
> >
> > My *point* is let's approach step by step.
> > First of all, go with batching normal pages swapout and if it's
> > not enough, dive into further optimization like introducing
> > THP-aware swapout.
> >
> > I believe it's natural development process to evolve things
> > without over-engineering.
> 
> My target is not only the THP swap out acceleration, but also the full
> THP swap out/in support without splitting THP.  This patchset is just
> the first step of the full THP swap support.
> 
> >> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> >> > can try to allocate new cluster. With that, we could allocate new
> >> > clusters to meet nr_pages requested or bail out if we fail to allocate
> >> > and fallback to 0-order page swapout. With that, swap layer could
> >> > support multiple order-0 pages by batch.
> >> >
> >> > IMO, I really want to land Tim Chen's batching swapout work first.
> >> > With Tim Chen's work, I expect we can make better refactoring
> >> > for batching swap before adding more confuse to the swap layer.
> >> > (I expect it would share several pieces of code for or would be base
> >> > for batching allocation of swapcache, swapslot)
> >> 
> >> I don't think there is hard conflict between normal pages swapping
> >> optimizing and THP swap optimizing.  Some code may be shared between
> >> them.  That is good for both sides.
> >> 
> >> > After that, we could enhance swap for big contiguous batching
> >> > like THP and finally we might make it be aware of THP specific to
> >> > enhance further.
> >> >
> >> > A thing I remember you aruged: you want to swapin 512 pages
> >> > all at once unconditionally. It's really worth to discuss if
> >> > your design is going for the way.
> >> > I doubt it's generally good idea. Because, currently, we try to
> >> > swap in swapped out pages in THP page with conservative approach
> >> > but your direction is going to opposite way.
> >> >
> >> > [mm, thp: convert from optimistic swapin collapsing to conservative]
> >> >
> >> > I think general approach(i.e., less effective than targeting
> >> > implement for your own specific goal but less hacky and better job
> >> > for many cases) is to rely/improve on the swap readahead.
> >> > If most of subpages of a THP page are really workingset, swap readahead
> >> > could work well.
> >> >
> >> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
> >> 
> >> Yes.  I want to go to the direction that to swap in 512 pages together.
> >> And I think it is a good opportunity to discuss that now.  The advantages
> >> of swapping in 

Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-12 Thread Huang, Ying
Minchan Kim  writes:

> Hi Huang,
>
> On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
>
> < snip >
>
>> >> Recently, the performance of the storage devices improved so fast that
>> >> we cannot saturate the disk bandwidth when do page swap out even on a
>> >> high-end server machine.  Because the performance of the storage
>> >> device improved faster than that of CPU.  And it seems that the trend
>> >> will not change in the near future.  On the other hand, the THP
>> >> becomes more and more popular because of increased memory size.  So it
>> >> becomes necessary to optimize THP swap performance.
>> >> 
>> >> The advantages of the THP swap support include:
>> >> 
>> >> - Batch the swap operations for the THP to reduce lock
>> >>   acquiring/releasing, including allocating/freeing the swap space,
>> >>   adding/deleting to/from the swap cache, and writing/reading the swap
>> >>   space, etc.  This will help improve the performance of the THP swap.
>> >> 
>> >> - The THP swap space read/write will be 2M sequential IO.  It is
>> >>   particularly helpful for the swap read, which usually are 4k random
>> >>   IO.  This will improve the performance of the THP swap too.
>> >> 
>> >> - It will help the memory fragmentation, especially when the THP is
>> >>   heavily used by the applications.  The 2M continuous pages will be
>> >>   free up after THP swapping out.
>> >
>> > I just read patchset right now and still doubt why the all changes
>> > should be coupled with THP tightly. Many parts(e.g., you introduced
>> > or modifying existing functions for making them THP specific) could
>> > just take page_list and the number of pages then would handle them
>> > without THP awareness.
>> 
>> I am glad if my change could help normal pages swapping too.  And we can
>> change these functions to work for normal pages when necessary.
>
> Sure but it would be less painful that THP awareness swapout is
> based on multiple normal pages swapout. For exmaple, we don't
> touch delay THP split part(i.e., split a THP into 512 pages like
> as-is) and enhances swapout further like Tim's suggestion
> for mulitple normal pages swapout. With that, it might be enough
> for fast-storage without needing THP awareness.
>
> My *point* is let's approach step by step.
> First of all, go with batching normal pages swapout and if it's
> not enough, dive into further optimization like introducing
> THP-aware swapout.
>
> I believe it's natural development process to evolve things
> without over-engineering.

My target is not only the THP swap out acceleration, but also the full
THP swap out/in support without splitting THP.  This patchset is just
the first step of the full THP swap support.

>> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
>> > can try to allocate new cluster. With that, we could allocate new
>> > clusters to meet nr_pages requested or bail out if we fail to allocate
>> > and fallback to 0-order page swapout. With that, swap layer could
>> > support multiple order-0 pages by batch.
>> >
>> > IMO, I really want to land Tim Chen's batching swapout work first.
>> > With Tim Chen's work, I expect we can make better refactoring
>> > for batching swap before adding more confuse to the swap layer.
>> > (I expect it would share several pieces of code for or would be base
>> > for batching allocation of swapcache, swapslot)
>> 
>> I don't think there is hard conflict between normal pages swapping
>> optimizing and THP swap optimizing.  Some code may be shared between
>> them.  That is good for both sides.
>> 
>> > After that, we could enhance swap for big contiguous batching
>> > like THP and finally we might make it be aware of THP specific to
>> > enhance further.
>> >
>> > A thing I remember you aruged: you want to swapin 512 pages
>> > all at once unconditionally. It's really worth to discuss if
>> > your design is going for the way.
>> > I doubt it's generally good idea. Because, currently, we try to
>> > swap in swapped out pages in THP page with conservative approach
>> > but your direction is going to opposite way.
>> >
>> > [mm, thp: convert from optimistic swapin collapsing to conservative]
>> >
>> > I think general approach(i.e., less effective than targeting
>> > implement for your own specific goal but less hacky and better job
>> > for many cases) is to rely/improve on the swap readahead.
>> > If most of subpages of a THP page are really workingset, swap readahead
>> > could work well.
>> >
>> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
>> 
>> Yes.  I want to go to the direction that to swap in 512 pages together.
>> And I think it is a good opportunity to discuss that now.  The advantages
>> of swapping in 512 pages together are:
>> 
>> - Improve the performance of swapping in IO via turning small read size
>>   into 512 pages big read size.
>> 
>> - Keep THP across swap out/in.  With the memory size become more and
>>   more large, the 4k pages bring more and 

Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-12 Thread Minchan Kim
Hi Huang,

On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:

< snip >

> >> Recently, the performance of the storage devices improved so fast that
> >> we cannot saturate the disk bandwidth when do page swap out even on a
> >> high-end server machine.  Because the performance of the storage
> >> device improved faster than that of CPU.  And it seems that the trend
> >> will not change in the near future.  On the other hand, the THP
> >> becomes more and more popular because of increased memory size.  So it
> >> becomes necessary to optimize THP swap performance.
> >> 
> >> The advantages of the THP swap support include:
> >> 
> >> - Batch the swap operations for the THP to reduce lock
> >>   acquiring/releasing, including allocating/freeing the swap space,
> >>   adding/deleting to/from the swap cache, and writing/reading the swap
> >>   space, etc.  This will help improve the performance of the THP swap.
> >> 
> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >>   particularly helpful for the swap read, which usually are 4k random
> >>   IO.  This will improve the performance of the THP swap too.
> >> 
> >> - It will help the memory fragmentation, especially when the THP is
> >>   heavily used by the applications.  The 2M continuous pages will be
> >>   free up after THP swapping out.
> >
> > I just read patchset right now and still doubt why the all changes
> > should be coupled with THP tightly. Many parts(e.g., you introduced
> > or modifying existing functions for making them THP specific) could
> > just take page_list and the number of pages then would handle them
> > without THP awareness.
> 
> I am glad if my change could help normal pages swapping too.  And we can
> change these functions to work for normal pages when necessary.

Sure but it would be less painful that THP awareness swapout is
based on multiple normal pages swapout. For exmaple, we don't
touch delay THP split part(i.e., split a THP into 512 pages like
as-is) and enhances swapout further like Tim's suggestion
for mulitple normal pages swapout. With that, it might be enough
for fast-storage without needing THP awareness.

My *point* is let's approach step by step.
First of all, go with batching normal pages swapout and if it's
not enough, dive into further optimization like introducing
THP-aware swapout.

I believe it's natural development process to evolve things
without over-engineering.

> 
> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> > can try to allocate new cluster. With that, we could allocate new
> > clusters to meet nr_pages requested or bail out if we fail to allocate
> > and fallback to 0-order page swapout. With that, swap layer could
> > support multiple order-0 pages by batch.
> >
> > IMO, I really want to land Tim Chen's batching swapout work first.
> > With Tim Chen's work, I expect we can make better refactoring
> > for batching swap before adding more confuse to the swap layer.
> > (I expect it would share several pieces of code for or would be base
> > for batching allocation of swapcache, swapslot)
> 
> I don't think there is hard conflict between normal pages swapping
> optimizing and THP swap optimizing.  Some code may be shared between
> them.  That is good for both sides.
> 
> > After that, we could enhance swap for big contiguous batching
> > like THP and finally we might make it be aware of THP specific to
> > enhance further.
> >
> > A thing I remember you aruged: you want to swapin 512 pages
> > all at once unconditionally. It's really worth to discuss if
> > your design is going for the way.
> > I doubt it's generally good idea. Because, currently, we try to
> > swap in swapped out pages in THP page with conservative approach
> > but your direction is going to opposite way.
> >
> > [mm, thp: convert from optimistic swapin collapsing to conservative]
> >
> > I think general approach(i.e., less effective than targeting
> > implement for your own specific goal but less hacky and better job
> > for many cases) is to rely/improve on the swap readahead.
> > If most of subpages of a THP page are really workingset, swap readahead
> > could work well.
> >
> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
> 
> Yes.  I want to go to the direction that to swap in 512 pages together.
> And I think it is a good opportunity to discuss that now.  The advantages
> of swapping in 512 pages together are:
> 
> - Improve the performance of swapping in IO via turning small read size
>   into 512 pages big read size.
> 
> - Keep THP across swap out/in.  With the memory size become more and
>   more large, the 4k pages bring more and more burden to memory
>   management.  One solution is to use 2M pages as much as possible, that
>   will reduce the management burden greatly, such as much reduced length
>   of LRU list, etc.
> 
> The disadvantage are:
> 
> - Increase the memory pressure when swap in THP.
> 
> - Some pages swapped in may not needed 

Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-09 Thread Huang, Ying
Hi, Minchan,

Minchan Kim  writes:
> Hi Huang,
>
> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
>> From: Huang Ying 
>> 
>> This patchset is to optimize the performance of Transparent Huge Page
>> (THP) swap.
>> 
>> Hi, Andrew, could you help me to check whether the overall design is
>> reasonable?
>> 
>> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
>> swap part of the patchset?  Especially [01/10], [04/10], [05/10],
>> [06/10], [07/10], [10/10].
>> 
>> Hi, Andrea and Kirill, could you help me to review the THP part of the
>> patchset?  Especially [02/10], [03/10], [09/10] and [10/10].
>> 
>> Hi, Johannes, Michal and Vladimir, I am not very confident about the
>> memory cgroup part, especially [02/10] and [03/10].  Could you help me
>> to review it?
>> 
>> And for all, Any comment is welcome!
>> 
>> 
>> Recently, the performance of the storage devices improved so fast that
>> we cannot saturate the disk bandwidth when do page swap out even on a
>> high-end server machine.  Because the performance of the storage
>> device improved faster than that of CPU.  And it seems that the trend
>> will not change in the near future.  On the other hand, the THP
>> becomes more and more popular because of increased memory size.  So it
>> becomes necessary to optimize THP swap performance.
>> 
>> The advantages of the THP swap support include:
>> 
>> - Batch the swap operations for the THP to reduce lock
>>   acquiring/releasing, including allocating/freeing the swap space,
>>   adding/deleting to/from the swap cache, and writing/reading the swap
>>   space, etc.  This will help improve the performance of the THP swap.
>> 
>> - The THP swap space read/write will be 2M sequential IO.  It is
>>   particularly helpful for the swap read, which usually are 4k random
>>   IO.  This will improve the performance of the THP swap too.
>> 
>> - It will help the memory fragmentation, especially when the THP is
>>   heavily used by the applications.  The 2M continuous pages will be
>>   free up after THP swapping out.
>
> I just read patchset right now and still doubt why the all changes
> should be coupled with THP tightly. Many parts(e.g., you introduced
> or modifying existing functions for making them THP specific) could
> just take page_list and the number of pages then would handle them
> without THP awareness.

I am glad if my change could help normal pages swapping too.  And we can
change these functions to work for normal pages when necessary.

> For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> can try to allocate new cluster. With that, we could allocate new
> clusters to meet nr_pages requested or bail out if we fail to allocate
> and fallback to 0-order page swapout. With that, swap layer could
> support multiple order-0 pages by batch.
>
> IMO, I really want to land Tim Chen's batching swapout work first.
> With Tim Chen's work, I expect we can make better refactoring
> for batching swap before adding more confuse to the swap layer.
> (I expect it would share several pieces of code for or would be base
> for batching allocation of swapcache, swapslot)

I don't think there is hard conflict between normal pages swapping
optimizing and THP swap optimizing.  Some code may be shared between
them.  That is good for both sides.

> After that, we could enhance swap for big contiguous batching
> like THP and finally we might make it be aware of THP specific to
> enhance further.
>
> A thing I remember you aruged: you want to swapin 512 pages
> all at once unconditionally. It's really worth to discuss if
> your design is going for the way.
> I doubt it's generally good idea. Because, currently, we try to
> swap in swapped out pages in THP page with conservative approach
> but your direction is going to opposite way.
>
> [mm, thp: convert from optimistic swapin collapsing to conservative]
>
> I think general approach(i.e., less effective than targeting
> implement for your own specific goal but less hacky and better job
> for many cases) is to rely/improve on the swap readahead.
> If most of subpages of a THP page are really workingset, swap readahead
> could work well.
>
> Yeah, it's fairly vague feedback so sorry if I miss something clear.

Yes.  I want to go to the direction that to swap in 512 pages together.
And I think it is a good opportunity to discuss that now.  The advantages
of swapping in 512 pages together are:

- Improve the performance of swapping in IO via turning small read size
  into 512 pages big read size.

- Keep THP across swap out/in.  With the memory size become more and
  more large, the 4k pages bring more and more burden to memory
  management.  One solution is to use 2M pages as much as possible, that
  will reduce the management burden greatly, such as much reduced length
  of LRU list, etc.

The disadvantage are:

- Increase the memory pressure when swap in THP.

- Some pages swapped in may not needed in the near future.

Because of the disadvan

Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-09 Thread Tim Chen
On Fri, 2016-09-09 at 14:43 +0900, Minchan Kim wrote:
> Hi Huang,
> 
> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> > 
> > From: Huang Ying 
> > 
> > This patchset is to optimize the performance of Transparent Huge Page
> > (THP) swap.
> > 
> > Hi, Andrew, could you help me to check whether the overall design is
> > reasonable?
> > 
> > Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
> > swap part of the patchset?  Especially [01/10], [04/10], [05/10],
> > [06/10], [07/10], [10/10].
> > 
> > Hi, Andrea and Kirill, could you help me to review the THP part of the
> > patchset?  Especially [02/10], [03/10], [09/10] and [10/10].
> > 
> > Hi, Johannes, Michal and Vladimir, I am not very confident about the
> > memory cgroup part, especially [02/10] and [03/10].  Could you help me
> > to review it?
> > 
> > And for all, Any comment is welcome!
> > 
> > 
> > Recently, the performance of the storage devices improved so fast that
> > we cannot saturate the disk bandwidth when do page swap out even on a
> > high-end server machine.  Because the performance of the storage
> > device improved faster than that of CPU.  And it seems that the trend
> > will not change in the near future.  On the other hand, the THP
> > becomes more and more popular because of increased memory size.  So it
> > becomes necessary to optimize THP swap performance.
> > 
> > The advantages of the THP swap support include:
> > 
> > - Batch the swap operations for the THP to reduce lock
> >   acquiring/releasing, including allocating/freeing the swap space,
> >   adding/deleting to/from the swap cache, and writing/reading the swap
> >   space, etc.  This will help improve the performance of the THP swap.
> > 
> > - The THP swap space read/write will be 2M sequential IO.  It is
> >   particularly helpful for the swap read, which usually are 4k random
> >   IO.  This will improve the performance of the THP swap too.
> > 
> > - It will help the memory fragmentation, especially when the THP is
> >   heavily used by the applications.  The 2M continuous pages will be
> >   free up after THP swapping out.
> I just read patchset right now and still doubt why the all changes
> should be coupled with THP tightly. Many parts(e.g., you introduced
> or modifying existing functions for making them THP specific) could
> just take page_list and the number of pages then would handle them
> without THP awareness.
> 
> For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> can try to allocate new cluster. With that, we could allocate new
> clusters to meet nr_pages requested or bail out if we fail to allocate
> and fallback to 0-order page swapout. With that, swap layer could
> support multiple order-0 pages by batch.
> 
> IMO, I really want to land Tim Chen's batching swapout work first.
> With Tim Chen's work, I expect we can make better refactoring
> for batching swap before adding more confuse to the swap layer.
> (I expect it would share several pieces of code for or would be base
> for batching allocation of swapcache, swapslot)

Minchan,

Ying and I do plan to send out a new patch series on batching swapout
and swapin plus a few other optimization on the swapping of 
regular sized pages.

Hopefully we'll be able to do that soon after we fixed up a few
things and retest.

Tim



Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out

2016-09-08 Thread Minchan Kim
Hi Huang,

On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> From: Huang Ying 
> 
> This patchset is to optimize the performance of Transparent Huge Page
> (THP) swap.
> 
> Hi, Andrew, could you help me to check whether the overall design is
> reasonable?
> 
> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
> swap part of the patchset?  Especially [01/10], [04/10], [05/10],
> [06/10], [07/10], [10/10].
> 
> Hi, Andrea and Kirill, could you help me to review the THP part of the
> patchset?  Especially [02/10], [03/10], [09/10] and [10/10].
> 
> Hi, Johannes, Michal and Vladimir, I am not very confident about the
> memory cgroup part, especially [02/10] and [03/10].  Could you help me
> to review it?
> 
> And for all, Any comment is welcome!
> 
> 
> Recently, the performance of the storage devices improved so fast that
> we cannot saturate the disk bandwidth when do page swap out even on a
> high-end server machine.  Because the performance of the storage
> device improved faster than that of CPU.  And it seems that the trend
> will not change in the near future.  On the other hand, the THP
> becomes more and more popular because of increased memory size.  So it
> becomes necessary to optimize THP swap performance.
> 
> The advantages of the THP swap support include:
> 
> - Batch the swap operations for the THP to reduce lock
>   acquiring/releasing, including allocating/freeing the swap space,
>   adding/deleting to/from the swap cache, and writing/reading the swap
>   space, etc.  This will help improve the performance of the THP swap.
> 
> - The THP swap space read/write will be 2M sequential IO.  It is
>   particularly helpful for the swap read, which usually are 4k random
>   IO.  This will improve the performance of the THP swap too.
> 
> - It will help the memory fragmentation, especially when the THP is
>   heavily used by the applications.  The 2M continuous pages will be
>   free up after THP swapping out.

I just read patchset right now and still doubt why the all changes
should be coupled with THP tightly. Many parts(e.g., you introduced
or modifying existing functions for making them THP specific) could
just take page_list and the number of pages then would handle them
without THP awareness.

For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
can try to allocate new cluster. With that, we could allocate new
clusters to meet nr_pages requested or bail out if we fail to allocate
and fallback to 0-order page swapout. With that, swap layer could
support multiple order-0 pages by batch.

IMO, I really want to land Tim Chen's batching swapout work first.
With Tim Chen's work, I expect we can make better refactoring
for batching swap before adding more confuse to the swap layer.
(I expect it would share several pieces of code for or would be base
for batching allocation of swapcache, swapslot)

After that, we could enhance swap for big contiguous batching
like THP and finally we might make it be aware of THP specific to
enhance further.

A thing I remember you aruged: you want to swapin 512 pages
all at once unconditionally. It's really worth to discuss if
your design is going for the way.
I doubt it's generally good idea. Because, currently, we try to
swap in swapped out pages in THP page with conservative approach
but your direction is going to opposite way.

[mm, thp: convert from optimistic swapin collapsing to conservative]

I think general approach(i.e., less effective than targeting
implement for your own specific goal but less hacky and better job
for many cases) is to rely/improve on the swap readahead.
If most of subpages of a THP page are really workingset, swap readahead
could work well.

Yeah, it's fairly vague feedback so sorry if I miss something clear.