Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-14 Thread Huang, Ying
Minchan Kim  writes:

> On Fri, Sep 15, 2017 at 11:15:08AM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> 
>> > On Thu, Sep 14, 2017 at 08:01:30PM +0800, Huang, Ying wrote:
>> >> Minchan Kim  writes:
>> >> 
>> >> > On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote:
>> >> >> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim  
>> >> >> wrote:
>> >> >> 
>> >> >> > Every zram users like low-end android device has used 0 page-cluster
>> >> >> > to disable swap readahead because it has no seek cost and works as
>> >> >> > synchronous IO operation so if we do readahead multiple pages,
>> >> >> > swap falut latency would be (4K * readahead window size). IOW,
>> >> >> > readahead is meaningful only if it doesn't bother faulted page's
>> >> >> > latency.
>> >> >> > 
>> >> >> > However, this patch introduces additional knob /sys/kernel/mm/swap/
>> >> >> > vma_ra_max_order as well as page-cluster. It means existing users
>> >> >> > has used disabled swap readahead doesn't work until they should be
>> >> >> > aware of new knob and modification of their script/code to disable
>> >> >> > vma_ra_max_order as well as page-cluster.
>> >> >> > 
>> >> >> > I say it's a *regression* and wanted to fix it but Huang's opinion
>> >> >> > is that it's not a functional regression so userspace should be fixed
>> >> >> > by themselves.
>> >> >> > Please look into detail of discussion in
>> >> >> > http://lkml.kernel.org/r/%3c1505183833-4739-4-git-send-email-minc...@kernel.org%3E
>> >> >> 
>> >> >> hm, tricky problem.  I do agree that linking the physical and virtual
>> >> >> readahead schemes in the proposed fashion is unfortunate.  I also agree
>> >> >> that breaking existing setups (a bit) is also unfortunate.
>> >> >> 
>> >> >> Would it help if, when page-cluster is written to zero, we do
>> >> >> 
>> >> >> printk_once("physical readahead disabled, virtual readahead still
>> >> >> enabled.  Disable virtual readhead via
>> >> >> /sys/kernel/mm/swap/vma_ra_max_order").
>> >> >> 
>> >> >> Or something like that.  It's pretty lame, but it should help alert the
>> >> >> zram-readahead-disabling people to the issue?
>> >> >
>> >> > It was my last resort. If we cannot find other ways after all, yes, it 
>> >> > would
>> >> > be a minimum we should do. But it still breaks users don't/can't 
>> >> > read/modify
>> >> > alert and program.
>> >> >
>> >> > How about this?
>> >> >
>> >> > Can't we make vma-based readahead config option?
>> >> > With that, users who no interest on readahead don't enable vma-based
>> >> > readahead. In this case, page-cluster works as expected "disable 
>> >> > readahead
>> >> > completely" so it doesn't break anything.
>> >> 
>> >> Now.  Users can choose between VMA based readahead and original
>> >> readahead via a knob as follow at runtime,
>> >> 
>> >> /sys/kernel/mm/swap/vma_ra_enabled
>> >
>> > It's not a config option and is enabled by default. IOW, it's under the 
>> > radar
>> > so current users cannot notice it. That's why we want to emit big fat 
>> > warnning.
>> > when old user set 0 to page-cluster. However, as Andrew said, it's lame.
>> >
>> > If we make it config option, product maker/kernel upgrade user can have
>> > a chance to notice and read description so they could be aware of two weird
>> > knobs and help to solve the problem in advance without printk_once warn.
>> > If user has no interest about swap-readahead or skip the new config option
>> > by mistake, it works physcial readahead which means no regression.
>> 
>> I am OK to make it config option.  But I think VMA based swap readahead
>> should be enabled by default.  Because per my understanding, default
>> option should be set for most common desktop users.  And VMA based swap
>> readahead should benefit them.  People needs to turn off swap readahead
>> is some special users, the original swap readahead default configuration
>> isn't for them too.
>
> Okay. I don't care either one is default if it is a config option.
> It still gives a chance to notice a new algorithm so users can decide it
> It is absolutely better than silent regressoin and printk tric.
> Please add more description about those parallel two readahead algorithms
> in somewhere(e.g., vm.txt) so he can understand the situation exactly and
> can handle both tunable knobs at the same time.

Sure.

Best Regards,
Huang, Ying


Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-14 Thread Minchan Kim
On Fri, Sep 15, 2017 at 11:15:08AM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Thu, Sep 14, 2017 at 08:01:30PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote:
> >> >> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim  
> >> >> wrote:
> >> >> 
> >> >> > Every zram users like low-end android device has used 0 page-cluster
> >> >> > to disable swap readahead because it has no seek cost and works as
> >> >> > synchronous IO operation so if we do readahead multiple pages,
> >> >> > swap falut latency would be (4K * readahead window size). IOW,
> >> >> > readahead is meaningful only if it doesn't bother faulted page's
> >> >> > latency.
> >> >> > 
> >> >> > However, this patch introduces additional knob /sys/kernel/mm/swap/
> >> >> > vma_ra_max_order as well as page-cluster. It means existing users
> >> >> > has used disabled swap readahead doesn't work until they should be
> >> >> > aware of new knob and modification of their script/code to disable
> >> >> > vma_ra_max_order as well as page-cluster.
> >> >> > 
> >> >> > I say it's a *regression* and wanted to fix it but Huang's opinion
> >> >> > is that it's not a functional regression so userspace should be fixed
> >> >> > by themselves.
> >> >> > Please look into detail of discussion in
> >> >> > http://lkml.kernel.org/r/%3c1505183833-4739-4-git-send-email-minc...@kernel.org%3E
> >> >> 
> >> >> hm, tricky problem.  I do agree that linking the physical and virtual
> >> >> readahead schemes in the proposed fashion is unfortunate.  I also agree
> >> >> that breaking existing setups (a bit) is also unfortunate.
> >> >> 
> >> >> Would it help if, when page-cluster is written to zero, we do
> >> >> 
> >> >> printk_once("physical readahead disabled, virtual readahead still
> >> >> enabled.  Disable virtual readhead via
> >> >> /sys/kernel/mm/swap/vma_ra_max_order").
> >> >> 
> >> >> Or something like that.  It's pretty lame, but it should help alert the
> >> >> zram-readahead-disabling people to the issue?
> >> >
> >> > It was my last resort. If we cannot find other ways after all, yes, it 
> >> > would
> >> > be a minimum we should do. But it still breaks users don't/can't 
> >> > read/modify
> >> > alert and program.
> >> >
> >> > How about this?
> >> >
> >> > Can't we make vma-based readahead config option?
> >> > With that, users who no interest on readahead don't enable vma-based
> >> > readahead. In this case, page-cluster works as expected "disable 
> >> > readahead
> >> > completely" so it doesn't break anything.
> >> 
> >> Now.  Users can choose between VMA based readahead and original
> >> readahead via a knob as follow at runtime,
> >> 
> >> /sys/kernel/mm/swap/vma_ra_enabled
> >
> > It's not a config option and is enabled by default. IOW, it's under the 
> > radar
> > so current users cannot notice it. That's why we want to emit big fat 
> > warnning.
> > when old user set 0 to page-cluster. However, as Andrew said, it's lame.
> >
> > If we make it config option, product maker/kernel upgrade user can have
> > a chance to notice and read description so they could be aware of two weird
> > knobs and help to solve the problem in advance without printk_once warn.
> > If user has no interest about swap-readahead or skip the new config option
> > by mistake, it works physcial readahead which means no regression.
> 
> I am OK to make it config option.  But I think VMA based swap readahead
> should be enabled by default.  Because per my understanding, default
> option should be set for most common desktop users.  And VMA based swap
> readahead should benefit them.  People needs to turn off swap readahead
> is some special users, the original swap readahead default configuration
> isn't for them too.

Okay. I don't care either one is default if it is a config option.
It still gives a chance to notice a new algorithm so users can decide it
It is absolutely better than silent regressoin and printk tric.
Please add more description about those parallel two readahead algorithms
in somewhere(e.g., vm.txt) so he can understand the situation exactly and
can handle both tunable knobs at the same time.


Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-14 Thread Huang, Ying
Minchan Kim  writes:

> On Thu, Sep 14, 2017 at 08:01:30PM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> 
>> > On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote:
>> >> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim  wrote:
>> >> 
>> >> > Every zram users like low-end android device has used 0 page-cluster
>> >> > to disable swap readahead because it has no seek cost and works as
>> >> > synchronous IO operation so if we do readahead multiple pages,
>> >> > swap falut latency would be (4K * readahead window size). IOW,
>> >> > readahead is meaningful only if it doesn't bother faulted page's
>> >> > latency.
>> >> > 
>> >> > However, this patch introduces additional knob /sys/kernel/mm/swap/
>> >> > vma_ra_max_order as well as page-cluster. It means existing users
>> >> > has used disabled swap readahead doesn't work until they should be
>> >> > aware of new knob and modification of their script/code to disable
>> >> > vma_ra_max_order as well as page-cluster.
>> >> > 
>> >> > I say it's a *regression* and wanted to fix it but Huang's opinion
>> >> > is that it's not a functional regression so userspace should be fixed
>> >> > by themselves.
>> >> > Please look into detail of discussion in
>> >> > http://lkml.kernel.org/r/%3c1505183833-4739-4-git-send-email-minc...@kernel.org%3E
>> >> 
>> >> hm, tricky problem.  I do agree that linking the physical and virtual
>> >> readahead schemes in the proposed fashion is unfortunate.  I also agree
>> >> that breaking existing setups (a bit) is also unfortunate.
>> >> 
>> >> Would it help if, when page-cluster is written to zero, we do
>> >> 
>> >> printk_once("physical readahead disabled, virtual readahead still
>> >> enabled.  Disable virtual readhead via
>> >> /sys/kernel/mm/swap/vma_ra_max_order").
>> >> 
>> >> Or something like that.  It's pretty lame, but it should help alert the
>> >> zram-readahead-disabling people to the issue?
>> >
>> > It was my last resort. If we cannot find other ways after all, yes, it 
>> > would
>> > be a minimum we should do. But it still breaks users don't/can't 
>> > read/modify
>> > alert and program.
>> >
>> > How about this?
>> >
>> > Can't we make vma-based readahead config option?
>> > With that, users who no interest on readahead don't enable vma-based
>> > readahead. In this case, page-cluster works as expected "disable readahead
>> > completely" so it doesn't break anything.
>> 
>> Now.  Users can choose between VMA based readahead and original
>> readahead via a knob as follow at runtime,
>> 
>> /sys/kernel/mm/swap/vma_ra_enabled
>
> It's not a config option and is enabled by default. IOW, it's under the radar
> so current users cannot notice it. That's why we want to emit big fat 
> warnning.
> when old user set 0 to page-cluster. However, as Andrew said, it's lame.
>
> If we make it config option, product maker/kernel upgrade user can have
> a chance to notice and read description so they could be aware of two weird
> knobs and help to solve the problem in advance without printk_once warn.
> If user has no interest about swap-readahead or skip the new config option
> by mistake, it works physcial readahead which means no regression.

I am OK to make it config option.  But I think VMA based swap readahead
should be enabled by default.  Because per my understanding, default
option should be set for most common desktop users.  And VMA based swap
readahead should benefit them.  People needs to turn off swap readahead
is some special users, the original swap readahead default configuration
isn't for them too.

Best Regards,
Huang, Ying

>>  
>> Best Regards,
>> Huang, Ying
>> 
>> 
>> > People who want to use upcoming vma-based readahead can enable the feature
>> > and we can say such unfortunate things in config/document description
>> > somewhere so upcoming users will be aware of that unforunate two knobs.


Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-14 Thread Andrew Morton
On Thu, 14 Sep 2017 22:14:46 +0900 Minchan Kim  wrote:

> > Now.  Users can choose between VMA based readahead and original
> > readahead via a knob as follow at runtime,
> > 
> > /sys/kernel/mm/swap/vma_ra_enabled
> 
> It's not a config option and is enabled by default. IOW, it's under the radar
> so current users cannot notice it. That's why we want to emit big fat 
> warnning.
> when old user set 0 to page-cluster. However, as Andrew said, it's lame.
> 
> If we make it config option, product maker/kernel upgrade user can have
> a chance to notice and read description so they could be aware of two weird
> knobs and help to solve the problem in advance without printk_once warn.
> If user has no interest about swap-readahead or skip the new config option
> by mistake, it works physcial readahead which means no regression.

Yup, a Kconfig option sounds like a good idea.  And that's a bit more
friendly to tiny kernels as well.



Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-14 Thread Minchan Kim
On Thu, Sep 14, 2017 at 08:01:30PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote:
> >> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim  wrote:
> >> 
> >> > Every zram users like low-end android device has used 0 page-cluster
> >> > to disable swap readahead because it has no seek cost and works as
> >> > synchronous IO operation so if we do readahead multiple pages,
> >> > swap falut latency would be (4K * readahead window size). IOW,
> >> > readahead is meaningful only if it doesn't bother faulted page's
> >> > latency.
> >> > 
> >> > However, this patch introduces additional knob /sys/kernel/mm/swap/
> >> > vma_ra_max_order as well as page-cluster. It means existing users
> >> > has used disabled swap readahead doesn't work until they should be
> >> > aware of new knob and modification of their script/code to disable
> >> > vma_ra_max_order as well as page-cluster.
> >> > 
> >> > I say it's a *regression* and wanted to fix it but Huang's opinion
> >> > is that it's not a functional regression so userspace should be fixed
> >> > by themselves.
> >> > Please look into detail of discussion in
> >> > http://lkml.kernel.org/r/%3c1505183833-4739-4-git-send-email-minc...@kernel.org%3E
> >> 
> >> hm, tricky problem.  I do agree that linking the physical and virtual
> >> readahead schemes in the proposed fashion is unfortunate.  I also agree
> >> that breaking existing setups (a bit) is also unfortunate.
> >> 
> >> Would it help if, when page-cluster is written to zero, we do
> >> 
> >> printk_once("physical readahead disabled, virtual readahead still
> >> enabled.  Disable virtual readhead via
> >> /sys/kernel/mm/swap/vma_ra_max_order").
> >> 
> >> Or something like that.  It's pretty lame, but it should help alert the
> >> zram-readahead-disabling people to the issue?
> >
> > It was my last resort. If we cannot find other ways after all, yes, it would
> > be a minimum we should do. But it still breaks users don't/can't read/modify
> > alert and program.
> >
> > How about this?
> >
> > Can't we make vma-based readahead config option?
> > With that, users who no interest on readahead don't enable vma-based
> > readahead. In this case, page-cluster works as expected "disable readahead
> > completely" so it doesn't break anything.
> 
> Now.  Users can choose between VMA based readahead and original
> readahead via a knob as follow at runtime,
> 
> /sys/kernel/mm/swap/vma_ra_enabled

It's not a config option and is enabled by default. IOW, it's under the radar
so current users cannot notice it. That's why we want to emit big fat warnning.
when old user set 0 to page-cluster. However, as Andrew said, it's lame.

If we make it config option, product maker/kernel upgrade user can have
a chance to notice and read description so they could be aware of two weird
knobs and help to solve the problem in advance without printk_once warn.
If user has no interest about swap-readahead or skip the new config option
by mistake, it works physcial readahead which means no regression.

>  
> Best Regards,
> Huang, Ying
> 
> 
> > People who want to use upcoming vma-based readahead can enable the feature
> > and we can say such unfortunate things in config/document description
> > somewhere so upcoming users will be aware of that unforunate two knobs.


Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-14 Thread Huang, Ying
Minchan Kim  writes:

> On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote:
>> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim  wrote:
>> 
>> > Every zram users like low-end android device has used 0 page-cluster
>> > to disable swap readahead because it has no seek cost and works as
>> > synchronous IO operation so if we do readahead multiple pages,
>> > swap falut latency would be (4K * readahead window size). IOW,
>> > readahead is meaningful only if it doesn't bother faulted page's
>> > latency.
>> > 
>> > However, this patch introduces additional knob /sys/kernel/mm/swap/
>> > vma_ra_max_order as well as page-cluster. It means existing users
>> > has used disabled swap readahead doesn't work until they should be
>> > aware of new knob and modification of their script/code to disable
>> > vma_ra_max_order as well as page-cluster.
>> > 
>> > I say it's a *regression* and wanted to fix it but Huang's opinion
>> > is that it's not a functional regression so userspace should be fixed
>> > by themselves.
>> > Please look into detail of discussion in
>> > http://lkml.kernel.org/r/%3c1505183833-4739-4-git-send-email-minc...@kernel.org%3E
>> 
>> hm, tricky problem.  I do agree that linking the physical and virtual
>> readahead schemes in the proposed fashion is unfortunate.  I also agree
>> that breaking existing setups (a bit) is also unfortunate.
>> 
>> Would it help if, when page-cluster is written to zero, we do
>> 
>> printk_once("physical readahead disabled, virtual readahead still
>> enabled.  Disable virtual readhead via
>> /sys/kernel/mm/swap/vma_ra_max_order").
>> 
>> Or something like that.  It's pretty lame, but it should help alert the
>> zram-readahead-disabling people to the issue?
>
> It was my last resort. If we cannot find other ways after all, yes, it would
> be a minimum we should do. But it still breaks users don't/can't read/modify
> alert and program.
>
> How about this?
>
> Can't we make vma-based readahead config option?
> With that, users who no interest on readahead don't enable vma-based
> readahead. In this case, page-cluster works as expected "disable readahead
> completely" so it doesn't break anything.

Now.  Users can choose between VMA based readahead and original
readahead via a knob as follow at runtime,

/sys/kernel/mm/swap/vma_ra_enabled

Best Regards,
Huang, Ying


> People who want to use upcoming vma-based readahead can enable the feature
> and we can say such unfortunate things in config/document description
> somewhere so upcoming users will be aware of that unforunate two knobs.


Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-14 Thread Minchan Kim
On Thu, Sep 14, 2017 at 08:53:04AM +0800, Huang, Ying wrote:
> Hi, Andrew,
> 
> Andrew Morton  writes:
> 
> > On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim  wrote:
> >
> >> Every zram users like low-end android device has used 0 page-cluster
> >> to disable swap readahead because it has no seek cost and works as
> >> synchronous IO operation so if we do readahead multiple pages,
> >> swap falut latency would be (4K * readahead window size). IOW,
> >> readahead is meaningful only if it doesn't bother faulted page's
> >> latency.
> >> 
> >> However, this patch introduces additional knob /sys/kernel/mm/swap/
> >> vma_ra_max_order as well as page-cluster. It means existing users
> >> has used disabled swap readahead doesn't work until they should be
> >> aware of new knob and modification of their script/code to disable
> >> vma_ra_max_order as well as page-cluster.
> >> 
> >> I say it's a *regression* and wanted to fix it but Huang's opinion
> >> is that it's not a functional regression so userspace should be fixed
> >> by themselves.
> >> Please look into detail of discussion in
> >> http://lkml.kernel.org/r/%3c1505183833-4739-4-git-send-email-minc...@kernel.org%3E
> >
> > hm, tricky problem.  I do agree that linking the physical and virtual
> > readahead schemes in the proposed fashion is unfortunate.  I also agree
> > that breaking existing setups (a bit) is also unfortunate.
> >
> > Would it help if, when page-cluster is written to zero, we do
> >
> > printk_once("physical readahead disabled, virtual readahead still
> > enabled.  Disable virtual readhead via
> > /sys/kernel/mm/swap/vma_ra_max_order").
> >
> > Or something like that.  It's pretty lame, but it should help alert the
> > zram-readahead-disabling people to the issue?
> 
> This sounds good for me.
> 
> Hi, Minchan, what do you think about this?  I think for low-end android
> device, the end-user may have no opportunity to upgrade to the latest
> kernel, the device vendor should care about this.  For desktop users,
> the warning proposed by Andrew may help to remind them for the new knob.

Yes, it would be option. At least, we should alert to the user to make
a chance to fix. However, can't we make vma-based readahead new config
option? Please look at the detail in my reply of andrew.

With that, there is no regression with current users and as a bonus,
user can measure both algorithm with their real workload with both
algorithm rather than artificial benchmark. I think recency vs spartial
locality would have each pros and cons so that kind soft landing would
be safer option rather than sudden replacing.
After a while, we can set new algorithm as default.


Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-14 Thread Minchan Kim
On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote:
> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim  wrote:
> 
> > Every zram users like low-end android device has used 0 page-cluster
> > to disable swap readahead because it has no seek cost and works as
> > synchronous IO operation so if we do readahead multiple pages,
> > swap falut latency would be (4K * readahead window size). IOW,
> > readahead is meaningful only if it doesn't bother faulted page's
> > latency.
> > 
> > However, this patch introduces additional knob /sys/kernel/mm/swap/
> > vma_ra_max_order as well as page-cluster. It means existing users
> > has used disabled swap readahead doesn't work until they should be
> > aware of new knob and modification of their script/code to disable
> > vma_ra_max_order as well as page-cluster.
> > 
> > I say it's a *regression* and wanted to fix it but Huang's opinion
> > is that it's not a functional regression so userspace should be fixed
> > by themselves.
> > Please look into detail of discussion in
> > http://lkml.kernel.org/r/%3c1505183833-4739-4-git-send-email-minc...@kernel.org%3E
> 
> hm, tricky problem.  I do agree that linking the physical and virtual
> readahead schemes in the proposed fashion is unfortunate.  I also agree
> that breaking existing setups (a bit) is also unfortunate.
> 
> Would it help if, when page-cluster is written to zero, we do
> 
> printk_once("physical readahead disabled, virtual readahead still
> enabled.  Disable virtual readhead via
> /sys/kernel/mm/swap/vma_ra_max_order").
> 
> Or something like that.  It's pretty lame, but it should help alert the
> zram-readahead-disabling people to the issue?

It was my last resort. If we cannot find other ways after all, yes, it would
be a minimum we should do. But it still breaks users don't/can't read/modify
alert and program.

How about this?

Can't we make vma-based readahead config option?
With that, users who no interest on readahead don't enable vma-based
readahead. In this case, page-cluster works as expected "disable readahead
completely" so it doesn't break anything.

People who want to use upcoming vma-based readahead can enable the feature
and we can say such unfortunate things in config/document description
somewhere so upcoming users will be aware of that unforunate two knobs.


Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-13 Thread Huang, Ying
Hi, Andrew,

Andrew Morton  writes:

> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim  wrote:
>
>> Every zram users like low-end android device has used 0 page-cluster
>> to disable swap readahead because it has no seek cost and works as
>> synchronous IO operation so if we do readahead multiple pages,
>> swap falut latency would be (4K * readahead window size). IOW,
>> readahead is meaningful only if it doesn't bother faulted page's
>> latency.
>> 
>> However, this patch introduces additional knob /sys/kernel/mm/swap/
>> vma_ra_max_order as well as page-cluster. It means existing users
>> has used disabled swap readahead doesn't work until they should be
>> aware of new knob and modification of their script/code to disable
>> vma_ra_max_order as well as page-cluster.
>> 
>> I say it's a *regression* and wanted to fix it but Huang's opinion
>> is that it's not a functional regression so userspace should be fixed
>> by themselves.
>> Please look into detail of discussion in
>> http://lkml.kernel.org/r/%3c1505183833-4739-4-git-send-email-minc...@kernel.org%3E
>
> hm, tricky problem.  I do agree that linking the physical and virtual
> readahead schemes in the proposed fashion is unfortunate.  I also agree
> that breaking existing setups (a bit) is also unfortunate.
>
> Would it help if, when page-cluster is written to zero, we do
>
> printk_once("physical readahead disabled, virtual readahead still
> enabled.  Disable virtual readhead via
> /sys/kernel/mm/swap/vma_ra_max_order").
>
> Or something like that.  It's pretty lame, but it should help alert the
> zram-readahead-disabling people to the issue?

This sounds good for me.

Hi, Minchan, what do you think about this?  I think for low-end android
device, the end-user may have no opportunity to upgrade to the latest
kernel, the device vendor should care about this.  For desktop users,
the warning proposed by Andrew may help to remind them for the new knob.

Best Regards,
Huang, Ying


Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-13 Thread Andrew Morton
On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim  wrote:

> Every zram users like low-end android device has used 0 page-cluster
> to disable swap readahead because it has no seek cost and works as
> synchronous IO operation so if we do readahead multiple pages,
> swap falut latency would be (4K * readahead window size). IOW,
> readahead is meaningful only if it doesn't bother faulted page's
> latency.
> 
> However, this patch introduces additional knob /sys/kernel/mm/swap/
> vma_ra_max_order as well as page-cluster. It means existing users
> has used disabled swap readahead doesn't work until they should be
> aware of new knob and modification of their script/code to disable
> vma_ra_max_order as well as page-cluster.
> 
> I say it's a *regression* and wanted to fix it but Huang's opinion
> is that it's not a functional regression so userspace should be fixed
> by themselves.
> Please look into detail of discussion in
> http://lkml.kernel.org/r/%3c1505183833-4739-4-git-send-email-minc...@kernel.org%3E

hm, tricky problem.  I do agree that linking the physical and virtual
readahead schemes in the proposed fashion is unfortunate.  I also agree
that breaking existing setups (a bit) is also unfortunate.

Would it help if, when page-cluster is written to zero, we do

printk_once("physical readahead disabled, virtual readahead still
enabled.  Disable virtual readhead via
/sys/kernel/mm/swap/vma_ra_max_order").

Or something like that.  It's pretty lame, but it should help alert the
zram-readahead-disabling people to the issue?


Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-09-12 Thread Minchan Kim
On Mon, Aug 07, 2017 at 01:40:36PM +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> The swap readahead is an important mechanism to reduce the swap in
> latency.  Although pure sequential memory access pattern isn't very
> popular for anonymous memory, the space locality is still considered
> valid.
> 
> In the original swap readahead implementation, the consecutive blocks
> in swap device are readahead based on the global space locality
> estimation.  But the consecutive blocks in swap device just reflect
> the order of page reclaiming, don't necessarily reflect the access
> pattern in virtual memory.  And the different tasks in the system may
> have different access patterns, which makes the global space locality
> estimation incorrect.
> 
> In this patch, when page fault occurs, the virtual pages near the
> fault address will be readahead instead of the swap slots near the
> fault swap slot in swap device.  This avoid to readahead the unrelated
> swap slots.  At the same time, the swap readahead is changed to work
> on per-VMA from globally.  So that the different access patterns of
> the different VMAs could be distinguished, and the different readahead
> policy could be applied accordingly.  The original core readahead
> detection and scaling algorithm is reused, because it is an effect
> algorithm to detect the space locality.

Andrew,

Every zram users like low-end android device has used 0 page-cluster
to disable swap readahead because it has no seek cost and works as
synchronous IO operation so if we do readahead multiple pages,
swap falut latency would be (4K * readahead window size). IOW,
readahead is meaningful only if it doesn't bother faulted page's
latency.

However, this patch introduces additional knob /sys/kernel/mm/swap/
vma_ra_max_order as well as page-cluster. It means existing users
has used disabled swap readahead doesn't work until they should be
aware of new knob and modification of their script/code to disable
vma_ra_max_order as well as page-cluster.

I say it's a *regression* and wanted to fix it but Huang's opinion
is that it's not a functional regression so userspace should be fixed
by themselves.
Please look into detail of discussion in
http://lkml.kernel.org/r/%3c1505183833-4739-4-git-send-email-minc...@kernel.org%3E

The discussion is never productive so it's time to follow maintainer's
opinion. Could you share your opinion?

Thanks.


[PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead

2017-08-06 Thread Huang, Ying
From: Huang Ying 

The swap readahead is an important mechanism to reduce the swap in
latency.  Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.

In the original swap readahead implementation, the consecutive blocks
in swap device are readahead based on the global space locality
estimation.  But the consecutive blocks in swap device just reflect
the order of page reclaiming, don't necessarily reflect the access
pattern in virtual memory.  And the different tasks in the system may
have different access patterns, which makes the global space locality
estimation incorrect.

In this patch, when page fault occurs, the virtual pages near the
fault address will be readahead instead of the swap slots near the
fault swap slot in swap device.  This avoid to readahead the unrelated
swap slots.  At the same time, the swap readahead is changed to work
on per-VMA from globally.  So that the different access patterns of
the different VMAs could be distinguished, and the different readahead
policy could be applied accordingly.  The original core readahead
detection and scaling algorithm is reused, because it is an effect
algorithm to detect the space locality.

The test and result is as follow,

Common test condition
=

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk

Micro-benchmark with combined access pattern


vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds.  The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.

At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds.  This will trigger random
swap-in in the background.

This is a combined workload with sequential and random memory
accessing at the same time.  The result (for sequential workload) is
as follow,

BaseOptimized
-
throughput  345413 KB/s 414029 KB/s (+19.9%)
latency.average 97.14 us61.06 us (-37.1%)
latency.50th2 us1 us
latency.60th2 us1 us
latency.70th98 us   2 us
latency.80th160 us  2 us
latency.90th260 us  217 us
latency.95th346 us  369 us
latency.99th1.34 ms 1.09 ms
ra_hit% 52.69%  99.98%

The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower.  The VMA-base
readahead algorithm works much better.

Linpack
===

The test memory size is bigger than RAM to trigger swapping.

BaseOptimized
-
elapsed_time393.49 s329.88 s (-16.2%)
ra_hit% 86.21%  98.82%

The score of base and optimized kernel hasn't visible changes.  But
the elapsed time reduced and readahead hit rate improved, so the
optimized kernel runs better for startup and tear down stages.  And
the absolute value of readahead hit rate is high, shows that the space
locality is still valid in some practical workloads.

Signed-off-by: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Fengguang Wu 
Cc: Tim Chen 
Cc: Dave Hansen 
---
 include/linux/mm_types.h |   1 +
 include/linux/swap.h |  57 -
 mm/memory.c  |  23 +++--
 mm/shmem.c   |   2 +-
 mm/swap_state.c  | 215 +++
 5 files changed, 273 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f384bb62d8e..5c02027050a2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,6 +335,7 @@ struct vm_area_struct {
struct file * vm_file;  /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */
 
+   atomic_long_t swap_readahead_info;
 #ifndef CONFIG_MMU
struct vm_region *vm_region;/* NOMMU mapping region */
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 76f1632eea5a..61d63379e956 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -251,6 +251,25 @@ struct swap_info_struct {
struct swap_cluster_list discard_clusters; /* discard clusters list */
 };
 
+#ifdef CONFIG_64BIT
+#define SWAP_RA_ORDER_CEILING  5
+#else
+/* Avoid stack overflow, because we need to save part of page table */
+#define SWAP_RA_ORDER_CEILING  3
+#define SWAP_RA_PTE_CACHE_SIZE (1 << SWAP_RA_ORDER_CEILING)
+#endif
+
+struct vma_swap_re