Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-18 Thread Joonsoo Kim
On Mon, Jul 18, 2016 at 08:51:16AM +0200, Vlastimil Babka wrote:
> On 07/18/2016 07:07 AM, Joonsoo Kim wrote:
> >On Thu, Jul 14, 2016 at 10:32:09AM +0200, Vlastimil Babka wrote:
> >>On 07/14/2016 07:23 AM, Joonsoo Kim wrote:
> >>
> >>I don't think there's a problem in the scenario? Kswapd will keep
> >>being woken up and reclaim from the node lru. It will hit and free
> >>any low zone pages that are on the lru, even though it doesn't
> >>"balance for low zone". Eventually it will either satisfy the
> >>constrained allocation by reclaiming those low-zone pages during the
> >>repeated wakeups, or the low-zone wakeups will stop coming together
> >>with higher-zone wakeups and then it will reclaim the low-zone pages
> >>in a single low-zone wakeup. If the zone-constrained request is not
> >
> >Yes, probability of this would be low.
> >
> >>allowed to fail, then it will just keep waking up kswapd and waiting
> >>for the progress. If it's allowed to fail (i.e. not __GFP_NOFAIL),
> >>but not allowed to direct reclaim, it goes "goto nopage" rather
> >>quickly in __alloc_pages_slowpath(), without any waiting for
> >>kswapd's progress, so there's not really much difference whether the
> >>kswapd wakeup picked up a low classzone or not. Note the
> >
> >Hmm... Even if allocation could fail, we should do our best to prevent
> >failure. Relying on luck isn't good idea to me.
> 
> But "Doing our best" has to have some sane limits. Allocation, that

Ensuring to do something for the requested zone at least once isn't insane.

> cannot direct reclaim, already relies on luck. And we are not really
> changing this. The allocation will "goto nopage" before kswapd can
> even wake up and start doing something, regardless of classzone_idx
> used.

But, this patch makes things worse. Even if next allocation comes
after kswapd is waking up and doing something, low zone would not be
balanced due to max classzone_idx and allocation could fail. It is
what this patch changes and I worry.

Thanks.



Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-18 Thread Joonsoo Kim
On Mon, Jul 18, 2016 at 08:51:16AM +0200, Vlastimil Babka wrote:
> On 07/18/2016 07:07 AM, Joonsoo Kim wrote:
> >On Thu, Jul 14, 2016 at 10:32:09AM +0200, Vlastimil Babka wrote:
> >>On 07/14/2016 07:23 AM, Joonsoo Kim wrote:
> >>
> >>I don't think there's a problem in the scenario? Kswapd will keep
> >>being woken up and reclaim from the node lru. It will hit and free
> >>any low zone pages that are on the lru, even though it doesn't
> >>"balance for low zone". Eventually it will either satisfy the
> >>constrained allocation by reclaiming those low-zone pages during the
> >>repeated wakeups, or the low-zone wakeups will stop coming together
> >>with higher-zone wakeups and then it will reclaim the low-zone pages
> >>in a single low-zone wakeup. If the zone-constrained request is not
> >
> >Yes, probability of this would be low.
> >
> >>allowed to fail, then it will just keep waking up kswapd and waiting
> >>for the progress. If it's allowed to fail (i.e. not __GFP_NOFAIL),
> >>but not allowed to direct reclaim, it goes "goto nopage" rather
> >>quickly in __alloc_pages_slowpath(), without any waiting for
> >>kswapd's progress, so there's not really much difference whether the
> >>kswapd wakeup picked up a low classzone or not. Note the
> >
> >Hmm... Even if allocation could fail, we should do our best to prevent
> >failure. Relying on luck isn't good idea to me.
> 
> But "Doing our best" has to have some sane limits. Allocation, that

Ensuring to do something for the requested zone at least once isn't insane.

> cannot direct reclaim, already relies on luck. And we are not really
> changing this. The allocation will "goto nopage" before kswapd can
> even wake up and start doing something, regardless of classzone_idx
> used.

But, this patch makes things worse. Even if next allocation comes
after kswapd is waking up and doing something, low zone would not be
balanced due to max classzone_idx and allocation could fail. It is
what this patch changes and I worry.

Thanks.



Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-18 Thread Vlastimil Babka

On 07/18/2016 07:07 AM, Joonsoo Kim wrote:

On Thu, Jul 14, 2016 at 10:32:09AM +0200, Vlastimil Babka wrote:

On 07/14/2016 07:23 AM, Joonsoo Kim wrote:

I don't think there's a problem in the scenario? Kswapd will keep
being woken up and reclaim from the node lru. It will hit and free
any low zone pages that are on the lru, even though it doesn't
"balance for low zone". Eventually it will either satisfy the
constrained allocation by reclaiming those low-zone pages during the
repeated wakeups, or the low-zone wakeups will stop coming together
with higher-zone wakeups and then it will reclaim the low-zone pages
in a single low-zone wakeup. If the zone-constrained request is not


Yes, probability of this would be low.


allowed to fail, then it will just keep waking up kswapd and waiting
for the progress. If it's allowed to fail (i.e. not __GFP_NOFAIL),
but not allowed to direct reclaim, it goes "goto nopage" rather
quickly in __alloc_pages_slowpath(), without any waiting for
kswapd's progress, so there's not really much difference whether the
kswapd wakeup picked up a low classzone or not. Note the


Hmm... Even if allocation could fail, we should do our best to prevent
failure. Relying on luck isn't good idea to me.


But "Doing our best" has to have some sane limits. Allocation, that 
cannot direct reclaim, already relies on luck. And we are not really 
changing this. The allocation will "goto nopage" before kswapd can even 
wake up and start doing something, regardless of classzone_idx used.



Thanks.


__GFP_NOFAIL but ~__GFP_DIRECT_RECLAIM is a WARN_ON_ONCE() scenario,
so definitely not common...


Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 




Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-18 Thread Vlastimil Babka

On 07/18/2016 07:07 AM, Joonsoo Kim wrote:

On Thu, Jul 14, 2016 at 10:32:09AM +0200, Vlastimil Babka wrote:

On 07/14/2016 07:23 AM, Joonsoo Kim wrote:

I don't think there's a problem in the scenario? Kswapd will keep
being woken up and reclaim from the node lru. It will hit and free
any low zone pages that are on the lru, even though it doesn't
"balance for low zone". Eventually it will either satisfy the
constrained allocation by reclaiming those low-zone pages during the
repeated wakeups, or the low-zone wakeups will stop coming together
with higher-zone wakeups and then it will reclaim the low-zone pages
in a single low-zone wakeup. If the zone-constrained request is not


Yes, probability of this would be low.


allowed to fail, then it will just keep waking up kswapd and waiting
for the progress. If it's allowed to fail (i.e. not __GFP_NOFAIL),
but not allowed to direct reclaim, it goes "goto nopage" rather
quickly in __alloc_pages_slowpath(), without any waiting for
kswapd's progress, so there's not really much difference whether the
kswapd wakeup picked up a low classzone or not. Note the


Hmm... Even if allocation could fail, we should do our best to prevent
failure. Relying on luck isn't good idea to me.


But "Doing our best" has to have some sane limits. Allocation, that 
cannot direct reclaim, already relies on luck. And we are not really 
changing this. The allocation will "goto nopage" before kswapd can even 
wake up and start doing something, regardless of classzone_idx used.



Thanks.


__GFP_NOFAIL but ~__GFP_DIRECT_RECLAIM is a WARN_ON_ONCE() scenario,
so definitely not common...


Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 




Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-17 Thread Joonsoo Kim
On Thu, Jul 14, 2016 at 10:32:09AM +0200, Vlastimil Babka wrote:
> On 07/14/2016 07:23 AM, Joonsoo Kim wrote:
> >On Fri, Jul 08, 2016 at 11:11:47AM +0100, Mel Gorman wrote:
> >>On Fri, Jul 08, 2016 at 11:44:47AM +0900, Joonsoo Kim wrote:
> >>
> >>It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
> >>for the whole node that may or may not have lower zone pages at the end
> >>of the LRU. If it does, then the allocation request will be satisfied.
> >>If it does not, then kswapd will think the node is balanced and get
> >>rewoken to do a zone-constrained reclaim pass.
> >
> >If zone-constrained request could go direct reclaim pass, there would
> >be no problem. But, please assume that request is zone-constrained
> >without __GFP_DIRECT_RECLAIM which is common for some device driver
> >implementation. And, please assume one more thing that this request
> >always comes with zone-unconstrained allocation request. In this case,
> >your max() logic will set kswapd_classzone_idx to highest zone index
> >and re-worken kswapd would not balance for low zone again. In the end,
> >zone-constrained allocation request without __GFP_DIRECT_RECLAIM could
> >fail.
> 
> I don't think there's a problem in the scenario? Kswapd will keep
> being woken up and reclaim from the node lru. It will hit and free
> any low zone pages that are on the lru, even though it doesn't
> "balance for low zone". Eventually it will either satisfy the
> constrained allocation by reclaiming those low-zone pages during the
> repeated wakeups, or the low-zone wakeups will stop coming together
> with higher-zone wakeups and then it will reclaim the low-zone pages
> in a single low-zone wakeup. If the zone-constrained request is not

Yes, probability of this would be low.

> allowed to fail, then it will just keep waking up kswapd and waiting
> for the progress. If it's allowed to fail (i.e. not __GFP_NOFAIL),
> but not allowed to direct reclaim, it goes "goto nopage" rather
> quickly in __alloc_pages_slowpath(), without any waiting for
> kswapd's progress, so there's not really much difference whether the
> kswapd wakeup picked up a low classzone or not. Note the

Hmm... Even if allocation could fail, we should do our best to prevent
failure. Relying on luck isn't good idea to me.

Thanks.

> __GFP_NOFAIL but ~__GFP_DIRECT_RECLAIM is a WARN_ON_ONCE() scenario,
> so definitely not common...
> 
> >Thanks.
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-17 Thread Joonsoo Kim
On Thu, Jul 14, 2016 at 10:32:09AM +0200, Vlastimil Babka wrote:
> On 07/14/2016 07:23 AM, Joonsoo Kim wrote:
> >On Fri, Jul 08, 2016 at 11:11:47AM +0100, Mel Gorman wrote:
> >>On Fri, Jul 08, 2016 at 11:44:47AM +0900, Joonsoo Kim wrote:
> >>
> >>It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
> >>for the whole node that may or may not have lower zone pages at the end
> >>of the LRU. If it does, then the allocation request will be satisfied.
> >>If it does not, then kswapd will think the node is balanced and get
> >>rewoken to do a zone-constrained reclaim pass.
> >
> >If zone-constrained request could go direct reclaim pass, there would
> >be no problem. But, please assume that request is zone-constrained
> >without __GFP_DIRECT_RECLAIM which is common for some device driver
> >implementation. And, please assume one more thing that this request
> >always comes with zone-unconstrained allocation request. In this case,
> >your max() logic will set kswapd_classzone_idx to highest zone index
> >and re-worken kswapd would not balance for low zone again. In the end,
> >zone-constrained allocation request without __GFP_DIRECT_RECLAIM could
> >fail.
> 
> I don't think there's a problem in the scenario? Kswapd will keep
> being woken up and reclaim from the node lru. It will hit and free
> any low zone pages that are on the lru, even though it doesn't
> "balance for low zone". Eventually it will either satisfy the
> constrained allocation by reclaiming those low-zone pages during the
> repeated wakeups, or the low-zone wakeups will stop coming together
> with higher-zone wakeups and then it will reclaim the low-zone pages
> in a single low-zone wakeup. If the zone-constrained request is not

Yes, probability of this would be low.

> allowed to fail, then it will just keep waking up kswapd and waiting
> for the progress. If it's allowed to fail (i.e. not __GFP_NOFAIL),
> but not allowed to direct reclaim, it goes "goto nopage" rather
> quickly in __alloc_pages_slowpath(), without any waiting for
> kswapd's progress, so there's not really much difference whether the
> kswapd wakeup picked up a low classzone or not. Note the

Hmm... Even if allocation could fail, we should do our best to prevent
failure. Relying on luck isn't good idea to me.

Thanks.

> __GFP_NOFAIL but ~__GFP_DIRECT_RECLAIM is a WARN_ON_ONCE() scenario,
> so definitely not common...
> 
> >Thanks.
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-17 Thread Joonsoo Kim
On Thu, Jul 14, 2016 at 10:05:00AM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2016 at 02:23:32PM +0900, Joonsoo Kim wrote:
> > > 
> > > > > > And, I'd like to know why max() is used for classzone_idx rather 
> > > > > > than
> > > > > > min()? I think that kswapd should balance the lowest zone requested.
> > > > > > 
> > > > > 
> > > > > If there are two allocation requests -- one zone-constraned and the 
> > > > > other
> > > > > zone-unconstrained, it does not make sense to have kswapd skip the 
> > > > > pages
> > > > > usable for the zone-unconstrained and waste a load of CPU. You could
> > > > 
> > > > I agree that, in this case, it's not good to skip the pages usable
> > > > for the zone-unconstrained request. But, what I am concerned is that
> > > > kswapd stop reclaim prematurely in the view of zone-constrained
> > > > requestor.
> > > 
> > > It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
> > > for the whole node that may or may not have lower zone pages at the end
> > > of the LRU. If it does, then the allocation request will be satisfied.
> > > If it does not, then kswapd will think the node is balanced and get
> > > rewoken to do a zone-constrained reclaim pass.
> > 
> > If zone-constrained request could go direct reclaim pass, there would
> > be no problem. But, please assume that request is zone-constrained
> > without __GFP_DIRECT_RECLAIM which is common for some device driver
> > implementation.
> 
> Then it's likely GFP_ATOMIC and it'll wake kswapd on each failure. If
> kswapd is containtly awake for highmem requests then we're reclaiming
> everything anyway.  Remember that if kswapd is reclaiming for higher zones,
> it'll still cover the lower zones eventually. There is no guarantee that
> skipping the highmem pages will satisfy the atomic allocations any faster
> but consuming the CPU to skip the pages is a definite cost.

Okay.

> 
> Even worse, skipping highmem pages when a highmem pages are required may
> ake lowmem pressure worse because those pages are freed faster and can
> be consumed by zone-unconstrained requests.

Okay.

> 
> If this really is a problem in practice then we can consider having
> allocation requests that are zone-constrained and !__GFP_DIRECT_RECLAIM
> set a flag and use the min classzone for the wakeup. That flag remains
> set until kswapd takes at least one pass using the lower classzone and
> clears it. The classzone will not be adjusted higher until that flag is

It would work.

> cleared. I don't think we should do it without evidence that it's a real
> problem because kswapd potentially uses useless CPU and the potential for
> higher lowmem pressure.

Hmmm... I think differently. Your patch changes current behaviour
without any evidence. Code simplification cannot compensate
potential stability issue. Before your patch, kswapd try to
balance for minimum classzone so until dis-advantage of this approach
is proved, it's better to keep original logic.

Thanks.


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-17 Thread Joonsoo Kim
On Thu, Jul 14, 2016 at 10:05:00AM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2016 at 02:23:32PM +0900, Joonsoo Kim wrote:
> > > 
> > > > > > And, I'd like to know why max() is used for classzone_idx rather 
> > > > > > than
> > > > > > min()? I think that kswapd should balance the lowest zone requested.
> > > > > > 
> > > > > 
> > > > > If there are two allocation requests -- one zone-constraned and the 
> > > > > other
> > > > > zone-unconstrained, it does not make sense to have kswapd skip the 
> > > > > pages
> > > > > usable for the zone-unconstrained and waste a load of CPU. You could
> > > > 
> > > > I agree that, in this case, it's not good to skip the pages usable
> > > > for the zone-unconstrained request. But, what I am concerned is that
> > > > kswapd stop reclaim prematurely in the view of zone-constrained
> > > > requestor.
> > > 
> > > It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
> > > for the whole node that may or may not have lower zone pages at the end
> > > of the LRU. If it does, then the allocation request will be satisfied.
> > > If it does not, then kswapd will think the node is balanced and get
> > > rewoken to do a zone-constrained reclaim pass.
> > 
> > If zone-constrained request could go direct reclaim pass, there would
> > be no problem. But, please assume that request is zone-constrained
> > without __GFP_DIRECT_RECLAIM which is common for some device driver
> > implementation.
> 
> Then it's likely GFP_ATOMIC and it'll wake kswapd on each failure. If
> kswapd is containtly awake for highmem requests then we're reclaiming
> everything anyway.  Remember that if kswapd is reclaiming for higher zones,
> it'll still cover the lower zones eventually. There is no guarantee that
> skipping the highmem pages will satisfy the atomic allocations any faster
> but consuming the CPU to skip the pages is a definite cost.

Okay.

> 
> Even worse, skipping highmem pages when a highmem pages are required may
> ake lowmem pressure worse because those pages are freed faster and can
> be consumed by zone-unconstrained requests.

Okay.

> 
> If this really is a problem in practice then we can consider having
> allocation requests that are zone-constrained and !__GFP_DIRECT_RECLAIM
> set a flag and use the min classzone for the wakeup. That flag remains
> set until kswapd takes at least one pass using the lower classzone and
> clears it. The classzone will not be adjusted higher until that flag is

It would work.

> cleared. I don't think we should do it without evidence that it's a real
> problem because kswapd potentially uses useless CPU and the potential for
> higher lowmem pressure.

Hmmm... I think differently. Your patch changes current behaviour
without any evidence. Code simplification cannot compensate
potential stability issue. Before your patch, kswapd try to
balance for minimum classzone so until dis-advantage of this approach
is proved, it's better to keep original logic.

Thanks.


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-14 Thread Mel Gorman
On Thu, Jul 14, 2016 at 02:23:32PM +0900, Joonsoo Kim wrote:
> > 
> > > > > And, I'd like to know why max() is used for classzone_idx rather than
> > > > > min()? I think that kswapd should balance the lowest zone requested.
> > > > > 
> > > > 
> > > > If there are two allocation requests -- one zone-constraned and the 
> > > > other
> > > > zone-unconstrained, it does not make sense to have kswapd skip the pages
> > > > usable for the zone-unconstrained and waste a load of CPU. You could
> > > 
> > > I agree that, in this case, it's not good to skip the pages usable
> > > for the zone-unconstrained request. But, what I am concerned is that
> > > kswapd stop reclaim prematurely in the view of zone-constrained
> > > requestor.
> > 
> > It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
> > for the whole node that may or may not have lower zone pages at the end
> > of the LRU. If it does, then the allocation request will be satisfied.
> > If it does not, then kswapd will think the node is balanced and get
> > rewoken to do a zone-constrained reclaim pass.
> 
> If zone-constrained request could go direct reclaim pass, there would
> be no problem. But, please assume that request is zone-constrained
> without __GFP_DIRECT_RECLAIM which is common for some device driver
> implementation.

Then it's likely GFP_ATOMIC and it'll wake kswapd on each failure. If
kswapd is containtly awake for highmem requests then we're reclaiming
everything anyway.  Remember that if kswapd is reclaiming for higher zones,
it'll still cover the lower zones eventually. There is no guarantee that
skipping the highmem pages will satisfy the atomic allocations any faster
but consuming the CPU to skip the pages is a definite cost.

Even worse, skipping highmem pages when a highmem pages are required may
ake lowmem pressure worse because those pages are freed faster and can
be consumed by zone-unconstrained requests.

If this really is a problem in practice then we can consider having
allocation requests that are zone-constrained and !__GFP_DIRECT_RECLAIM
set a flag and use the min classzone for the wakeup. That flag remains
set until kswapd takes at least one pass using the lower classzone and
clears it. The classzone will not be adjusted higher until that flag is
cleared. I don't think we should do it without evidence that it's a real
problem because kswapd potentially uses useless CPU and the potential for
higher lowmem pressure.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-14 Thread Mel Gorman
On Thu, Jul 14, 2016 at 02:23:32PM +0900, Joonsoo Kim wrote:
> > 
> > > > > And, I'd like to know why max() is used for classzone_idx rather than
> > > > > min()? I think that kswapd should balance the lowest zone requested.
> > > > > 
> > > > 
> > > > If there are two allocation requests -- one zone-constraned and the 
> > > > other
> > > > zone-unconstrained, it does not make sense to have kswapd skip the pages
> > > > usable for the zone-unconstrained and waste a load of CPU. You could
> > > 
> > > I agree that, in this case, it's not good to skip the pages usable
> > > for the zone-unconstrained request. But, what I am concerned is that
> > > kswapd stop reclaim prematurely in the view of zone-constrained
> > > requestor.
> > 
> > It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
> > for the whole node that may or may not have lower zone pages at the end
> > of the LRU. If it does, then the allocation request will be satisfied.
> > If it does not, then kswapd will think the node is balanced and get
> > rewoken to do a zone-constrained reclaim pass.
> 
> If zone-constrained request could go direct reclaim pass, there would
> be no problem. But, please assume that request is zone-constrained
> without __GFP_DIRECT_RECLAIM which is common for some device driver
> implementation.

Then it's likely GFP_ATOMIC and it'll wake kswapd on each failure. If
kswapd is containtly awake for highmem requests then we're reclaiming
everything anyway.  Remember that if kswapd is reclaiming for higher zones,
it'll still cover the lower zones eventually. There is no guarantee that
skipping the highmem pages will satisfy the atomic allocations any faster
but consuming the CPU to skip the pages is a definite cost.

Even worse, skipping highmem pages when a highmem pages are required may
ake lowmem pressure worse because those pages are freed faster and can
be consumed by zone-unconstrained requests.

If this really is a problem in practice then we can consider having
allocation requests that are zone-constrained and !__GFP_DIRECT_RECLAIM
set a flag and use the min classzone for the wakeup. That flag remains
set until kswapd takes at least one pass using the lower classzone and
clears it. The classzone will not be adjusted higher until that flag is
cleared. I don't think we should do it without evidence that it's a real
problem because kswapd potentially uses useless CPU and the potential for
higher lowmem pressure.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-14 Thread Vlastimil Babka

On 07/14/2016 07:23 AM, Joonsoo Kim wrote:

On Fri, Jul 08, 2016 at 11:11:47AM +0100, Mel Gorman wrote:

On Fri, Jul 08, 2016 at 11:44:47AM +0900, Joonsoo Kim wrote:

It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
for the whole node that may or may not have lower zone pages at the end
of the LRU. If it does, then the allocation request will be satisfied.
If it does not, then kswapd will think the node is balanced and get
rewoken to do a zone-constrained reclaim pass.


If zone-constrained request could go direct reclaim pass, there would
be no problem. But, please assume that request is zone-constrained
without __GFP_DIRECT_RECLAIM which is common for some device driver
implementation. And, please assume one more thing that this request
always comes with zone-unconstrained allocation request. In this case,
your max() logic will set kswapd_classzone_idx to highest zone index
and re-worken kswapd would not balance for low zone again. In the end,
zone-constrained allocation request without __GFP_DIRECT_RECLAIM could
fail.


I don't think there's a problem in the scenario? Kswapd will keep being 
woken up and reclaim from the node lru. It will hit and free any low 
zone pages that are on the lru, even though it doesn't "balance for low 
zone". Eventually it will either satisfy the constrained allocation by 
reclaiming those low-zone pages during the repeated wakeups, or the 
low-zone wakeups will stop coming together with higher-zone wakeups and 
then it will reclaim the low-zone pages in a single low-zone wakeup. If 
the zone-constrained request is not allowed to fail, then it will just 
keep waking up kswapd and waiting for the progress. If it's allowed to 
fail (i.e. not __GFP_NOFAIL), but not allowed to direct reclaim, it goes 
"goto nopage" rather quickly in __alloc_pages_slowpath(), without any 
waiting for kswapd's progress, so there's not really much difference 
whether the kswapd wakeup picked up a low classzone or not. Note the 
__GFP_NOFAIL but ~__GFP_DIRECT_RECLAIM is a WARN_ON_ONCE() scenario, so 
definitely not common...



Thanks.





Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-14 Thread Vlastimil Babka

On 07/14/2016 07:23 AM, Joonsoo Kim wrote:

On Fri, Jul 08, 2016 at 11:11:47AM +0100, Mel Gorman wrote:

On Fri, Jul 08, 2016 at 11:44:47AM +0900, Joonsoo Kim wrote:

It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
for the whole node that may or may not have lower zone pages at the end
of the LRU. If it does, then the allocation request will be satisfied.
If it does not, then kswapd will think the node is balanced and get
rewoken to do a zone-constrained reclaim pass.


If zone-constrained request could go direct reclaim pass, there would
be no problem. But, please assume that request is zone-constrained
without __GFP_DIRECT_RECLAIM which is common for some device driver
implementation. And, please assume one more thing that this request
always comes with zone-unconstrained allocation request. In this case,
your max() logic will set kswapd_classzone_idx to highest zone index
and re-worken kswapd would not balance for low zone again. In the end,
zone-constrained allocation request without __GFP_DIRECT_RECLAIM could
fail.


I don't think there's a problem in the scenario? Kswapd will keep being 
woken up and reclaim from the node lru. It will hit and free any low 
zone pages that are on the lru, even though it doesn't "balance for low 
zone". Eventually it will either satisfy the constrained allocation by 
reclaiming those low-zone pages during the repeated wakeups, or the 
low-zone wakeups will stop coming together with higher-zone wakeups and 
then it will reclaim the low-zone pages in a single low-zone wakeup. If 
the zone-constrained request is not allowed to fail, then it will just 
keep waking up kswapd and waiting for the progress. If it's allowed to 
fail (i.e. not __GFP_NOFAIL), but not allowed to direct reclaim, it goes 
"goto nopage" rather quickly in __alloc_pages_slowpath(), without any 
waiting for kswapd's progress, so there's not really much difference 
whether the kswapd wakeup picked up a low classzone or not. Note the 
__GFP_NOFAIL but ~__GFP_DIRECT_RECLAIM is a WARN_ON_ONCE() scenario, so 
definitely not common...



Thanks.





Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-13 Thread Joonsoo Kim
On Fri, Jul 08, 2016 at 11:11:47AM +0100, Mel Gorman wrote:
> On Fri, Jul 08, 2016 at 11:44:47AM +0900, Joonsoo Kim wrote:
> > > > > @@ -3390,12 +3386,24 @@ static int kswapd(void *p)
> > > > >* We can speed up thawing tasks if we don't call 
> > > > > balance_pgdat
> > > > >* after returning from the refrigerator
> > > > >*/
> > > > > - if (!ret) {
> > > > > - trace_mm_vmscan_kswapd_wake(pgdat->node_id, 
> > > > > order);
> > > > > + if (ret)
> > > > > + continue;
> > > > >  
> > > > > - /* return value ignored until next patch */
> > > > > - balance_pgdat(pgdat, order, classzone_idx);
> > > > > - }
> > > > > + /*
> > > > > +  * Reclaim begins at the requested order but if a 
> > > > > high-order
> > > > > +  * reclaim fails then kswapd falls back to reclaiming 
> > > > > for
> > > > > +  * order-0. If that happens, kswapd will consider 
> > > > > sleeping
> > > > > +  * for the order it finished reclaiming at 
> > > > > (reclaim_order)
> > > > > +  * but kcompactd is woken to compact for the original
> > > > > +  * request (alloc_order).
> > > > > +  */
> > > > > + trace_mm_vmscan_kswapd_wake(pgdat->node_id, 
> > > > > alloc_order);
> > > > > + reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > > > classzone_idx);
> > > > > + if (reclaim_order < alloc_order)
> > > > > + goto kswapd_try_sleep;
> > > > 
> > > > This 'goto' would cause kswapd to sleep prematurely. We need to check
> > > > *new* pgdat->kswapd_order and classzone_idx even in this case.
> > > > 
> > > 
> > > It only matters if the next request coming is also high-order requests but
> > > one thing that needs to be avoided is kswapd staying awake periods of time
> > > constantly reclaiming for high-order pages. This is why the check means
> > > "If we reclaimed for high-order and failed, then consider sleeping now".
> > > If allocations still require it, they direct reclaim instead.
> > 
> > But, assume that next request is zone-constrained allocation. We need
> > to balance memory for it but kswapd would skip it.
> > 
> 
> Then it'll also be woken up again in the very near future as the
> zone-constrained allocation. If the zone is at the min watermark, then
> it'll have direct reclaimed but between min and low, it'll be a simple
> wakeup.
> 
> The premature sleep, wakeup with new requests logic was a complete mess.
> However, what I did do is remove the -1 handling of kswapd_classzone_idx
> handling and the goto full-sleep. In the event of a premature wakeup,
> it'll recheck for wakeups and if none has occured, it'll use the old
> classzone information.
> 
> Note that it will *not* use the original allocation order if it's a
> premature sleep. This is because it's known that high-order reclaim
> failed in the near past and restarting it has a high risk of
> overreclaiming.
> 
> > > > And, I'd like to know why max() is used for classzone_idx rather than
> > > > min()? I think that kswapd should balance the lowest zone requested.
> > > > 
> > > 
> > > If there are two allocation requests -- one zone-constraned and the other
> > > zone-unconstrained, it does not make sense to have kswapd skip the pages
> > > usable for the zone-unconstrained and waste a load of CPU. You could
> > 
> > I agree that, in this case, it's not good to skip the pages usable
> > for the zone-unconstrained request. But, what I am concerned is that
> > kswapd stop reclaim prematurely in the view of zone-constrained
> > requestor.
> 
> It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
> for the whole node that may or may not have lower zone pages at the end
> of the LRU. If it does, then the allocation request will be satisfied.
> If it does not, then kswapd will think the node is balanced and get
> rewoken to do a zone-constrained reclaim pass.

If zone-constrained request could go direct reclaim pass, there would
be no problem. But, please assume that request is zone-constrained
without __GFP_DIRECT_RECLAIM which is common for some device driver
implementation. And, please assume one more thing that this request
always comes with zone-unconstrained allocation request. In this case,
your max() logic will set kswapd_classzone_idx to highest zone index
and re-worken kswapd would not balance for low zone again. In the end,
zone-constrained allocation request without __GFP_DIRECT_RECLAIM could
fail.

Thanks.


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-13 Thread Joonsoo Kim
On Fri, Jul 08, 2016 at 11:11:47AM +0100, Mel Gorman wrote:
> On Fri, Jul 08, 2016 at 11:44:47AM +0900, Joonsoo Kim wrote:
> > > > > @@ -3390,12 +3386,24 @@ static int kswapd(void *p)
> > > > >* We can speed up thawing tasks if we don't call 
> > > > > balance_pgdat
> > > > >* after returning from the refrigerator
> > > > >*/
> > > > > - if (!ret) {
> > > > > - trace_mm_vmscan_kswapd_wake(pgdat->node_id, 
> > > > > order);
> > > > > + if (ret)
> > > > > + continue;
> > > > >  
> > > > > - /* return value ignored until next patch */
> > > > > - balance_pgdat(pgdat, order, classzone_idx);
> > > > > - }
> > > > > + /*
> > > > > +  * Reclaim begins at the requested order but if a 
> > > > > high-order
> > > > > +  * reclaim fails then kswapd falls back to reclaiming 
> > > > > for
> > > > > +  * order-0. If that happens, kswapd will consider 
> > > > > sleeping
> > > > > +  * for the order it finished reclaiming at 
> > > > > (reclaim_order)
> > > > > +  * but kcompactd is woken to compact for the original
> > > > > +  * request (alloc_order).
> > > > > +  */
> > > > > + trace_mm_vmscan_kswapd_wake(pgdat->node_id, 
> > > > > alloc_order);
> > > > > + reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > > > classzone_idx);
> > > > > + if (reclaim_order < alloc_order)
> > > > > + goto kswapd_try_sleep;
> > > > 
> > > > This 'goto' would cause kswapd to sleep prematurely. We need to check
> > > > *new* pgdat->kswapd_order and classzone_idx even in this case.
> > > > 
> > > 
> > > It only matters if the next request coming is also high-order requests but
> > > one thing that needs to be avoided is kswapd staying awake periods of time
> > > constantly reclaiming for high-order pages. This is why the check means
> > > "If we reclaimed for high-order and failed, then consider sleeping now".
> > > If allocations still require it, they direct reclaim instead.
> > 
> > But, assume that next request is zone-constrained allocation. We need
> > to balance memory for it but kswapd would skip it.
> > 
> 
> Then it'll also be woken up again in the very near future as the
> zone-constrained allocation. If the zone is at the min watermark, then
> it'll have direct reclaimed but between min and low, it'll be a simple
> wakeup.
> 
> The premature sleep, wakeup with new requests logic was a complete mess.
> However, what I did do is remove the -1 handling of kswapd_classzone_idx
> handling and the goto full-sleep. In the event of a premature wakeup,
> it'll recheck for wakeups and if none has occured, it'll use the old
> classzone information.
> 
> Note that it will *not* use the original allocation order if it's a
> premature sleep. This is because it's known that high-order reclaim
> failed in the near past and restarting it has a high risk of
> overreclaiming.
> 
> > > > And, I'd like to know why max() is used for classzone_idx rather than
> > > > min()? I think that kswapd should balance the lowest zone requested.
> > > > 
> > > 
> > > If there are two allocation requests -- one zone-constraned and the other
> > > zone-unconstrained, it does not make sense to have kswapd skip the pages
> > > usable for the zone-unconstrained and waste a load of CPU. You could
> > 
> > I agree that, in this case, it's not good to skip the pages usable
> > for the zone-unconstrained request. But, what I am concerned is that
> > kswapd stop reclaim prematurely in the view of zone-constrained
> > requestor.
> 
> It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
> for the whole node that may or may not have lower zone pages at the end
> of the LRU. If it does, then the allocation request will be satisfied.
> If it does not, then kswapd will think the node is balanced and get
> rewoken to do a zone-constrained reclaim pass.

If zone-constrained request could go direct reclaim pass, there would
be no problem. But, please assume that request is zone-constrained
without __GFP_DIRECT_RECLAIM which is common for some device driver
implementation. And, please assume one more thing that this request
always comes with zone-unconstrained allocation request. In this case,
your max() logic will set kswapd_classzone_idx to highest zone index
and re-worken kswapd would not balance for low zone again. In the end,
zone-constrained allocation request without __GFP_DIRECT_RECLAIM could
fail.

Thanks.


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-08 Thread Mel Gorman
On Fri, Jul 08, 2016 at 11:44:47AM +0900, Joonsoo Kim wrote:
> > > > @@ -3390,12 +3386,24 @@ static int kswapd(void *p)
> > > >  * We can speed up thawing tasks if we don't call 
> > > > balance_pgdat
> > > >  * after returning from the refrigerator
> > > >  */
> > > > -   if (!ret) {
> > > > -   trace_mm_vmscan_kswapd_wake(pgdat->node_id, 
> > > > order);
> > > > +   if (ret)
> > > > +   continue;
> > > >  
> > > > -   /* return value ignored until next patch */
> > > > -   balance_pgdat(pgdat, order, classzone_idx);
> > > > -   }
> > > > +   /*
> > > > +* Reclaim begins at the requested order but if a 
> > > > high-order
> > > > +* reclaim fails then kswapd falls back to reclaiming 
> > > > for
> > > > +* order-0. If that happens, kswapd will consider 
> > > > sleeping
> > > > +* for the order it finished reclaiming at 
> > > > (reclaim_order)
> > > > +* but kcompactd is woken to compact for the original
> > > > +* request (alloc_order).
> > > > +*/
> > > > +   trace_mm_vmscan_kswapd_wake(pgdat->node_id, 
> > > > alloc_order);
> > > > +   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > > classzone_idx);
> > > > +   if (reclaim_order < alloc_order)
> > > > +   goto kswapd_try_sleep;
> > > 
> > > This 'goto' would cause kswapd to sleep prematurely. We need to check
> > > *new* pgdat->kswapd_order and classzone_idx even in this case.
> > > 
> > 
> > It only matters if the next request coming is also high-order requests but
> > one thing that needs to be avoided is kswapd staying awake periods of time
> > constantly reclaiming for high-order pages. This is why the check means
> > "If we reclaimed for high-order and failed, then consider sleeping now".
> > If allocations still require it, they direct reclaim instead.
> 
> But, assume that next request is zone-constrained allocation. We need
> to balance memory for it but kswapd would skip it.
> 

Then it'll also be woken up again in the very near future as the
zone-constrained allocation. If the zone is at the min watermark, then
it'll have direct reclaimed but between min and low, it'll be a simple
wakeup.

The premature sleep, wakeup with new requests logic was a complete mess.
However, what I did do is remove the -1 handling of kswapd_classzone_idx
handling and the goto full-sleep. In the event of a premature wakeup,
it'll recheck for wakeups and if none has occured, it'll use the old
classzone information.

Note that it will *not* use the original allocation order if it's a
premature sleep. This is because it's known that high-order reclaim
failed in the near past and restarting it has a high risk of
overreclaiming.

> > > And, I'd like to know why max() is used for classzone_idx rather than
> > > min()? I think that kswapd should balance the lowest zone requested.
> > > 
> > 
> > If there are two allocation requests -- one zone-constraned and the other
> > zone-unconstrained, it does not make sense to have kswapd skip the pages
> > usable for the zone-unconstrained and waste a load of CPU. You could
> 
> I agree that, in this case, it's not good to skip the pages usable
> for the zone-unconstrained request. But, what I am concerned is that
> kswapd stop reclaim prematurely in the view of zone-constrained
> requestor.

It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
for the whole node that may or may not have lower zone pages at the end
of the LRU. If it does, then the allocation request will be satisfied.
If it does not, then kswapd will think the node is balanced and get
rewoken to do a zone-constrained reclaim pass.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-08 Thread Mel Gorman
On Fri, Jul 08, 2016 at 11:44:47AM +0900, Joonsoo Kim wrote:
> > > > @@ -3390,12 +3386,24 @@ static int kswapd(void *p)
> > > >  * We can speed up thawing tasks if we don't call 
> > > > balance_pgdat
> > > >  * after returning from the refrigerator
> > > >  */
> > > > -   if (!ret) {
> > > > -   trace_mm_vmscan_kswapd_wake(pgdat->node_id, 
> > > > order);
> > > > +   if (ret)
> > > > +   continue;
> > > >  
> > > > -   /* return value ignored until next patch */
> > > > -   balance_pgdat(pgdat, order, classzone_idx);
> > > > -   }
> > > > +   /*
> > > > +* Reclaim begins at the requested order but if a 
> > > > high-order
> > > > +* reclaim fails then kswapd falls back to reclaiming 
> > > > for
> > > > +* order-0. If that happens, kswapd will consider 
> > > > sleeping
> > > > +* for the order it finished reclaiming at 
> > > > (reclaim_order)
> > > > +* but kcompactd is woken to compact for the original
> > > > +* request (alloc_order).
> > > > +*/
> > > > +   trace_mm_vmscan_kswapd_wake(pgdat->node_id, 
> > > > alloc_order);
> > > > +   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > > classzone_idx);
> > > > +   if (reclaim_order < alloc_order)
> > > > +   goto kswapd_try_sleep;
> > > 
> > > This 'goto' would cause kswapd to sleep prematurely. We need to check
> > > *new* pgdat->kswapd_order and classzone_idx even in this case.
> > > 
> > 
> > It only matters if the next request coming is also high-order requests but
> > one thing that needs to be avoided is kswapd staying awake periods of time
> > constantly reclaiming for high-order pages. This is why the check means
> > "If we reclaimed for high-order and failed, then consider sleeping now".
> > If allocations still require it, they direct reclaim instead.
> 
> But, assume that next request is zone-constrained allocation. We need
> to balance memory for it but kswapd would skip it.
> 

Then it'll also be woken up again in the very near future as the
zone-constrained allocation. If the zone is at the min watermark, then
it'll have direct reclaimed but between min and low, it'll be a simple
wakeup.

The premature sleep, wakeup with new requests logic was a complete mess.
However, what I did do is remove the -1 handling of kswapd_classzone_idx
handling and the goto full-sleep. In the event of a premature wakeup,
it'll recheck for wakeups and if none has occured, it'll use the old
classzone information.

Note that it will *not* use the original allocation order if it's a
premature sleep. This is because it's known that high-order reclaim
failed in the near past and restarting it has a high risk of
overreclaiming.

> > > And, I'd like to know why max() is used for classzone_idx rather than
> > > min()? I think that kswapd should balance the lowest zone requested.
> > > 
> > 
> > If there are two allocation requests -- one zone-constraned and the other
> > zone-unconstrained, it does not make sense to have kswapd skip the pages
> > usable for the zone-unconstrained and waste a load of CPU. You could
> 
> I agree that, in this case, it's not good to skip the pages usable
> for the zone-unconstrained request. But, what I am concerned is that
> kswapd stop reclaim prematurely in the view of zone-constrained
> requestor.

It doesn't stop reclaiming for the lower zones. It's reclaiming the LRU
for the whole node that may or may not have lower zone pages at the end
of the LRU. If it does, then the allocation request will be satisfied.
If it does not, then kswapd will think the node is balanced and get
rewoken to do a zone-constrained reclaim pass.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-07 Thread Joonsoo Kim
On Thu, Jul 07, 2016 at 11:17:01AM +0100, Mel Gorman wrote:
> On Thu, Jul 07, 2016 at 10:20:39AM +0900, Joonsoo Kim wrote:
> > > @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, 
> > > int order,
> > >  
> > >   prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
> > >  
> > > + /*
> > > +  * If kswapd has not been woken recently, then kswapd goes fully
> > > +  * to sleep. kcompactd may still need to wake if the original
> > > +  * request was high-order.
> > > +  */
> > > + if (classzone_idx == -1) {
> > > + wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > > + classzone_idx = MAX_NR_ZONES - 1;
> > > + goto full_sleep;
> > > + }
> > 
> > Passing -1 to kcompactd would cause the problem?
> > 
> 
> No, it ends up doing a wakeup and then going back to sleep which is not
> what is required. I'll fix it.
> 
> > > @@ -3390,12 +3386,24 @@ static int kswapd(void *p)
> > >* We can speed up thawing tasks if we don't call balance_pgdat
> > >* after returning from the refrigerator
> > >*/
> > > - if (!ret) {
> > > - trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> > > + if (ret)
> > > + continue;
> > >  
> > > - /* return value ignored until next patch */
> > > - balance_pgdat(pgdat, order, classzone_idx);
> > > - }
> > > + /*
> > > +  * Reclaim begins at the requested order but if a high-order
> > > +  * reclaim fails then kswapd falls back to reclaiming for
> > > +  * order-0. If that happens, kswapd will consider sleeping
> > > +  * for the order it finished reclaiming at (reclaim_order)
> > > +  * but kcompactd is woken to compact for the original
> > > +  * request (alloc_order).
> > > +  */
> > > + trace_mm_vmscan_kswapd_wake(pgdat->node_id, alloc_order);
> > > + reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > classzone_idx);
> > > + if (reclaim_order < alloc_order)
> > > + goto kswapd_try_sleep;
> > 
> > This 'goto' would cause kswapd to sleep prematurely. We need to check
> > *new* pgdat->kswapd_order and classzone_idx even in this case.
> > 
> 
> It only matters if the next request coming is also high-order requests but
> one thing that needs to be avoided is kswapd staying awake periods of time
> constantly reclaiming for high-order pages. This is why the check means
> "If we reclaimed for high-order and failed, then consider sleeping now".
> If allocations still require it, they direct reclaim instead.

But, assume that next request is zone-constrained allocation. We need
to balance memory for it but kswapd would skip it.

> 
> "Fixing" this potentially causes reclaim storms from kswapd.
> 
> > > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int order, 
> > > enum zone_type classzone_idx)
> > >   if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > >   return;
> > >   pgdat = zone->zone_pgdat;
> > > - if (pgdat->kswapd_max_order < order) {
> > > - pgdat->kswapd_max_order = order;
> > > - pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
> > > - }
> > > + if (pgdat->kswapd_classzone_idx == -1)
> > > + pgdat->kswapd_classzone_idx = classzone_idx;
> > > + pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, 
> > > classzone_idx);
> > > + pgdat->kswapd_order = max(pgdat->kswapd_order, order);
> > 
> > Now, updating pgdat->skwapd_max_order and classzone_idx happens
> > unconditionally. Before your patch, it is only updated toward hard
> > constraint (e.g. higher order).
> > 
> 
> So? It's updating the request to suit the requirements of all pending
> allocation requests that woke kswapd.
> 
> > And, I'd like to know why max() is used for classzone_idx rather than
> > min()? I think that kswapd should balance the lowest zone requested.
> > 
> 
> If there are two allocation requests -- one zone-constraned and the other
> zone-unconstrained, it does not make sense to have kswapd skip the pages
> usable for the zone-unconstrained and waste a load of CPU. You could

I agree that, in this case, it's not good to skip the pages usable
for the zone-unconstrained request. But, what I am concerned is that
kswapd stop reclaim prematurely in the view of zone-constrained
requestor. Kswapd decide to stop reclaim if one of eligible zone is
balanced and this max() makes eligible zone higher than the one
zone-unconstrained requestor want.

Thanks.

> argue that using min would satisfy the zone-constrained allocation faster
> but that's at the cost of delaying the zone-unconstrained allocation and
> wasting CPU. Bear in mind that using max may mean some lowmem pages get
> freed anyway due to LRU order.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more 

Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-07 Thread Joonsoo Kim
On Thu, Jul 07, 2016 at 11:17:01AM +0100, Mel Gorman wrote:
> On Thu, Jul 07, 2016 at 10:20:39AM +0900, Joonsoo Kim wrote:
> > > @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, 
> > > int order,
> > >  
> > >   prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
> > >  
> > > + /*
> > > +  * If kswapd has not been woken recently, then kswapd goes fully
> > > +  * to sleep. kcompactd may still need to wake if the original
> > > +  * request was high-order.
> > > +  */
> > > + if (classzone_idx == -1) {
> > > + wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > > + classzone_idx = MAX_NR_ZONES - 1;
> > > + goto full_sleep;
> > > + }
> > 
> > Passing -1 to kcompactd would cause the problem?
> > 
> 
> No, it ends up doing a wakeup and then going back to sleep which is not
> what is required. I'll fix it.
> 
> > > @@ -3390,12 +3386,24 @@ static int kswapd(void *p)
> > >* We can speed up thawing tasks if we don't call balance_pgdat
> > >* after returning from the refrigerator
> > >*/
> > > - if (!ret) {
> > > - trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> > > + if (ret)
> > > + continue;
> > >  
> > > - /* return value ignored until next patch */
> > > - balance_pgdat(pgdat, order, classzone_idx);
> > > - }
> > > + /*
> > > +  * Reclaim begins at the requested order but if a high-order
> > > +  * reclaim fails then kswapd falls back to reclaiming for
> > > +  * order-0. If that happens, kswapd will consider sleeping
> > > +  * for the order it finished reclaiming at (reclaim_order)
> > > +  * but kcompactd is woken to compact for the original
> > > +  * request (alloc_order).
> > > +  */
> > > + trace_mm_vmscan_kswapd_wake(pgdat->node_id, alloc_order);
> > > + reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > classzone_idx);
> > > + if (reclaim_order < alloc_order)
> > > + goto kswapd_try_sleep;
> > 
> > This 'goto' would cause kswapd to sleep prematurely. We need to check
> > *new* pgdat->kswapd_order and classzone_idx even in this case.
> > 
> 
> It only matters if the next request coming is also high-order requests but
> one thing that needs to be avoided is kswapd staying awake periods of time
> constantly reclaiming for high-order pages. This is why the check means
> "If we reclaimed for high-order and failed, then consider sleeping now".
> If allocations still require it, they direct reclaim instead.

But, assume that next request is zone-constrained allocation. We need
to balance memory for it but kswapd would skip it.

> 
> "Fixing" this potentially causes reclaim storms from kswapd.
> 
> > > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int order, 
> > > enum zone_type classzone_idx)
> > >   if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > >   return;
> > >   pgdat = zone->zone_pgdat;
> > > - if (pgdat->kswapd_max_order < order) {
> > > - pgdat->kswapd_max_order = order;
> > > - pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
> > > - }
> > > + if (pgdat->kswapd_classzone_idx == -1)
> > > + pgdat->kswapd_classzone_idx = classzone_idx;
> > > + pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, 
> > > classzone_idx);
> > > + pgdat->kswapd_order = max(pgdat->kswapd_order, order);
> > 
> > Now, updating pgdat->skwapd_max_order and classzone_idx happens
> > unconditionally. Before your patch, it is only updated toward hard
> > constraint (e.g. higher order).
> > 
> 
> So? It's updating the request to suit the requirements of all pending
> allocation requests that woke kswapd.
> 
> > And, I'd like to know why max() is used for classzone_idx rather than
> > min()? I think that kswapd should balance the lowest zone requested.
> > 
> 
> If there are two allocation requests -- one zone-constraned and the other
> zone-unconstrained, it does not make sense to have kswapd skip the pages
> usable for the zone-unconstrained and waste a load of CPU. You could

I agree that, in this case, it's not good to skip the pages usable
for the zone-unconstrained request. But, what I am concerned is that
kswapd stop reclaim prematurely in the view of zone-constrained
requestor. Kswapd decide to stop reclaim if one of eligible zone is
balanced and this max() makes eligible zone higher than the one
zone-unconstrained requestor want.

Thanks.

> argue that using min would satisfy the zone-constrained allocation faster
> but that's at the cost of delaying the zone-unconstrained allocation and
> wasting CPU. Bear in mind that using max may mean some lowmem pages get
> freed anyway due to LRU order.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more 

Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-07 Thread Mel Gorman
On Thu, Jul 07, 2016 at 10:20:39AM +0900, Joonsoo Kim wrote:
> > @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, 
> > int order,
> >  
> > prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
> >  
> > +   /*
> > +* If kswapd has not been woken recently, then kswapd goes fully
> > +* to sleep. kcompactd may still need to wake if the original
> > +* request was high-order.
> > +*/
> > +   if (classzone_idx == -1) {
> > +   wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > +   classzone_idx = MAX_NR_ZONES - 1;
> > +   goto full_sleep;
> > +   }
> 
> Passing -1 to kcompactd would cause the problem?
> 

No, it ends up doing a wakeup and then going back to sleep which is not
what is required. I'll fix it.

> > @@ -3390,12 +3386,24 @@ static int kswapd(void *p)
> >  * We can speed up thawing tasks if we don't call balance_pgdat
> >  * after returning from the refrigerator
> >  */
> > -   if (!ret) {
> > -   trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> > +   if (ret)
> > +   continue;
> >  
> > -   /* return value ignored until next patch */
> > -   balance_pgdat(pgdat, order, classzone_idx);
> > -   }
> > +   /*
> > +* Reclaim begins at the requested order but if a high-order
> > +* reclaim fails then kswapd falls back to reclaiming for
> > +* order-0. If that happens, kswapd will consider sleeping
> > +* for the order it finished reclaiming at (reclaim_order)
> > +* but kcompactd is woken to compact for the original
> > +* request (alloc_order).
> > +*/
> > +   trace_mm_vmscan_kswapd_wake(pgdat->node_id, alloc_order);
> > +   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > classzone_idx);
> > +   if (reclaim_order < alloc_order)
> > +   goto kswapd_try_sleep;
> 
> This 'goto' would cause kswapd to sleep prematurely. We need to check
> *new* pgdat->kswapd_order and classzone_idx even in this case.
> 

It only matters if the next request coming is also high-order requests but
one thing that needs to be avoided is kswapd staying awake periods of time
constantly reclaiming for high-order pages. This is why the check means
"If we reclaimed for high-order and failed, then consider sleeping now".
If allocations still require it, they direct reclaim instead.

"Fixing" this potentially causes reclaim storms from kswapd.

> > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int order, 
> > enum zone_type classzone_idx)
> > if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > return;
> > pgdat = zone->zone_pgdat;
> > -   if (pgdat->kswapd_max_order < order) {
> > -   pgdat->kswapd_max_order = order;
> > -   pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
> > -   }
> > +   if (pgdat->kswapd_classzone_idx == -1)
> > +   pgdat->kswapd_classzone_idx = classzone_idx;
> > +   pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, 
> > classzone_idx);
> > +   pgdat->kswapd_order = max(pgdat->kswapd_order, order);
> 
> Now, updating pgdat->skwapd_max_order and classzone_idx happens
> unconditionally. Before your patch, it is only updated toward hard
> constraint (e.g. higher order).
> 

So? It's updating the request to suit the requirements of all pending
allocation requests that woke kswapd.

> And, I'd like to know why max() is used for classzone_idx rather than
> min()? I think that kswapd should balance the lowest zone requested.
> 

If there are two allocation requests -- one zone-constraned and the other
zone-unconstrained, it does not make sense to have kswapd skip the pages
usable for the zone-unconstrained and waste a load of CPU. You could
argue that using min would satisfy the zone-constrained allocation faster
but that's at the cost of delaying the zone-unconstrained allocation and
wasting CPU. Bear in mind that using max may mean some lowmem pages get
freed anyway due to LRU order.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-07 Thread Mel Gorman
On Thu, Jul 07, 2016 at 10:20:39AM +0900, Joonsoo Kim wrote:
> > @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, 
> > int order,
> >  
> > prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
> >  
> > +   /*
> > +* If kswapd has not been woken recently, then kswapd goes fully
> > +* to sleep. kcompactd may still need to wake if the original
> > +* request was high-order.
> > +*/
> > +   if (classzone_idx == -1) {
> > +   wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > +   classzone_idx = MAX_NR_ZONES - 1;
> > +   goto full_sleep;
> > +   }
> 
> Passing -1 to kcompactd would cause the problem?
> 

No, it ends up doing a wakeup and then going back to sleep which is not
what is required. I'll fix it.

> > @@ -3390,12 +3386,24 @@ static int kswapd(void *p)
> >  * We can speed up thawing tasks if we don't call balance_pgdat
> >  * after returning from the refrigerator
> >  */
> > -   if (!ret) {
> > -   trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> > +   if (ret)
> > +   continue;
> >  
> > -   /* return value ignored until next patch */
> > -   balance_pgdat(pgdat, order, classzone_idx);
> > -   }
> > +   /*
> > +* Reclaim begins at the requested order but if a high-order
> > +* reclaim fails then kswapd falls back to reclaiming for
> > +* order-0. If that happens, kswapd will consider sleeping
> > +* for the order it finished reclaiming at (reclaim_order)
> > +* but kcompactd is woken to compact for the original
> > +* request (alloc_order).
> > +*/
> > +   trace_mm_vmscan_kswapd_wake(pgdat->node_id, alloc_order);
> > +   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > classzone_idx);
> > +   if (reclaim_order < alloc_order)
> > +   goto kswapd_try_sleep;
> 
> This 'goto' would cause kswapd to sleep prematurely. We need to check
> *new* pgdat->kswapd_order and classzone_idx even in this case.
> 

It only matters if the next request coming is also high-order requests but
one thing that needs to be avoided is kswapd staying awake periods of time
constantly reclaiming for high-order pages. This is why the check means
"If we reclaimed for high-order and failed, then consider sleeping now".
If allocations still require it, they direct reclaim instead.

"Fixing" this potentially causes reclaim storms from kswapd.

> > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int order, 
> > enum zone_type classzone_idx)
> > if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > return;
> > pgdat = zone->zone_pgdat;
> > -   if (pgdat->kswapd_max_order < order) {
> > -   pgdat->kswapd_max_order = order;
> > -   pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
> > -   }
> > +   if (pgdat->kswapd_classzone_idx == -1)
> > +   pgdat->kswapd_classzone_idx = classzone_idx;
> > +   pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, 
> > classzone_idx);
> > +   pgdat->kswapd_order = max(pgdat->kswapd_order, order);
> 
> Now, updating pgdat->skwapd_max_order and classzone_idx happens
> unconditionally. Before your patch, it is only updated toward hard
> constraint (e.g. higher order).
> 

So? It's updating the request to suit the requirements of all pending
allocation requests that woke kswapd.

> And, I'd like to know why max() is used for classzone_idx rather than
> min()? I think that kswapd should balance the lowest zone requested.
> 

If there are two allocation requests -- one zone-constraned and the other
zone-unconstrained, it does not make sense to have kswapd skip the pages
usable for the zone-unconstrained and waste a load of CPU. You could
argue that using min would satisfy the zone-constrained allocation faster
but that's at the cost of delaying the zone-unconstrained allocation and
wasting CPU. Bear in mind that using max may mean some lowmem pages get
freed anyway due to LRU order.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-07 Thread Mel Gorman
On Thu, Jul 07, 2016 at 02:51:21PM +0900, Minchan Kim wrote:
> > It becomes difficult to tell the difference between "no wakeup and init to
> > zone 0" and "wakeup and reclaim for zone 0". At least that's the problem
> > I ran into when I tried before settling on -1.
> 
> Sorry for bothering you several times. I cannot parse what you mean.
> I didn't mean -1 is problem here but why do we need below two lines
> I removed?
> 

What you have should be fine. The hazard initially was that both
classzone_idx and kswapd_classzone_idx are enum and the signedness of
enum is implementation-dependent. Using max_t avoids that but it's a
subtle. I prefer the  obvious check of kswapd_classzone_idx == 1 because
it is clearer that we're checking for an initialised value instead of
depending on a side-effect of the casting in max_t to do the right thing.

I can apply it if you wish, I just don't think it helps.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-07 Thread Mel Gorman
On Thu, Jul 07, 2016 at 02:51:21PM +0900, Minchan Kim wrote:
> > It becomes difficult to tell the difference between "no wakeup and init to
> > zone 0" and "wakeup and reclaim for zone 0". At least that's the problem
> > I ran into when I tried before settling on -1.
> 
> Sorry for bothering you several times. I cannot parse what you mean.
> I didn't mean -1 is problem here but why do we need below two lines
> I removed?
> 

What you have should be fine. The hazard initially was that both
classzone_idx and kswapd_classzone_idx are enum and the signedness of
enum is implementation-dependent. Using max_t avoids that but it's a
subtle. I prefer the  obvious check of kswapd_classzone_idx == 1 because
it is clearer that we're checking for an initialised value instead of
depending on a side-effect of the casting in max_t to do the right thing.

I can apply it if you wish, I just don't think it helps.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-06 Thread Minchan Kim
On Wed, Jul 06, 2016 at 09:31:21AM +0100, Mel Gorman wrote:
> On Wed, Jul 06, 2016 at 09:30:54AM +0900, Minchan Kim wrote:
> > On Tue, Jul 05, 2016 at 11:26:39AM +0100, Mel Gorman wrote:
> > 
> > 
> > 
> > > > > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int 
> > > > > order, enum zone_type classzone_idx)
> > > > >   if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > > > >   return;
> > > > >   pgdat = zone->zone_pgdat;
> > > > > - if (pgdat->kswapd_max_order < order) {
> > > > > - pgdat->kswapd_max_order = order;
> > > > > - pgdat->classzone_idx = min(pgdat->classzone_idx, 
> > > > > classzone_idx);
> > > > > - }
> > > > > + if (pgdat->kswapd_classzone_idx == -1)
> > > > > + pgdat->kswapd_classzone_idx = classzone_idx;
> > > > 
> > > > It's tricky. Couldn't we change kswapd_classzone_idx to integer type
> > > > and remove if above if condition?
> > > > 
> > > 
> > > It's tricky and not necessarily better overall. It's perfectly possible
> > > to be woken up for zone index 0 so it's changing -1 to another magic
> > > value.
> > 
> > I don't get it. What is a problem with this?
> > 
> 
> It becomes difficult to tell the difference between "no wakeup and init to
> zone 0" and "wakeup and reclaim for zone 0". At least that's the problem
> I ran into when I tried before settling on -1.

Sorry for bothering you several times. I cannot parse what you mean.
I didn't mean -1 is problem here but why do we need below two lines
I removed?

IOW, what's the problem if we apply below patch?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c538a8c..6eb23f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3413,9 +3413,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
return;
pgdat = zone->zone_pgdat;
-   if (pgdat->kswapd_classzone_idx == -1)
-   pgdat->kswapd_classzone_idx = classzone_idx;
-   pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, 
classzone_idx);
+   pgdat->kswapd_classzone_idx = max_t(int, pgdat->kswapd_classzone_idx, 
classzone_idx);
pgdat->kswapd_order = max(pgdat->kswapd_order, order);
if (!waitqueue_active(>kswapd_wait))
return;  

> 
> -- 
> Mel Gorman
> SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-06 Thread Minchan Kim
On Wed, Jul 06, 2016 at 09:31:21AM +0100, Mel Gorman wrote:
> On Wed, Jul 06, 2016 at 09:30:54AM +0900, Minchan Kim wrote:
> > On Tue, Jul 05, 2016 at 11:26:39AM +0100, Mel Gorman wrote:
> > 
> > 
> > 
> > > > > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int 
> > > > > order, enum zone_type classzone_idx)
> > > > >   if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > > > >   return;
> > > > >   pgdat = zone->zone_pgdat;
> > > > > - if (pgdat->kswapd_max_order < order) {
> > > > > - pgdat->kswapd_max_order = order;
> > > > > - pgdat->classzone_idx = min(pgdat->classzone_idx, 
> > > > > classzone_idx);
> > > > > - }
> > > > > + if (pgdat->kswapd_classzone_idx == -1)
> > > > > + pgdat->kswapd_classzone_idx = classzone_idx;
> > > > 
> > > > It's tricky. Couldn't we change kswapd_classzone_idx to integer type
> > > > and remove if above if condition?
> > > > 
> > > 
> > > It's tricky and not necessarily better overall. It's perfectly possible
> > > to be woken up for zone index 0 so it's changing -1 to another magic
> > > value.
> > 
> > I don't get it. What is a problem with this?
> > 
> 
> It becomes difficult to tell the difference between "no wakeup and init to
> zone 0" and "wakeup and reclaim for zone 0". At least that's the problem
> I ran into when I tried before settling on -1.

Sorry for bothering you several times. I cannot parse what you mean.
I didn't mean -1 is problem here but why do we need below two lines
I removed?

IOW, what's the problem if we apply below patch?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c538a8c..6eb23f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3413,9 +3413,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
return;
pgdat = zone->zone_pgdat;
-   if (pgdat->kswapd_classzone_idx == -1)
-   pgdat->kswapd_classzone_idx = classzone_idx;
-   pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, 
classzone_idx);
+   pgdat->kswapd_classzone_idx = max_t(int, pgdat->kswapd_classzone_idx, 
classzone_idx);
pgdat->kswapd_order = max(pgdat->kswapd_order, order);
if (!waitqueue_active(>kswapd_wait))
return;  

> 
> -- 
> Mel Gorman
> SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-06 Thread Joonsoo Kim
On Fri, Jul 01, 2016 at 09:01:16PM +0100, Mel Gorman wrote:
> kswapd goes through some complex steps trying to figure out if it should
> stay awake based on the classzone_idx and the requested order.  It is
> unnecessarily complex and passes in an invalid classzone_idx to
> balance_pgdat().  What matters most of all is whether a larger order has
> been requsted and whether kswapd successfully reclaimed at the previous
> order.  This patch irons out the logic to check just that and the end
> result is less headache inducing.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Johannes Weiner 
> Acked-by: Vlastimil Babka 
> ---
>  include/linux/mmzone.h |   5 ++-
>  mm/memory_hotplug.c|   5 ++-
>  mm/page_alloc.c|   2 +-
>  mm/vmscan.c| 102 
> ++---
>  4 files changed, 62 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 258c20758e80..eb74e63df5cf 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -667,8 +667,9 @@ typedef struct pglist_data {
>   wait_queue_head_t pfmemalloc_wait;
>   struct task_struct *kswapd; /* Protected by
>  mem_hotplug_begin/end() */
> - int kswapd_max_order;
> - enum zone_type classzone_idx;
> + int kswapd_order;
> + enum zone_type kswapd_classzone_idx;
> +
>  #ifdef CONFIG_COMPACTION
>   int kcompactd_max_order;
>   enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c5278360ca66..065140ecd081 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
> start)
>  
>   arch_refresh_nodedata(nid, pgdat);
>   } else {
> - /* Reset the nr_zones and classzone_idx to 0 before reuse */
> + /* Reset the nr_zones, order and classzone_idx before reuse */
>   pgdat->nr_zones = 0;
> - pgdat->classzone_idx = 0;
> + pgdat->kswapd_order = 0;
> + pgdat->kswapd_classzone_idx = 0;
>   }
>  
>   /* we can use NODE_DATA(nid) from here */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 59e4463e5dce..f58548139bf2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
> long *zones_size,
>   unsigned long end_pfn = 0;
>  
>   /* pg_data_t should be reset to zero when it's allocated */
> - WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
> + WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
>  
>   reset_deferred_meminit(pgdat);
>   pgdat->node_id = nid;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a52167eabc96..b524d3b72527 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>  
>   /* kswapd must be awake if processes are being throttled */
>   if (!wmark_ok && waitqueue_active(>kswapd_wait)) {
> - pgdat->classzone_idx = min(pgdat->classzone_idx,
> + pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
>   (enum zone_type)ZONE_NORMAL);
>   wake_up_interruptible(>kswapd_wait);
>   }
> @@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>   return sc.order;
>  }
>  
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
> - int classzone_idx, int balanced_classzone_idx)
> +static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
> reclaim_order,
> + int classzone_idx)
>  {
>   long remaining = 0;
>   DEFINE_WAIT(wait);
> @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
> order,
>  
>   prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
>  
> + /*
> +  * If kswapd has not been woken recently, then kswapd goes fully
> +  * to sleep. kcompactd may still need to wake if the original
> +  * request was high-order.
> +  */
> + if (classzone_idx == -1) {
> + wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> + classzone_idx = MAX_NR_ZONES - 1;
> + goto full_sleep;
> + }

Passing -1 to kcompactd would cause the problem?

> +
>   /* Try to sleep for a short interval */
> - if (prepare_kswapd_sleep(pgdat, order, remaining,
> - balanced_classzone_idx)) {
> + if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
> classzone_idx)) {
>   /*
>* Compaction records what page blocks it recently failed to
>* isolate pages from and skips them in the future scanning.
> @@ -3264,19 +3274,19 

Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-06 Thread Joonsoo Kim
On Fri, Jul 01, 2016 at 09:01:16PM +0100, Mel Gorman wrote:
> kswapd goes through some complex steps trying to figure out if it should
> stay awake based on the classzone_idx and the requested order.  It is
> unnecessarily complex and passes in an invalid classzone_idx to
> balance_pgdat().  What matters most of all is whether a larger order has
> been requsted and whether kswapd successfully reclaimed at the previous
> order.  This patch irons out the logic to check just that and the end
> result is less headache inducing.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Johannes Weiner 
> Acked-by: Vlastimil Babka 
> ---
>  include/linux/mmzone.h |   5 ++-
>  mm/memory_hotplug.c|   5 ++-
>  mm/page_alloc.c|   2 +-
>  mm/vmscan.c| 102 
> ++---
>  4 files changed, 62 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 258c20758e80..eb74e63df5cf 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -667,8 +667,9 @@ typedef struct pglist_data {
>   wait_queue_head_t pfmemalloc_wait;
>   struct task_struct *kswapd; /* Protected by
>  mem_hotplug_begin/end() */
> - int kswapd_max_order;
> - enum zone_type classzone_idx;
> + int kswapd_order;
> + enum zone_type kswapd_classzone_idx;
> +
>  #ifdef CONFIG_COMPACTION
>   int kcompactd_max_order;
>   enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c5278360ca66..065140ecd081 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
> start)
>  
>   arch_refresh_nodedata(nid, pgdat);
>   } else {
> - /* Reset the nr_zones and classzone_idx to 0 before reuse */
> + /* Reset the nr_zones, order and classzone_idx before reuse */
>   pgdat->nr_zones = 0;
> - pgdat->classzone_idx = 0;
> + pgdat->kswapd_order = 0;
> + pgdat->kswapd_classzone_idx = 0;
>   }
>  
>   /* we can use NODE_DATA(nid) from here */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 59e4463e5dce..f58548139bf2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
> long *zones_size,
>   unsigned long end_pfn = 0;
>  
>   /* pg_data_t should be reset to zero when it's allocated */
> - WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
> + WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
>  
>   reset_deferred_meminit(pgdat);
>   pgdat->node_id = nid;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a52167eabc96..b524d3b72527 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>  
>   /* kswapd must be awake if processes are being throttled */
>   if (!wmark_ok && waitqueue_active(>kswapd_wait)) {
> - pgdat->classzone_idx = min(pgdat->classzone_idx,
> + pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
>   (enum zone_type)ZONE_NORMAL);
>   wake_up_interruptible(>kswapd_wait);
>   }
> @@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>   return sc.order;
>  }
>  
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
> - int classzone_idx, int balanced_classzone_idx)
> +static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
> reclaim_order,
> + int classzone_idx)
>  {
>   long remaining = 0;
>   DEFINE_WAIT(wait);
> @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
> order,
>  
>   prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
>  
> + /*
> +  * If kswapd has not been woken recently, then kswapd goes fully
> +  * to sleep. kcompactd may still need to wake if the original
> +  * request was high-order.
> +  */
> + if (classzone_idx == -1) {
> + wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> + classzone_idx = MAX_NR_ZONES - 1;
> + goto full_sleep;
> + }

Passing -1 to kcompactd would cause the problem?

> +
>   /* Try to sleep for a short interval */
> - if (prepare_kswapd_sleep(pgdat, order, remaining,
> - balanced_classzone_idx)) {
> + if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
> classzone_idx)) {
>   /*
>* Compaction records what page blocks it recently failed to
>* isolate pages from and skips them in the future scanning.
> @@ -3264,19 +3274,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
> 

Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-06 Thread Mel Gorman
On Wed, Jul 06, 2016 at 09:30:54AM +0900, Minchan Kim wrote:
> On Tue, Jul 05, 2016 at 11:26:39AM +0100, Mel Gorman wrote:
> 
> 
> 
> > > > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int 
> > > > order, enum zone_type classzone_idx)
> > > > if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > > > return;
> > > > pgdat = zone->zone_pgdat;
> > > > -   if (pgdat->kswapd_max_order < order) {
> > > > -   pgdat->kswapd_max_order = order;
> > > > -   pgdat->classzone_idx = min(pgdat->classzone_idx, 
> > > > classzone_idx);
> > > > -   }
> > > > +   if (pgdat->kswapd_classzone_idx == -1)
> > > > +   pgdat->kswapd_classzone_idx = classzone_idx;
> > > 
> > > It's tricky. Couldn't we change kswapd_classzone_idx to integer type
> > > and remove if above if condition?
> > > 
> > 
> > It's tricky and not necessarily better overall. It's perfectly possible
> > to be woken up for zone index 0 so it's changing -1 to another magic
> > value.
> 
> I don't get it. What is a problem with this?
> 

It becomes difficult to tell the difference between "no wakeup and init to
zone 0" and "wakeup and reclaim for zone 0". At least that's the problem
I ran into when I tried before settling on -1.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-06 Thread Mel Gorman
On Wed, Jul 06, 2016 at 09:30:54AM +0900, Minchan Kim wrote:
> On Tue, Jul 05, 2016 at 11:26:39AM +0100, Mel Gorman wrote:
> 
> 
> 
> > > > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int 
> > > > order, enum zone_type classzone_idx)
> > > > if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > > > return;
> > > > pgdat = zone->zone_pgdat;
> > > > -   if (pgdat->kswapd_max_order < order) {
> > > > -   pgdat->kswapd_max_order = order;
> > > > -   pgdat->classzone_idx = min(pgdat->classzone_idx, 
> > > > classzone_idx);
> > > > -   }
> > > > +   if (pgdat->kswapd_classzone_idx == -1)
> > > > +   pgdat->kswapd_classzone_idx = classzone_idx;
> > > 
> > > It's tricky. Couldn't we change kswapd_classzone_idx to integer type
> > > and remove if above if condition?
> > > 
> > 
> > It's tricky and not necessarily better overall. It's perfectly possible
> > to be woken up for zone index 0 so it's changing -1 to another magic
> > value.
> 
> I don't get it. What is a problem with this?
> 

It becomes difficult to tell the difference between "no wakeup and init to
zone 0" and "wakeup and reclaim for zone 0". At least that's the problem
I ran into when I tried before settling on -1.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-05 Thread Minchan Kim
On Tue, Jul 05, 2016 at 11:26:39AM +0100, Mel Gorman wrote:



> > > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int order, 
> > > enum zone_type classzone_idx)
> > >   if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > >   return;
> > >   pgdat = zone->zone_pgdat;
> > > - if (pgdat->kswapd_max_order < order) {
> > > - pgdat->kswapd_max_order = order;
> > > - pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
> > > - }
> > > + if (pgdat->kswapd_classzone_idx == -1)
> > > + pgdat->kswapd_classzone_idx = classzone_idx;
> > 
> > It's tricky. Couldn't we change kswapd_classzone_idx to integer type
> > and remove if above if condition?
> > 
> 
> It's tricky and not necessarily better overall. It's perfectly possible
> to be woken up for zone index 0 so it's changing -1 to another magic
> value.

I don't get it. What is a problem with this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c538a8c..6eb23f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3413,9 +3413,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
return;
pgdat = zone->zone_pgdat;
-   if (pgdat->kswapd_classzone_idx == -1)
-   pgdat->kswapd_classzone_idx = classzone_idx;
-   pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, 
classzone_idx);
+   pgdat->kswapd_classzone_idx = max_t(int, pgdat->kswapd_classzone_idx, 
classzone_idx);
pgdat->kswapd_order = max(pgdat->kswapd_order, order);
if (!waitqueue_active(>kswapd_wait))
return;


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-05 Thread Minchan Kim
On Tue, Jul 05, 2016 at 11:26:39AM +0100, Mel Gorman wrote:



> > > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int order, 
> > > enum zone_type classzone_idx)
> > >   if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > >   return;
> > >   pgdat = zone->zone_pgdat;
> > > - if (pgdat->kswapd_max_order < order) {
> > > - pgdat->kswapd_max_order = order;
> > > - pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
> > > - }
> > > + if (pgdat->kswapd_classzone_idx == -1)
> > > + pgdat->kswapd_classzone_idx = classzone_idx;
> > 
> > It's tricky. Couldn't we change kswapd_classzone_idx to integer type
> > and remove if above if condition?
> > 
> 
> It's tricky and not necessarily better overall. It's perfectly possible
> to be woken up for zone index 0 so it's changing -1 to another magic
> value.

I don't get it. What is a problem with this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c538a8c..6eb23f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3413,9 +3413,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
return;
pgdat = zone->zone_pgdat;
-   if (pgdat->kswapd_classzone_idx == -1)
-   pgdat->kswapd_classzone_idx = classzone_idx;
-   pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, 
classzone_idx);
+   pgdat->kswapd_classzone_idx = max_t(int, pgdat->kswapd_classzone_idx, 
classzone_idx);
pgdat->kswapd_order = max(pgdat->kswapd_order, order);
if (!waitqueue_active(>kswapd_wait))
return;


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-05 Thread Mel Gorman
On Tue, Jul 05, 2016 at 02:59:31PM +0900, Minchan Kim wrote:
> > @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, 
> > int order,
> >  
> > prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
> >  
> > +   /*
> > +* If kswapd has not been woken recently, then kswapd goes fully
> > +* to sleep. kcompactd may still need to wake if the original
> > +* request was high-order.
> > +*/
> > +   if (classzone_idx == -1) {
> > +   wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > +   classzone_idx = MAX_NR_ZONES - 1;
> > +   goto full_sleep;
> > +   }
> > +
> > /* Try to sleep for a short interval */
> > -   if (prepare_kswapd_sleep(pgdat, order, remaining,
> > -   balanced_classzone_idx)) {
> > +   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
> > classzone_idx)) {
> 
> 
> Just trivial but this is clean up patch so I suggest one.
> If it doesn't help readability, just ignore, please.
> 
> This(ie, first prepare_kswapd_sleep always get 0 remaining value so
> it's pointless argument for the function. We could remove it and
> check it before second prepare_kswapd_sleep call.
> 

Yeah, fair point. I added a new patch that does this near the end of
the series with the other patches that avoid unnecessarily passing
parameters.

> > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int order, 
> > enum zone_type classzone_idx)
> > if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > return;
> > pgdat = zone->zone_pgdat;
> > -   if (pgdat->kswapd_max_order < order) {
> > -   pgdat->kswapd_max_order = order;
> > -   pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
> > -   }
> > +   if (pgdat->kswapd_classzone_idx == -1)
> > +   pgdat->kswapd_classzone_idx = classzone_idx;
> 
> It's tricky. Couldn't we change kswapd_classzone_idx to integer type
> and remove if above if condition?
> 

It's tricky and not necessarily better overall. It's perfectly possible
to be woken up for zone index 0 so it's changing -1 to another magic
value.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-05 Thread Mel Gorman
On Tue, Jul 05, 2016 at 02:59:31PM +0900, Minchan Kim wrote:
> > @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, 
> > int order,
> >  
> > prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
> >  
> > +   /*
> > +* If kswapd has not been woken recently, then kswapd goes fully
> > +* to sleep. kcompactd may still need to wake if the original
> > +* request was high-order.
> > +*/
> > +   if (classzone_idx == -1) {
> > +   wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > +   classzone_idx = MAX_NR_ZONES - 1;
> > +   goto full_sleep;
> > +   }
> > +
> > /* Try to sleep for a short interval */
> > -   if (prepare_kswapd_sleep(pgdat, order, remaining,
> > -   balanced_classzone_idx)) {
> > +   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
> > classzone_idx)) {
> 
> 
> Just trivial but this is clean up patch so I suggest one.
> If it doesn't help readability, just ignore, please.
> 
> This(ie, first prepare_kswapd_sleep always get 0 remaining value so
> it's pointless argument for the function. We could remove it and
> check it before second prepare_kswapd_sleep call.
> 

Yeah, fair point. I added a new patch that does this near the end of
the series with the other patches that avoid unnecessarily passing
parameters.

> > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int order, 
> > enum zone_type classzone_idx)
> > if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > return;
> > pgdat = zone->zone_pgdat;
> > -   if (pgdat->kswapd_max_order < order) {
> > -   pgdat->kswapd_max_order = order;
> > -   pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
> > -   }
> > +   if (pgdat->kswapd_classzone_idx == -1)
> > +   pgdat->kswapd_classzone_idx = classzone_idx;
> 
> It's tricky. Couldn't we change kswapd_classzone_idx to integer type
> and remove if above if condition?
> 

It's tricky and not necessarily better overall. It's perfectly possible
to be woken up for zone index 0 so it's changing -1 to another magic
value.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-04 Thread Minchan Kim
On Fri, Jul 01, 2016 at 09:01:16PM +0100, Mel Gorman wrote:
> kswapd goes through some complex steps trying to figure out if it should
> stay awake based on the classzone_idx and the requested order.  It is
> unnecessarily complex and passes in an invalid classzone_idx to
> balance_pgdat().  What matters most of all is whether a larger order has
> been requsted and whether kswapd successfully reclaimed at the previous
> order.  This patch irons out the logic to check just that and the end
> result is less headache inducing.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Johannes Weiner 
> Acked-by: Vlastimil Babka 
> ---
>  include/linux/mmzone.h |   5 ++-
>  mm/memory_hotplug.c|   5 ++-
>  mm/page_alloc.c|   2 +-
>  mm/vmscan.c| 102 
> ++---
>  4 files changed, 62 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 258c20758e80..eb74e63df5cf 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -667,8 +667,9 @@ typedef struct pglist_data {
>   wait_queue_head_t pfmemalloc_wait;
>   struct task_struct *kswapd; /* Protected by
>  mem_hotplug_begin/end() */
> - int kswapd_max_order;
> - enum zone_type classzone_idx;
> + int kswapd_order;
> + enum zone_type kswapd_classzone_idx;
> +
>  #ifdef CONFIG_COMPACTION
>   int kcompactd_max_order;
>   enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c5278360ca66..065140ecd081 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
> start)
>  
>   arch_refresh_nodedata(nid, pgdat);
>   } else {
> - /* Reset the nr_zones and classzone_idx to 0 before reuse */
> + /* Reset the nr_zones, order and classzone_idx before reuse */
>   pgdat->nr_zones = 0;
> - pgdat->classzone_idx = 0;
> + pgdat->kswapd_order = 0;
> + pgdat->kswapd_classzone_idx = 0;
>   }
>  
>   /* we can use NODE_DATA(nid) from here */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 59e4463e5dce..f58548139bf2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
> long *zones_size,
>   unsigned long end_pfn = 0;
>  
>   /* pg_data_t should be reset to zero when it's allocated */
> - WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
> + WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
>  
>   reset_deferred_meminit(pgdat);
>   pgdat->node_id = nid;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a52167eabc96..b524d3b72527 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>  
>   /* kswapd must be awake if processes are being throttled */
>   if (!wmark_ok && waitqueue_active(>kswapd_wait)) {
> - pgdat->classzone_idx = min(pgdat->classzone_idx,
> + pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
>   (enum zone_type)ZONE_NORMAL);
>   wake_up_interruptible(>kswapd_wait);
>   }
> @@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>   return sc.order;
>  }
>  
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
> - int classzone_idx, int balanced_classzone_idx)
> +static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
> reclaim_order,
> + int classzone_idx)
>  {
>   long remaining = 0;
>   DEFINE_WAIT(wait);
> @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
> order,
>  
>   prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
>  
> + /*
> +  * If kswapd has not been woken recently, then kswapd goes fully
> +  * to sleep. kcompactd may still need to wake if the original
> +  * request was high-order.
> +  */
> + if (classzone_idx == -1) {
> + wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> + classzone_idx = MAX_NR_ZONES - 1;
> + goto full_sleep;
> + }
> +
>   /* Try to sleep for a short interval */
> - if (prepare_kswapd_sleep(pgdat, order, remaining,
> - balanced_classzone_idx)) {
> + if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
> classzone_idx)) {


Just trivial but this is clean up patch so I suggest one.
If it doesn't help readability, just ignore, please.

This(ie, first prepare_kswapd_sleep always get 0 remaining value so
it's pointless argument for the function. We could remove it and

Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-04 Thread Minchan Kim
On Fri, Jul 01, 2016 at 09:01:16PM +0100, Mel Gorman wrote:
> kswapd goes through some complex steps trying to figure out if it should
> stay awake based on the classzone_idx and the requested order.  It is
> unnecessarily complex and passes in an invalid classzone_idx to
> balance_pgdat().  What matters most of all is whether a larger order has
> been requsted and whether kswapd successfully reclaimed at the previous
> order.  This patch irons out the logic to check just that and the end
> result is less headache inducing.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Johannes Weiner 
> Acked-by: Vlastimil Babka 
> ---
>  include/linux/mmzone.h |   5 ++-
>  mm/memory_hotplug.c|   5 ++-
>  mm/page_alloc.c|   2 +-
>  mm/vmscan.c| 102 
> ++---
>  4 files changed, 62 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 258c20758e80..eb74e63df5cf 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -667,8 +667,9 @@ typedef struct pglist_data {
>   wait_queue_head_t pfmemalloc_wait;
>   struct task_struct *kswapd; /* Protected by
>  mem_hotplug_begin/end() */
> - int kswapd_max_order;
> - enum zone_type classzone_idx;
> + int kswapd_order;
> + enum zone_type kswapd_classzone_idx;
> +
>  #ifdef CONFIG_COMPACTION
>   int kcompactd_max_order;
>   enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c5278360ca66..065140ecd081 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
> start)
>  
>   arch_refresh_nodedata(nid, pgdat);
>   } else {
> - /* Reset the nr_zones and classzone_idx to 0 before reuse */
> + /* Reset the nr_zones, order and classzone_idx before reuse */
>   pgdat->nr_zones = 0;
> - pgdat->classzone_idx = 0;
> + pgdat->kswapd_order = 0;
> + pgdat->kswapd_classzone_idx = 0;
>   }
>  
>   /* we can use NODE_DATA(nid) from here */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 59e4463e5dce..f58548139bf2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
> long *zones_size,
>   unsigned long end_pfn = 0;
>  
>   /* pg_data_t should be reset to zero when it's allocated */
> - WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
> + WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
>  
>   reset_deferred_meminit(pgdat);
>   pgdat->node_id = nid;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a52167eabc96..b524d3b72527 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>  
>   /* kswapd must be awake if processes are being throttled */
>   if (!wmark_ok && waitqueue_active(>kswapd_wait)) {
> - pgdat->classzone_idx = min(pgdat->classzone_idx,
> + pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
>   (enum zone_type)ZONE_NORMAL);
>   wake_up_interruptible(>kswapd_wait);
>   }
> @@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>   return sc.order;
>  }
>  
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
> - int classzone_idx, int balanced_classzone_idx)
> +static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
> reclaim_order,
> + int classzone_idx)
>  {
>   long remaining = 0;
>   DEFINE_WAIT(wait);
> @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
> order,
>  
>   prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
>  
> + /*
> +  * If kswapd has not been woken recently, then kswapd goes fully
> +  * to sleep. kcompactd may still need to wake if the original
> +  * request was high-order.
> +  */
> + if (classzone_idx == -1) {
> + wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> + classzone_idx = MAX_NR_ZONES - 1;
> + goto full_sleep;
> + }
> +
>   /* Try to sleep for a short interval */
> - if (prepare_kswapd_sleep(pgdat, order, remaining,
> - balanced_classzone_idx)) {
> + if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
> classzone_idx)) {


Just trivial but this is clean up patch so I suggest one.
If it doesn't help readability, just ignore, please.

This(ie, first prepare_kswapd_sleep always get 0 remaining value so
it's pointless argument for the function. We could remove it and
check it before second prepare_kswapd_sleep call.

full_sleep:



[PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-01 Thread Mel Gorman
kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order.  It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order.  This patch irons out the logic to check just that and the end
result is less headache inducing.

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
---
 include/linux/mmzone.h |   5 ++-
 mm/memory_hotplug.c|   5 ++-
 mm/page_alloc.c|   2 +-
 mm/vmscan.c| 102 ++---
 4 files changed, 62 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 258c20758e80..eb74e63df5cf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -667,8 +667,9 @@ typedef struct pglist_data {
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
   mem_hotplug_begin/end() */
-   int kswapd_max_order;
-   enum zone_type classzone_idx;
+   int kswapd_order;
+   enum zone_type kswapd_classzone_idx;
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c5278360ca66..065140ecd081 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
 
arch_refresh_nodedata(nid, pgdat);
} else {
-   /* Reset the nr_zones and classzone_idx to 0 before reuse */
+   /* Reset the nr_zones, order and classzone_idx before reuse */
pgdat->nr_zones = 0;
-   pgdat->classzone_idx = 0;
+   pgdat->kswapd_order = 0;
+   pgdat->kswapd_classzone_idx = 0;
}
 
/* we can use NODE_DATA(nid) from here */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59e4463e5dce..f58548139bf2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
unsigned long end_pfn = 0;
 
/* pg_data_t should be reset to zero when it's allocated */
-   WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
+   WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
 
reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a52167eabc96..b524d3b72527 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
/* kswapd must be awake if processes are being throttled */
if (!wmark_ok && waitqueue_active(>kswapd_wait)) {
-   pgdat->classzone_idx = min(pgdat->classzone_idx,
+   pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
(enum zone_type)ZONE_NORMAL);
wake_up_interruptible(>kswapd_wait);
}
@@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
return sc.order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
-   int classzone_idx, int balanced_classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
reclaim_order,
+   int classzone_idx)
 {
long remaining = 0;
DEFINE_WAIT(wait);
@@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 
prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
 
+   /*
+* If kswapd has not been woken recently, then kswapd goes fully
+* to sleep. kcompactd may still need to wake if the original
+* request was high-order.
+*/
+   if (classzone_idx == -1) {
+   wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
+   classzone_idx = MAX_NR_ZONES - 1;
+   goto full_sleep;
+   }
+
/* Try to sleep for a short interval */
-   if (prepare_kswapd_sleep(pgdat, order, remaining,
-   balanced_classzone_idx)) {
+   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
classzone_idx)) {
/*
 * Compaction records what page blocks it recently failed to
 * isolate pages from and skips them in the future scanning.
@@ -3264,19 +3274,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 * We have freed the memory, now we should compact it to make
 * allocation of the requested order possible.
 */
-   

[PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-01 Thread Mel Gorman
kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order.  It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order.  This patch irons out the logic to check just that and the end
result is less headache inducing.

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
---
 include/linux/mmzone.h |   5 ++-
 mm/memory_hotplug.c|   5 ++-
 mm/page_alloc.c|   2 +-
 mm/vmscan.c| 102 ++---
 4 files changed, 62 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 258c20758e80..eb74e63df5cf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -667,8 +667,9 @@ typedef struct pglist_data {
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
   mem_hotplug_begin/end() */
-   int kswapd_max_order;
-   enum zone_type classzone_idx;
+   int kswapd_order;
+   enum zone_type kswapd_classzone_idx;
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c5278360ca66..065140ecd081 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
 
arch_refresh_nodedata(nid, pgdat);
} else {
-   /* Reset the nr_zones and classzone_idx to 0 before reuse */
+   /* Reset the nr_zones, order and classzone_idx before reuse */
pgdat->nr_zones = 0;
-   pgdat->classzone_idx = 0;
+   pgdat->kswapd_order = 0;
+   pgdat->kswapd_classzone_idx = 0;
}
 
/* we can use NODE_DATA(nid) from here */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59e4463e5dce..f58548139bf2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
unsigned long end_pfn = 0;
 
/* pg_data_t should be reset to zero when it's allocated */
-   WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
+   WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
 
reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a52167eabc96..b524d3b72527 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
/* kswapd must be awake if processes are being throttled */
if (!wmark_ok && waitqueue_active(>kswapd_wait)) {
-   pgdat->classzone_idx = min(pgdat->classzone_idx,
+   pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
(enum zone_type)ZONE_NORMAL);
wake_up_interruptible(>kswapd_wait);
}
@@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
return sc.order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
-   int classzone_idx, int balanced_classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
reclaim_order,
+   int classzone_idx)
 {
long remaining = 0;
DEFINE_WAIT(wait);
@@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 
prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
 
+   /*
+* If kswapd has not been woken recently, then kswapd goes fully
+* to sleep. kcompactd may still need to wake if the original
+* request was high-order.
+*/
+   if (classzone_idx == -1) {
+   wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
+   classzone_idx = MAX_NR_ZONES - 1;
+   goto full_sleep;
+   }
+
/* Try to sleep for a short interval */
-   if (prepare_kswapd_sleep(pgdat, order, remaining,
-   balanced_classzone_idx)) {
+   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
classzone_idx)) {
/*
 * Compaction records what page blocks it recently failed to
 * isolate pages from and skips them in the future scanning.
@@ -3264,19 +3274,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 * We have freed the memory, now we should compact it to make
 * allocation of the requested order possible.
 */
-   wakeup_kcompactd(pgdat, order, classzone_idx);
+ 

[PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-01 Thread Mel Gorman
kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order.  It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order.  This patch irons out the logic to check just that and the end
result is less headache inducing.

Link: 
http://lkml.kernel.org/r/1466518566-30034-9-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 include/linux/mmzone.h |   5 ++-
 mm/memory_hotplug.c|   5 ++-
 mm/page_alloc.c|   2 +-
 mm/vmscan.c| 102 ++---
 4 files changed, 62 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 258c20758e80..eb74e63df5cf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -667,8 +667,9 @@ typedef struct pglist_data {
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
   mem_hotplug_begin/end() */
-   int kswapd_max_order;
-   enum zone_type classzone_idx;
+   int kswapd_order;
+   enum zone_type kswapd_classzone_idx;
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c5278360ca66..065140ecd081 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
 
arch_refresh_nodedata(nid, pgdat);
} else {
-   /* Reset the nr_zones and classzone_idx to 0 before reuse */
+   /* Reset the nr_zones, order and classzone_idx before reuse */
pgdat->nr_zones = 0;
-   pgdat->classzone_idx = 0;
+   pgdat->kswapd_order = 0;
+   pgdat->kswapd_classzone_idx = 0;
}
 
/* we can use NODE_DATA(nid) from here */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59e4463e5dce..f58548139bf2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
unsigned long end_pfn = 0;
 
/* pg_data_t should be reset to zero when it's allocated */
-   WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
+   WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
 
reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a52167eabc96..b524d3b72527 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
/* kswapd must be awake if processes are being throttled */
if (!wmark_ok && waitqueue_active(>kswapd_wait)) {
-   pgdat->classzone_idx = min(pgdat->classzone_idx,
+   pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
(enum zone_type)ZONE_NORMAL);
wake_up_interruptible(>kswapd_wait);
}
@@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
return sc.order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
-   int classzone_idx, int balanced_classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
reclaim_order,
+   int classzone_idx)
 {
long remaining = 0;
DEFINE_WAIT(wait);
@@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 
prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
 
+   /*
+* If kswapd has not been woken recently, then kswapd goes fully
+* to sleep. kcompactd may still need to wake if the original
+* request was high-order.
+*/
+   if (classzone_idx == -1) {
+   wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
+   classzone_idx = MAX_NR_ZONES - 1;
+   goto full_sleep;
+   }
+
/* Try to sleep for a short interval */
-   if (prepare_kswapd_sleep(pgdat, order, remaining,
-   balanced_classzone_idx)) {
+   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
classzone_idx)) {
/*
 * Compaction records what page blocks it recently failed to
 * isolate pages from and skips them in the future scanning.
@@ -3264,19 +3274,19 @@ static void kswapd_try_to_sleep(pg_data_t 

[PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-01 Thread Mel Gorman
kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order.  It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order.  This patch irons out the logic to check just that and the end
result is less headache inducing.

Link: 
http://lkml.kernel.org/r/1466518566-30034-9-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 include/linux/mmzone.h |   5 ++-
 mm/memory_hotplug.c|   5 ++-
 mm/page_alloc.c|   2 +-
 mm/vmscan.c| 102 ++---
 4 files changed, 62 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 258c20758e80..eb74e63df5cf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -667,8 +667,9 @@ typedef struct pglist_data {
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
   mem_hotplug_begin/end() */
-   int kswapd_max_order;
-   enum zone_type classzone_idx;
+   int kswapd_order;
+   enum zone_type kswapd_classzone_idx;
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c5278360ca66..065140ecd081 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
 
arch_refresh_nodedata(nid, pgdat);
} else {
-   /* Reset the nr_zones and classzone_idx to 0 before reuse */
+   /* Reset the nr_zones, order and classzone_idx before reuse */
pgdat->nr_zones = 0;
-   pgdat->classzone_idx = 0;
+   pgdat->kswapd_order = 0;
+   pgdat->kswapd_classzone_idx = 0;
}
 
/* we can use NODE_DATA(nid) from here */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59e4463e5dce..f58548139bf2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
unsigned long end_pfn = 0;
 
/* pg_data_t should be reset to zero when it's allocated */
-   WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
+   WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
 
reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a52167eabc96..b524d3b72527 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
/* kswapd must be awake if processes are being throttled */
if (!wmark_ok && waitqueue_active(>kswapd_wait)) {
-   pgdat->classzone_idx = min(pgdat->classzone_idx,
+   pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
(enum zone_type)ZONE_NORMAL);
wake_up_interruptible(>kswapd_wait);
}
@@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
return sc.order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
-   int classzone_idx, int balanced_classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
reclaim_order,
+   int classzone_idx)
 {
long remaining = 0;
DEFINE_WAIT(wait);
@@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 
prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
 
+   /*
+* If kswapd has not been woken recently, then kswapd goes fully
+* to sleep. kcompactd may still need to wake if the original
+* request was high-order.
+*/
+   if (classzone_idx == -1) {
+   wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
+   classzone_idx = MAX_NR_ZONES - 1;
+   goto full_sleep;
+   }
+
/* Try to sleep for a short interval */
-   if (prepare_kswapd_sleep(pgdat, order, remaining,
-   balanced_classzone_idx)) {
+   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
classzone_idx)) {
/*
 * Compaction records what page blocks it recently failed to
 * isolate pages from and skips them in the future scanning.
@@ -3264,19 +3274,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 * We have freed the memory, now we should compact it to make