subject:"\[PATCH\] mm, vmscan\: do not loop on too_many

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-24 Thread Tetsuo Handa

Hugh Dickins wrote:
> On Thu, 20 Jul 2017, Tetsuo Handa wrote:
> > Hugh Dickins wrote:
> > > You probably won't welcome getting into alternatives at this late stage;
> > > but after hacking around it one way or another because of its pointless
> > > lockups, I lost patience with that too_many_isolated() loop a few months
> > > back (on realizing the enormous number of pages that may be isolated via
> > > migrate_pages(2)), and we've been running nicely since with something 
> > > like:
> > > 
> > >   bool got_mutex = false;
> > > 
> > >   if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > >   if (mutex_lock_killable(>too_many_isolated))
> > >   return SWAP_CLUSTER_MAX;
> > >   got_mutex = true;
> > >   }
> > >   ...
> > >   if (got_mutex)
> > >   mutex_unlock(>too_many_isolated);
> > > 
> > > Using a mutex to provide the intended throttling, without an infinite
> > > loop or an arbitrary delay; and without having to worry (as we often did)
> > > about whether those numbers in too_many_isolated() are really appropriate.
> > > No premature OOMs complained of yet.
> > 
> > Roughly speaking, there is a moment where shrink_inactive_list() acts
> > like below.
> > 
> > bool got_mutex = false;
> > 
> > if (!current_is_kswapd()) {
> > if (mutex_lock_killable(>too_many_isolated))
> > return SWAP_CLUSTER_MAX;
> > got_mutex = true;
> > }
> > 
> > // kswapd is blocked here waiting for !current_is_kswapd().
> 
> That would be a shame, for kswapd to wait for !current_is_kswapd()!

Yes, but current code (not about your patch) does allow kswapd to wait
for memory allocations of !current_is_kswapd() thread to complete.

> 
> But seriously, I think I understand what you mean by that, you're
> thinking that kswapd would be waiting on some other task to clear
> the too_many_isolated() condition?

Yes.

> 
> No, it does not work that way: kswapd (never seeing too_many_isolated()
> because that always says false when current_is_kswapd()) never tries to
> take the pgdat->too_many_isolated mutex itself: it does not wait there
> at all, although other tasks may be waiting there at the time.

I know. I wrote behavior of your patch if my guess (your "..." part
corresponds to kswapd doing writepage) is correct.

> 
> Perhaps my naming the mutex "too_many_isolated", same as the function,
> is actually confusing, when I had intended it to be helpful.

Not confusing at all. It is helpful.
I just wanted to confirm what comes in your "..." part.

> 
> > 
> > if (got_mutex)
> > mutex_unlock(>too_many_isolated);
> > 
> > > 
> > > But that was on a different kernel, and there I did have to make sure
> > > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > > that in the current kernel (but do remember Johannes changing the memcg
> > > end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> > > if you're interested in that alternative: if you are, then I'll go ahead
> > > and make it into an actual patch against v4.13-rc.
> > 
> > I don't know what your actual patch looks like, but the problem is that
> > pgdat->too_many_isolated waits for kswapd while kswapd waits for
> > pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
> > once we hit it.
> 
> Not so (and we'd hardly be finding it a useful patch if that were so).

Current code allows kswapd to wait for memory allocation of !current_is_kswapd()
threads, and thus !current_is_kswapd() threads wait for current_is_kswapd() 
threads
while current_is_kswapd() threads wait for !current_is_kswapd() threads; nobody 
can
make too_many_isolated() false if once we hit it. Hence, this patch is proposed.

Thanks.

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-24 Thread Tetsuo Handa

Hugh Dickins wrote:
> On Thu, 20 Jul 2017, Tetsuo Handa wrote:
> > Hugh Dickins wrote:
> > > You probably won't welcome getting into alternatives at this late stage;
> > > but after hacking around it one way or another because of its pointless
> > > lockups, I lost patience with that too_many_isolated() loop a few months
> > > back (on realizing the enormous number of pages that may be isolated via
> > > migrate_pages(2)), and we've been running nicely since with something 
> > > like:
> > > 
> > >   bool got_mutex = false;
> > > 
> > >   if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > >   if (mutex_lock_killable(>too_many_isolated))
> > >   return SWAP_CLUSTER_MAX;
> > >   got_mutex = true;
> > >   }
> > >   ...
> > >   if (got_mutex)
> > >   mutex_unlock(>too_many_isolated);
> > > 
> > > Using a mutex to provide the intended throttling, without an infinite
> > > loop or an arbitrary delay; and without having to worry (as we often did)
> > > about whether those numbers in too_many_isolated() are really appropriate.
> > > No premature OOMs complained of yet.
> > 
> > Roughly speaking, there is a moment where shrink_inactive_list() acts
> > like below.
> > 
> > bool got_mutex = false;
> > 
> > if (!current_is_kswapd()) {
> > if (mutex_lock_killable(>too_many_isolated))
> > return SWAP_CLUSTER_MAX;
> > got_mutex = true;
> > }
> > 
> > // kswapd is blocked here waiting for !current_is_kswapd().
> 
> That would be a shame, for kswapd to wait for !current_is_kswapd()!

Yes, but current code (not about your patch) does allow kswapd to wait
for memory allocations of !current_is_kswapd() thread to complete.

> 
> But seriously, I think I understand what you mean by that, you're
> thinking that kswapd would be waiting on some other task to clear
> the too_many_isolated() condition?

Yes.

> 
> No, it does not work that way: kswapd (never seeing too_many_isolated()
> because that always says false when current_is_kswapd()) never tries to
> take the pgdat->too_many_isolated mutex itself: it does not wait there
> at all, although other tasks may be waiting there at the time.

I know. I wrote behavior of your patch if my guess (your "..." part
corresponds to kswapd doing writepage) is correct.

> 
> Perhaps my naming the mutex "too_many_isolated", same as the function,
> is actually confusing, when I had intended it to be helpful.

Not confusing at all. It is helpful.
I just wanted to confirm what comes in your "..." part.

> 
> > 
> > if (got_mutex)
> > mutex_unlock(>too_many_isolated);
> > 
> > > 
> > > But that was on a different kernel, and there I did have to make sure
> > > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > > that in the current kernel (but do remember Johannes changing the memcg
> > > end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> > > if you're interested in that alternative: if you are, then I'll go ahead
> > > and make it into an actual patch against v4.13-rc.
> > 
> > I don't know what your actual patch looks like, but the problem is that
> > pgdat->too_many_isolated waits for kswapd while kswapd waits for
> > pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
> > once we hit it.
> 
> Not so (and we'd hardly be finding it a useful patch if that were so).

Current code allows kswapd to wait for memory allocation of !current_is_kswapd()
threads, and thus !current_is_kswapd() threads wait for current_is_kswapd() 
threads
while current_is_kswapd() threads wait for !current_is_kswapd() threads; nobody 
can
make too_many_isolated() false if once we hit it. Hence, this patch is proposed.

Thanks.

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-24 Thread Hugh Dickins

On Thu, 20 Jul 2017, Michal Hocko wrote:
> On Wed 19-07-17 18:54:40, Hugh Dickins wrote:
> [...]
> > You probably won't welcome getting into alternatives at this late stage;
> > but after hacking around it one way or another because of its pointless
> > lockups, I lost patience with that too_many_isolated() loop a few months
> > back (on realizing the enormous number of pages that may be isolated via
> > migrate_pages(2)), and we've been running nicely since with something like:
> > 
> > bool got_mutex = false;
> > 
> > if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > if (mutex_lock_killable(>too_many_isolated))
> > return SWAP_CLUSTER_MAX;
> > got_mutex = true;
> > }
> > ...
> > if (got_mutex)
> > mutex_unlock(>too_many_isolated);
> > 
> > Using a mutex to provide the intended throttling, without an infinite
> > loop or an arbitrary delay; and without having to worry (as we often did)
> > about whether those numbers in too_many_isolated() are really appropriate.
> > No premature OOMs complained of yet.
> > 
> > But that was on a different kernel, and there I did have to make sure
> > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > that in the current kernel (but do remember Johannes changing the memcg
> > end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> > if you're interested in that alternative: if you are, then I'll go ahead
> > and make it into an actual patch against v4.13-rc.
> 
> I would rather get rid of any additional locking here and my ultimate
> goal is to make throttling at the page allocator layer rather than
> inside the reclaim.

Fair enough, I'm certainly in no hurry to send the patch,
but thought it worth mentioning.

Hugh

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-24 Thread Hugh Dickins

On Thu, 20 Jul 2017, Michal Hocko wrote:
> On Wed 19-07-17 18:54:40, Hugh Dickins wrote:
> [...]
> > You probably won't welcome getting into alternatives at this late stage;
> > but after hacking around it one way or another because of its pointless
> > lockups, I lost patience with that too_many_isolated() loop a few months
> > back (on realizing the enormous number of pages that may be isolated via
> > migrate_pages(2)), and we've been running nicely since with something like:
> > 
> > bool got_mutex = false;
> > 
> > if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > if (mutex_lock_killable(>too_many_isolated))
> > return SWAP_CLUSTER_MAX;
> > got_mutex = true;
> > }
> > ...
> > if (got_mutex)
> > mutex_unlock(>too_many_isolated);
> > 
> > Using a mutex to provide the intended throttling, without an infinite
> > loop or an arbitrary delay; and without having to worry (as we often did)
> > about whether those numbers in too_many_isolated() are really appropriate.
> > No premature OOMs complained of yet.
> > 
> > But that was on a different kernel, and there I did have to make sure
> > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > that in the current kernel (but do remember Johannes changing the memcg
> > end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> > if you're interested in that alternative: if you are, then I'll go ahead
> > and make it into an actual patch against v4.13-rc.
> 
> I would rather get rid of any additional locking here and my ultimate
> goal is to make throttling at the page allocator layer rather than
> inside the reclaim.

Fair enough, I'm certainly in no hurry to send the patch,
but thought it worth mentioning.

Hugh

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-24 Thread Hugh Dickins

On Thu, 20 Jul 2017, Tetsuo Handa wrote:
> Hugh Dickins wrote:
> > You probably won't welcome getting into alternatives at this late stage;
> > but after hacking around it one way or another because of its pointless
> > lockups, I lost patience with that too_many_isolated() loop a few months
> > back (on realizing the enormous number of pages that may be isolated via
> > migrate_pages(2)), and we've been running nicely since with something like:
> > 
> > bool got_mutex = false;
> > 
> > if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > if (mutex_lock_killable(>too_many_isolated))
> > return SWAP_CLUSTER_MAX;
> > got_mutex = true;
> > }
> > ...
> > if (got_mutex)
> > mutex_unlock(>too_many_isolated);
> > 
> > Using a mutex to provide the intended throttling, without an infinite
> > loop or an arbitrary delay; and without having to worry (as we often did)
> > about whether those numbers in too_many_isolated() are really appropriate.
> > No premature OOMs complained of yet.
> 
> Roughly speaking, there is a moment where shrink_inactive_list() acts
> like below.
> 
>   bool got_mutex = false;
> 
>   if (!current_is_kswapd()) {
>   if (mutex_lock_killable(>too_many_isolated))
>   return SWAP_CLUSTER_MAX;
>   got_mutex = true;
>   }
> 
>   // kswapd is blocked here waiting for !current_is_kswapd().

That would be a shame, for kswapd to wait for !current_is_kswapd()!

But seriously, I think I understand what you mean by that, you're
thinking that kswapd would be waiting on some other task to clear
the too_many_isolated() condition?

No, it does not work that way: kswapd (never seeing too_many_isolated()
because that always says false when current_is_kswapd()) never tries to
take the pgdat->too_many_isolated mutex itself: it does not wait there
at all, although other tasks may be waiting there at the time.

Perhaps my naming the mutex "too_many_isolated", same as the function,
is actually confusing, when I had intended it to be helpful.

> 
>   if (got_mutex)
>   mutex_unlock(>too_many_isolated);
> 
> > 
> > But that was on a different kernel, and there I did have to make sure
> > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > that in the current kernel (but do remember Johannes changing the memcg
> > end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> > if you're interested in that alternative: if you are, then I'll go ahead
> > and make it into an actual patch against v4.13-rc.
> 
> I don't know what your actual patch looks like, but the problem is that
> pgdat->too_many_isolated waits for kswapd while kswapd waits for
> pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
> once we hit it.

Not so (and we'd hardly be finding it a useful patch if that were so).

Hugh

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-24 Thread Hugh Dickins

On Thu, 20 Jul 2017, Tetsuo Handa wrote:
> Hugh Dickins wrote:
> > You probably won't welcome getting into alternatives at this late stage;
> > but after hacking around it one way or another because of its pointless
> > lockups, I lost patience with that too_many_isolated() loop a few months
> > back (on realizing the enormous number of pages that may be isolated via
> > migrate_pages(2)), and we've been running nicely since with something like:
> > 
> > bool got_mutex = false;
> > 
> > if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > if (mutex_lock_killable(>too_many_isolated))
> > return SWAP_CLUSTER_MAX;
> > got_mutex = true;
> > }
> > ...
> > if (got_mutex)
> > mutex_unlock(>too_many_isolated);
> > 
> > Using a mutex to provide the intended throttling, without an infinite
> > loop or an arbitrary delay; and without having to worry (as we often did)
> > about whether those numbers in too_many_isolated() are really appropriate.
> > No premature OOMs complained of yet.
> 
> Roughly speaking, there is a moment where shrink_inactive_list() acts
> like below.
> 
>   bool got_mutex = false;
> 
>   if (!current_is_kswapd()) {
>   if (mutex_lock_killable(>too_many_isolated))
>   return SWAP_CLUSTER_MAX;
>   got_mutex = true;
>   }
> 
>   // kswapd is blocked here waiting for !current_is_kswapd().

That would be a shame, for kswapd to wait for !current_is_kswapd()!

But seriously, I think I understand what you mean by that, you're
thinking that kswapd would be waiting on some other task to clear
the too_many_isolated() condition?

No, it does not work that way: kswapd (never seeing too_many_isolated()
because that always says false when current_is_kswapd()) never tries to
take the pgdat->too_many_isolated mutex itself: it does not wait there
at all, although other tasks may be waiting there at the time.

Perhaps my naming the mutex "too_many_isolated", same as the function,
is actually confusing, when I had intended it to be helpful.

> 
>   if (got_mutex)
>   mutex_unlock(>too_many_isolated);
> 
> > 
> > But that was on a different kernel, and there I did have to make sure
> > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > that in the current kernel (but do remember Johannes changing the memcg
> > end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> > if you're interested in that alternative: if you are, then I'll go ahead
> > and make it into an actual patch against v4.13-rc.
> 
> I don't know what your actual patch looks like, but the problem is that
> pgdat->too_many_isolated waits for kswapd while kswapd waits for
> pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
> once we hit it.

Not so (and we'd hardly be finding it a useful patch if that were so).

Hugh

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-24 Thread Michal Hocko

On Fri 21-07-17 16:01:04, Andrew Morton wrote:
> On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hocko  wrote:
> > 
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, 
> > > > struct lruvec *lruvec,
> > > > int file = is_file_lru(lru);
> > > > struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> > > > +   bool stalled = false;
> > > >  
> > > > while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > > > -   congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > +   if (stalled)
> > > > +   return 0;
> > > > +
> > > > +   /* wait a bit for the reclaimer. */
> > > > +   schedule_timeout_interruptible(HZ/10);
> > > 
> > > a) if this task has signal_pending(), this falls straight through
> > >and I suspect the code breaks?
> > 
> > It will not break. It will return to the allocation path more quickly
> > but no over-reclaim will happen and it will/should get throttled there.
> > So nothing critical.
> > 
> > > b) replacing congestion_wait() with schedule_timeout_interruptible()
> > >means this task no longer contributes to load average here and it's
> > >a (slightly) user-visible change.
> > 
> > you are right. I am not sure it matters but it might be visible.
> >  
> > > c) msleep_interruptible() is nicer
> > > 
> > > d) IOW, methinks we should be using msleep() here?
> > 
> > OK, I do not have objections. Are you going to squash this in or want a
> > separate patch explaining all the above?
> 
> I'd prefer to have a comment explaining why interruptible sleep is
> being used, because that "what if signal_pending()" case is rather a
> red flag.

I didn't really consider interruptible vs. uninterruptible sleep so it
wasn't really a deliberate decision. Now, that you have brought up the
above points I am OK changing that the uninterruptible.

Here is a fix up. I am fine with this either folded in or as a separate
patch.
---
>From 4b295fc1625d499a2e1283b036aea005158b33cc Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 24 Jul 2017 08:43:23 +0200
Subject: [PATCH] mm, vmscan: throttle too_many_isolated by uninterruptible
 sleep

"mm, vmscan: do not loop on too_many_isolated for ever" has added an
interruptible sleep into shrink_inactive_list when we have isolated too
many pages. Andrew has noticed that we used to sleep in uninterruptible
sleep previously (in congestion_wait) and that changing that is wrong
for at least two reasons
- waiting task would not participate in the load average anymore
- task with a pending signal will not sleep and bail out
  immediately
While none of those issues are critical in any way but they are
unnecessary. The interruptible sleep was more an oversight than a
deliberate decision. Fix this by using msleep instead.

Spotted-by: Andrew Morton 
Signed-off-by: Michal Hocko 
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7ba9467b6d58..ed0c29a3db16 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1749,7 +1749,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
return 0;
 
/* wait a bit for the reclaimer. */
-   schedule_timeout_interruptible(HZ/10);
+   msleep(100);
stalled = true;
 
/* We are about to die and free our memory. Return now. */
-- 
2.13.2


-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-24 Thread Michal Hocko

On Fri 21-07-17 16:01:04, Andrew Morton wrote:
> On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hocko  wrote:
> > 
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, 
> > > > struct lruvec *lruvec,
> > > > int file = is_file_lru(lru);
> > > > struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> > > > +   bool stalled = false;
> > > >  
> > > > while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > > > -   congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > +   if (stalled)
> > > > +   return 0;
> > > > +
> > > > +   /* wait a bit for the reclaimer. */
> > > > +   schedule_timeout_interruptible(HZ/10);
> > > 
> > > a) if this task has signal_pending(), this falls straight through
> > >and I suspect the code breaks?
> > 
> > It will not break. It will return to the allocation path more quickly
> > but no over-reclaim will happen and it will/should get throttled there.
> > So nothing critical.
> > 
> > > b) replacing congestion_wait() with schedule_timeout_interruptible()
> > >means this task no longer contributes to load average here and it's
> > >a (slightly) user-visible change.
> > 
> > you are right. I am not sure it matters but it might be visible.
> >  
> > > c) msleep_interruptible() is nicer
> > > 
> > > d) IOW, methinks we should be using msleep() here?
> > 
> > OK, I do not have objections. Are you going to squash this in or want a
> > separate patch explaining all the above?
> 
> I'd prefer to have a comment explaining why interruptible sleep is
> being used, because that "what if signal_pending()" case is rather a
> red flag.

I didn't really consider interruptible vs. uninterruptible sleep so it
wasn't really a deliberate decision. Now, that you have brought up the
above points I am OK changing that the uninterruptible.

Here is a fix up. I am fine with this either folded in or as a separate
patch.
---
>From 4b295fc1625d499a2e1283b036aea005158b33cc Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 24 Jul 2017 08:43:23 +0200
Subject: [PATCH] mm, vmscan: throttle too_many_isolated by uninterruptible
 sleep

"mm, vmscan: do not loop on too_many_isolated for ever" has added an
interruptible sleep into shrink_inactive_list when we have isolated too
many pages. Andrew has noticed that we used to sleep in uninterruptible
sleep previously (in congestion_wait) and that changing that is wrong
for at least two reasons
- waiting task would not participate in the load average anymore
- task with a pending signal will not sleep and bail out
  immediately
While none of those issues are critical in any way but they are
unnecessary. The interruptible sleep was more an oversight than a
deliberate decision. Fix this by using msleep instead.

Spotted-by: Andrew Morton 
Signed-off-by: Michal Hocko 
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7ba9467b6d58..ed0c29a3db16 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1749,7 +1749,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
return 0;
 
/* wait a bit for the reclaimer. */
-   schedule_timeout_interruptible(HZ/10);
+   msleep(100);
stalled = true;
 
/* We are about to die and free our memory. Return now. */
-- 
2.13.2


-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-21 Thread Andrew Morton

On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hocko  wrote:
> 
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, 
> > > struct lruvec *lruvec,
> > >   int file = is_file_lru(lru);
> > >   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > >   struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> > > + bool stalled = false;
> > >  
> > >   while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > + if (stalled)
> > > + return 0;
> > > +
> > > + /* wait a bit for the reclaimer. */
> > > + schedule_timeout_interruptible(HZ/10);
> > 
> > a) if this task has signal_pending(), this falls straight through
> >and I suspect the code breaks?
> 
> It will not break. It will return to the allocation path more quickly
> but no over-reclaim will happen and it will/should get throttled there.
> So nothing critical.
> 
> > b) replacing congestion_wait() with schedule_timeout_interruptible()
> >means this task no longer contributes to load average here and it's
> >a (slightly) user-visible change.
> 
> you are right. I am not sure it matters but it might be visible.
>  
> > c) msleep_interruptible() is nicer
> > 
> > d) IOW, methinks we should be using msleep() here?
> 
> OK, I do not have objections. Are you going to squash this in or want a
> separate patch explaining all the above?

I'd prefer to have a comment explaining why interruptible sleep is
being used, because that "what if signal_pending()" case is rather a
red flag.

Is it the case that fall-through-if-signal_pending() is the
*preferred* behaviour?  If so, the comment should explain this.  If it
isn't the preferred behaviour then using uninterruptible sleep sounds
better to me, if only because it saves us from having to test a rather
tricky and rare case.

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-21 Thread Andrew Morton

On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hocko  wrote:
> 
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, 
> > > struct lruvec *lruvec,
> > >   int file = is_file_lru(lru);
> > >   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > >   struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> > > + bool stalled = false;
> > >  
> > >   while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > + if (stalled)
> > > + return 0;
> > > +
> > > + /* wait a bit for the reclaimer. */
> > > + schedule_timeout_interruptible(HZ/10);
> > 
> > a) if this task has signal_pending(), this falls straight through
> >and I suspect the code breaks?
> 
> It will not break. It will return to the allocation path more quickly
> but no over-reclaim will happen and it will/should get throttled there.
> So nothing critical.
> 
> > b) replacing congestion_wait() with schedule_timeout_interruptible()
> >means this task no longer contributes to load average here and it's
> >a (slightly) user-visible change.
> 
> you are right. I am not sure it matters but it might be visible.
>  
> > c) msleep_interruptible() is nicer
> > 
> > d) IOW, methinks we should be using msleep() here?
> 
> OK, I do not have objections. Are you going to squash this in or want a
> separate patch explaining all the above?

I'd prefer to have a comment explaining why interruptible sleep is
being used, because that "what if signal_pending()" case is rather a
red flag.

Is it the case that fall-through-if-signal_pending() is the
*preferred* behaviour?  If so, the comment should explain this.  If it
isn't the preferred behaviour then using uninterruptible sleep sounds
better to me, if only because it saves us from having to test a rather
tricky and rare case.

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-20 Thread Michal Hocko

On Wed 19-07-17 18:54:40, Hugh Dickins wrote:
[...]
> You probably won't welcome getting into alternatives at this late stage;
> but after hacking around it one way or another because of its pointless
> lockups, I lost patience with that too_many_isolated() loop a few months
> back (on realizing the enormous number of pages that may be isolated via
> migrate_pages(2)), and we've been running nicely since with something like:
> 
>   bool got_mutex = false;
> 
>   if (unlikely(too_many_isolated(pgdat, file, sc))) {
>   if (mutex_lock_killable(>too_many_isolated))
>   return SWAP_CLUSTER_MAX;
>   got_mutex = true;
>   }
>   ...
>   if (got_mutex)
>   mutex_unlock(>too_many_isolated);
> 
> Using a mutex to provide the intended throttling, without an infinite
> loop or an arbitrary delay; and without having to worry (as we often did)
> about whether those numbers in too_many_isolated() are really appropriate.
> No premature OOMs complained of yet.
> 
> But that was on a different kernel, and there I did have to make sure
> that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> that in the current kernel (but do remember Johannes changing the memcg
> end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> if you're interested in that alternative: if you are, then I'll go ahead
> and make it into an actual patch against v4.13-rc.

I would rather get rid of any additional locking here and my ultimate
goal is to make throttling at the page allocator layer rather than
inside the reclaim.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-20 Thread Michal Hocko

On Wed 19-07-17 18:54:40, Hugh Dickins wrote:
[...]
> You probably won't welcome getting into alternatives at this late stage;
> but after hacking around it one way or another because of its pointless
> lockups, I lost patience with that too_many_isolated() loop a few months
> back (on realizing the enormous number of pages that may be isolated via
> migrate_pages(2)), and we've been running nicely since with something like:
> 
>   bool got_mutex = false;
> 
>   if (unlikely(too_many_isolated(pgdat, file, sc))) {
>   if (mutex_lock_killable(>too_many_isolated))
>   return SWAP_CLUSTER_MAX;
>   got_mutex = true;
>   }
>   ...
>   if (got_mutex)
>   mutex_unlock(>too_many_isolated);
> 
> Using a mutex to provide the intended throttling, without an infinite
> loop or an arbitrary delay; and without having to worry (as we often did)
> about whether those numbers in too_many_isolated() are really appropriate.
> No premature OOMs complained of yet.
> 
> But that was on a different kernel, and there I did have to make sure
> that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> that in the current kernel (but do remember Johannes changing the memcg
> end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> if you're interested in that alternative: if you are, then I'll go ahead
> and make it into an actual patch against v4.13-rc.

I would rather get rid of any additional locking here and my ultimate
goal is to make throttling at the page allocator layer rather than
inside the reclaim.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-20 Thread Tetsuo Handa

Hugh Dickins wrote:
> You probably won't welcome getting into alternatives at this late stage;
> but after hacking around it one way or another because of its pointless
> lockups, I lost patience with that too_many_isolated() loop a few months
> back (on realizing the enormous number of pages that may be isolated via
> migrate_pages(2)), and we've been running nicely since with something like:
> 
>   bool got_mutex = false;
> 
>   if (unlikely(too_many_isolated(pgdat, file, sc))) {
>   if (mutex_lock_killable(>too_many_isolated))
>   return SWAP_CLUSTER_MAX;
>   got_mutex = true;
>   }
>   ...
>   if (got_mutex)
>   mutex_unlock(>too_many_isolated);
> 
> Using a mutex to provide the intended throttling, without an infinite
> loop or an arbitrary delay; and without having to worry (as we often did)
> about whether those numbers in too_many_isolated() are really appropriate.
> No premature OOMs complained of yet.

Roughly speaking, there is a moment where shrink_inactive_list() acts
like below.

bool got_mutex = false;

if (!current_is_kswapd()) {
if (mutex_lock_killable(>too_many_isolated))
return SWAP_CLUSTER_MAX;
got_mutex = true;
}

// kswapd is blocked here waiting for !current_is_kswapd().

if (got_mutex)
mutex_unlock(>too_many_isolated);

> 
> But that was on a different kernel, and there I did have to make sure
> that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> that in the current kernel (but do remember Johannes changing the memcg
> end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> if you're interested in that alternative: if you are, then I'll go ahead
> and make it into an actual patch against v4.13-rc.

I don't know what your actual patch looks like, but the problem is that
pgdat->too_many_isolated waits for kswapd while kswapd waits for
pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
once we hit it.

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-20 Thread Tetsuo Handa

Hugh Dickins wrote:
> You probably won't welcome getting into alternatives at this late stage;
> but after hacking around it one way or another because of its pointless
> lockups, I lost patience with that too_many_isolated() loop a few months
> back (on realizing the enormous number of pages that may be isolated via
> migrate_pages(2)), and we've been running nicely since with something like:
> 
>   bool got_mutex = false;
> 
>   if (unlikely(too_many_isolated(pgdat, file, sc))) {
>   if (mutex_lock_killable(>too_many_isolated))
>   return SWAP_CLUSTER_MAX;
>   got_mutex = true;
>   }
>   ...
>   if (got_mutex)
>   mutex_unlock(>too_many_isolated);
> 
> Using a mutex to provide the intended throttling, without an infinite
> loop or an arbitrary delay; and without having to worry (as we often did)
> about whether those numbers in too_many_isolated() are really appropriate.
> No premature OOMs complained of yet.

Roughly speaking, there is a moment where shrink_inactive_list() acts
like below.

bool got_mutex = false;

if (!current_is_kswapd()) {
if (mutex_lock_killable(>too_many_isolated))
return SWAP_CLUSTER_MAX;
got_mutex = true;
}

// kswapd is blocked here waiting for !current_is_kswapd().

if (got_mutex)
mutex_unlock(>too_many_isolated);

> 
> But that was on a different kernel, and there I did have to make sure
> that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> that in the current kernel (but do remember Johannes changing the memcg
> end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> if you're interested in that alternative: if you are, then I'll go ahead
> and make it into an actual patch against v4.13-rc.

I don't know what your actual patch looks like, but the problem is that
pgdat->too_many_isolated waits for kswapd while kswapd waits for
pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
once we hit it.

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-20 Thread Michal Hocko

On Wed 19-07-17 15:20:14, Andrew Morton wrote:
> On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hocko  wrote:
> 
> > From: Michal Hocko 
> > 
> > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> > in too_many_isolated loop basically for ever because the last few pages
> > on the LRU lists are isolated by the kswapd which is stuck on fs locks
> > when doing the pageout or slab reclaim. This in turn means that there is
> > nobody to actually trigger the oom killer and the system is basically
> > unusable.
> > 
> > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> > direct reclaim when too many pages are isolated already") to prevent
> > from pre-mature oom killer invocations because back then no reclaim
> > progress could indeed trigger the OOM killer too early. But since the
> > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > the allocation/reclaim retry loop considers all the reclaimable pages
> > and throttles the allocation at that layer so we can loosen the direct
> > reclaim throttling.
> > 
> > Make shrink_inactive_list loop over too_many_isolated bounded and returns
> > immediately when the situation hasn't resolved after the first sleep.
> > Replace congestion_wait by a simple schedule_timeout_interruptible because
> > we are not really waiting on the IO congestion in this path.
> > 
> > Please note that this patch can theoretically cause the OOM killer to
> > trigger earlier while there are many pages isolated for the reclaim
> > which makes progress only very slowly. This would be obvious from the oom
> > report as the number of isolated pages are printed there. If we ever hit
> > this should_reclaim_retry should consider those numbers in the evaluation
> > in one way or another.
> 
> Need to figure out which kernels to patch.  Maybe just 4.13-rc after a
> week or two?

I do not think we need to rush it and the next merge window should be
just OK.

> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, 
> > struct lruvec *lruvec,
> > int file = is_file_lru(lru);
> > struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> > +   bool stalled = false;
> >  
> > while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > -   congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +   if (stalled)
> > +   return 0;
> > +
> > +   /* wait a bit for the reclaimer. */
> > +   schedule_timeout_interruptible(HZ/10);
> 
> a) if this task has signal_pending(), this falls straight through
>and I suspect the code breaks?

It will not break. It will return to the allocation path more quickly
but no over-reclaim will happen and it will/should get throttled there.
So nothing critical.

> b) replacing congestion_wait() with schedule_timeout_interruptible()
>means this task no longer contributes to load average here and it's
>a (slightly) user-visible change.

you are right. I am not sure it matters but it might be visible.
 
> c) msleep_interruptible() is nicer
> 
> d) IOW, methinks we should be using msleep() here?

OK, I do not have objections. Are you going to squash this in or want a
separate patch explaining all the above?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-20 Thread Michal Hocko

On Wed 19-07-17 15:20:14, Andrew Morton wrote:
> On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hocko  wrote:
> 
> > From: Michal Hocko 
> > 
> > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> > in too_many_isolated loop basically for ever because the last few pages
> > on the LRU lists are isolated by the kswapd which is stuck on fs locks
> > when doing the pageout or slab reclaim. This in turn means that there is
> > nobody to actually trigger the oom killer and the system is basically
> > unusable.
> > 
> > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> > direct reclaim when too many pages are isolated already") to prevent
> > from pre-mature oom killer invocations because back then no reclaim
> > progress could indeed trigger the OOM killer too early. But since the
> > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > the allocation/reclaim retry loop considers all the reclaimable pages
> > and throttles the allocation at that layer so we can loosen the direct
> > reclaim throttling.
> > 
> > Make shrink_inactive_list loop over too_many_isolated bounded and returns
> > immediately when the situation hasn't resolved after the first sleep.
> > Replace congestion_wait by a simple schedule_timeout_interruptible because
> > we are not really waiting on the IO congestion in this path.
> > 
> > Please note that this patch can theoretically cause the OOM killer to
> > trigger earlier while there are many pages isolated for the reclaim
> > which makes progress only very slowly. This would be obvious from the oom
> > report as the number of isolated pages are printed there. If we ever hit
> > this should_reclaim_retry should consider those numbers in the evaluation
> > in one way or another.
> 
> Need to figure out which kernels to patch.  Maybe just 4.13-rc after a
> week or two?

I do not think we need to rush it and the next merge window should be
just OK.

> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, 
> > struct lruvec *lruvec,
> > int file = is_file_lru(lru);
> > struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> > +   bool stalled = false;
> >  
> > while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > -   congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +   if (stalled)
> > +   return 0;
> > +
> > +   /* wait a bit for the reclaimer. */
> > +   schedule_timeout_interruptible(HZ/10);
> 
> a) if this task has signal_pending(), this falls straight through
>and I suspect the code breaks?

It will not break. It will return to the allocation path more quickly
but no over-reclaim will happen and it will/should get throttled there.
So nothing critical.

> b) replacing congestion_wait() with schedule_timeout_interruptible()
>means this task no longer contributes to load average here and it's
>a (slightly) user-visible change.

you are right. I am not sure it matters but it might be visible.
 
> c) msleep_interruptible() is nicer
> 
> d) IOW, methinks we should be using msleep() here?

OK, I do not have objections. Are you going to squash this in or want a
separate patch explaining all the above?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-19 Thread Hugh Dickins

On Mon, 10 Jul 2017, Michal Hocko wrote:

> From: Michal Hocko 
> 
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
> 
> [1] 
> http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
> [2] 
> http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp
> [3] 
> http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp
> 
> Acked-by: Mel Gorman 
> Tested-by: Tetsuo Handa 
> Signed-off-by: Michal Hocko 
> ---
> Hi,
> I am resubmitting this patch previously sent here
> http://lkml.kernel.org/r/20170307133057.26182-1-mho...@kernel.org.
> 
> Johannes and Rik had some concerns that this could lead to premature
> OOM kills. I agree with them that we need a better throttling
> mechanism. Until now we didn't give the issue described above a high
> priority because it usually required a really insane workload to
> trigger. But it seems that the issue can be reproduced also without
> having an insane number of competing threads [3].
> 
> Moreover, the issue also triggers very often while testing heavy memory
> pressure and so prevents further development of hardening of that area
> (http://lkml.kernel.org/r/201707061948.icj18763.tvfoqfohmjf...@i-love.sakura.ne.jp).
> Tetsuo hasn't seen any negative effect of this patch in his oom stress
> tests so I think we should go with this simple patch for now and think
> about something more robust long term.
> 
> That being said I suggest merging this (after spending the full release
> cycle in linux-next) for the time being until we come up with a more
> clever solution.
> 
>  mm/vmscan.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c15b2e4c47ca..4ae069060ae5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   int file = is_file_lru(lru);
>   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>   struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> + bool stalled = false;
>  
>   while (unlikely(too_many_isolated(pgdat, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + if (stalled)
> + return 0;
> +
> + /* wait a bit for the reclaimer. */
> + schedule_timeout_interruptible(HZ/10);
> + stalled = true;
>  
>   /* We are about to die and free our memory. Return now. */
>   if (fatal_signal_pending(current))
> -- 

You probably won't welcome getting into alternatives at this late stage;
but after hacking around it one way or another because of its pointless
lockups, I lost patience with that too_many_isolated() loop a few months
back (on realizing the enormous number of pages that may be isolated via
migrate_pages(2)), and we've been running nicely since with something like:

bool got_mutex = false;

if (unlikely(too_many_isolated(pgdat, file, sc))) {
if (mutex_lock_killable(>too_many_isolated))
return SWAP_CLUSTER_MAX;
got_mutex = true;
}
...
if (got_mutex)
mutex_unlock(>too_many_isolated);

Using a mutex to provide the intended throttling, without an infinite
loop or an arbitrary delay; and

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-19 Thread Hugh Dickins

On Mon, 10 Jul 2017, Michal Hocko wrote:

> From: Michal Hocko 
> 
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
> 
> [1] 
> http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
> [2] 
> http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp
> [3] 
> http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp
> 
> Acked-by: Mel Gorman 
> Tested-by: Tetsuo Handa 
> Signed-off-by: Michal Hocko 
> ---
> Hi,
> I am resubmitting this patch previously sent here
> http://lkml.kernel.org/r/20170307133057.26182-1-mho...@kernel.org.
> 
> Johannes and Rik had some concerns that this could lead to premature
> OOM kills. I agree with them that we need a better throttling
> mechanism. Until now we didn't give the issue described above a high
> priority because it usually required a really insane workload to
> trigger. But it seems that the issue can be reproduced also without
> having an insane number of competing threads [3].
> 
> Moreover, the issue also triggers very often while testing heavy memory
> pressure and so prevents further development of hardening of that area
> (http://lkml.kernel.org/r/201707061948.icj18763.tvfoqfohmjf...@i-love.sakura.ne.jp).
> Tetsuo hasn't seen any negative effect of this patch in his oom stress
> tests so I think we should go with this simple patch for now and think
> about something more robust long term.
> 
> That being said I suggest merging this (after spending the full release
> cycle in linux-next) for the time being until we come up with a more
> clever solution.
> 
>  mm/vmscan.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c15b2e4c47ca..4ae069060ae5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   int file = is_file_lru(lru);
>   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>   struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> + bool stalled = false;
>  
>   while (unlikely(too_many_isolated(pgdat, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + if (stalled)
> + return 0;
> +
> + /* wait a bit for the reclaimer. */
> + schedule_timeout_interruptible(HZ/10);
> + stalled = true;
>  
>   /* We are about to die and free our memory. Return now. */
>   if (fatal_signal_pending(current))
> -- 

You probably won't welcome getting into alternatives at this late stage;
but after hacking around it one way or another because of its pointless
lockups, I lost patience with that too_many_isolated() loop a few months
back (on realizing the enormous number of pages that may be isolated via
migrate_pages(2)), and we've been running nicely since with something like:

bool got_mutex = false;

if (unlikely(too_many_isolated(pgdat, file, sc))) {
if (mutex_lock_killable(>too_many_isolated))
return SWAP_CLUSTER_MAX;
got_mutex = true;
}
...
if (got_mutex)
mutex_unlock(>too_many_isolated);

Using a mutex to provide the intended throttling, without an infinite
loop or an arbitrary delay; and without having to worry (as we often did)
about whether those numbers in

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-19 Thread Andrew Morton

On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.

Need to figure out which kernels to patch.  Maybe just 4.13-rc after a
week or two?

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   int file = is_file_lru(lru);
>   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>   struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> + bool stalled = false;
>  
>   while (unlikely(too_many_isolated(pgdat, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + if (stalled)
> + return 0;
> +
> + /* wait a bit for the reclaimer. */
> + schedule_timeout_interruptible(HZ/10);

a) if this task has signal_pending(), this falls straight through
   and I suspect the code breaks?

b) replacing congestion_wait() with schedule_timeout_interruptible()
   means this task no longer contributes to load average here and it's
   a (slightly) user-visible change.

c) msleep_interruptible() is nicer

d) IOW, methinks we should be using msleep() here?

> + stalled = true;
>  
>   /* We are about to die and free our memory. Return now. */
>   if (fatal_signal_pending(current))

(Gets distracted by the thought that we should do
s/msleep/msleep_uninterruptible/g)

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-19 Thread Andrew Morton

On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.

Need to figure out which kernels to patch.  Maybe just 4.13-rc after a
week or two?

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   int file = is_file_lru(lru);
>   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>   struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> + bool stalled = false;
>  
>   while (unlikely(too_many_isolated(pgdat, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + if (stalled)
> + return 0;
> +
> + /* wait a bit for the reclaimer. */
> + schedule_timeout_interruptible(HZ/10);

a) if this task has signal_pending(), this falls straight through
   and I suspect the code breaks?

b) replacing congestion_wait() with schedule_timeout_interruptible()
   means this task no longer contributes to load average here and it's
   a (slightly) user-visible change.

c) msleep_interruptible() is nicer

d) IOW, methinks we should be using msleep() here?

> + stalled = true;
>  
>   /* We are about to die and free our memory. Return now. */
>   if (fatal_signal_pending(current))

(Gets distracted by the thought that we should do
s/msleep/msleep_uninterruptible/g)

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Michal Hocko

On Mon 10-07-17 12:58:59, Johannes Weiner wrote:
> On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote:
> > On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:
> > 
> > > Johannes and Rik had some concerns that this could lead to premature
> > > OOM kills. I agree with them that we need a better throttling
> > > mechanism. Until now we didn't give the issue described above a high
> > > priority because it usually required a really insane workload to
> > > trigger. But it seems that the issue can be reproduced also without
> > > having an insane number of competing threads [3].
> > 
> > My worries stand, but lets fix the real observed bug, and not worry
> > too much about the theoretical bug for now.
> > 
> > Acked-by: Rik van Riel 
> 
> I agree with this.
> 
> Acked-by: Johannes Weiner 

Thanks to both of you. Just to make it clear. I really do want to
address the throttling problem longterm properly. I do not have any
great ideas to be honest.  I am busy with other things so it might be
quite some time before I come up with something.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Michal Hocko

On Mon 10-07-17 12:58:59, Johannes Weiner wrote:
> On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote:
> > On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:
> > 
> > > Johannes and Rik had some concerns that this could lead to premature
> > > OOM kills. I agree with them that we need a better throttling
> > > mechanism. Until now we didn't give the issue described above a high
> > > priority because it usually required a really insane workload to
> > > trigger. But it seems that the issue can be reproduced also without
> > > having an insane number of competing threads [3].
> > 
> > My worries stand, but lets fix the real observed bug, and not worry
> > too much about the theoretical bug for now.
> > 
> > Acked-by: Rik van Riel 
> 
> I agree with this.
> 
> Acked-by: Johannes Weiner 

Thanks to both of you. Just to make it clear. I really do want to
address the throttling problem longterm properly. I do not have any
great ideas to be honest.  I am busy with other things so it might be
quite some time before I come up with something.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Johannes Weiner

On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote:
> On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:
> 
> > Johannes and Rik had some concerns that this could lead to premature
> > OOM kills. I agree with them that we need a better throttling
> > mechanism. Until now we didn't give the issue described above a high
> > priority because it usually required a really insane workload to
> > trigger. But it seems that the issue can be reproduced also without
> > having an insane number of competing threads [3].
> 
> My worries stand, but lets fix the real observed bug, and not worry
> too much about the theoretical bug for now.
> 
> Acked-by: Rik van Riel 

I agree with this.

Acked-by: Johannes Weiner

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Johannes Weiner

On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote:
> On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:
> 
> > Johannes and Rik had some concerns that this could lead to premature
> > OOM kills. I agree with them that we need a better throttling
> > mechanism. Until now we didn't give the issue described above a high
> > priority because it usually required a really insane workload to
> > trigger. But it seems that the issue can be reproduced also without
> > having an insane number of competing threads [3].
> 
> My worries stand, but lets fix the real observed bug, and not worry
> too much about the theoretical bug for now.
> 
> Acked-by: Rik van Riel 

I agree with this.

Acked-by: Johannes Weiner

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Rik van Riel

On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:

> Johannes and Rik had some concerns that this could lead to premature
> OOM kills. I agree with them that we need a better throttling
> mechanism. Until now we didn't give the issue described above a high
> priority because it usually required a really insane workload to
> trigger. But it seems that the issue can be reproduced also without
> having an insane number of competing threads [3].

My worries stand, but lets fix the real observed bug, and not worry
too much about the theoretical bug for now.

Acked-by: Rik van Riel

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Rik van Riel

On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:

> Johannes and Rik had some concerns that this could lead to premature
> OOM kills. I agree with them that we need a better throttling
> mechanism. Until now we didn't give the issue described above a high
> priority because it usually required a really insane workload to
> trigger. But it seems that the issue can be reproduced also without
> having an insane number of competing threads [3].

My worries stand, but lets fix the real observed bug, and not worry
too much about the theoretical bug for now.

Acked-by: Rik van Riel

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Vlastimil Babka

On 07/10/2017 09:48 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
> 
> [1] 
> http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
> [2] 
> http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp
> [3] 
> http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp
> 
> Acked-by: Mel Gorman 
> Tested-by: Tetsuo Handa 
> Signed-off-by: Michal Hocko 

Let's hope there won't be premature OOM's then.

Acked-by: Vlastimil Babka

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Vlastimil Babka

On 07/10/2017 09:48 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
> 
> [1] 
> http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
> [2] 
> http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp
> [3] 
> http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp
> 
> Acked-by: Mel Gorman 
> Tested-by: Tetsuo Handa 
> Signed-off-by: Michal Hocko 

Let's hope there won't be premature OOM's then.

Acked-by: Vlastimil Babka

[PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Michal Hocko

From: Michal Hocko 

Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] 
http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
[2] 
http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp
[3] 
http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp

Acked-by: Mel Gorman 
Tested-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
---
Hi,
I am resubmitting this patch previously sent here
http://lkml.kernel.org/r/20170307133057.26182-1-mho...@kernel.org.

Johannes and Rik had some concerns that this could lead to premature
OOM kills. I agree with them that we need a better throttling
mechanism. Until now we didn't give the issue described above a high
priority because it usually required a really insane workload to
trigger. But it seems that the issue can be reproduced also without
having an insane number of competing threads [3].

Moreover, the issue also triggers very often while testing heavy memory
pressure and so prevents further development of hardening of that area
(http://lkml.kernel.org/r/201707061948.icj18763.tvfoqfohmjf...@i-love.sakura.ne.jp).
Tetsuo hasn't seen any negative effect of this patch in his oom stress
tests so I think we should go with this simple patch for now and think
about something more robust long term.

That being said I suggest merging this (after spending the full release
cycle in linux-next) for the time being until we come up with a more
clever solution.

 mm/vmscan.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
+   bool stalled = false;
 
while (unlikely(too_many_isolated(pgdat, file, sc))) {
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
+   if (stalled)
+   return 0;
+
+   /* wait a bit for the reclaimer. */
+   schedule_timeout_interruptible(HZ/10);
+   stalled = true;
 
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
-- 
2.11.0

[PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Michal Hocko

From: Michal Hocko 

Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] 
http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
[2] 
http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp
[3] 
http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp

Acked-by: Mel Gorman 
Tested-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
---
Hi,
I am resubmitting this patch previously sent here
http://lkml.kernel.org/r/20170307133057.26182-1-mho...@kernel.org.

Johannes and Rik had some concerns that this could lead to premature
OOM kills. I agree with them that we need a better throttling
mechanism. Until now we didn't give the issue described above a high
priority because it usually required a really insane workload to
trigger. But it seems that the issue can be reproduced also without
having an insane number of competing threads [3].

Moreover, the issue also triggers very often while testing heavy memory
pressure and so prevents further development of hardening of that area
(http://lkml.kernel.org/r/201707061948.icj18763.tvfoqfohmjf...@i-love.sakura.ne.jp).
Tetsuo hasn't seen any negative effect of this patch in his oom stress
tests so I think we should go with this simple patch for now and think
about something more robust long term.

That being said I suggest merging this (after spending the full release
cycle in linux-next) for the time being until we come up with a more
clever solution.

 mm/vmscan.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
+   bool stalled = false;
 
while (unlikely(too_many_isolated(pgdat, file, sc))) {
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
+   if (stalled)
+   return 0;
+
+   /* wait a bit for the reclaimer. */
+   schedule_timeout_interruptible(HZ/10);
+   stalled = true;
 
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
-- 
2.11.0

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-06 Thread Tetsuo Handa

Michal Hocko wrote:
> On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > It is really hard to pursue this half solution when there is no clear
> > > indication it helps in your testing. So could you try to test with only
> > > this patch on top of the current linux-next tree (or Linus tree) and see
> > > if you can reproduce the problem?
> > 
> > With this patch on top of next-20170630, I no longer hit this problem.
> > (Of course, this is because this patch eliminates the infinite loop.)
> 
> I assume you haven't seen other negative side effects, like unexpected
> OOMs etc... Are you willing to give your Tested-by?

I didn't see other negative side effects.

Tested-by: Tetsuo Handa 

We need long time for testing this patch at linux-next.git (and I give up
this handy bug for finding other bugs under almost OOM situation).

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-06 Thread Tetsuo Handa

Michal Hocko wrote:
> On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > It is really hard to pursue this half solution when there is no clear
> > > indication it helps in your testing. So could you try to test with only
> > > this patch on top of the current linux-next tree (or Linus tree) and see
> > > if you can reproduce the problem?
> > 
> > With this patch on top of next-20170630, I no longer hit this problem.
> > (Of course, this is because this patch eliminates the infinite loop.)
> 
> I assume you haven't seen other negative side effects, like unexpected
> OOMs etc... Are you willing to give your Tested-by?

I didn't see other negative side effects.

Tested-by: Tetsuo Handa 

We need long time for testing this patch at linux-next.git (and I give up
this handy bug for finding other bugs under almost OOM situation).

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-05 Thread Michal Hocko

On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > It is really hard to pursue this half solution when there is no clear
> > indication it helps in your testing. So could you try to test with only
> > this patch on top of the current linux-next tree (or Linus tree) and see
> > if you can reproduce the problem?
> 
> With this patch on top of next-20170630, I no longer hit this problem.
> (Of course, this is because this patch eliminates the infinite loop.)

I assume you haven't seen other negative side effects, like unexpected
OOMs etc... Are you willing to give your Tested-by?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-05 Thread Michal Hocko

On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > It is really hard to pursue this half solution when there is no clear
> > indication it helps in your testing. So could you try to test with only
> > this patch on top of the current linux-next tree (or Linus tree) and see
> > if you can reproduce the problem?
> 
> With this patch on top of next-20170630, I no longer hit this problem.
> (Of course, this is because this patch eliminates the infinite loop.)

I assume you haven't seen other negative side effects, like unexpected
OOMs etc... Are you willing to give your Tested-by?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-05 Thread Michal Hocko

[this is getting tangent again and I will not respond any further if
this turn into yet another flame]

On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I really do appreciate your testing because it uncovers corner cases
> > most people do not test for and we can actually make the code better in
> > the end.
> 
> That statement does not get to my heart at all. Collision between your
> approach and my approach is wasting both your time and my time.
> 
> I've reported this too_many_isolated() trap three years ago at
> http://lkml.kernel.org/r/201407022140.bfj13092.qvosjtfmfhl...@i-love.sakura.ne.jp
>  .
> Do you know that we already wasted 3 years without any attention?

And how many real bugs have we seen in those three years? Well, zero
AFAIR, except for your corner case testing. So while I never dismissed
the problem I've been saing this is not that trivial to fix. As my
attempt to address this and the review feedback I've received shows.

> You are rejecting serialization under OOM without giving a chance to test
> side effects of serialization under OOM at linux-next.git. I call such 
> attitude
> "speculation" which you never accept.

No I am rejecting abusing the lock for purpose it is not aimed for.

> Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization.
> In the first place, if the system is under global OOM (which is more
> serious situation than memcg OOM), delay caused by serialization will not
> matter. Rather, I consider that making sure that the system does not get
> locked up is more important. I'm reporting that serialization helps
> facilitating the OOM killer/reaper operations, avoiding lockups, and
> solving global OOM situation smoothly. But you are refusing my report without
> giving a chance to test what side effects will pop up at linux-next.git.

You are mixing oranges with apples here. We do synchronize memcg oom
killer the same way as the global one.

> Knowledge about OOM situation is hardly shared among Linux developers and 
> users,
> and is far from object of concern. Like shown by cgroup-aware OOM killer 
> proposal,
> what will happen if we restrict 0 <= oom_victims <= 1 is not shared among 
> developers.
> 
> How many developers joined to my OOM watchdog proposal? Every time and ever 
> it is
> confrontation between you and me. You, as effectively the only participant, 
> are
> showing negative attitude is effectively Nacked-by: response without 
> alternative
> proposal.

This is something all of us have to fight with. There are only so many
MM developers. You have to justify your changes in order to attract other
developers/users. You are basing your changes on speculations and what-ifs
for workloads that most developers consider borderline and
misconfigurations already.
 
> Not everybody can afford testing with absolutely latest upstream kernels.
> Not prepared to obtain information for analysis using distributor kernels 
> makes
> it impossible to compare whether user's problems are already fixed in upstream
> kernels, makes it impossible to identify patches which needs to be backported 
> to
> distributor kernels, and is bad for customers using distributor kernels. Of 
> course,
> it is possible that distributors decide not to allow users to obtain 
> information
> for analysis, but such decision cannot become a reason we can not prepare to 
> obtain
> information for analysis at upstream kernels.

If you have to work with distribution kernels then talk to distribution
people. It is that simple. You are surely not using those systems just
because of a fancy logo...
 
[...]

> > this way of pushing your patch is really annoying. Please do realize
> > that repeating the same thing all around will not make a patch more
> > likely to merge. You have proposed something, nobody has nacked it
> > so it waits for people to actually find it important enough to justify
> > the additional code. So please stop this.
> 
> When will people find time to judge it? We already wasted three years, and
> knowledge about OOM situation is hardly shared among Linux developers and 
> users,
> and will unlikely be object of concern. How many years (or decades) will we 
> waste
> more? MM subsystem will change meanwhile and we will just ignore old kernels.
> 
> If you do want me to stop bringing watchdog here and there, please do show
> alternative approach which I can tolerate. If you cannot afford it, please 
> allow
> me to involve people (e.g. you make calls for joining to my proposals because
> you are asking me to wait until people find time to judge it).
> Please do realize that just repeatedly saying "wait patiently" helps nothing.

You really have to realize that there will hardly be more interest in
your reports when they do not reflect real life situations. I have said
(several times) that those issues should be addressed eventually but
there are more pressing issues which do trigger in the real life and
they have precedence.

Should we add a

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-05 Thread Michal Hocko

[this is getting tangent again and I will not respond any further if
this turn into yet another flame]

On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I really do appreciate your testing because it uncovers corner cases
> > most people do not test for and we can actually make the code better in
> > the end.
> 
> That statement does not get to my heart at all. Collision between your
> approach and my approach is wasting both your time and my time.
> 
> I've reported this too_many_isolated() trap three years ago at
> http://lkml.kernel.org/r/201407022140.bfj13092.qvosjtfmfhl...@i-love.sakura.ne.jp
>  .
> Do you know that we already wasted 3 years without any attention?

And how many real bugs have we seen in those three years? Well, zero
AFAIR, except for your corner case testing. So while I never dismissed
the problem I've been saing this is not that trivial to fix. As my
attempt to address this and the review feedback I've received shows.

> You are rejecting serialization under OOM without giving a chance to test
> side effects of serialization under OOM at linux-next.git. I call such 
> attitude
> "speculation" which you never accept.

No I am rejecting abusing the lock for purpose it is not aimed for.

> Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization.
> In the first place, if the system is under global OOM (which is more
> serious situation than memcg OOM), delay caused by serialization will not
> matter. Rather, I consider that making sure that the system does not get
> locked up is more important. I'm reporting that serialization helps
> facilitating the OOM killer/reaper operations, avoiding lockups, and
> solving global OOM situation smoothly. But you are refusing my report without
> giving a chance to test what side effects will pop up at linux-next.git.

You are mixing oranges with apples here. We do synchronize memcg oom
killer the same way as the global one.

> Knowledge about OOM situation is hardly shared among Linux developers and 
> users,
> and is far from object of concern. Like shown by cgroup-aware OOM killer 
> proposal,
> what will happen if we restrict 0 <= oom_victims <= 1 is not shared among 
> developers.
> 
> How many developers joined to my OOM watchdog proposal? Every time and ever 
> it is
> confrontation between you and me. You, as effectively the only participant, 
> are
> showing negative attitude is effectively Nacked-by: response without 
> alternative
> proposal.

This is something all of us have to fight with. There are only so many
MM developers. You have to justify your changes in order to attract other
developers/users. You are basing your changes on speculations and what-ifs
for workloads that most developers consider borderline and
misconfigurations already.
 
> Not everybody can afford testing with absolutely latest upstream kernels.
> Not prepared to obtain information for analysis using distributor kernels 
> makes
> it impossible to compare whether user's problems are already fixed in upstream
> kernels, makes it impossible to identify patches which needs to be backported 
> to
> distributor kernels, and is bad for customers using distributor kernels. Of 
> course,
> it is possible that distributors decide not to allow users to obtain 
> information
> for analysis, but such decision cannot become a reason we can not prepare to 
> obtain
> information for analysis at upstream kernels.

If you have to work with distribution kernels then talk to distribution
people. It is that simple. You are surely not using those systems just
because of a fancy logo...
 
[...]

> > this way of pushing your patch is really annoying. Please do realize
> > that repeating the same thing all around will not make a patch more
> > likely to merge. You have proposed something, nobody has nacked it
> > so it waits for people to actually find it important enough to justify
> > the additional code. So please stop this.
> 
> When will people find time to judge it? We already wasted three years, and
> knowledge about OOM situation is hardly shared among Linux developers and 
> users,
> and will unlikely be object of concern. How many years (or decades) will we 
> waste
> more? MM subsystem will change meanwhile and we will just ignore old kernels.
> 
> If you do want me to stop bringing watchdog here and there, please do show
> alternative approach which I can tolerate. If you cannot afford it, please 
> allow
> me to involve people (e.g. you make calls for joining to my proposals because
> you are asking me to wait until people find time to judge it).
> Please do realize that just repeatedly saying "wait patiently" helps nothing.

You really have to realize that there will hardly be more interest in
your reports when they do not reflect real life situations. I have said
(several times) that those issues should be addressed eventually but
there are more pressing issues which do trigger in the real life and
they have precedence.

Should we add a

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-01 Thread Tetsuo Handa

Michal Hocko wrote:
> I really do appreciate your testing because it uncovers corner cases
> most people do not test for and we can actually make the code better in
> the end.

That statement does not get to my heart at all. Collision between your
approach and my approach is wasting both your time and my time.

I've reported this too_many_isolated() trap three years ago at
http://lkml.kernel.org/r/201407022140.bfj13092.qvosjtfmfhl...@i-love.sakura.ne.jp
 .
Do you know that we already wasted 3 years without any attention?

You are rejecting serialization under OOM without giving a chance to test
side effects of serialization under OOM at linux-next.git. I call such attitude
"speculation" which you never accept.

Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization.
In the first place, if the system is under global OOM (which is more
serious situation than memcg OOM), delay caused by serialization will not
matter. Rather, I consider that making sure that the system does not get
locked up is more important. I'm reporting that serialization helps
facilitating the OOM killer/reaper operations, avoiding lockups, and
solving global OOM situation smoothly. But you are refusing my report without
giving a chance to test what side effects will pop up at linux-next.git.

Knowledge about OOM situation is hardly shared among Linux developers and users,
and is far from object of concern. Like shown by cgroup-aware OOM killer 
proposal,
what will happen if we restrict 0 <= oom_victims <= 1 is not shared among 
developers.

How many developers joined to my OOM watchdog proposal? Every time and ever it 
is
confrontation between you and me. You, as effectively the only participant, are
showing negative attitude is effectively Nacked-by: response without alternative
proposal.

Not everybody can afford testing with absolutely latest upstream kernels.
Not prepared to obtain information for analysis using distributor kernels makes
it impossible to compare whether user's problems are already fixed in upstream
kernels, makes it impossible to identify patches which needs to be backported to
distributor kernels, and is bad for customers using distributor kernels. Of 
course,
it is possible that distributors decide not to allow users to obtain information
for analysis, but such decision cannot become a reason we can not prepare to 
obtain
information for analysis at upstream kernels.

Suppose I take a step back and tolerate the burden of sitting in front of 
console
24 hours a day, every day of the year so that users can press SysRq when 
something
went wrong, how nice it will be if all in-flight allocation requests were 
printed
upon SysRq. show_workqueue_state() is called upon SysRq-t is to some degree 
useful.

In fact, my proposal was such approach before I serialize using a kernel thread
(e.g. 
http://lkml.kernel.org/r/201411231351.hja17065.vhqsfojftlf...@i-love.sakura.ne.jp
which I proposed two years and a half ago). Though, while my proposal was left 
ignored,
I learned that showing only current thread is not sufficient and updated my 
watchdog
to show other threads (e.g. kswapd) using serialization.

A patch at 
http://lkml.kernel.org/r/201505232339.dab00557.vfflhmsojfo...@i-love.sakura.ne.jp
which I posted two years ago also includes a proposal for handling infinite
shrink_inactive_list() problem. After all, this shrink_inactive_list() problem 
was
ignored for three years without getting a chance to even test at linux-next.git.
Sigh...

I know my proposals might not be best. But you cannot afford showing 
alternative proposals
because you are putting higher priority to other problems. And other developers 
cannot afford
participating because they are not interested in or they do not share knowledge 
of this problem.

My proposals do not constrain future kernels. We can revert my proposals when 
my proposals
became no longer needed. My proposals is meaningful as interim approach, but 
you never accept
approaches which do not match your will (or desire). Even without giving people 
a chance to
test what side effects will crop up, how can your "I really do appreciate your 
testing"
statement get to my heart?

My watchdog allows detecting problems which are previously overlooked unless 
putting
unrealistic burden (e.g. stand by 24 hours a day, every day of the year). You 
ask people
to prove that it is a MM problem. But I am dissatisfied that you are letting 
proposals
which helps judging whether it is a MM problem alone.

> this way of pushing your patch is really annoying. Please do realize
> that repeating the same thing all around will not make a patch more
> likely to merge. You have proposed something, nobody has nacked it
> so it waits for people to actually find it important enough to justify
> the additional code. So please stop this.

When will people find time to judge it? We already wasted three years, and
knowledge about OOM situation is hardly shared among Linux developers and users,
and will

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-01 Thread Tetsuo Handa

Michal Hocko wrote:
> I really do appreciate your testing because it uncovers corner cases
> most people do not test for and we can actually make the code better in
> the end.

That statement does not get to my heart at all. Collision between your
approach and my approach is wasting both your time and my time.

I've reported this too_many_isolated() trap three years ago at
http://lkml.kernel.org/r/201407022140.bfj13092.qvosjtfmfhl...@i-love.sakura.ne.jp
 .
Do you know that we already wasted 3 years without any attention?

You are rejecting serialization under OOM without giving a chance to test
side effects of serialization under OOM at linux-next.git. I call such attitude
"speculation" which you never accept.

Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization.
In the first place, if the system is under global OOM (which is more
serious situation than memcg OOM), delay caused by serialization will not
matter. Rather, I consider that making sure that the system does not get
locked up is more important. I'm reporting that serialization helps
facilitating the OOM killer/reaper operations, avoiding lockups, and
solving global OOM situation smoothly. But you are refusing my report without
giving a chance to test what side effects will pop up at linux-next.git.

Knowledge about OOM situation is hardly shared among Linux developers and users,
and is far from object of concern. Like shown by cgroup-aware OOM killer 
proposal,
what will happen if we restrict 0 <= oom_victims <= 1 is not shared among 
developers.

How many developers joined to my OOM watchdog proposal? Every time and ever it 
is
confrontation between you and me. You, as effectively the only participant, are
showing negative attitude is effectively Nacked-by: response without alternative
proposal.

Not everybody can afford testing with absolutely latest upstream kernels.
Not prepared to obtain information for analysis using distributor kernels makes
it impossible to compare whether user's problems are already fixed in upstream
kernels, makes it impossible to identify patches which needs to be backported to
distributor kernels, and is bad for customers using distributor kernels. Of 
course,
it is possible that distributors decide not to allow users to obtain information
for analysis, but such decision cannot become a reason we can not prepare to 
obtain
information for analysis at upstream kernels.

Suppose I take a step back and tolerate the burden of sitting in front of 
console
24 hours a day, every day of the year so that users can press SysRq when 
something
went wrong, how nice it will be if all in-flight allocation requests were 
printed
upon SysRq. show_workqueue_state() is called upon SysRq-t is to some degree 
useful.

In fact, my proposal was such approach before I serialize using a kernel thread
(e.g. 
http://lkml.kernel.org/r/201411231351.hja17065.vhqsfojftlf...@i-love.sakura.ne.jp
which I proposed two years and a half ago). Though, while my proposal was left 
ignored,
I learned that showing only current thread is not sufficient and updated my 
watchdog
to show other threads (e.g. kswapd) using serialization.

A patch at 
http://lkml.kernel.org/r/201505232339.dab00557.vfflhmsojfo...@i-love.sakura.ne.jp
which I posted two years ago also includes a proposal for handling infinite
shrink_inactive_list() problem. After all, this shrink_inactive_list() problem 
was
ignored for three years without getting a chance to even test at linux-next.git.
Sigh...

I know my proposals might not be best. But you cannot afford showing 
alternative proposals
because you are putting higher priority to other problems. And other developers 
cannot afford
participating because they are not interested in or they do not share knowledge 
of this problem.

My proposals do not constrain future kernels. We can revert my proposals when 
my proposals
became no longer needed. My proposals is meaningful as interim approach, but 
you never accept
approaches which do not match your will (or desire). Even without giving people 
a chance to
test what side effects will crop up, how can your "I really do appreciate your 
testing"
statement get to my heart?

My watchdog allows detecting problems which are previously overlooked unless 
putting
unrealistic burden (e.g. stand by 24 hours a day, every day of the year). You 
ask people
to prove that it is a MM problem. But I am dissatisfied that you are letting 
proposals
which helps judging whether it is a MM problem alone.

> this way of pushing your patch is really annoying. Please do realize
> that repeating the same thing all around will not make a patch more
> likely to merge. You have proposed something, nobody has nacked it
> so it waits for people to actually find it important enough to justify
> the additional code. So please stop this.

When will people find time to judge it? We already wasted three years, and
knowledge about OOM situation is hardly shared among Linux developers and users,
and will

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-06-30 Thread Michal Hocko

On Sat 01-07-17 00:59:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
> > [...]
> > > Ping? Ping? When are we going to apply this patch or watchdog patch?
> > > This problem occurs with not so insane stress like shown below.
> > > I can't test almost OOM situation because test likely falls into either
> > > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.
> > 
> > So you are saying that the patch fixes this issue. Do I understand you
> > corretly? And you do not see any other negative side effectes with it
> > applied?
> 
> I hit this problem using 
> http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
> on next-20170628. We won't be able to test whether the patch fixes this issue 
> without
> seeing any other negative side effects without sending this patch to 
> linux-next.git.
> But at least we know that even this patch is sent to linux-next.git, we will 
> still see
> bugs like 
> http://lkml.kernel.org/r/201703031948.chj81278.vohsfffoolj...@i-love.sakura.ne.jp
>  .

It is really hard to pursue this half solution when there is no clear
indication it helps in your testing. So could you try to test with only
this patch on top of the current linux-next tree (or Linus tree) and see
if you can reproduce the problem?

It is possible that there are other potential problems but we at least
need to know whether it is worth going with the patch now.
 
[...]
> > Rik, Johannes what do you think? Should we go with the simpler approach
> > for now and think of a better plan longterm?
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
> 
> Watchdog does not introduce negative side effects, will avoid soft lockups 
> like
> http://lkml.kernel.org/r/cam_iqpwupvgc2ky8m-9yukects+zkjidasnymx7rmcbjbfy...@mail.gmail.com
>  ,
> will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(),
> will catch similar bugs which people are failing to reproduce.

this way of pushing your patch is really annoying. Please do realize
that repeating the same thing all around will not make a patch more
likely to merge. You have proposed something, nobody has nacked it
so it waits for people to actually find it important enough to justify
the additional code. So please stop this.

I really do appreciate your testing because it uncovers corner cases
most people do not test for and we can actually make the code better in
the end.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-06-30 Thread Michal Hocko

On Sat 01-07-17 00:59:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
> > [...]
> > > Ping? Ping? When are we going to apply this patch or watchdog patch?
> > > This problem occurs with not so insane stress like shown below.
> > > I can't test almost OOM situation because test likely falls into either
> > > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.
> > 
> > So you are saying that the patch fixes this issue. Do I understand you
> > corretly? And you do not see any other negative side effectes with it
> > applied?
> 
> I hit this problem using 
> http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
> on next-20170628. We won't be able to test whether the patch fixes this issue 
> without
> seeing any other negative side effects without sending this patch to 
> linux-next.git.
> But at least we know that even this patch is sent to linux-next.git, we will 
> still see
> bugs like 
> http://lkml.kernel.org/r/201703031948.chj81278.vohsfffoolj...@i-love.sakura.ne.jp
>  .

It is really hard to pursue this half solution when there is no clear
indication it helps in your testing. So could you try to test with only
this patch on top of the current linux-next tree (or Linus tree) and see
if you can reproduce the problem?

It is possible that there are other potential problems but we at least
need to know whether it is worth going with the patch now.
 
[...]
> > Rik, Johannes what do you think? Should we go with the simpler approach
> > for now and think of a better plan longterm?
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
> 
> Watchdog does not introduce negative side effects, will avoid soft lockups 
> like
> http://lkml.kernel.org/r/cam_iqpwupvgc2ky8m-9yukects+zkjidasnymx7rmcbjbfy...@mail.gmail.com
>  ,
> will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(),
> will catch similar bugs which people are failing to reproduce.

this way of pushing your patch is really annoying. Please do realize
that repeating the same thing all around will not make a patch more
likely to merge. You have proposed something, nobody has nacked it
so it waits for people to actually find it important enough to justify
the additional code. So please stop this.

I really do appreciate your testing because it uncovers corner cases
most people do not test for and we can actually make the code better in
the end.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-06-30 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
> [...]
> > Ping? Ping? When are we going to apply this patch or watchdog patch?
> > This problem occurs with not so insane stress like shown below.
> > I can't test almost OOM situation because test likely falls into either
> > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.
> 
> So you are saying that the patch fixes this issue. Do I understand you
> corretly? And you do not see any other negative side effectes with it
> applied?

I hit this problem using 
http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
on next-20170628. We won't be able to test whether the patch fixes this issue 
without
seeing any other negative side effects without sending this patch to 
linux-next.git.
But at least we know that even this patch is sent to linux-next.git, we will 
still see
bugs like 
http://lkml.kernel.org/r/201703031948.chj81278.vohsfffoolj...@i-love.sakura.ne.jp
 .

> 
> I am sorry I didn't have much time to think about feedback from Johannes
> yet. A more robust throttling method is surely due but also not trivial.
> So I am not sure how to proceed. It is true that your last test case
> with only 10 processes fighting resembles the reality much better than
> hundreds (AFAIR) that you were using previously.

Even if hundreds are running, most of them are simply blocked inside open()
at down_write() (like an example from serial-20170423-2.txt.xz shown below).
Actual number of processes fighting for memory is always less than 100.

 ? __schedule+0x1d2/0x5a0
 ? schedule+0x2d/0x80
 ? rwsem_down_write_failed+0x1f9/0x370
 ? walk_component+0x43/0x270
 ? call_rwsem_down_write_failed+0x13/0x20
 ? down_write+0x24/0x40
 ? path_openat+0x670/0x1210
 ? do_filp_open+0x8c/0x100
 ? getname_flags+0x47/0x1e0
 ? do_sys_open+0x121/0x200
 ? do_syscall_64+0x5c/0x140
 ? entry_SYSCALL64_slow_path+0x25/0x25

> 
> Rik, Johannes what do you think? Should we go with the simpler approach
> for now and think of a better plan longterm?

I don't hurry if we can check using watchdog whether this problem is occurring
in the real world. I have to test corner cases because watchdog is missing.

Watchdog does not introduce negative side effects, will avoid soft lockups like
http://lkml.kernel.org/r/cam_iqpwupvgc2ky8m-9yukects+zkjidasnymx7rmcbjbfy...@mail.gmail.com
 ,
will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(),
will catch similar bugs which people are failing to reproduce.

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-06-30 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
> [...]
> > Ping? Ping? When are we going to apply this patch or watchdog patch?
> > This problem occurs with not so insane stress like shown below.
> > I can't test almost OOM situation because test likely falls into either
> > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.
> 
> So you are saying that the patch fixes this issue. Do I understand you
> corretly? And you do not see any other negative side effectes with it
> applied?

I hit this problem using 
http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
on next-20170628. We won't be able to test whether the patch fixes this issue 
without
seeing any other negative side effects without sending this patch to 
linux-next.git.
But at least we know that even this patch is sent to linux-next.git, we will 
still see
bugs like 
http://lkml.kernel.org/r/201703031948.chj81278.vohsfffoolj...@i-love.sakura.ne.jp
 .

> 
> I am sorry I didn't have much time to think about feedback from Johannes
> yet. A more robust throttling method is surely due but also not trivial.
> So I am not sure how to proceed. It is true that your last test case
> with only 10 processes fighting resembles the reality much better than
> hundreds (AFAIR) that you were using previously.

Even if hundreds are running, most of them are simply blocked inside open()
at down_write() (like an example from serial-20170423-2.txt.xz shown below).
Actual number of processes fighting for memory is always less than 100.

 ? __schedule+0x1d2/0x5a0
 ? schedule+0x2d/0x80
 ? rwsem_down_write_failed+0x1f9/0x370
 ? walk_component+0x43/0x270
 ? call_rwsem_down_write_failed+0x13/0x20
 ? down_write+0x24/0x40
 ? path_openat+0x670/0x1210
 ? do_filp_open+0x8c/0x100
 ? getname_flags+0x47/0x1e0
 ? do_sys_open+0x121/0x200
 ? do_syscall_64+0x5c/0x140
 ? entry_SYSCALL64_slow_path+0x25/0x25

> 
> Rik, Johannes what do you think? Should we go with the simpler approach
> for now and think of a better plan longterm?

I don't hurry if we can check using watchdog whether this problem is occurring
in the real world. I have to test corner cases because watchdog is missing.

Watchdog does not introduce negative side effects, will avoid soft lockups like
http://lkml.kernel.org/r/cam_iqpwupvgc2ky8m-9yukects+zkjidasnymx7rmcbjbfy...@mail.gmail.com
 ,
will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(),
will catch similar bugs which people are failing to reproduce.

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-06-30 Thread Michal Hocko

On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
[...]
> Ping? Ping? When are we going to apply this patch or watchdog patch?
> This problem occurs with not so insane stress like shown below.
> I can't test almost OOM situation because test likely falls into either
> printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.

So you are saying that the patch fixes this issue. Do I understand you
corretly? And you do not see any other negative side effectes with it
applied?

I am sorry I didn't have much time to think about feedback from Johannes
yet. A more robust throttling method is surely due but also not trivial.
So I am not sure how to proceed. It is true that your last test case
with only 10 processes fighting resembles the reality much better than
hundreds (AFAIR) that you were using previously.

Rik, Johannes what do you think? Should we go with the simpler approach
for now and think of a better plan longterm?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-06-30 Thread Michal Hocko

On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
[...]
> Ping? Ping? When are we going to apply this patch or watchdog patch?
> This problem occurs with not so insane stress like shown below.
> I can't test almost OOM situation because test likely falls into either
> printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.

So you are saying that the patch fixes this issue. Do I understand you
corretly? And you do not see any other negative side effectes with it
applied?

I am sorry I didn't have much time to think about feedback from Johannes
yet. A more robust throttling method is surely due but also not trivial.
So I am not sure how to proceed. It is true that your last test case
with only 10 processes fighting resembles the reality much better than
hundreds (AFAIR) that you were using previously.

Rik, Johannes what do you think? Should we go with the simpler approach
for now and think of a better plan longterm?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-06-29 Thread Tetsuo Handa

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> > > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > > It only does this to some extent.  If reclaim made
> > > > no progress, for example due to immediately bailing
> > > > out because the number of already isolated pages is
> > > > too high (due to many parallel reclaimers), the code
> > > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > > test without ever looking at the number of reclaimable
> > > > pages.
> > > 
> > > Hm, there is no early return there, actually. We bump the loop counter
> > > every time it happens, but then *do* look at the reclaimable pages.
> > > 
> > > > Could that create problems if we have many concurrent
> > > > reclaimers?
> > > 
> > > With increased concurrency, the likelihood of OOM will go up if we
> > > remove the unlimited wait for isolated pages, that much is true.
> > > 
> > > I'm not sure that's a bad thing, however, because we want the OOM
> > > killer to be predictable and timely. So a reasonable wait time in
> > > between 0 and forever before an allocating thread gives up under
> > > extreme concurrency makes sense to me.
> > > 
> > > > It may be OK, I just do not understand all the implications.
> > > > 
> > > > I like the general direction your patch takes the code in,
> > > > but I would like to understand it better...
> > > 
> > > I feel the same way. The throttling logic doesn't seem to be very well
> > > thought out at the moment, making it hard to reason about what happens
> > > in certain scenarios.
> > > 
> > > In that sense, this patch isn't really an overall improvement to the
> > > way things work. It patches a hole that seems to be exploitable only
> > > from an artificial OOM torture test, at the risk of regressing high
> > > concurrency workloads that may or may not be artificial.
> > > 
> > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > > behind this patch. Can we think about a general model to deal with
> > > allocation concurrency? 
> > 
> > I am definitely not against. There is no reason to rush the patch in.
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
> 
> > My main point behind this patch was to reduce unbound loops from inside
> > the reclaim path and push any throttling up the call chain to the
> > page allocator path because I believe that it is easier to reason
> > about them at that level. The direct reclaim should be as simple as
> > possible without too many side effects otherwise we end up in a highly
> > unpredictable behavior. This was a first step in that direction and my
> > testing so far didn't show any regressions.
> > 
> > > Unlimited parallel direct reclaim is kinda
> > > bonkers in the first place. How about checking for excessive isolation
> > > counts from the page allocator and putting allocations on a waitqueue?
> > 
> > I would be interested in details here.
> 
> That will help implementing __GFP_KILLABLE.
> https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15
> 
Ping? Ping? When are we going to apply this patch or watchdog patch?
This problem occurs with not so insane stress like shown below.
I can't test almost OOM situation because test likely falls into either
printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.

--
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
static char buffer[4096] = { };
char *buf = NULL;
unsigned long size;
int i;
for (i = 0; i < 10; i++) {
if (fork() == 0) {
int fd = open("/proc/self/oom_score_adj", O_WRONLY);
write(fd, "1000", 4);
close(fd);
sleep(1);
if (!i)
pause();
snprintf(buffer, sizeof(buffer), "/tmp/file.%u", 
getpid());
fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
while (write(fd, buffer, sizeof(buffer)) == 
sizeof(buffer))
fsync(fd);
_exit(0);
}
}
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
sleep(2);
/* Will cause OOM due to overcommit */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
return 0;
}
--

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170629-3.txt.xz .

[  190.924887] a.out   D13296  2191   2172 0x0080
[  190.927121] Call Trace:
[  190.928304]

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-06-29 Thread Tetsuo Handa

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> > > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > > It only does this to some extent.  If reclaim made
> > > > no progress, for example due to immediately bailing
> > > > out because the number of already isolated pages is
> > > > too high (due to many parallel reclaimers), the code
> > > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > > test without ever looking at the number of reclaimable
> > > > pages.
> > > 
> > > Hm, there is no early return there, actually. We bump the loop counter
> > > every time it happens, but then *do* look at the reclaimable pages.
> > > 
> > > > Could that create problems if we have many concurrent
> > > > reclaimers?
> > > 
> > > With increased concurrency, the likelihood of OOM will go up if we
> > > remove the unlimited wait for isolated pages, that much is true.
> > > 
> > > I'm not sure that's a bad thing, however, because we want the OOM
> > > killer to be predictable and timely. So a reasonable wait time in
> > > between 0 and forever before an allocating thread gives up under
> > > extreme concurrency makes sense to me.
> > > 
> > > > It may be OK, I just do not understand all the implications.
> > > > 
> > > > I like the general direction your patch takes the code in,
> > > > but I would like to understand it better...
> > > 
> > > I feel the same way. The throttling logic doesn't seem to be very well
> > > thought out at the moment, making it hard to reason about what happens
> > > in certain scenarios.
> > > 
> > > In that sense, this patch isn't really an overall improvement to the
> > > way things work. It patches a hole that seems to be exploitable only
> > > from an artificial OOM torture test, at the risk of regressing high
> > > concurrency workloads that may or may not be artificial.
> > > 
> > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > > behind this patch. Can we think about a general model to deal with
> > > allocation concurrency? 
> > 
> > I am definitely not against. There is no reason to rush the patch in.
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
> 
> > My main point behind this patch was to reduce unbound loops from inside
> > the reclaim path and push any throttling up the call chain to the
> > page allocator path because I believe that it is easier to reason
> > about them at that level. The direct reclaim should be as simple as
> > possible without too many side effects otherwise we end up in a highly
> > unpredictable behavior. This was a first step in that direction and my
> > testing so far didn't show any regressions.
> > 
> > > Unlimited parallel direct reclaim is kinda
> > > bonkers in the first place. How about checking for excessive isolation
> > > counts from the page allocator and putting allocations on a waitqueue?
> > 
> > I would be interested in details here.
> 
> That will help implementing __GFP_KILLABLE.
> https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15
> 
Ping? Ping? When are we going to apply this patch or watchdog patch?
This problem occurs with not so insane stress like shown below.
I can't test almost OOM situation because test likely falls into either
printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.

--
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
static char buffer[4096] = { };
char *buf = NULL;
unsigned long size;
int i;
for (i = 0; i < 10; i++) {
if (fork() == 0) {
int fd = open("/proc/self/oom_score_adj", O_WRONLY);
write(fd, "1000", 4);
close(fd);
sleep(1);
if (!i)
pause();
snprintf(buffer, sizeof(buffer), "/tmp/file.%u", 
getpid());
fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
while (write(fd, buffer, sizeof(buffer)) == 
sizeof(buffer))
fsync(fd);
_exit(0);
}
}
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
sleep(2);
/* Will cause OOM due to overcommit */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
return 0;
}
--

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170629-3.txt.xz .

[  190.924887] a.out   D13296  2191   2172 0x0080
[  190.927121] Call Trace:
[  190.928304]

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-04-25 Thread Stanislaw Gruszka

On Mon, Apr 24, 2017 at 10:06:32PM +0900, Tetsuo Handa wrote:
> Stanislaw Gruszka wrote:
> > On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> > > On 2017/03/10 20:44, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > >> I am definitely not against. There is no reason to rush the patch in.
> > > > 
> > > > I don't hurry if we can check using watchdog whether this problem is 
> > > > occurring
> > > > in the real world. I have to test corner cases because watchdog is 
> > > > missing.
> > > > 
> > > Ping?
> > > 
> > > This problem can occur even immediately after the first invocation of
> > > the OOM killer. I believe this problem can occur in the real world.
> > > When are we going to apply this patch or watchdog patch?
> > > 
> > > 
> > > [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) 
> > > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 
> > > 23 17:38:02 JST 2017
> > > [0.00] Command line: 
> > > BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ 
> > > root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 
> > > crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none 
> > > sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 
> > > debug_guardpage_minorder=1
> > 
> > Are you debugging memory corruption problem?
> 
> No. Just a random testing trying to find how we can avoid flooding of
> warn_alloc_stall() warning messages while also avoiding ratelimiting.

This is not right way to stress mm subsystem, debug_guardpage_minorder= 
option is for _debug_ purpose. Use mem= instead if you want to limit
available memory.

> > FWIW, if you use debug_guardpage_minorder= you can expect any
> > allocation memory problems. This option is intended to debug
> > memory corruption bugs and it shrinks available memory in 
> > artificial way. Taking that, I don't think justifying any
> > patch, by problem happened when debug_guardpage_minorder= is 
> > used, is reasonable.
> >  
> > Stanislaw
> 
> This problem occurs without debug_guardpage_minorder= parameter and

So please justify your patches by that.

Thanks
Stanislaw

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-04-25 Thread Stanislaw Gruszka

On Mon, Apr 24, 2017 at 10:06:32PM +0900, Tetsuo Handa wrote:
> Stanislaw Gruszka wrote:
> > On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> > > On 2017/03/10 20:44, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > >> I am definitely not against. There is no reason to rush the patch in.
> > > > 
> > > > I don't hurry if we can check using watchdog whether this problem is 
> > > > occurring
> > > > in the real world. I have to test corner cases because watchdog is 
> > > > missing.
> > > > 
> > > Ping?
> > > 
> > > This problem can occur even immediately after the first invocation of
> > > the OOM killer. I believe this problem can occur in the real world.
> > > When are we going to apply this patch or watchdog patch?
> > > 
> > > 
> > > [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) 
> > > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 
> > > 23 17:38:02 JST 2017
> > > [0.00] Command line: 
> > > BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ 
> > > root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 
> > > crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none 
> > > sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 
> > > debug_guardpage_minorder=1
> > 
> > Are you debugging memory corruption problem?
> 
> No. Just a random testing trying to find how we can avoid flooding of
> warn_alloc_stall() warning messages while also avoiding ratelimiting.

This is not right way to stress mm subsystem, debug_guardpage_minorder= 
option is for _debug_ purpose. Use mem= instead if you want to limit
available memory.

> > FWIW, if you use debug_guardpage_minorder= you can expect any
> > allocation memory problems. This option is intended to debug
> > memory corruption bugs and it shrinks available memory in 
> > artificial way. Taking that, I don't think justifying any
> > patch, by problem happened when debug_guardpage_minorder= is 
> > used, is reasonable.
> >  
> > Stanislaw
> 
> This problem occurs without debug_guardpage_minorder= parameter and

So please justify your patches by that.

Thanks
Stanislaw

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-04-24 Thread Tetsuo Handa

Stanislaw Gruszka wrote:
> On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> > On 2017/03/10 20:44, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > >> I am definitely not against. There is no reason to rush the patch in.
> > > 
> > > I don't hurry if we can check using watchdog whether this problem is 
> > > occurring
> > > in the real world. I have to test corner cases because watchdog is 
> > > missing.
> > > 
> > Ping?
> > 
> > This problem can occur even immediately after the first invocation of
> > the OOM killer. I believe this problem can occur in the real world.
> > When are we going to apply this patch or watchdog patch?
> > 
> > 
> > [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) 
> > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 
> > 17:38:02 JST 2017
> > [0.00] Command line: 
> > BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ 
> > root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 
> > crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none 
> > sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 
> > debug_guardpage_minorder=1
> 
> Are you debugging memory corruption problem?

No. Just a random testing trying to find how we can avoid flooding of
warn_alloc_stall() warning messages while also avoiding ratelimiting.

> 
> FWIW, if you use debug_guardpage_minorder= you can expect any
> allocation memory problems. This option is intended to debug
> memory corruption bugs and it shrinks available memory in 
> artificial way. Taking that, I don't think justifying any
> patch, by problem happened when debug_guardpage_minorder= is 
> used, is reasonable.
>  
> Stanislaw

This problem occurs without debug_guardpage_minorder= parameter and

--
[0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc 
version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 
JST 2017
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ 
root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 
vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled 
console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8
(...snipped...)
CentOS Linux 7 (Core)
Kernel 4.11.0-rc7-next-20170421+ on an x86_64

ccsecurity login: [   31.882531] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   32.550187] Ebtables v2.0 registered
[   32.730371] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   32.926518] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready
[   32.928310] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
None
[   32.930960] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready
[   33.741378] Netfilter messages via NETLINK v0.30.
[   33.807350] ip_set: protocol 6
[   37.581002] nf_conntrack: default automatic helper assignment has been 
turned off for security reasons and CT-based  firewall rule not found. Use the 
iptables CT target to attach helpers instead.
[   38.072689] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready
[   38.074419] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
None
[   38.077222] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready
[   92.753140] gmain invoked oom-killer: 
gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, 
oom_score_adj=0
[   92.763445] gmain cpuset=/ mems_allowed=0
[   92.767634] CPU: 2 PID: 2733 Comm: gmain Not tainted 
4.11.0-rc7-next-20170421+ #588
[   92.773624] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[   92.781790] Call Trace:
[   92.782630]  ? dump_stack+0x5c/0x7d
[   92.783902]  ? dump_header+0x97/0x233
[   92.785427]  ? ktime_get+0x30/0x90
[   92.786390]  ? delayacct_end+0x35/0x60
[   92.787433]  ? do_try_to_free_pages+0x2ca/0x370
[   92.789157]  ? oom_kill_process+0x223/0x3e0
[   92.790502]  ? has_capability_noaudit+0x17/0x20
[   92.791761]  ? oom_badness+0xeb/0x160
[   92.792783]  ? out_of_memory+0x10b/0x490
[   92.793872]  ? __alloc_pages_slowpath+0x701/0x8e2
[   92.795603]  ? __alloc_pages_nodemask+0x1ed/0x210
[   92.796902]  ? alloc_pages_current+0x7a/0x100
[   92.798115]  ? filemap_fault+0x2e9/0x5e0
[   92.799204]  ? filemap_map_pages+0x185/0x3a0
[   92.800402]  ? xfs_filemap_fault+0x2f/0x50 [xfs]
[   92.801678]  ? __do_fault+0x15/0x70
[   92.802651]  ? __handle_mm_fault+0xb0f/0x11e0
[   92.805141]  ? handle_mm_fault+0xc5/0x220
[   92.807261]  ? __do_page_fault+0x21e/0x4b0
[   92.809203]  ? do_page_fault+0x2b/0x70
[   92.811018]  ? do_syscall_64+0x137/0x140
[   92.812554]  ? page_fault+0x28/0x30
[   92.813855] Mem-Info:
[   92.815009] active_anon:437483 inactive_anon:2097 isolated_anon:0
[   92.815009]  active_file:0 inactive_file:104 isolated_file:41
[   92.815009]  unevictable:0 dirty:10 writeback:0 unstable:0
[   92.815009]  slab_reclaimable:2439

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-04-24 Thread Tetsuo Handa

Stanislaw Gruszka wrote:
> On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> > On 2017/03/10 20:44, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > >> I am definitely not against. There is no reason to rush the patch in.
> > > 
> > > I don't hurry if we can check using watchdog whether this problem is 
> > > occurring
> > > in the real world. I have to test corner cases because watchdog is 
> > > missing.
> > > 
> > Ping?
> > 
> > This problem can occur even immediately after the first invocation of
> > the OOM killer. I believe this problem can occur in the real world.
> > When are we going to apply this patch or watchdog patch?
> > 
> > 
> > [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) 
> > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 
> > 17:38:02 JST 2017
> > [0.00] Command line: 
> > BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ 
> > root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 
> > crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none 
> > sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 
> > debug_guardpage_minorder=1
> 
> Are you debugging memory corruption problem?

No. Just a random testing trying to find how we can avoid flooding of
warn_alloc_stall() warning messages while also avoiding ratelimiting.

> 
> FWIW, if you use debug_guardpage_minorder= you can expect any
> allocation memory problems. This option is intended to debug
> memory corruption bugs and it shrinks available memory in 
> artificial way. Taking that, I don't think justifying any
> patch, by problem happened when debug_guardpage_minorder= is 
> used, is reasonable.
>  
> Stanislaw

This problem occurs without debug_guardpage_minorder= parameter and

--
[0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc 
version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 
JST 2017
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ 
root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 
vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled 
console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8
(...snipped...)
CentOS Linux 7 (Core)
Kernel 4.11.0-rc7-next-20170421+ on an x86_64

ccsecurity login: [   31.882531] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   32.550187] Ebtables v2.0 registered
[   32.730371] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   32.926518] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready
[   32.928310] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
None
[   32.930960] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready
[   33.741378] Netfilter messages via NETLINK v0.30.
[   33.807350] ip_set: protocol 6
[   37.581002] nf_conntrack: default automatic helper assignment has been 
turned off for security reasons and CT-based  firewall rule not found. Use the 
iptables CT target to attach helpers instead.
[   38.072689] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready
[   38.074419] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
None
[   38.077222] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready
[   92.753140] gmain invoked oom-killer: 
gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, 
oom_score_adj=0
[   92.763445] gmain cpuset=/ mems_allowed=0
[   92.767634] CPU: 2 PID: 2733 Comm: gmain Not tainted 
4.11.0-rc7-next-20170421+ #588
[   92.773624] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[   92.781790] Call Trace:
[   92.782630]  ? dump_stack+0x5c/0x7d
[   92.783902]  ? dump_header+0x97/0x233
[   92.785427]  ? ktime_get+0x30/0x90
[   92.786390]  ? delayacct_end+0x35/0x60
[   92.787433]  ? do_try_to_free_pages+0x2ca/0x370
[   92.789157]  ? oom_kill_process+0x223/0x3e0
[   92.790502]  ? has_capability_noaudit+0x17/0x20
[   92.791761]  ? oom_badness+0xeb/0x160
[   92.792783]  ? out_of_memory+0x10b/0x490
[   92.793872]  ? __alloc_pages_slowpath+0x701/0x8e2
[   92.795603]  ? __alloc_pages_nodemask+0x1ed/0x210
[   92.796902]  ? alloc_pages_current+0x7a/0x100
[   92.798115]  ? filemap_fault+0x2e9/0x5e0
[   92.799204]  ? filemap_map_pages+0x185/0x3a0
[   92.800402]  ? xfs_filemap_fault+0x2f/0x50 [xfs]
[   92.801678]  ? __do_fault+0x15/0x70
[   92.802651]  ? __handle_mm_fault+0xb0f/0x11e0
[   92.805141]  ? handle_mm_fault+0xc5/0x220
[   92.807261]  ? __do_page_fault+0x21e/0x4b0
[   92.809203]  ? do_page_fault+0x2b/0x70
[   92.811018]  ? do_syscall_64+0x137/0x140
[   92.812554]  ? page_fault+0x28/0x30
[   92.813855] Mem-Info:
[   92.815009] active_anon:437483 inactive_anon:2097 isolated_anon:0
[   92.815009]  active_file:0 inactive_file:104 isolated_file:41
[   92.815009]  unevictable:0 dirty:10 writeback:0 unstable:0
[   92.815009]  slab_reclaimable:2439

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-04-24 Thread Stanislaw Gruszka

On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> On 2017/03/10 20:44, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> >> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> >>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
>  It only does this to some extent.  If reclaim made
>  no progress, for example due to immediately bailing
>  out because the number of already isolated pages is
>  too high (due to many parallel reclaimers), the code
>  could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
>  test without ever looking at the number of reclaimable
>  pages.
> >>>
> >>> Hm, there is no early return there, actually. We bump the loop counter
> >>> every time it happens, but then *do* look at the reclaimable pages.
> >>>
>  Could that create problems if we have many concurrent
>  reclaimers?
> >>>
> >>> With increased concurrency, the likelihood of OOM will go up if we
> >>> remove the unlimited wait for isolated pages, that much is true.
> >>>
> >>> I'm not sure that's a bad thing, however, because we want the OOM
> >>> killer to be predictable and timely. So a reasonable wait time in
> >>> between 0 and forever before an allocating thread gives up under
> >>> extreme concurrency makes sense to me.
> >>>
>  It may be OK, I just do not understand all the implications.
> 
>  I like the general direction your patch takes the code in,
>  but I would like to understand it better...
> >>>
> >>> I feel the same way. The throttling logic doesn't seem to be very well
> >>> thought out at the moment, making it hard to reason about what happens
> >>> in certain scenarios.
> >>>
> >>> In that sense, this patch isn't really an overall improvement to the
> >>> way things work. It patches a hole that seems to be exploitable only
> >>> from an artificial OOM torture test, at the risk of regressing high
> >>> concurrency workloads that may or may not be artificial.
> >>>
> >>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> >>> behind this patch. Can we think about a general model to deal with
> >>> allocation concurrency? 
> >>
> >> I am definitely not against. There is no reason to rush the patch in.
> > 
> > I don't hurry if we can check using watchdog whether this problem is 
> > occurring
> > in the real world. I have to test corner cases because watchdog is missing.
> > 
> Ping?
> 
> This problem can occur even immediately after the first invocation of
> the OOM killer. I believe this problem can occur in the real world.
> When are we going to apply this patch or watchdog patch?
> 
> 
> [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc 
> version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 
> 17:38:02 JST 2017
> [0.00] Command line: 
> BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ 
> root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 
> crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none 
> sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 
> debug_guardpage_minorder=1

Are you debugging memory corruption problem?

FWIW, if you use debug_guardpage_minorder= you can expect any
allocation memory problems. This option is intended to debug
memory corruption bugs and it shrinks available memory in 
artificial way. Taking that, I don't think justifying any
patch, by problem happened when debug_guardpage_minorder= is 
used, is reasonable.
 
Stanislaw

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-04-24 Thread Stanislaw Gruszka

On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> On 2017/03/10 20:44, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> >> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> >>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
>  It only does this to some extent.  If reclaim made
>  no progress, for example due to immediately bailing
>  out because the number of already isolated pages is
>  too high (due to many parallel reclaimers), the code
>  could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
>  test without ever looking at the number of reclaimable
>  pages.
> >>>
> >>> Hm, there is no early return there, actually. We bump the loop counter
> >>> every time it happens, but then *do* look at the reclaimable pages.
> >>>
>  Could that create problems if we have many concurrent
>  reclaimers?
> >>>
> >>> With increased concurrency, the likelihood of OOM will go up if we
> >>> remove the unlimited wait for isolated pages, that much is true.
> >>>
> >>> I'm not sure that's a bad thing, however, because we want the OOM
> >>> killer to be predictable and timely. So a reasonable wait time in
> >>> between 0 and forever before an allocating thread gives up under
> >>> extreme concurrency makes sense to me.
> >>>
>  It may be OK, I just do not understand all the implications.
> 
>  I like the general direction your patch takes the code in,
>  but I would like to understand it better...
> >>>
> >>> I feel the same way. The throttling logic doesn't seem to be very well
> >>> thought out at the moment, making it hard to reason about what happens
> >>> in certain scenarios.
> >>>
> >>> In that sense, this patch isn't really an overall improvement to the
> >>> way things work. It patches a hole that seems to be exploitable only
> >>> from an artificial OOM torture test, at the risk of regressing high
> >>> concurrency workloads that may or may not be artificial.
> >>>
> >>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> >>> behind this patch. Can we think about a general model to deal with
> >>> allocation concurrency? 
> >>
> >> I am definitely not against. There is no reason to rush the patch in.
> > 
> > I don't hurry if we can check using watchdog whether this problem is 
> > occurring
> > in the real world. I have to test corner cases because watchdog is missing.
> > 
> Ping?
> 
> This problem can occur even immediately after the first invocation of
> the OOM killer. I believe this problem can occur in the real world.
> When are we going to apply this patch or watchdog patch?
> 
> 
> [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc 
> version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 
> 17:38:02 JST 2017
> [0.00] Command line: 
> BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ 
> root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 
> crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none 
> sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 
> debug_guardpage_minorder=1

Are you debugging memory corruption problem?

FWIW, if you use debug_guardpage_minorder= you can expect any
allocation memory problems. This option is intended to debug
memory corruption bugs and it shrinks available memory in 
artificial way. Taking that, I don't think justifying any
patch, by problem happened when debug_guardpage_minorder= is 
used, is reasonable.
 
Stanislaw

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-04-23 Thread Tetsuo Handa

On 2017/03/10 20:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
>>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
 It only does this to some extent.  If reclaim made
 no progress, for example due to immediately bailing
 out because the number of already isolated pages is
 too high (due to many parallel reclaimers), the code
 could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
 test without ever looking at the number of reclaimable
 pages.
>>>
>>> Hm, there is no early return there, actually. We bump the loop counter
>>> every time it happens, but then *do* look at the reclaimable pages.
>>>
 Could that create problems if we have many concurrent
 reclaimers?
>>>
>>> With increased concurrency, the likelihood of OOM will go up if we
>>> remove the unlimited wait for isolated pages, that much is true.
>>>
>>> I'm not sure that's a bad thing, however, because we want the OOM
>>> killer to be predictable and timely. So a reasonable wait time in
>>> between 0 and forever before an allocating thread gives up under
>>> extreme concurrency makes sense to me.
>>>
 It may be OK, I just do not understand all the implications.

 I like the general direction your patch takes the code in,
 but I would like to understand it better...
>>>
>>> I feel the same way. The throttling logic doesn't seem to be very well
>>> thought out at the moment, making it hard to reason about what happens
>>> in certain scenarios.
>>>
>>> In that sense, this patch isn't really an overall improvement to the
>>> way things work. It patches a hole that seems to be exploitable only
>>> from an artificial OOM torture test, at the risk of regressing high
>>> concurrency workloads that may or may not be artificial.
>>>
>>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
>>> behind this patch. Can we think about a general model to deal with
>>> allocation concurrency? 
>>
>> I am definitely not against. There is no reason to rush the patch in.
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
> 
Ping?

This problem can occur even immediately after the first invocation of
the OOM killer. I believe this problem can occur in the real world.
When are we going to apply this patch or watchdog patch?


[0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc 
version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 
JST 2017
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ 
root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 
crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none 
sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 
debug_guardpage_minorder=1
(...snipped...)
CentOS Linux 7 (Core)
Kernel 4.11.0-rc7-next-20170421+ on an x86_64

ccsecurity login: [   32.406723] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   32.852917] Ebtables v2.0 registered
[   33.034402] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   33.467929] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
None
[   33.475728] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready
[   33.478910] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready
[   33.950365] Netfilter messages via NETLINK v0.30.
[   33.983449] ip_set: protocol 6
[   37.335966] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
None
[   37.337587] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready
[   37.339925] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready
[   39.940942] nf_conntrack: default automatic helper assignment has been 
turned off for security reasons and CT-based  firewall rule not found. Use the 
iptables CT target to attach helpers instead.
[   98.926202] a.out invoked oom-killer: 
gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, 
oom_score_adj=0
[   98.932977] a.out cpuset=/ mems_allowed=0
[   98.934780] CPU: 1 PID: 2972 Comm: a.out Not tainted 
4.11.0-rc7-next-20170421+ #588
[   98.937988] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[   98.942193] Call Trace:
[   98.942942]  ? dump_stack+0x5c/0x7d
[   98.943907]  ? dump_header+0x97/0x233
[   98.945334]  ? ktime_get+0x30/0x90
[   98.946290]  ? delayacct_end+0x35/0x60
[   98.947319]  ? do_try_to_free_pages+0x2ca/0x370
[   98.948554]  ? oom_kill_process+0x223/0x3e0
[   98.949715]  ? has_capability_noaudit+0x17/0x20
[   98.950948]  ? oom_badness+0xeb/0x160
[   98.951962]  ? out_of_memory+0x10b/0x490
[   98.953030]  ? __alloc_pages_slowpath+0x701/0x8e2
[   98.954313]  ? __alloc_pages_nodemask+0x1ed/0x210
[   98.956242]  ? alloc_pages_vma+0x9f/0x220
[   98.957486]  ?

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-04-23 Thread Tetsuo Handa

On 2017/03/10 20:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
>>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
 It only does this to some extent.  If reclaim made
 no progress, for example due to immediately bailing
 out because the number of already isolated pages is
 too high (due to many parallel reclaimers), the code
 could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
 test without ever looking at the number of reclaimable
 pages.
>>>
>>> Hm, there is no early return there, actually. We bump the loop counter
>>> every time it happens, but then *do* look at the reclaimable pages.
>>>
 Could that create problems if we have many concurrent
 reclaimers?
>>>
>>> With increased concurrency, the likelihood of OOM will go up if we
>>> remove the unlimited wait for isolated pages, that much is true.
>>>
>>> I'm not sure that's a bad thing, however, because we want the OOM
>>> killer to be predictable and timely. So a reasonable wait time in
>>> between 0 and forever before an allocating thread gives up under
>>> extreme concurrency makes sense to me.
>>>
 It may be OK, I just do not understand all the implications.

 I like the general direction your patch takes the code in,
 but I would like to understand it better...
>>>
>>> I feel the same way. The throttling logic doesn't seem to be very well
>>> thought out at the moment, making it hard to reason about what happens
>>> in certain scenarios.
>>>
>>> In that sense, this patch isn't really an overall improvement to the
>>> way things work. It patches a hole that seems to be exploitable only
>>> from an artificial OOM torture test, at the risk of regressing high
>>> concurrency workloads that may or may not be artificial.
>>>
>>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
>>> behind this patch. Can we think about a general model to deal with
>>> allocation concurrency? 
>>
>> I am definitely not against. There is no reason to rush the patch in.
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
> 
Ping?

This problem can occur even immediately after the first invocation of
the OOM killer. I believe this problem can occur in the real world.
When are we going to apply this patch or watchdog patch?


[0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc 
version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 
JST 2017
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ 
root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 
crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none 
sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 
debug_guardpage_minorder=1
(...snipped...)
CentOS Linux 7 (Core)
Kernel 4.11.0-rc7-next-20170421+ on an x86_64

ccsecurity login: [   32.406723] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   32.852917] Ebtables v2.0 registered
[   33.034402] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   33.467929] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
None
[   33.475728] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready
[   33.478910] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready
[   33.950365] Netfilter messages via NETLINK v0.30.
[   33.983449] ip_set: protocol 6
[   37.335966] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
None
[   37.337587] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready
[   37.339925] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready
[   39.940942] nf_conntrack: default automatic helper assignment has been 
turned off for security reasons and CT-based  firewall rule not found. Use the 
iptables CT target to attach helpers instead.
[   98.926202] a.out invoked oom-killer: 
gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, 
oom_score_adj=0
[   98.932977] a.out cpuset=/ mems_allowed=0
[   98.934780] CPU: 1 PID: 2972 Comm: a.out Not tainted 
4.11.0-rc7-next-20170421+ #588
[   98.937988] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[   98.942193] Call Trace:
[   98.942942]  ? dump_stack+0x5c/0x7d
[   98.943907]  ? dump_header+0x97/0x233
[   98.945334]  ? ktime_get+0x30/0x90
[   98.946290]  ? delayacct_end+0x35/0x60
[   98.947319]  ? do_try_to_free_pages+0x2ca/0x370
[   98.948554]  ? oom_kill_process+0x223/0x3e0
[   98.949715]  ? has_capability_noaudit+0x17/0x20
[   98.950948]  ? oom_badness+0xeb/0x160
[   98.951962]  ? out_of_memory+0x10b/0x490
[   98.953030]  ? __alloc_pages_slowpath+0x701/0x8e2
[   98.954313]  ? __alloc_pages_nodemask+0x1ed/0x210
[   98.956242]  ? alloc_pages_vma+0x9f/0x220
[   98.957486]  ?

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-21 Thread Tetsuo Handa

On 2017/03/10 20:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
 It may be OK, I just do not understand all the implications.

 I like the general direction your patch takes the code in,
 but I would like to understand it better...
>>>
>>> I feel the same way. The throttling logic doesn't seem to be very well
>>> thought out at the moment, making it hard to reason about what happens
>>> in certain scenarios.
>>>
>>> In that sense, this patch isn't really an overall improvement to the
>>> way things work. It patches a hole that seems to be exploitable only
>>> from an artificial OOM torture test, at the risk of regressing high
>>> concurrency workloads that may or may not be artificial.
>>>
>>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
>>> behind this patch. Can we think about a general model to deal with
>>> allocation concurrency? 
>>
>> I am definitely not against. There is no reason to rush the patch in.
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.

Today I tested linux-next-20170321 with not so insane stress, and I again
hit this problem. Thus, I think this problem might occur in the real world.

http://I-love.SAKURA.ne.jp/tmp/serial-20170321.txt.xz (Logs up to before 
swapoff are eliminated.)
--
[ 2250.175109] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2257.535653] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2319.806880] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2320.722282] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2381.243393] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2389.777052] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2450.878287] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2459.386321] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2520.500633] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2529.042088] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
--

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-21 Thread Tetsuo Handa

On 2017/03/10 20:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
 It may be OK, I just do not understand all the implications.

 I like the general direction your patch takes the code in,
 but I would like to understand it better...
>>>
>>> I feel the same way. The throttling logic doesn't seem to be very well
>>> thought out at the moment, making it hard to reason about what happens
>>> in certain scenarios.
>>>
>>> In that sense, this patch isn't really an overall improvement to the
>>> way things work. It patches a hole that seems to be exploitable only
>>> from an artificial OOM torture test, at the risk of regressing high
>>> concurrency workloads that may or may not be artificial.
>>>
>>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
>>> behind this patch. Can we think about a general model to deal with
>>> allocation concurrency? 
>>
>> I am definitely not against. There is no reason to rush the patch in.
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.

Today I tested linux-next-20170321 with not so insane stress, and I again
hit this problem. Thus, I think this problem might occur in the real world.

http://I-love.SAKURA.ne.jp/tmp/serial-20170321.txt.xz (Logs up to before 
swapoff are eliminated.)
--
[ 2250.175109] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2257.535653] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2319.806880] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2320.722282] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2381.243393] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2389.777052] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2450.878287] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2459.386321] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2520.500633] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
[ 2529.042088] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 
oom_count=1155386
--

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-10 Thread Tetsuo Handa

Michal Hocko wrote:
> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > It only does this to some extent.  If reclaim made
> > > no progress, for example due to immediately bailing
> > > out because the number of already isolated pages is
> > > too high (due to many parallel reclaimers), the code
> > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > test without ever looking at the number of reclaimable
> > > pages.
> > 
> > Hm, there is no early return there, actually. We bump the loop counter
> > every time it happens, but then *do* look at the reclaimable pages.
> > 
> > > Could that create problems if we have many concurrent
> > > reclaimers?
> > 
> > With increased concurrency, the likelihood of OOM will go up if we
> > remove the unlimited wait for isolated pages, that much is true.
> > 
> > I'm not sure that's a bad thing, however, because we want the OOM
> > killer to be predictable and timely. So a reasonable wait time in
> > between 0 and forever before an allocating thread gives up under
> > extreme concurrency makes sense to me.
> > 
> > > It may be OK, I just do not understand all the implications.
> > > 
> > > I like the general direction your patch takes the code in,
> > > but I would like to understand it better...
> > 
> > I feel the same way. The throttling logic doesn't seem to be very well
> > thought out at the moment, making it hard to reason about what happens
> > in certain scenarios.
> > 
> > In that sense, this patch isn't really an overall improvement to the
> > way things work. It patches a hole that seems to be exploitable only
> > from an artificial OOM torture test, at the risk of regressing high
> > concurrency workloads that may or may not be artificial.
> > 
> > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > behind this patch. Can we think about a general model to deal with
> > allocation concurrency? 
> 
> I am definitely not against. There is no reason to rush the patch in.

I don't hurry if we can check using watchdog whether this problem is occurring
in the real world. I have to test corner cases because watchdog is missing.

> My main point behind this patch was to reduce unbound loops from inside
> the reclaim path and push any throttling up the call chain to the
> page allocator path because I believe that it is easier to reason
> about them at that level. The direct reclaim should be as simple as
> possible without too many side effects otherwise we end up in a highly
> unpredictable behavior. This was a first step in that direction and my
> testing so far didn't show any regressions.
> 
> > Unlimited parallel direct reclaim is kinda
> > bonkers in the first place. How about checking for excessive isolation
> > counts from the page allocator and putting allocations on a waitqueue?
> 
> I would be interested in details here.

That will help implementing __GFP_KILLABLE.
https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-10 Thread Tetsuo Handa

Michal Hocko wrote:
> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > It only does this to some extent.  If reclaim made
> > > no progress, for example due to immediately bailing
> > > out because the number of already isolated pages is
> > > too high (due to many parallel reclaimers), the code
> > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > test without ever looking at the number of reclaimable
> > > pages.
> > 
> > Hm, there is no early return there, actually. We bump the loop counter
> > every time it happens, but then *do* look at the reclaimable pages.
> > 
> > > Could that create problems if we have many concurrent
> > > reclaimers?
> > 
> > With increased concurrency, the likelihood of OOM will go up if we
> > remove the unlimited wait for isolated pages, that much is true.
> > 
> > I'm not sure that's a bad thing, however, because we want the OOM
> > killer to be predictable and timely. So a reasonable wait time in
> > between 0 and forever before an allocating thread gives up under
> > extreme concurrency makes sense to me.
> > 
> > > It may be OK, I just do not understand all the implications.
> > > 
> > > I like the general direction your patch takes the code in,
> > > but I would like to understand it better...
> > 
> > I feel the same way. The throttling logic doesn't seem to be very well
> > thought out at the moment, making it hard to reason about what happens
> > in certain scenarios.
> > 
> > In that sense, this patch isn't really an overall improvement to the
> > way things work. It patches a hole that seems to be exploitable only
> > from an artificial OOM torture test, at the risk of regressing high
> > concurrency workloads that may or may not be artificial.
> > 
> > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > behind this patch. Can we think about a general model to deal with
> > allocation concurrency? 
> 
> I am definitely not against. There is no reason to rush the patch in.

I don't hurry if we can check using watchdog whether this problem is occurring
in the real world. I have to test corner cases because watchdog is missing.

> My main point behind this patch was to reduce unbound loops from inside
> the reclaim path and push any throttling up the call chain to the
> page allocator path because I believe that it is easier to reason
> about them at that level. The direct reclaim should be as simple as
> possible without too many side effects otherwise we end up in a highly
> unpredictable behavior. This was a first step in that direction and my
> testing so far didn't show any regressions.
> 
> > Unlimited parallel direct reclaim is kinda
> > bonkers in the first place. How about checking for excessive isolation
> > counts from the page allocator and putting allocations on a waitqueue?
> 
> I would be interested in details here.

That will help implementing __GFP_KILLABLE.
https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-10 Thread Michal Hocko

On Thu 09-03-17 17:18:00, Rik van Riel wrote:
> On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote:
> > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > 
> > > It only does this to some extent.  If reclaim made
> > > no progress, for example due to immediately bailing
> > > out because the number of already isolated pages is
> > > too high (due to many parallel reclaimers), the code
> > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > test without ever looking at the number of reclaimable
> > > pages.
> > Hm, there is no early return there, actually. We bump the loop
> > counter
> > every time it happens, but then *do* look at the reclaimable pages.
> 
> Am I looking at an old tree?  I see this code
> before we look at the reclaimable pages.
> 
>         /*
>  * Make sure we converge to OOM if we cannot make any progress
>  * several times in the row.
>  */
> if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> /* Before OOM, exhaust highatomic_reserve */
> return unreserve_highatomic_pageblock(ac, true);
> }

I believe that Johannes meant cases where we do not exhaust all the
reclaim retries and fail early because there are no reclaimable pages
during the watermark check.

> > > Could that create problems if we have many concurrent
> > > reclaimers?
> > With increased concurrency, the likelihood of OOM will go up if we
> > remove the unlimited wait for isolated pages, that much is true.
> > 
> > I'm not sure that's a bad thing, however, because we want the OOM
> > killer to be predictable and timely. So a reasonable wait time in
> > between 0 and forever before an allocating thread gives up under
> > extreme concurrency makes sense to me.
> 
> That is a fair point, a faster OOM kill is preferable
> to a system that is livelocked.
> 
> > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > behind this patch. Can we think about a general model to deal with
> > allocation concurrency? Unlimited parallel direct reclaim is kinda
> > bonkers in the first place. How about checking for excessive
> > isolation
> > counts from the page allocator and putting allocations on a
> > waitqueue?
> 
> The (limited) number of reclaimers can still do a
> relatively fast OOM kill, if none of them manage
> to make progress.

well, we can estimate how much memory can those relatively few
reclaimers isolate and try to reclaim. Even if we have hundreds of them which
is more towards a large number to me then we are 100*SWAP_CLUSTER_MAX
which is not all that much. And we are effectivelly OOM if there is no
other reclaimable memory left. All we need is just to put some upper
bound. We already have throttle_direct_reclaim but it doesn't really
throttle the maximum number of reclaimers.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-10 Thread Michal Hocko

On Thu 09-03-17 17:18:00, Rik van Riel wrote:
> On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote:
> > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > 
> > > It only does this to some extent.  If reclaim made
> > > no progress, for example due to immediately bailing
> > > out because the number of already isolated pages is
> > > too high (due to many parallel reclaimers), the code
> > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > test without ever looking at the number of reclaimable
> > > pages.
> > Hm, there is no early return there, actually. We bump the loop
> > counter
> > every time it happens, but then *do* look at the reclaimable pages.
> 
> Am I looking at an old tree?  I see this code
> before we look at the reclaimable pages.
> 
>         /*
>  * Make sure we converge to OOM if we cannot make any progress
>  * several times in the row.
>  */
> if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> /* Before OOM, exhaust highatomic_reserve */
> return unreserve_highatomic_pageblock(ac, true);
> }

I believe that Johannes meant cases where we do not exhaust all the
reclaim retries and fail early because there are no reclaimable pages
during the watermark check.

> > > Could that create problems if we have many concurrent
> > > reclaimers?
> > With increased concurrency, the likelihood of OOM will go up if we
> > remove the unlimited wait for isolated pages, that much is true.
> > 
> > I'm not sure that's a bad thing, however, because we want the OOM
> > killer to be predictable and timely. So a reasonable wait time in
> > between 0 and forever before an allocating thread gives up under
> > extreme concurrency makes sense to me.
> 
> That is a fair point, a faster OOM kill is preferable
> to a system that is livelocked.
> 
> > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > behind this patch. Can we think about a general model to deal with
> > allocation concurrency? Unlimited parallel direct reclaim is kinda
> > bonkers in the first place. How about checking for excessive
> > isolation
> > counts from the page allocator and putting allocations on a
> > waitqueue?
> 
> The (limited) number of reclaimers can still do a
> relatively fast OOM kill, if none of them manage
> to make progress.

well, we can estimate how much memory can those relatively few
reclaimers isolate and try to reclaim. Even if we have hundreds of them which
is more towards a large number to me then we are 100*SWAP_CLUSTER_MAX
which is not all that much. And we are effectivelly OOM if there is no
other reclaimable memory left. All we need is just to put some upper
bound. We already have throttle_direct_reclaim but it doesn't really
throttle the maximum number of reclaimers.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-10 Thread Michal Hocko

On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > It only does this to some extent.  If reclaim made
> > no progress, for example due to immediately bailing
> > out because the number of already isolated pages is
> > too high (due to many parallel reclaimers), the code
> > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > test without ever looking at the number of reclaimable
> > pages.
> 
> Hm, there is no early return there, actually. We bump the loop counter
> every time it happens, but then *do* look at the reclaimable pages.
> 
> > Could that create problems if we have many concurrent
> > reclaimers?
> 
> With increased concurrency, the likelihood of OOM will go up if we
> remove the unlimited wait for isolated pages, that much is true.
> 
> I'm not sure that's a bad thing, however, because we want the OOM
> killer to be predictable and timely. So a reasonable wait time in
> between 0 and forever before an allocating thread gives up under
> extreme concurrency makes sense to me.
> 
> > It may be OK, I just do not understand all the implications.
> > 
> > I like the general direction your patch takes the code in,
> > but I would like to understand it better...
> 
> I feel the same way. The throttling logic doesn't seem to be very well
> thought out at the moment, making it hard to reason about what happens
> in certain scenarios.
> 
> In that sense, this patch isn't really an overall improvement to the
> way things work. It patches a hole that seems to be exploitable only
> from an artificial OOM torture test, at the risk of regressing high
> concurrency workloads that may or may not be artificial.
> 
> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> behind this patch. Can we think about a general model to deal with
> allocation concurrency? 

I am definitely not against. There is no reason to rush the patch in.
My main point behind this patch was to reduce unbound loops from inside
the reclaim path and push any throttling up the call chain to the
page allocator path because I believe that it is easier to reason
about them at that level. The direct reclaim should be as simple as
possible without too many side effects otherwise we end up in a highly
unpredictable behavior. This was a first step in that direction and my
testing so far didn't show any regressions.

> Unlimited parallel direct reclaim is kinda
> bonkers in the first place. How about checking for excessive isolation
> counts from the page allocator and putting allocations on a waitqueue?

I would be interested in details here.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-10 Thread Michal Hocko

On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > It only does this to some extent.  If reclaim made
> > no progress, for example due to immediately bailing
> > out because the number of already isolated pages is
> > too high (due to many parallel reclaimers), the code
> > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > test without ever looking at the number of reclaimable
> > pages.
> 
> Hm, there is no early return there, actually. We bump the loop counter
> every time it happens, but then *do* look at the reclaimable pages.
> 
> > Could that create problems if we have many concurrent
> > reclaimers?
> 
> With increased concurrency, the likelihood of OOM will go up if we
> remove the unlimited wait for isolated pages, that much is true.
> 
> I'm not sure that's a bad thing, however, because we want the OOM
> killer to be predictable and timely. So a reasonable wait time in
> between 0 and forever before an allocating thread gives up under
> extreme concurrency makes sense to me.
> 
> > It may be OK, I just do not understand all the implications.
> > 
> > I like the general direction your patch takes the code in,
> > but I would like to understand it better...
> 
> I feel the same way. The throttling logic doesn't seem to be very well
> thought out at the moment, making it hard to reason about what happens
> in certain scenarios.
> 
> In that sense, this patch isn't really an overall improvement to the
> way things work. It patches a hole that seems to be exploitable only
> from an artificial OOM torture test, at the risk of regressing high
> concurrency workloads that may or may not be artificial.
> 
> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> behind this patch. Can we think about a general model to deal with
> allocation concurrency? 

I am definitely not against. There is no reason to rush the patch in.
My main point behind this patch was to reduce unbound loops from inside
the reclaim path and push any throttling up the call chain to the
page allocator path because I believe that it is easier to reason
about them at that level. The direct reclaim should be as simple as
possible without too many side effects otherwise we end up in a highly
unpredictable behavior. This was a first step in that direction and my
testing so far didn't show any regressions.

> Unlimited parallel direct reclaim is kinda
> bonkers in the first place. How about checking for excessive isolation
> counts from the page allocator and putting allocations on a waitqueue?

I would be interested in details here.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Rik van Riel

On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > 
> > It only does this to some extent.  If reclaim made
> > no progress, for example due to immediately bailing
> > out because the number of already isolated pages is
> > too high (due to many parallel reclaimers), the code
> > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > test without ever looking at the number of reclaimable
> > pages.
> Hm, there is no early return there, actually. We bump the loop
> counter
> every time it happens, but then *do* look at the reclaimable pages.

Am I looking at an old tree?  I see this code
before we look at the reclaimable pages.

        /*
 * Make sure we converge to OOM if we cannot make any progress
 * several times in the row.
 */
if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
/* Before OOM, exhaust highatomic_reserve */
return unreserve_highatomic_pageblock(ac, true);
}

> > Could that create problems if we have many concurrent
> > reclaimers?
> With increased concurrency, the likelihood of OOM will go up if we
> remove the unlimited wait for isolated pages, that much is true.
> 
> I'm not sure that's a bad thing, however, because we want the OOM
> killer to be predictable and timely. So a reasonable wait time in
> between 0 and forever before an allocating thread gives up under
> extreme concurrency makes sense to me.

That is a fair point, a faster OOM kill is preferable
to a system that is livelocked.

> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> behind this patch. Can we think about a general model to deal with
> allocation concurrency? Unlimited parallel direct reclaim is kinda
> bonkers in the first place. How about checking for excessive
> isolation
> counts from the page allocator and putting allocations on a
> waitqueue?

The (limited) number of reclaimers can still do a
relatively fast OOM kill, if none of them manage
to make progress.

That should avoid the potential issue you and I
both pointed out, and, as a bonus, it might actually
be faster than letting all the tasks in the system
into the direct reclaim code simultaneously.

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Rik van Riel

On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > 
> > It only does this to some extent.  If reclaim made
> > no progress, for example due to immediately bailing
> > out because the number of already isolated pages is
> > too high (due to many parallel reclaimers), the code
> > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > test without ever looking at the number of reclaimable
> > pages.
> Hm, there is no early return there, actually. We bump the loop
> counter
> every time it happens, but then *do* look at the reclaimable pages.

Am I looking at an old tree?  I see this code
before we look at the reclaimable pages.

        /*
 * Make sure we converge to OOM if we cannot make any progress
 * several times in the row.
 */
if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
/* Before OOM, exhaust highatomic_reserve */
return unreserve_highatomic_pageblock(ac, true);
}

> > Could that create problems if we have many concurrent
> > reclaimers?
> With increased concurrency, the likelihood of OOM will go up if we
> remove the unlimited wait for isolated pages, that much is true.
> 
> I'm not sure that's a bad thing, however, because we want the OOM
> killer to be predictable and timely. So a reasonable wait time in
> between 0 and forever before an allocating thread gives up under
> extreme concurrency makes sense to me.

That is a fair point, a faster OOM kill is preferable
to a system that is livelocked.

> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> behind this patch. Can we think about a general model to deal with
> allocation concurrency? Unlimited parallel direct reclaim is kinda
> bonkers in the first place. How about checking for excessive
> isolation
> counts from the page allocator and putting allocations on a
> waitqueue?

The (limited) number of reclaimers can still do a
relatively fast OOM kill, if none of them manage
to make progress.

That should avoid the potential issue you and I
both pointed out, and, as a bonus, it might actually
be faster than letting all the tasks in the system
into the direct reclaim code simultaneously.

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Johannes Weiner

On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> It only does this to some extent.  If reclaim made
> no progress, for example due to immediately bailing
> out because the number of already isolated pages is
> too high (due to many parallel reclaimers), the code
> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> test without ever looking at the number of reclaimable
> pages.

Hm, there is no early return there, actually. We bump the loop counter
every time it happens, but then *do* look at the reclaimable pages.

> Could that create problems if we have many concurrent
> reclaimers?

With increased concurrency, the likelihood of OOM will go up if we
remove the unlimited wait for isolated pages, that much is true.

I'm not sure that's a bad thing, however, because we want the OOM
killer to be predictable and timely. So a reasonable wait time in
between 0 and forever before an allocating thread gives up under
extreme concurrency makes sense to me.

> It may be OK, I just do not understand all the implications.
> 
> I like the general direction your patch takes the code in,
> but I would like to understand it better...

I feel the same way. The throttling logic doesn't seem to be very well
thought out at the moment, making it hard to reason about what happens
in certain scenarios.

In that sense, this patch isn't really an overall improvement to the
way things work. It patches a hole that seems to be exploitable only
from an artificial OOM torture test, at the risk of regressing high
concurrency workloads that may or may not be artificial.

Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
behind this patch. Can we think about a general model to deal with
allocation concurrency? Unlimited parallel direct reclaim is kinda
bonkers in the first place. How about checking for excessive isolation
counts from the page allocator and putting allocations on a waitqueue?

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Johannes Weiner

On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> It only does this to some extent.  If reclaim made
> no progress, for example due to immediately bailing
> out because the number of already isolated pages is
> too high (due to many parallel reclaimers), the code
> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> test without ever looking at the number of reclaimable
> pages.

Hm, there is no early return there, actually. We bump the loop counter
every time it happens, but then *do* look at the reclaimable pages.

> Could that create problems if we have many concurrent
> reclaimers?

With increased concurrency, the likelihood of OOM will go up if we
remove the unlimited wait for isolated pages, that much is true.

I'm not sure that's a bad thing, however, because we want the OOM
killer to be predictable and timely. So a reasonable wait time in
between 0 and forever before an allocating thread gives up under
extreme concurrency makes sense to me.

> It may be OK, I just do not understand all the implications.
> 
> I like the general direction your patch takes the code in,
> but I would like to understand it better...

I feel the same way. The throttling logic doesn't seem to be very well
thought out at the moment, making it hard to reason about what happens
in certain scenarios.

In that sense, this patch isn't really an overall improvement to the
way things work. It patches a hole that seems to be exploitable only
from an artificial OOM torture test, at the risk of regressing high
concurrency workloads that may or may not be artificial.

Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
behind this patch. Can we think about a general model to deal with
allocation concurrency? Unlimited parallel direct reclaim is kinda
bonkers in the first place. How about checking for excessive isolation
counts from the page allocator and putting allocations on a waitqueue?

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Michal Hocko

On Thu 09-03-17 09:16:25, Rik van Riel wrote:
> On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote:
> > On Wed 08-03-17 10:54:57, Rik van Riel wrote:
> 
> > > In fact, false OOM kills with that kind of workload is
> > > how we ended up getting the "too many isolated" logic
> > > in the first place.
> > Right, but the retry logic was considerably different than what we
> > have these days. should_reclaim_retry considers amount of reclaimable
> > memory. As I've said earlier if we see a report where the oom hits
> > prematurely with many NR_ISOLATED* we know how to fix that.
> 
> Would it be enough to simply reset no_progress_loops
> in this check inside should_reclaim_retry, if we know
> pageout IO is pending?
> 
>                         if (!did_some_progress) {
> unsigned long write_pending;
> 
> write_pending = zone_page_state_snapshot(zone,
> 
> NR_ZONE_WRITE_PENDING);
> 
> if (2 * write_pending > reclaimable) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> return true;
> }
> }

I am not really sure what problem we are trying to solve right now to be
honest. I would prefer to keep the logic simpler rather than over
engeneer something that is even not needed.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Michal Hocko

On Thu 09-03-17 09:16:25, Rik van Riel wrote:
> On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote:
> > On Wed 08-03-17 10:54:57, Rik van Riel wrote:
> 
> > > In fact, false OOM kills with that kind of workload is
> > > how we ended up getting the "too many isolated" logic
> > > in the first place.
> > Right, but the retry logic was considerably different than what we
> > have these days. should_reclaim_retry considers amount of reclaimable
> > memory. As I've said earlier if we see a report where the oom hits
> > prematurely with many NR_ISOLATED* we know how to fix that.
> 
> Would it be enough to simply reset no_progress_loops
> in this check inside should_reclaim_retry, if we know
> pageout IO is pending?
> 
>                         if (!did_some_progress) {
> unsigned long write_pending;
> 
> write_pending = zone_page_state_snapshot(zone,
> 
> NR_ZONE_WRITE_PENDING);
> 
> if (2 * write_pending > reclaimable) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> return true;
> }
> }

I am not really sure what problem we are trying to solve right now to be
honest. I would prefer to keep the logic simpler rather than over
engeneer something that is even not needed.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Mel Gorman

On Tue, Mar 07, 2017 at 02:30:57PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
> 
> [1] 
> http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
> [2] 
> http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp
> 
> Signed-off-by: Michal Hocko 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Mel Gorman

On Tue, Mar 07, 2017 at 02:30:57PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
> 
> [1] 
> http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
> [2] 
> http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp
> 
> Signed-off-by: Michal Hocko 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Rik van Riel

On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote:
> On Wed 08-03-17 10:54:57, Rik van Riel wrote:

> > In fact, false OOM kills with that kind of workload is
> > how we ended up getting the "too many isolated" logic
> > in the first place.
> Right, but the retry logic was considerably different than what we
> have these days. should_reclaim_retry considers amount of reclaimable
> memory. As I've said earlier if we see a report where the oom hits
> prematurely with many NR_ISOLATED* we know how to fix that.

Would it be enough to simply reset no_progress_loops
in this check inside should_reclaim_retry, if we know
pageout IO is pending?

                        if (!did_some_progress) {
unsigned long write_pending;

write_pending =
zone_page_state_snapshot(zone,
NR_ZONE_WRITE_P
ENDING);

if (2 * write_pending > reclaimable) {
congestion_wait(BLK_RW_ASYNC,
HZ/10);
return true;
}
}

-- 
All rights reversed


signature.asc
Description: This is a digitally signed message part

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Rik van Riel

On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote:
> On Wed 08-03-17 10:54:57, Rik van Riel wrote:

> > In fact, false OOM kills with that kind of workload is
> > how we ended up getting the "too many isolated" logic
> > in the first place.
> Right, but the retry logic was considerably different than what we
> have these days. should_reclaim_retry considers amount of reclaimable
> memory. As I've said earlier if we see a report where the oom hits
> prematurely with many NR_ISOLATED* we know how to fix that.

Would it be enough to simply reset no_progress_loops
in this check inside should_reclaim_retry, if we know
pageout IO is pending?

                        if (!did_some_progress) {
unsigned long write_pending;

write_pending =
zone_page_state_snapshot(zone,
NR_ZONE_WRITE_P
ENDING);

if (2 * write_pending > reclaimable) {
congestion_wait(BLK_RW_ASYNC,
HZ/10);
return true;
}
}

-- 
All rights reversed


signature.asc
Description: This is a digitally signed message part

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Michal Hocko

On Wed 08-03-17 10:54:57, Rik van Riel wrote:
> On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote:
> 
> > > Could that create problems if we have many concurrent
> > > reclaimers?
> > 
> > As the changelog mentions it might cause a premature oom killer
> > invocation theoretically. We could easily see that from the oom
> > report
> > by checking isolated counters. My testing didn't trigger that though
> > and I was hammering the page allocator path from many threads.
> > 
> > I suspect some artificial tests can trigger that, I am not so sure
> > about
> > reasonabel workloads. If we see this happening though then the fix
> > would
> > be to resurrect my previous attempt to track NR_ISOLATED* per zone
> > and
> > use them in the allocator retry logic.
> 
> I am not sure the workload in question is "artificial".
> A heavily forking (or multi-threaded) server running out
> of physical memory could easily get hundreds of tasks
> doing direct reclaim simultaneously.

Yes, some of my OOM tests (fork many short lived processes while there
is a strong memory pressure and a lot of IO going on) are doing this and
I haven't hit a premature OOM yet. It is hard to tune those tests for almost
OOM but not yet there, though. Usually you either find a steady state or
really run out of memory.

> In fact, false OOM kills with that kind of workload is
> how we ended up getting the "too many isolated" logic
> in the first place.

Right, but the retry logic was considerably different than what we
have these days. should_reclaim_retry considers amount of reclaimable
memory. As I've said earlier if we see a report where the oom hits
prematurely with many NR_ISOLATED* we know how to fix that.

> I am perfectly fine with moving the retry logic up like
> you did, but think it may make sense to check the number
> of reclaimable pages if we have too many isolated pages,
> instead of risking a too-early OOM kill.

Actually that was my initial attempt but for that we would need per-zone
NR_ISOLATED* counters but Mel was against and wanted to start with
simpler approach if it works reasonably well which it seems it does from
my experience so far (but the reallity can surprise as I've seen so many
times already).
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-09 Thread Michal Hocko

On Wed 08-03-17 10:54:57, Rik van Riel wrote:
> On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote:
> 
> > > Could that create problems if we have many concurrent
> > > reclaimers?
> > 
> > As the changelog mentions it might cause a premature oom killer
> > invocation theoretically. We could easily see that from the oom
> > report
> > by checking isolated counters. My testing didn't trigger that though
> > and I was hammering the page allocator path from many threads.
> > 
> > I suspect some artificial tests can trigger that, I am not so sure
> > about
> > reasonabel workloads. If we see this happening though then the fix
> > would
> > be to resurrect my previous attempt to track NR_ISOLATED* per zone
> > and
> > use them in the allocator retry logic.
> 
> I am not sure the workload in question is "artificial".
> A heavily forking (or multi-threaded) server running out
> of physical memory could easily get hundreds of tasks
> doing direct reclaim simultaneously.

Yes, some of my OOM tests (fork many short lived processes while there
is a strong memory pressure and a lot of IO going on) are doing this and
I haven't hit a premature OOM yet. It is hard to tune those tests for almost
OOM but not yet there, though. Usually you either find a steady state or
really run out of memory.

> In fact, false OOM kills with that kind of workload is
> how we ended up getting the "too many isolated" logic
> in the first place.

Right, but the retry logic was considerably different than what we
have these days. should_reclaim_retry considers amount of reclaimable
memory. As I've said earlier if we see a report where the oom hits
prematurely with many NR_ISOLATED* we know how to fix that.

> I am perfectly fine with moving the retry logic up like
> you did, but think it may make sense to check the number
> of reclaimable pages if we have too many isolated pages,
> instead of risking a too-early OOM kill.

Actually that was my initial attempt but for that we would need per-zone
NR_ISOLATED* counters but Mel was against and wanted to start with
simpler approach if it works reasonably well which it seems it does from
my experience so far (but the reallity can surprise as I've seen so many
times already).
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-08 Thread Rik van Riel

On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote:

> > Could that create problems if we have many concurrent
> > reclaimers?
> 
> As the changelog mentions it might cause a premature oom killer
> invocation theoretically. We could easily see that from the oom
> report
> by checking isolated counters. My testing didn't trigger that though
> and I was hammering the page allocator path from many threads.
> 
> I suspect some artificial tests can trigger that, I am not so sure
> about
> reasonabel workloads. If we see this happening though then the fix
> would
> be to resurrect my previous attempt to track NR_ISOLATED* per zone
> and
> use them in the allocator retry logic.

I am not sure the workload in question is "artificial".
A heavily forking (or multi-threaded) server running out
of physical memory could easily get hundreds of tasks
doing direct reclaim simultaneously.

In fact, false OOM kills with that kind of workload is
how we ended up getting the "too many isolated" logic
in the first place.

I am perfectly fine with moving the retry logic up like
you did, but think it may make sense to check the number
of reclaimable pages if we have too many isolated pages,
instead of risking a too-early OOM kill.

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-08 Thread Rik van Riel

On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote:

> > Could that create problems if we have many concurrent
> > reclaimers?
> 
> As the changelog mentions it might cause a premature oom killer
> invocation theoretically. We could easily see that from the oom
> report
> by checking isolated counters. My testing didn't trigger that though
> and I was hammering the page allocator path from many threads.
> 
> I suspect some artificial tests can trigger that, I am not so sure
> about
> reasonabel workloads. If we see this happening though then the fix
> would
> be to resurrect my previous attempt to track NR_ISOLATED* per zone
> and
> use them in the allocator retry logic.

I am not sure the workload in question is "artificial".
A heavily forking (or multi-threaded) server running out
of physical memory could easily get hundreds of tasks
doing direct reclaim simultaneously.

In fact, false OOM kills with that kind of workload is
how we ended up getting the "too many isolated" logic
in the first place.

I am perfectly fine with moving the retry logic up like
you did, but think it may make sense to check the number
of reclaimable pages if we have too many isolated pages,
instead of risking a too-early OOM kill.

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-08 Thread Michal Hocko

On Tue 07-03-17 14:52:36, Rik van Riel wrote:
> On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > Tetsuo Handa has reported [1][2] that direct reclaimers might get
> > stuck
> > in too_many_isolated loop basically for ever because the last few
> > pages
> > on the LRU lists are isolated by the kswapd which is stuck on fs
> > locks
> > when doing the pageout or slab reclaim. This in turn means that there
> > is
> > nobody to actually trigger the oom killer and the system is basically
> > unusable.
> > 
> > too_many_isolated has been introduced by 35cd78156c49 ("vmscan:
> > throttle
> > direct reclaim when too many pages are isolated already") to prevent
> > from pre-mature oom killer invocations because back then no reclaim
> > progress could indeed trigger the OOM killer too early. But since the
> > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > the allocation/reclaim retry loop considers all the reclaimable pages
> > and throttles the allocation at that layer so we can loosen the
> > direct
> > reclaim throttling.
> 
> It only does this to some extent.  If reclaim made
> no progress, for example due to immediately bailing
> out because the number of already isolated pages is
> too high (due to many parallel reclaimers), the code
> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> test without ever looking at the number of reclaimable
> pages.
> 
> Could that create problems if we have many concurrent
> reclaimers?

As the changelog mentions it might cause a premature oom killer
invocation theoretically. We could easily see that from the oom report
by checking isolated counters. My testing didn't trigger that though
and I was hammering the page allocator path from many threads.

I suspect some artificial tests can trigger that, I am not so sure about
reasonabel workloads. If we see this happening though then the fix would
be to resurrect my previous attempt to track NR_ISOLATED* per zone and
use them in the allocator retry logic.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-08 Thread Michal Hocko

On Tue 07-03-17 14:52:36, Rik van Riel wrote:
> On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > Tetsuo Handa has reported [1][2] that direct reclaimers might get
> > stuck
> > in too_many_isolated loop basically for ever because the last few
> > pages
> > on the LRU lists are isolated by the kswapd which is stuck on fs
> > locks
> > when doing the pageout or slab reclaim. This in turn means that there
> > is
> > nobody to actually trigger the oom killer and the system is basically
> > unusable.
> > 
> > too_many_isolated has been introduced by 35cd78156c49 ("vmscan:
> > throttle
> > direct reclaim when too many pages are isolated already") to prevent
> > from pre-mature oom killer invocations because back then no reclaim
> > progress could indeed trigger the OOM killer too early. But since the
> > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > the allocation/reclaim retry loop considers all the reclaimable pages
> > and throttles the allocation at that layer so we can loosen the
> > direct
> > reclaim throttling.
> 
> It only does this to some extent.  If reclaim made
> no progress, for example due to immediately bailing
> out because the number of already isolated pages is
> too high (due to many parallel reclaimers), the code
> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> test without ever looking at the number of reclaimable
> pages.
> 
> Could that create problems if we have many concurrent
> reclaimers?

As the changelog mentions it might cause a premature oom killer
invocation theoretically. We could easily see that from the oom report
by checking isolated counters. My testing didn't trigger that though
and I was hammering the page allocator path from many threads.

I suspect some artificial tests can trigger that, I am not so sure about
reasonabel workloads. If we see this happening though then the fix would
be to resurrect my previous attempt to track NR_ISOLATED* per zone and
use them in the allocator retry logic.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-07 Thread Rik van Riel

On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Tetsuo Handa has reported [1][2] that direct reclaimers might get
> stuck
> in too_many_isolated loop basically for ever because the last few
> pages
> on the LRU lists are isolated by the kswapd which is stuck on fs
> locks
> when doing the pageout or slab reclaim. This in turn means that there
> is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan:
> throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the
> direct
> reclaim throttling.

It only does this to some extent.  If reclaim made
no progress, for example due to immediately bailing
out because the number of already isolated pages is
too high (due to many parallel reclaimers), the code
could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
test without ever looking at the number of reclaimable
pages.

Could that create problems if we have many concurrent
reclaimers?

It may be OK, I just do not understand all the implications.

I like the general direction your patch takes the code in,
but I would like to understand it better...

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-07 Thread Rik van Riel

On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Tetsuo Handa has reported [1][2] that direct reclaimers might get
> stuck
> in too_many_isolated loop basically for ever because the last few
> pages
> on the LRU lists are isolated by the kswapd which is stuck on fs
> locks
> when doing the pageout or slab reclaim. This in turn means that there
> is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan:
> throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the
> direct
> reclaim throttling.

It only does this to some extent.  If reclaim made
no progress, for example due to immediately bailing
out because the number of already isolated pages is
too high (due to many parallel reclaimers), the code
could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
test without ever looking at the number of reclaimable
pages.

Could that create problems if we have many concurrent
reclaimers?

It may be OK, I just do not understand all the implications.

I like the general direction your patch takes the code in,
but I would like to understand it better...

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

[PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-07 Thread Michal Hocko

From: Michal Hocko 

Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] 
http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
[2] 
http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp

Signed-off-by: Michal Hocko 
---

Hi,
Tetsuo helped to test this patch [3] and couldn't reproduce the hang
inside the page allocator anymore. Thanks! He was able to see a
different lockup though. This time this is more related to how XFS is
doing the inode reclaim from the WQ context. This is being discussed [4]
and I believe it is unrelated to this change.

I believe this change is still an improvement because it reduces chances
of an unbound loop inside the reclaim path so we have a) more reliable
detection of the lockup from the allocator path and b) more deterministic
retry loop logic.

Thoughts/complains/suggestions?

[3] 
http://lkml.kernel.org/r/201702261530.jdd56292.ofolfhqtvmj...@i-love.sakura.ne.jp
[4] http://lkml.kernel.org/r/20170303133950.gd31...@dhcp22.suse.cz

 mm/vmscan.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
+   bool stalled = false;
 
while (unlikely(too_many_isolated(pgdat, file, sc))) {
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
+   if (stalled)
+   return 0;
+
+   /* wait a bit for the reclaimer. */
+   schedule_timeout_interruptible(HZ/10);
+   stalled = true;
 
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
-- 
2.11.0

[PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-03-07 Thread Michal Hocko

From: Michal Hocko 

Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] 
http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
[2] 
http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp

Signed-off-by: Michal Hocko 
---

Hi,
Tetsuo helped to test this patch [3] and couldn't reproduce the hang
inside the page allocator anymore. Thanks! He was able to see a
different lockup though. This time this is more related to how XFS is
doing the inode reclaim from the WQ context. This is being discussed [4]
and I believe it is unrelated to this change.

I believe this change is still an improvement because it reduces chances
of an unbound loop inside the reclaim path so we have a) more reliable
detection of the lockup from the allocator path and b) more deterministic
retry loop logic.

Thoughts/complains/suggestions?

[3] 
http://lkml.kernel.org/r/201702261530.jdd56292.ofolfhqtvmj...@i-love.sakura.ne.jp
[4] http://lkml.kernel.org/r/20170303133950.gd31...@dhcp22.suse.cz

 mm/vmscan.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
+   bool stalled = false;
 
while (unlikely(too_many_isolated(pgdat, file, sc))) {
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
+   if (stalled)
+   return 0;
+
+   /* wait a bit for the reclaimer. */
+   schedule_timeout_interruptible(HZ/10);
+   stalled = true;
 
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
-- 
2.11.0

82 matches

Mail list logo