Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Hugh Dickins wrote: > On Thu, 20 Jul 2017, Tetsuo Handa wrote: > > Hugh Dickins wrote: > > > You probably won't welcome getting into alternatives at this late stage; > > > but after hacking around it one way or another because of its pointless > > > lockups, I lost patience with that too_many_isolated() loop a few months > > > back (on realizing the enormous number of pages that may be isolated via > > > migrate_pages(2)), and we've been running nicely since with something > > > like: > > > > > > bool got_mutex = false; > > > > > > if (unlikely(too_many_isolated(pgdat, file, sc))) { > > > if (mutex_lock_killable(>too_many_isolated)) > > > return SWAP_CLUSTER_MAX; > > > got_mutex = true; > > > } > > > ... > > > if (got_mutex) > > > mutex_unlock(>too_many_isolated); > > > > > > Using a mutex to provide the intended throttling, without an infinite > > > loop or an arbitrary delay; and without having to worry (as we often did) > > > about whether those numbers in too_many_isolated() are really appropriate. > > > No premature OOMs complained of yet. > > > > Roughly speaking, there is a moment where shrink_inactive_list() acts > > like below. > > > > bool got_mutex = false; > > > > if (!current_is_kswapd()) { > > if (mutex_lock_killable(>too_many_isolated)) > > return SWAP_CLUSTER_MAX; > > got_mutex = true; > > } > > > > // kswapd is blocked here waiting for !current_is_kswapd(). > > That would be a shame, for kswapd to wait for !current_is_kswapd()! Yes, but current code (not about your patch) does allow kswapd to wait for memory allocations of !current_is_kswapd() thread to complete. > > But seriously, I think I understand what you mean by that, you're > thinking that kswapd would be waiting on some other task to clear > the too_many_isolated() condition? Yes. > > No, it does not work that way: kswapd (never seeing too_many_isolated() > because that always says false when current_is_kswapd()) never tries to > take the pgdat->too_many_isolated mutex itself: it does not wait there > at all, although other tasks may be waiting there at the time. I know. I wrote behavior of your patch if my guess (your "..." part corresponds to kswapd doing writepage) is correct. > > Perhaps my naming the mutex "too_many_isolated", same as the function, > is actually confusing, when I had intended it to be helpful. Not confusing at all. It is helpful. I just wanted to confirm what comes in your "..." part. > > > > > if (got_mutex) > > mutex_unlock(>too_many_isolated); > > > > > > > > But that was on a different kernel, and there I did have to make sure > > > that PF_MEMALLOC always prevented us from nesting: I'm not certain of > > > that in the current kernel (but do remember Johannes changing the memcg > > > end to make it use PF_MEMALLOC too). I offer the preview above, to see > > > if you're interested in that alternative: if you are, then I'll go ahead > > > and make it into an actual patch against v4.13-rc. > > > > I don't know what your actual patch looks like, but the problem is that > > pgdat->too_many_isolated waits for kswapd while kswapd waits for > > pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if > > once we hit it. > > Not so (and we'd hardly be finding it a useful patch if that were so). Current code allows kswapd to wait for memory allocation of !current_is_kswapd() threads, and thus !current_is_kswapd() threads wait for current_is_kswapd() threads while current_is_kswapd() threads wait for !current_is_kswapd() threads; nobody can make too_many_isolated() false if once we hit it. Hence, this patch is proposed. Thanks.
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Hugh Dickins wrote: > On Thu, 20 Jul 2017, Tetsuo Handa wrote: > > Hugh Dickins wrote: > > > You probably won't welcome getting into alternatives at this late stage; > > > but after hacking around it one way or another because of its pointless > > > lockups, I lost patience with that too_many_isolated() loop a few months > > > back (on realizing the enormous number of pages that may be isolated via > > > migrate_pages(2)), and we've been running nicely since with something > > > like: > > > > > > bool got_mutex = false; > > > > > > if (unlikely(too_many_isolated(pgdat, file, sc))) { > > > if (mutex_lock_killable(>too_many_isolated)) > > > return SWAP_CLUSTER_MAX; > > > got_mutex = true; > > > } > > > ... > > > if (got_mutex) > > > mutex_unlock(>too_many_isolated); > > > > > > Using a mutex to provide the intended throttling, without an infinite > > > loop or an arbitrary delay; and without having to worry (as we often did) > > > about whether those numbers in too_many_isolated() are really appropriate. > > > No premature OOMs complained of yet. > > > > Roughly speaking, there is a moment where shrink_inactive_list() acts > > like below. > > > > bool got_mutex = false; > > > > if (!current_is_kswapd()) { > > if (mutex_lock_killable(>too_many_isolated)) > > return SWAP_CLUSTER_MAX; > > got_mutex = true; > > } > > > > // kswapd is blocked here waiting for !current_is_kswapd(). > > That would be a shame, for kswapd to wait for !current_is_kswapd()! Yes, but current code (not about your patch) does allow kswapd to wait for memory allocations of !current_is_kswapd() thread to complete. > > But seriously, I think I understand what you mean by that, you're > thinking that kswapd would be waiting on some other task to clear > the too_many_isolated() condition? Yes. > > No, it does not work that way: kswapd (never seeing too_many_isolated() > because that always says false when current_is_kswapd()) never tries to > take the pgdat->too_many_isolated mutex itself: it does not wait there > at all, although other tasks may be waiting there at the time. I know. I wrote behavior of your patch if my guess (your "..." part corresponds to kswapd doing writepage) is correct. > > Perhaps my naming the mutex "too_many_isolated", same as the function, > is actually confusing, when I had intended it to be helpful. Not confusing at all. It is helpful. I just wanted to confirm what comes in your "..." part. > > > > > if (got_mutex) > > mutex_unlock(>too_many_isolated); > > > > > > > > But that was on a different kernel, and there I did have to make sure > > > that PF_MEMALLOC always prevented us from nesting: I'm not certain of > > > that in the current kernel (but do remember Johannes changing the memcg > > > end to make it use PF_MEMALLOC too). I offer the preview above, to see > > > if you're interested in that alternative: if you are, then I'll go ahead > > > and make it into an actual patch against v4.13-rc. > > > > I don't know what your actual patch looks like, but the problem is that > > pgdat->too_many_isolated waits for kswapd while kswapd waits for > > pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if > > once we hit it. > > Not so (and we'd hardly be finding it a useful patch if that were so). Current code allows kswapd to wait for memory allocation of !current_is_kswapd() threads, and thus !current_is_kswapd() threads wait for current_is_kswapd() threads while current_is_kswapd() threads wait for !current_is_kswapd() threads; nobody can make too_many_isolated() false if once we hit it. Hence, this patch is proposed. Thanks.
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu, 20 Jul 2017, Michal Hocko wrote: > On Wed 19-07-17 18:54:40, Hugh Dickins wrote: > [...] > > You probably won't welcome getting into alternatives at this late stage; > > but after hacking around it one way or another because of its pointless > > lockups, I lost patience with that too_many_isolated() loop a few months > > back (on realizing the enormous number of pages that may be isolated via > > migrate_pages(2)), and we've been running nicely since with something like: > > > > bool got_mutex = false; > > > > if (unlikely(too_many_isolated(pgdat, file, sc))) { > > if (mutex_lock_killable(>too_many_isolated)) > > return SWAP_CLUSTER_MAX; > > got_mutex = true; > > } > > ... > > if (got_mutex) > > mutex_unlock(>too_many_isolated); > > > > Using a mutex to provide the intended throttling, without an infinite > > loop or an arbitrary delay; and without having to worry (as we often did) > > about whether those numbers in too_many_isolated() are really appropriate. > > No premature OOMs complained of yet. > > > > But that was on a different kernel, and there I did have to make sure > > that PF_MEMALLOC always prevented us from nesting: I'm not certain of > > that in the current kernel (but do remember Johannes changing the memcg > > end to make it use PF_MEMALLOC too). I offer the preview above, to see > > if you're interested in that alternative: if you are, then I'll go ahead > > and make it into an actual patch against v4.13-rc. > > I would rather get rid of any additional locking here and my ultimate > goal is to make throttling at the page allocator layer rather than > inside the reclaim. Fair enough, I'm certainly in no hurry to send the patch, but thought it worth mentioning. Hugh
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu, 20 Jul 2017, Michal Hocko wrote: > On Wed 19-07-17 18:54:40, Hugh Dickins wrote: > [...] > > You probably won't welcome getting into alternatives at this late stage; > > but after hacking around it one way or another because of its pointless > > lockups, I lost patience with that too_many_isolated() loop a few months > > back (on realizing the enormous number of pages that may be isolated via > > migrate_pages(2)), and we've been running nicely since with something like: > > > > bool got_mutex = false; > > > > if (unlikely(too_many_isolated(pgdat, file, sc))) { > > if (mutex_lock_killable(>too_many_isolated)) > > return SWAP_CLUSTER_MAX; > > got_mutex = true; > > } > > ... > > if (got_mutex) > > mutex_unlock(>too_many_isolated); > > > > Using a mutex to provide the intended throttling, without an infinite > > loop or an arbitrary delay; and without having to worry (as we often did) > > about whether those numbers in too_many_isolated() are really appropriate. > > No premature OOMs complained of yet. > > > > But that was on a different kernel, and there I did have to make sure > > that PF_MEMALLOC always prevented us from nesting: I'm not certain of > > that in the current kernel (but do remember Johannes changing the memcg > > end to make it use PF_MEMALLOC too). I offer the preview above, to see > > if you're interested in that alternative: if you are, then I'll go ahead > > and make it into an actual patch against v4.13-rc. > > I would rather get rid of any additional locking here and my ultimate > goal is to make throttling at the page allocator layer rather than > inside the reclaim. Fair enough, I'm certainly in no hurry to send the patch, but thought it worth mentioning. Hugh
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu, 20 Jul 2017, Tetsuo Handa wrote: > Hugh Dickins wrote: > > You probably won't welcome getting into alternatives at this late stage; > > but after hacking around it one way or another because of its pointless > > lockups, I lost patience with that too_many_isolated() loop a few months > > back (on realizing the enormous number of pages that may be isolated via > > migrate_pages(2)), and we've been running nicely since with something like: > > > > bool got_mutex = false; > > > > if (unlikely(too_many_isolated(pgdat, file, sc))) { > > if (mutex_lock_killable(>too_many_isolated)) > > return SWAP_CLUSTER_MAX; > > got_mutex = true; > > } > > ... > > if (got_mutex) > > mutex_unlock(>too_many_isolated); > > > > Using a mutex to provide the intended throttling, without an infinite > > loop or an arbitrary delay; and without having to worry (as we often did) > > about whether those numbers in too_many_isolated() are really appropriate. > > No premature OOMs complained of yet. > > Roughly speaking, there is a moment where shrink_inactive_list() acts > like below. > > bool got_mutex = false; > > if (!current_is_kswapd()) { > if (mutex_lock_killable(>too_many_isolated)) > return SWAP_CLUSTER_MAX; > got_mutex = true; > } > > // kswapd is blocked here waiting for !current_is_kswapd(). That would be a shame, for kswapd to wait for !current_is_kswapd()! But seriously, I think I understand what you mean by that, you're thinking that kswapd would be waiting on some other task to clear the too_many_isolated() condition? No, it does not work that way: kswapd (never seeing too_many_isolated() because that always says false when current_is_kswapd()) never tries to take the pgdat->too_many_isolated mutex itself: it does not wait there at all, although other tasks may be waiting there at the time. Perhaps my naming the mutex "too_many_isolated", same as the function, is actually confusing, when I had intended it to be helpful. > > if (got_mutex) > mutex_unlock(>too_many_isolated); > > > > > But that was on a different kernel, and there I did have to make sure > > that PF_MEMALLOC always prevented us from nesting: I'm not certain of > > that in the current kernel (but do remember Johannes changing the memcg > > end to make it use PF_MEMALLOC too). I offer the preview above, to see > > if you're interested in that alternative: if you are, then I'll go ahead > > and make it into an actual patch against v4.13-rc. > > I don't know what your actual patch looks like, but the problem is that > pgdat->too_many_isolated waits for kswapd while kswapd waits for > pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if > once we hit it. Not so (and we'd hardly be finding it a useful patch if that were so). Hugh
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu, 20 Jul 2017, Tetsuo Handa wrote: > Hugh Dickins wrote: > > You probably won't welcome getting into alternatives at this late stage; > > but after hacking around it one way or another because of its pointless > > lockups, I lost patience with that too_many_isolated() loop a few months > > back (on realizing the enormous number of pages that may be isolated via > > migrate_pages(2)), and we've been running nicely since with something like: > > > > bool got_mutex = false; > > > > if (unlikely(too_many_isolated(pgdat, file, sc))) { > > if (mutex_lock_killable(>too_many_isolated)) > > return SWAP_CLUSTER_MAX; > > got_mutex = true; > > } > > ... > > if (got_mutex) > > mutex_unlock(>too_many_isolated); > > > > Using a mutex to provide the intended throttling, without an infinite > > loop or an arbitrary delay; and without having to worry (as we often did) > > about whether those numbers in too_many_isolated() are really appropriate. > > No premature OOMs complained of yet. > > Roughly speaking, there is a moment where shrink_inactive_list() acts > like below. > > bool got_mutex = false; > > if (!current_is_kswapd()) { > if (mutex_lock_killable(>too_many_isolated)) > return SWAP_CLUSTER_MAX; > got_mutex = true; > } > > // kswapd is blocked here waiting for !current_is_kswapd(). That would be a shame, for kswapd to wait for !current_is_kswapd()! But seriously, I think I understand what you mean by that, you're thinking that kswapd would be waiting on some other task to clear the too_many_isolated() condition? No, it does not work that way: kswapd (never seeing too_many_isolated() because that always says false when current_is_kswapd()) never tries to take the pgdat->too_many_isolated mutex itself: it does not wait there at all, although other tasks may be waiting there at the time. Perhaps my naming the mutex "too_many_isolated", same as the function, is actually confusing, when I had intended it to be helpful. > > if (got_mutex) > mutex_unlock(>too_many_isolated); > > > > > But that was on a different kernel, and there I did have to make sure > > that PF_MEMALLOC always prevented us from nesting: I'm not certain of > > that in the current kernel (but do remember Johannes changing the memcg > > end to make it use PF_MEMALLOC too). I offer the preview above, to see > > if you're interested in that alternative: if you are, then I'll go ahead > > and make it into an actual patch against v4.13-rc. > > I don't know what your actual patch looks like, but the problem is that > pgdat->too_many_isolated waits for kswapd while kswapd waits for > pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if > once we hit it. Not so (and we'd hardly be finding it a useful patch if that were so). Hugh
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Fri 21-07-17 16:01:04, Andrew Morton wrote: > On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hockowrote: > > > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, > > > > struct lruvec *lruvec, > > > > int file = is_file_lru(lru); > > > > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > > > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > > > > + bool stalled = false; > > > > > > > > while (unlikely(too_many_isolated(pgdat, file, sc))) { > > > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > + if (stalled) > > > > + return 0; > > > > + > > > > + /* wait a bit for the reclaimer. */ > > > > + schedule_timeout_interruptible(HZ/10); > > > > > > a) if this task has signal_pending(), this falls straight through > > >and I suspect the code breaks? > > > > It will not break. It will return to the allocation path more quickly > > but no over-reclaim will happen and it will/should get throttled there. > > So nothing critical. > > > > > b) replacing congestion_wait() with schedule_timeout_interruptible() > > >means this task no longer contributes to load average here and it's > > >a (slightly) user-visible change. > > > > you are right. I am not sure it matters but it might be visible. > > > > > c) msleep_interruptible() is nicer > > > > > > d) IOW, methinks we should be using msleep() here? > > > > OK, I do not have objections. Are you going to squash this in or want a > > separate patch explaining all the above? > > I'd prefer to have a comment explaining why interruptible sleep is > being used, because that "what if signal_pending()" case is rather a > red flag. I didn't really consider interruptible vs. uninterruptible sleep so it wasn't really a deliberate decision. Now, that you have brought up the above points I am OK changing that the uninterruptible. Here is a fix up. I am fine with this either folded in or as a separate patch. --- >From 4b295fc1625d499a2e1283b036aea005158b33cc Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 24 Jul 2017 08:43:23 +0200 Subject: [PATCH] mm, vmscan: throttle too_many_isolated by uninterruptible sleep "mm, vmscan: do not loop on too_many_isolated for ever" has added an interruptible sleep into shrink_inactive_list when we have isolated too many pages. Andrew has noticed that we used to sleep in uninterruptible sleep previously (in congestion_wait) and that changing that is wrong for at least two reasons - waiting task would not participate in the load average anymore - task with a pending signal will not sleep and bail out immediately While none of those issues are critical in any way but they are unnecessary. The interruptible sleep was more an oversight than a deliberate decision. Fix this by using msleep instead. Spotted-by: Andrew Morton Signed-off-by: Michal Hocko --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 7ba9467b6d58..ed0c29a3db16 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1749,7 +1749,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, return 0; /* wait a bit for the reclaimer. */ - schedule_timeout_interruptible(HZ/10); + msleep(100); stalled = true; /* We are about to die and free our memory. Return now. */ -- 2.13.2 -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Fri 21-07-17 16:01:04, Andrew Morton wrote: > On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hocko wrote: > > > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, > > > > struct lruvec *lruvec, > > > > int file = is_file_lru(lru); > > > > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > > > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > > > > + bool stalled = false; > > > > > > > > while (unlikely(too_many_isolated(pgdat, file, sc))) { > > > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > + if (stalled) > > > > + return 0; > > > > + > > > > + /* wait a bit for the reclaimer. */ > > > > + schedule_timeout_interruptible(HZ/10); > > > > > > a) if this task has signal_pending(), this falls straight through > > >and I suspect the code breaks? > > > > It will not break. It will return to the allocation path more quickly > > but no over-reclaim will happen and it will/should get throttled there. > > So nothing critical. > > > > > b) replacing congestion_wait() with schedule_timeout_interruptible() > > >means this task no longer contributes to load average here and it's > > >a (slightly) user-visible change. > > > > you are right. I am not sure it matters but it might be visible. > > > > > c) msleep_interruptible() is nicer > > > > > > d) IOW, methinks we should be using msleep() here? > > > > OK, I do not have objections. Are you going to squash this in or want a > > separate patch explaining all the above? > > I'd prefer to have a comment explaining why interruptible sleep is > being used, because that "what if signal_pending()" case is rather a > red flag. I didn't really consider interruptible vs. uninterruptible sleep so it wasn't really a deliberate decision. Now, that you have brought up the above points I am OK changing that the uninterruptible. Here is a fix up. I am fine with this either folded in or as a separate patch. --- >From 4b295fc1625d499a2e1283b036aea005158b33cc Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 24 Jul 2017 08:43:23 +0200 Subject: [PATCH] mm, vmscan: throttle too_many_isolated by uninterruptible sleep "mm, vmscan: do not loop on too_many_isolated for ever" has added an interruptible sleep into shrink_inactive_list when we have isolated too many pages. Andrew has noticed that we used to sleep in uninterruptible sleep previously (in congestion_wait) and that changing that is wrong for at least two reasons - waiting task would not participate in the load average anymore - task with a pending signal will not sleep and bail out immediately While none of those issues are critical in any way but they are unnecessary. The interruptible sleep was more an oversight than a deliberate decision. Fix this by using msleep instead. Spotted-by: Andrew Morton Signed-off-by: Michal Hocko --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 7ba9467b6d58..ed0c29a3db16 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1749,7 +1749,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, return 0; /* wait a bit for the reclaimer. */ - schedule_timeout_interruptible(HZ/10); + msleep(100); stalled = true; /* We are about to die and free our memory. Return now. */ -- 2.13.2 -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hockowrote: > > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, > > > struct lruvec *lruvec, > > > int file = is_file_lru(lru); > > > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > > > + bool stalled = false; > > > > > > while (unlikely(too_many_isolated(pgdat, file, sc))) { > > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > > + if (stalled) > > > + return 0; > > > + > > > + /* wait a bit for the reclaimer. */ > > > + schedule_timeout_interruptible(HZ/10); > > > > a) if this task has signal_pending(), this falls straight through > >and I suspect the code breaks? > > It will not break. It will return to the allocation path more quickly > but no over-reclaim will happen and it will/should get throttled there. > So nothing critical. > > > b) replacing congestion_wait() with schedule_timeout_interruptible() > >means this task no longer contributes to load average here and it's > >a (slightly) user-visible change. > > you are right. I am not sure it matters but it might be visible. > > > c) msleep_interruptible() is nicer > > > > d) IOW, methinks we should be using msleep() here? > > OK, I do not have objections. Are you going to squash this in or want a > separate patch explaining all the above? I'd prefer to have a comment explaining why interruptible sleep is being used, because that "what if signal_pending()" case is rather a red flag. Is it the case that fall-through-if-signal_pending() is the *preferred* behaviour? If so, the comment should explain this. If it isn't the preferred behaviour then using uninterruptible sleep sounds better to me, if only because it saves us from having to test a rather tricky and rare case.
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hocko wrote: > > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, > > > struct lruvec *lruvec, > > > int file = is_file_lru(lru); > > > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > > > + bool stalled = false; > > > > > > while (unlikely(too_many_isolated(pgdat, file, sc))) { > > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > > + if (stalled) > > > + return 0; > > > + > > > + /* wait a bit for the reclaimer. */ > > > + schedule_timeout_interruptible(HZ/10); > > > > a) if this task has signal_pending(), this falls straight through > >and I suspect the code breaks? > > It will not break. It will return to the allocation path more quickly > but no over-reclaim will happen and it will/should get throttled there. > So nothing critical. > > > b) replacing congestion_wait() with schedule_timeout_interruptible() > >means this task no longer contributes to load average here and it's > >a (slightly) user-visible change. > > you are right. I am not sure it matters but it might be visible. > > > c) msleep_interruptible() is nicer > > > > d) IOW, methinks we should be using msleep() here? > > OK, I do not have objections. Are you going to squash this in or want a > separate patch explaining all the above? I'd prefer to have a comment explaining why interruptible sleep is being used, because that "what if signal_pending()" case is rather a red flag. Is it the case that fall-through-if-signal_pending() is the *preferred* behaviour? If so, the comment should explain this. If it isn't the preferred behaviour then using uninterruptible sleep sounds better to me, if only because it saves us from having to test a rather tricky and rare case.
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Wed 19-07-17 18:54:40, Hugh Dickins wrote: [...] > You probably won't welcome getting into alternatives at this late stage; > but after hacking around it one way or another because of its pointless > lockups, I lost patience with that too_many_isolated() loop a few months > back (on realizing the enormous number of pages that may be isolated via > migrate_pages(2)), and we've been running nicely since with something like: > > bool got_mutex = false; > > if (unlikely(too_many_isolated(pgdat, file, sc))) { > if (mutex_lock_killable(>too_many_isolated)) > return SWAP_CLUSTER_MAX; > got_mutex = true; > } > ... > if (got_mutex) > mutex_unlock(>too_many_isolated); > > Using a mutex to provide the intended throttling, without an infinite > loop or an arbitrary delay; and without having to worry (as we often did) > about whether those numbers in too_many_isolated() are really appropriate. > No premature OOMs complained of yet. > > But that was on a different kernel, and there I did have to make sure > that PF_MEMALLOC always prevented us from nesting: I'm not certain of > that in the current kernel (but do remember Johannes changing the memcg > end to make it use PF_MEMALLOC too). I offer the preview above, to see > if you're interested in that alternative: if you are, then I'll go ahead > and make it into an actual patch against v4.13-rc. I would rather get rid of any additional locking here and my ultimate goal is to make throttling at the page allocator layer rather than inside the reclaim. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Wed 19-07-17 18:54:40, Hugh Dickins wrote: [...] > You probably won't welcome getting into alternatives at this late stage; > but after hacking around it one way or another because of its pointless > lockups, I lost patience with that too_many_isolated() loop a few months > back (on realizing the enormous number of pages that may be isolated via > migrate_pages(2)), and we've been running nicely since with something like: > > bool got_mutex = false; > > if (unlikely(too_many_isolated(pgdat, file, sc))) { > if (mutex_lock_killable(>too_many_isolated)) > return SWAP_CLUSTER_MAX; > got_mutex = true; > } > ... > if (got_mutex) > mutex_unlock(>too_many_isolated); > > Using a mutex to provide the intended throttling, without an infinite > loop or an arbitrary delay; and without having to worry (as we often did) > about whether those numbers in too_many_isolated() are really appropriate. > No premature OOMs complained of yet. > > But that was on a different kernel, and there I did have to make sure > that PF_MEMALLOC always prevented us from nesting: I'm not certain of > that in the current kernel (but do remember Johannes changing the memcg > end to make it use PF_MEMALLOC too). I offer the preview above, to see > if you're interested in that alternative: if you are, then I'll go ahead > and make it into an actual patch against v4.13-rc. I would rather get rid of any additional locking here and my ultimate goal is to make throttling at the page allocator layer rather than inside the reclaim. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Hugh Dickins wrote: > You probably won't welcome getting into alternatives at this late stage; > but after hacking around it one way or another because of its pointless > lockups, I lost patience with that too_many_isolated() loop a few months > back (on realizing the enormous number of pages that may be isolated via > migrate_pages(2)), and we've been running nicely since with something like: > > bool got_mutex = false; > > if (unlikely(too_many_isolated(pgdat, file, sc))) { > if (mutex_lock_killable(>too_many_isolated)) > return SWAP_CLUSTER_MAX; > got_mutex = true; > } > ... > if (got_mutex) > mutex_unlock(>too_many_isolated); > > Using a mutex to provide the intended throttling, without an infinite > loop or an arbitrary delay; and without having to worry (as we often did) > about whether those numbers in too_many_isolated() are really appropriate. > No premature OOMs complained of yet. Roughly speaking, there is a moment where shrink_inactive_list() acts like below. bool got_mutex = false; if (!current_is_kswapd()) { if (mutex_lock_killable(>too_many_isolated)) return SWAP_CLUSTER_MAX; got_mutex = true; } // kswapd is blocked here waiting for !current_is_kswapd(). if (got_mutex) mutex_unlock(>too_many_isolated); > > But that was on a different kernel, and there I did have to make sure > that PF_MEMALLOC always prevented us from nesting: I'm not certain of > that in the current kernel (but do remember Johannes changing the memcg > end to make it use PF_MEMALLOC too). I offer the preview above, to see > if you're interested in that alternative: if you are, then I'll go ahead > and make it into an actual patch against v4.13-rc. I don't know what your actual patch looks like, but the problem is that pgdat->too_many_isolated waits for kswapd while kswapd waits for pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if once we hit it.
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Hugh Dickins wrote: > You probably won't welcome getting into alternatives at this late stage; > but after hacking around it one way or another because of its pointless > lockups, I lost patience with that too_many_isolated() loop a few months > back (on realizing the enormous number of pages that may be isolated via > migrate_pages(2)), and we've been running nicely since with something like: > > bool got_mutex = false; > > if (unlikely(too_many_isolated(pgdat, file, sc))) { > if (mutex_lock_killable(>too_many_isolated)) > return SWAP_CLUSTER_MAX; > got_mutex = true; > } > ... > if (got_mutex) > mutex_unlock(>too_many_isolated); > > Using a mutex to provide the intended throttling, without an infinite > loop or an arbitrary delay; and without having to worry (as we often did) > about whether those numbers in too_many_isolated() are really appropriate. > No premature OOMs complained of yet. Roughly speaking, there is a moment where shrink_inactive_list() acts like below. bool got_mutex = false; if (!current_is_kswapd()) { if (mutex_lock_killable(>too_many_isolated)) return SWAP_CLUSTER_MAX; got_mutex = true; } // kswapd is blocked here waiting for !current_is_kswapd(). if (got_mutex) mutex_unlock(>too_many_isolated); > > But that was on a different kernel, and there I did have to make sure > that PF_MEMALLOC always prevented us from nesting: I'm not certain of > that in the current kernel (but do remember Johannes changing the memcg > end to make it use PF_MEMALLOC too). I offer the preview above, to see > if you're interested in that alternative: if you are, then I'll go ahead > and make it into an actual patch against v4.13-rc. I don't know what your actual patch looks like, but the problem is that pgdat->too_many_isolated waits for kswapd while kswapd waits for pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if once we hit it.
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Wed 19-07-17 15:20:14, Andrew Morton wrote: > On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hockowrote: > > > From: Michal Hocko > > > > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck > > in too_many_isolated loop basically for ever because the last few pages > > on the LRU lists are isolated by the kswapd which is stuck on fs locks > > when doing the pageout or slab reclaim. This in turn means that there is > > nobody to actually trigger the oom killer and the system is basically > > unusable. > > > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > > direct reclaim when too many pages are isolated already") to prevent > > from pre-mature oom killer invocations because back then no reclaim > > progress could indeed trigger the OOM killer too early. But since the > > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > > the allocation/reclaim retry loop considers all the reclaimable pages > > and throttles the allocation at that layer so we can loosen the direct > > reclaim throttling. > > > > Make shrink_inactive_list loop over too_many_isolated bounded and returns > > immediately when the situation hasn't resolved after the first sleep. > > Replace congestion_wait by a simple schedule_timeout_interruptible because > > we are not really waiting on the IO congestion in this path. > > > > Please note that this patch can theoretically cause the OOM killer to > > trigger earlier while there are many pages isolated for the reclaim > > which makes progress only very slowly. This would be obvious from the oom > > report as the number of isolated pages are printed there. If we ever hit > > this should_reclaim_retry should consider those numbers in the evaluation > > in one way or another. > > Need to figure out which kernels to patch. Maybe just 4.13-rc after a > week or two? I do not think we need to rush it and the next merge window should be just OK. > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, > > struct lruvec *lruvec, > > int file = is_file_lru(lru); > > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > > + bool stalled = false; > > > > while (unlikely(too_many_isolated(pgdat, file, sc))) { > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > + if (stalled) > > + return 0; > > + > > + /* wait a bit for the reclaimer. */ > > + schedule_timeout_interruptible(HZ/10); > > a) if this task has signal_pending(), this falls straight through >and I suspect the code breaks? It will not break. It will return to the allocation path more quickly but no over-reclaim will happen and it will/should get throttled there. So nothing critical. > b) replacing congestion_wait() with schedule_timeout_interruptible() >means this task no longer contributes to load average here and it's >a (slightly) user-visible change. you are right. I am not sure it matters but it might be visible. > c) msleep_interruptible() is nicer > > d) IOW, methinks we should be using msleep() here? OK, I do not have objections. Are you going to squash this in or want a separate patch explaining all the above? -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Wed 19-07-17 15:20:14, Andrew Morton wrote: > On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hocko wrote: > > > From: Michal Hocko > > > > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck > > in too_many_isolated loop basically for ever because the last few pages > > on the LRU lists are isolated by the kswapd which is stuck on fs locks > > when doing the pageout or slab reclaim. This in turn means that there is > > nobody to actually trigger the oom killer and the system is basically > > unusable. > > > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > > direct reclaim when too many pages are isolated already") to prevent > > from pre-mature oom killer invocations because back then no reclaim > > progress could indeed trigger the OOM killer too early. But since the > > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > > the allocation/reclaim retry loop considers all the reclaimable pages > > and throttles the allocation at that layer so we can loosen the direct > > reclaim throttling. > > > > Make shrink_inactive_list loop over too_many_isolated bounded and returns > > immediately when the situation hasn't resolved after the first sleep. > > Replace congestion_wait by a simple schedule_timeout_interruptible because > > we are not really waiting on the IO congestion in this path. > > > > Please note that this patch can theoretically cause the OOM killer to > > trigger earlier while there are many pages isolated for the reclaim > > which makes progress only very slowly. This would be obvious from the oom > > report as the number of isolated pages are printed there. If we ever hit > > this should_reclaim_retry should consider those numbers in the evaluation > > in one way or another. > > Need to figure out which kernels to patch. Maybe just 4.13-rc after a > week or two? I do not think we need to rush it and the next merge window should be just OK. > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, > > struct lruvec *lruvec, > > int file = is_file_lru(lru); > > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > > + bool stalled = false; > > > > while (unlikely(too_many_isolated(pgdat, file, sc))) { > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > + if (stalled) > > + return 0; > > + > > + /* wait a bit for the reclaimer. */ > > + schedule_timeout_interruptible(HZ/10); > > a) if this task has signal_pending(), this falls straight through >and I suspect the code breaks? It will not break. It will return to the allocation path more quickly but no over-reclaim will happen and it will/should get throttled there. So nothing critical. > b) replacing congestion_wait() with schedule_timeout_interruptible() >means this task no longer contributes to load average here and it's >a (slightly) user-visible change. you are right. I am not sure it matters but it might be visible. > c) msleep_interruptible() is nicer > > d) IOW, methinks we should be using msleep() here? OK, I do not have objections. Are you going to squash this in or want a separate patch explaining all the above? -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon, 10 Jul 2017, Michal Hocko wrote: > From: Michal Hocko> > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck > in too_many_isolated loop basically for ever because the last few pages > on the LRU lists are isolated by the kswapd which is stuck on fs locks > when doing the pageout or slab reclaim. This in turn means that there is > nobody to actually trigger the oom killer and the system is basically > unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the direct > reclaim throttling. > > Make shrink_inactive_list loop over too_many_isolated bounded and returns > immediately when the situation hasn't resolved after the first sleep. > Replace congestion_wait by a simple schedule_timeout_interruptible because > we are not really waiting on the IO congestion in this path. > > Please note that this patch can theoretically cause the OOM killer to > trigger earlier while there are many pages isolated for the reclaim > which makes progress only very slowly. This would be obvious from the oom > report as the number of isolated pages are printed there. If we ever hit > this should_reclaim_retry should consider those numbers in the evaluation > in one way or another. > > [1] > http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp > [2] > http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp > [3] > http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp > > Acked-by: Mel Gorman > Tested-by: Tetsuo Handa > Signed-off-by: Michal Hocko > --- > Hi, > I am resubmitting this patch previously sent here > http://lkml.kernel.org/r/20170307133057.26182-1-mho...@kernel.org. > > Johannes and Rik had some concerns that this could lead to premature > OOM kills. I agree with them that we need a better throttling > mechanism. Until now we didn't give the issue described above a high > priority because it usually required a really insane workload to > trigger. But it seems that the issue can be reproduced also without > having an insane number of competing threads [3]. > > Moreover, the issue also triggers very often while testing heavy memory > pressure and so prevents further development of hardening of that area > (http://lkml.kernel.org/r/201707061948.icj18763.tvfoqfohmjf...@i-love.sakura.ne.jp). > Tetsuo hasn't seen any negative effect of this patch in his oom stress > tests so I think we should go with this simple patch for now and think > about something more robust long term. > > That being said I suggest merging this (after spending the full release > cycle in linux-next) for the time being until we come up with a more > clever solution. > > mm/vmscan.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c15b2e4c47ca..4ae069060ae5 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct > lruvec *lruvec, > int file = is_file_lru(lru); > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > + bool stalled = false; > > while (unlikely(too_many_isolated(pgdat, file, sc))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + if (stalled) > + return 0; > + > + /* wait a bit for the reclaimer. */ > + schedule_timeout_interruptible(HZ/10); > + stalled = true; > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > -- You probably won't welcome getting into alternatives at this late stage; but after hacking around it one way or another because of its pointless lockups, I lost patience with that too_many_isolated() loop a few months back (on realizing the enormous number of pages that may be isolated via migrate_pages(2)), and we've been running nicely since with something like: bool got_mutex = false; if (unlikely(too_many_isolated(pgdat, file, sc))) { if (mutex_lock_killable(>too_many_isolated)) return SWAP_CLUSTER_MAX; got_mutex = true; } ... if (got_mutex) mutex_unlock(>too_many_isolated); Using a mutex to provide the intended throttling, without an infinite loop or an arbitrary delay; and
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon, 10 Jul 2017, Michal Hocko wrote: > From: Michal Hocko > > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck > in too_many_isolated loop basically for ever because the last few pages > on the LRU lists are isolated by the kswapd which is stuck on fs locks > when doing the pageout or slab reclaim. This in turn means that there is > nobody to actually trigger the oom killer and the system is basically > unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the direct > reclaim throttling. > > Make shrink_inactive_list loop over too_many_isolated bounded and returns > immediately when the situation hasn't resolved after the first sleep. > Replace congestion_wait by a simple schedule_timeout_interruptible because > we are not really waiting on the IO congestion in this path. > > Please note that this patch can theoretically cause the OOM killer to > trigger earlier while there are many pages isolated for the reclaim > which makes progress only very slowly. This would be obvious from the oom > report as the number of isolated pages are printed there. If we ever hit > this should_reclaim_retry should consider those numbers in the evaluation > in one way or another. > > [1] > http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp > [2] > http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp > [3] > http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp > > Acked-by: Mel Gorman > Tested-by: Tetsuo Handa > Signed-off-by: Michal Hocko > --- > Hi, > I am resubmitting this patch previously sent here > http://lkml.kernel.org/r/20170307133057.26182-1-mho...@kernel.org. > > Johannes and Rik had some concerns that this could lead to premature > OOM kills. I agree with them that we need a better throttling > mechanism. Until now we didn't give the issue described above a high > priority because it usually required a really insane workload to > trigger. But it seems that the issue can be reproduced also without > having an insane number of competing threads [3]. > > Moreover, the issue also triggers very often while testing heavy memory > pressure and so prevents further development of hardening of that area > (http://lkml.kernel.org/r/201707061948.icj18763.tvfoqfohmjf...@i-love.sakura.ne.jp). > Tetsuo hasn't seen any negative effect of this patch in his oom stress > tests so I think we should go with this simple patch for now and think > about something more robust long term. > > That being said I suggest merging this (after spending the full release > cycle in linux-next) for the time being until we come up with a more > clever solution. > > mm/vmscan.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c15b2e4c47ca..4ae069060ae5 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct > lruvec *lruvec, > int file = is_file_lru(lru); > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > + bool stalled = false; > > while (unlikely(too_many_isolated(pgdat, file, sc))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + if (stalled) > + return 0; > + > + /* wait a bit for the reclaimer. */ > + schedule_timeout_interruptible(HZ/10); > + stalled = true; > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > -- You probably won't welcome getting into alternatives at this late stage; but after hacking around it one way or another because of its pointless lockups, I lost patience with that too_many_isolated() loop a few months back (on realizing the enormous number of pages that may be isolated via migrate_pages(2)), and we've been running nicely since with something like: bool got_mutex = false; if (unlikely(too_many_isolated(pgdat, file, sc))) { if (mutex_lock_killable(>too_many_isolated)) return SWAP_CLUSTER_MAX; got_mutex = true; } ... if (got_mutex) mutex_unlock(>too_many_isolated); Using a mutex to provide the intended throttling, without an infinite loop or an arbitrary delay; and without having to worry (as we often did) about whether those numbers in
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hockowrote: > From: Michal Hocko > > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck > in too_many_isolated loop basically for ever because the last few pages > on the LRU lists are isolated by the kswapd which is stuck on fs locks > when doing the pageout or slab reclaim. This in turn means that there is > nobody to actually trigger the oom killer and the system is basically > unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the direct > reclaim throttling. > > Make shrink_inactive_list loop over too_many_isolated bounded and returns > immediately when the situation hasn't resolved after the first sleep. > Replace congestion_wait by a simple schedule_timeout_interruptible because > we are not really waiting on the IO congestion in this path. > > Please note that this patch can theoretically cause the OOM killer to > trigger earlier while there are many pages isolated for the reclaim > which makes progress only very slowly. This would be obvious from the oom > report as the number of isolated pages are printed there. If we ever hit > this should_reclaim_retry should consider those numbers in the evaluation > in one way or another. Need to figure out which kernels to patch. Maybe just 4.13-rc after a week or two? > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct > lruvec *lruvec, > int file = is_file_lru(lru); > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > + bool stalled = false; > > while (unlikely(too_many_isolated(pgdat, file, sc))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + if (stalled) > + return 0; > + > + /* wait a bit for the reclaimer. */ > + schedule_timeout_interruptible(HZ/10); a) if this task has signal_pending(), this falls straight through and I suspect the code breaks? b) replacing congestion_wait() with schedule_timeout_interruptible() means this task no longer contributes to load average here and it's a (slightly) user-visible change. c) msleep_interruptible() is nicer d) IOW, methinks we should be using msleep() here? > + stalled = true; > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) (Gets distracted by the thought that we should do s/msleep/msleep_uninterruptible/g)
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hocko wrote: > From: Michal Hocko > > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck > in too_many_isolated loop basically for ever because the last few pages > on the LRU lists are isolated by the kswapd which is stuck on fs locks > when doing the pageout or slab reclaim. This in turn means that there is > nobody to actually trigger the oom killer and the system is basically > unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the direct > reclaim throttling. > > Make shrink_inactive_list loop over too_many_isolated bounded and returns > immediately when the situation hasn't resolved after the first sleep. > Replace congestion_wait by a simple schedule_timeout_interruptible because > we are not really waiting on the IO congestion in this path. > > Please note that this patch can theoretically cause the OOM killer to > trigger earlier while there are many pages isolated for the reclaim > which makes progress only very slowly. This would be obvious from the oom > report as the number of isolated pages are printed there. If we ever hit > this should_reclaim_retry should consider those numbers in the evaluation > in one way or another. Need to figure out which kernels to patch. Maybe just 4.13-rc after a week or two? > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct > lruvec *lruvec, > int file = is_file_lru(lru); > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > + bool stalled = false; > > while (unlikely(too_many_isolated(pgdat, file, sc))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + if (stalled) > + return 0; > + > + /* wait a bit for the reclaimer. */ > + schedule_timeout_interruptible(HZ/10); a) if this task has signal_pending(), this falls straight through and I suspect the code breaks? b) replacing congestion_wait() with schedule_timeout_interruptible() means this task no longer contributes to load average here and it's a (slightly) user-visible change. c) msleep_interruptible() is nicer d) IOW, methinks we should be using msleep() here? > + stalled = true; > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) (Gets distracted by the thought that we should do s/msleep/msleep_uninterruptible/g)
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon 10-07-17 12:58:59, Johannes Weiner wrote: > On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote: > > On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote: > > > > > Johannes and Rik had some concerns that this could lead to premature > > > OOM kills. I agree with them that we need a better throttling > > > mechanism. Until now we didn't give the issue described above a high > > > priority because it usually required a really insane workload to > > > trigger. But it seems that the issue can be reproduced also without > > > having an insane number of competing threads [3]. > > > > My worries stand, but lets fix the real observed bug, and not worry > > too much about the theoretical bug for now. > > > > Acked-by: Rik van Riel> > I agree with this. > > Acked-by: Johannes Weiner Thanks to both of you. Just to make it clear. I really do want to address the throttling problem longterm properly. I do not have any great ideas to be honest. I am busy with other things so it might be quite some time before I come up with something. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon 10-07-17 12:58:59, Johannes Weiner wrote: > On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote: > > On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote: > > > > > Johannes and Rik had some concerns that this could lead to premature > > > OOM kills. I agree with them that we need a better throttling > > > mechanism. Until now we didn't give the issue described above a high > > > priority because it usually required a really insane workload to > > > trigger. But it seems that the issue can be reproduced also without > > > having an insane number of competing threads [3]. > > > > My worries stand, but lets fix the real observed bug, and not worry > > too much about the theoretical bug for now. > > > > Acked-by: Rik van Riel > > I agree with this. > > Acked-by: Johannes Weiner Thanks to both of you. Just to make it clear. I really do want to address the throttling problem longterm properly. I do not have any great ideas to be honest. I am busy with other things so it might be quite some time before I come up with something. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote: > On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote: > > > Johannes and Rik had some concerns that this could lead to premature > > OOM kills. I agree with them that we need a better throttling > > mechanism. Until now we didn't give the issue described above a high > > priority because it usually required a really insane workload to > > trigger. But it seems that the issue can be reproduced also without > > having an insane number of competing threads [3]. > > My worries stand, but lets fix the real observed bug, and not worry > too much about the theoretical bug for now. > > Acked-by: Rik van RielI agree with this. Acked-by: Johannes Weiner
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote: > On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote: > > > Johannes and Rik had some concerns that this could lead to premature > > OOM kills. I agree with them that we need a better throttling > > mechanism. Until now we didn't give the issue described above a high > > priority because it usually required a really insane workload to > > trigger. But it seems that the issue can be reproduced also without > > having an insane number of competing threads [3]. > > My worries stand, but lets fix the real observed bug, and not worry > too much about the theoretical bug for now. > > Acked-by: Rik van Riel I agree with this. Acked-by: Johannes Weiner
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote: > Johannes and Rik had some concerns that this could lead to premature > OOM kills. I agree with them that we need a better throttling > mechanism. Until now we didn't give the issue described above a high > priority because it usually required a really insane workload to > trigger. But it seems that the issue can be reproduced also without > having an insane number of competing threads [3]. My worries stand, but lets fix the real observed bug, and not worry too much about the theoretical bug for now. Acked-by: Rik van Riel
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote: > Johannes and Rik had some concerns that this could lead to premature > OOM kills. I agree with them that we need a better throttling > mechanism. Until now we didn't give the issue described above a high > priority because it usually required a really insane workload to > trigger. But it seems that the issue can be reproduced also without > having an insane number of competing threads [3]. My worries stand, but lets fix the real observed bug, and not worry too much about the theoretical bug for now. Acked-by: Rik van Riel
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On 07/10/2017 09:48 AM, Michal Hocko wrote: > From: Michal Hocko> > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck > in too_many_isolated loop basically for ever because the last few pages > on the LRU lists are isolated by the kswapd which is stuck on fs locks > when doing the pageout or slab reclaim. This in turn means that there is > nobody to actually trigger the oom killer and the system is basically > unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the direct > reclaim throttling. > > Make shrink_inactive_list loop over too_many_isolated bounded and returns > immediately when the situation hasn't resolved after the first sleep. > Replace congestion_wait by a simple schedule_timeout_interruptible because > we are not really waiting on the IO congestion in this path. > > Please note that this patch can theoretically cause the OOM killer to > trigger earlier while there are many pages isolated for the reclaim > which makes progress only very slowly. This would be obvious from the oom > report as the number of isolated pages are printed there. If we ever hit > this should_reclaim_retry should consider those numbers in the evaluation > in one way or another. > > [1] > http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp > [2] > http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp > [3] > http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp > > Acked-by: Mel Gorman > Tested-by: Tetsuo Handa > Signed-off-by: Michal Hocko Let's hope there won't be premature OOM's then. Acked-by: Vlastimil Babka
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On 07/10/2017 09:48 AM, Michal Hocko wrote: > From: Michal Hocko > > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck > in too_many_isolated loop basically for ever because the last few pages > on the LRU lists are isolated by the kswapd which is stuck on fs locks > when doing the pageout or slab reclaim. This in turn means that there is > nobody to actually trigger the oom killer and the system is basically > unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the direct > reclaim throttling. > > Make shrink_inactive_list loop over too_many_isolated bounded and returns > immediately when the situation hasn't resolved after the first sleep. > Replace congestion_wait by a simple schedule_timeout_interruptible because > we are not really waiting on the IO congestion in this path. > > Please note that this patch can theoretically cause the OOM killer to > trigger earlier while there are many pages isolated for the reclaim > which makes progress only very slowly. This would be obvious from the oom > report as the number of isolated pages are printed there. If we ever hit > this should_reclaim_retry should consider those numbers in the evaluation > in one way or another. > > [1] > http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp > [2] > http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp > [3] > http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp > > Acked-by: Mel Gorman > Tested-by: Tetsuo Handa > Signed-off-by: Michal Hocko Let's hope there won't be premature OOM's then. Acked-by: Vlastimil Babka
[PATCH] mm, vmscan: do not loop on too_many_isolated for ever
From: Michal HockoTetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout or slab reclaim. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages and throttles the allocation at that layer so we can loosen the direct reclaim throttling. Make shrink_inactive_list loop over too_many_isolated bounded and returns immediately when the situation hasn't resolved after the first sleep. Replace congestion_wait by a simple schedule_timeout_interruptible because we are not really waiting on the IO congestion in this path. Please note that this patch can theoretically cause the OOM killer to trigger earlier while there are many pages isolated for the reclaim which makes progress only very slowly. This would be obvious from the oom report as the number of isolated pages are printed there. If we ever hit this should_reclaim_retry should consider those numbers in the evaluation in one way or another. [1] http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp [2] http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp [3] http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp Acked-by: Mel Gorman Tested-by: Tetsuo Handa Signed-off-by: Michal Hocko --- Hi, I am resubmitting this patch previously sent here http://lkml.kernel.org/r/20170307133057.26182-1-mho...@kernel.org. Johannes and Rik had some concerns that this could lead to premature OOM kills. I agree with them that we need a better throttling mechanism. Until now we didn't give the issue described above a high priority because it usually required a really insane workload to trigger. But it seems that the issue can be reproduced also without having an insane number of competing threads [3]. Moreover, the issue also triggers very often while testing heavy memory pressure and so prevents further development of hardening of that area (http://lkml.kernel.org/r/201707061948.icj18763.tvfoqfohmjf...@i-love.sakura.ne.jp). Tetsuo hasn't seen any negative effect of this patch in his oom stress tests so I think we should go with this simple patch for now and think about something more robust long term. That being said I suggest merging this (after spending the full release cycle in linux-next) for the time being until we come up with a more clever solution. mm/vmscan.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c15b2e4c47ca..4ae069060ae5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; + bool stalled = false; while (unlikely(too_many_isolated(pgdat, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + if (stalled) + return 0; + + /* wait a bit for the reclaimer. */ + schedule_timeout_interruptible(HZ/10); + stalled = true; /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) -- 2.11.0
[PATCH] mm, vmscan: do not loop on too_many_isolated for ever
From: Michal Hocko Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout or slab reclaim. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages and throttles the allocation at that layer so we can loosen the direct reclaim throttling. Make shrink_inactive_list loop over too_many_isolated bounded and returns immediately when the situation hasn't resolved after the first sleep. Replace congestion_wait by a simple schedule_timeout_interruptible because we are not really waiting on the IO congestion in this path. Please note that this patch can theoretically cause the OOM killer to trigger earlier while there are many pages isolated for the reclaim which makes progress only very slowly. This would be obvious from the oom report as the number of isolated pages are printed there. If we ever hit this should_reclaim_retry should consider those numbers in the evaluation in one way or another. [1] http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp [2] http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp [3] http://lkml.kernel.org/r/201706300914.ceh95859.fmqolvfhjft...@i-love.sakura.ne.jp Acked-by: Mel Gorman Tested-by: Tetsuo Handa Signed-off-by: Michal Hocko --- Hi, I am resubmitting this patch previously sent here http://lkml.kernel.org/r/20170307133057.26182-1-mho...@kernel.org. Johannes and Rik had some concerns that this could lead to premature OOM kills. I agree with them that we need a better throttling mechanism. Until now we didn't give the issue described above a high priority because it usually required a really insane workload to trigger. But it seems that the issue can be reproduced also without having an insane number of competing threads [3]. Moreover, the issue also triggers very often while testing heavy memory pressure and so prevents further development of hardening of that area (http://lkml.kernel.org/r/201707061948.icj18763.tvfoqfohmjf...@i-love.sakura.ne.jp). Tetsuo hasn't seen any negative effect of this patch in his oom stress tests so I think we should go with this simple patch for now and think about something more robust long term. That being said I suggest merging this (after spending the full release cycle in linux-next) for the time being until we come up with a more clever solution. mm/vmscan.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c15b2e4c47ca..4ae069060ae5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; + bool stalled = false; while (unlikely(too_many_isolated(pgdat, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + if (stalled) + return 0; + + /* wait a bit for the reclaimer. */ + schedule_timeout_interruptible(HZ/10); + stalled = true; /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) -- 2.11.0
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Michal Hocko wrote: > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > > Michal Hocko wrote: > [...] > > > It is really hard to pursue this half solution when there is no clear > > > indication it helps in your testing. So could you try to test with only > > > this patch on top of the current linux-next tree (or Linus tree) and see > > > if you can reproduce the problem? > > > > With this patch on top of next-20170630, I no longer hit this problem. > > (Of course, this is because this patch eliminates the infinite loop.) > > I assume you haven't seen other negative side effects, like unexpected > OOMs etc... Are you willing to give your Tested-by? I didn't see other negative side effects. Tested-by: Tetsuo HandaWe need long time for testing this patch at linux-next.git (and I give up this handy bug for finding other bugs under almost OOM situation).
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Michal Hocko wrote: > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > > Michal Hocko wrote: > [...] > > > It is really hard to pursue this half solution when there is no clear > > > indication it helps in your testing. So could you try to test with only > > > this patch on top of the current linux-next tree (or Linus tree) and see > > > if you can reproduce the problem? > > > > With this patch on top of next-20170630, I no longer hit this problem. > > (Of course, this is because this patch eliminates the infinite loop.) > > I assume you haven't seen other negative side effects, like unexpected > OOMs etc... Are you willing to give your Tested-by? I didn't see other negative side effects. Tested-by: Tetsuo Handa We need long time for testing this patch at linux-next.git (and I give up this handy bug for finding other bugs under almost OOM situation).
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > It is really hard to pursue this half solution when there is no clear > > indication it helps in your testing. So could you try to test with only > > this patch on top of the current linux-next tree (or Linus tree) and see > > if you can reproduce the problem? > > With this patch on top of next-20170630, I no longer hit this problem. > (Of course, this is because this patch eliminates the infinite loop.) I assume you haven't seen other negative side effects, like unexpected OOMs etc... Are you willing to give your Tested-by? -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > It is really hard to pursue this half solution when there is no clear > > indication it helps in your testing. So could you try to test with only > > this patch on top of the current linux-next tree (or Linus tree) and see > > if you can reproduce the problem? > > With this patch on top of next-20170630, I no longer hit this problem. > (Of course, this is because this patch eliminates the infinite loop.) I assume you haven't seen other negative side effects, like unexpected OOMs etc... Are you willing to give your Tested-by? -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
[this is getting tangent again and I will not respond any further if this turn into yet another flame] On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > Michal Hocko wrote: > > I really do appreciate your testing because it uncovers corner cases > > most people do not test for and we can actually make the code better in > > the end. > > That statement does not get to my heart at all. Collision between your > approach and my approach is wasting both your time and my time. > > I've reported this too_many_isolated() trap three years ago at > http://lkml.kernel.org/r/201407022140.bfj13092.qvosjtfmfhl...@i-love.sakura.ne.jp > . > Do you know that we already wasted 3 years without any attention? And how many real bugs have we seen in those three years? Well, zero AFAIR, except for your corner case testing. So while I never dismissed the problem I've been saing this is not that trivial to fix. As my attempt to address this and the review feedback I've received shows. > You are rejecting serialization under OOM without giving a chance to test > side effects of serialization under OOM at linux-next.git. I call such > attitude > "speculation" which you never accept. No I am rejecting abusing the lock for purpose it is not aimed for. > Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization. > In the first place, if the system is under global OOM (which is more > serious situation than memcg OOM), delay caused by serialization will not > matter. Rather, I consider that making sure that the system does not get > locked up is more important. I'm reporting that serialization helps > facilitating the OOM killer/reaper operations, avoiding lockups, and > solving global OOM situation smoothly. But you are refusing my report without > giving a chance to test what side effects will pop up at linux-next.git. You are mixing oranges with apples here. We do synchronize memcg oom killer the same way as the global one. > Knowledge about OOM situation is hardly shared among Linux developers and > users, > and is far from object of concern. Like shown by cgroup-aware OOM killer > proposal, > what will happen if we restrict 0 <= oom_victims <= 1 is not shared among > developers. > > How many developers joined to my OOM watchdog proposal? Every time and ever > it is > confrontation between you and me. You, as effectively the only participant, > are > showing negative attitude is effectively Nacked-by: response without > alternative > proposal. This is something all of us have to fight with. There are only so many MM developers. You have to justify your changes in order to attract other developers/users. You are basing your changes on speculations and what-ifs for workloads that most developers consider borderline and misconfigurations already. > Not everybody can afford testing with absolutely latest upstream kernels. > Not prepared to obtain information for analysis using distributor kernels > makes > it impossible to compare whether user's problems are already fixed in upstream > kernels, makes it impossible to identify patches which needs to be backported > to > distributor kernels, and is bad for customers using distributor kernels. Of > course, > it is possible that distributors decide not to allow users to obtain > information > for analysis, but such decision cannot become a reason we can not prepare to > obtain > information for analysis at upstream kernels. If you have to work with distribution kernels then talk to distribution people. It is that simple. You are surely not using those systems just because of a fancy logo... [...] > > this way of pushing your patch is really annoying. Please do realize > > that repeating the same thing all around will not make a patch more > > likely to merge. You have proposed something, nobody has nacked it > > so it waits for people to actually find it important enough to justify > > the additional code. So please stop this. > > When will people find time to judge it? We already wasted three years, and > knowledge about OOM situation is hardly shared among Linux developers and > users, > and will unlikely be object of concern. How many years (or decades) will we > waste > more? MM subsystem will change meanwhile and we will just ignore old kernels. > > If you do want me to stop bringing watchdog here and there, please do show > alternative approach which I can tolerate. If you cannot afford it, please > allow > me to involve people (e.g. you make calls for joining to my proposals because > you are asking me to wait until people find time to judge it). > Please do realize that just repeatedly saying "wait patiently" helps nothing. You really have to realize that there will hardly be more interest in your reports when they do not reflect real life situations. I have said (several times) that those issues should be addressed eventually but there are more pressing issues which do trigger in the real life and they have precedence. Should we add a
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
[this is getting tangent again and I will not respond any further if this turn into yet another flame] On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > Michal Hocko wrote: > > I really do appreciate your testing because it uncovers corner cases > > most people do not test for and we can actually make the code better in > > the end. > > That statement does not get to my heart at all. Collision between your > approach and my approach is wasting both your time and my time. > > I've reported this too_many_isolated() trap three years ago at > http://lkml.kernel.org/r/201407022140.bfj13092.qvosjtfmfhl...@i-love.sakura.ne.jp > . > Do you know that we already wasted 3 years without any attention? And how many real bugs have we seen in those three years? Well, zero AFAIR, except for your corner case testing. So while I never dismissed the problem I've been saing this is not that trivial to fix. As my attempt to address this and the review feedback I've received shows. > You are rejecting serialization under OOM without giving a chance to test > side effects of serialization under OOM at linux-next.git. I call such > attitude > "speculation" which you never accept. No I am rejecting abusing the lock for purpose it is not aimed for. > Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization. > In the first place, if the system is under global OOM (which is more > serious situation than memcg OOM), delay caused by serialization will not > matter. Rather, I consider that making sure that the system does not get > locked up is more important. I'm reporting that serialization helps > facilitating the OOM killer/reaper operations, avoiding lockups, and > solving global OOM situation smoothly. But you are refusing my report without > giving a chance to test what side effects will pop up at linux-next.git. You are mixing oranges with apples here. We do synchronize memcg oom killer the same way as the global one. > Knowledge about OOM situation is hardly shared among Linux developers and > users, > and is far from object of concern. Like shown by cgroup-aware OOM killer > proposal, > what will happen if we restrict 0 <= oom_victims <= 1 is not shared among > developers. > > How many developers joined to my OOM watchdog proposal? Every time and ever > it is > confrontation between you and me. You, as effectively the only participant, > are > showing negative attitude is effectively Nacked-by: response without > alternative > proposal. This is something all of us have to fight with. There are only so many MM developers. You have to justify your changes in order to attract other developers/users. You are basing your changes on speculations and what-ifs for workloads that most developers consider borderline and misconfigurations already. > Not everybody can afford testing with absolutely latest upstream kernels. > Not prepared to obtain information for analysis using distributor kernels > makes > it impossible to compare whether user's problems are already fixed in upstream > kernels, makes it impossible to identify patches which needs to be backported > to > distributor kernels, and is bad for customers using distributor kernels. Of > course, > it is possible that distributors decide not to allow users to obtain > information > for analysis, but such decision cannot become a reason we can not prepare to > obtain > information for analysis at upstream kernels. If you have to work with distribution kernels then talk to distribution people. It is that simple. You are surely not using those systems just because of a fancy logo... [...] > > this way of pushing your patch is really annoying. Please do realize > > that repeating the same thing all around will not make a patch more > > likely to merge. You have proposed something, nobody has nacked it > > so it waits for people to actually find it important enough to justify > > the additional code. So please stop this. > > When will people find time to judge it? We already wasted three years, and > knowledge about OOM situation is hardly shared among Linux developers and > users, > and will unlikely be object of concern. How many years (or decades) will we > waste > more? MM subsystem will change meanwhile and we will just ignore old kernels. > > If you do want me to stop bringing watchdog here and there, please do show > alternative approach which I can tolerate. If you cannot afford it, please > allow > me to involve people (e.g. you make calls for joining to my proposals because > you are asking me to wait until people find time to judge it). > Please do realize that just repeatedly saying "wait patiently" helps nothing. You really have to realize that there will hardly be more interest in your reports when they do not reflect real life situations. I have said (several times) that those issues should be addressed eventually but there are more pressing issues which do trigger in the real life and they have precedence. Should we add a
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Michal Hocko wrote: > I really do appreciate your testing because it uncovers corner cases > most people do not test for and we can actually make the code better in > the end. That statement does not get to my heart at all. Collision between your approach and my approach is wasting both your time and my time. I've reported this too_many_isolated() trap three years ago at http://lkml.kernel.org/r/201407022140.bfj13092.qvosjtfmfhl...@i-love.sakura.ne.jp . Do you know that we already wasted 3 years without any attention? You are rejecting serialization under OOM without giving a chance to test side effects of serialization under OOM at linux-next.git. I call such attitude "speculation" which you never accept. Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization. In the first place, if the system is under global OOM (which is more serious situation than memcg OOM), delay caused by serialization will not matter. Rather, I consider that making sure that the system does not get locked up is more important. I'm reporting that serialization helps facilitating the OOM killer/reaper operations, avoiding lockups, and solving global OOM situation smoothly. But you are refusing my report without giving a chance to test what side effects will pop up at linux-next.git. Knowledge about OOM situation is hardly shared among Linux developers and users, and is far from object of concern. Like shown by cgroup-aware OOM killer proposal, what will happen if we restrict 0 <= oom_victims <= 1 is not shared among developers. How many developers joined to my OOM watchdog proposal? Every time and ever it is confrontation between you and me. You, as effectively the only participant, are showing negative attitude is effectively Nacked-by: response without alternative proposal. Not everybody can afford testing with absolutely latest upstream kernels. Not prepared to obtain information for analysis using distributor kernels makes it impossible to compare whether user's problems are already fixed in upstream kernels, makes it impossible to identify patches which needs to be backported to distributor kernels, and is bad for customers using distributor kernels. Of course, it is possible that distributors decide not to allow users to obtain information for analysis, but such decision cannot become a reason we can not prepare to obtain information for analysis at upstream kernels. Suppose I take a step back and tolerate the burden of sitting in front of console 24 hours a day, every day of the year so that users can press SysRq when something went wrong, how nice it will be if all in-flight allocation requests were printed upon SysRq. show_workqueue_state() is called upon SysRq-t is to some degree useful. In fact, my proposal was such approach before I serialize using a kernel thread (e.g. http://lkml.kernel.org/r/201411231351.hja17065.vhqsfojftlf...@i-love.sakura.ne.jp which I proposed two years and a half ago). Though, while my proposal was left ignored, I learned that showing only current thread is not sufficient and updated my watchdog to show other threads (e.g. kswapd) using serialization. A patch at http://lkml.kernel.org/r/201505232339.dab00557.vfflhmsojfo...@i-love.sakura.ne.jp which I posted two years ago also includes a proposal for handling infinite shrink_inactive_list() problem. After all, this shrink_inactive_list() problem was ignored for three years without getting a chance to even test at linux-next.git. Sigh... I know my proposals might not be best. But you cannot afford showing alternative proposals because you are putting higher priority to other problems. And other developers cannot afford participating because they are not interested in or they do not share knowledge of this problem. My proposals do not constrain future kernels. We can revert my proposals when my proposals became no longer needed. My proposals is meaningful as interim approach, but you never accept approaches which do not match your will (or desire). Even without giving people a chance to test what side effects will crop up, how can your "I really do appreciate your testing" statement get to my heart? My watchdog allows detecting problems which are previously overlooked unless putting unrealistic burden (e.g. stand by 24 hours a day, every day of the year). You ask people to prove that it is a MM problem. But I am dissatisfied that you are letting proposals which helps judging whether it is a MM problem alone. > this way of pushing your patch is really annoying. Please do realize > that repeating the same thing all around will not make a patch more > likely to merge. You have proposed something, nobody has nacked it > so it waits for people to actually find it important enough to justify > the additional code. So please stop this. When will people find time to judge it? We already wasted three years, and knowledge about OOM situation is hardly shared among Linux developers and users, and will
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Michal Hocko wrote: > I really do appreciate your testing because it uncovers corner cases > most people do not test for and we can actually make the code better in > the end. That statement does not get to my heart at all. Collision between your approach and my approach is wasting both your time and my time. I've reported this too_many_isolated() trap three years ago at http://lkml.kernel.org/r/201407022140.bfj13092.qvosjtfmfhl...@i-love.sakura.ne.jp . Do you know that we already wasted 3 years without any attention? You are rejecting serialization under OOM without giving a chance to test side effects of serialization under OOM at linux-next.git. I call such attitude "speculation" which you never accept. Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization. In the first place, if the system is under global OOM (which is more serious situation than memcg OOM), delay caused by serialization will not matter. Rather, I consider that making sure that the system does not get locked up is more important. I'm reporting that serialization helps facilitating the OOM killer/reaper operations, avoiding lockups, and solving global OOM situation smoothly. But you are refusing my report without giving a chance to test what side effects will pop up at linux-next.git. Knowledge about OOM situation is hardly shared among Linux developers and users, and is far from object of concern. Like shown by cgroup-aware OOM killer proposal, what will happen if we restrict 0 <= oom_victims <= 1 is not shared among developers. How many developers joined to my OOM watchdog proposal? Every time and ever it is confrontation between you and me. You, as effectively the only participant, are showing negative attitude is effectively Nacked-by: response without alternative proposal. Not everybody can afford testing with absolutely latest upstream kernels. Not prepared to obtain information for analysis using distributor kernels makes it impossible to compare whether user's problems are already fixed in upstream kernels, makes it impossible to identify patches which needs to be backported to distributor kernels, and is bad for customers using distributor kernels. Of course, it is possible that distributors decide not to allow users to obtain information for analysis, but such decision cannot become a reason we can not prepare to obtain information for analysis at upstream kernels. Suppose I take a step back and tolerate the burden of sitting in front of console 24 hours a day, every day of the year so that users can press SysRq when something went wrong, how nice it will be if all in-flight allocation requests were printed upon SysRq. show_workqueue_state() is called upon SysRq-t is to some degree useful. In fact, my proposal was such approach before I serialize using a kernel thread (e.g. http://lkml.kernel.org/r/201411231351.hja17065.vhqsfojftlf...@i-love.sakura.ne.jp which I proposed two years and a half ago). Though, while my proposal was left ignored, I learned that showing only current thread is not sufficient and updated my watchdog to show other threads (e.g. kswapd) using serialization. A patch at http://lkml.kernel.org/r/201505232339.dab00557.vfflhmsojfo...@i-love.sakura.ne.jp which I posted two years ago also includes a proposal for handling infinite shrink_inactive_list() problem. After all, this shrink_inactive_list() problem was ignored for three years without getting a chance to even test at linux-next.git. Sigh... I know my proposals might not be best. But you cannot afford showing alternative proposals because you are putting higher priority to other problems. And other developers cannot afford participating because they are not interested in or they do not share knowledge of this problem. My proposals do not constrain future kernels. We can revert my proposals when my proposals became no longer needed. My proposals is meaningful as interim approach, but you never accept approaches which do not match your will (or desire). Even without giving people a chance to test what side effects will crop up, how can your "I really do appreciate your testing" statement get to my heart? My watchdog allows detecting problems which are previously overlooked unless putting unrealistic burden (e.g. stand by 24 hours a day, every day of the year). You ask people to prove that it is a MM problem. But I am dissatisfied that you are letting proposals which helps judging whether it is a MM problem alone. > this way of pushing your patch is really annoying. Please do realize > that repeating the same thing all around will not make a patch more > likely to merge. You have proposed something, nobody has nacked it > so it waits for people to actually find it important enough to justify > the additional code. So please stop this. When will people find time to judge it? We already wasted three years, and knowledge about OOM situation is hardly shared among Linux developers and users, and will
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Sat 01-07-17 00:59:56, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 30-06-17 09:14:22, Tetsuo Handa wrote: > > [...] > > > Ping? Ping? When are we going to apply this patch or watchdog patch? > > > This problem occurs with not so insane stress like shown below. > > > I can't test almost OOM situation because test likely falls into either > > > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem. > > > > So you are saying that the patch fixes this issue. Do I understand you > > corretly? And you do not see any other negative side effectes with it > > applied? > > I hit this problem using > http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org > on next-20170628. We won't be able to test whether the patch fixes this issue > without > seeing any other negative side effects without sending this patch to > linux-next.git. > But at least we know that even this patch is sent to linux-next.git, we will > still see > bugs like > http://lkml.kernel.org/r/201703031948.chj81278.vohsfffoolj...@i-love.sakura.ne.jp > . It is really hard to pursue this half solution when there is no clear indication it helps in your testing. So could you try to test with only this patch on top of the current linux-next tree (or Linus tree) and see if you can reproduce the problem? It is possible that there are other potential problems but we at least need to know whether it is worth going with the patch now. [...] > > Rik, Johannes what do you think? Should we go with the simpler approach > > for now and think of a better plan longterm? > > I don't hurry if we can check using watchdog whether this problem is occurring > in the real world. I have to test corner cases because watchdog is missing. > > Watchdog does not introduce negative side effects, will avoid soft lockups > like > http://lkml.kernel.org/r/cam_iqpwupvgc2ky8m-9yukects+zkjidasnymx7rmcbjbfy...@mail.gmail.com > , > will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(), > will catch similar bugs which people are failing to reproduce. this way of pushing your patch is really annoying. Please do realize that repeating the same thing all around will not make a patch more likely to merge. You have proposed something, nobody has nacked it so it waits for people to actually find it important enough to justify the additional code. So please stop this. I really do appreciate your testing because it uncovers corner cases most people do not test for and we can actually make the code better in the end. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Sat 01-07-17 00:59:56, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 30-06-17 09:14:22, Tetsuo Handa wrote: > > [...] > > > Ping? Ping? When are we going to apply this patch or watchdog patch? > > > This problem occurs with not so insane stress like shown below. > > > I can't test almost OOM situation because test likely falls into either > > > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem. > > > > So you are saying that the patch fixes this issue. Do I understand you > > corretly? And you do not see any other negative side effectes with it > > applied? > > I hit this problem using > http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org > on next-20170628. We won't be able to test whether the patch fixes this issue > without > seeing any other negative side effects without sending this patch to > linux-next.git. > But at least we know that even this patch is sent to linux-next.git, we will > still see > bugs like > http://lkml.kernel.org/r/201703031948.chj81278.vohsfffoolj...@i-love.sakura.ne.jp > . It is really hard to pursue this half solution when there is no clear indication it helps in your testing. So could you try to test with only this patch on top of the current linux-next tree (or Linus tree) and see if you can reproduce the problem? It is possible that there are other potential problems but we at least need to know whether it is worth going with the patch now. [...] > > Rik, Johannes what do you think? Should we go with the simpler approach > > for now and think of a better plan longterm? > > I don't hurry if we can check using watchdog whether this problem is occurring > in the real world. I have to test corner cases because watchdog is missing. > > Watchdog does not introduce negative side effects, will avoid soft lockups > like > http://lkml.kernel.org/r/cam_iqpwupvgc2ky8m-9yukects+zkjidasnymx7rmcbjbfy...@mail.gmail.com > , > will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(), > will catch similar bugs which people are failing to reproduce. this way of pushing your patch is really annoying. Please do realize that repeating the same thing all around will not make a patch more likely to merge. You have proposed something, nobody has nacked it so it waits for people to actually find it important enough to justify the additional code. So please stop this. I really do appreciate your testing because it uncovers corner cases most people do not test for and we can actually make the code better in the end. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Michal Hocko wrote: > On Fri 30-06-17 09:14:22, Tetsuo Handa wrote: > [...] > > Ping? Ping? When are we going to apply this patch or watchdog patch? > > This problem occurs with not so insane stress like shown below. > > I can't test almost OOM situation because test likely falls into either > > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem. > > So you are saying that the patch fixes this issue. Do I understand you > corretly? And you do not see any other negative side effectes with it > applied? I hit this problem using http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org on next-20170628. We won't be able to test whether the patch fixes this issue without seeing any other negative side effects without sending this patch to linux-next.git. But at least we know that even this patch is sent to linux-next.git, we will still see bugs like http://lkml.kernel.org/r/201703031948.chj81278.vohsfffoolj...@i-love.sakura.ne.jp . > > I am sorry I didn't have much time to think about feedback from Johannes > yet. A more robust throttling method is surely due but also not trivial. > So I am not sure how to proceed. It is true that your last test case > with only 10 processes fighting resembles the reality much better than > hundreds (AFAIR) that you were using previously. Even if hundreds are running, most of them are simply blocked inside open() at down_write() (like an example from serial-20170423-2.txt.xz shown below). Actual number of processes fighting for memory is always less than 100. ? __schedule+0x1d2/0x5a0 ? schedule+0x2d/0x80 ? rwsem_down_write_failed+0x1f9/0x370 ? walk_component+0x43/0x270 ? call_rwsem_down_write_failed+0x13/0x20 ? down_write+0x24/0x40 ? path_openat+0x670/0x1210 ? do_filp_open+0x8c/0x100 ? getname_flags+0x47/0x1e0 ? do_sys_open+0x121/0x200 ? do_syscall_64+0x5c/0x140 ? entry_SYSCALL64_slow_path+0x25/0x25 > > Rik, Johannes what do you think? Should we go with the simpler approach > for now and think of a better plan longterm? I don't hurry if we can check using watchdog whether this problem is occurring in the real world. I have to test corner cases because watchdog is missing. Watchdog does not introduce negative side effects, will avoid soft lockups like http://lkml.kernel.org/r/cam_iqpwupvgc2ky8m-9yukects+zkjidasnymx7rmcbjbfy...@mail.gmail.com , will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(), will catch similar bugs which people are failing to reproduce.
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Michal Hocko wrote: > On Fri 30-06-17 09:14:22, Tetsuo Handa wrote: > [...] > > Ping? Ping? When are we going to apply this patch or watchdog patch? > > This problem occurs with not so insane stress like shown below. > > I can't test almost OOM situation because test likely falls into either > > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem. > > So you are saying that the patch fixes this issue. Do I understand you > corretly? And you do not see any other negative side effectes with it > applied? I hit this problem using http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org on next-20170628. We won't be able to test whether the patch fixes this issue without seeing any other negative side effects without sending this patch to linux-next.git. But at least we know that even this patch is sent to linux-next.git, we will still see bugs like http://lkml.kernel.org/r/201703031948.chj81278.vohsfffoolj...@i-love.sakura.ne.jp . > > I am sorry I didn't have much time to think about feedback from Johannes > yet. A more robust throttling method is surely due but also not trivial. > So I am not sure how to proceed. It is true that your last test case > with only 10 processes fighting resembles the reality much better than > hundreds (AFAIR) that you were using previously. Even if hundreds are running, most of them are simply blocked inside open() at down_write() (like an example from serial-20170423-2.txt.xz shown below). Actual number of processes fighting for memory is always less than 100. ? __schedule+0x1d2/0x5a0 ? schedule+0x2d/0x80 ? rwsem_down_write_failed+0x1f9/0x370 ? walk_component+0x43/0x270 ? call_rwsem_down_write_failed+0x13/0x20 ? down_write+0x24/0x40 ? path_openat+0x670/0x1210 ? do_filp_open+0x8c/0x100 ? getname_flags+0x47/0x1e0 ? do_sys_open+0x121/0x200 ? do_syscall_64+0x5c/0x140 ? entry_SYSCALL64_slow_path+0x25/0x25 > > Rik, Johannes what do you think? Should we go with the simpler approach > for now and think of a better plan longterm? I don't hurry if we can check using watchdog whether this problem is occurring in the real world. I have to test corner cases because watchdog is missing. Watchdog does not introduce negative side effects, will avoid soft lockups like http://lkml.kernel.org/r/cam_iqpwupvgc2ky8m-9yukects+zkjidasnymx7rmcbjbfy...@mail.gmail.com , will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(), will catch similar bugs which people are failing to reproduce.
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Fri 30-06-17 09:14:22, Tetsuo Handa wrote: [...] > Ping? Ping? When are we going to apply this patch or watchdog patch? > This problem occurs with not so insane stress like shown below. > I can't test almost OOM situation because test likely falls into either > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem. So you are saying that the patch fixes this issue. Do I understand you corretly? And you do not see any other negative side effectes with it applied? I am sorry I didn't have much time to think about feedback from Johannes yet. A more robust throttling method is surely due but also not trivial. So I am not sure how to proceed. It is true that your last test case with only 10 processes fighting resembles the reality much better than hundreds (AFAIR) that you were using previously. Rik, Johannes what do you think? Should we go with the simpler approach for now and think of a better plan longterm? -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Fri 30-06-17 09:14:22, Tetsuo Handa wrote: [...] > Ping? Ping? When are we going to apply this patch or watchdog patch? > This problem occurs with not so insane stress like shown below. > I can't test almost OOM situation because test likely falls into either > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem. So you are saying that the patch fixes this issue. Do I understand you corretly? And you do not see any other negative side effectes with it applied? I am sorry I didn't have much time to think about feedback from Johannes yet. A more robust throttling method is surely due but also not trivial. So I am not sure how to proceed. It is true that your last test case with only 10 processes fighting resembles the reality much better than hundreds (AFAIR) that you were using previously. Rik, Johannes what do you think? Should we go with the simpler approach for now and think of a better plan longterm? -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 09-03-17 13:05:40, Johannes Weiner wrote: > > > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > > > It only does this to some extent. If reclaim made > > > > no progress, for example due to immediately bailing > > > > out because the number of already isolated pages is > > > > too high (due to many parallel reclaimers), the code > > > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > > > test without ever looking at the number of reclaimable > > > > pages. > > > > > > Hm, there is no early return there, actually. We bump the loop counter > > > every time it happens, but then *do* look at the reclaimable pages. > > > > > > > Could that create problems if we have many concurrent > > > > reclaimers? > > > > > > With increased concurrency, the likelihood of OOM will go up if we > > > remove the unlimited wait for isolated pages, that much is true. > > > > > > I'm not sure that's a bad thing, however, because we want the OOM > > > killer to be predictable and timely. So a reasonable wait time in > > > between 0 and forever before an allocating thread gives up under > > > extreme concurrency makes sense to me. > > > > > > > It may be OK, I just do not understand all the implications. > > > > > > > > I like the general direction your patch takes the code in, > > > > but I would like to understand it better... > > > > > > I feel the same way. The throttling logic doesn't seem to be very well > > > thought out at the moment, making it hard to reason about what happens > > > in certain scenarios. > > > > > > In that sense, this patch isn't really an overall improvement to the > > > way things work. It patches a hole that seems to be exploitable only > > > from an artificial OOM torture test, at the risk of regressing high > > > concurrency workloads that may or may not be artificial. > > > > > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > > > behind this patch. Can we think about a general model to deal with > > > allocation concurrency? > > > > I am definitely not against. There is no reason to rush the patch in. > > I don't hurry if we can check using watchdog whether this problem is occurring > in the real world. I have to test corner cases because watchdog is missing. > > > My main point behind this patch was to reduce unbound loops from inside > > the reclaim path and push any throttling up the call chain to the > > page allocator path because I believe that it is easier to reason > > about them at that level. The direct reclaim should be as simple as > > possible without too many side effects otherwise we end up in a highly > > unpredictable behavior. This was a first step in that direction and my > > testing so far didn't show any regressions. > > > > > Unlimited parallel direct reclaim is kinda > > > bonkers in the first place. How about checking for excessive isolation > > > counts from the page allocator and putting allocations on a waitqueue? > > > > I would be interested in details here. > > That will help implementing __GFP_KILLABLE. > https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15 > Ping? Ping? When are we going to apply this patch or watchdog patch? This problem occurs with not so insane stress like shown below. I can't test almost OOM situation because test likely falls into either printk() v.s. oom_lock lockup problem or this too_many_isolated() problem. -- #include #include #include #include #include #include int main(int argc, char *argv[]) { static char buffer[4096] = { }; char *buf = NULL; unsigned long size; int i; for (i = 0; i < 10; i++) { if (fork() == 0) { int fd = open("/proc/self/oom_score_adj", O_WRONLY); write(fd, "1000", 4); close(fd); sleep(1); if (!i) pause(); snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) fsync(fd); _exit(0); } } for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sleep(2); /* Will cause OOM due to overcommit */ for (i = 0; i < size; i += 4096) buf[i] = 0; return 0; } -- Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170629-3.txt.xz . [ 190.924887] a.out D13296 2191 2172 0x0080 [ 190.927121] Call Trace: [ 190.928304]
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 09-03-17 13:05:40, Johannes Weiner wrote: > > > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > > > It only does this to some extent. If reclaim made > > > > no progress, for example due to immediately bailing > > > > out because the number of already isolated pages is > > > > too high (due to many parallel reclaimers), the code > > > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > > > test without ever looking at the number of reclaimable > > > > pages. > > > > > > Hm, there is no early return there, actually. We bump the loop counter > > > every time it happens, but then *do* look at the reclaimable pages. > > > > > > > Could that create problems if we have many concurrent > > > > reclaimers? > > > > > > With increased concurrency, the likelihood of OOM will go up if we > > > remove the unlimited wait for isolated pages, that much is true. > > > > > > I'm not sure that's a bad thing, however, because we want the OOM > > > killer to be predictable and timely. So a reasonable wait time in > > > between 0 and forever before an allocating thread gives up under > > > extreme concurrency makes sense to me. > > > > > > > It may be OK, I just do not understand all the implications. > > > > > > > > I like the general direction your patch takes the code in, > > > > but I would like to understand it better... > > > > > > I feel the same way. The throttling logic doesn't seem to be very well > > > thought out at the moment, making it hard to reason about what happens > > > in certain scenarios. > > > > > > In that sense, this patch isn't really an overall improvement to the > > > way things work. It patches a hole that seems to be exploitable only > > > from an artificial OOM torture test, at the risk of regressing high > > > concurrency workloads that may or may not be artificial. > > > > > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > > > behind this patch. Can we think about a general model to deal with > > > allocation concurrency? > > > > I am definitely not against. There is no reason to rush the patch in. > > I don't hurry if we can check using watchdog whether this problem is occurring > in the real world. I have to test corner cases because watchdog is missing. > > > My main point behind this patch was to reduce unbound loops from inside > > the reclaim path and push any throttling up the call chain to the > > page allocator path because I believe that it is easier to reason > > about them at that level. The direct reclaim should be as simple as > > possible without too many side effects otherwise we end up in a highly > > unpredictable behavior. This was a first step in that direction and my > > testing so far didn't show any regressions. > > > > > Unlimited parallel direct reclaim is kinda > > > bonkers in the first place. How about checking for excessive isolation > > > counts from the page allocator and putting allocations on a waitqueue? > > > > I would be interested in details here. > > That will help implementing __GFP_KILLABLE. > https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15 > Ping? Ping? When are we going to apply this patch or watchdog patch? This problem occurs with not so insane stress like shown below. I can't test almost OOM situation because test likely falls into either printk() v.s. oom_lock lockup problem or this too_many_isolated() problem. -- #include #include #include #include #include #include int main(int argc, char *argv[]) { static char buffer[4096] = { }; char *buf = NULL; unsigned long size; int i; for (i = 0; i < 10; i++) { if (fork() == 0) { int fd = open("/proc/self/oom_score_adj", O_WRONLY); write(fd, "1000", 4); close(fd); sleep(1); if (!i) pause(); snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) fsync(fd); _exit(0); } } for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sleep(2); /* Will cause OOM due to overcommit */ for (i = 0; i < size; i += 4096) buf[i] = 0; return 0; } -- Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170629-3.txt.xz . [ 190.924887] a.out D13296 2191 2172 0x0080 [ 190.927121] Call Trace: [ 190.928304]
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon, Apr 24, 2017 at 10:06:32PM +0900, Tetsuo Handa wrote: > Stanislaw Gruszka wrote: > > On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote: > > > On 2017/03/10 20:44, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > >> I am definitely not against. There is no reason to rush the patch in. > > > > > > > > I don't hurry if we can check using watchdog whether this problem is > > > > occurring > > > > in the real world. I have to test corner cases because watchdog is > > > > missing. > > > > > > > Ping? > > > > > > This problem can occur even immediately after the first invocation of > > > the OOM killer. I believe this problem can occur in the real world. > > > When are we going to apply this patch or watchdog patch? > > > > > > > > > [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) > > > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr > > > 23 17:38:02 JST 2017 > > > [0.00] Command line: > > > BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ > > > root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 > > > crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none > > > sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 > > > debug_guardpage_minorder=1 > > > > Are you debugging memory corruption problem? > > No. Just a random testing trying to find how we can avoid flooding of > warn_alloc_stall() warning messages while also avoiding ratelimiting. This is not right way to stress mm subsystem, debug_guardpage_minorder= option is for _debug_ purpose. Use mem= instead if you want to limit available memory. > > FWIW, if you use debug_guardpage_minorder= you can expect any > > allocation memory problems. This option is intended to debug > > memory corruption bugs and it shrinks available memory in > > artificial way. Taking that, I don't think justifying any > > patch, by problem happened when debug_guardpage_minorder= is > > used, is reasonable. > > > > Stanislaw > > This problem occurs without debug_guardpage_minorder= parameter and So please justify your patches by that. Thanks Stanislaw
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Mon, Apr 24, 2017 at 10:06:32PM +0900, Tetsuo Handa wrote: > Stanislaw Gruszka wrote: > > On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote: > > > On 2017/03/10 20:44, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > >> I am definitely not against. There is no reason to rush the patch in. > > > > > > > > I don't hurry if we can check using watchdog whether this problem is > > > > occurring > > > > in the real world. I have to test corner cases because watchdog is > > > > missing. > > > > > > > Ping? > > > > > > This problem can occur even immediately after the first invocation of > > > the OOM killer. I believe this problem can occur in the real world. > > > When are we going to apply this patch or watchdog patch? > > > > > > > > > [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) > > > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr > > > 23 17:38:02 JST 2017 > > > [0.00] Command line: > > > BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ > > > root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 > > > crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none > > > sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 > > > debug_guardpage_minorder=1 > > > > Are you debugging memory corruption problem? > > No. Just a random testing trying to find how we can avoid flooding of > warn_alloc_stall() warning messages while also avoiding ratelimiting. This is not right way to stress mm subsystem, debug_guardpage_minorder= option is for _debug_ purpose. Use mem= instead if you want to limit available memory. > > FWIW, if you use debug_guardpage_minorder= you can expect any > > allocation memory problems. This option is intended to debug > > memory corruption bugs and it shrinks available memory in > > artificial way. Taking that, I don't think justifying any > > patch, by problem happened when debug_guardpage_minorder= is > > used, is reasonable. > > > > Stanislaw > > This problem occurs without debug_guardpage_minorder= parameter and So please justify your patches by that. Thanks Stanislaw
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Stanislaw Gruszka wrote: > On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote: > > On 2017/03/10 20:44, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > >> I am definitely not against. There is no reason to rush the patch in. > > > > > > I don't hurry if we can check using watchdog whether this problem is > > > occurring > > > in the real world. I have to test corner cases because watchdog is > > > missing. > > > > > Ping? > > > > This problem can occur even immediately after the first invocation of > > the OOM killer. I believe this problem can occur in the real world. > > When are we going to apply this patch or watchdog patch? > > > > > > [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) > > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 > > 17:38:02 JST 2017 > > [0.00] Command line: > > BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ > > root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 > > crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none > > sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 > > debug_guardpage_minorder=1 > > Are you debugging memory corruption problem? No. Just a random testing trying to find how we can avoid flooding of warn_alloc_stall() warning messages while also avoiding ratelimiting. > > FWIW, if you use debug_guardpage_minorder= you can expect any > allocation memory problems. This option is intended to debug > memory corruption bugs and it shrinks available memory in > artificial way. Taking that, I don't think justifying any > patch, by problem happened when debug_guardpage_minorder= is > used, is reasonable. > > Stanislaw This problem occurs without debug_guardpage_minorder= parameter and -- [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 (...snipped...) CentOS Linux 7 (Core) Kernel 4.11.0-rc7-next-20170421+ on an x86_64 ccsecurity login: [ 31.882531] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 32.550187] Ebtables v2.0 registered [ 32.730371] nf_conntrack version 0.5.0 (16384 buckets, 65536 max) [ 32.926518] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready [ 32.928310] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 32.930960] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready [ 33.741378] Netfilter messages via NETLINK v0.30. [ 33.807350] ip_set: protocol 6 [ 37.581002] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based firewall rule not found. Use the iptables CT target to attach helpers instead. [ 38.072689] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready [ 38.074419] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 38.077222] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready [ 92.753140] gmain invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 [ 92.763445] gmain cpuset=/ mems_allowed=0 [ 92.767634] CPU: 2 PID: 2733 Comm: gmain Not tainted 4.11.0-rc7-next-20170421+ #588 [ 92.773624] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 92.781790] Call Trace: [ 92.782630] ? dump_stack+0x5c/0x7d [ 92.783902] ? dump_header+0x97/0x233 [ 92.785427] ? ktime_get+0x30/0x90 [ 92.786390] ? delayacct_end+0x35/0x60 [ 92.787433] ? do_try_to_free_pages+0x2ca/0x370 [ 92.789157] ? oom_kill_process+0x223/0x3e0 [ 92.790502] ? has_capability_noaudit+0x17/0x20 [ 92.791761] ? oom_badness+0xeb/0x160 [ 92.792783] ? out_of_memory+0x10b/0x490 [ 92.793872] ? __alloc_pages_slowpath+0x701/0x8e2 [ 92.795603] ? __alloc_pages_nodemask+0x1ed/0x210 [ 92.796902] ? alloc_pages_current+0x7a/0x100 [ 92.798115] ? filemap_fault+0x2e9/0x5e0 [ 92.799204] ? filemap_map_pages+0x185/0x3a0 [ 92.800402] ? xfs_filemap_fault+0x2f/0x50 [xfs] [ 92.801678] ? __do_fault+0x15/0x70 [ 92.802651] ? __handle_mm_fault+0xb0f/0x11e0 [ 92.805141] ? handle_mm_fault+0xc5/0x220 [ 92.807261] ? __do_page_fault+0x21e/0x4b0 [ 92.809203] ? do_page_fault+0x2b/0x70 [ 92.811018] ? do_syscall_64+0x137/0x140 [ 92.812554] ? page_fault+0x28/0x30 [ 92.813855] Mem-Info: [ 92.815009] active_anon:437483 inactive_anon:2097 isolated_anon:0 [ 92.815009] active_file:0 inactive_file:104 isolated_file:41 [ 92.815009] unevictable:0 dirty:10 writeback:0 unstable:0 [ 92.815009] slab_reclaimable:2439
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Stanislaw Gruszka wrote: > On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote: > > On 2017/03/10 20:44, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > >> I am definitely not against. There is no reason to rush the patch in. > > > > > > I don't hurry if we can check using watchdog whether this problem is > > > occurring > > > in the real world. I have to test corner cases because watchdog is > > > missing. > > > > > Ping? > > > > This problem can occur even immediately after the first invocation of > > the OOM killer. I believe this problem can occur in the real world. > > When are we going to apply this patch or watchdog patch? > > > > > > [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) > > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 > > 17:38:02 JST 2017 > > [0.00] Command line: > > BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ > > root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 > > crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none > > sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 > > debug_guardpage_minorder=1 > > Are you debugging memory corruption problem? No. Just a random testing trying to find how we can avoid flooding of warn_alloc_stall() warning messages while also avoiding ratelimiting. > > FWIW, if you use debug_guardpage_minorder= you can expect any > allocation memory problems. This option is intended to debug > memory corruption bugs and it shrinks available memory in > artificial way. Taking that, I don't think justifying any > patch, by problem happened when debug_guardpage_minorder= is > used, is reasonable. > > Stanislaw This problem occurs without debug_guardpage_minorder= parameter and -- [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 (...snipped...) CentOS Linux 7 (Core) Kernel 4.11.0-rc7-next-20170421+ on an x86_64 ccsecurity login: [ 31.882531] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 32.550187] Ebtables v2.0 registered [ 32.730371] nf_conntrack version 0.5.0 (16384 buckets, 65536 max) [ 32.926518] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready [ 32.928310] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 32.930960] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready [ 33.741378] Netfilter messages via NETLINK v0.30. [ 33.807350] ip_set: protocol 6 [ 37.581002] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based firewall rule not found. Use the iptables CT target to attach helpers instead. [ 38.072689] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready [ 38.074419] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 38.077222] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready [ 92.753140] gmain invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 [ 92.763445] gmain cpuset=/ mems_allowed=0 [ 92.767634] CPU: 2 PID: 2733 Comm: gmain Not tainted 4.11.0-rc7-next-20170421+ #588 [ 92.773624] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 92.781790] Call Trace: [ 92.782630] ? dump_stack+0x5c/0x7d [ 92.783902] ? dump_header+0x97/0x233 [ 92.785427] ? ktime_get+0x30/0x90 [ 92.786390] ? delayacct_end+0x35/0x60 [ 92.787433] ? do_try_to_free_pages+0x2ca/0x370 [ 92.789157] ? oom_kill_process+0x223/0x3e0 [ 92.790502] ? has_capability_noaudit+0x17/0x20 [ 92.791761] ? oom_badness+0xeb/0x160 [ 92.792783] ? out_of_memory+0x10b/0x490 [ 92.793872] ? __alloc_pages_slowpath+0x701/0x8e2 [ 92.795603] ? __alloc_pages_nodemask+0x1ed/0x210 [ 92.796902] ? alloc_pages_current+0x7a/0x100 [ 92.798115] ? filemap_fault+0x2e9/0x5e0 [ 92.799204] ? filemap_map_pages+0x185/0x3a0 [ 92.800402] ? xfs_filemap_fault+0x2f/0x50 [xfs] [ 92.801678] ? __do_fault+0x15/0x70 [ 92.802651] ? __handle_mm_fault+0xb0f/0x11e0 [ 92.805141] ? handle_mm_fault+0xc5/0x220 [ 92.807261] ? __do_page_fault+0x21e/0x4b0 [ 92.809203] ? do_page_fault+0x2b/0x70 [ 92.811018] ? do_syscall_64+0x137/0x140 [ 92.812554] ? page_fault+0x28/0x30 [ 92.813855] Mem-Info: [ 92.815009] active_anon:437483 inactive_anon:2097 isolated_anon:0 [ 92.815009] active_file:0 inactive_file:104 isolated_file:41 [ 92.815009] unevictable:0 dirty:10 writeback:0 unstable:0 [ 92.815009] slab_reclaimable:2439
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote: > On 2017/03/10 20:44, Tetsuo Handa wrote: > > Michal Hocko wrote: > >> On Thu 09-03-17 13:05:40, Johannes Weiner wrote: > >>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > It only does this to some extent. If reclaim made > no progress, for example due to immediately bailing > out because the number of already isolated pages is > too high (due to many parallel reclaimers), the code > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > test without ever looking at the number of reclaimable > pages. > >>> > >>> Hm, there is no early return there, actually. We bump the loop counter > >>> every time it happens, but then *do* look at the reclaimable pages. > >>> > Could that create problems if we have many concurrent > reclaimers? > >>> > >>> With increased concurrency, the likelihood of OOM will go up if we > >>> remove the unlimited wait for isolated pages, that much is true. > >>> > >>> I'm not sure that's a bad thing, however, because we want the OOM > >>> killer to be predictable and timely. So a reasonable wait time in > >>> between 0 and forever before an allocating thread gives up under > >>> extreme concurrency makes sense to me. > >>> > It may be OK, I just do not understand all the implications. > > I like the general direction your patch takes the code in, > but I would like to understand it better... > >>> > >>> I feel the same way. The throttling logic doesn't seem to be very well > >>> thought out at the moment, making it hard to reason about what happens > >>> in certain scenarios. > >>> > >>> In that sense, this patch isn't really an overall improvement to the > >>> way things work. It patches a hole that seems to be exploitable only > >>> from an artificial OOM torture test, at the risk of regressing high > >>> concurrency workloads that may or may not be artificial. > >>> > >>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > >>> behind this patch. Can we think about a general model to deal with > >>> allocation concurrency? > >> > >> I am definitely not against. There is no reason to rush the patch in. > > > > I don't hurry if we can check using watchdog whether this problem is > > occurring > > in the real world. I have to test corner cases because watchdog is missing. > > > Ping? > > This problem can occur even immediately after the first invocation of > the OOM killer. I believe this problem can occur in the real world. > When are we going to apply this patch or watchdog patch? > > > [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc > version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 > 17:38:02 JST 2017 > [0.00] Command line: > BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ > root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 > crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none > sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 > debug_guardpage_minorder=1 Are you debugging memory corruption problem? FWIW, if you use debug_guardpage_minorder= you can expect any allocation memory problems. This option is intended to debug memory corruption bugs and it shrinks available memory in artificial way. Taking that, I don't think justifying any patch, by problem happened when debug_guardpage_minorder= is used, is reasonable. Stanislaw
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote: > On 2017/03/10 20:44, Tetsuo Handa wrote: > > Michal Hocko wrote: > >> On Thu 09-03-17 13:05:40, Johannes Weiner wrote: > >>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > It only does this to some extent. If reclaim made > no progress, for example due to immediately bailing > out because the number of already isolated pages is > too high (due to many parallel reclaimers), the code > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > test without ever looking at the number of reclaimable > pages. > >>> > >>> Hm, there is no early return there, actually. We bump the loop counter > >>> every time it happens, but then *do* look at the reclaimable pages. > >>> > Could that create problems if we have many concurrent > reclaimers? > >>> > >>> With increased concurrency, the likelihood of OOM will go up if we > >>> remove the unlimited wait for isolated pages, that much is true. > >>> > >>> I'm not sure that's a bad thing, however, because we want the OOM > >>> killer to be predictable and timely. So a reasonable wait time in > >>> between 0 and forever before an allocating thread gives up under > >>> extreme concurrency makes sense to me. > >>> > It may be OK, I just do not understand all the implications. > > I like the general direction your patch takes the code in, > but I would like to understand it better... > >>> > >>> I feel the same way. The throttling logic doesn't seem to be very well > >>> thought out at the moment, making it hard to reason about what happens > >>> in certain scenarios. > >>> > >>> In that sense, this patch isn't really an overall improvement to the > >>> way things work. It patches a hole that seems to be exploitable only > >>> from an artificial OOM torture test, at the risk of regressing high > >>> concurrency workloads that may or may not be artificial. > >>> > >>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > >>> behind this patch. Can we think about a general model to deal with > >>> allocation concurrency? > >> > >> I am definitely not against. There is no reason to rush the patch in. > > > > I don't hurry if we can check using watchdog whether this problem is > > occurring > > in the real world. I have to test corner cases because watchdog is missing. > > > Ping? > > This problem can occur even immediately after the first invocation of > the OOM killer. I believe this problem can occur in the real world. > When are we going to apply this patch or watchdog patch? > > > [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc > version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 > 17:38:02 JST 2017 > [0.00] Command line: > BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ > root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 > crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none > sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 > debug_guardpage_minorder=1 Are you debugging memory corruption problem? FWIW, if you use debug_guardpage_minorder= you can expect any allocation memory problems. This option is intended to debug memory corruption bugs and it shrinks available memory in artificial way. Taking that, I don't think justifying any patch, by problem happened when debug_guardpage_minorder= is used, is reasonable. Stanislaw
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On 2017/03/10 20:44, Tetsuo Handa wrote: > Michal Hocko wrote: >> On Thu 09-03-17 13:05:40, Johannes Weiner wrote: >>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: It only does this to some extent. If reclaim made no progress, for example due to immediately bailing out because the number of already isolated pages is too high (due to many parallel reclaimers), the code could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" test without ever looking at the number of reclaimable pages. >>> >>> Hm, there is no early return there, actually. We bump the loop counter >>> every time it happens, but then *do* look at the reclaimable pages. >>> Could that create problems if we have many concurrent reclaimers? >>> >>> With increased concurrency, the likelihood of OOM will go up if we >>> remove the unlimited wait for isolated pages, that much is true. >>> >>> I'm not sure that's a bad thing, however, because we want the OOM >>> killer to be predictable and timely. So a reasonable wait time in >>> between 0 and forever before an allocating thread gives up under >>> extreme concurrency makes sense to me. >>> It may be OK, I just do not understand all the implications. I like the general direction your patch takes the code in, but I would like to understand it better... >>> >>> I feel the same way. The throttling logic doesn't seem to be very well >>> thought out at the moment, making it hard to reason about what happens >>> in certain scenarios. >>> >>> In that sense, this patch isn't really an overall improvement to the >>> way things work. It patches a hole that seems to be exploitable only >>> from an artificial OOM torture test, at the risk of regressing high >>> concurrency workloads that may or may not be artificial. >>> >>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency >>> behind this patch. Can we think about a general model to deal with >>> allocation concurrency? >> >> I am definitely not against. There is no reason to rush the patch in. > > I don't hurry if we can check using watchdog whether this problem is occurring > in the real world. I have to test corner cases because watchdog is missing. > Ping? This problem can occur even immediately after the first invocation of the OOM killer. I believe this problem can occur in the real world. When are we going to apply this patch or watchdog patch? [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1 (...snipped...) CentOS Linux 7 (Core) Kernel 4.11.0-rc7-next-20170421+ on an x86_64 ccsecurity login: [ 32.406723] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 32.852917] Ebtables v2.0 registered [ 33.034402] nf_conntrack version 0.5.0 (16384 buckets, 65536 max) [ 33.467929] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 33.475728] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready [ 33.478910] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready [ 33.950365] Netfilter messages via NETLINK v0.30. [ 33.983449] ip_set: protocol 6 [ 37.335966] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 37.337587] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready [ 37.339925] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready [ 39.940942] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based firewall rule not found. Use the iptables CT target to attach helpers instead. [ 98.926202] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 [ 98.932977] a.out cpuset=/ mems_allowed=0 [ 98.934780] CPU: 1 PID: 2972 Comm: a.out Not tainted 4.11.0-rc7-next-20170421+ #588 [ 98.937988] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 98.942193] Call Trace: [ 98.942942] ? dump_stack+0x5c/0x7d [ 98.943907] ? dump_header+0x97/0x233 [ 98.945334] ? ktime_get+0x30/0x90 [ 98.946290] ? delayacct_end+0x35/0x60 [ 98.947319] ? do_try_to_free_pages+0x2ca/0x370 [ 98.948554] ? oom_kill_process+0x223/0x3e0 [ 98.949715] ? has_capability_noaudit+0x17/0x20 [ 98.950948] ? oom_badness+0xeb/0x160 [ 98.951962] ? out_of_memory+0x10b/0x490 [ 98.953030] ? __alloc_pages_slowpath+0x701/0x8e2 [ 98.954313] ? __alloc_pages_nodemask+0x1ed/0x210 [ 98.956242] ? alloc_pages_vma+0x9f/0x220 [ 98.957486] ?
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On 2017/03/10 20:44, Tetsuo Handa wrote: > Michal Hocko wrote: >> On Thu 09-03-17 13:05:40, Johannes Weiner wrote: >>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: It only does this to some extent. If reclaim made no progress, for example due to immediately bailing out because the number of already isolated pages is too high (due to many parallel reclaimers), the code could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" test without ever looking at the number of reclaimable pages. >>> >>> Hm, there is no early return there, actually. We bump the loop counter >>> every time it happens, but then *do* look at the reclaimable pages. >>> Could that create problems if we have many concurrent reclaimers? >>> >>> With increased concurrency, the likelihood of OOM will go up if we >>> remove the unlimited wait for isolated pages, that much is true. >>> >>> I'm not sure that's a bad thing, however, because we want the OOM >>> killer to be predictable and timely. So a reasonable wait time in >>> between 0 and forever before an allocating thread gives up under >>> extreme concurrency makes sense to me. >>> It may be OK, I just do not understand all the implications. I like the general direction your patch takes the code in, but I would like to understand it better... >>> >>> I feel the same way. The throttling logic doesn't seem to be very well >>> thought out at the moment, making it hard to reason about what happens >>> in certain scenarios. >>> >>> In that sense, this patch isn't really an overall improvement to the >>> way things work. It patches a hole that seems to be exploitable only >>> from an artificial OOM torture test, at the risk of regressing high >>> concurrency workloads that may or may not be artificial. >>> >>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency >>> behind this patch. Can we think about a general model to deal with >>> allocation concurrency? >> >> I am definitely not against. There is no reason to rush the patch in. > > I don't hurry if we can check using watchdog whether this problem is occurring > in the real world. I have to test corner cases because watchdog is missing. > Ping? This problem can occur even immediately after the first invocation of the OOM killer. I believe this problem can occur in the real world. When are we going to apply this patch or watchdog patch? [0.00] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1 (...snipped...) CentOS Linux 7 (Core) Kernel 4.11.0-rc7-next-20170421+ on an x86_64 ccsecurity login: [ 32.406723] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 32.852917] Ebtables v2.0 registered [ 33.034402] nf_conntrack version 0.5.0 (16384 buckets, 65536 max) [ 33.467929] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 33.475728] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready [ 33.478910] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready [ 33.950365] Netfilter messages via NETLINK v0.30. [ 33.983449] ip_set: protocol 6 [ 37.335966] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 37.337587] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready [ 37.339925] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready [ 39.940942] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based firewall rule not found. Use the iptables CT target to attach helpers instead. [ 98.926202] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 [ 98.932977] a.out cpuset=/ mems_allowed=0 [ 98.934780] CPU: 1 PID: 2972 Comm: a.out Not tainted 4.11.0-rc7-next-20170421+ #588 [ 98.937988] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 98.942193] Call Trace: [ 98.942942] ? dump_stack+0x5c/0x7d [ 98.943907] ? dump_header+0x97/0x233 [ 98.945334] ? ktime_get+0x30/0x90 [ 98.946290] ? delayacct_end+0x35/0x60 [ 98.947319] ? do_try_to_free_pages+0x2ca/0x370 [ 98.948554] ? oom_kill_process+0x223/0x3e0 [ 98.949715] ? has_capability_noaudit+0x17/0x20 [ 98.950948] ? oom_badness+0xeb/0x160 [ 98.951962] ? out_of_memory+0x10b/0x490 [ 98.953030] ? __alloc_pages_slowpath+0x701/0x8e2 [ 98.954313] ? __alloc_pages_nodemask+0x1ed/0x210 [ 98.956242] ? alloc_pages_vma+0x9f/0x220 [ 98.957486] ?
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On 2017/03/10 20:44, Tetsuo Handa wrote: > Michal Hocko wrote: >> On Thu 09-03-17 13:05:40, Johannes Weiner wrote: It may be OK, I just do not understand all the implications. I like the general direction your patch takes the code in, but I would like to understand it better... >>> >>> I feel the same way. The throttling logic doesn't seem to be very well >>> thought out at the moment, making it hard to reason about what happens >>> in certain scenarios. >>> >>> In that sense, this patch isn't really an overall improvement to the >>> way things work. It patches a hole that seems to be exploitable only >>> from an artificial OOM torture test, at the risk of regressing high >>> concurrency workloads that may or may not be artificial. >>> >>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency >>> behind this patch. Can we think about a general model to deal with >>> allocation concurrency? >> >> I am definitely not against. There is no reason to rush the patch in. > > I don't hurry if we can check using watchdog whether this problem is occurring > in the real world. I have to test corner cases because watchdog is missing. Today I tested linux-next-20170321 with not so insane stress, and I again hit this problem. Thus, I think this problem might occur in the real world. http://I-love.SAKURA.ne.jp/tmp/serial-20170321.txt.xz (Logs up to before swapoff are eliminated.) -- [ 2250.175109] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2257.535653] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2319.806880] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2320.722282] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2381.243393] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2389.777052] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2450.878287] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2459.386321] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2520.500633] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2529.042088] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 --
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On 2017/03/10 20:44, Tetsuo Handa wrote: > Michal Hocko wrote: >> On Thu 09-03-17 13:05:40, Johannes Weiner wrote: It may be OK, I just do not understand all the implications. I like the general direction your patch takes the code in, but I would like to understand it better... >>> >>> I feel the same way. The throttling logic doesn't seem to be very well >>> thought out at the moment, making it hard to reason about what happens >>> in certain scenarios. >>> >>> In that sense, this patch isn't really an overall improvement to the >>> way things work. It patches a hole that seems to be exploitable only >>> from an artificial OOM torture test, at the risk of regressing high >>> concurrency workloads that may or may not be artificial. >>> >>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency >>> behind this patch. Can we think about a general model to deal with >>> allocation concurrency? >> >> I am definitely not against. There is no reason to rush the patch in. > > I don't hurry if we can check using watchdog whether this problem is occurring > in the real world. I have to test corner cases because watchdog is missing. Today I tested linux-next-20170321 with not so insane stress, and I again hit this problem. Thus, I think this problem might occur in the real world. http://I-love.SAKURA.ne.jp/tmp/serial-20170321.txt.xz (Logs up to before swapoff are eliminated.) -- [ 2250.175109] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2257.535653] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2319.806880] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2320.722282] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2381.243393] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2389.777052] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2450.878287] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2459.386321] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2520.500633] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 [ 2529.042088] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386 --
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Michal Hocko wrote: > On Thu 09-03-17 13:05:40, Johannes Weiner wrote: > > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > > It only does this to some extent. If reclaim made > > > no progress, for example due to immediately bailing > > > out because the number of already isolated pages is > > > too high (due to many parallel reclaimers), the code > > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > > test without ever looking at the number of reclaimable > > > pages. > > > > Hm, there is no early return there, actually. We bump the loop counter > > every time it happens, but then *do* look at the reclaimable pages. > > > > > Could that create problems if we have many concurrent > > > reclaimers? > > > > With increased concurrency, the likelihood of OOM will go up if we > > remove the unlimited wait for isolated pages, that much is true. > > > > I'm not sure that's a bad thing, however, because we want the OOM > > killer to be predictable and timely. So a reasonable wait time in > > between 0 and forever before an allocating thread gives up under > > extreme concurrency makes sense to me. > > > > > It may be OK, I just do not understand all the implications. > > > > > > I like the general direction your patch takes the code in, > > > but I would like to understand it better... > > > > I feel the same way. The throttling logic doesn't seem to be very well > > thought out at the moment, making it hard to reason about what happens > > in certain scenarios. > > > > In that sense, this patch isn't really an overall improvement to the > > way things work. It patches a hole that seems to be exploitable only > > from an artificial OOM torture test, at the risk of regressing high > > concurrency workloads that may or may not be artificial. > > > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > > behind this patch. Can we think about a general model to deal with > > allocation concurrency? > > I am definitely not against. There is no reason to rush the patch in. I don't hurry if we can check using watchdog whether this problem is occurring in the real world. I have to test corner cases because watchdog is missing. > My main point behind this patch was to reduce unbound loops from inside > the reclaim path and push any throttling up the call chain to the > page allocator path because I believe that it is easier to reason > about them at that level. The direct reclaim should be as simple as > possible without too many side effects otherwise we end up in a highly > unpredictable behavior. This was a first step in that direction and my > testing so far didn't show any regressions. > > > Unlimited parallel direct reclaim is kinda > > bonkers in the first place. How about checking for excessive isolation > > counts from the page allocator and putting allocations on a waitqueue? > > I would be interested in details here. That will help implementing __GFP_KILLABLE. https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
Michal Hocko wrote: > On Thu 09-03-17 13:05:40, Johannes Weiner wrote: > > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > > It only does this to some extent. If reclaim made > > > no progress, for example due to immediately bailing > > > out because the number of already isolated pages is > > > too high (due to many parallel reclaimers), the code > > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > > test without ever looking at the number of reclaimable > > > pages. > > > > Hm, there is no early return there, actually. We bump the loop counter > > every time it happens, but then *do* look at the reclaimable pages. > > > > > Could that create problems if we have many concurrent > > > reclaimers? > > > > With increased concurrency, the likelihood of OOM will go up if we > > remove the unlimited wait for isolated pages, that much is true. > > > > I'm not sure that's a bad thing, however, because we want the OOM > > killer to be predictable and timely. So a reasonable wait time in > > between 0 and forever before an allocating thread gives up under > > extreme concurrency makes sense to me. > > > > > It may be OK, I just do not understand all the implications. > > > > > > I like the general direction your patch takes the code in, > > > but I would like to understand it better... > > > > I feel the same way. The throttling logic doesn't seem to be very well > > thought out at the moment, making it hard to reason about what happens > > in certain scenarios. > > > > In that sense, this patch isn't really an overall improvement to the > > way things work. It patches a hole that seems to be exploitable only > > from an artificial OOM torture test, at the risk of regressing high > > concurrency workloads that may or may not be artificial. > > > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > > behind this patch. Can we think about a general model to deal with > > allocation concurrency? > > I am definitely not against. There is no reason to rush the patch in. I don't hurry if we can check using watchdog whether this problem is occurring in the real world. I have to test corner cases because watchdog is missing. > My main point behind this patch was to reduce unbound loops from inside > the reclaim path and push any throttling up the call chain to the > page allocator path because I believe that it is easier to reason > about them at that level. The direct reclaim should be as simple as > possible without too many side effects otherwise we end up in a highly > unpredictable behavior. This was a first step in that direction and my > testing so far didn't show any regressions. > > > Unlimited parallel direct reclaim is kinda > > bonkers in the first place. How about checking for excessive isolation > > counts from the page allocator and putting allocations on a waitqueue? > > I would be interested in details here. That will help implementing __GFP_KILLABLE. https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu 09-03-17 17:18:00, Rik van Riel wrote: > On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote: > > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > > > > > It only does this to some extent. If reclaim made > > > no progress, for example due to immediately bailing > > > out because the number of already isolated pages is > > > too high (due to many parallel reclaimers), the code > > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > > test without ever looking at the number of reclaimable > > > pages. > > Hm, there is no early return there, actually. We bump the loop > > counter > > every time it happens, but then *do* look at the reclaimable pages. > > Am I looking at an old tree? I see this code > before we look at the reclaimable pages. > > /* > * Make sure we converge to OOM if we cannot make any progress > * several times in the row. > */ > if (*no_progress_loops > MAX_RECLAIM_RETRIES) { > /* Before OOM, exhaust highatomic_reserve */ > return unreserve_highatomic_pageblock(ac, true); > } I believe that Johannes meant cases where we do not exhaust all the reclaim retries and fail early because there are no reclaimable pages during the watermark check. > > > Could that create problems if we have many concurrent > > > reclaimers? > > With increased concurrency, the likelihood of OOM will go up if we > > remove the unlimited wait for isolated pages, that much is true. > > > > I'm not sure that's a bad thing, however, because we want the OOM > > killer to be predictable and timely. So a reasonable wait time in > > between 0 and forever before an allocating thread gives up under > > extreme concurrency makes sense to me. > > That is a fair point, a faster OOM kill is preferable > to a system that is livelocked. > > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > > behind this patch. Can we think about a general model to deal with > > allocation concurrency? Unlimited parallel direct reclaim is kinda > > bonkers in the first place. How about checking for excessive > > isolation > > counts from the page allocator and putting allocations on a > > waitqueue? > > The (limited) number of reclaimers can still do a > relatively fast OOM kill, if none of them manage > to make progress. well, we can estimate how much memory can those relatively few reclaimers isolate and try to reclaim. Even if we have hundreds of them which is more towards a large number to me then we are 100*SWAP_CLUSTER_MAX which is not all that much. And we are effectivelly OOM if there is no other reclaimable memory left. All we need is just to put some upper bound. We already have throttle_direct_reclaim but it doesn't really throttle the maximum number of reclaimers. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu 09-03-17 17:18:00, Rik van Riel wrote: > On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote: > > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > > > > > It only does this to some extent. If reclaim made > > > no progress, for example due to immediately bailing > > > out because the number of already isolated pages is > > > too high (due to many parallel reclaimers), the code > > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > > test without ever looking at the number of reclaimable > > > pages. > > Hm, there is no early return there, actually. We bump the loop > > counter > > every time it happens, but then *do* look at the reclaimable pages. > > Am I looking at an old tree? I see this code > before we look at the reclaimable pages. > > /* > * Make sure we converge to OOM if we cannot make any progress > * several times in the row. > */ > if (*no_progress_loops > MAX_RECLAIM_RETRIES) { > /* Before OOM, exhaust highatomic_reserve */ > return unreserve_highatomic_pageblock(ac, true); > } I believe that Johannes meant cases where we do not exhaust all the reclaim retries and fail early because there are no reclaimable pages during the watermark check. > > > Could that create problems if we have many concurrent > > > reclaimers? > > With increased concurrency, the likelihood of OOM will go up if we > > remove the unlimited wait for isolated pages, that much is true. > > > > I'm not sure that's a bad thing, however, because we want the OOM > > killer to be predictable and timely. So a reasonable wait time in > > between 0 and forever before an allocating thread gives up under > > extreme concurrency makes sense to me. > > That is a fair point, a faster OOM kill is preferable > to a system that is livelocked. > > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > > behind this patch. Can we think about a general model to deal with > > allocation concurrency? Unlimited parallel direct reclaim is kinda > > bonkers in the first place. How about checking for excessive > > isolation > > counts from the page allocator and putting allocations on a > > waitqueue? > > The (limited) number of reclaimers can still do a > relatively fast OOM kill, if none of them manage > to make progress. well, we can estimate how much memory can those relatively few reclaimers isolate and try to reclaim. Even if we have hundreds of them which is more towards a large number to me then we are 100*SWAP_CLUSTER_MAX which is not all that much. And we are effectivelly OOM if there is no other reclaimable memory left. All we need is just to put some upper bound. We already have throttle_direct_reclaim but it doesn't really throttle the maximum number of reclaimers. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu 09-03-17 13:05:40, Johannes Weiner wrote: > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > It only does this to some extent. If reclaim made > > no progress, for example due to immediately bailing > > out because the number of already isolated pages is > > too high (due to many parallel reclaimers), the code > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > test without ever looking at the number of reclaimable > > pages. > > Hm, there is no early return there, actually. We bump the loop counter > every time it happens, but then *do* look at the reclaimable pages. > > > Could that create problems if we have many concurrent > > reclaimers? > > With increased concurrency, the likelihood of OOM will go up if we > remove the unlimited wait for isolated pages, that much is true. > > I'm not sure that's a bad thing, however, because we want the OOM > killer to be predictable and timely. So a reasonable wait time in > between 0 and forever before an allocating thread gives up under > extreme concurrency makes sense to me. > > > It may be OK, I just do not understand all the implications. > > > > I like the general direction your patch takes the code in, > > but I would like to understand it better... > > I feel the same way. The throttling logic doesn't seem to be very well > thought out at the moment, making it hard to reason about what happens > in certain scenarios. > > In that sense, this patch isn't really an overall improvement to the > way things work. It patches a hole that seems to be exploitable only > from an artificial OOM torture test, at the risk of regressing high > concurrency workloads that may or may not be artificial. > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > behind this patch. Can we think about a general model to deal with > allocation concurrency? I am definitely not against. There is no reason to rush the patch in. My main point behind this patch was to reduce unbound loops from inside the reclaim path and push any throttling up the call chain to the page allocator path because I believe that it is easier to reason about them at that level. The direct reclaim should be as simple as possible without too many side effects otherwise we end up in a highly unpredictable behavior. This was a first step in that direction and my testing so far didn't show any regressions. > Unlimited parallel direct reclaim is kinda > bonkers in the first place. How about checking for excessive isolation > counts from the page allocator and putting allocations on a waitqueue? I would be interested in details here. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu 09-03-17 13:05:40, Johannes Weiner wrote: > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > It only does this to some extent. If reclaim made > > no progress, for example due to immediately bailing > > out because the number of already isolated pages is > > too high (due to many parallel reclaimers), the code > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > test without ever looking at the number of reclaimable > > pages. > > Hm, there is no early return there, actually. We bump the loop counter > every time it happens, but then *do* look at the reclaimable pages. > > > Could that create problems if we have many concurrent > > reclaimers? > > With increased concurrency, the likelihood of OOM will go up if we > remove the unlimited wait for isolated pages, that much is true. > > I'm not sure that's a bad thing, however, because we want the OOM > killer to be predictable and timely. So a reasonable wait time in > between 0 and forever before an allocating thread gives up under > extreme concurrency makes sense to me. > > > It may be OK, I just do not understand all the implications. > > > > I like the general direction your patch takes the code in, > > but I would like to understand it better... > > I feel the same way. The throttling logic doesn't seem to be very well > thought out at the moment, making it hard to reason about what happens > in certain scenarios. > > In that sense, this patch isn't really an overall improvement to the > way things work. It patches a hole that seems to be exploitable only > from an artificial OOM torture test, at the risk of regressing high > concurrency workloads that may or may not be artificial. > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > behind this patch. Can we think about a general model to deal with > allocation concurrency? I am definitely not against. There is no reason to rush the patch in. My main point behind this patch was to reduce unbound loops from inside the reclaim path and push any throttling up the call chain to the page allocator path because I believe that it is easier to reason about them at that level. The direct reclaim should be as simple as possible without too many side effects otherwise we end up in a highly unpredictable behavior. This was a first step in that direction and my testing so far didn't show any regressions. > Unlimited parallel direct reclaim is kinda > bonkers in the first place. How about checking for excessive isolation > counts from the page allocator and putting allocations on a waitqueue? I would be interested in details here. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote: > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > > > It only does this to some extent. If reclaim made > > no progress, for example due to immediately bailing > > out because the number of already isolated pages is > > too high (due to many parallel reclaimers), the code > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > test without ever looking at the number of reclaimable > > pages. > Hm, there is no early return there, actually. We bump the loop > counter > every time it happens, but then *do* look at the reclaimable pages. Am I looking at an old tree? I see this code before we look at the reclaimable pages. /* * Make sure we converge to OOM if we cannot make any progress * several times in the row. */ if (*no_progress_loops > MAX_RECLAIM_RETRIES) { /* Before OOM, exhaust highatomic_reserve */ return unreserve_highatomic_pageblock(ac, true); } > > Could that create problems if we have many concurrent > > reclaimers? > With increased concurrency, the likelihood of OOM will go up if we > remove the unlimited wait for isolated pages, that much is true. > > I'm not sure that's a bad thing, however, because we want the OOM > killer to be predictable and timely. So a reasonable wait time in > between 0 and forever before an allocating thread gives up under > extreme concurrency makes sense to me. That is a fair point, a faster OOM kill is preferable to a system that is livelocked. > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > behind this patch. Can we think about a general model to deal with > allocation concurrency? Unlimited parallel direct reclaim is kinda > bonkers in the first place. How about checking for excessive > isolation > counts from the page allocator and putting allocations on a > waitqueue? The (limited) number of reclaimers can still do a relatively fast OOM kill, if none of them manage to make progress. That should avoid the potential issue you and I both pointed out, and, as a bonus, it might actually be faster than letting all the tasks in the system into the direct reclaim code simultaneously. -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote: > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > > > It only does this to some extent. If reclaim made > > no progress, for example due to immediately bailing > > out because the number of already isolated pages is > > too high (due to many parallel reclaimers), the code > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > test without ever looking at the number of reclaimable > > pages. > Hm, there is no early return there, actually. We bump the loop > counter > every time it happens, but then *do* look at the reclaimable pages. Am I looking at an old tree? I see this code before we look at the reclaimable pages. /* * Make sure we converge to OOM if we cannot make any progress * several times in the row. */ if (*no_progress_loops > MAX_RECLAIM_RETRIES) { /* Before OOM, exhaust highatomic_reserve */ return unreserve_highatomic_pageblock(ac, true); } > > Could that create problems if we have many concurrent > > reclaimers? > With increased concurrency, the likelihood of OOM will go up if we > remove the unlimited wait for isolated pages, that much is true. > > I'm not sure that's a bad thing, however, because we want the OOM > killer to be predictable and timely. So a reasonable wait time in > between 0 and forever before an allocating thread gives up under > extreme concurrency makes sense to me. That is a fair point, a faster OOM kill is preferable to a system that is livelocked. > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > behind this patch. Can we think about a general model to deal with > allocation concurrency? Unlimited parallel direct reclaim is kinda > bonkers in the first place. How about checking for excessive > isolation > counts from the page allocator and putting allocations on a > waitqueue? The (limited) number of reclaimers can still do a relatively fast OOM kill, if none of them manage to make progress. That should avoid the potential issue you and I both pointed out, and, as a bonus, it might actually be faster than letting all the tasks in the system into the direct reclaim code simultaneously. -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > It only does this to some extent. If reclaim made > no progress, for example due to immediately bailing > out because the number of already isolated pages is > too high (due to many parallel reclaimers), the code > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > test without ever looking at the number of reclaimable > pages. Hm, there is no early return there, actually. We bump the loop counter every time it happens, but then *do* look at the reclaimable pages. > Could that create problems if we have many concurrent > reclaimers? With increased concurrency, the likelihood of OOM will go up if we remove the unlimited wait for isolated pages, that much is true. I'm not sure that's a bad thing, however, because we want the OOM killer to be predictable and timely. So a reasonable wait time in between 0 and forever before an allocating thread gives up under extreme concurrency makes sense to me. > It may be OK, I just do not understand all the implications. > > I like the general direction your patch takes the code in, > but I would like to understand it better... I feel the same way. The throttling logic doesn't seem to be very well thought out at the moment, making it hard to reason about what happens in certain scenarios. In that sense, this patch isn't really an overall improvement to the way things work. It patches a hole that seems to be exploitable only from an artificial OOM torture test, at the risk of regressing high concurrency workloads that may or may not be artificial. Unless I'm mistaken, there doesn't seem to be a whole lot of urgency behind this patch. Can we think about a general model to deal with allocation concurrency? Unlimited parallel direct reclaim is kinda bonkers in the first place. How about checking for excessive isolation counts from the page allocator and putting allocations on a waitqueue?
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > It only does this to some extent. If reclaim made > no progress, for example due to immediately bailing > out because the number of already isolated pages is > too high (due to many parallel reclaimers), the code > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > test without ever looking at the number of reclaimable > pages. Hm, there is no early return there, actually. We bump the loop counter every time it happens, but then *do* look at the reclaimable pages. > Could that create problems if we have many concurrent > reclaimers? With increased concurrency, the likelihood of OOM will go up if we remove the unlimited wait for isolated pages, that much is true. I'm not sure that's a bad thing, however, because we want the OOM killer to be predictable and timely. So a reasonable wait time in between 0 and forever before an allocating thread gives up under extreme concurrency makes sense to me. > It may be OK, I just do not understand all the implications. > > I like the general direction your patch takes the code in, > but I would like to understand it better... I feel the same way. The throttling logic doesn't seem to be very well thought out at the moment, making it hard to reason about what happens in certain scenarios. In that sense, this patch isn't really an overall improvement to the way things work. It patches a hole that seems to be exploitable only from an artificial OOM torture test, at the risk of regressing high concurrency workloads that may or may not be artificial. Unless I'm mistaken, there doesn't seem to be a whole lot of urgency behind this patch. Can we think about a general model to deal with allocation concurrency? Unlimited parallel direct reclaim is kinda bonkers in the first place. How about checking for excessive isolation counts from the page allocator and putting allocations on a waitqueue?
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu 09-03-17 09:16:25, Rik van Riel wrote: > On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote: > > On Wed 08-03-17 10:54:57, Rik van Riel wrote: > > > > In fact, false OOM kills with that kind of workload is > > > how we ended up getting the "too many isolated" logic > > > in the first place. > > Right, but the retry logic was considerably different than what we > > have these days. should_reclaim_retry considers amount of reclaimable > > memory. As I've said earlier if we see a report where the oom hits > > prematurely with many NR_ISOLATED* we know how to fix that. > > Would it be enough to simply reset no_progress_loops > in this check inside should_reclaim_retry, if we know > pageout IO is pending? > > if (!did_some_progress) { > unsigned long write_pending; > > write_pending = zone_page_state_snapshot(zone, > > NR_ZONE_WRITE_PENDING); > > if (2 * write_pending > reclaimable) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > return true; > } > } I am not really sure what problem we are trying to solve right now to be honest. I would prefer to keep the logic simpler rather than over engeneer something that is even not needed. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu 09-03-17 09:16:25, Rik van Riel wrote: > On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote: > > On Wed 08-03-17 10:54:57, Rik van Riel wrote: > > > > In fact, false OOM kills with that kind of workload is > > > how we ended up getting the "too many isolated" logic > > > in the first place. > > Right, but the retry logic was considerably different than what we > > have these days. should_reclaim_retry considers amount of reclaimable > > memory. As I've said earlier if we see a report where the oom hits > > prematurely with many NR_ISOLATED* we know how to fix that. > > Would it be enough to simply reset no_progress_loops > in this check inside should_reclaim_retry, if we know > pageout IO is pending? > > if (!did_some_progress) { > unsigned long write_pending; > > write_pending = zone_page_state_snapshot(zone, > > NR_ZONE_WRITE_PENDING); > > if (2 * write_pending > reclaimable) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > return true; > } > } I am not really sure what problem we are trying to solve right now to be honest. I would prefer to keep the logic simpler rather than over engeneer something that is even not needed. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Tue, Mar 07, 2017 at 02:30:57PM +0100, Michal Hocko wrote: > From: Michal Hocko> > Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck > in too_many_isolated loop basically for ever because the last few pages > on the LRU lists are isolated by the kswapd which is stuck on fs locks > when doing the pageout or slab reclaim. This in turn means that there is > nobody to actually trigger the oom killer and the system is basically > unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the direct > reclaim throttling. > > Make shrink_inactive_list loop over too_many_isolated bounded and returns > immediately when the situation hasn't resolved after the first sleep. > Replace congestion_wait by a simple schedule_timeout_interruptible because > we are not really waiting on the IO congestion in this path. > > Please note that this patch can theoretically cause the OOM killer to > trigger earlier while there are many pages isolated for the reclaim > which makes progress only very slowly. This would be obvious from the oom > report as the number of isolated pages are printed there. If we ever hit > this should_reclaim_retry should consider those numbers in the evaluation > in one way or another. > > [1] > http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp > [2] > http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp > > Signed-off-by: Michal Hocko Acked-by: Mel Gorman -- Mel Gorman SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Tue, Mar 07, 2017 at 02:30:57PM +0100, Michal Hocko wrote: > From: Michal Hocko > > Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck > in too_many_isolated loop basically for ever because the last few pages > on the LRU lists are isolated by the kswapd which is stuck on fs locks > when doing the pageout or slab reclaim. This in turn means that there is > nobody to actually trigger the oom killer and the system is basically > unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the direct > reclaim throttling. > > Make shrink_inactive_list loop over too_many_isolated bounded and returns > immediately when the situation hasn't resolved after the first sleep. > Replace congestion_wait by a simple schedule_timeout_interruptible because > we are not really waiting on the IO congestion in this path. > > Please note that this patch can theoretically cause the OOM killer to > trigger earlier while there are many pages isolated for the reclaim > which makes progress only very slowly. This would be obvious from the oom > report as the number of isolated pages are printed there. If we ever hit > this should_reclaim_retry should consider those numbers in the evaluation > in one way or another. > > [1] > http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp > [2] > http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp > > Signed-off-by: Michal Hocko Acked-by: Mel Gorman -- Mel Gorman SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote: > On Wed 08-03-17 10:54:57, Rik van Riel wrote: > > In fact, false OOM kills with that kind of workload is > > how we ended up getting the "too many isolated" logic > > in the first place. > Right, but the retry logic was considerably different than what we > have these days. should_reclaim_retry considers amount of reclaimable > memory. As I've said earlier if we see a report where the oom hits > prematurely with many NR_ISOLATED* we know how to fix that. Would it be enough to simply reset no_progress_loops in this check inside should_reclaim_retry, if we know pageout IO is pending? if (!did_some_progress) { unsigned long write_pending; write_pending = zone_page_state_snapshot(zone, NR_ZONE_WRITE_P ENDING); if (2 * write_pending > reclaimable) { congestion_wait(BLK_RW_ASYNC, HZ/10); return true; } } -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote: > On Wed 08-03-17 10:54:57, Rik van Riel wrote: > > In fact, false OOM kills with that kind of workload is > > how we ended up getting the "too many isolated" logic > > in the first place. > Right, but the retry logic was considerably different than what we > have these days. should_reclaim_retry considers amount of reclaimable > memory. As I've said earlier if we see a report where the oom hits > prematurely with many NR_ISOLATED* we know how to fix that. Would it be enough to simply reset no_progress_loops in this check inside should_reclaim_retry, if we know pageout IO is pending? if (!did_some_progress) { unsigned long write_pending; write_pending = zone_page_state_snapshot(zone, NR_ZONE_WRITE_P ENDING); if (2 * write_pending > reclaimable) { congestion_wait(BLK_RW_ASYNC, HZ/10); return true; } } -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Wed 08-03-17 10:54:57, Rik van Riel wrote: > On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote: > > > > Could that create problems if we have many concurrent > > > reclaimers? > > > > As the changelog mentions it might cause a premature oom killer > > invocation theoretically. We could easily see that from the oom > > report > > by checking isolated counters. My testing didn't trigger that though > > and I was hammering the page allocator path from many threads. > > > > I suspect some artificial tests can trigger that, I am not so sure > > about > > reasonabel workloads. If we see this happening though then the fix > > would > > be to resurrect my previous attempt to track NR_ISOLATED* per zone > > and > > use them in the allocator retry logic. > > I am not sure the workload in question is "artificial". > A heavily forking (or multi-threaded) server running out > of physical memory could easily get hundreds of tasks > doing direct reclaim simultaneously. Yes, some of my OOM tests (fork many short lived processes while there is a strong memory pressure and a lot of IO going on) are doing this and I haven't hit a premature OOM yet. It is hard to tune those tests for almost OOM but not yet there, though. Usually you either find a steady state or really run out of memory. > In fact, false OOM kills with that kind of workload is > how we ended up getting the "too many isolated" logic > in the first place. Right, but the retry logic was considerably different than what we have these days. should_reclaim_retry considers amount of reclaimable memory. As I've said earlier if we see a report where the oom hits prematurely with many NR_ISOLATED* we know how to fix that. > I am perfectly fine with moving the retry logic up like > you did, but think it may make sense to check the number > of reclaimable pages if we have too many isolated pages, > instead of risking a too-early OOM kill. Actually that was my initial attempt but for that we would need per-zone NR_ISOLATED* counters but Mel was against and wanted to start with simpler approach if it works reasonably well which it seems it does from my experience so far (but the reallity can surprise as I've seen so many times already). -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Wed 08-03-17 10:54:57, Rik van Riel wrote: > On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote: > > > > Could that create problems if we have many concurrent > > > reclaimers? > > > > As the changelog mentions it might cause a premature oom killer > > invocation theoretically. We could easily see that from the oom > > report > > by checking isolated counters. My testing didn't trigger that though > > and I was hammering the page allocator path from many threads. > > > > I suspect some artificial tests can trigger that, I am not so sure > > about > > reasonabel workloads. If we see this happening though then the fix > > would > > be to resurrect my previous attempt to track NR_ISOLATED* per zone > > and > > use them in the allocator retry logic. > > I am not sure the workload in question is "artificial". > A heavily forking (or multi-threaded) server running out > of physical memory could easily get hundreds of tasks > doing direct reclaim simultaneously. Yes, some of my OOM tests (fork many short lived processes while there is a strong memory pressure and a lot of IO going on) are doing this and I haven't hit a premature OOM yet. It is hard to tune those tests for almost OOM but not yet there, though. Usually you either find a steady state or really run out of memory. > In fact, false OOM kills with that kind of workload is > how we ended up getting the "too many isolated" logic > in the first place. Right, but the retry logic was considerably different than what we have these days. should_reclaim_retry considers amount of reclaimable memory. As I've said earlier if we see a report where the oom hits prematurely with many NR_ISOLATED* we know how to fix that. > I am perfectly fine with moving the retry logic up like > you did, but think it may make sense to check the number > of reclaimable pages if we have too many isolated pages, > instead of risking a too-early OOM kill. Actually that was my initial attempt but for that we would need per-zone NR_ISOLATED* counters but Mel was against and wanted to start with simpler approach if it works reasonably well which it seems it does from my experience so far (but the reallity can surprise as I've seen so many times already). -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote: > > Could that create problems if we have many concurrent > > reclaimers? > > As the changelog mentions it might cause a premature oom killer > invocation theoretically. We could easily see that from the oom > report > by checking isolated counters. My testing didn't trigger that though > and I was hammering the page allocator path from many threads. > > I suspect some artificial tests can trigger that, I am not so sure > about > reasonabel workloads. If we see this happening though then the fix > would > be to resurrect my previous attempt to track NR_ISOLATED* per zone > and > use them in the allocator retry logic. I am not sure the workload in question is "artificial". A heavily forking (or multi-threaded) server running out of physical memory could easily get hundreds of tasks doing direct reclaim simultaneously. In fact, false OOM kills with that kind of workload is how we ended up getting the "too many isolated" logic in the first place. I am perfectly fine with moving the retry logic up like you did, but think it may make sense to check the number of reclaimable pages if we have too many isolated pages, instead of risking a too-early OOM kill.
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote: > > Could that create problems if we have many concurrent > > reclaimers? > > As the changelog mentions it might cause a premature oom killer > invocation theoretically. We could easily see that from the oom > report > by checking isolated counters. My testing didn't trigger that though > and I was hammering the page allocator path from many threads. > > I suspect some artificial tests can trigger that, I am not so sure > about > reasonabel workloads. If we see this happening though then the fix > would > be to resurrect my previous attempt to track NR_ISOLATED* per zone > and > use them in the allocator retry logic. I am not sure the workload in question is "artificial". A heavily forking (or multi-threaded) server running out of physical memory could easily get hundreds of tasks doing direct reclaim simultaneously. In fact, false OOM kills with that kind of workload is how we ended up getting the "too many isolated" logic in the first place. I am perfectly fine with moving the retry logic up like you did, but think it may make sense to check the number of reclaimable pages if we have too many isolated pages, instead of risking a too-early OOM kill.
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Tue 07-03-17 14:52:36, Rik van Riel wrote: > On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote: > > From: Michal Hocko> > > > Tetsuo Handa has reported [1][2] that direct reclaimers might get > > stuck > > in too_many_isolated loop basically for ever because the last few > > pages > > on the LRU lists are isolated by the kswapd which is stuck on fs > > locks > > when doing the pageout or slab reclaim. This in turn means that there > > is > > nobody to actually trigger the oom killer and the system is basically > > unusable. > > > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: > > throttle > > direct reclaim when too many pages are isolated already") to prevent > > from pre-mature oom killer invocations because back then no reclaim > > progress could indeed trigger the OOM killer too early. But since the > > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > > the allocation/reclaim retry loop considers all the reclaimable pages > > and throttles the allocation at that layer so we can loosen the > > direct > > reclaim throttling. > > It only does this to some extent. If reclaim made > no progress, for example due to immediately bailing > out because the number of already isolated pages is > too high (due to many parallel reclaimers), the code > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > test without ever looking at the number of reclaimable > pages. > > Could that create problems if we have many concurrent > reclaimers? As the changelog mentions it might cause a premature oom killer invocation theoretically. We could easily see that from the oom report by checking isolated counters. My testing didn't trigger that though and I was hammering the page allocator path from many threads. I suspect some artificial tests can trigger that, I am not so sure about reasonabel workloads. If we see this happening though then the fix would be to resurrect my previous attempt to track NR_ISOLATED* per zone and use them in the allocator retry logic. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Tue 07-03-17 14:52:36, Rik van Riel wrote: > On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote: > > From: Michal Hocko > > > > Tetsuo Handa has reported [1][2] that direct reclaimers might get > > stuck > > in too_many_isolated loop basically for ever because the last few > > pages > > on the LRU lists are isolated by the kswapd which is stuck on fs > > locks > > when doing the pageout or slab reclaim. This in turn means that there > > is > > nobody to actually trigger the oom killer and the system is basically > > unusable. > > > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: > > throttle > > direct reclaim when too many pages are isolated already") to prevent > > from pre-mature oom killer invocations because back then no reclaim > > progress could indeed trigger the OOM killer too early. But since the > > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > > the allocation/reclaim retry loop considers all the reclaimable pages > > and throttles the allocation at that layer so we can loosen the > > direct > > reclaim throttling. > > It only does this to some extent. If reclaim made > no progress, for example due to immediately bailing > out because the number of already isolated pages is > too high (due to many parallel reclaimers), the code > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > test without ever looking at the number of reclaimable > pages. > > Could that create problems if we have many concurrent > reclaimers? As the changelog mentions it might cause a premature oom killer invocation theoretically. We could easily see that from the oom report by checking isolated counters. My testing didn't trigger that though and I was hammering the page allocator path from many threads. I suspect some artificial tests can trigger that, I am not so sure about reasonabel workloads. If we see this happening though then the fix would be to resurrect my previous attempt to track NR_ISOLATED* per zone and use them in the allocator retry logic. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote: > From: Michal Hocko> > Tetsuo Handa has reported [1][2] that direct reclaimers might get > stuck > in too_many_isolated loop basically for ever because the last few > pages > on the LRU lists are isolated by the kswapd which is stuck on fs > locks > when doing the pageout or slab reclaim. This in turn means that there > is > nobody to actually trigger the oom killer and the system is basically > unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: > throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the > direct > reclaim throttling. It only does this to some extent. If reclaim made no progress, for example due to immediately bailing out because the number of already isolated pages is too high (due to many parallel reclaimers), the code could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" test without ever looking at the number of reclaimable pages. Could that create problems if we have many concurrent reclaimers? It may be OK, I just do not understand all the implications. I like the general direction your patch takes the code in, but I would like to understand it better... -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote: > From: Michal Hocko > > Tetsuo Handa has reported [1][2] that direct reclaimers might get > stuck > in too_many_isolated loop basically for ever because the last few > pages > on the LRU lists are isolated by the kswapd which is stuck on fs > locks > when doing the pageout or slab reclaim. This in turn means that there > is > nobody to actually trigger the oom killer and the system is basically > unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: > throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the > direct > reclaim throttling. It only does this to some extent. If reclaim made no progress, for example due to immediately bailing out because the number of already isolated pages is too high (due to many parallel reclaimers), the code could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" test without ever looking at the number of reclaimable pages. Could that create problems if we have many concurrent reclaimers? It may be OK, I just do not understand all the implications. I like the general direction your patch takes the code in, but I would like to understand it better... -- All rights reversed signature.asc Description: This is a digitally signed message part
[PATCH] mm, vmscan: do not loop on too_many_isolated for ever
From: Michal HockoTetsuo Handa has reported [1][2] that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout or slab reclaim. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages and throttles the allocation at that layer so we can loosen the direct reclaim throttling. Make shrink_inactive_list loop over too_many_isolated bounded and returns immediately when the situation hasn't resolved after the first sleep. Replace congestion_wait by a simple schedule_timeout_interruptible because we are not really waiting on the IO congestion in this path. Please note that this patch can theoretically cause the OOM killer to trigger earlier while there are many pages isolated for the reclaim which makes progress only very slowly. This would be obvious from the oom report as the number of isolated pages are printed there. If we ever hit this should_reclaim_retry should consider those numbers in the evaluation in one way or another. [1] http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp [2] http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp Signed-off-by: Michal Hocko --- Hi, Tetsuo helped to test this patch [3] and couldn't reproduce the hang inside the page allocator anymore. Thanks! He was able to see a different lockup though. This time this is more related to how XFS is doing the inode reclaim from the WQ context. This is being discussed [4] and I believe it is unrelated to this change. I believe this change is still an improvement because it reduces chances of an unbound loop inside the reclaim path so we have a) more reliable detection of the lockup from the allocator path and b) more deterministic retry loop logic. Thoughts/complains/suggestions? [3] http://lkml.kernel.org/r/201702261530.jdd56292.ofolfhqtvmj...@i-love.sakura.ne.jp [4] http://lkml.kernel.org/r/20170303133950.gd31...@dhcp22.suse.cz mm/vmscan.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c15b2e4c47ca..4ae069060ae5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; + bool stalled = false; while (unlikely(too_many_isolated(pgdat, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + if (stalled) + return 0; + + /* wait a bit for the reclaimer. */ + schedule_timeout_interruptible(HZ/10); + stalled = true; /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) -- 2.11.0
[PATCH] mm, vmscan: do not loop on too_many_isolated for ever
From: Michal Hocko Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout or slab reclaim. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages and throttles the allocation at that layer so we can loosen the direct reclaim throttling. Make shrink_inactive_list loop over too_many_isolated bounded and returns immediately when the situation hasn't resolved after the first sleep. Replace congestion_wait by a simple schedule_timeout_interruptible because we are not really waiting on the IO congestion in this path. Please note that this patch can theoretically cause the OOM killer to trigger earlier while there are many pages isolated for the reclaim which makes progress only very slowly. This would be obvious from the oom report as the number of isolated pages are printed there. If we ever hit this should_reclaim_retry should consider those numbers in the evaluation in one way or another. [1] http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp [2] http://lkml.kernel.org/r/201702212335.djb30777.jofmhsftvlq...@i-love.sakura.ne.jp Signed-off-by: Michal Hocko --- Hi, Tetsuo helped to test this patch [3] and couldn't reproduce the hang inside the page allocator anymore. Thanks! He was able to see a different lockup though. This time this is more related to how XFS is doing the inode reclaim from the WQ context. This is being discussed [4] and I believe it is unrelated to this change. I believe this change is still an improvement because it reduces chances of an unbound loop inside the reclaim path so we have a) more reliable detection of the lockup from the allocator path and b) more deterministic retry loop logic. Thoughts/complains/suggestions? [3] http://lkml.kernel.org/r/201702261530.jdd56292.ofolfhqtvmj...@i-love.sakura.ne.jp [4] http://lkml.kernel.org/r/20170303133950.gd31...@dhcp22.suse.cz mm/vmscan.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c15b2e4c47ca..4ae069060ae5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; + bool stalled = false; while (unlikely(too_many_isolated(pgdat, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + if (stalled) + return 0; + + /* wait a bit for the reclaimer. */ + schedule_timeout_interruptible(HZ/10); + stalled = true; /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) -- 2.11.0