Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-16 Thread Michal Hocko
On Tue 16-10-18 20:05:47, Tetsuo Handa wrote:
> On 2018/10/16 18:20, Michal Hocko wrote:
> >> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with 
> >> simplified
> >> one shown below (because you don't like per memcg limit).
> > 
> > My patch is adding a rate-limit! I really fail to see why we need yet
> > another one on top of it. This is just ridiculous. I can see reasons to
> > tune that rate limit but adding 2 different mechanisms is just wrong.
> > 
> > If your NAK to unify the ratelimit for dump_header for all paths
> > still holds then I do not care too much to push it forward. But I find
> > thiis way of the review feedback counter productive.
> > 
> 
> Your patch is _NOT_ adding a rate-limit for
> 
>   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, 
> oom_score_adj=%hd\n"
>   "Out of memory and no killable processes...\n"
> 
> lines!

And I've said I do not have objections to have an _incremental_ patch to
move the ratelimit up with a clear cost/benefit evaluation.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-16 Thread Michal Hocko
On Tue 16-10-18 20:05:47, Tetsuo Handa wrote:
> On 2018/10/16 18:20, Michal Hocko wrote:
> >> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with 
> >> simplified
> >> one shown below (because you don't like per memcg limit).
> > 
> > My patch is adding a rate-limit! I really fail to see why we need yet
> > another one on top of it. This is just ridiculous. I can see reasons to
> > tune that rate limit but adding 2 different mechanisms is just wrong.
> > 
> > If your NAK to unify the ratelimit for dump_header for all paths
> > still holds then I do not care too much to push it forward. But I find
> > thiis way of the review feedback counter productive.
> > 
> 
> Your patch is _NOT_ adding a rate-limit for
> 
>   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, 
> oom_score_adj=%hd\n"
>   "Out of memory and no killable processes...\n"
> 
> lines!

And I've said I do not have objections to have an _incremental_ patch to
move the ratelimit up with a clear cost/benefit evaluation.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-16 Thread Tetsuo Handa
On 2018/10/16 18:20, Michal Hocko wrote:
>> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with 
>> simplified
>> one shown below (because you don't like per memcg limit).
> 
> My patch is adding a rate-limit! I really fail to see why we need yet
> another one on top of it. This is just ridiculous. I can see reasons to
> tune that rate limit but adding 2 different mechanisms is just wrong.
> 
> If your NAK to unify the ratelimit for dump_header for all paths
> still holds then I do not care too much to push it forward. But I find
> thiis way of the review feedback counter productive.
> 

Your patch is _NOT_ adding a rate-limit for

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, 
oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

lines!

[   97.519229] Out of memory and no killable processes...
[   97.522060] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.525507] Out of memory and no killable processes...
[   97.528817] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.532345] Out of memory and no killable processes...
[   97.534813] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.538270] Out of memory and no killable processes...
[   97.541449] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.546268] Out of memory and no killable processes...
[   97.548823] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.552399] Out of memory and no killable processes...
[   97.554939] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.558616] Out of memory and no killable processes...
[   97.562257] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.565998] Out of memory and no killable processes...
[   97.568642] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.572169] Out of memory and no killable processes...
[   97.575200] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.579357] Out of memory and no killable processes...
[   97.581912] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.585414] Out of memory and no killable processes...
[   97.589191] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.593586] Out of memory and no killable processes...
[   97.596527] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.600118] Out of memory and no killable processes...
[   97.603237] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.606837] Out of memory and no killable processes...
[   97.611550] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.615244] Out of memory and no killable processes...
[   97.617859] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.621634] Out of memory and no killable processes...
[   97.624884] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.629256] Out of memory and no killable processes...
[   97.631885] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.635367] Out of memory and no killable processes...
[   97.638033] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.641827] Out of memory and no killable processes...
[   97.641993] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.648453] Out of memory and no killable processes...
[   97.651481] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.655082] Out of memory and no killable processes...
[   97.657941] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.663036] Out of memory and no killable processes...
[   97.665890] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.669473] Out of memory and no killable processes...

We don't need to print these lines every few milliseconds. Even if an 
exceptional case,
this is a DoS for console users. Printing once (or a few times) per a minute 
will be
enough. Otherwise, users cannot see what they are typing 

Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-16 Thread Tetsuo Handa
On 2018/10/16 18:20, Michal Hocko wrote:
>> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with 
>> simplified
>> one shown below (because you don't like per memcg limit).
> 
> My patch is adding a rate-limit! I really fail to see why we need yet
> another one on top of it. This is just ridiculous. I can see reasons to
> tune that rate limit but adding 2 different mechanisms is just wrong.
> 
> If your NAK to unify the ratelimit for dump_header for all paths
> still holds then I do not care too much to push it forward. But I find
> thiis way of the review feedback counter productive.
> 

Your patch is _NOT_ adding a rate-limit for

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, 
oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

lines!

[   97.519229] Out of memory and no killable processes...
[   97.522060] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.525507] Out of memory and no killable processes...
[   97.528817] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.532345] Out of memory and no killable processes...
[   97.534813] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.538270] Out of memory and no killable processes...
[   97.541449] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.546268] Out of memory and no killable processes...
[   97.548823] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.552399] Out of memory and no killable processes...
[   97.554939] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.558616] Out of memory and no killable processes...
[   97.562257] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.565998] Out of memory and no killable processes...
[   97.568642] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.572169] Out of memory and no killable processes...
[   97.575200] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.579357] Out of memory and no killable processes...
[   97.581912] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.585414] Out of memory and no killable processes...
[   97.589191] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.593586] Out of memory and no killable processes...
[   97.596527] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.600118] Out of memory and no killable processes...
[   97.603237] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.606837] Out of memory and no killable processes...
[   97.611550] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.615244] Out of memory and no killable processes...
[   97.617859] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.621634] Out of memory and no killable processes...
[   97.624884] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.629256] Out of memory and no killable processes...
[   97.631885] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.635367] Out of memory and no killable processes...
[   97.638033] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.641827] Out of memory and no killable processes...
[   97.641993] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.648453] Out of memory and no killable processes...
[   97.651481] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.655082] Out of memory and no killable processes...
[   97.657941] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.663036] Out of memory and no killable processes...
[   97.665890] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   97.669473] Out of memory and no killable processes...

We don't need to print these lines every few milliseconds. Even if an 
exceptional case,
this is a DoS for console users. Printing once (or a few times) per a minute 
will be
enough. Otherwise, users cannot see what they are typing 

Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-16 Thread Michal Hocko
On Tue 16-10-18 09:55:06, Tetsuo Handa wrote:
> On 2018/10/15 22:35, Michal Hocko wrote:
> >> Nobody can prove that it never kills some machine. This is just one 
> >> example result of
> >> one example stress tried in my environment. Since I am secure programming 
> >> man from security
> >> subsystem, I really hate your "Can you trigger it?" resistance. Since this 
> >> is OOM path
> >> where nobody tests, starting from being prepared for the worst case keeps 
> >> things simple.
> > 
> > There is simply no way to be generally safe this kind of situation. As
> > soon as your console is so slow that you cannot push the oom report
> > through there is only one single option left and that is to disable the
> > oom report altogether. And that might be a viable option.
> 
> There is a way to be safe this kind of situation. The way is to make sure 
> that printk()
> is called with enough interval. That is, count the interval between the end 
> of previous
> printk() messages and the beginning of next printk() messages.

You are simply wrong. Because any interval is meaningless without
knowing the printk throughput.

[...]

> lines on evey page fault event. A kernel which consumes multiple milliseconds 
> on each page
> fault event (due to printk() messages from the defunctional OOM killer) is 
> stupid.

Not if it represent an unusual situation where there is no eligible
task available. Because this is an exceptional case where the cost of
the printk is simply not relevant.

[...]

I am sorry to skip large part of your message but this discussion, like
many others, doesn't lead anywhere. You simply refuse to understand
some of the core assumptions in this area.

> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with 
> simplified
> one shown below (because you don't like per memcg limit).

My patch is adding a rate-limit! I really fail to see why we need yet
another one on top of it. This is just ridiculous. I can see reasons to
tune that rate limit but adding 2 different mechanisms is just wrong.

If your NAK to unify the ratelimit for dump_header for all paths
still holds then I do not care too much to push it forward. But I find
thiis way of the review feedback counter productive.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-16 Thread Michal Hocko
On Tue 16-10-18 09:55:06, Tetsuo Handa wrote:
> On 2018/10/15 22:35, Michal Hocko wrote:
> >> Nobody can prove that it never kills some machine. This is just one 
> >> example result of
> >> one example stress tried in my environment. Since I am secure programming 
> >> man from security
> >> subsystem, I really hate your "Can you trigger it?" resistance. Since this 
> >> is OOM path
> >> where nobody tests, starting from being prepared for the worst case keeps 
> >> things simple.
> > 
> > There is simply no way to be generally safe this kind of situation. As
> > soon as your console is so slow that you cannot push the oom report
> > through there is only one single option left and that is to disable the
> > oom report altogether. And that might be a viable option.
> 
> There is a way to be safe this kind of situation. The way is to make sure 
> that printk()
> is called with enough interval. That is, count the interval between the end 
> of previous
> printk() messages and the beginning of next printk() messages.

You are simply wrong. Because any interval is meaningless without
knowing the printk throughput.

[...]

> lines on evey page fault event. A kernel which consumes multiple milliseconds 
> on each page
> fault event (due to printk() messages from the defunctional OOM killer) is 
> stupid.

Not if it represent an unusual situation where there is no eligible
task available. Because this is an exceptional case where the cost of
the printk is simply not relevant.

[...]

I am sorry to skip large part of your message but this discussion, like
many others, doesn't lead anywhere. You simply refuse to understand
some of the core assumptions in this area.

> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with 
> simplified
> one shown below (because you don't like per memcg limit).

My patch is adding a rate-limit! I really fail to see why we need yet
another one on top of it. This is just ridiculous. I can see reasons to
tune that rate limit but adding 2 different mechanisms is just wrong.

If your NAK to unify the ratelimit for dump_header for all paths
still holds then I do not care too much to push it forward. But I find
thiis way of the review feedback counter productive.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Tetsuo Handa
On 2018/10/15 22:35, Michal Hocko wrote:
>> Nobody can prove that it never kills some machine. This is just one example 
>> result of
>> one example stress tried in my environment. Since I am secure programming 
>> man from security
>> subsystem, I really hate your "Can you trigger it?" resistance. Since this 
>> is OOM path
>> where nobody tests, starting from being prepared for the worst case keeps 
>> things simple.
> 
> There is simply no way to be generally safe this kind of situation. As
> soon as your console is so slow that you cannot push the oom report
> through there is only one single option left and that is to disable the
> oom report altogether. And that might be a viable option.

There is a way to be safe this kind of situation. The way is to make sure that 
printk()
is called with enough interval. That is, count the interval between the end of 
previous
printk() messages and the beginning of next printk() messages.

And you are misunderstanding my patch. Although my patch does not ratelimit the 
first
occurrence of memcg OOM in each memcg domain (because the first

dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");

output is usually a useful information to get) which is serialized by oom_lock 
mutex,
my patch cannot cause lockup because my patch ratelimits subsequent outputs in 
any
memcg domain. That is, my patch might cause

  "** %u printk messages dropped **\n"

when we have hundreds of different memcgs triggering this path around the same 
time,
my patch refrains from "keep disturbing administrator's manual recovery 
operation from
console by printing

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, 
oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

on each page fault event from hundreds of different memcgs triggering this 
path".

There is no need to print

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, 
oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

lines on evey page fault event. A kernel which consumes multiple milliseconds 
on each page
fault event (due to printk() messages from the defunctional OOM killer) is 
stupid.

>   But fiddling
> with per memcg limit is not going to fly. Just realize what will happen
> if you have hundreds of different memcgs triggering this path around the
> same time.

You have just said that "No killable process should be a rare event which
requires a seriously misconfigured memcg to happen so wildly." and now you
refer to a very bad case "Just realize what will happen if you have hundreds
of different memcgs triggering this path around the same time." which makes
your former comment suspicious.

> 
> So can you start being reasonable and try to look at a wider picture
> finally please?
> 

Honestly, I can't look at a wider picture because I have never been shown a 
picture from you.
What we are doing is endless loop of "let's do ... because ..." and "hmm, our 
assumption
was wrong because ..."; that is, making changes without firstly considering the 
worst case.
For example, OOM victims which David Rientjes is complaining is that our 
assumption that
"__oom_reap_task_mm() can reclaim majority of memory" was wrong. (And your 
proposal to
hand over is getting no response.) For another example, __set_oom_adj() which 
Yong-Taek Lee
is trying to optimize is that our assumption that "we already succeeded 
enforcing same
oom_score_adj among multiple thread groups" was wrong. For yet another example,
CVE-2018-1000200 and CVE-2016-10723 are caused by ignoring my concern. And 
funny thing
is that you negated the rationale of "mm, oom: remove sleep from under 
oom_lock" by
"mm, oom: remove oom_lock from oom_reaper" after only 4 days...

Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with 
simplified
one shown below (because you don't like per memcg limit).

---
 mm/oom_kill.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..9056f9b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,11 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
+   static unsigned long last_warned;
+
+   if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) &&
+   time_in_range(jiffies, last_warned, last_warned + 60 * HZ))
+   return false;
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
/*
@@ -1115,6 +1120,7 @@ bool out_of_memory(struct oom_control *oc)
 */
if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
panic("System is deadlocked on memory\n");
+   last_warned = jiffies;
}
if (oc->chosen && 

Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Tetsuo Handa
On 2018/10/15 22:35, Michal Hocko wrote:
>> Nobody can prove that it never kills some machine. This is just one example 
>> result of
>> one example stress tried in my environment. Since I am secure programming 
>> man from security
>> subsystem, I really hate your "Can you trigger it?" resistance. Since this 
>> is OOM path
>> where nobody tests, starting from being prepared for the worst case keeps 
>> things simple.
> 
> There is simply no way to be generally safe this kind of situation. As
> soon as your console is so slow that you cannot push the oom report
> through there is only one single option left and that is to disable the
> oom report altogether. And that might be a viable option.

There is a way to be safe this kind of situation. The way is to make sure that 
printk()
is called with enough interval. That is, count the interval between the end of 
previous
printk() messages and the beginning of next printk() messages.

And you are misunderstanding my patch. Although my patch does not ratelimit the 
first
occurrence of memcg OOM in each memcg domain (because the first

dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");

output is usually a useful information to get) which is serialized by oom_lock 
mutex,
my patch cannot cause lockup because my patch ratelimits subsequent outputs in 
any
memcg domain. That is, my patch might cause

  "** %u printk messages dropped **\n"

when we have hundreds of different memcgs triggering this path around the same 
time,
my patch refrains from "keep disturbing administrator's manual recovery 
operation from
console by printing

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, 
oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

on each page fault event from hundreds of different memcgs triggering this 
path".

There is no need to print

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, 
oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

lines on evey page fault event. A kernel which consumes multiple milliseconds 
on each page
fault event (due to printk() messages from the defunctional OOM killer) is 
stupid.

>   But fiddling
> with per memcg limit is not going to fly. Just realize what will happen
> if you have hundreds of different memcgs triggering this path around the
> same time.

You have just said that "No killable process should be a rare event which
requires a seriously misconfigured memcg to happen so wildly." and now you
refer to a very bad case "Just realize what will happen if you have hundreds
of different memcgs triggering this path around the same time." which makes
your former comment suspicious.

> 
> So can you start being reasonable and try to look at a wider picture
> finally please?
> 

Honestly, I can't look at a wider picture because I have never been shown a 
picture from you.
What we are doing is endless loop of "let's do ... because ..." and "hmm, our 
assumption
was wrong because ..."; that is, making changes without firstly considering the 
worst case.
For example, OOM victims which David Rientjes is complaining is that our 
assumption that
"__oom_reap_task_mm() can reclaim majority of memory" was wrong. (And your 
proposal to
hand over is getting no response.) For another example, __set_oom_adj() which 
Yong-Taek Lee
is trying to optimize is that our assumption that "we already succeeded 
enforcing same
oom_score_adj among multiple thread groups" was wrong. For yet another example,
CVE-2018-1000200 and CVE-2016-10723 are caused by ignoring my concern. And 
funny thing
is that you negated the rationale of "mm, oom: remove sleep from under 
oom_lock" by
"mm, oom: remove oom_lock from oom_reaper" after only 4 days...

Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with 
simplified
one shown below (because you don't like per memcg limit).

---
 mm/oom_kill.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..9056f9b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,11 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
+   static unsigned long last_warned;
+
+   if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) &&
+   time_in_range(jiffies, last_warned, last_warned + 60 * HZ))
+   return false;
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
/*
@@ -1115,6 +1120,7 @@ bool out_of_memory(struct oom_control *oc)
 */
if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
panic("System is deadlocked on memory\n");
+   last_warned = jiffies;
}
if (oc->chosen && 

Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Michal Hocko
On Mon 15-10-18 21:47:08, Tetsuo Handa wrote:
> On 2018/10/15 20:24, Michal Hocko wrote:
> > On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
> >> On 2018/10/15 17:19, Michal Hocko wrote:
> >>> As so many dozens of times before, I will point you to an incremental
> >>> nature of changes we really prefer in the mm land. We are also after a
> >>> simplicity which your proposal lacks in many aspects. You seem to ignore
> >>> that general approach and I have hard time to consider your NAK as a
> >>> relevant feedback. Going to an extreme and basing a complex solution on
> >>> it is not going to fly. No killable process should be a rare event which
> >>> requires a seriously misconfigured memcg to happen so wildly. If you can
> >>> trigger it with a normal user privileges then it would be a clear bug to
> >>> address rather than work around with printk throttling.
> >>>
> >>
> >> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
> >> with a normal user privileges. This is a lot of needless noise/delay.
> > 
> > I am pretty sure you have understood the part of my message you have
> > chosen to not quote where I have said that the specific rate limitting
> > decisions can be changed based on reasonable configurations. There is
> > absolutely zero reason to NAK a natural decision to unify the throttling
> > and cook a per-memcg way for a very specific path instead.
> > 
> >> No killable process is not a rare event, even without root privileges.
> >>
> >> [root@ccsecurity kumaneko]# time ./a.out
> >> Killed
> >>
> >> real0m2.396s
> >> user0m0.000s
> >> sys 0m2.970s
> >> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
> >> 202
> >> [root@ccsecurity ~]# dmesg | wc
> >> 9427335   70716
> > 
> > OK, so this is 70kB worth of data pushed throug the console. Is this
> > really killing any machine?
> > 
> 
> Nobody can prove that it never kills some machine. This is just one example 
> result of
> one example stress tried in my environment. Since I am secure programming man 
> from security
> subsystem, I really hate your "Can you trigger it?" resistance. Since this is 
> OOM path
> where nobody tests, starting from being prepared for the worst case keeps 
> things simple.

There is simply no way to be generally safe this kind of situation. As
soon as your console is so slow that you cannot push the oom report
through there is only one single option left and that is to disable the
oom report altogether. And that might be a viable option. But fiddling
with per memcg limit is not going to fly. Just realize what will happen
if you have hundreds of different memcgs triggering this path around the
same time.

So can you start being reasonable and try to look at a wider picture
finally please?

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Michal Hocko
On Mon 15-10-18 21:47:08, Tetsuo Handa wrote:
> On 2018/10/15 20:24, Michal Hocko wrote:
> > On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
> >> On 2018/10/15 17:19, Michal Hocko wrote:
> >>> As so many dozens of times before, I will point you to an incremental
> >>> nature of changes we really prefer in the mm land. We are also after a
> >>> simplicity which your proposal lacks in many aspects. You seem to ignore
> >>> that general approach and I have hard time to consider your NAK as a
> >>> relevant feedback. Going to an extreme and basing a complex solution on
> >>> it is not going to fly. No killable process should be a rare event which
> >>> requires a seriously misconfigured memcg to happen so wildly. If you can
> >>> trigger it with a normal user privileges then it would be a clear bug to
> >>> address rather than work around with printk throttling.
> >>>
> >>
> >> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
> >> with a normal user privileges. This is a lot of needless noise/delay.
> > 
> > I am pretty sure you have understood the part of my message you have
> > chosen to not quote where I have said that the specific rate limitting
> > decisions can be changed based on reasonable configurations. There is
> > absolutely zero reason to NAK a natural decision to unify the throttling
> > and cook a per-memcg way for a very specific path instead.
> > 
> >> No killable process is not a rare event, even without root privileges.
> >>
> >> [root@ccsecurity kumaneko]# time ./a.out
> >> Killed
> >>
> >> real0m2.396s
> >> user0m0.000s
> >> sys 0m2.970s
> >> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
> >> 202
> >> [root@ccsecurity ~]# dmesg | wc
> >> 9427335   70716
> > 
> > OK, so this is 70kB worth of data pushed throug the console. Is this
> > really killing any machine?
> > 
> 
> Nobody can prove that it never kills some machine. This is just one example 
> result of
> one example stress tried in my environment. Since I am secure programming man 
> from security
> subsystem, I really hate your "Can you trigger it?" resistance. Since this is 
> OOM path
> where nobody tests, starting from being prepared for the worst case keeps 
> things simple.

There is simply no way to be generally safe this kind of situation. As
soon as your console is so slow that you cannot push the oom report
through there is only one single option left and that is to disable the
oom report altogether. And that might be a viable option. But fiddling
with per memcg limit is not going to fly. Just realize what will happen
if you have hundreds of different memcgs triggering this path around the
same time.

So can you start being reasonable and try to look at a wider picture
finally please?

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Tetsuo Handa
On 2018/10/15 20:24, Michal Hocko wrote:
> On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
>> On 2018/10/15 17:19, Michal Hocko wrote:
>>> As so many dozens of times before, I will point you to an incremental
>>> nature of changes we really prefer in the mm land. We are also after a
>>> simplicity which your proposal lacks in many aspects. You seem to ignore
>>> that general approach and I have hard time to consider your NAK as a
>>> relevant feedback. Going to an extreme and basing a complex solution on
>>> it is not going to fly. No killable process should be a rare event which
>>> requires a seriously misconfigured memcg to happen so wildly. If you can
>>> trigger it with a normal user privileges then it would be a clear bug to
>>> address rather than work around with printk throttling.
>>>
>>
>> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
>> with a normal user privileges. This is a lot of needless noise/delay.
> 
> I am pretty sure you have understood the part of my message you have
> chosen to not quote where I have said that the specific rate limitting
> decisions can be changed based on reasonable configurations. There is
> absolutely zero reason to NAK a natural decision to unify the throttling
> and cook a per-memcg way for a very specific path instead.
> 
>> No killable process is not a rare event, even without root privileges.
>>
>> [root@ccsecurity kumaneko]# time ./a.out
>> Killed
>>
>> real0m2.396s
>> user0m0.000s
>> sys 0m2.970s
>> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
>> 202
>> [root@ccsecurity ~]# dmesg | wc
>> 9427335   70716
> 
> OK, so this is 70kB worth of data pushed throug the console. Is this
> really killing any machine?
> 

Nobody can prove that it never kills some machine. This is just one example 
result of
one example stress tried in my environment. Since I am secure programming man 
from security
subsystem, I really hate your "Can you trigger it?" resistance. Since this is 
OOM path
where nobody tests, starting from being prepared for the worst case keeps 
things simple.


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Tetsuo Handa
On 2018/10/15 20:24, Michal Hocko wrote:
> On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
>> On 2018/10/15 17:19, Michal Hocko wrote:
>>> As so many dozens of times before, I will point you to an incremental
>>> nature of changes we really prefer in the mm land. We are also after a
>>> simplicity which your proposal lacks in many aspects. You seem to ignore
>>> that general approach and I have hard time to consider your NAK as a
>>> relevant feedback. Going to an extreme and basing a complex solution on
>>> it is not going to fly. No killable process should be a rare event which
>>> requires a seriously misconfigured memcg to happen so wildly. If you can
>>> trigger it with a normal user privileges then it would be a clear bug to
>>> address rather than work around with printk throttling.
>>>
>>
>> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
>> with a normal user privileges. This is a lot of needless noise/delay.
> 
> I am pretty sure you have understood the part of my message you have
> chosen to not quote where I have said that the specific rate limitting
> decisions can be changed based on reasonable configurations. There is
> absolutely zero reason to NAK a natural decision to unify the throttling
> and cook a per-memcg way for a very specific path instead.
> 
>> No killable process is not a rare event, even without root privileges.
>>
>> [root@ccsecurity kumaneko]# time ./a.out
>> Killed
>>
>> real0m2.396s
>> user0m0.000s
>> sys 0m2.970s
>> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
>> 202
>> [root@ccsecurity ~]# dmesg | wc
>> 9427335   70716
> 
> OK, so this is 70kB worth of data pushed throug the console. Is this
> really killing any machine?
> 

Nobody can prove that it never kills some machine. This is just one example 
result of
one example stress tried in my environment. Since I am secure programming man 
from security
subsystem, I really hate your "Can you trigger it?" resistance. Since this is 
OOM path
where nobody tests, starting from being prepared for the worst case keeps 
things simple.


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Michal Hocko
On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
> On 2018/10/15 17:19, Michal Hocko wrote:
> > As so many dozens of times before, I will point you to an incremental
> > nature of changes we really prefer in the mm land. We are also after a
> > simplicity which your proposal lacks in many aspects. You seem to ignore
> > that general approach and I have hard time to consider your NAK as a
> > relevant feedback. Going to an extreme and basing a complex solution on
> > it is not going to fly. No killable process should be a rare event which
> > requires a seriously misconfigured memcg to happen so wildly. If you can
> > trigger it with a normal user privileges then it would be a clear bug to
> > address rather than work around with printk throttling.
> > 
> 
> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
> with a normal user privileges. This is a lot of needless noise/delay.

I am pretty sure you have understood the part of my message you have
chosen to not quote where I have said that the specific rate limitting
decisions can be changed based on reasonable configurations. There is
absolutely zero reason to NAK a natural decision to unify the throttling
and cook a per-memcg way for a very specific path instead.

> No killable process is not a rare event, even without root privileges.
>
> [root@ccsecurity kumaneko]# time ./a.out
> Killed
> 
> real0m2.396s
> user0m0.000s
> sys 0m2.970s
> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
> 202
> [root@ccsecurity ~]# dmesg | wc
> 9427335   70716

OK, so this is 70kB worth of data pushed throug the console. Is this
really killing any machine?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Michal Hocko
On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
> On 2018/10/15 17:19, Michal Hocko wrote:
> > As so many dozens of times before, I will point you to an incremental
> > nature of changes we really prefer in the mm land. We are also after a
> > simplicity which your proposal lacks in many aspects. You seem to ignore
> > that general approach and I have hard time to consider your NAK as a
> > relevant feedback. Going to an extreme and basing a complex solution on
> > it is not going to fly. No killable process should be a rare event which
> > requires a seriously misconfigured memcg to happen so wildly. If you can
> > trigger it with a normal user privileges then it would be a clear bug to
> > address rather than work around with printk throttling.
> > 
> 
> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
> with a normal user privileges. This is a lot of needless noise/delay.

I am pretty sure you have understood the part of my message you have
chosen to not quote where I have said that the specific rate limitting
decisions can be changed based on reasonable configurations. There is
absolutely zero reason to NAK a natural decision to unify the throttling
and cook a per-memcg way for a very specific path instead.

> No killable process is not a rare event, even without root privileges.
>
> [root@ccsecurity kumaneko]# time ./a.out
> Killed
> 
> real0m2.396s
> user0m0.000s
> sys 0m2.970s
> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
> 202
> [root@ccsecurity ~]# dmesg | wc
> 9427335   70716

OK, so this is 70kB worth of data pushed throug the console. Is this
really killing any machine?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Tetsuo Handa
On 2018/10/15 17:19, Michal Hocko wrote:
> As so many dozens of times before, I will point you to an incremental
> nature of changes we really prefer in the mm land. We are also after a
> simplicity which your proposal lacks in many aspects. You seem to ignore
> that general approach and I have hard time to consider your NAK as a
> relevant feedback. Going to an extreme and basing a complex solution on
> it is not going to fly. No killable process should be a rare event which
> requires a seriously misconfigured memcg to happen so wildly. If you can
> trigger it with a normal user privileges then it would be a clear bug to
> address rather than work around with printk throttling.
> 

I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
with a normal user privileges. This is a lot of needless noise/delay.
No killable process is not a rare event, even without root privileges.

[root@ccsecurity kumaneko]# time ./a.out
Killed

real0m2.396s
user0m0.000s
sys 0m2.970s
[root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
202
[root@ccsecurity ~]# dmesg | wc
9427335   70716
[root@ccsecurity ~]#


#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define NUMTHREADS 256
#define MMAPSIZE 4 * 10485760
#define STACKSIZE 4096
static int pipe_fd[2] = { EOF, EOF };
static int memory_eater(void *unused)
{
int fd = open("/dev/zero", O_RDONLY);
char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ,
 MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
read(pipe_fd[0], buf, 1);
read(fd, buf, MMAPSIZE);
pause();
return 0;
}
int main(int argc, char *argv[])
{
int i;
char *stack;
FILE *fp;
const unsigned long size = 1048576 * 200;
mkdir("/sys/fs/cgroup/memory/test1", 0755);
fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
fprintf(fp, "%lu\n", size);
fclose(fp);
fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
fprintf(fp, "%u\n", getpid());
fclose(fp);
if (setgid(-2) || setuid(-2))
return 1;
stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ,
 MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
for (i = 0; i < NUMTHREADS; i++)
if (clone(memory_eater, stack + (i + 1) * STACKSIZE,
  CLONE_SIGHAND | CLONE_THREAD | CLONE_VM | CLONE_FS | 
CLONE_FILES, NULL) == -1)
break;
sleep(1);
close(pipe_fd[1]);
pause();
return 0;
}




Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Tetsuo Handa
On 2018/10/15 17:19, Michal Hocko wrote:
> As so many dozens of times before, I will point you to an incremental
> nature of changes we really prefer in the mm land. We are also after a
> simplicity which your proposal lacks in many aspects. You seem to ignore
> that general approach and I have hard time to consider your NAK as a
> relevant feedback. Going to an extreme and basing a complex solution on
> it is not going to fly. No killable process should be a rare event which
> requires a seriously misconfigured memcg to happen so wildly. If you can
> trigger it with a normal user privileges then it would be a clear bug to
> address rather than work around with printk throttling.
> 

I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
with a normal user privileges. This is a lot of needless noise/delay.
No killable process is not a rare event, even without root privileges.

[root@ccsecurity kumaneko]# time ./a.out
Killed

real0m2.396s
user0m0.000s
sys 0m2.970s
[root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
202
[root@ccsecurity ~]# dmesg | wc
9427335   70716
[root@ccsecurity ~]#


#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define NUMTHREADS 256
#define MMAPSIZE 4 * 10485760
#define STACKSIZE 4096
static int pipe_fd[2] = { EOF, EOF };
static int memory_eater(void *unused)
{
int fd = open("/dev/zero", O_RDONLY);
char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ,
 MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
read(pipe_fd[0], buf, 1);
read(fd, buf, MMAPSIZE);
pause();
return 0;
}
int main(int argc, char *argv[])
{
int i;
char *stack;
FILE *fp;
const unsigned long size = 1048576 * 200;
mkdir("/sys/fs/cgroup/memory/test1", 0755);
fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
fprintf(fp, "%lu\n", size);
fclose(fp);
fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
fprintf(fp, "%u\n", getpid());
fclose(fp);
if (setgid(-2) || setuid(-2))
return 1;
stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ,
 MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
for (i = 0; i < NUMTHREADS; i++)
if (clone(memory_eater, stack + (i + 1) * STACKSIZE,
  CLONE_SIGHAND | CLONE_THREAD | CLONE_VM | CLONE_FS | 
CLONE_FILES, NULL) == -1)
break;
sleep(1);
close(pipe_fd[1]);
pause();
return 0;
}




Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Michal Hocko
On Sat 13-10-18 20:28:38, Tetsuo Handa wrote:
> On 2018/10/13 20:22, Johannes Weiner wrote:
> > On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
> >> -- Michal's patch --
> >>
> >> 73133 lines (5.79MB) of kernel messages per one run
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real3m44.389s
> >> user0m0.000s
> >> sys 3m42.334s
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real3m41.767s
> >> user0m0.004s
> >> sys 3m39.779s
> >>
> >> -- My v2 patch --
> >>
> >> 50 lines (3.40 KB) of kernel messages per one run
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real0m5.227s
> >> user0m0.000s
> >> sys 0m4.950s
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real0m5.249s
> >> user0m0.000s
> >> sys 0m4.956s
> > 
> > Your patch is suppressing information that I want to have and my
> > console can handle, just because your console is slow, even though
> > there is no need to use that console at that log level.
> 
> My patch is not suppressing information you want to have.
> My patch is mainly suppressing
> 
> [   52.393146] Out of memory and no killable processes...
> [   52.395195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> [   52.398623] Out of memory and no killable processes...
> [   52.401195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> [   52.404356] Out of memory and no killable processes...
> [   52.406492] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> [   52.409595] Out of memory and no killable processes...
> [   52.411745] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> [   52.415588] Out of memory and no killable processes...
> [   52.418484] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> [   52.421904] Out of memory and no killable processes...
> [   52.424273] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> 
> lines which Michal's patch cannot suppress.

This was a deliberate decision because the allocation failure context is
usually a useful information to get. If this is killing a reasonably
configured machine then we can move the ratelimit up and suppress that
information. This will always be cost vs. benefit decision. And as such
it should be argued in the changelog.

As so many dozens of times before, I will point you to an incremental
nature of changes we really prefer in the mm land. We are also after a
simplicity which your proposal lacks in many aspects. You seem to ignore
that general approach and I have hard time to consider your NAK as a
relevant feedback. Going to an extreme and basing a complex solution on
it is not going to fly. No killable process should be a rare event which
requires a seriously misconfigured memcg to happen so wildly. If you can
trigger it with a normal user privileges then it would be a clear bug to
address rather than work around with printk throttling.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-15 Thread Michal Hocko
On Sat 13-10-18 20:28:38, Tetsuo Handa wrote:
> On 2018/10/13 20:22, Johannes Weiner wrote:
> > On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
> >> -- Michal's patch --
> >>
> >> 73133 lines (5.79MB) of kernel messages per one run
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real3m44.389s
> >> user0m0.000s
> >> sys 3m42.334s
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real3m41.767s
> >> user0m0.004s
> >> sys 3m39.779s
> >>
> >> -- My v2 patch --
> >>
> >> 50 lines (3.40 KB) of kernel messages per one run
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real0m5.227s
> >> user0m0.000s
> >> sys 0m4.950s
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real0m5.249s
> >> user0m0.000s
> >> sys 0m4.956s
> > 
> > Your patch is suppressing information that I want to have and my
> > console can handle, just because your console is slow, even though
> > there is no need to use that console at that log level.
> 
> My patch is not suppressing information you want to have.
> My patch is mainly suppressing
> 
> [   52.393146] Out of memory and no killable processes...
> [   52.395195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> [   52.398623] Out of memory and no killable processes...
> [   52.401195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> [   52.404356] Out of memory and no killable processes...
> [   52.406492] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> [   52.409595] Out of memory and no killable processes...
> [   52.411745] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> [   52.415588] Out of memory and no killable processes...
> [   52.418484] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> [   52.421904] Out of memory and no killable processes...
> [   52.424273] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
> nodemask=(null), order=0, oom_score_adj=-1000
> 
> lines which Michal's patch cannot suppress.

This was a deliberate decision because the allocation failure context is
usually a useful information to get. If this is killing a reasonably
configured machine then we can move the ratelimit up and suppress that
information. This will always be cost vs. benefit decision. And as such
it should be argued in the changelog.

As so many dozens of times before, I will point you to an incremental
nature of changes we really prefer in the mm land. We are also after a
simplicity which your proposal lacks in many aspects. You seem to ignore
that general approach and I have hard time to consider your NAK as a
relevant feedback. Going to an extreme and basing a complex solution on
it is not going to fly. No killable process should be a rare event which
requires a seriously misconfigured memcg to happen so wildly. If you can
trigger it with a normal user privileges then it would be a clear bug to
address rather than work around with printk throttling.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-13 Thread Tetsuo Handa
On 2018/10/13 20:22, Johannes Weiner wrote:
> On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
>> -- Michal's patch --
>>
>> 73133 lines (5.79MB) of kernel messages per one run
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real3m44.389s
>> user0m0.000s
>> sys 3m42.334s
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real3m41.767s
>> user0m0.004s
>> sys 3m39.779s
>>
>> -- My v2 patch --
>>
>> 50 lines (3.40 KB) of kernel messages per one run
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real0m5.227s
>> user0m0.000s
>> sys 0m4.950s
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real0m5.249s
>> user0m0.000s
>> sys 0m4.956s
> 
> Your patch is suppressing information that I want to have and my
> console can handle, just because your console is slow, even though
> there is no need to use that console at that log level.

My patch is not suppressing information you want to have.
My patch is mainly suppressing

[   52.393146] Out of memory and no killable processes...
[   52.395195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   52.398623] Out of memory and no killable processes...
[   52.401195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   52.404356] Out of memory and no killable processes...
[   52.406492] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   52.409595] Out of memory and no killable processes...
[   52.411745] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   52.415588] Out of memory and no killable processes...
[   52.418484] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   52.421904] Out of memory and no killable processes...
[   52.424273] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000

lines which Michal's patch cannot suppress.

Also, my console is console=ttyS0,115200n8 . Not slow at all.

> 
> NAK to your patch. I think you're looking at this from the wrong
> angle. A console that takes almost 4 minutes to print 70k lines
> shouldn't be the baseline for how verbose KERN_INFO is.
> 

Run the testcase in your environment.


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-13 Thread Tetsuo Handa
On 2018/10/13 20:22, Johannes Weiner wrote:
> On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
>> -- Michal's patch --
>>
>> 73133 lines (5.79MB) of kernel messages per one run
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real3m44.389s
>> user0m0.000s
>> sys 3m42.334s
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real3m41.767s
>> user0m0.004s
>> sys 3m39.779s
>>
>> -- My v2 patch --
>>
>> 50 lines (3.40 KB) of kernel messages per one run
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real0m5.227s
>> user0m0.000s
>> sys 0m4.950s
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real0m5.249s
>> user0m0.000s
>> sys 0m4.956s
> 
> Your patch is suppressing information that I want to have and my
> console can handle, just because your console is slow, even though
> there is no need to use that console at that log level.

My patch is not suppressing information you want to have.
My patch is mainly suppressing

[   52.393146] Out of memory and no killable processes...
[   52.395195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   52.398623] Out of memory and no killable processes...
[   52.401195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   52.404356] Out of memory and no killable processes...
[   52.406492] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   52.409595] Out of memory and no killable processes...
[   52.411745] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   52.415588] Out of memory and no killable processes...
[   52.418484] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000
[   52.421904] Out of memory and no killable processes...
[   52.424273] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=-1000

lines which Michal's patch cannot suppress.

Also, my console is console=ttyS0,115200n8 . Not slow at all.

> 
> NAK to your patch. I think you're looking at this from the wrong
> angle. A console that takes almost 4 minutes to print 70k lines
> shouldn't be the baseline for how verbose KERN_INFO is.
> 

Run the testcase in your environment.


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-13 Thread Johannes Weiner
On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
> -- Michal's patch --
> 
> 73133 lines (5.79MB) of kernel messages per one run
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real3m44.389s
> user0m0.000s
> sys 3m42.334s
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real3m41.767s
> user0m0.004s
> sys 3m39.779s
> 
> -- My v2 patch --
> 
> 50 lines (3.40 KB) of kernel messages per one run
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real0m5.227s
> user0m0.000s
> sys 0m4.950s
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real0m5.249s
> user0m0.000s
> sys 0m4.956s

Your patch is suppressing information that I want to have and my
console can handle, just because your console is slow, even though
there is no need to use that console at that log level.

NAK to your patch. I think you're looking at this from the wrong
angle. A console that takes almost 4 minutes to print 70k lines
shouldn't be the baseline for how verbose KERN_INFO is.


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-13 Thread Johannes Weiner
On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
> -- Michal's patch --
> 
> 73133 lines (5.79MB) of kernel messages per one run
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real3m44.389s
> user0m0.000s
> sys 3m42.334s
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real3m41.767s
> user0m0.004s
> sys 3m39.779s
> 
> -- My v2 patch --
> 
> 50 lines (3.40 KB) of kernel messages per one run
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real0m5.227s
> user0m0.000s
> sys 0m4.950s
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real0m5.249s
> user0m0.000s
> sys 0m4.956s

Your patch is suppressing information that I want to have and my
console can handle, just because your console is slow, even though
there is no need to use that console at that log level.

NAK to your patch. I think you're looking at this from the wrong
angle. A console that takes almost 4 minutes to print 70k lines
shouldn't be the baseline for how verbose KERN_INFO is.


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-13 Thread Tetsuo Handa
On 2018/10/12 21:58, Tetsuo Handa wrote:
> On 2018/10/12 21:41, Johannes Weiner wrote:
>> On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
>>> On 2018/10/12 21:08, Michal Hocko wrote:
> So not more than 10 dumps in each 5s interval. That looks reasonable
> to me. By the time it starts dropping data you have more than enough
> information to go on already.

Not reasonable at all.


 Yeah. Unless we have a storm coming from many different cgroups in
 parallel. But even then we have the allocation context for each OOM so
 we are not losing everything. Should we ever tune this, it can be done
 later with some explicit examples.

> Acked-by: Johannes Weiner 

 Thanks! I will post the patch to Andrew early next week.


One thread from one cgroup is sufficient. I don't think that Michal's patch
is an appropriate mitigation. It still needlessly floods kernel log buffer
and significantly defers recovery operation.

Nacked-by: Tetsuo Handa 

-- Testcase --

#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
FILE *fp;
const unsigned long size = 1048576 * 200;
char *buf = malloc(size);
mkdir("/sys/fs/cgroup/memory/test1", 0755);
fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
fprintf(fp, "%lu\n", size / 2);
fclose(fp);
fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
fprintf(fp, "%u\n", getpid());
fclose(fp);
fp = fopen("/proc/self/oom_score_adj", "w");
fprintf(fp, "-1000\n");
fclose(fp);
fp = fopen("/dev/zero", "r");
fread(buf, 1, size, fp);
fclose(fp);
return 0;
}

-- Michal's patch --

73133 lines (5.79MB) of kernel messages per one run

[root@ccsecurity ~]# time ./a.out

real3m44.389s
user0m0.000s
sys 3m42.334s

[root@ccsecurity ~]# time ./a.out

real3m41.767s
user0m0.004s
sys 3m39.779s

-- My v2 patch --

50 lines (3.40 KB) of kernel messages per one run

[root@ccsecurity ~]# time ./a.out

real0m5.227s
user0m0.000s
sys 0m4.950s

[root@ccsecurity ~]# time ./a.out

real0m5.249s
user0m0.000s
sys 0m4.956s



Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-13 Thread Tetsuo Handa
On 2018/10/12 21:58, Tetsuo Handa wrote:
> On 2018/10/12 21:41, Johannes Weiner wrote:
>> On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
>>> On 2018/10/12 21:08, Michal Hocko wrote:
> So not more than 10 dumps in each 5s interval. That looks reasonable
> to me. By the time it starts dropping data you have more than enough
> information to go on already.

Not reasonable at all.


 Yeah. Unless we have a storm coming from many different cgroups in
 parallel. But even then we have the allocation context for each OOM so
 we are not losing everything. Should we ever tune this, it can be done
 later with some explicit examples.

> Acked-by: Johannes Weiner 

 Thanks! I will post the patch to Andrew early next week.


One thread from one cgroup is sufficient. I don't think that Michal's patch
is an appropriate mitigation. It still needlessly floods kernel log buffer
and significantly defers recovery operation.

Nacked-by: Tetsuo Handa 

-- Testcase --

#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
FILE *fp;
const unsigned long size = 1048576 * 200;
char *buf = malloc(size);
mkdir("/sys/fs/cgroup/memory/test1", 0755);
fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
fprintf(fp, "%lu\n", size / 2);
fclose(fp);
fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
fprintf(fp, "%u\n", getpid());
fclose(fp);
fp = fopen("/proc/self/oom_score_adj", "w");
fprintf(fp, "-1000\n");
fclose(fp);
fp = fopen("/dev/zero", "r");
fread(buf, 1, size, fp);
fclose(fp);
return 0;
}

-- Michal's patch --

73133 lines (5.79MB) of kernel messages per one run

[root@ccsecurity ~]# time ./a.out

real3m44.389s
user0m0.000s
sys 3m42.334s

[root@ccsecurity ~]# time ./a.out

real3m41.767s
user0m0.004s
sys 3m39.779s

-- My v2 patch --

50 lines (3.40 KB) of kernel messages per one run

[root@ccsecurity ~]# time ./a.out

real0m5.227s
user0m0.000s
sys 0m4.950s

[root@ccsecurity ~]# time ./a.out

real0m5.249s
user0m0.000s
sys 0m4.956s



Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Tetsuo Handa
Calling printk() people. ;-)

On 2018/10/12 21:41, Johannes Weiner wrote:
> On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
>> On 2018/10/12 21:08, Michal Hocko wrote:
 So not more than 10 dumps in each 5s interval. That looks reasonable
 to me. By the time it starts dropping data you have more than enough
 information to go on already.
>>>
>>> Yeah. Unless we have a storm coming from many different cgroups in
>>> parallel. But even then we have the allocation context for each OOM so
>>> we are not losing everything. Should we ever tune this, it can be done
>>> later with some explicit examples.
>>>
 Acked-by: Johannes Weiner 
>>>
>>> Thanks! I will post the patch to Andrew early next week.
>>>
>>
>> How do you handle environments where one dump takes e.g. 3 seconds?
>> Counting delay between first message in previous dump and first message
>> in next dump is not safe. Unless we count delay between last message
>> in previous dump and first message in next dump, we cannot guarantee
>> that the system won't lockup due to printk() flooding.
> 
> How is that different from any other printk ratelimiting? If a dump
> takes 3 seconds you need to fix your console. It doesn't make sense to
> design KERN_INFO messages for the slowest serial consoles out there.

You can't fix the console. It is a hardware limitation.

> 
> That's what we did, btw. We used to patch out the OOM header because
> our serial console was so bad, but obviously that's not a generic
> upstream solution. We've since changed the loglevel on the serial and
> use netconsole[1] for the chattier loglevels.
> 
> [1] https://github.com/facebook/fbkutils/tree/master/netconsd
> 


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Tetsuo Handa
Calling printk() people. ;-)

On 2018/10/12 21:41, Johannes Weiner wrote:
> On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
>> On 2018/10/12 21:08, Michal Hocko wrote:
 So not more than 10 dumps in each 5s interval. That looks reasonable
 to me. By the time it starts dropping data you have more than enough
 information to go on already.
>>>
>>> Yeah. Unless we have a storm coming from many different cgroups in
>>> parallel. But even then we have the allocation context for each OOM so
>>> we are not losing everything. Should we ever tune this, it can be done
>>> later with some explicit examples.
>>>
 Acked-by: Johannes Weiner 
>>>
>>> Thanks! I will post the patch to Andrew early next week.
>>>
>>
>> How do you handle environments where one dump takes e.g. 3 seconds?
>> Counting delay between first message in previous dump and first message
>> in next dump is not safe. Unless we count delay between last message
>> in previous dump and first message in next dump, we cannot guarantee
>> that the system won't lockup due to printk() flooding.
> 
> How is that different from any other printk ratelimiting? If a dump
> takes 3 seconds you need to fix your console. It doesn't make sense to
> design KERN_INFO messages for the slowest serial consoles out there.

You can't fix the console. It is a hardware limitation.

> 
> That's what we did, btw. We used to patch out the OOM header because
> our serial console was so bad, but obviously that's not a generic
> upstream solution. We've since changed the loglevel on the serial and
> use netconsole[1] for the chattier loglevels.
> 
> [1] https://github.com/facebook/fbkutils/tree/master/netconsd
> 


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Johannes Weiner
On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
> On 2018/10/12 21:08, Michal Hocko wrote:
> >> So not more than 10 dumps in each 5s interval. That looks reasonable
> >> to me. By the time it starts dropping data you have more than enough
> >> information to go on already.
> > 
> > Yeah. Unless we have a storm coming from many different cgroups in
> > parallel. But even then we have the allocation context for each OOM so
> > we are not losing everything. Should we ever tune this, it can be done
> > later with some explicit examples.
> > 
> >> Acked-by: Johannes Weiner 
> > 
> > Thanks! I will post the patch to Andrew early next week.
> > 
> 
> How do you handle environments where one dump takes e.g. 3 seconds?
> Counting delay between first message in previous dump and first message
> in next dump is not safe. Unless we count delay between last message
> in previous dump and first message in next dump, we cannot guarantee
> that the system won't lockup due to printk() flooding.

How is that different from any other printk ratelimiting? If a dump
takes 3 seconds you need to fix your console. It doesn't make sense to
design KERN_INFO messages for the slowest serial consoles out there.

That's what we did, btw. We used to patch out the OOM header because
our serial console was so bad, but obviously that's not a generic
upstream solution. We've since changed the loglevel on the serial and
use netconsole[1] for the chattier loglevels.

[1] https://github.com/facebook/fbkutils/tree/master/netconsd


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Johannes Weiner
On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
> On 2018/10/12 21:08, Michal Hocko wrote:
> >> So not more than 10 dumps in each 5s interval. That looks reasonable
> >> to me. By the time it starts dropping data you have more than enough
> >> information to go on already.
> > 
> > Yeah. Unless we have a storm coming from many different cgroups in
> > parallel. But even then we have the allocation context for each OOM so
> > we are not losing everything. Should we ever tune this, it can be done
> > later with some explicit examples.
> > 
> >> Acked-by: Johannes Weiner 
> > 
> > Thanks! I will post the patch to Andrew early next week.
> > 
> 
> How do you handle environments where one dump takes e.g. 3 seconds?
> Counting delay between first message in previous dump and first message
> in next dump is not safe. Unless we count delay between last message
> in previous dump and first message in next dump, we cannot guarantee
> that the system won't lockup due to printk() flooding.

How is that different from any other printk ratelimiting? If a dump
takes 3 seconds you need to fix your console. It doesn't make sense to
design KERN_INFO messages for the slowest serial consoles out there.

That's what we did, btw. We used to patch out the OOM header because
our serial console was so bad, but obviously that's not a generic
upstream solution. We've since changed the loglevel on the serial and
use netconsole[1] for the chattier loglevels.

[1] https://github.com/facebook/fbkutils/tree/master/netconsd


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Tetsuo Handa
On 2018/10/12 21:08, Michal Hocko wrote:
>> So not more than 10 dumps in each 5s interval. That looks reasonable
>> to me. By the time it starts dropping data you have more than enough
>> information to go on already.
> 
> Yeah. Unless we have a storm coming from many different cgroups in
> parallel. But even then we have the allocation context for each OOM so
> we are not losing everything. Should we ever tune this, it can be done
> later with some explicit examples.
> 
>> Acked-by: Johannes Weiner 
> 
> Thanks! I will post the patch to Andrew early next week.
> 

How do you handle environments where one dump takes e.g. 3 seconds?
Counting delay between first message in previous dump and first message
in next dump is not safe. Unless we count delay between last message
in previous dump and first message in next dump, we cannot guarantee
that the system won't lockup due to printk() flooding.


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Tetsuo Handa
On 2018/10/12 21:08, Michal Hocko wrote:
>> So not more than 10 dumps in each 5s interval. That looks reasonable
>> to me. By the time it starts dropping data you have more than enough
>> information to go on already.
> 
> Yeah. Unless we have a storm coming from many different cgroups in
> parallel. But even then we have the allocation context for each OOM so
> we are not losing everything. Should we ever tune this, it can be done
> later with some explicit examples.
> 
>> Acked-by: Johannes Weiner 
> 
> Thanks! I will post the patch to Andrew early next week.
> 

How do you handle environments where one dump takes e.g. 3 seconds?
Counting delay between first message in previous dump and first message
in next dump is not safe. Unless we count delay between last message
in previous dump and first message in next dump, we cannot guarantee
that the system won't lockup due to printk() flooding.


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Michal Hocko
On Fri 12-10-18 07:20:08, Johannes Weiner wrote:
> On Wed, Oct 10, 2018 at 05:11:35PM +0200, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > syzbot has noticed that it can trigger RCU stalls from the memcg oom
> > path:
> > RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
> > Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
> > 84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
> > e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
> > RSP: 0018:88017d3a5c70 EFLAGS: 0246 ORIG_RAX: ff13
> > RAX: 0004 RBX: 11263ebe RCX: c90001e5a000
> > RDX: 0004 RSI: 87b4e0f4 RDI: 0246
> > RBP: 88017d3a5d18 R08: 8801d7e02480 R09: fbfff13da030
> > R10: fbfff13da030 R11: 0003 R12: 11002fa74b96
> > R13:  R14: 0200 R15: 
> >   dump_header+0x27b/0xf72 mm/oom_kill.c:441
> >   out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
> >   mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
> >   mem_cgroup_oom mm/memcontrol.c:1701 [inline]
> >   try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
> >   mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
> >   mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
> >   shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
> >   shmem_fault+0x25f/0x960 mm/shmem.c:1982
> >   __do_fault+0x100/0x6b0 mm/memory.c:2996
> >   do_read_fault mm/memory.c:3408 [inline]
> >   do_fault mm/memory.c:3531 [inline]
> > 
> > The primary reason of the stall lies in an expensive printk handling
> > of oom report flood because a misconfiguration on the syzbot side
> > caused that there is simply no eligible task because they have
> > OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation
> > from the memcg context.
> > 
> > While normal workloads should be much more careful about potential heavy
> > memory consumers that are OOM disabled it makes some sense to rate limit
> > a potentially expensive oom reports for cases when there is no eligible
> > victim found. Do that by moving the rate limit logic inside dump_header.
> > We no longer rely on the caller to do that. It was only oom_kill_process
> > which has been throttling. Other two call sites simply didn't have to
> > care because one just paniced on the OOM when configured that way and
> > no eligible task would panic for the global case as well. Memcg changed
> > the picture because we do not panic and we might have multiple sources
> > of the same event.
> > 
> > Once we are here, make sure that the reason to trigger the OOM is
> > printed without ratelimiting because this is really valuable to
> > debug what happened.
> > 
> > Reported-by: syzbot+77e6b28a7a7106ad0...@syzkaller.appspotmail.com
> > Cc: g...@fb.com
> > Cc: han...@cmpxchg.org
> > Cc: kirill.shute...@linux.intel.com
> > Cc: linux-kernel@vger.kernel.org
> > Cc: penguin-ker...@i-love.sakura.ne.jp
> > Cc: rient...@google.com
> > Cc: yan...@alibaba-inc.com
> > Signed-off-by: Michal Hocko 
> 
> So not more than 10 dumps in each 5s interval. That looks reasonable
> to me. By the time it starts dropping data you have more than enough
> information to go on already.

Yeah. Unless we have a storm coming from many different cgroups in
parallel. But even then we have the allocation context for each OOM so
we are not losing everything. Should we ever tune this, it can be done
later with some explicit examples.

> Acked-by: Johannes Weiner 

Thanks! I will post the patch to Andrew early next week.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Michal Hocko
On Fri 12-10-18 07:20:08, Johannes Weiner wrote:
> On Wed, Oct 10, 2018 at 05:11:35PM +0200, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > syzbot has noticed that it can trigger RCU stalls from the memcg oom
> > path:
> > RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
> > Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
> > 84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
> > e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
> > RSP: 0018:88017d3a5c70 EFLAGS: 0246 ORIG_RAX: ff13
> > RAX: 0004 RBX: 11263ebe RCX: c90001e5a000
> > RDX: 0004 RSI: 87b4e0f4 RDI: 0246
> > RBP: 88017d3a5d18 R08: 8801d7e02480 R09: fbfff13da030
> > R10: fbfff13da030 R11: 0003 R12: 11002fa74b96
> > R13:  R14: 0200 R15: 
> >   dump_header+0x27b/0xf72 mm/oom_kill.c:441
> >   out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
> >   mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
> >   mem_cgroup_oom mm/memcontrol.c:1701 [inline]
> >   try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
> >   mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
> >   mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
> >   shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
> >   shmem_fault+0x25f/0x960 mm/shmem.c:1982
> >   __do_fault+0x100/0x6b0 mm/memory.c:2996
> >   do_read_fault mm/memory.c:3408 [inline]
> >   do_fault mm/memory.c:3531 [inline]
> > 
> > The primary reason of the stall lies in an expensive printk handling
> > of oom report flood because a misconfiguration on the syzbot side
> > caused that there is simply no eligible task because they have
> > OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation
> > from the memcg context.
> > 
> > While normal workloads should be much more careful about potential heavy
> > memory consumers that are OOM disabled it makes some sense to rate limit
> > a potentially expensive oom reports for cases when there is no eligible
> > victim found. Do that by moving the rate limit logic inside dump_header.
> > We no longer rely on the caller to do that. It was only oom_kill_process
> > which has been throttling. Other two call sites simply didn't have to
> > care because one just paniced on the OOM when configured that way and
> > no eligible task would panic for the global case as well. Memcg changed
> > the picture because we do not panic and we might have multiple sources
> > of the same event.
> > 
> > Once we are here, make sure that the reason to trigger the OOM is
> > printed without ratelimiting because this is really valuable to
> > debug what happened.
> > 
> > Reported-by: syzbot+77e6b28a7a7106ad0...@syzkaller.appspotmail.com
> > Cc: g...@fb.com
> > Cc: han...@cmpxchg.org
> > Cc: kirill.shute...@linux.intel.com
> > Cc: linux-kernel@vger.kernel.org
> > Cc: penguin-ker...@i-love.sakura.ne.jp
> > Cc: rient...@google.com
> > Cc: yan...@alibaba-inc.com
> > Signed-off-by: Michal Hocko 
> 
> So not more than 10 dumps in each 5s interval. That looks reasonable
> to me. By the time it starts dropping data you have more than enough
> information to go on already.

Yeah. Unless we have a storm coming from many different cgroups in
parallel. But even then we have the allocation context for each OOM so
we are not losing everything. Should we ever tune this, it can be done
later with some explicit examples.

> Acked-by: Johannes Weiner 

Thanks! I will post the patch to Andrew early next week.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Johannes Weiner
On Wed, Oct 10, 2018 at 05:11:35PM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> syzbot has noticed that it can trigger RCU stalls from the memcg oom
> path:
> RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
> Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
> 84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
> e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
> RSP: 0018:88017d3a5c70 EFLAGS: 0246 ORIG_RAX: ff13
> RAX: 0004 RBX: 11263ebe RCX: c90001e5a000
> RDX: 0004 RSI: 87b4e0f4 RDI: 0246
> RBP: 88017d3a5d18 R08: 8801d7e02480 R09: fbfff13da030
> R10: fbfff13da030 R11: 0003 R12: 11002fa74b96
> R13:  R14: 0200 R15: 
>   dump_header+0x27b/0xf72 mm/oom_kill.c:441
>   out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
>   mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
>   mem_cgroup_oom mm/memcontrol.c:1701 [inline]
>   try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
>   mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
>   mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
>   shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
>   shmem_fault+0x25f/0x960 mm/shmem.c:1982
>   __do_fault+0x100/0x6b0 mm/memory.c:2996
>   do_read_fault mm/memory.c:3408 [inline]
>   do_fault mm/memory.c:3531 [inline]
> 
> The primary reason of the stall lies in an expensive printk handling
> of oom report flood because a misconfiguration on the syzbot side
> caused that there is simply no eligible task because they have
> OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation
> from the memcg context.
> 
> While normal workloads should be much more careful about potential heavy
> memory consumers that are OOM disabled it makes some sense to rate limit
> a potentially expensive oom reports for cases when there is no eligible
> victim found. Do that by moving the rate limit logic inside dump_header.
> We no longer rely on the caller to do that. It was only oom_kill_process
> which has been throttling. Other two call sites simply didn't have to
> care because one just paniced on the OOM when configured that way and
> no eligible task would panic for the global case as well. Memcg changed
> the picture because we do not panic and we might have multiple sources
> of the same event.
> 
> Once we are here, make sure that the reason to trigger the OOM is
> printed without ratelimiting because this is really valuable to
> debug what happened.
> 
> Reported-by: syzbot+77e6b28a7a7106ad0...@syzkaller.appspotmail.com
> Cc: g...@fb.com
> Cc: han...@cmpxchg.org
> Cc: kirill.shute...@linux.intel.com
> Cc: linux-kernel@vger.kernel.org
> Cc: penguin-ker...@i-love.sakura.ne.jp
> Cc: rient...@google.com
> Cc: yan...@alibaba-inc.com
> Signed-off-by: Michal Hocko 

So not more than 10 dumps in each 5s interval. That looks reasonable
to me. By the time it starts dropping data you have more than enough
information to go on already.

Acked-by: Johannes Weiner 


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Johannes Weiner
On Wed, Oct 10, 2018 at 05:11:35PM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> syzbot has noticed that it can trigger RCU stalls from the memcg oom
> path:
> RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
> Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
> 84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
> e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
> RSP: 0018:88017d3a5c70 EFLAGS: 0246 ORIG_RAX: ff13
> RAX: 0004 RBX: 11263ebe RCX: c90001e5a000
> RDX: 0004 RSI: 87b4e0f4 RDI: 0246
> RBP: 88017d3a5d18 R08: 8801d7e02480 R09: fbfff13da030
> R10: fbfff13da030 R11: 0003 R12: 11002fa74b96
> R13:  R14: 0200 R15: 
>   dump_header+0x27b/0xf72 mm/oom_kill.c:441
>   out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
>   mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
>   mem_cgroup_oom mm/memcontrol.c:1701 [inline]
>   try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
>   mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
>   mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
>   shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
>   shmem_fault+0x25f/0x960 mm/shmem.c:1982
>   __do_fault+0x100/0x6b0 mm/memory.c:2996
>   do_read_fault mm/memory.c:3408 [inline]
>   do_fault mm/memory.c:3531 [inline]
> 
> The primary reason of the stall lies in an expensive printk handling
> of oom report flood because a misconfiguration on the syzbot side
> caused that there is simply no eligible task because they have
> OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation
> from the memcg context.
> 
> While normal workloads should be much more careful about potential heavy
> memory consumers that are OOM disabled it makes some sense to rate limit
> a potentially expensive oom reports for cases when there is no eligible
> victim found. Do that by moving the rate limit logic inside dump_header.
> We no longer rely on the caller to do that. It was only oom_kill_process
> which has been throttling. Other two call sites simply didn't have to
> care because one just paniced on the OOM when configured that way and
> no eligible task would panic for the global case as well. Memcg changed
> the picture because we do not panic and we might have multiple sources
> of the same event.
> 
> Once we are here, make sure that the reason to trigger the OOM is
> printed without ratelimiting because this is really valuable to
> debug what happened.
> 
> Reported-by: syzbot+77e6b28a7a7106ad0...@syzkaller.appspotmail.com
> Cc: g...@fb.com
> Cc: han...@cmpxchg.org
> Cc: kirill.shute...@linux.intel.com
> Cc: linux-kernel@vger.kernel.org
> Cc: penguin-ker...@i-love.sakura.ne.jp
> Cc: rient...@google.com
> Cc: yan...@alibaba-inc.com
> Signed-off-by: Michal Hocko 

So not more than 10 dumps in each 5s interval. That looks reasonable
to me. By the time it starts dropping data you have more than enough
information to go on already.

Acked-by: Johannes Weiner 


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Tetsuo Handa
On 2018/10/11 15:37, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> Once we are here, make sure that the reason to trigger the OOM is
>> printed without ratelimiting because this is really valuable to
>> debug what happened.
> 
> Here is my version.
> 

Hmm, per mem_cgroup flag would be better than per task_struct flag.

>From 5e80805e4988f23372a3b352fd4185e6bad53f58 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Fri, 12 Oct 2018 16:24:49 +0900
Subject: [PATCH v2] mm: memcontrol: Don't flood OOM messages with no eligible 
task.

syzbot is hitting RCU stall at shmem_fault() [1].
This is because memcg-OOM events with no eligible task (current thread
is marked as OOM-unkillable) continued calling dump_header() from
out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol:
print proper OOM header when no eligible victim left.").

Let's make sure that next dump_header() waits for at least 60 seconds from
previous "Out of memory and no killable processes..." message, if requested
memcg already reported it. This change allows threads in requested memcg to
complete memory allocation requests for doing recovery operation, and also
allows us to know whether threads in requested memcg is doing recovery
operation.

[1] 
https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Cc: Johannes Weiner 
Cc: Michal Hocko 
---
 include/linux/memcontrol.h |  3 +++
 mm/oom_kill.c  | 23 +++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 652f602..8da2340 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -230,6 +230,9 @@ struct mem_cgroup {
 */
bool oom_group;
 
+   /* Previous OOM-kill request failed due to no eligible task? */
+   bool oom_no_eligible_warned;
+
/* protected by memcg_oom_lock */
booloom_lock;
int under_oom;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..52a8319 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
+#ifdef CONFIG_MEMCG
+   /*
+* Don't flood with dump_header() if already reported, in case
+* current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN. Maybe
+* this variable should be per "struct mem_cgroup". But since
+* we can't prove that multiple concurrent memcg OOM without
+* eligible task won't cause flooding, choose global variable
+* for safety.
+*/
+   static u64 last_warned;
+
+   if (is_memcg_oom(oc) && oc->memcg->oom_no_eligible_warned &&
+   time_before64(get_jiffies_64(), last_warned + 60 * HZ))
+   return false;
+#endif
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
/*
@@ -1115,6 +1130,14 @@ bool out_of_memory(struct oom_control *oc)
 */
if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
panic("System is deadlocked on memory\n");
+#ifdef CONFIG_MEMCG
+   if (is_memcg_oom(oc)) {
+   oc->memcg->oom_no_eligible_warned = true;
+   last_warned = get_jiffies_64();
+   }
+   } else if (is_memcg_oom(oc)) {
+   oc->memcg->oom_no_eligible_warned = false;
+#endif
}
if (oc->chosen && oc->chosen != (void *)-1UL)
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
-- 
1.8.3.1


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-12 Thread Tetsuo Handa
On 2018/10/11 15:37, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> Once we are here, make sure that the reason to trigger the OOM is
>> printed without ratelimiting because this is really valuable to
>> debug what happened.
> 
> Here is my version.
> 

Hmm, per mem_cgroup flag would be better than per task_struct flag.

>From 5e80805e4988f23372a3b352fd4185e6bad53f58 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Fri, 12 Oct 2018 16:24:49 +0900
Subject: [PATCH v2] mm: memcontrol: Don't flood OOM messages with no eligible 
task.

syzbot is hitting RCU stall at shmem_fault() [1].
This is because memcg-OOM events with no eligible task (current thread
is marked as OOM-unkillable) continued calling dump_header() from
out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol:
print proper OOM header when no eligible victim left.").

Let's make sure that next dump_header() waits for at least 60 seconds from
previous "Out of memory and no killable processes..." message, if requested
memcg already reported it. This change allows threads in requested memcg to
complete memory allocation requests for doing recovery operation, and also
allows us to know whether threads in requested memcg is doing recovery
operation.

[1] 
https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Cc: Johannes Weiner 
Cc: Michal Hocko 
---
 include/linux/memcontrol.h |  3 +++
 mm/oom_kill.c  | 23 +++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 652f602..8da2340 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -230,6 +230,9 @@ struct mem_cgroup {
 */
bool oom_group;
 
+   /* Previous OOM-kill request failed due to no eligible task? */
+   bool oom_no_eligible_warned;
+
/* protected by memcg_oom_lock */
booloom_lock;
int under_oom;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..52a8319 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
+#ifdef CONFIG_MEMCG
+   /*
+* Don't flood with dump_header() if already reported, in case
+* current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN. Maybe
+* this variable should be per "struct mem_cgroup". But since
+* we can't prove that multiple concurrent memcg OOM without
+* eligible task won't cause flooding, choose global variable
+* for safety.
+*/
+   static u64 last_warned;
+
+   if (is_memcg_oom(oc) && oc->memcg->oom_no_eligible_warned &&
+   time_before64(get_jiffies_64(), last_warned + 60 * HZ))
+   return false;
+#endif
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
/*
@@ -1115,6 +1130,14 @@ bool out_of_memory(struct oom_control *oc)
 */
if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
panic("System is deadlocked on memory\n");
+#ifdef CONFIG_MEMCG
+   if (is_memcg_oom(oc)) {
+   oc->memcg->oom_no_eligible_warned = true;
+   last_warned = get_jiffies_64();
+   }
+   } else if (is_memcg_oom(oc)) {
+   oc->memcg->oom_no_eligible_warned = false;
+#endif
}
if (oc->chosen && oc->chosen != (void *)-1UL)
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
-- 
1.8.3.1


Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-11 Thread Tetsuo Handa
Michal Hocko wrote:
> Once we are here, make sure that the reason to trigger the OOM is
> printed without ratelimiting because this is really valuable to
> debug what happened.

Here is my version.

>From 0c9ab34fd01837d4c85794042ecb9e922c9eed5a Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Thu, 11 Oct 2018 15:27:42 +0900
Subject: [PATCH] mm: memcontrol: Don't flood OOM messages with no eligible task.

syzbot is hitting RCU stall at shmem_fault() [1].
This is because memcg-OOM events with no eligible task (current thread
is marked as OOM-unkillable) continued calling dump_header() from
out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol:
print proper OOM header when no eligible victim left.").

Let's make sure that next dump_header() waits for at least 60 seconds from
previous "Out of memory and no killable processes..." message, if current
thread already reported it. This change allows current thread to complete
memory allocation requests for doing recovery operation, and also allows
us to know whether the current thread is doing recovery operation.

[1] 
https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Cc: Johannes Weiner 
Cc: Michal Hocko 
---
 include/linux/sched.h |  1 +
 mm/oom_kill.c | 23 +++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57..701cf3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -723,6 +723,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_MEMCG
unsignedin_user_fault:1;
+   unsignedmemcg_oom_no_eligible_warned:1;
 #ifdef CONFIG_MEMCG_KMEM
unsignedmemcg_kmem_skip_account:1;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..f954d99 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
+#ifdef CONFIG_MEMCG
+   /*
+* Don't flood with dump_header() if already reported, in case
+* current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN. Maybe
+* this variable should be per "struct mem_cgroup". But since
+* we can't prove that multiple concurrent memcg OOM without
+* eligible task won't cause flooding, choose global variable
+* for safety.
+*/
+   static u64 last_warned;
+
+   if (is_memcg_oom(oc) && current->memcg_oom_no_eligible_warned
+   && time_before64(get_jiffies_64(), last_warned + 60 * HZ))
+   return false;
+#endif
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
/*
@@ -1115,6 +1130,14 @@ bool out_of_memory(struct oom_control *oc)
 */
if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
panic("System is deadlocked on memory\n");
+#ifdef CONFIG_MEMCG
+   if (is_memcg_oom(oc)) {
+   last_warned = get_jiffies_64();
+   current->memcg_oom_no_eligible_warned = 1;
+   }
+   } else if (is_memcg_oom(oc)) {
+   current->memcg_oom_no_eligible_warned = 0;
+#endif
}
if (oc->chosen && oc->chosen != (void *)-1UL)
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
-- 
1.8.3.1



Re: [RFC PATCH] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

2018-10-11 Thread Tetsuo Handa
Michal Hocko wrote:
> Once we are here, make sure that the reason to trigger the OOM is
> printed without ratelimiting because this is really valuable to
> debug what happened.

Here is my version.

>From 0c9ab34fd01837d4c85794042ecb9e922c9eed5a Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Thu, 11 Oct 2018 15:27:42 +0900
Subject: [PATCH] mm: memcontrol: Don't flood OOM messages with no eligible task.

syzbot is hitting RCU stall at shmem_fault() [1].
This is because memcg-OOM events with no eligible task (current thread
is marked as OOM-unkillable) continued calling dump_header() from
out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol:
print proper OOM header when no eligible victim left.").

Let's make sure that next dump_header() waits for at least 60 seconds from
previous "Out of memory and no killable processes..." message, if current
thread already reported it. This change allows current thread to complete
memory allocation requests for doing recovery operation, and also allows
us to know whether the current thread is doing recovery operation.

[1] 
https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Cc: Johannes Weiner 
Cc: Michal Hocko 
---
 include/linux/sched.h |  1 +
 mm/oom_kill.c | 23 +++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57..701cf3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -723,6 +723,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_MEMCG
unsignedin_user_fault:1;
+   unsignedmemcg_oom_no_eligible_warned:1;
 #ifdef CONFIG_MEMCG_KMEM
unsignedmemcg_kmem_skip_account:1;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..f954d99 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
+#ifdef CONFIG_MEMCG
+   /*
+* Don't flood with dump_header() if already reported, in case
+* current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN. Maybe
+* this variable should be per "struct mem_cgroup". But since
+* we can't prove that multiple concurrent memcg OOM without
+* eligible task won't cause flooding, choose global variable
+* for safety.
+*/
+   static u64 last_warned;
+
+   if (is_memcg_oom(oc) && current->memcg_oom_no_eligible_warned
+   && time_before64(get_jiffies_64(), last_warned + 60 * HZ))
+   return false;
+#endif
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
/*
@@ -1115,6 +1130,14 @@ bool out_of_memory(struct oom_control *oc)
 */
if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
panic("System is deadlocked on memory\n");
+#ifdef CONFIG_MEMCG
+   if (is_memcg_oom(oc)) {
+   last_warned = get_jiffies_64();
+   current->memcg_oom_no_eligible_warned = 1;
+   }
+   } else if (is_memcg_oom(oc)) {
+   current->memcg_oom_no_eligible_warned = 0;
+#endif
}
if (oc->chosen && oc->chosen != (void *)-1UL)
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
-- 
1.8.3.1