Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-21 Thread Michal Hocko
On Mon 20-01-14 22:13:21, David Rientjes wrote:
> On Thu, 16 Jan 2014, Michal Hocko wrote:
> 
> > > The heuristic may have existed for ages, but the proposed memcg 
> > > configuration for preserving memory such that userspace oom handlers may 
> > > run such as
> > > 
> > >_root__
> > >   /   \
> > >   user oom
> > >  /\   /   \
> > >  AB   a   b
> > > 
> > > where user/memory.limit_in_bytes == [amount of present RAM] + 
> > > oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
> > > problematic, including Johannes' buggy bypass for charges in memcgs with 
> > > pending memcgs that has since been fixed after I identified it.  This 
> > > bypass is included.  Processes attached to "a" and "b" are userspace oom 
> > > handlers for processes attached to "A" and "B", respectively.
> > > 
> > > The amount of memory you're talking about is proportional to the number 
> > > of 
> > > processes that have pending SIGKILLs (and now those with PF_EXITING set), 
> > > the former of which is obviously more concerning since they could be 
> > > charging memory at any point in the kernel that would succeed. 
> > 
> > I understand your concerns. Yes, excessive charges might be dangerous. I
> > haven't dismissed that when you mentioned it earlier. I am just
> > repeatedly asking how much memory are we talking about, how real is the
> > issue and what are all the other conseqeunces. And for some reason you
> > are not providing that information (or maybe I am just not seeing that
> > in your responses) and that is why we are stuck in circle.
> > 
> 
> Wtf are you talking about?  You're adding a bypass in this patch and then 
> you're asking me to go and see how much memory it could potentially bypass 
> and take away from oom handlers under the above memcg configuration?

No. You are mixing two things. One of them is adding PF_EXITING bypass
while the other is removing fatal_signal_pending bypass.

The first one is a subset of the later and it doesn't add an excessive
amount of charges because there are no direct allocations after
exit_signals. You haven't shown that this is not true and your only
concern was that this might change in future. Besides that my argument
was that even if such an allocation led to the global OOM the task would
be given TIF_MEMDIE and nothing would be killed.

The other part is fatal_signal_pending which we have there for ages
and you want to remove it. In order to do that I am asking you for
some data backing up that removal. You keep repeating your arguments
but they lack data or at least show code paths which would wildly
allocate after task has been killed which wouldn't be fixable by
fatal_signal_pending check in the caller to show that the issue is real.
Besides that you are completely ignoring other concerns I have
mentioned, e.g. possible performance regressions when a pointless
reclaim slows existing tags.

Please try to understand that this is not Black thing.

> This seems like something you should provide before throwing out
> patches that nobody has tested if you want to make the argument that
> the above memcg configuration is valid for handling userspace oom
> notifications.
> 
> And you certainly have dismissed what I've mentioned earlier when I said 
> that anybody can add memory allocation to the exit path later on and 
> nobody is going to think about how much memory this is going to bypass to 
> the root memcg and potentially take away from userspace oom handlers.

If this happens then it has to be fixed and if not fixable then
reconsider this heuristic.

> There's two possible ways to forward this:
> 
>  - avoid bypass to the root memcg in every possible case such that the
>above memcg configuration actually makes a guarantee to userspace oom
>handlers attached to it, or
> 
>  - provide per-memcg memory reserves such that userspace oom handlers can
>allocate and charge memory without the above memcg configuration so 
>there is a guarantee.

David, you are aware that there are memory allocations that are out of
memcg/kmem scope, aren't you? This means that whether you add memcg
charge-reserves or access to memory reserves to memcg OOM killers then
you still can never rule out the global OOM killer.

> What's not acceptable, now or ever, is suggesting a solution to a problem 
> that is supposed to guarantee some resource and then allow under some 
> circumstances that resource to be completely depleted such that the 
> solution never works.

And yet you still haven't shown that such depletion is real. E.g. g-u-p
backs off when it sees fatal signal pending other callers that allocate
charged memory should do the same.

> > Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
> > should terminate in a short time. I think that having a setup with a
> > guarantee of never triggering the 

Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-21 Thread Michal Hocko
On Mon 20-01-14 22:13:21, David Rientjes wrote:
 On Thu, 16 Jan 2014, Michal Hocko wrote:
 
   The heuristic may have existed for ages, but the proposed memcg 
   configuration for preserving memory such that userspace oom handlers may 
   run such as
   
  _root__
 /   \
 user oom
/\   /   \
AB   a   b
   
   where user/memory.limit_in_bytes == [amount of present RAM] + 
   oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
   problematic, including Johannes' buggy bypass for charges in memcgs with 
   pending memcgs that has since been fixed after I identified it.  This 
   bypass is included.  Processes attached to a and b are userspace oom 
   handlers for processes attached to A and B, respectively.
   
   The amount of memory you're talking about is proportional to the number 
   of 
   processes that have pending SIGKILLs (and now those with PF_EXITING set), 
   the former of which is obviously more concerning since they could be 
   charging memory at any point in the kernel that would succeed. 
  
  I understand your concerns. Yes, excessive charges might be dangerous. I
  haven't dismissed that when you mentioned it earlier. I am just
  repeatedly asking how much memory are we talking about, how real is the
  issue and what are all the other conseqeunces. And for some reason you
  are not providing that information (or maybe I am just not seeing that
  in your responses) and that is why we are stuck in circle.
  
 
 Wtf are you talking about?  You're adding a bypass in this patch and then 
 you're asking me to go and see how much memory it could potentially bypass 
 and take away from oom handlers under the above memcg configuration?

No. You are mixing two things. One of them is adding PF_EXITING bypass
while the other is removing fatal_signal_pending bypass.

The first one is a subset of the later and it doesn't add an excessive
amount of charges because there are no direct allocations after
exit_signals. You haven't shown that this is not true and your only
concern was that this might change in future. Besides that my argument
was that even if such an allocation led to the global OOM the task would
be given TIF_MEMDIE and nothing would be killed.

The other part is fatal_signal_pending which we have there for ages
and you want to remove it. In order to do that I am asking you for
some data backing up that removal. You keep repeating your arguments
but they lack data or at least show code paths which would wildly
allocatecharge after task has been killed which wouldn't be fixable by
fatal_signal_pending check in the caller to show that the issue is real.
Besides that you are completely ignoring other concerns I have
mentioned, e.g. possible performance regressions when a pointless
reclaim slows existing tags.

Please try to understand that this is not BlackWhite thing.

 This seems like something you should provide before throwing out
 patches that nobody has tested if you want to make the argument that
 the above memcg configuration is valid for handling userspace oom
 notifications.
 
 And you certainly have dismissed what I've mentioned earlier when I said 
 that anybody can add memory allocation to the exit path later on and 
 nobody is going to think about how much memory this is going to bypass to 
 the root memcg and potentially take away from userspace oom handlers.

If this happens then it has to be fixed and if not fixable then
reconsider this heuristic.

 There's two possible ways to forward this:
 
  - avoid bypass to the root memcg in every possible case such that the
above memcg configuration actually makes a guarantee to userspace oom
handlers attached to it, or
 
  - provide per-memcg memory reserves such that userspace oom handlers can
allocate and charge memory without the above memcg configuration so 
there is a guarantee.

David, you are aware that there are memory allocations that are out of
memcg/kmem scope, aren't you? This means that whether you add memcg
charge-reserves or access to memory reserves to memcg OOM killers then
you still can never rule out the global OOM killer.

 What's not acceptable, now or ever, is suggesting a solution to a problem 
 that is supposed to guarantee some resource and then allow under some 
 circumstances that resource to be completely depleted such that the 
 solution never works.

And yet you still haven't shown that such depletion is real. E.g. g-u-p
backs off when it sees fatal signal pending other callers that allocate
charged memory should do the same.

  Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
  should terminate in a short time. I think that having a setup with a
  guarantee of never triggering the global OOM is too ambitious and I am
  even skeptical it would be achievable.
  
 
 Short time is meaningless if the 

Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-20 Thread David Rientjes
On Thu, 16 Jan 2014, Michal Hocko wrote:

> > The heuristic may have existed for ages, but the proposed memcg 
> > configuration for preserving memory such that userspace oom handlers may 
> > run such as
> > 
> >  _root__
> > /   \
> > user oom
> >/\   /   \
> >AB   a   b
> > 
> > where user/memory.limit_in_bytes == [amount of present RAM] + 
> > oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
> > problematic, including Johannes' buggy bypass for charges in memcgs with 
> > pending memcgs that has since been fixed after I identified it.  This 
> > bypass is included.  Processes attached to "a" and "b" are userspace oom 
> > handlers for processes attached to "A" and "B", respectively.
> > 
> > The amount of memory you're talking about is proportional to the number of 
> > processes that have pending SIGKILLs (and now those with PF_EXITING set), 
> > the former of which is obviously more concerning since they could be 
> > charging memory at any point in the kernel that would succeed. 
> 
> I understand your concerns. Yes, excessive charges might be dangerous. I
> haven't dismissed that when you mentioned it earlier. I am just
> repeatedly asking how much memory are we talking about, how real is the
> issue and what are all the other conseqeunces. And for some reason you
> are not providing that information (or maybe I am just not seeing that
> in your responses) and that is why we are stuck in circle.
> 

Wtf are you talking about?  You're adding a bypass in this patch and then 
you're asking me to go and see how much memory it could potentially bypass 
and take away from oom handlers under the above memcg configuration?  This 
seems like something you should provide before throwing out patches that 
nobody has tested if you want to make the argument that the above memcg 
configuration is valid for handling userspace oom notifications.

And you certainly have dismissed what I've mentioned earlier when I said 
that anybody can add memory allocation to the exit path later on and 
nobody is going to think about how much memory this is going to bypass to 
the root memcg and potentially take away from userspace oom handlers.

There's two possible ways to forward this:

 - avoid bypass to the root memcg in every possible case such that the
   above memcg configuration actually makes a guarantee to userspace oom
   handlers attached to it, or

 - provide per-memcg memory reserves such that userspace oom handlers can
   allocate and charge memory without the above memcg configuration so 
   there is a guarantee.

What's not acceptable, now or ever, is suggesting a solution to a problem 
that is supposed to guarantee some resource and then allow under some 
circumstances that resource to be completely depleted such that the 
solution never works.

> Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
> should terminate in a short time. I think that having a setup with a
> guarantee of never triggering the global OOM is too ambitious and I am
> even skeptical it would be achievable.
> 

"Short time" is meaningless if the memory allocation causes memory to not 
be available to userspace oom handlers.  If allocations are allowed to be 
charged because you're in the exit() path or because you have SIGKILL, 
that can result in a system oom condition that would prevent userspace 
from being able to handle them.

> > I'm debating both fatal_signal_pending() and PF_EXITING here since they 
> > are now both bypasses, we need to remove fatal_signal_pending().  My 
> > simple question with your patch: how do you guarantee memory to processes 
> > attached to "a" and "b"?
> 
> The only way you can get that _guarantee_ is to account all the memory
> allocations. And that is not implemented and I would even question
> whether it is worthwhile. So we still have to live with a possibility
> of triggering the global OOM killer. That's why I believe we need to be
> able to tell the kernel what is the user policy for oom killer (that is
> a different discussion though).
> 

So you're saying that Tejun's suggested userspace oom handler 
configuration is pointless, correct?  We can certainly provide a guarantee 
if memory is reserved specifically for userspace oom handling like I 
proposed, the same way that memory reserves are guaranteed for oom killed 
processes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-20 Thread David Rientjes
On Thu, 16 Jan 2014, Michal Hocko wrote:

  The heuristic may have existed for ages, but the proposed memcg 
  configuration for preserving memory such that userspace oom handlers may 
  run such as
  
   _root__
  /   \
  user oom
 /\   /   \
 AB   a   b
  
  where user/memory.limit_in_bytes == [amount of present RAM] + 
  oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
  problematic, including Johannes' buggy bypass for charges in memcgs with 
  pending memcgs that has since been fixed after I identified it.  This 
  bypass is included.  Processes attached to a and b are userspace oom 
  handlers for processes attached to A and B, respectively.
  
  The amount of memory you're talking about is proportional to the number of 
  processes that have pending SIGKILLs (and now those with PF_EXITING set), 
  the former of which is obviously more concerning since they could be 
  charging memory at any point in the kernel that would succeed. 
 
 I understand your concerns. Yes, excessive charges might be dangerous. I
 haven't dismissed that when you mentioned it earlier. I am just
 repeatedly asking how much memory are we talking about, how real is the
 issue and what are all the other conseqeunces. And for some reason you
 are not providing that information (or maybe I am just not seeing that
 in your responses) and that is why we are stuck in circle.
 

Wtf are you talking about?  You're adding a bypass in this patch and then 
you're asking me to go and see how much memory it could potentially bypass 
and take away from oom handlers under the above memcg configuration?  This 
seems like something you should provide before throwing out patches that 
nobody has tested if you want to make the argument that the above memcg 
configuration is valid for handling userspace oom notifications.

And you certainly have dismissed what I've mentioned earlier when I said 
that anybody can add memory allocation to the exit path later on and 
nobody is going to think about how much memory this is going to bypass to 
the root memcg and potentially take away from userspace oom handlers.

There's two possible ways to forward this:

 - avoid bypass to the root memcg in every possible case such that the
   above memcg configuration actually makes a guarantee to userspace oom
   handlers attached to it, or

 - provide per-memcg memory reserves such that userspace oom handlers can
   allocate and charge memory without the above memcg configuration so 
   there is a guarantee.

What's not acceptable, now or ever, is suggesting a solution to a problem 
that is supposed to guarantee some resource and then allow under some 
circumstances that resource to be completely depleted such that the 
solution never works.

 Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
 should terminate in a short time. I think that having a setup with a
 guarantee of never triggering the global OOM is too ambitious and I am
 even skeptical it would be achievable.
 

Short time is meaningless if the memory allocation causes memory to not 
be available to userspace oom handlers.  If allocations are allowed to be 
charged because you're in the exit() path or because you have SIGKILL, 
that can result in a system oom condition that would prevent userspace 
from being able to handle them.

  I'm debating both fatal_signal_pending() and PF_EXITING here since they 
  are now both bypasses, we need to remove fatal_signal_pending().  My 
  simple question with your patch: how do you guarantee memory to processes 
  attached to a and b?
 
 The only way you can get that _guarantee_ is to account all the memory
 allocations. And that is not implemented and I would even question
 whether it is worthwhile. So we still have to live with a possibility
 of triggering the global OOM killer. That's why I believe we need to be
 able to tell the kernel what is the user policy for oom killer (that is
 a different discussion though).
 

So you're saying that Tejun's suggested userspace oom handler 
configuration is pointless, correct?  We can certainly provide a guarantee 
if memory is reserved specifically for userspace oom handling like I 
proposed, the same way that memory reserves are guaranteed for oom killed 
processes.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-16 Thread Michal Hocko
On Wed 15-01-14 13:19:21, David Rientjes wrote:
> On Wed, 15 Jan 2014, Michal Hocko wrote:
> 
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index b8dfed1b9d87..b86fbb04b7c6 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct 
> > > > > > mm_struct *mm,
> > > > > >  * MEMDIE process.
> > > > > >  */
> > > > > > if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > > > > -|| fatal_signal_pending(current)))
> > > > > > +|| fatal_signal_pending(current))
> > > > > > +|| current->flags & PF_EXITING)
> > > > > > goto bypass;
> > > > > >  
> > > > > > if (unlikely(task_in_memcg_oom(current)))
> > > > > 
> > > > > This would become problematic if significant amount of memory is 
> > > > > charged 
> > > > > in the exit() path. 
> > > > 
> > > > But this would hurt also for fatal_signal_pending tasks, wouldn't it?
> > > 
> > > Yes, and as I've said twice now, that should be removed. 
> > 
> > And you failed to provide any relevant data to back your suggestions. I
> > have told you that we have these heuristics for ages and we need a
> > strong justification to drop them. So if you really think that they are
> > not appropriate then back your statements with real data.
> > 
> > E.g. measure how much memory are we talking about.
> 
> The heuristic may have existed for ages, but the proposed memcg 
> configuration for preserving memory such that userspace oom handlers may 
> run such as
> 
>_root__
>   /   \
>   user oom
>  /\   /   \
>  AB   a   b
> 
> where user/memory.limit_in_bytes == [amount of present RAM] + 
> oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
> problematic, including Johannes' buggy bypass for charges in memcgs with 
> pending memcgs that has since been fixed after I identified it.  This 
> bypass is included.  Processes attached to "a" and "b" are userspace oom 
> handlers for processes attached to "A" and "B", respectively.
> 
> The amount of memory you're talking about is proportional to the number of 
> processes that have pending SIGKILLs (and now those with PF_EXITING set), 
> the former of which is obviously more concerning since they could be 
> charging memory at any point in the kernel that would succeed. 

I understand your concerns. Yes, excessive charges might be dangerous. I
haven't dismissed that when you mentioned it earlier. I am just
repeatedly asking how much memory are we talking about, how real is the
issue and what are all the other conseqeunces. And for some reason you
are not providing that information (or maybe I am just not seeing that
in your responses) and that is why we are stuck in circle.

For example your some_fudge needs to consider memory which is not
accounted for. That is rather hard to predict because it depends
on drivers (or whoever calls page allocator directly) you have and
the current load. Also all tasks living in the root memcg are not
accounted. So you would need to have some pillow to be safe. Do
fatal_signal_pending tasks allocate way much more than your pillow?

I am also not sure whether fatal_signal_pending failing charge (without
going via OOM which would set TIF_MEMDIE) could prevent exiting task.

Finally there is still a risk for regressions when a killed task causes a
pointless reclaim just to free few pages for a task which will free
memory few seconds later.

> The latter 
> is concerning only if future memory allocation post-PF_EXITING would be 
> become significant and nobody is going to think about oom memcg bypasses 
> in such a case.
> 
> To use the configuration suggested above, we need to prevent as many 
> bypasses as possible to the root memcg.  Otherwise, the memory protected 
> for the "oom" memcg from processes constrained by the limit of "user" is 
> no longer protected.  This isn't only a problem with the bypasses here in 
> the charging path, but also unaccounted kernel memory, for example.

Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
should terminate in a short time. I think that having a setup with a
guarantee of never triggering the global OOM is too ambitious and I am
even skeptical it would be achievable.

> For this to be usable, we need to ensure that the limit of the "oom" memcg 
> is protected for the userspace oom handlers that are attached.  With a 
> charge bypassed to the root memcg greater than or equal to the limit of 
> the "oom" memcg OR cumulative charges bypassed to the root memcg greater 
> than or equal to the limit of the "oom" memcg by processes with pending 
> SIGKILLs, userspace oom handlers cannot respond.  That's particuarly 
> dangerous without a memcg oom kill delay, as proposed before, since 
> 

Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-16 Thread Michal Hocko
On Wed 15-01-14 13:19:21, David Rientjes wrote:
 On Wed, 15 Jan 2014, Michal Hocko wrote:
 
  diff --git a/mm/memcontrol.c b/mm/memcontrol.c
  index b8dfed1b9d87..b86fbb04b7c6 100644
  --- a/mm/memcontrol.c
  +++ b/mm/memcontrol.c
  @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct 
  mm_struct *mm,
   * MEMDIE process.
   */
  if (unlikely(test_thread_flag(TIF_MEMDIE)
  -|| fatal_signal_pending(current)))
  +|| fatal_signal_pending(current))
  +|| current-flags  PF_EXITING)
  goto bypass;
   
  if (unlikely(task_in_memcg_oom(current)))
 
 This would become problematic if significant amount of memory is 
 charged 
 in the exit() path. 

But this would hurt also for fatal_signal_pending tasks, wouldn't it?
   
   Yes, and as I've said twice now, that should be removed. 
  
  And you failed to provide any relevant data to back your suggestions. I
  have told you that we have these heuristics for ages and we need a
  strong justification to drop them. So if you really think that they are
  not appropriate then back your statements with real data.
  
  E.g. measure how much memory are we talking about.
 
 The heuristic may have existed for ages, but the proposed memcg 
 configuration for preserving memory such that userspace oom handlers may 
 run such as
 
_root__
   /   \
   user oom
  /\   /   \
  AB   a   b
 
 where user/memory.limit_in_bytes == [amount of present RAM] + 
 oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
 problematic, including Johannes' buggy bypass for charges in memcgs with 
 pending memcgs that has since been fixed after I identified it.  This 
 bypass is included.  Processes attached to a and b are userspace oom 
 handlers for processes attached to A and B, respectively.
 
 The amount of memory you're talking about is proportional to the number of 
 processes that have pending SIGKILLs (and now those with PF_EXITING set), 
 the former of which is obviously more concerning since they could be 
 charging memory at any point in the kernel that would succeed. 

I understand your concerns. Yes, excessive charges might be dangerous. I
haven't dismissed that when you mentioned it earlier. I am just
repeatedly asking how much memory are we talking about, how real is the
issue and what are all the other conseqeunces. And for some reason you
are not providing that information (or maybe I am just not seeing that
in your responses) and that is why we are stuck in circle.

For example your some_fudge needs to consider memory which is not
accounted for. That is rather hard to predict because it depends
on drivers (or whoever calls page allocator directly) you have and
the current load. Also all tasks living in the root memcg are not
accounted. So you would need to have some pillow to be safe. Do
fatal_signal_pending tasks allocate way much more than your pillow?

I am also not sure whether fatal_signal_pending failing charge (without
going via OOM which would set TIF_MEMDIE) could prevent exiting task.

Finally there is still a risk for regressions when a killed task causes a
pointless reclaim just to free few pages for a task which will free
memory few seconds later.

 The latter 
 is concerning only if future memory allocation post-PF_EXITING would be 
 become significant and nobody is going to think about oom memcg bypasses 
 in such a case.
 
 To use the configuration suggested above, we need to prevent as many 
 bypasses as possible to the root memcg.  Otherwise, the memory protected 
 for the oom memcg from processes constrained by the limit of user is 
 no longer protected.  This isn't only a problem with the bypasses here in 
 the charging path, but also unaccounted kernel memory, for example.

Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
should terminate in a short time. I think that having a setup with a
guarantee of never triggering the global OOM is too ambitious and I am
even skeptical it would be achievable.

 For this to be usable, we need to ensure that the limit of the oom memcg 
 is protected for the userspace oom handlers that are attached.  With a 
 charge bypassed to the root memcg greater than or equal to the limit of 
 the oom memcg OR cumulative charges bypassed to the root memcg greater 
 than or equal to the limit of the oom memcg by processes with pending 
 SIGKILLs, userspace oom handlers cannot respond.  That's particuarly 
 dangerous without a memcg oom kill delay, as proposed before, since 
 userspace must disable oom killing entirely for both A and B for 
 userspace notification to be meaningful, since all processes are now 
 livelocked.
 
   These bypasses should be given to one thread and 

Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-15 Thread David Rientjes
On Wed, 15 Jan 2014, Michal Hocko wrote:

> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index b8dfed1b9d87..b86fbb04b7c6 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct 
> > > > > mm_struct *mm,
> > > > >* MEMDIE process.
> > > > >*/
> > > > >   if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > > > -  || fatal_signal_pending(current)))
> > > > > +  || fatal_signal_pending(current))
> > > > > +  || current->flags & PF_EXITING)
> > > > >   goto bypass;
> > > > >  
> > > > >   if (unlikely(task_in_memcg_oom(current)))
> > > > 
> > > > This would become problematic if significant amount of memory is 
> > > > charged 
> > > > in the exit() path. 
> > > 
> > > But this would hurt also for fatal_signal_pending tasks, wouldn't it?
> > 
> > Yes, and as I've said twice now, that should be removed. 
> 
> And you failed to provide any relevant data to back your suggestions. I
> have told you that we have these heuristics for ages and we need a
> strong justification to drop them. So if you really think that they are
> not appropriate then back your statements with real data.
> 
> E.g. measure how much memory are we talking about.
> 

The heuristic may have existed for ages, but the proposed memcg 
configuration for preserving memory such that userspace oom handlers may 
run such as

 _root__
/   \
user oom
   /\   /   \
   AB   a   b

where user/memory.limit_in_bytes == [amount of present RAM] + 
oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
problematic, including Johannes' buggy bypass for charges in memcgs with 
pending memcgs that has since been fixed after I identified it.  This 
bypass is included.  Processes attached to "a" and "b" are userspace oom 
handlers for processes attached to "A" and "B", respectively.

The amount of memory you're talking about is proportional to the number of 
processes that have pending SIGKILLs (and now those with PF_EXITING set), 
the former of which is obviously more concerning since they could be 
charging memory at any point in the kernel that would succeed.  The latter 
is concerning only if future memory allocation post-PF_EXITING would be 
become significant and nobody is going to think about oom memcg bypasses 
in such a case.

To use the configuration suggested above, we need to prevent as many 
bypasses as possible to the root memcg.  Otherwise, the memory protected 
for the "oom" memcg from processes constrained by the limit of "user" is 
no longer protected.  This isn't only a problem with the bypasses here in 
the charging path, but also unaccounted kernel memory, for example.

For this to be usable, we need to ensure that the limit of the "oom" memcg 
is protected for the userspace oom handlers that are attached.  With a 
charge bypassed to the root memcg greater than or equal to the limit of 
the "oom" memcg OR cumulative charges bypassed to the root memcg greater 
than or equal to the limit of the "oom" memcg by processes with pending 
SIGKILLs, userspace oom handlers cannot respond.  That's particuarly 
dangerous without a memcg oom kill delay, as proposed before, since 
userspace must disable oom killing entirely for both "A" and "B" for 
userspace notification to be meaningful, since all processes are now 
livelocked.

> > These bypasses should be given to one thread and one thread only,
> > which would be the oom killed thread if it needs access to memory
> > reserves to either allocate memory or charge memory.
> 
> There is no way to determine whether a task has been killed due to user
> space OOM killer or by a regular kill.
> 

I'm referring to only granting TIF_MEMDIE to a single process in any memcg 
hierarchy at or below the memcg that has encountered its limit to avoid 
granting it to many processes and bypassing their charges to the root 
memcg; the same variation of the above code, but going through the memcg 
oom killer to get TIF_MEMDIE first.  We must be vigilant and only grant 
TIF_MEMDIE for the process that shall exit.

> > If you are suggesting we use the "user" and "oom" top-level memcg 
> > hierarchy for allowing memory to be available for userspace system oom 
> > handlers, then this has become important when in the past it may have been 
> > a minor point.
> 
> I am not sure it would be _that_ important and if that really becomes to
> be the case then we should deal with it. So far I haven't see any
> evidence there is a lot of memory charged on the exit path.
> 

I'm debating both fatal_signal_pending() and PF_EXITING here since they 
are now both bypasses, we need to remove fatal_signal_pending().  My 
simple question with your patch: how do you guarantee 

Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-15 Thread Michal Hocko
On Fri 10-01-14 13:33:01, David Rientjes wrote:
> On Fri, 10 Jan 2014, Michal Hocko wrote:
> 
> > > > ---
> > > >  mm/memcontrol.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index b8dfed1b9d87..b86fbb04b7c6 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct 
> > > > mm_struct *mm,
> > > >  * MEMDIE process.
> > > >  */
> > > > if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > > -|| fatal_signal_pending(current)))
> > > > +|| fatal_signal_pending(current))
> > > > +|| current->flags & PF_EXITING)
> > > > goto bypass;
> > > >  
> > > > if (unlikely(task_in_memcg_oom(current)))
> > > 
> > > This would become problematic if significant amount of memory is charged 
> > > in the exit() path. 
> > 
> > But this would hurt also for fatal_signal_pending tasks, wouldn't it?
> 
> Yes, and as I've said twice now, that should be removed. 

And you failed to provide any relevant data to back your suggestions. I
have told you that we have these heuristics for ages and we need a
strong justification to drop them. So if you really think that they are
not appropriate then back your statements with real data.

E.g. measure how much memory are we talking about.

> These bypasses should be given to one thread and one thread only,
> which would be the oom killed thread if it needs access to memory
> reserves to either allocate memory or charge memory.

There is no way to determine whether a task has been killed due to user
space OOM killer or by a regular kill.

> If you are suggesting we use the "user" and "oom" top-level memcg 
> hierarchy for allowing memory to be available for userspace system oom 
> handlers, then this has become important when in the past it may have been 
> a minor point.

I am not sure it would be _that_ important and if that really becomes to
be the case then we should deal with it. So far I haven't see any
evidence there is a lot of memory charged on the exit path.

> > Besides that I do not see any source of allocation after exit_signals.
> > 
> 
> That's fine for today but may not be in the future.  If memory allocation 
> is done after PF_EXITING in the future, are people going to check memcg 
> bypasses?  No.  And now we have additional memory bypass to root that will 
> cause our userspace system oom hanlders to be oom themselves with the 
> suggested configuration.
> 
> Using the "user" and "oom" top-level memcg hierarchy is a double edged 
> sword, we must attempt to prevent all of these bypasses as much as 
> possible.  The only relevant bypass here is for TIF_MEMDIE which would be 
> set if necessary for the one thread that needs it.

TIF_MEMDIE doesn't work for userspace OOM killers. So we cannot rely on
this flag currently.

> > > I don't know of an egregious amount of memory being 
> > > allocated and charged after PF_EXITING is set, but if it happens in the 
> > > future then this could potentially cause system oom conditions even in 
> > > memcg configurations 
> > 
> > Even if that happens then the global OOM killer would give the exiting
> > task access to memory reserves and wouldn't kill anything else.
> > 
> > So I am not sure what problem do you see exactly.
> > 
> 
> Userspace system oom handlers being able to handle memcg oom conditions in 
> the top-level "user" memcg as proposed by Tejun.  If the global oom killer 
> becomes a part of that discussion at all, then the userspace system oom 
> handler never got a chance to handle the "user" oom.
> 
> > Besides that allocating egregious amount of memory after exit_signals
> > sounds fundamentally broken to me.
> > 
> 
> Egregious could be defined as allocating a few bytes multiplied by 
> thousands of threads in PF_EXITING.

Does this happen in the real life.

Look, I have no objections to make the OOM handling better but it would
help a lot to build new heuristics based on some data in hands. I tried
to repeat that again and again but it seems to not help. I do not want
to end up with new sets of heuristics that break other stuff jut because
they made sense in the context of the specific usecase.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-15 Thread Michal Hocko
On Fri 10-01-14 13:33:01, David Rientjes wrote:
 On Fri, 10 Jan 2014, Michal Hocko wrote:
 
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b8dfed1b9d87..b86fbb04b7c6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct 
mm_struct *mm,
 * MEMDIE process.
 */
if (unlikely(test_thread_flag(TIF_MEMDIE)
-|| fatal_signal_pending(current)))
+|| fatal_signal_pending(current))
+|| current-flags  PF_EXITING)
goto bypass;
 
if (unlikely(task_in_memcg_oom(current)))
   
   This would become problematic if significant amount of memory is charged 
   in the exit() path. 
  
  But this would hurt also for fatal_signal_pending tasks, wouldn't it?
 
 Yes, and as I've said twice now, that should be removed. 

And you failed to provide any relevant data to back your suggestions. I
have told you that we have these heuristics for ages and we need a
strong justification to drop them. So if you really think that they are
not appropriate then back your statements with real data.

E.g. measure how much memory are we talking about.

 These bypasses should be given to one thread and one thread only,
 which would be the oom killed thread if it needs access to memory
 reserves to either allocate memory or charge memory.

There is no way to determine whether a task has been killed due to user
space OOM killer or by a regular kill.

 If you are suggesting we use the user and oom top-level memcg 
 hierarchy for allowing memory to be available for userspace system oom 
 handlers, then this has become important when in the past it may have been 
 a minor point.

I am not sure it would be _that_ important and if that really becomes to
be the case then we should deal with it. So far I haven't see any
evidence there is a lot of memory charged on the exit path.

  Besides that I do not see any source of allocation after exit_signals.
  
 
 That's fine for today but may not be in the future.  If memory allocation 
 is done after PF_EXITING in the future, are people going to check memcg 
 bypasses?  No.  And now we have additional memory bypass to root that will 
 cause our userspace system oom hanlders to be oom themselves with the 
 suggested configuration.
 
 Using the user and oom top-level memcg hierarchy is a double edged 
 sword, we must attempt to prevent all of these bypasses as much as 
 possible.  The only relevant bypass here is for TIF_MEMDIE which would be 
 set if necessary for the one thread that needs it.

TIF_MEMDIE doesn't work for userspace OOM killers. So we cannot rely on
this flag currently.

   I don't know of an egregious amount of memory being 
   allocated and charged after PF_EXITING is set, but if it happens in the 
   future then this could potentially cause system oom conditions even in 
   memcg configurations 
  
  Even if that happens then the global OOM killer would give the exiting
  task access to memory reserves and wouldn't kill anything else.
  
  So I am not sure what problem do you see exactly.
  
 
 Userspace system oom handlers being able to handle memcg oom conditions in 
 the top-level user memcg as proposed by Tejun.  If the global oom killer 
 becomes a part of that discussion at all, then the userspace system oom 
 handler never got a chance to handle the user oom.
 
  Besides that allocating egregious amount of memory after exit_signals
  sounds fundamentally broken to me.
  
 
 Egregious could be defined as allocating a few bytes multiplied by 
 thousands of threads in PF_EXITING.

Does this happen in the real life.

Look, I have no objections to make the OOM handling better but it would
help a lot to build new heuristics based on some data in hands. I tried
to repeat that again and again but it seems to not help. I do not want
to end up with new sets of heuristics that break other stuff jut because
they made sense in the context of the specific usecase.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-15 Thread David Rientjes
On Wed, 15 Jan 2014, Michal Hocko wrote:

 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index b8dfed1b9d87..b86fbb04b7c6 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct 
 mm_struct *mm,
* MEMDIE process.
*/
   if (unlikely(test_thread_flag(TIF_MEMDIE)
 -  || fatal_signal_pending(current)))
 +  || fatal_signal_pending(current))
 +  || current-flags  PF_EXITING)
   goto bypass;
  
   if (unlikely(task_in_memcg_oom(current)))

This would become problematic if significant amount of memory is 
charged 
in the exit() path. 
   
   But this would hurt also for fatal_signal_pending tasks, wouldn't it?
  
  Yes, and as I've said twice now, that should be removed. 
 
 And you failed to provide any relevant data to back your suggestions. I
 have told you that we have these heuristics for ages and we need a
 strong justification to drop them. So if you really think that they are
 not appropriate then back your statements with real data.
 
 E.g. measure how much memory are we talking about.
 

The heuristic may have existed for ages, but the proposed memcg 
configuration for preserving memory such that userspace oom handlers may 
run such as

 _root__
/   \
user oom
   /\   /   \
   AB   a   b

where user/memory.limit_in_bytes == [amount of present RAM] + 
oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
problematic, including Johannes' buggy bypass for charges in memcgs with 
pending memcgs that has since been fixed after I identified it.  This 
bypass is included.  Processes attached to a and b are userspace oom 
handlers for processes attached to A and B, respectively.

The amount of memory you're talking about is proportional to the number of 
processes that have pending SIGKILLs (and now those with PF_EXITING set), 
the former of which is obviously more concerning since they could be 
charging memory at any point in the kernel that would succeed.  The latter 
is concerning only if future memory allocation post-PF_EXITING would be 
become significant and nobody is going to think about oom memcg bypasses 
in such a case.

To use the configuration suggested above, we need to prevent as many 
bypasses as possible to the root memcg.  Otherwise, the memory protected 
for the oom memcg from processes constrained by the limit of user is 
no longer protected.  This isn't only a problem with the bypasses here in 
the charging path, but also unaccounted kernel memory, for example.

For this to be usable, we need to ensure that the limit of the oom memcg 
is protected for the userspace oom handlers that are attached.  With a 
charge bypassed to the root memcg greater than or equal to the limit of 
the oom memcg OR cumulative charges bypassed to the root memcg greater 
than or equal to the limit of the oom memcg by processes with pending 
SIGKILLs, userspace oom handlers cannot respond.  That's particuarly 
dangerous without a memcg oom kill delay, as proposed before, since 
userspace must disable oom killing entirely for both A and B for 
userspace notification to be meaningful, since all processes are now 
livelocked.

  These bypasses should be given to one thread and one thread only,
  which would be the oom killed thread if it needs access to memory
  reserves to either allocate memory or charge memory.
 
 There is no way to determine whether a task has been killed due to user
 space OOM killer or by a regular kill.
 

I'm referring to only granting TIF_MEMDIE to a single process in any memcg 
hierarchy at or below the memcg that has encountered its limit to avoid 
granting it to many processes and bypassing their charges to the root 
memcg; the same variation of the above code, but going through the memcg 
oom killer to get TIF_MEMDIE first.  We must be vigilant and only grant 
TIF_MEMDIE for the process that shall exit.

  If you are suggesting we use the user and oom top-level memcg 
  hierarchy for allowing memory to be available for userspace system oom 
  handlers, then this has become important when in the past it may have been 
  a minor point.
 
 I am not sure it would be _that_ important and if that really becomes to
 be the case then we should deal with it. So far I haven't see any
 evidence there is a lot of memory charged on the exit path.
 

I'm debating both fatal_signal_pending() and PF_EXITING here since they 
are now both bypasses, we need to remove fatal_signal_pending().  My 
simple question with your patch: how do you guarantee memory to processes 
attached to a and b?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to 

Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-10 Thread David Rientjes
On Fri, 10 Jan 2014, Michal Hocko wrote:

> > > ---
> > >  mm/memcontrol.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index b8dfed1b9d87..b86fbb04b7c6 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct 
> > > *mm,
> > >* MEMDIE process.
> > >*/
> > >   if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > -  || fatal_signal_pending(current)))
> > > +  || fatal_signal_pending(current))
> > > +  || current->flags & PF_EXITING)
> > >   goto bypass;
> > >  
> > >   if (unlikely(task_in_memcg_oom(current)))
> > 
> > This would become problematic if significant amount of memory is charged 
> > in the exit() path. 
> 
> But this would hurt also for fatal_signal_pending tasks, wouldn't it?

Yes, and as I've said twice now, that should be removed.  These bypasses 
should be given to one thread and one thread only, which would be the oom 
killed thread if it needs access to memory reserves to either allocate 
memory or charge memory.

If you are suggesting we use the "user" and "oom" top-level memcg 
hierarchy for allowing memory to be available for userspace system oom 
handlers, then this has become important when in the past it may have been 
a minor point.

> Besides that I do not see any source of allocation after exit_signals.
> 

That's fine for today but may not be in the future.  If memory allocation 
is done after PF_EXITING in the future, are people going to check memcg 
bypasses?  No.  And now we have additional memory bypass to root that will 
cause our userspace system oom hanlders to be oom themselves with the 
suggested configuration.

Using the "user" and "oom" top-level memcg hierarchy is a double edged 
sword, we must attempt to prevent all of these bypasses as much as 
possible.  The only relevant bypass here is for TIF_MEMDIE which would be 
set if necessary for the one thread that needs it.

> > I don't know of an egregious amount of memory being 
> > allocated and charged after PF_EXITING is set, but if it happens in the 
> > future then this could potentially cause system oom conditions even in 
> > memcg configurations 
> 
> Even if that happens then the global OOM killer would give the exiting
> task access to memory reserves and wouldn't kill anything else.
> 
> So I am not sure what problem do you see exactly.
> 

Userspace system oom handlers being able to handle memcg oom conditions in 
the top-level "user" memcg as proposed by Tejun.  If the global oom killer 
becomes a part of that discussion at all, then the userspace system oom 
handler never got a chance to handle the "user" oom.

> Besides that allocating egregious amount of memory after exit_signals
> sounds fundamentally broken to me.
> 

Egregious could be defined as allocating a few bytes multiplied by 
thousands of threads in PF_EXITING.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-10 Thread Michal Hocko
On Thu 09-01-14 13:40:10, David Rientjes wrote:
> On Thu, 9 Jan 2014, Michal Hocko wrote:
> 
> > Eric has reported that he can see task(s) stuck in memcg OOM handler
> > regularly. The only way out is to
> > echo 0 > $GROUP/memory.oom_controll
> > His usecase is:
> > - Setup a hierarchy with memory and the freezer
> >   (disable kernel oom and have a process watch for oom).
> > - In that memory cgroup add a process with one thread per cpu.
> > - In one thread slowly allocate once per second I think it is 16M of ram
> >   and mlock and dirty it (just to force the pages into ram and stay there).
> > - When oom is achieved loop:
> >   * attempt to freeze all of the tasks.
> >   * if frozen send every task SIGKILL, unfreeze, remove the directory in
> > cgroupfs.
> > 
> > Eric has then pinpointed the issue to be memcg specific.
> > 
> > All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
> > Those that have received fatal signal will bypass the charge and should
> > continue on their way out. The tricky part is that the exit path might
> > trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
> > while its memcg is still under OOM because nobody has released any
> > charges yet.
> > Unlike with the in-kernel OOM handler the exiting task doesn't get
> > TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task
> > and falls to the memcg OOM again without any way out of it as there are
> > no fatal signals pending anymore.
> > 
> > This patch fixes the issue by checking PF_EXITING early in
> > __mem_cgroup_try_charge and bypass the charge same as if it had fatal
> > signal pending or TIF_MEMDIE set.
> > 
> > Normally exiting tasks (aka not killed) will bypass the charge now but
> > this should be OK as the task is leaving and will release memory and
> > increasing the memory pressure just to release it in a moment seems
> > dubious wasting of cycles. Besides that charges after exit_signals
> > should be rare.
> > 
> > Reported-by: Eric W. Biederman 
> > Signed-off-by: Michal Hocko 
> 
> Is this tested?

By Eric? No AFAIK. I wasn't able to reproduce the issue myself.

> > ---
> >  mm/memcontrol.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index b8dfed1b9d87..b86fbb04b7c6 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct 
> > *mm,
> >  * MEMDIE process.
> >  */
> > if (unlikely(test_thread_flag(TIF_MEMDIE)
> > -|| fatal_signal_pending(current)))
> > +|| fatal_signal_pending(current))
> > +|| current->flags & PF_EXITING)
> > goto bypass;
> >  
> > if (unlikely(task_in_memcg_oom(current)))
> 
> This would become problematic if significant amount of memory is charged 
> in the exit() path. 

But this would hurt also for fatal_signal_pending tasks, wouldn't it?
Besides that I do not see any source of allocation after exit_signals.

> I don't know of an egregious amount of memory being 
> allocated and charged after PF_EXITING is set, but if it happens in the 
> future then this could potentially cause system oom conditions even in 
> memcg configurations 

Even if that happens then the global OOM killer would give the exiting
task access to memory reserves and wouldn't kill anything else.

So I am not sure what problem do you see exactly.

Besides that allocating egregious amount of memory after exit_signals
sounds fundamentally broken to me.

> that are designed such as the one Tejun suggested to 
> be able to handle such conditions in userspace:
> 
>___root___
>   /  \
>   useroom
>   /  \/ \
>   A  BC D
> 
> where the limit of user is equal to the amount of system memory minus 
> whatever amount of memory is needed by the system oom handler attached as 
> a descendant of oom and still allows the limits of A + B to exceed the 
> limit of user.
> 
> So how do we ensure that memory allocations in the exit() path don't cause 
> system oom conditions whereas the above configuration no longer provides 
> any strict guarantee?
> 
> Thanks.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-10 Thread Michal Hocko
On Thu 09-01-14 13:40:10, David Rientjes wrote:
 On Thu, 9 Jan 2014, Michal Hocko wrote:
 
  Eric has reported that he can see task(s) stuck in memcg OOM handler
  regularly. The only way out is to
  echo 0  $GROUP/memory.oom_controll
  His usecase is:
  - Setup a hierarchy with memory and the freezer
(disable kernel oom and have a process watch for oom).
  - In that memory cgroup add a process with one thread per cpu.
  - In one thread slowly allocate once per second I think it is 16M of ram
and mlock and dirty it (just to force the pages into ram and stay there).
  - When oom is achieved loop:
* attempt to freeze all of the tasks.
* if frozen send every task SIGKILL, unfreeze, remove the directory in
  cgroupfs.
  
  Eric has then pinpointed the issue to be memcg specific.
  
  All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
  Those that have received fatal signal will bypass the charge and should
  continue on their way out. The tricky part is that the exit path might
  trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
  while its memcg is still under OOM because nobody has released any
  charges yet.
  Unlike with the in-kernel OOM handler the exiting task doesn't get
  TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task
  and falls to the memcg OOM again without any way out of it as there are
  no fatal signals pending anymore.
  
  This patch fixes the issue by checking PF_EXITING early in
  __mem_cgroup_try_charge and bypass the charge same as if it had fatal
  signal pending or TIF_MEMDIE set.
  
  Normally exiting tasks (aka not killed) will bypass the charge now but
  this should be OK as the task is leaving and will release memory and
  increasing the memory pressure just to release it in a moment seems
  dubious wasting of cycles. Besides that charges after exit_signals
  should be rare.
  
  Reported-by: Eric W. Biederman ebied...@xmission.com
  Signed-off-by: Michal Hocko mho...@suse.cz
 
 Is this tested?

By Eric? No AFAIK. I wasn't able to reproduce the issue myself.

  ---
   mm/memcontrol.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)
  
  diff --git a/mm/memcontrol.c b/mm/memcontrol.c
  index b8dfed1b9d87..b86fbb04b7c6 100644
  --- a/mm/memcontrol.c
  +++ b/mm/memcontrol.c
  @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct 
  *mm,
   * MEMDIE process.
   */
  if (unlikely(test_thread_flag(TIF_MEMDIE)
  -|| fatal_signal_pending(current)))
  +|| fatal_signal_pending(current))
  +|| current-flags  PF_EXITING)
  goto bypass;
   
  if (unlikely(task_in_memcg_oom(current)))
 
 This would become problematic if significant amount of memory is charged 
 in the exit() path. 

But this would hurt also for fatal_signal_pending tasks, wouldn't it?
Besides that I do not see any source of allocation after exit_signals.

 I don't know of an egregious amount of memory being 
 allocated and charged after PF_EXITING is set, but if it happens in the 
 future then this could potentially cause system oom conditions even in 
 memcg configurations 

Even if that happens then the global OOM killer would give the exiting
task access to memory reserves and wouldn't kill anything else.

So I am not sure what problem do you see exactly.

Besides that allocating egregious amount of memory after exit_signals
sounds fundamentally broken to me.

 that are designed such as the one Tejun suggested to 
 be able to handle such conditions in userspace:
 
___root___
   /  \
   useroom
   /  \/ \
   A  BC D
 
 where the limit of user is equal to the amount of system memory minus 
 whatever amount of memory is needed by the system oom handler attached as 
 a descendant of oom and still allows the limits of A + B to exceed the 
 limit of user.
 
 So how do we ensure that memory allocations in the exit() path don't cause 
 system oom conditions whereas the above configuration no longer provides 
 any strict guarantee?
 
 Thanks.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-10 Thread David Rientjes
On Fri, 10 Jan 2014, Michal Hocko wrote:

   ---
mm/memcontrol.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
   
   diff --git a/mm/memcontrol.c b/mm/memcontrol.c
   index b8dfed1b9d87..b86fbb04b7c6 100644
   --- a/mm/memcontrol.c
   +++ b/mm/memcontrol.c
   @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct 
   *mm,
  * MEMDIE process.
  */
 if (unlikely(test_thread_flag(TIF_MEMDIE)
   -  || fatal_signal_pending(current)))
   +  || fatal_signal_pending(current))
   +  || current-flags  PF_EXITING)
 goto bypass;

 if (unlikely(task_in_memcg_oom(current)))
  
  This would become problematic if significant amount of memory is charged 
  in the exit() path. 
 
 But this would hurt also for fatal_signal_pending tasks, wouldn't it?

Yes, and as I've said twice now, that should be removed.  These bypasses 
should be given to one thread and one thread only, which would be the oom 
killed thread if it needs access to memory reserves to either allocate 
memory or charge memory.

If you are suggesting we use the user and oom top-level memcg 
hierarchy for allowing memory to be available for userspace system oom 
handlers, then this has become important when in the past it may have been 
a minor point.

 Besides that I do not see any source of allocation after exit_signals.
 

That's fine for today but may not be in the future.  If memory allocation 
is done after PF_EXITING in the future, are people going to check memcg 
bypasses?  No.  And now we have additional memory bypass to root that will 
cause our userspace system oom hanlders to be oom themselves with the 
suggested configuration.

Using the user and oom top-level memcg hierarchy is a double edged 
sword, we must attempt to prevent all of these bypasses as much as 
possible.  The only relevant bypass here is for TIF_MEMDIE which would be 
set if necessary for the one thread that needs it.

  I don't know of an egregious amount of memory being 
  allocated and charged after PF_EXITING is set, but if it happens in the 
  future then this could potentially cause system oom conditions even in 
  memcg configurations 
 
 Even if that happens then the global OOM killer would give the exiting
 task access to memory reserves and wouldn't kill anything else.
 
 So I am not sure what problem do you see exactly.
 

Userspace system oom handlers being able to handle memcg oom conditions in 
the top-level user memcg as proposed by Tejun.  If the global oom killer 
becomes a part of that discussion at all, then the userspace system oom 
handler never got a chance to handle the user oom.

 Besides that allocating egregious amount of memory after exit_signals
 sounds fundamentally broken to me.
 

Egregious could be defined as allocating a few bytes multiplied by 
thousands of threads in PF_EXITING.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-09 Thread David Rientjes
On Thu, 9 Jan 2014, Michal Hocko wrote:

> Eric has reported that he can see task(s) stuck in memcg OOM handler
> regularly. The only way out is to
>   echo 0 > $GROUP/memory.oom_controll
> His usecase is:
> - Setup a hierarchy with memory and the freezer
>   (disable kernel oom and have a process watch for oom).
> - In that memory cgroup add a process with one thread per cpu.
> - In one thread slowly allocate once per second I think it is 16M of ram
>   and mlock and dirty it (just to force the pages into ram and stay there).
> - When oom is achieved loop:
>   * attempt to freeze all of the tasks.
>   * if frozen send every task SIGKILL, unfreeze, remove the directory in
> cgroupfs.
> 
> Eric has then pinpointed the issue to be memcg specific.
> 
> All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
> Those that have received fatal signal will bypass the charge and should
> continue on their way out. The tricky part is that the exit path might
> trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
> while its memcg is still under OOM because nobody has released any
> charges yet.
> Unlike with the in-kernel OOM handler the exiting task doesn't get
> TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task
> and falls to the memcg OOM again without any way out of it as there are
> no fatal signals pending anymore.
> 
> This patch fixes the issue by checking PF_EXITING early in
> __mem_cgroup_try_charge and bypass the charge same as if it had fatal
> signal pending or TIF_MEMDIE set.
> 
> Normally exiting tasks (aka not killed) will bypass the charge now but
> this should be OK as the task is leaving and will release memory and
> increasing the memory pressure just to release it in a moment seems
> dubious wasting of cycles. Besides that charges after exit_signals
> should be rare.
> 
> Reported-by: Eric W. Biederman 
> Signed-off-by: Michal Hocko 

Is this tested?

> ---
>  mm/memcontrol.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b8dfed1b9d87..b86fbb04b7c6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>* MEMDIE process.
>*/
>   if (unlikely(test_thread_flag(TIF_MEMDIE)
> -  || fatal_signal_pending(current)))
> +  || fatal_signal_pending(current))
> +  || current->flags & PF_EXITING)
>   goto bypass;
>  
>   if (unlikely(task_in_memcg_oom(current)))

This would become problematic if significant amount of memory is charged 
in the exit() path.  I don't know of an egregious amount of memory being 
allocated and charged after PF_EXITING is set, but if it happens in the 
future then this could potentially cause system oom conditions even in 
memcg configurations that are designed such as the one Tejun suggested to 
be able to handle such conditions in userspace:

 ___root___
/  \
useroom
/  \/ \
A  BC D

where the limit of user is equal to the amount of system memory minus 
whatever amount of memory is needed by the system oom handler attached as 
a descendant of oom and still allows the limits of A + B to exceed the 
limit of user.

So how do we ensure that memory allocations in the exit() path don't cause 
system oom conditions whereas the above configuration no longer provides 
any strict guarantee?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-09 Thread Michal Hocko
On Wed 08-01-14 11:33:19, Michal Hocko wrote:
> On Tue 07-01-14 16:25:03, Andrew Morton wrote:
> [...]
> > > OK, so can we at least agree on the patch posted here:
> > > https://lkml.org/lkml/2013/12/12/129. This is a real bug and definitely
> > > worth fixing.
> > 
> > Yes, can we please get Eric's bug fixed?  I don't believe that Eric has
> > tested either https://lkml.org/lkml/2013/12/12/129 or
> > http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch.
> > Is he the only person who can reproduce this?
> 
> I have gathered 3 patches from all the discussion and plan to post them
> today or tomorrow as the time permits. https://lkml.org/lkml/2013/12/12/129
> will be a part of it.

OK, I've decided to post the oom notification parts later because they
will likely generate some discussion which might distract from the
actual fix so here it goes (can be applied on both mmotm and the current
Linus' tree):
---
>From 6185aaf3cd429b1ecf2d11d424e29b597feb7811 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 12 Dec 2013 11:37:27 +0100
Subject: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM

Eric has reported that he can see task(s) stuck in memcg OOM handler
regularly. The only way out is to
echo 0 > $GROUP/memory.oom_controll
His usecase is:
- Setup a hierarchy with memory and the freezer
  (disable kernel oom and have a process watch for oom).
- In that memory cgroup add a process with one thread per cpu.
- In one thread slowly allocate once per second I think it is 16M of ram
  and mlock and dirty it (just to force the pages into ram and stay there).
- When oom is achieved loop:
  * attempt to freeze all of the tasks.
  * if frozen send every task SIGKILL, unfreeze, remove the directory in
cgroupfs.

Eric has then pinpointed the issue to be memcg specific.

All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
Those that have received fatal signal will bypass the charge and should
continue on their way out. The tricky part is that the exit path might
trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
while its memcg is still under OOM because nobody has released any
charges yet.
Unlike with the in-kernel OOM handler the exiting task doesn't get
TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task
and falls to the memcg OOM again without any way out of it as there are
no fatal signals pending anymore.

This patch fixes the issue by checking PF_EXITING early in
__mem_cgroup_try_charge and bypass the charge same as if it had fatal
signal pending or TIF_MEMDIE set.

Normally exiting tasks (aka not killed) will bypass the charge now but
this should be OK as the task is leaving and will release memory and
increasing the memory pressure just to release it in a moment seems
dubious wasting of cycles. Besides that charges after exit_signals
should be rare.

Reported-by: Eric W. Biederman 
Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b8dfed1b9d87..b86fbb04b7c6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 * MEMDIE process.
 */
if (unlikely(test_thread_flag(TIF_MEMDIE)
-|| fatal_signal_pending(current)))
+|| fatal_signal_pending(current))
+|| current->flags & PF_EXITING)
goto bypass;
 
if (unlikely(task_in_memcg_oom(current)))
-- 
1.8.5.2

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-09 Thread Michal Hocko
On Wed 08-01-14 11:33:19, Michal Hocko wrote:
 On Tue 07-01-14 16:25:03, Andrew Morton wrote:
 [...]
   OK, so can we at least agree on the patch posted here:
   https://lkml.org/lkml/2013/12/12/129. This is a real bug and definitely
   worth fixing.
  
  Yes, can we please get Eric's bug fixed?  I don't believe that Eric has
  tested either https://lkml.org/lkml/2013/12/12/129 or
  http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch.
  Is he the only person who can reproduce this?
 
 I have gathered 3 patches from all the discussion and plan to post them
 today or tomorrow as the time permits. https://lkml.org/lkml/2013/12/12/129
 will be a part of it.

OK, I've decided to post the oom notification parts later because they
will likely generate some discussion which might distract from the
actual fix so here it goes (can be applied on both mmotm and the current
Linus' tree):
---
From 6185aaf3cd429b1ecf2d11d424e29b597feb7811 Mon Sep 17 00:00:00 2001
From: Michal Hocko mho...@suse.cz
Date: Thu, 12 Dec 2013 11:37:27 +0100
Subject: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM

Eric has reported that he can see task(s) stuck in memcg OOM handler
regularly. The only way out is to
echo 0  $GROUP/memory.oom_controll
His usecase is:
- Setup a hierarchy with memory and the freezer
  (disable kernel oom and have a process watch for oom).
- In that memory cgroup add a process with one thread per cpu.
- In one thread slowly allocate once per second I think it is 16M of ram
  and mlock and dirty it (just to force the pages into ram and stay there).
- When oom is achieved loop:
  * attempt to freeze all of the tasks.
  * if frozen send every task SIGKILL, unfreeze, remove the directory in
cgroupfs.

Eric has then pinpointed the issue to be memcg specific.

All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
Those that have received fatal signal will bypass the charge and should
continue on their way out. The tricky part is that the exit path might
trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
while its memcg is still under OOM because nobody has released any
charges yet.
Unlike with the in-kernel OOM handler the exiting task doesn't get
TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task
and falls to the memcg OOM again without any way out of it as there are
no fatal signals pending anymore.

This patch fixes the issue by checking PF_EXITING early in
__mem_cgroup_try_charge and bypass the charge same as if it had fatal
signal pending or TIF_MEMDIE set.

Normally exiting tasks (aka not killed) will bypass the charge now but
this should be OK as the task is leaving and will release memory and
increasing the memory pressure just to release it in a moment seems
dubious wasting of cycles. Besides that charges after exit_signals
should be rare.

Reported-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Michal Hocko mho...@suse.cz
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b8dfed1b9d87..b86fbb04b7c6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 * MEMDIE process.
 */
if (unlikely(test_thread_flag(TIF_MEMDIE)
-|| fatal_signal_pending(current)))
+|| fatal_signal_pending(current))
+|| current-flags  PF_EXITING)
goto bypass;
 
if (unlikely(task_in_memcg_oom(current)))
-- 
1.8.5.2

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

2014-01-09 Thread David Rientjes
On Thu, 9 Jan 2014, Michal Hocko wrote:

 Eric has reported that he can see task(s) stuck in memcg OOM handler
 regularly. The only way out is to
   echo 0  $GROUP/memory.oom_controll
 His usecase is:
 - Setup a hierarchy with memory and the freezer
   (disable kernel oom and have a process watch for oom).
 - In that memory cgroup add a process with one thread per cpu.
 - In one thread slowly allocate once per second I think it is 16M of ram
   and mlock and dirty it (just to force the pages into ram and stay there).
 - When oom is achieved loop:
   * attempt to freeze all of the tasks.
   * if frozen send every task SIGKILL, unfreeze, remove the directory in
 cgroupfs.
 
 Eric has then pinpointed the issue to be memcg specific.
 
 All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
 Those that have received fatal signal will bypass the charge and should
 continue on their way out. The tricky part is that the exit path might
 trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
 while its memcg is still under OOM because nobody has released any
 charges yet.
 Unlike with the in-kernel OOM handler the exiting task doesn't get
 TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task
 and falls to the memcg OOM again without any way out of it as there are
 no fatal signals pending anymore.
 
 This patch fixes the issue by checking PF_EXITING early in
 __mem_cgroup_try_charge and bypass the charge same as if it had fatal
 signal pending or TIF_MEMDIE set.
 
 Normally exiting tasks (aka not killed) will bypass the charge now but
 this should be OK as the task is leaving and will release memory and
 increasing the memory pressure just to release it in a moment seems
 dubious wasting of cycles. Besides that charges after exit_signals
 should be rare.
 
 Reported-by: Eric W. Biederman ebied...@xmission.com
 Signed-off-by: Michal Hocko mho...@suse.cz

Is this tested?

 ---
  mm/memcontrol.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index b8dfed1b9d87..b86fbb04b7c6 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
* MEMDIE process.
*/
   if (unlikely(test_thread_flag(TIF_MEMDIE)
 -  || fatal_signal_pending(current)))
 +  || fatal_signal_pending(current))
 +  || current-flags  PF_EXITING)
   goto bypass;
  
   if (unlikely(task_in_memcg_oom(current)))

This would become problematic if significant amount of memory is charged 
in the exit() path.  I don't know of an egregious amount of memory being 
allocated and charged after PF_EXITING is set, but if it happens in the 
future then this could potentially cause system oom conditions even in 
memcg configurations that are designed such as the one Tejun suggested to 
be able to handle such conditions in userspace:

 ___root___
/  \
useroom
/  \/ \
A  BC D

where the limit of user is equal to the amount of system memory minus 
whatever amount of memory is needed by the system oom handler attached as 
a descendant of oom and still allows the limits of A + B to exceed the 
limit of user.

So how do we ensure that memory allocations in the exit() path don't cause 
system oom conditions whereas the above configuration no longer provides 
any strict guarantee?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/